# CSE 4/587 PROJECT

## TITLE 
PREDICTING POWERLIFTING RESULTS

## PROBLEM STATEMENT
Given the inputs of age, weight, and equiptment we will predict max powerlifting performance on the squat, bench, and deadlift.

## BACKGROUND
We have an interest in powerlifting due to our own enjoyment of fitness and lifting. Our question and problem came from us wondering how much we would have to lift at our age and bodyweight to be a competitive powerlifter. More specifically, which of our lifts are the furthest behind predicted and which are closer. This is a significant question/problem because it can take years of specialization on a lift to bring it up to par. Knowing for sure which lifts are lacking can help save years of suboptimal training.

## POTENTIAL
Our project can have a crucial contribution to the powerlifting domain for average lifters as well as current powerlifters. We beleive being able to predict powerlifting performance from age, weight, and equiptment would be beneficial to non powerlifters who want to see how close their own lifts are to being able to compete. This could help motivate non powerlifters to join the sport. We also beleive predicting performance could be of value to current powerlifters who want to see how they perform compared to others at their age, weight, and equiptment. Our project can help both groups to plan out their training by identifying lifts to focus/specialize on if they are lower than predicted. 

## DATASET
Link: https://www.kaggle.com/datasets/open-powerlifting/powerlifting-database

# 1) DATA CLEANING

In [None]:
'''

* Initially the csv file is viewed and it's features are understood. 
* Columns of importance are decided and a modified dataframe with columns of interest is created.
* All rows containing NaN values are dropped. 

'''
import pandas as pd
df = pd.read_csv('openpowerlifting.csv', low_memory=False)

# processing step-1
'''
Cleaning MeetCountry, MeetName, MeetState
Type: String
Step Description: Stripping a string
'''
country_list = df['MeetCountry'].to_list()
country_mod = [i.strip() for i in country_list if type(i)==str]
state_list = df['MeetState'].to_list()
state_mod = [i.strip() for i in state_list if type(i)==str]
name_list = df['MeetName'].to_list()
name_mod = [i.strip() for i in name_list if type(i)==str]
df['MeetCountry'] = pd.Series(name_mod)
df['MeetState'] = pd.Series(state_mod)
df['MeetName'] = pd.Series(name_mod)
#print(df.info())
'''
Rounding floats to 2 decimal places in Age and BodyweightKg
Type: Float
Step Description: rounding to 2 decimal places
'''
age_list = df['Age'].to_list()
age_mod = [round(i, 2) for i in age_list]
BodyweightKg_list = df['BodyweightKg'].to_list()
BodyweightKg_mod = [round(i, 3) for i in BodyweightKg_list]
df['Age'] = pd.Series(age_mod)
df['BodyweightKg'] = pd.Series(BodyweightKg_mod)
df2 = df.drop(columns=['Squat4Kg', 'Bench4Kg', 'Deadlift4Kg'])
# cleaning step-2
df2 = df2.dropna()# cleaning step-2
print(df.shape, df2.shape)
'''
here the columns: 'Squat4Kg', 'Bench4Kg', 'Deadlift4Kg' are dropped.
It was observed that only 14 competitors are participating in all categories of powerlifting. 
Almost all powerlifting meets only have 3 attempts per lift. 4Kg is an unusual 4th attempt.

'''
df2# This modified datframe has 20271 rows and 34 columns
print(df2.shape, df2.info())
#people do not seem to be doing: 'Squat4Kg', 'Bench4Kg', 'Deadlift4Kg'
# 4kg is a 4th a attempt and not typical 

In [None]:
#processing step-3
df3 = df2.drop(columns=['Squat1Kg', 'Squat2Kg', 'Squat3Kg'])
#processing step-4
df3 = df3.drop(columns=['Bench1Kg', 'Bench2Kg', 'Bench3Kg'])
#processing step-5
df3 = df3.drop(columns=['Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg'])
# leaning step-6: The table is cleared of null entries
df3 = df3.dropna()
'''
here the 1st, 2nd, and 3rd attempts for squat, bench, and deadlift are dropped.
Attempts 1 and 2 are usually submaximal weights and dont represent total strength.
We are interested in the max columns: Best3BenchKg, Best3SquatKg, Best3DeadliftKg

'''
df3
# This modified datframe has 20271 rows and 26 columns
print(df3.shape, df3.info())

In [None]:
df4 = df3.drop(columns=['Name', 'Country', 'Federation','Date','MeetCountry','MeetState','MeetName'])#cleaning step-6
'''
here the personal identifying info, date, and other non numerical lifting data are ignored
We are only interested in factors that affect lifting performance

'''
df4
# This modified datframe has 20271 rows and 19 columns
print(df4.shape, df4.info())
#SPLITTING DATA CATEGORICALLY
df_categorical = df4.select_dtypes(include='object')
df_not_categorical = df4.select_dtypes(exclude='object')
df_categorical, df_not_categorical
# REVERSING ORDER OF COLUMNS IN NON CATEGORICAL DATA SO THAT A POSSIBLE TARGET VALUE IPF PPOINTS APPEARS AS THE FIRST COLUMN
# AND OTHER COLUMNS CAN BE POSSIBLE PREDICTORS - CLEANING STEP
df_ipf_points = df_not_categorical.loc[:, ::-1]
x, y = df_ipf_points.shape
df_ipf_points.index = range(0, x, 1)
df_ipf_points

In [None]:
x, y = df_not_categorical.shape
df_not_categorical.index = range(0, x, 1)
df_not_categorical

In [None]:
x, y = df_categorical.shape
df_categorical.index = range(0, x, 1)
df_categorical

In [None]:
# Every event is SBD so the column is not needed
df4["Event"].value_counts()['SBD']

In [None]:
# Every event is tested so not needed
df4['Tested'].value_counts()

In [None]:
# Raw and single-ly make up the majority of lifters
# Wraps and multi-ply give larger skewed numbers
df4['Equipment'].value_counts()

## NEW CLEANING STEP:RENAMING COLUMN .'TotalKg' to 'Total_wt_lifted' to avoid the ambiguity of whose weight. (please delete after reading).

In [None]:
#cleaning step-7: Every event is tested so not needed
df5 = df4.drop(columns=['Event', 'Tested'])
df5 = df5[~df5['Equipment'].isin(['Wraps', 'Multi-ply'])]
# cleaning step-8: Drop Age class and weight class and division because we are interested in exacts not categories
df5 = df5.drop(columns=['AgeClass','Division','WeightClassKg'])# 
# cleaning step-9: Dont need wilks, mcculloch, glossbrenner, or ipfpoints. only interested in exact totalkg
df5 = df5.drop(columns=['Wilks','McCulloch','Glossbrenner','IPFPoints'])# 
#cleaning step-10: Place doesnt matter for prediction
df5 = df5.drop(columns=['Place'])# 

In [None]:
# This modified datframe has 19603 rows and 8 columns
print(df5.shape, df5.info())
df5

In [None]:
# This modified datframe has 19603 rows and 8 columns
print(df5.shape, df5.info())
x, y = df5.shape
df5.index = range(0, x, 1)#NEW CLEANING STEP: RESETIING INDICES IN DATAFRAME
num_row = x
row_index = [i for i in range(x)]
df5.rename(columns = {'TotalKg':'Total_wt_lifted'}, inplace = True)#RENAMING COLUMNS
df5

In [None]:
df5.describe()

In [None]:
# NORMALIZING AND MODE CALCULATION FOR DF5
# Normalizing Non-Categorical Entries in above datframe.
# Form of Normalizzation used is minimum-maximum normalization
# In the below fuction we first normalize the data, select the mode of normalized values in each column and 
# reverse calculate to estimate the mode of each normalized column
# This process is useful particularly in identifying the mode
# (most frequently repeated observation) of each column in above datframe df5.
def normalize_col(x,name):
    nor_list=[]
    x_list = x.to_list()
    low=min(x_list)
    high=max(x_list)
    for i in x_list:
        nor_list.append(round((i-low)/(high-low),2)) 
    nor_age = sorted(dict(pd.Series(nor_list).value_counts()).items(), key=lambda item:item[1], reverse=True)
    nor_age_dict_sort = dict(nor_age)
    #print(nor_age_dict_sort)
    #Therefore reverse calculating to findout maximumage group
    mode_ages=0
    #print(list(nor_age_dict_sort.keys())[0])
    mode_ages = round((list(nor_age_dict_sort.keys())[0]*(high-low))+low,1)
    print(f'The mode for the {name} column is {mode_ages}')
    return nor_list
        
df_age_normal=normalize_col(df5['Age'],'Age')
df_bodywt_normal=normalize_col(df5['BodyweightKg'], 'BodyweightKg')
df_squat_normal=normalize_col(df5['Best3SquatKg'], 'Best3SquatKg')
df_bench_normal=normalize_col(df5['Best3BenchKg'], 'Best3BenchKg')
df_lift_normal=normalize_col(df5['Best3DeadliftKg'], 'Best3DeadliftKg')
df_total_normal=normalize_col(df5['Total_wt_lifted'], 'Total_wt_lifted')
#print(type(df_age_normal), type(df_bodywt_normal), type(df_bench_normal), type(df_lift_normal), type(df_total_normal))


In [None]:
# Entries are normalized to findout the most densely populated attributes.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# EDA to see TotalKg
fig, ax = plt.subplots()
ax.hist(df5["Total_wt_lifted"], edgecolor="white")
plt.xlabel("TotalKgs")
plt.ylabel("Lifters")
plt.show()

In [None]:
# EDA to see Bench
fig, ax = plt.subplots()
ax.hist(df5["Best3BenchKg"], edgecolor="white")
plt.xlabel("BenchKgs")
plt.ylabel("Lifters")
plt.show()

In [None]:
# EDA to see Squat
fig, ax = plt.subplots()
ax.hist(df5["Best3SquatKg"], edgecolor="white")
plt.xlabel("SquatKgs")
plt.ylabel("Lifters")
plt.show()

In [None]:
# EDA to see Deadlift
fig, ax = plt.subplots()
ax.hist(df5["Best3DeadliftKg"], edgecolor="white")
plt.xlabel("DeadliftKgs")
plt.ylabel("Lifters")
plt.show()

In [None]:
# EDA to see age vs total
plt.plot(df5["Age"],df5["Total_wt_lifted"],'o')
plt.xlabel("Ages")
plt.ylabel("TotalKgs")
plt.show()

In [None]:
# EDA to see age vs bench
plt.plot(df5["Age"],df5["Best3BenchKg"],'o')
plt.xlabel("Ages")
plt.ylabel("BenchKgs")
plt.show()

In [None]:
# EDA to see age vs squat
plt.plot(df5["Age"],df5["Best3SquatKg"],'o')
plt.xlabel("Ages")
plt.ylabel("SquatKgs")
plt.show()

In [None]:
# EDA to see age vs deadlift
plt.plot(df5["Age"],df5["Best3DeadliftKg"],'o')
plt.xlabel("Ages")
plt.ylabel("DeadliftKgs")
plt.show()

In [None]:
# EDA to see weight vs total
plt.plot(df5["BodyweightKg"],df5["Total_wt_lifted"],'o')
plt.xlabel("BodyweightKgs")
plt.ylabel("TotalKgs")
plt.show()

In [None]:
# EDA to see weight vs bench
plt.plot(df5["BodyweightKg"],df5["Best3BenchKg"],'o')
plt.xlabel("BodyweightKgs")
plt.ylabel("BenchKgs")
plt.show()

In [None]:
# EDA to see weight vs squat
plt.plot(df5["BodyweightKg"],df5["Best3SquatKg"],'o')
plt.xlabel("BodyweightKgs")
plt.ylabel("SquatKgs")
plt.show()

In [None]:
# EDA to see weight vs deadlift
plt.plot(df5["BodyweightKg"],df5["Best3DeadliftKg"],'o')
plt.xlabel("BodyweightKgs")
plt.ylabel("DeadliftKgs")
plt.show()

In [None]:
raw = df5.loc[df5['Equipment'] == 'Raw']
equiped = df5.loc[df5['Equipment'] == 'Single-ply']

In [None]:
# EDA for raw vs equiped on totalkg
plt.hist([raw["Total_wt_lifted"], equiped["Total_wt_lifted"]], label=['Raw', 'Single-ply'])
plt.xlabel("TotalKgs")
plt.ylabel("Lifters")
plt.legend(loc='upper right')
plt.show()

In [None]:
# EDA for raw vs equiped on bench
plt.hist([raw["Best3BenchKg"], equiped["Best3BenchKg"]], label=['Raw', 'Single-ply'])
plt.xlabel("BenchKgs")
plt.ylabel("Lifters")
plt.legend(loc='upper right')
plt.show()

In [None]:
# EDA for raw vs equiped on squat
plt.hist([raw["Best3SquatKg"], equiped["Best3SquatKg"]], label=['Raw', 'Single-ply'])
plt.xlabel("SquatKgs")
plt.ylabel("Lifters")
plt.legend(loc='upper right')
plt.show()

In [None]:
# EDA for raw vs equiped on deadlift
plt.hist([raw["Best3DeadliftKg"], equiped["Best3DeadliftKg"]], label=['Raw', 'Single-ply'])
plt.xlabel("DeadliftKgs")
plt.ylabel("Lifters")
plt.legend(loc='upper right')
plt.show()

## FOLLOW QUESTIONS FROM EDA
Why does the deadlift have a more normal/bell curve distrubution?
Does being younger have higher potential?
Which lift does age impact the most?
Which lift does bodyweight impact the most?
It seems like higher bodyweight has diminishing return after a point. What is that point?
The squat seems to have the best linear relation to bodyweight. Deadlift the worst.
Equiped lifters do much better on the squat and bench, but not the deadlift.

# PHASE-2: Machine Learning and Statistical Analysis

In [None]:
#Model-1: Using Linear Regression to predict Best3SquatKg, Best3BenchKg and Best3DeadliftKg. 
#From EDA analysis above we observed that, the quantities BodyWeightKg, Equipment

In [None]:
df5

In [None]:
#y = df5['response_variable']
print(df5['Sex'].unique())
print(df5['Equipment'].unique())
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
#encoder instance
encoder = OneHotEncoder()
encoder_cols = encoder.fit_transform(df5[['Sex']])
gen_cat = {'Sex':{"M":1,"F":2}}
equip_cat = {'Equipment':{"Raw":1,"Single-ply":2}}
df5 = df5.replace(gen_cat)
df5 = df5.replace(equip_cat)
df5

### Best3SquatKg

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as skm
X = df5[['Sex','Equipment','Age','BodyweightKg','Total_wt_lifted']]
y = df5['Best3SquatKg']
model = LinearRegression()
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X, y,random_state=120, test_size=0.35, shuffle=True)
result1 = model.fit(X1_train, Y1_train)
#Train accuracy
print('Train accuracy is',result1.score(X1_train, Y1_train))
#Making predictions
y1_pred = model.predict(X1_test)
y1_train_pred=model.predict(X1_train)
#mse value
mse_ln = skm.mean_squared_error(Y1_test, y1_pred)
#Test accuracy
print('Test accuracy is', result1.score(X1_test,Y1_test))
#mean absolute error calculation
print('Mean absolute error is', skm.mean_absolute_error(Y1_test,y1_pred))
print('Train MSE of linear Regression is', skm.mean_squared_error(Y1_train, y1_train_pred))
print('Test MSE of linear regression is',mse_ln)
new_diff=skm.mean_squared_error(Y1_train, y1_train_pred)-mse_ln

In [None]:
x_list=[]
#print(X1_test.shape)
size_x=X1_test.shape[0]
for i in range(size_x):
    x_list.append(i+1)

In [None]:
fig=plt.figure(figsize=(100,100))
plt.scatter(x_list,Y1_test,s=170)
plt.scatter(x_list,y1_pred,s=170)
print('The graph below represents overfitting.')
plt.show()

In [None]:
from sklearn.linear_model import Ridge,Lasso
X=X.drop(['Sex','Equipment','Age'],axis=1)# These columns have correlation values less than 0.5
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X, y,random_state=150, test_size=0.2, shuffle=True)
alpha_vals=[0.01,0.1,0.7,1,10,100]
dict_info={}
count=0
for i in alpha_vals:
    count+=1
    rid_model = Ridge(alpha=i)
    rid_model.fit(X2_train, Y2_train)
    y_rid_pred = rid_model.predict(X2_test)
    mse_rid = skm.mean_squared_error(Y2_test, y_rid_pred)    
    r2_rid = skm.r2_score(Y2_test, y_rid_pred)
    y2_train_pred = rid_model.predict(X2_train)
    train_mse = skm.mean_squared_error(Y2_train, y2_train_pred)
    #mean absolute error calculation
    #print(f'For the learning rate {i} the results are as follows:')
    #print('Mean absolute error is', skm.mean_absolute_error(Y2_test,y_rid_pred))
    #print("Test MSE of ridge regression is", mse_rid)
    #print("Train MSE of Ridge regression is", train_mse)
    #print('Train accuracy of Ridge Regression is', rid_model.score(X2_train,Y2_train))
    #print("Test Accuracy of Ridge Regression is", r2_rid)
    dict_info['Trial'+str(count)]=[i, round(skm.mean_absolute_error(Y2_test,y_rid_pred),5), 
                                   round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                   round(mse_rid,4),
                                   round(mse_rid,4)-round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                  round(rid_model.score(X2_train,Y2_train),4),
                                  round(rid_model.score(X2_test,Y2_test),4)]    
    #print('Prediction variance is:', skm.explained_variance_score(y2_train_pred,Y2_train))
    #print('Test variance is:', skm.explained_variance_score(Y2_test,y_rid_pred))
info_display= pd.DataFrame(dict_info)
info_display.index = ['learning rate','MAE','Train MSE','Test MSE','diff of MSE','Train accuracy','Test Accuracy']
print(' Following observations were made while predicting Best3SquatKg by applying Ridge Regression on different learning rates.')
info_display

In [None]:
comp_list=[new_diff for i in range(6)]
x1_list=[i+1 for i in range(6)]
fig,ax=plt.subplots()
bar_width=0.4
ax.bar(np.array(x1_list) - bar_width/2 , comp_list, bar_width, label='Linear MSE diff')
ax.bar(np.array(x1_list) + bar_width/2, info_display.loc['diff of MSE'], bar_width, label='RIDGE MSE diff', alpha=0.5)
ax.set_xlabel('Trial number')
ax.set_ylabel('MSE difference')
ax.set_title('Comparing MSEs for different learning rates in RIDGE Regression.')
ax.legend()
plt.show()
#plt.scatter(x1_list, comp_list)
#plt.scatter(x1_list, info_display.loc['diff of MSE'])
#print('Below is the graph comparing differnce in MSE for linear regression with differnt learning rates of RIDGE Regression.')
#plt.show()

In [None]:
alpha_vals=[0.01,0.1,0.7,1,10,100]
dict_info={}
count=0
for i in alpha_vals:
    count+=1
    lasso_model = Lasso(alpha=i)
    lasso_model.fit(X2_train, Y2_train)
    y_lasso_pred = lasso_model.predict(X2_test)
    mse_lasso = skm.mean_squared_error(Y2_test, y_lasso_pred)
    r2_lasso = skm.r2_score(Y2_test, y_lasso_pred)
    y2_train_pred = lasso_model.predict(X2_train)
    #mean absolute error calculation
    #print(f'For the learning rate {i} the results are as follows:')
    #print('Mean absolute error is', skm.mean_absolute_error(Y2_test,y_lasso_pred))
    #print("Train MSE of Lasso regression is", skm.mean_squared_error(Y2_train, y2_train_pred))
    #print("Test MSE of Lasso regression is", mse_lasso)
    #print('Train accuracy of Laso Regression is', lasso_model.score(X2_train,Y2_train))
    #print("Accuracy of Lasso Regression is", r2_lasso)
    dict_info['Trial'+str(count)]=[i, round(skm.mean_absolute_error(Y2_test,y_lasso_pred),5), 
                                   round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                   round(mse_lasso,4),
                                   round(mse_lasso,4)-round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                  round(lasso_model.score(X2_train,Y2_train),4),
                                  round(lasso_model.score(X2_test,Y2_test),4)]
    #print('Prediction variance is:', skm.explained_variance_score(y2_train_pred,Y2_train))
    #print('Test variance is:', skm.explained_variance_score(Y2_test,y_lasso_pred))
info_display= pd.DataFrame(dict_info)
info_display.index = ['learning rate','MAE','Train MSE','Test MSE','diff of MSE','Train accuracy','Test Accuracy']
print('Following observations were made while predicting Best3SquatKg by applying Lasso Regression on different learning rates.')
info_display

In [None]:
new_diff=round(mse_lasso,4)-round(skm.mean_squared_error(Y2_train, y2_train_pred),4)
comp_list=[new_diff for i in range(6)]
x1_list=[i+1 for i in range(6)]
fig,ax=plt.subplots()
bar_width=0.4
ax.bar(np.array(x1_list) - bar_width/2 , comp_list, bar_width, label='Linear MSE diff')
ax.bar(np.array(x1_list) + bar_width/2, info_display.loc['diff of MSE'], bar_width, label='RIDGE MSE diff', alpha=0.5)
ax.set_xlabel('Trial number')
ax.set_ylabel('MSE difference')
ax.set_title('Comparing MSEs for different learning rates in LASSO Regression.')
ax.legend()
plt.show()
#plt.scatter(x1_list, comp_list)
#plt.scatter(x1_list, info_display.loc['diff of MSE'])
#print('Below is the graph comparing differnce in MSE for linear regression with differnt learning rates of LASSO Regression.')
#plt.show()

### Best3BenchKg

In [None]:
#Bestbenchprediction
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as skm
X = df5[['Sex','Equipment','Age','BodyweightKg','Total_wt_lifted']]
y = df5['Best3BenchKg']
model = LinearRegression()
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X, y,random_state=120, test_size=0.35, shuffle=True)
result1 = model.fit(X1_train, Y1_train)
#Train accuracy
#print('Train accuracy(R^2 score) is',skm.r2_score(X1_train, Y1_train))
#Making predictions
y1_pred = model.predict(X1_test)
y1_train_pred = model.predict(X1_train)
#mse value
mse_ln = skm.mean_squared_error(Y1_test, y1_pred)
#Test MSE
print('Train MSE of linear regression is', skm.mean_squared_error(Y2_train, y2_train_pred))
print('Test MSE of linear regression is',mse_ln)
#mean absolute error calculation
print('Mean absolute error is', skm.mean_absolute_error(Y1_test,y1_pred))
#Train accuracy
print('Train accuracy is' ,model.score(X1_train, Y1_train))
print('Test accuracy is', model.score(X1_test,Y1_test))
#print('Prediction variance is:', skm.explained_variance_score(y1_train_pred,Y1_train))
#print('Test variance is:', skm.explained_variance_score(Y1_test,y1_pred))
new_diff=mse_ln-skm.mean_squared_error(Y2_train, y2_train_pred)

In [None]:
corr_list=list(sorted(df5.corr()['Best3BenchKg'].to_list()))
corr_list

In [None]:

fig=plt.figure(figsize=(100,100))
plt.scatter(x_list,Y1_test,s=170)
plt.scatter(x_list,y1_pred,s=170)
print('The graph below represents overfitting.')
plt.show()

In [None]:
from sklearn.linear_model import Ridge,Lasso
X=X.drop(['Sex','Equipment'],axis=1)# These columns have correlation values less than 0.5
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X, y,random_state=150, test_size=0.35, shuffle=True)
alpha_vals=[0.0001,0.1,0.7,1,10,100]
dict_info={}
count=0
for i in alpha_vals:
    count+=1
    rid_model = Ridge(alpha=i)
    rid_model.fit(X2_train, Y2_train)
    y_rid_pred = rid_model.predict(X2_test)
    mse_rid = skm.mean_squared_error(Y2_test, y_rid_pred)
    r2_rid = skm.r2_score(Y2_test, y_rid_pred)
    #mean absolute error calculation
    y2_train_pred = rid_model.predict(X2_train)
    #print(f'For the learning rate {i} the results are as follows:')
    #print('Mean absolute error is', skm.mean_absolute_error(Y2_test,y_rid_pred))
    #print("Train MSE of ridge regression is", skm.mean_squared_error(Y2_train, y2_train_pred))
    #print("Test MSE of ridge regression is", mse_rid)
    #print("Train Accuracy of Ridge Regression is", rid_model.score(X2_train,Y2_train))
    #print("Test Accuracy of Ridge Regression is", rid_model.score(X2_test,Y2_test))
    dict_info['Trial'+str(count)]=[i, round(skm.mean_absolute_error(Y2_test,y_rid_pred),5), 
                                   round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                   round(mse_rid,4),round(mse_rid,4)-round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                  round(rid_model.score(X2_train,Y2_train),4),
                                  round(rid_model.score(X2_test,Y2_test),4)]
    #print('Prediction variance is:', skm.explained_variance_score(y2_train_pred,Y2_train))
    #print('Test variance is:', skm.explained_variance_score(Y2_test,y_rid_pred))
info_display= pd.DataFrame(dict_info)
print(' Following observations were made while predicting Best3BenchKg by applying Ridge Regression on different learning rates.')
info_display.index = ['learning rate','MAE','Train MSE','Test MSE','diff of MSE','Train accuracy','Test Accuracy']


info_display

In [None]:
comp_list=[new_diff for i in range(6)]
x1_list=[i+1 for i in range(6)]
fig,ax=plt.subplots()
bar_width=0.4
ax.bar(np.array(x1_list) - bar_width/2 , comp_list, bar_width, label='Linear MSE diff')
ax.bar(np.array(x1_list) + bar_width/2, info_display.loc['diff of MSE'], bar_width, label='RIDGE MSE diff', alpha=0.5)
ax.set_xlabel('Trial number')
ax.set_ylabel('MSE difference')
ax.set_title('Comparing MSEs for different learning rates in RIDGE Regression.')
ax.legend()
plt.show()
#plt.scatter(x1_list, comp_list)
#plt.scatter(x1_list, info_display.loc['diff of MSE'])
#print('Below is the graph comparing differnce in MSE for linear regression with differnt learning rates of LASSO Regression.')
#plt.show()

In [None]:
alpha_vals=[0.01,0.1,0.7,1,10,100]
dict_info={}
count=0
for i in alpha_vals:
    count+=1
    lasso_model = Lasso(alpha=i)
    lasso_model.fit(X2_train, Y2_train)
    y_lasso_pred = lasso_model.predict(X2_test)
    mse_lasso = skm.mean_squared_error(Y2_test, y_lasso_pred)
    r2_lasso = skm.r2_score(Y2_test, y_lasso_pred)
    y2_train_pred = lasso_model.predict(X2_train)
    #mean absolute error calculation
    #print(f'For the learning rate {i} the results are as follows:')
    #print('Mean absolute error is', skm.mean_absolute_error(Y2_test,y_lasso_pred))
    #print("Train MSE of Lasso regression is", skm.mean_squared_error(Y2_train, y2_train_pred))
    #print("Test MSE of Lasso regression is", mse_lasso)
    #print('Train accuracy of Laso Regression is', lasso_model.score(X2_train,Y2_train))
    #print("Accuracy of Lasso Regression is", r2_lasso)
    dict_info['Trial'+str(count)]=[i, round(skm.mean_absolute_error(Y2_test,y_lasso_pred),5), 
                                   round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                   round(mse_lasso,4),
                                   round(mse_lasso,4)-round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                  round(lasso_model.score(X2_train,Y2_train),4),
                                  round(lasso_model.score(X2_test,Y2_test),4)]
    #print('Prediction variance is:', skm.explained_variance_score(y2_train_pred,Y2_train))
    #print('Test variance is:', skm.explained_variance_score(Y2_test,y_lasso_pred))
info_display= pd.DataFrame(dict_info)
info_display.index = ['learning rate','MAE','Train MSE','Test MSE','diff of MSE','Train accuracy','Test Accuracy']
print('Following observations were made while predicting Best3BenchKg by applying Lasso Regression on different learning rates.')
info_display

In [None]:
comp_list=[new_diff for i in range(6)]
x1_list=[i+1 for i in range(6)]
fig,ax=plt.subplots()
bar_width=0.4
ax.bar(np.array(x1_list) - bar_width/2 , comp_list, bar_width, label='Linear MSE diff')
ax.bar(np.array(x1_list) + bar_width/2, info_display.loc['diff of MSE'], bar_width, label='RIDGE MSE diff', alpha=0.5)
ax.set_xlabel('Trial number')
ax.set_ylabel('MSE difference')
ax.set_title('Comparing MSEs for different learning rates in RIDGE Regression.')
ax.legend()
plt.show()
#plt.scatter(x1_list, comp_list)
#plt.scatter(x1_list, info_display.loc['diff of MSE'])
#print('Below is the graph comparing differnce in MSE for linear regression with differnt learning rates of LASSO Regression.')
#plt.show()

### Best3DeadliftKg

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as skm
X = df5[['Sex','Equipment','Age','BodyweightKg','Total_wt_lifted']]
y = df5['Best3DeadliftKg']
model = LinearRegression()
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X, y,random_state=120, test_size=0.35, shuffle=True)
result1 = model.fit(X1_train, Y1_train)
#Train accuracy
#print('Train accuracy(R^2 score) is',skm.r2_score(X1_train, Y1_train))
#Making predictions
y1_pred = model.predict(X1_test)
y1_train_pred = model.predict(X1_train)
#mse value
mse_ln = skm.mean_squared_error(Y1_test, y1_pred)
#Test accuracy
train_mse = skm.mean_squared_error(Y1_train, y1_train_pred)
print('Train MSE of linar regression is', train_mse)
print('Test MSE of linear regression is',mse_ln)
#mean absolute error calculation
print('Mean absolute error is', skm.mean_absolute_error(Y1_test,y1_pred))
print('Train accuracy is',model.score(X1_train,Y1_train))
print('Test accuracy is',model.score(X1_test,Y1_test))
#print('Prediction variance is:', skm.explained_variance_score(y1_train_pred,Y1_train))
#print('Test variance is:', skm.explained_variance_score(Y1_test,y1_pred))
new_diff=mse_ln-train_mse

In [None]:
fig=plt.figure(figsize=(100,100))
plt.scatter(x_list,Y1_test,s=170)
plt.scatter(x_list,y1_pred,s=170)
print('The graph below represents overfitting.')
plt.show()

In [None]:
from sklearn.linear_model import Ridge,Lasso
X=X.drop(['Sex','Equipment','Age'],axis=1)# These columns have correlation values less than 0.5
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X, y,random_state=150, test_size=0.2, shuffle=True)
alpha_vals=[0.01,0.1,0.7,1,10,100]
dict_info={}
count=0
for i in alpha_vals:
    count+=1
    rid_model = Ridge(alpha=i)
    rid_model.fit(X2_train, Y2_train)
    y_rid_pred = rid_model.predict(X2_test)
    mse_rid = skm.mean_squared_error(Y2_test, y_rid_pred)    
    r2_rid = skm.r2_score(Y2_test, y_rid_pred)
    y2_train_pred = rid_model.predict(X2_train)
    train_mse = skm.mean_squared_error(Y2_train, y2_train_pred)
    #mean absolute error calculation
    #print(f'For the learning rate {i} the results are as follows:')
    #print('Mean absolute error is', skm.mean_absolute_error(Y2_test,y_rid_pred))
    #print("Test MSE of ridge regression is", mse_rid)
    #print("Train MSE of Ridge regression is", train_mse)
    #print('Train accuracy of Ridge Regression is', rid_model.score(X2_train,Y2_train))
    #print("Test Accuracy of Ridge Regression is", r2_rid)
    dict_info['Trial'+str(count)]=[i, round(skm.mean_absolute_error(Y2_test,y_rid_pred),5), 
                                   round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                   round(mse_rid,4),
                                  round(mse_rid,4)-round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                  round(rid_model.score(X2_train,Y2_train),4),
                                  round(rid_model.score(X2_test,Y2_test),4)]    
    #print('Prediction variance is:', skm.explained_variance_score(y2_train_pred,Y2_train))
    #print('Test variance is:', skm.explained_variance_score(Y2_test,y_rid_pred))
info_display= pd.DataFrame(dict_info)
info_display.index = ['learning rate','MAE','Train MSE','Test MSE','diff of MSE','Train accuracy','Test Accuracy']
print(' Following observations were made while predicting Best3DeadliftKg by applying Ridge Regression on different learning rates.')
info_display

In [None]:
comp_list=[new_diff for i in range(6)]
x1_list=[i+1 for i in range(6)]
fig,ax=plt.subplots()
bar_width=0.4
ax.bar(np.array(x1_list) - bar_width/2 , comp_list, bar_width, label='Linear MSE diff')
ax.bar(np.array(x1_list) + bar_width/2, info_display.loc['diff of MSE'], bar_width, label='RIDGE MSE diff', alpha=0.5)
ax.set_xlabel('Trial number')
ax.set_ylabel('MSE difference')
ax.set_title('Comparing MSEs for different learning rates in RIDGE Regression.')
ax.legend()
plt.show()
#plt.scatter(x1_list, comp_list)
#plt.scatter(x1_list, info_display.loc['diff of MSE'])
#print('Below is the graph comparing differnce in MSE for linear regression with differnt learning rates of LASSO Regression.')
#plt.show()

In [None]:
alpha_vals=[0.01,0.1,0.7,1,10,100]
dict_info={}
count=0
x_list=[]
#print(X1_test.shape)
size_x=X2_test.shape[0]
for i in range(size_x):
    x_list.append(i+1)
for i in alpha_vals:
    count+=1
    lasso_model = Lasso(alpha=i)
    lasso_model.fit(X2_train, Y2_train)
    y_lasso_pred = lasso_model.predict(X2_test)
    mse_lasso = skm.mean_squared_error(Y2_test, y_lasso_pred)
    r2_lasso = skm.r2_score(Y2_test, y_lasso_pred)
    y2_train_pred = lasso_model.predict(X2_train)
    train_mse = skm.mean_squared_error(Y2_train, y2_train_pred)
    if i==0.01:
        fig=plt.figure(figsize=(100,100))
        plt.scatter(x_list, y_lasso_pred,s=170)
        plt.scatter(x_list, Y2_test,s=170)
        print('The graph below represents reduced overfitting.')
        plt.show()
    #mean absolute error calculation
    #print(f'For the learning rate {i} the results are as follows:')
    #print('Mean absolute error is', skm.mean_absolute_error(Y2_test,y_lasso_pred))
    #print("Test MSE of ridge regression is", mse_lasso)
    #print("Train MSE of ridge regression is", train_mse)    
    #print('Train accuracy of Ridge Regression is', lasso_model.score(X2_train,Y2_train))
    #print("Test Accuracy of Ridge Regression is", r2_lasso)
    #print('Prediction variance is:', skm.explained_variance_score(y2_train_pred,Y2_train))
    #print('Test variance is:', skm.explained_variance_score(Y2_test,y_lasso_pred))
    dict_info['Trial'+str(count)]=[i, round(skm.mean_absolute_error(Y2_test,y_lasso_pred),5), 
                                   round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                   round(mse_lasso,4),
                                   round(mse_lasso,4)-round(skm.mean_squared_error(Y2_train, y2_train_pred),4),
                                  round(lasso_model.score(X2_train,Y2_train),4),
                                  round(lasso_model.score(X2_test,Y2_test),4)]
info_display= pd.DataFrame(dict_info)
info_display.index = ['learning rate','MAE','Train MSE','Test MSE','diff of MSE','Train accuracy','Test Accuracy']
print('Following observations were made while predicting Best3DeadliftKg by applying Lasso Regression on different learning rates.')
info_display

In [None]:
comp_list=[new_diff for i in range(6)]
x1_list=[i+1 for i in range(6)]
fig,ax=plt.subplots()
bar_width=0.4
ax.bar(np.array(x1_list) - bar_width/2 , comp_list, bar_width, label='Linear MSE diff')
ax.bar(np.array(x1_list) + bar_width/2, info_display.loc['diff of MSE'], bar_width, label='LASSO MSE diff', alpha=0.5)
ax.set_xlabel('Trial number')
ax.set_ylabel('MSE difference')
ax.set_title('Comparing MSEs for different learning rates in RIDGE Regression.')
ax.legend()
plt.show()
#plt.scatter(x1_list, comp_list)
#plt.scatter(x1_list, info_display.loc['diff of MSE'])
#print('Below is the graph comparing differnce in MSE for linear regression with differnt learning rates of LASSO Regression.')
#plt.show()

In [None]:
# Summary: Starting with a a simple linear regression, using the parameters:
#'Sex','Equipment','Age','BodyweightKg','Total_wt_lifted' to predict  BestSquat3Kg, BestBench3Kg,and BestDeadLift3Kg. 
# As per the EDA we observed that the pairs (Age, BestBench3Kg), (Age, BestSquat3Kg) and (Age, BestDeadLift3Kg) were displaying
# proportional increse with respect to each other in the scatter plots.
# Also TotalWt. Lifted showed directly proportional variance with respect to BestBench3Kg, BestSquat3Kg and BestDeadLift3Kg.
# Hence we deided to include these parameters  when making predictions.
# After performing Linear Regression we got very good test accuracies of 97.7%, 95.5% and 92.1% respectively while making
# coresponding predictions in BestSquat3Kg, BestBench3Kg,and BestDeadLift3Kg.
# But the point of worry is the large differences between testing and training MSE values. This indicates that,
# Though model fits well on training data, it may run into problems on unseen data.
# This could become problematic on new unseen input which might produce a data product of poor quality.
# Hence to optimize the reult of the Linear regression and avoid overfitting, We used Ridge and lasso Regression models.
# Also it was observed that columns'Sex' and 'Equipment' had correlation values less than zero!
# Hence these columns were dropped before proceeding with further Analysis.
# After applying Ridge and Lasso Regressions we observed that difference in train and test MSE has dropped signifficantly with
# a minute decrease in the score of the model. this is due to bias-variance trade off.
# This indicates that problem of overfitting has now been resolved in our model.
# At the expense of a small test score we now have a reasonable test and train MSE.
# This ensures that our model is inclusive on unseen data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns

In [None]:
sns.scatterplot(x=df5['BodyweightKg'],y=df5['Age'], hue=df5['Total_wt_lifted'])

In [None]:
sns.scatterplot(x=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['BodyweightKg'],y=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['Best3SquatKg'], hue=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['Total_wt_lifted'])

In [None]:
sns.scatterplot(x=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['BodyweightKg'],y=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['Best3BenchKg'], hue=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['Total_wt_lifted'])

In [None]:
sns.scatterplot(x=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['BodyweightKg'],y=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['Best3DeadliftKg'], hue=df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]['Total_wt_lifted'])

In [None]:
# Create the model comparing bodyweight and age to their resulting total lifts rounded to the 100s
knn = KNeighborsClassifier(n_neighbors=100)
X2 = df5[["BodyweightKg","Age"]]
y2 = df5["Total_wt_lifted"].round(-2)
X3_train, X3_test, Y3_train, Y3_test = train_test_split(X2, y2,random_state=150, test_size=0.25, shuffle=True)

In [None]:
knn.fit(X3_train, Y3_train)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [None]:
# model statistics
y_pred = knn.predict(X3_test)
accuracy = accuracy_score(Y3_test, y_pred)
precision = precision_score(Y3_test, y_pred, average='micro')
recall = recall_score(Y3_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Try the model but this time comparing bodyweight and equiptment to their resulting total lifts rounded to the 100s
knn2 = KNeighborsClassifier(n_neighbors=100)
X3 = df5[["BodyweightKg","Equipment"]]
y3 = df5["Total_wt_lifted"].round(-2)
X4_train, X4_test, Y4_train, Y4_test = train_test_split(X3, y3,random_state=150, test_size=0.25, shuffle=True)

In [None]:
knn2.fit(X4_train, Y4_train)

In [None]:
# model statistics
y_pred = knn2.predict(X4_test)
accuracy = accuracy_score(Y4_test, y_pred)
precision = precision_score(Y4_test, y_pred, average='micro')
recall = recall_score(Y4_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# try knn but seperate gender and equiptment out before comparing bodyweight to total
# Select male and raw
knn3 = KNeighborsClassifier(n_neighbors=150)
X4 = df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]["BodyweightKg"].values.reshape(-1,1)
y4 = df5.loc[(df5['Sex'] == 1)&(df5["Equipment"] == 1)]["Total_wt_lifted"].round(-2)
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X4, y4,random_state=150, test_size=0.25, shuffle=True)

In [None]:
knn3.fit(X5_train, Y5_train)

In [None]:
# model statistics
y_pred = knn3.predict(X5_test)
accuracy = accuracy_score(Y5_test, y_pred)
precision = precision_score(Y5_test, y_pred, average='micro')
recall = recall_score(Y5_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# compare squat to total
knn4 = KNeighborsClassifier(n_neighbors=200)
X5 = df5["Best3SquatKg"].values.reshape(-1,1)
y5 = df5["Total_wt_lifted"].round(-2)
X6_train, X6_test, Y6_train, Y6_test = train_test_split(X5, y5,random_state=150, test_size=0.25, shuffle=True)

In [None]:
knn4.fit(X6_train, Y6_train)

In [None]:
# model statistics
y_pred = knn4.predict(X6_test)
accuracy = accuracy_score(Y6_test, y_pred)
precision = precision_score(Y6_test, y_pred, average='micro')
recall = recall_score(Y6_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# compare bench to total
knn5 = KNeighborsClassifier(n_neighbors=200)
X6 = df5["Best3SquatKg"].values.reshape(-1,1)
y6 = df5["Total_wt_lifted"].round(-2)
X7_train, X7_test, Y7_train, Y7_test = train_test_split(X6, y6,random_state=150, test_size=0.25, shuffle=True)
knn5.fit(X7_train, Y7_train)
# model statistics
y_pred = knn5.predict(X7_test)
accuracy = accuracy_score(Y7_test, y_pred)
precision = precision_score(Y7_test, y_pred, average='micro')
recall = recall_score(Y7_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# compare deadlift to total
knn6 = KNeighborsClassifier(n_neighbors=200)
X7 = df5["Best3SquatKg"].values.reshape(-1,1)
y7 = df5["Total_wt_lifted"].round(-2)
X8_train, X8_test, Y8_train, Y8_test = train_test_split(X7, y7,random_state=150, test_size=0.25, shuffle=True)
knn6.fit(X8_train, Y8_train)
# model statistics
y_pred = knn6.predict(X8_test)
accuracy = accuracy_score(Y8_test, y_pred)
precision = precision_score(Y8_test, y_pred, average='micro')
recall = recall_score(Y8_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
sns.scatterplot(x=df5['BodyweightKg'],y=df5['Sex'], hue=df5['Total_wt_lifted'])

In [None]:
sns.scatterplot(x=df5['BodyweightKg'],y=df5['Equipment'], hue=df5['Total_wt_lifted'])

In [None]:
# based on age, weight, gender, lifts, and total where they equipped
X9 = df5.drop('Equipment',axis=1)
y9 = df5["Equipment"]
# split
X9_train, X9_test, Y9_train, Y9_test = train_test_split(X9, y9,random_state=150, test_size=0.25, shuffle=True)
# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)
# fit the model with data
logreg.fit(X9_train, Y9_train)
y_pred = logreg.predict(X9_test)
accuracy = accuracy_score(Y9_test, y_pred)
precision = precision_score(Y9_test, y_pred, average='micro')
recall = recall_score(Y9_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# based on age, weight, equipment, lifts, and total what gender where they
X10 = df5.drop('Sex',axis=1)
y10 = df5["Sex"]
# split
X10_train, X10_test, Y10_train, Y10_test = train_test_split(X10, y10,random_state=150, test_size=0.25, shuffle=True)
# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)
# fit the model with data
logreg.fit(X10_train, Y10_train)
y_pred = logreg.predict(X10_test)
accuracy = accuracy_score(Y10_test, y_pred)
precision = precision_score(Y10_test, y_pred, average='micro')
recall = recall_score(Y10_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

# REFERENCES
https://www.oreilly.com/library/view/hands-on-machine-learning/9781788393485/fd5b8a44-e9d3-4c19-bebb-c2fa5a5ebfee.xhtml#:~:text=Min%2Dmax%20normalization%20(usually%20called,among%20the%20original%20data%20values.

https://stackoverflow.com/questions/26785354/normalizing-a-list-of-numbers-in-python

https://developers.google.com/machine-learning/data-prep/transform/normalization#:~:text=The%20goal%20of%20normalization%20is,training%20stability%20of%20the%20model.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html

https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

https://sparkbyexamples.com/python/count-occurrences-of-element-in-python-list/#:~:text=To%20count%20the%20occurrences%20of%20an%20element%20in%20a%20list,of%20elements%20in%20a%20list.

https://matplotlib.org/stable/plot_types/index.html

https://www.w3schools.com/python/matplotlib_intro.asp

# REFERENCES-PHASE 2
https://www.statology.org/one-hot-encoding-in-python/

https://www.turing.com/kb/convert-categorical-data-in-pandas-and-scikit-learn

https://www.geeksforgeeks.org/how-to-do-train-test-split-using-sklearn-in-python/

https://medium.com/@alexstrebeck/training-and-testing-machine-learning-models-e1f27dc9b3cb

https://towardsdatascience.com/linear-regression-using-python-b136c91bf0a2

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://www.datacamp.com/tutorial/understanding-logistic-regression-python