# Prediction

In [7]:
import pandas as pd

file_path = './data/filtered_data.csv'
data = pd.read_csv(file_path)
print(data.head())


   Sex   Age  BodyweightKg AgeClass  Squat1Kg  Squat2Kg  Squat3Kg  \
0    0  29.0          59.8    24-34      80.0      92.5     105.0   
1    0  29.0          58.5    24-34     100.0     110.0     120.0   
2    0  23.0          60.0    20-23    -105.0    -105.0     105.0   
3    0  45.0         104.0    45-49     120.0     130.0     140.0   
4    0  37.0          74.0    35-39     127.5     135.0     142.5   

   Best3SquatKg  Bench1Kg  Bench2Kg  Bench3Kg  Best3BenchKg  Deadlift1Kg  \
0         105.0      45.0      50.0      55.0          55.0        110.0   
1         120.0      55.0      62.5      67.5          67.5        130.0   
2         105.0      67.5      72.5     -75.0          72.5        132.5   
3         140.0      70.0      75.0      80.0          80.0        150.0   
4         142.5      72.5      77.5      82.5          82.5        125.0   

   Deadlift2Kg  Deadlift3Kg  Best3DeadliftKg  
0        120.0        130.0            130.0  
1        140.0        145.0       

In [8]:
print(data.info())
data = data.dropna(subset=['Age', 'BodyweightKg', 'Best3SquatKg'])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 486038 entries, 0 to 486037
Data columns (total 16 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Sex              486038 non-null  int64  
 1   Age              486038 non-null  float64
 2   BodyweightKg     486038 non-null  float64
 3   AgeClass         485949 non-null  object 
 4   Squat1Kg         264359 non-null  float64
 5   Squat2Kg         261447 non-null  float64
 6   Squat3Kg         254484 non-null  float64
 7   Best3SquatKg     486038 non-null  float64
 8   Bench1Kg         264315 non-null  float64
 9   Bench2Kg         261943 non-null  float64
 10  Bench3Kg         254429 non-null  float64
 11  Best3BenchKg     486038 non-null  float64
 12  Deadlift1Kg      264496 non-null  float64
 13  Deadlift2Kg      260505 non-null  float64
 14  Deadlift3Kg      249573 non-null  float64
 15  Best3DeadliftKg  486038 non-null  float64
dtypes: float64(14), int64(1), object(1)
me

## Polynomal Regression
Polynomial regression is a form of linear regression in which the relationship between the independent variable xx and the dependent variable yy is modeled as an nth degree polynomial. Unlike simple linear regression, which models the relationship as a straight line, polynomial regression can fit a wide range of curvature in the data. This flexibility makes it particularly useful for modeling datasets where the relationship between variables is not linear but still follows a specific trend.

In [13]:
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

features = data[['Age', 'BodyweightKg', 'Sex']]
target = data['Best3SquatKg']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.1)

# Generating polynomial features
poly_degree = 2
poly_features = PolynomialFeatures(degree=poly_degree)

X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Training the model
poly_model_squat = LinearRegression()
poly_model_squat.fit(X_train_poly, y_train)

# Making predictions and evaluating the model
y_pred = poly_model_squat.predict(X_test_poly)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


Unnamed: 0,Metric,Value
0,Mean Squared Error (MSE),2257.131963
1,R-squared (R²),0.568054


In [14]:
features = data[['Age', 'BodyweightKg', 'Sex']]
target = data['Best3BenchKg']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.1)

# Generating polynomial features
poly_degree = 2
poly_features = PolynomialFeatures(degree=poly_degree)

X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Training the model
poly_model_bench = LinearRegression()
poly_model_bench.fit(X_train_poly, y_train)

# Making predictions and evaluating the model
y_pred = poly_model_bench.predict(X_test_poly)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


Unnamed: 0,Metric,Value
0,Mean Squared Error (MSE),957.552367
1,R-squared (R²),0.645674


In [15]:
features = data[['Age', 'BodyweightKg', 'Sex']]
target = data['Best3DeadliftKg']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.1)

# Generating polynomial features
poly_degree = 2
poly_features = PolynomialFeatures(degree=poly_degree)

X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Training the model
poly_model_deadlift = LinearRegression()
poly_model_deadlift.fit(X_train_poly, y_train)

# Making predictions and evaluating the model
y_pred = poly_model_deadlift.predict(X_test_poly)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


Unnamed: 0,Metric,Value
0,Mean Squared Error (MSE),1357.2254
1,R-squared (R²),0.652179


## Random Forest Regressor
Random Forest is an ensemble learning method, particularly effective for regression tasks. It operates by constructing multiple decision trees during training and outputting the mean or average prediction of the individual trees. This approach helps in reducing overfitting, a common problem in decision tree models, and improves the predictive accuracy.

In [4]:
from sklearn.ensemble import RandomForestRegressor

# Define the features and the target
X = data[['Age', 'BodyweightKg']]
y = data['Best3SquatKg']

# Spliting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100)

# Train the model
rf_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


Unnamed: 0,Metric,Value
0,Mean Squared Error (MSE),1900.001312
1,R-squared (R²),0.165592


## MLP Regression
MLP Regression refers to the application of a Multi-Layer Perceptron, a class of feedforward artificial neural network, to regression tasks. MLP consists of multiple layers of nodes, simulating a biological neural network, which makes it highly effective in capturing complex and non-linear relationships in data.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor

X = data[['Age', 'BodyweightKg']]
y = data['Best3SquatKg']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create the MLPRegressor model
model = MLPRegressor(hidden_layer_sizes=(200, 200, 200), activation='relu', 
                     solver='adam', max_iter=500, random_state=1)

# Fit the model
model.fit(X_train, y_train)

# Make predictions on the test set
predictions = model.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


Unnamed: 0,Metric,Value
0,Mean Squared Error (MSE),1884.384624
1,R-squared (R²),-0.420458


## Conclusion

It appears that the models I've experimented with aren't performing as expected. Despite various attempts and configurations, the accuracy remains unsatisfactorily low, and the last model even yielded a negative R² score, which suggests there might be some issues in my approach. This leads me to believe that our current dataset might be insufficient for accurate predictions. Incorporating additional features, such as muscle mass, nutritional habits, and body length, could potentially enhance the model's predictive power and provide a more comprehensive understanding of the factors influencing the outcomes. Other ways could be to use Grid Search or Random Search for hyperparameter tuning to improve the models.

In [6]:
import pickle

with open('model_squat.pkl', 'wb') as file:
    pickle.dump(poly_model_squat, file)

with open('model_bench.pkl', 'wb') as file:
    pickle.dump(poly_model_bench, file)

with open('model_deadlift.pkl', 'wb') as file:
    pickle.dump(poly_model_deadlift, file)