# Prediction

In [1]:
import pandas as pd

file_path = './data/filtered_data.csv'
data = pd.read_csv(file_path)
print(data.head())


   Sex   Age  BodyweightKg AgeClass  Squat1Kg  Squat2Kg  Squat3Kg  \
0    1  19.0          73.1    18-19     210.0    -220.0    -220.0   
1    1  28.0          74.9    24-34     190.0     200.0     210.0   
2    1  19.0          74.1    18-19     150.0     160.0     165.0   
3    1  25.0          74.2    24-34     255.0     275.0     290.0   
4    1  21.0          74.9    20-23     200.0     217.5    -225.0   

   Best3SquatKg  Bench1Kg  Bench2Kg  Bench3Kg  Best3BenchKg  Deadlift1Kg  \
0         210.0    -120.0    -120.0     120.0         120.0       -250.0   
1         210.0     135.0     140.0     142.5         142.5        250.0   
2         165.0     125.0    -132.5     135.0         135.0        225.0   
3         290.0     160.0     170.0     175.0         175.0        270.0   
4         217.5     110.0     120.0    -127.5         120.0        217.5   

   Deadlift2Kg  Deadlift3Kg  Best3DeadliftKg  
0       -250.0        250.0            250.0  
1       -265.0       -270.0       

In [2]:
print(data.info())
data = data.dropna(subset=['Age', 'BodyweightKg', 'Best3SquatKg'])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42946 entries, 0 to 42945
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Sex              42946 non-null  int64  
 1   Age              42946 non-null  float64
 2   BodyweightKg     42946 non-null  float64
 3   AgeClass         42936 non-null  object 
 4   Squat1Kg         23772 non-null  float64
 5   Squat2Kg         23566 non-null  float64
 6   Squat3Kg         22990 non-null  float64
 7   Best3SquatKg     42946 non-null  float64
 8   Bench1Kg         23769 non-null  float64
 9   Bench2Kg         23585 non-null  float64
 10  Bench3Kg         22876 non-null  float64
 11  Best3BenchKg     42946 non-null  float64
 12  Deadlift1Kg      23788 non-null  float64
 13  Deadlift2Kg      23489 non-null  float64
 14  Deadlift3Kg      22442 non-null  float64
 15  Best3DeadliftKg  42946 non-null  float64
dtypes: float64(14), int64(1), object(1)
memory usage: 5.2+ MB


## Polynomal Regression
Polynomial regression is a form of linear regression in which the relationship between the independent variable xx and the dependent variable yy is modeled as an nth degree polynomial. Unlike simple linear regression, which models the relationship as a straight line, polynomial regression can fit a wide range of curvature in the data. This flexibility makes it particularly useful for modeling datasets where the relationship between variables is not linear but still follows a specific trend.

In [3]:
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

features = data[['Age', 'BodyweightKg']]
target = data['Best3SquatKg']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

# Generating polynomial features
poly_degree = 2
poly_features = PolynomialFeatures(degree=poly_degree)

X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Training the model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)

# Making predictions and evaluating the model
y_pred = poly_model.predict(X_test_poly)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


Unnamed: 0,Metric,Value
0,Mean Squared Error (MSE),1797.268199
1,R-squared (R²),0.214059


## Random Forest Regressor
Random Forest is an ensemble learning method, particularly effective for regression tasks. It operates by constructing multiple decision trees during training and outputting the mean or average prediction of the individual trees. This approach helps in reducing overfitting, a common problem in decision tree models, and improves the predictive accuracy.

In [4]:
from sklearn.ensemble import RandomForestRegressor

# Define the features and the target
X = data[['Age', 'BodyweightKg']]
y = data['Best3SquatKg']

# Spliting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100)

# Train the model
rf_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


(1884.4420701156012, 0.17551183091425093)

## MLP Regression
MLP Regression refers to the application of a Multi-Layer Perceptron, a class of feedforward artificial neural network, to regression tasks. MLP consists of multiple layers of nodes, simulating a biological neural network, which makes it highly effective in capturing complex and non-linear relationships in data.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor

X = data[['Age', 'BodyweightKg']]
y = data['Best3SquatKg']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create the MLPRegressor model
model = MLPRegressor(hidden_layer_sizes=(200, 200, 200), activation='relu', 
                     solver='adam', max_iter=500, random_state=1)

# Fit the model
model.fit(X_train, y_train)

# Make predictions on the test set
predictions = model.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, y_pred)

# Creating a DataFrame to display the values
error_metrics = pd.DataFrame({
    "Metric": ["Mean Squared Error (MSE)", "R-squared (R²)"],
    "Value": [mse, r2]
})

error_metrics


(1805.5464686223825, -0.4065978150029137)

## Conclusion

It appears that the models I've experimented with aren't performing as expected. Despite various attempts and configurations, the accuracy remains unsatisfactorily low, and the last model even yielded a negative R² score, which suggests there might be some issues in my approach. This leads me to believe that our current dataset might be insufficient for accurate predictions. Incorporating additional features, such as muscle mass, nutritional habits, and body length, could potentially enhance the model's predictive power and provide a more comprehensive understanding of the factors influencing the outcomes. Other ways could be to use Grid Search or Random Search for hyperparameter tuning to improve the models.