# Gym Trainer ML: Predicting Top Set Intensity

This notebook demonstrates the workflow for predicting the optimal top set intensity for a user’s next workout session. 
We use past workout data, engineered features, and XGBoost to make predictions. 

**Goals:**
- Clean and preprocess the workout data
- Feature engineering for ML and rule-based logic
- Train an XGBoost model
- Keep rule-based features separate for UI/alerts


### Imports and Setup

In [98]:
# Standard libraries
import pandas as pd
import numpy as np

#skklearn imports
from sklearn.model_selection import train_test_split

# Utils for ML (DE_utils.py)
import de_utils as de_utils

### Load Data

In this section, we load the workout data and perform an initial inspection to understand the dataset. This includes checking:

- The number of rows and columns.
- Basic statistics about the numerical columns.
- Data types and potential issues (e.g., missing values).

In [99]:
from pathlib import Path

# Resolve the data file from common relative locations
path = Path("../../data/processed/baseline_all_processed.csv")
df = pd.read_csv(path)

X, y = train_test_split(df, test_size=0.2, random_state=42)

# View the basic info and statistics of the DataFrame
df.info()
df.describe()

# Uncomment to see first 5 rows
# df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13921 entries, 0 to 13920
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date                 13921 non-null  object 
 1   distance             13921 non-null  float64
 2   effective_load       13921 non-null  float64
 3   exercise_name        13921 non-null  object 
 4   exercise_normalized  13921 non-null  object 
 5   notes                42 non-null     object 
 6   reps                 13921 non-null  int64  
 7   rpe                  8426 non-null   float64
 8   seconds              13921 non-null  int64  
 9   set_order            13921 non-null  int64  
 10  set_volume           13921 non-null  float64
 11  weight               13921 non-null  float64
 12  workout_name         13921 non-null  object 
 13  workout_notes        221 non-null    object 
dtypes: float64(5), int64(3), object(6)
memory usage: 1.5+ MB


Unnamed: 0,distance,effective_load,reps,rpe,seconds,set_order,set_volume,weight
count,13921.0,13921.0,13921.0,8426.0,13921.0,13921.0,13921.0,13921.0
mean,0.063505,78.980662,8.931111,9.373645,0.769557,2.840816,1059.633528,132.51978
std,4.457465,112.363625,3.277108,0.622712,24.986785,1.818789,971.775131,117.887481
min,0.0,0.0,0.0,6.0,0.0,1.0,0.0,0.0
25%,0.0,0.0,6.0,8.90411,0.0,1.0,348.75,35.5
50%,0.0,25.5,8.0,9.5,0.0,2.0,900.0,100.0
75%,0.0,138.013699,10.0,10.0,0.0,4.0,1610.0,225.0
max,363.0,2632.054795,60.0,10.0,1260.0,11.0,17736.0,2956.0


### Data Cleaning

We drop the following columns, which are not relevant to predicting workout intensity:

- `notes`, `workout_notes`: Textual notes about the workout that are not useful for prediction.
- `exercise_name`, `seconds`: Columns that don’t contain relevant numerical or categorical data for this model.


Also, this application focuses exclusively on strength training, so we remove any cardio-related entries based on the presence of distance values:
Cardio exercises are identified by the presence of the `distance` column. Since we are only interested in strength training, we remove all rows where `distance` is NaN. Then we remove the cardio column altogether

In [100]:
# Drop rows or columns we know are unnecessary for our ML application
df = df.drop(columns=['notes', 'workout_notes', 'seconds'])

# Drop cardio rows since we are focusing on strength training
cardioRows = df[df['distance'].isna()]
df = df.drop(cardioRows.index)

# Drop distance column since it is now redundant
df = df.drop(columns=['distance'])

### View Missing values and Data types

We now check for missing values across all columns. This will help us identify any columns that need imputation or handling of missing data.

In [101]:
print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Types:")
print(df.dtypes)



Missing Values:
date                      0
effective_load            0
exercise_name             0
exercise_normalized       0
reps                      0
rpe                    5495
set_order                 0
set_volume                0
weight                    0
workout_name              0
dtype: int64

Data Types:
date                    object
effective_load         float64
exercise_name           object
exercise_normalized     object
reps                     int64
rpe                    float64
set_order                int64
set_volume             float64
weight                 float64
workout_name            object
dtype: object


### View Categorical and Numerical Columns

Categorical columns with low cardinality are those with a small number of unique values. These columns are typically suitable for encoding using one-hot encoding or label encoding.

In [102]:
def explore_columns(df):
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    print("\n\nCategorical Columns with low cardinality:")
    for col in categorical_cols:
        if df[col].nunique() <= 20:
            print(f"{col}")
            print(df[col].unique())
            print("\n")

    print("\n\nUnique Values for Categorical Columns with high cardinality:")
    for col in categorical_cols:
        if df[col].nunique() > 20:
            print(f"{col}")

    print("\n\nNumerical Columns")
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    for col in numerical_cols:
        print(f"{col}")

explore_columns(df)



Categorical Columns with low cardinality:


Unique Values for Categorical Columns with high cardinality:
date
exercise_name
exercise_normalized
workout_name


Numerical Columns
effective_load
reps
rpe
set_order
set_volume
weight


We need to make decisions about how to handle the missing values in the following columns:

- **`notes`**: Drop this column, as it doesn't provide relevant information for our model.
- **`rpe`**: Impute missing RPE values with the **median** (or mean), as RPE is a key feature for our model. Alternatively, if the proportion of missing values is significant, we could consider dropping the rows.
- **`workout_notes`**: Drop this column, as it doesn't contribute to predicting workout intensity.

Lets think about what features we truly need for a model that predicts the optimal top set intensity for the next session.
What we definitely need:
- `date` (for time-based features)
- `effective_load` (target variable)
- `reps`
- `rpe` (after imputation and encoding)

What we don't need:
- `notes`
- `workout_notes`
- `exercise_name`
- `workout_name`
- `exercise_normalized`
- `distance` (cardio rows can be dropped entirely)
- `seconds`


## Now for some Feature Engineering Brainstorming!

#### Time-Based Feature Engineering
- days_since_last_session: an integer indicating the number of days since the first workout
- days_since_first_workout: an integer indicating the number of days since the last workout session
- session_number: an integer indicating the session number for each exercise
- rolling_avg_load_last_n_sessions: an integer indicating the rolling average load over the last n sessions
- rolling_trend_load: an integer indicating the trend (slope) of load over the last n sessions

#### RPE Feature Engineering
First, we need to handle missing RPE values.
Then we create a column indicating which columns are missing RPE values.
Then we impute with median RPE.

We'll bin RPE and then encode it as ordinal categorical variable.
0 - 5 : Low
6 - 7 : Medium
8 - 10: High

#### REPS Feature Engineering
We can create bins for reps as well.
rep_range_buckets:
1-5: Strength
6-15: Hypertrophy
15+: Endurance

<br>

>Fortunately the functions to handle these feature engineering ideas have been defined in DE_utils.py, so no need to code it out. Just run the below cell and view the MI score for each feature

In [None]:


def feature_engineering_pipeline(df):
    """
    Main pipeline for feature engineering.

    Parameters:
    df (DataFrame): The raw DataFrame to process.

    Returns:
    DataFrame: The DataFrame with engineered features
    """
    dfCopy = df.copy()


    dfCopy = de_utils.add_session_number_per_exercise(dfCopy)
    dfCopy = de_utils.add_time_features(dfCopy)
    dfCopy = de_utils.add_days_since_last_workout(dfCopy)
    dfCopy = de_utils.add_rolling_avg_load_last_n_sessions(dfCopy, n=3)
    dfCopy = de_utils.add_rolling_trend_load(dfCopy, n=3)
    dfCopy['rolling_trend_load'] = dfCopy['rolling_trend_load'].fillna(dfCopy['rolling_trend_load'].mean())

    dfCopy = de_utils.handle_missing_rpe(dfCopy)
    dfCopy = de_utils.bin_rpe(dfCopy)
    dfCopy = de_utils.encode_rpe_ordinal(dfCopy)

    # IMPORTANT: use dfCopy, not df, to ensure engineered columns are available
    dfCopy = de_utils.filter_top_set_sessions(dfCopy)

    # We can drop these columns since we already extracted useful features from them, and we get to keep our numerical features
    dfCopy = dfCopy.drop(columns=['date', 'rpe', 'rpe_binned', 'workout_name', 'exercise_name', 'exercise_normalized'])
  

    return dfCopy



df = feature_engineering_pipeline(df)

explore_columns(df)

# mi_scores = de_utils.get_mi_scores(X, y)



# print("\nMutual Information Scores:")
# print(mi_scores.sort_values(ascending=False))




Categorical Columns with low cardinality:


Unique Values for Categorical Columns with high cardinality:
exercise_name


Numerical Columns
effective_load
reps
set_order
set_volume
weight
session_number
days_since_first_workout
days_since_last_workout
rolling_avg_load_last_3_sessions
rolling_trend_load
rpe_missing
rpe_ordinal
