# Gym Trainer ML: Predicting Top Set Intensity

This notebook demonstrates the workflow for predicting the optimal top set intensity for a user’s next workout session. 
We use past workout data, engineered features, and XGBoost to make predictions. 

**Goals:**
- Clean and preprocess the workout data
- Feature engineering for ML and rule-based logic
- Train an XGBoost model
- Keep rule-based features separate for UI/alerts


### Imports and Setup

In [None]:
# Standard libraries
import pandas as pd
import numpy as np

#skklearn imports
from sklearn.model_selection import train_test_split

# Utils for ML (DE_utils.py)
import de_utils as de_utils

### Load Data

In this section, we load the workout data and perform an initial inspection to understand the dataset. This includes checking:

- The number of rows and columns.
- Basic statistics about the numerical columns.
- Data types and potential issues (e.g., missing values).

In [None]:
from pathlib import Path

# Resolve the data file from common relative locations
path = Path("../../data/processed/baseline_all_processed.csv")
df = pd.read_csv(path)

# View the basic info and statistics of the DataFrame
df.info()
df.describe()

# Uncomment to see first 5 rows
# df.head()

### Data Cleaning

We drop the following columns, which are not relevant to predicting workout intensity:

- `notes`, `workout_notes`: Textual notes about the workout that are not useful for prediction.
- `exercise_name`, `seconds`: Columns that don’t contain relevant numerical or categorical data for this model.


Also, this application focuses exclusively on strength training, so we remove any cardio-related entries based on the presence of distance values:
Cardio exercises are identified by the presence of the `distance` column. Since we are only interested in strength training, we remove all rows where `distance` is NaN. Then we remove the cardio column altogether

In [None]:
# Drop rows or columns we know are unnecessary for our ML application
df = df.drop(columns=['notes', 'workout_notes', 'seconds'])

# Drop cardio rows since we are focusing on strength training
cardioRows = df[df['distance'].isna()]
df = df.drop(cardioRows.index)

# Drop distance column since it is now redundant
df = df.drop(columns=['distance'])

### View Missing values and Data types

We now check for missing values across all columns. This will help us identify any columns that need imputation or handling of missing data.

In [None]:
print("\nMissing Values:")
print(df.isnull().sum())
print("\nData Types:")
print(df.dtypes)


### View Categorical and Numerical Columns

Categorical columns with low cardinality are those with a small number of unique values. These columns are typically suitable for encoding using one-hot encoding or label encoding.

In [None]:
def explore_columns(df):
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    print("\n\nCategorical Columns with low cardinality:")
    for col in categorical_cols:
        if df[col].nunique() <= 20:
            print(f"{col}")
            print(df[col].unique())
            print("\n")

    print("\n\nUnique Values for Categorical Columns with high cardinality:")
    for col in categorical_cols:
        if df[col].nunique() > 20:
            print(f"{col}")

    print("\n\nNumerical Columns")
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    for col in numerical_cols:
        print(f"{col}")

explore_columns(df)

We need to make decisions about how to handle the missing values in the following columns:

- **`notes`**: Drop this column, as it doesn't provide relevant information for our model.
- **`rpe`**: Impute missing RPE values with the **median** (or mean), as RPE is a key feature for our model. Alternatively, if the proportion of missing values is significant, we could consider dropping the rows.
- **`workout_notes`**: Drop this column, as it doesn't contribute to predicting workout intensity.

Lets think about what features we truly need for a model that predicts the optimal top set intensity for the next session.
What we definitely need:
- `date` (for time-based features)
- `effective_load` (target variable)
- `reps`
- `rpe` (after imputation and encoding)

What we don't need:
- `notes`
- `workout_notes`
- `exercise_name`
- `workout_name`
- `exercise_normalized`
- `distance` (cardio rows can be dropped entirely)
- `seconds`


## Now for some Feature Engineering Brainstorming!

**Time-based features** help the model learn patterns based on when the workout occurred relative to other sessions:

- **`days_since_last_session`** (captures recovery periods between sessions)
- **`days_since_first_workout`** (shows the user’s experience level)
- **`session_number`** (tracks progression within each exercise)
- **`rolling_avg_load_last_n_sessions`** (captures trends in workout intensity)
- **`rolling_trend_load (slope)`** (measures whether the user’s intensity is improving or declining)


**RPE (Rate of Perceived Exertion)** will be handled in the following steps:

1. **Imputation**: Fill missing RPE values with the **median** RPE.
2. **Binning**: RPE values will be categorized into three groups:
   - **Low**: 0 - 5
   - **Medium**: 6 - 7
   - **High**: 8 - 10
3. **Encoding**: Convert RPE categories into an ordinal feature to capture workout intensity.


**Reps (Number of Repetitions)** will be binned as follows:

- **Strength**: 1 - 5 reps (lower rep ranges focus on strength training)
- **Hypertrophy**: 6 - 15 reps (moderate rep ranges for muscle growth)
- **Endurance**: 15+ reps (higher rep ranges for endurance training)

Binning reps helps capture the type of training being done and its impact on the top-set intensity.


<br>

>Fortunately the functions to handle these feature engineering ideas have been defined in DE_utils.py, so no need to code it out. Just run the below cell and view the MI score for each feature

In [None]:


def feature_engineering_pipeline(df):
    """
    Main pipeline for feature engineering.

    Parameters:
    df (DataFrame): The raw DataFrame to process.

    Returns:
    DataFrame: The DataFrame with engineered features
    """
    dfCopy = df.copy()


    dfCopy = de_utils.add_session_number_per_exercise(dfCopy)
    dfCopy = de_utils.add_time_features(dfCopy)
    dfCopy = de_utils.add_days_since_last_workout(dfCopy)
    dfCopy = de_utils.add_rolling_avg_load_last_n_sessions(dfCopy, n=3)
    dfCopy = de_utils.add_rolling_trend_load(dfCopy, n=3)
    dfCopy['rolling_trend_load'] = dfCopy['rolling_trend_load'].fillna(dfCopy['rolling_trend_load'].mean())

    dfCopy = de_utils.handle_missing_rpe(dfCopy)
    dfCopy = de_utils.bin_rpe(dfCopy)
    dfCopy = de_utils.encode_rpe_ordinal(dfCopy)

    # IMPORTANT: use dfCopy, not df, to ensure engineered columns are available
    dfCopy = de_utils.filter_top_set_sessions(dfCopy)

    # We can drop these columns since we already extracted useful features from them, and we get to keep our numerical features
    dfCopy = dfCopy.drop(columns=['date', 'rpe', 'rpe_binned', 'workout_name', 'exercise_name', 'exercise_normalized'])
  

    return dfCopy



df = feature_engineering_pipeline(df)

X = df.drop(columns=['effective_load'])
y = df['effective_load']

mi_scores = de_utils.get_mi_scores(X, y)



print("\nMutual Information Scores:")
print(mi_scores.sort_values(ascending=False))


### **Mutual Information Scores**

After calculating the MI scores between features and the target (`effective_load`), we see the MI scores for each feature.


### **Feature Selection Strategy**

We'll drop features with an MI score below **0.15**, as they offer minimal value for predicting `effective_load`.

#### **Features to Drop**:
- **days_since_last_workout** (MI = 0.000)
- **rpe_ordinal** (MI = 0.013)
- **is_top_set** (MI = 0.059)

#### **Remaining Features**:
- **weight**
- **rolling_avg_load_last_3_sessions**
- **set_volume**
- **rolling_trend_load**
- **days_since_first_workout**
- **rpe_missing**
- **session_number**
- **reps**
- **set_order**


### However...
`rpe_missing` has way too high of an MI when it's not even that high of a predictor. That means rpe_missing is being inflated and the model will end up memorizing "noise". So, it's better to **drop** rpe_missing.


The remaining features will be kept for model training.
