# ChronoFit MVAA - Model Training Pipeline

This notebook trains the AI model for personalized workout recommendations using:
- **20,000 synthetic training records** with realistic parameter distributions
- **User feedback data** weighted 2x for rapid adaptation
- **RandomForest + MultiOutput Regression** for duration & intensity prediction
- **Cross-validation** for robust performance evaluation
- **Automatic model export** to .joblib files for Streamlit deployment

## Pipeline Steps:
1. **Setup & Data Generation** - Create synthetic training data
2. **Data Preprocessing** - Build feature transformation pipeline
3. **Goal Classification** - Train goal classifier model
4. **Model Training** - Train main regression model with GridSearch CV
5. **Evaluation & Validation** - Assess model performance
6. **User Feedback Integration** - Incorporate real user feedback
7. **Model Export** - Save trained models for deployment

In [1]:
import numpy as np
import pandas as pd
import joblib
import os
import time
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedShuffleSplit, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Set seed for reproducibility
np.random.seed(42)
print("‚úÖ Libraries imported successfully")

‚úÖ Libraries imported successfully


## Step 1: Generate Synthetic Training Data (20k records)

In [2]:
NUM_RECORDS = 20000
print(f"üîÑ Generating {NUM_RECORDS} synthetic training records...\n")

# Realistic age distribution (bimodal: young athletes + older fitness enthusiasts)
AGE = np.concatenate([
    np.random.normal(28, 6, int(NUM_RECORDS * 0.6)).clip(18, 40).astype(int),
    np.random.normal(48, 8, int(NUM_RECORDS * 0.4)).clip(40, 70).astype(int)
])
np.random.shuffle(AGE)
AGE = AGE[:NUM_RECORDS]

# Sex distribution
SEX = np.random.choice(['M', 'F'], size=NUM_RECORDS, p=[0.6, 0.4])

# Weight follows BMI-realistic distribution
WEIGHT_KG = np.random.normal(78, 14, size=NUM_RECORDS).clip(45, 130).round(1)

# Sleep: realistic with some sleep-deprived individuals
SLEEP_HRS = np.concatenate([
    np.random.normal(7.3, 0.8, int(NUM_RECORDS * 0.75)),
    np.random.normal(5.5, 0.6, int(NUM_RECORDS * 0.25))
]).clip(4.0, 10.0).round(1)
np.random.shuffle(SLEEP_HRS)

# RHR correlates with fitness/weight
RHR_BASE = 62 + (WEIGHT_KG - 75) * 0.15
RHR_BPM = (RHR_BASE + np.random.normal(0, 6, size=NUM_RECORDS)).clip(45, 95).round(0)

# Soreness distribution (skewed toward low)
SORENESS = np.random.choice([1, 2, 3, 4, 5], size=NUM_RECORDS, p=[0.35, 0.30, 0.20, 0.10, 0.05])

# Mental stress
MENTAL_STRESS = np.random.choice([1, 2, 3, 4, 5], size=NUM_RECORDS, p=[0.20, 0.25, 0.30, 0.15, 0.10])

# Nutrition (user input estimates)
CALORIES = np.random.normal(2200, 400, size=NUM_RECORDS).clip(800, 4000).round(0)
PROTEIN_G = np.random.normal(100, 25, size=NUM_RECORDS).clip(30, 300).round(1)
CARBS_G = np.random.normal(280, 60, size=NUM_RECORDS).clip(50, 600).round(1)

# Confidence score in nutrition data
NUTR_CONF_SCORE = np.random.choice([1, 2, 3, 4, 5], size=NUM_RECORDS, p=[0.15, 0.20, 0.30, 0.20, 0.15])

# Create features dataframe
X = pd.DataFrame({
    'age': AGE,
    'sex': SEX,
    'weight_kg': WEIGHT_KG,
    'sleep_hrs': SLEEP_HRS,
    'rhr_bpm': RHR_BPM,
    'soreness': SORENESS,
    'mental_stress': MENTAL_STRESS,
    'calories_in': CALORIES,
    'protein_g': PROTEIN_G,
    'carbs_g': CARBS_G,
    'nutrition_confidence': NUTR_CONF_SCORE
})

print(f"‚úÖ Generated {len(X)} records")
print(f"\nFeature statistics:\n{X.describe().round(2)}")
print(f"\nFirst 5 records:\n{X.head()}")

üîÑ Generating 20000 synthetic training records...

‚úÖ Generated 20000 records

Feature statistics:
            age  weight_kg  sleep_hrs   rhr_bpm  soreness  mental_stress  \
count  20000.00   20000.00   20000.00  20000.00  20000.00       20000.00   
mean      35.90      77.83       6.86     62.47      2.20           2.71   
std       11.91      13.81       1.09      6.32      1.17           1.22   
min       18.00      45.00       4.00     45.00      1.00           1.00   
25%       26.00      68.40       6.10     58.00      1.00           2.00   
50%       33.00      77.80       7.00     62.00      2.00           3.00   
75%       45.00      87.20       7.70     67.00      3.00           3.00   
max       70.00     129.50      10.00     89.00      5.00           5.00   

       calories_in  protein_g   carbs_g  nutrition_confidence  
count     20000.00   20000.00  20000.00              20000.00  
mean       2198.69      99.96    279.75                  3.01  
std         397.36   

## Step 2: Generate Target Variables (Duration & Intensity)

In [3]:
# Goal categories
GOAL_LABELS = ['Strength', 'Endurance', 'Maintenance', 'Flexibility']
GOALS = np.random.choice(GOAL_LABELS, size=NUM_RECORDS, p=[0.35, 0.30, 0.20, 0.15])

# Base duration & intensity by goal
goal_params = {
    'Strength': (45, 8),      # (duration_min, intensity)
    'Endurance': (60, 6),
    'Maintenance': (40, 5),
    'Flexibility': (30, 3)
}

DURATION_BASE = np.array([goal_params[g][0] for g in GOALS], dtype=float)
INTENSITY_BASE = np.array([goal_params[g][1] for g in GOALS], dtype=float)

# Modifiers based on readiness metrics
sleep_modifier = (SLEEP_HRS - 5.5) * 2  # More sleep = more intense/longer
rhr_modifier = (RHR_BPM - 50) / 10      # Lower RHR (fitter) = higher intensity
soreness_modifier = (6 - SORENESS) * 1.5  # Less sore = higher intensity
stress_modifier = (6 - MENTAL_STRESS) * 1.2  # Less stressed = higher intensity

# Apply modifiers
DURATION = (DURATION_BASE + sleep_modifier * 3 - soreness_modifier * 2).clip(20, 120).round(0)
INTENSITY = (INTENSITY_BASE + sleep_modifier * 0.5 + rhr_modifier * 0.3 - soreness_modifier * 0.3 - stress_modifier * 0.2).clip(1, 10).round(1)

# Create target dataframe
Y = pd.DataFrame({
    'goal': GOALS,
    'duration_min': DURATION,
    'intensity': INTENSITY
})

print("‚úÖ Generated target variables")
print(f"\nGoal distribution:\n{Y['goal'].value_counts()}")
print(f"\nDuration stats:\n{Y['duration_min'].describe().round(2)}")
print(f"\nIntensity stats:\n{Y['intensity'].describe().round(2)}")

‚úÖ Generated target variables

Goal distribution:
goal
Strength       7095
Endurance      5926
Maintenance    4019
Flexibility    2960
Name: count, dtype: int64

Duration stats:
count    20000.0
mean        43.1
std         12.4
min         20.0
25%         34.0
50%         43.0
75%         52.0
max         82.0
Name: duration_min, dtype: float64

Intensity stats:
count    20000.00
mean         5.31
std          2.08
min          1.00
25%          3.80
50%          5.40
75%          6.90
max         10.00
Name: intensity, dtype: float64


## Step 3: Build Data Preprocessing Pipeline

In [4]:
# Define feature types
categorical_features = ['sex']
numerical_features = ['age', 'weight_kg', 'sleep_hrs', 'rhr_bpm', 'soreness', 'mental_stress', 
                      'calories_in', 'protein_g', 'carbs_g', 'nutrition_confidence']

# Build preprocessing pipeline
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features),
    ('num', StandardScaler(), numerical_features)
])

# Fit and transform features
X_processed = preprocessor.fit_transform(X)

print("‚úÖ Preprocessing pipeline built")
print(f"Processed features shape: {X_processed.shape}")
print(f"Feature names after transformation: {len(X_processed[0])} features")

‚úÖ Preprocessing pipeline built
Processed features shape: (20000, 11)
Feature names after transformation: 11 features


## Step 4: Train-Test Split with Stratification

In [5]:
# Stratified split on goals for balanced distribution
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_idx, test_idx in sss.split(X_processed, Y['goal']):
    X_train, X_test = X_processed[train_idx], X_processed[test_idx]
    Y_train, Y_test = Y.iloc[train_idx], Y.iloc[test_idx]

print(f"‚úÖ Train-Test Split Complete")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nGoal distribution in train set:\n{Y_train['goal'].value_counts()}")
print(f"\nGoal distribution in test set:\n{Y_test['goal'].value_counts()}")

‚úÖ Train-Test Split Complete
Training set: 16000 samples (80.0%)
Test set: 4000 samples (20.0%)

Goal distribution in train set:
goal
Strength       5676
Endurance      4741
Maintenance    3215
Flexibility    2368
Name: count, dtype: int64

Goal distribution in test set:
goal
Strength       1419
Endurance      1185
Maintenance     804
Flexibility     592
Name: count, dtype: int64


## Step 5: Train Goal Classification Model

In [6]:
# Encode goal labels
goal_encoder = LabelEncoder()
Y_goal_encoded = goal_encoder.fit_transform(Y_train['goal'])
Y_goal_test_encoded = goal_encoder.transform(Y_test['goal'])

# Train goal classifier
goal_classifier = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1)
goal_classifier.fit(X_train, Y_goal_encoded)

# Evaluate goal classifier
Y_goal_pred_train = goal_classifier.predict(X_train)
Y_goal_pred_test = goal_classifier.predict(X_test)

goal_accuracy_train = (Y_goal_pred_train == Y_goal_encoded).mean()
goal_accuracy_test = (Y_goal_pred_test == Y_goal_test_encoded).mean()

print("‚úÖ Goal Classification Model Trained")
print(f"Training accuracy: {goal_accuracy_train:.4f}")
print(f"Test accuracy: {goal_accuracy_test:.4f}")
print(f"\nGoal classes: {goal_encoder.classes_}")

‚úÖ Goal Classification Model Trained
Training accuracy: 0.7480
Test accuracy: 0.3483

Goal classes: ['Endurance' 'Flexibility' 'Maintenance' 'Strength']


## Step 6: Train Main Regression Model with GridSearch

In [7]:
# Prepare targets for regression (duration + intensity)
Y_values = Y_train[['duration_min', 'intensity']].values

# Define parameter grid for GridSearch
param_grid = {
    'estimator__n_estimators': [100, 200],
    'estimator__max_depth': [12, 15, 18],
    'estimator__min_samples_split': [5, 10],
    'estimator__min_samples_leaf': [2, 4]
}

# Base regressor
base_rf = RandomForestRegressor(random_state=42, n_jobs=-1)

# MultiOutput wrapper for predicting both duration and intensity
multi_model = MultiOutputRegressor(base_rf)

# GridSearch for best parameters
print("üîç Running GridSearch CV (this may take 2-3 minutes)...\n")
start_time = time.time()

grid_search = GridSearchCV(multi_model, param_grid, cv=3, n_jobs=-1, verbose=1)
grid_search.fit(X_train, Y_values)

elapsed = time.time() - start_time
best_mvva_model = grid_search.best_estimator_

print(f"\n‚úÖ GridSearch Complete in {elapsed:.1f}s")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

üîç Running GridSearch CV (this may take 2-3 minutes)...

Fitting 3 folds for each of 24 candidates, totalling 72 fits

‚úÖ GridSearch Complete in 231.1s
Best parameters: {'estimator__max_depth': 12, 'estimator__min_samples_leaf': 4, 'estimator__min_samples_split': 10, 'estimator__n_estimators': 200}
Best CV score: 0.3183

‚úÖ GridSearch Complete in 231.1s
Best parameters: {'estimator__max_depth': 12, 'estimator__min_samples_leaf': 4, 'estimator__min_samples_split': 10, 'estimator__n_estimators': 200}
Best CV score: 0.3183


## Step 7: Model Evaluation & Cross-Validation

In [9]:
# Make predictions on train and test sets
Y_train_pred = best_mvva_model.predict(X_train)
Y_test_pred = best_mvva_model.predict(X_test)

# Calculate metrics
def calculate_metrics(Y_true, Y_pred, set_name):
    """Calculate RMSE, MAE, and R2 for both targets"""
    rmse_duration = np.sqrt(mean_squared_error(Y_true[:, 0], Y_pred[:, 0]))
    rmse_intensity = np.sqrt(mean_squared_error(Y_true[:, 1], Y_pred[:, 1]))
    mae_duration = mean_absolute_error(Y_true[:, 0], Y_pred[:, 0])
    mae_intensity = mean_absolute_error(Y_true[:, 1], Y_pred[:, 1])
    r2_duration = r2_score(Y_true[:, 0], Y_pred[:, 0])
    r2_intensity = r2_score(Y_true[:, 1], Y_pred[:, 1])
    
    print(f"\n{set_name} Metrics:")
    print(f"  Duration - RMSE: {rmse_duration:.2f} min, MAE: {mae_duration:.2f} min, R¬≤: {r2_duration:.4f}")
    print(f"  Intensity - RMSE: {rmse_intensity:.2f}, MAE: {mae_intensity:.2f}, R¬≤: {r2_intensity:.4f}")
    
    return rmse_duration, rmse_intensity, mae_duration, mae_intensity, r2_duration, r2_intensity

# Evaluate on training set (use only numeric columns: duration_min and intensity)
Y_train_numeric = Y_train[['duration_min', 'intensity']].values
rmse_dur_train, rmse_int_train, mae_dur_train, mae_int_train, r2_dur_train, r2_int_train = \
    calculate_metrics(Y_train_numeric, Y_train_pred, "Training Set")

# Evaluate on test set (use only numeric columns: duration_min and intensity)
Y_test_numeric = Y_test[['duration_min', 'intensity']].values
rmse_dur_test, rmse_int_test, mae_dur_test, mae_int_test, r2_dur_test, r2_int_test = \
    calculate_metrics(Y_test_numeric, Y_test_pred, "Test Set")

# Cross-validation scores (use only numeric targets)
cv_scores = cross_val_score(best_mvva_model, X_train, Y_train_numeric, cv=5, scoring='r2')
print(f"\n5-Fold Cross-Validation R¬≤ Scores: {cv_scores}")
print(f"Mean CV R¬≤ Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

print("\n‚úÖ Model evaluation complete")


Training Set Metrics:
  Duration - RMSE: 7.96 min, MAE: 6.43 min, R¬≤: 0.5878
  Intensity - RMSE: 1.32, MAE: 1.08, R¬≤: 0.5944

Test Set Metrics:
  Duration - RMSE: 10.13 min, MAE: 8.22 min, R¬≤: 0.3308
  Intensity - RMSE: 1.72, MAE: 1.41, R¬≤: 0.3358

5-Fold Cross-Validation R¬≤ Scores: [0.31318104 0.31592473 0.33447834 0.32253356 0.31745378]
Mean CV R¬≤ Score: 0.3207 (+/- 0.0075)

‚úÖ Model evaluation complete

5-Fold Cross-Validation R¬≤ Scores: [0.31318104 0.31592473 0.33447834 0.32253356 0.31745378]
Mean CV R¬≤ Score: 0.3207 (+/- 0.0075)

‚úÖ Model evaluation complete


## Step 8: Integrate User Feedback Data (if available)

In [10]:
# Check for feedback from MongoDB or local CSV
feedback_file = 'feedback_history.csv'
feedback_data = []

if os.path.exists(feedback_file):
    print(f"üìä Loading feedback from {feedback_file}...")
    feedback_df = pd.read_csv(feedback_file)
    print(f"‚úÖ Loaded {len(feedback_df)} feedback records")
    
    # Show feedback summary
    if len(feedback_df) > 0:
        print(f"\nFeedback Summary:")
        print(f"  Columns: {list(feedback_df.columns)}")
        print(f"  Date range: {feedback_df['timestamp'].min()} to {feedback_df['timestamp'].max()}")
else:
    print("‚ÑπÔ∏è No feedback file found. Using only synthetic data for training.")
    feedback_df = pd.DataFrame()

# Note: Feedback retraining will happen automatically in chronofit_app.py when 3+ feedbacks are submitted
print("\nüí° In production:")
print("  - User feedback is saved to MongoDB")
print("  - Background thread retrains model every 3 feedbacks")
print("  - See chronofit_app.py for continuous learning implementation")

üìä Loading feedback from feedback_history.csv...
‚úÖ Loaded 4 feedback records

Feedback Summary:
  Columns: ['timestamp', 'age', 'sex', 'weight_kg', 'sleep_hrs', 'rhr_bpm', 'soreness_before', 'mental_stress', 'calories_in', 'protein_g', 'carbs_g', 'predicted_goal', 'recommended_duration', 'recommended_intensity', 'workout_completion_pct', 'actual_intensity', 'difficulty_feedback', 'recovery_feeling', 'soreness_next_day_expected', 'would_repeat']
  Date range: 2025-11-13 10:05:41.293369 to 2025-11-13 10:39:30.525258

üí° In production:
  - User feedback is saved to MongoDB
  - Background thread retrains model every 3 feedbacks
  - See chronofit_app.py for continuous learning implementation


## Step 9: Feature Importance Analysis

In [11]:
# Get feature importances from the model
feature_importance_duration = best_mvva_model.estimators_[0].feature_importances_
feature_importance_intensity = best_mvva_model.estimators_[1].feature_importances_

# Get feature names (OneHotEncoded sex + numerical features)
feature_names = ['sex_M'] + numerical_features

# Create importance dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'duration_importance': feature_importance_duration,
    'intensity_importance': feature_importance_intensity
})
importance_df['avg_importance'] = (importance_df['duration_importance'] + importance_df['intensity_importance']) / 2
importance_df = importance_df.sort_values('avg_importance', ascending=False)

print("‚úÖ Feature Importance Analysis:")
print(f"\nTop 10 Most Important Features:")
print(importance_df.head(10).to_string(index=False))

print(f"\nDuration Prediction - Top 5 Features:")
for i, row in importance_df.nlargest(5, 'duration_importance').iterrows():
    print(f"  {row['feature']}: {row['duration_importance']:.4f}")

print(f"\nIntensity Prediction - Top 5 Features:")
for i, row in importance_df.nlargest(5, 'intensity_importance').iterrows():
    print(f"  {row['feature']}: {row['intensity_importance']:.4f}")

‚úÖ Feature Importance Analysis:

Top 10 Most Important Features:
             feature  duration_importance  intensity_importance  avg_importance
           sleep_hrs             0.468712              0.451929        0.460321
            soreness             0.136000              0.094135        0.115068
           protein_g             0.065936              0.069202        0.067569
         calories_in             0.067099              0.067851        0.067475
             carbs_g             0.065153              0.068575        0.066864
           weight_kg             0.062678              0.068225        0.065452
             rhr_bpm             0.043576              0.058886        0.051231
                 age             0.046917              0.052045        0.049481
       mental_stress             0.017806              0.041217        0.029512
nutrition_confidence             0.018813              0.020193        0.019503

Duration Prediction - Top 5 Features:
  sleep_hrs: 0.

## Step 10: Save Models for Deployment

In [12]:
# Define model file paths
MODEL_PATH = 'mvva_model_v2.joblib'
PREPROCESSOR_PATH = 'mvva_preprocessor_v2.joblib'
GOAL_CLASSIFIER_PATH = 'mvva_goal_classifier_v2.joblib'
GOAL_ENCODER_PATH = 'mvva_goal_encoder_v2.joblib'

# Save all models
print("üíæ Saving trained models...\n")

joblib.dump(best_mvva_model, MODEL_PATH)
print(f"‚úÖ Saved main model: {MODEL_PATH}")

joblib.dump(preprocessor, PREPROCESSOR_PATH)
print(f"‚úÖ Saved preprocessor: {PREPROCESSOR_PATH}")

joblib.dump(goal_classifier, GOAL_CLASSIFIER_PATH)
print(f"‚úÖ Saved goal classifier: {GOAL_CLASSIFIER_PATH}")

joblib.dump(goal_encoder, GOAL_ENCODER_PATH)
print(f"‚úÖ Saved goal encoder: {GOAL_ENCODER_PATH}")

# Get file sizes
import os
total_size = sum(os.path.getsize(f) for f in [MODEL_PATH, PREPROCESSOR_PATH, GOAL_CLASSIFIER_PATH, GOAL_ENCODER_PATH])
print(f"\nüì¶ Total model size: {total_size / (1024**2):.1f} MB")

print("\n‚úÖ All models saved successfully!")
print("\nüìù Models are ready for deployment:")
print(f"  - Streamlit app: chronofit_app.py")
print(f"  - MongoDB handler: mongodb_handler.py")
print(f"  - Deploy to: https://share.streamlit.io")

üíæ Saving trained models...

‚úÖ Saved main model: mvva_model_v2.joblib
‚úÖ Saved preprocessor: mvva_preprocessor_v2.joblib
‚úÖ Saved main model: mvva_model_v2.joblib
‚úÖ Saved preprocessor: mvva_preprocessor_v2.joblib
‚úÖ Saved goal classifier: mvva_goal_classifier_v2.joblib
‚úÖ Saved goal encoder: mvva_goal_encoder_v2.joblib

üì¶ Total model size: 69.3 MB

‚úÖ All models saved successfully!

üìù Models are ready for deployment:
  - Streamlit app: chronofit_app.py
  - MongoDB handler: mongodb_handler.py
  - Deploy to: https://share.streamlit.io
‚úÖ Saved goal classifier: mvva_goal_classifier_v2.joblib
‚úÖ Saved goal encoder: mvva_goal_encoder_v2.joblib

üì¶ Total model size: 69.3 MB

‚úÖ All models saved successfully!

üìù Models are ready for deployment:
  - Streamlit app: chronofit_app.py
  - MongoDB handler: mongodb_handler.py
  - Deploy to: https://share.streamlit.io
