Skip to content

pascalghanimi/Injury-Prediction-in-Runners

Repository files navigation

Injury-Prediction-in-Runners

Injury Prediction in Competitive Runners With Machine Learning

Overview

This project focuses on predicting injuries in athletes using machine learning models. The goal is to compare different boosting algorithms—XGBoost, LightGBM, and CatBoost—and evaluate their performance across different feature sets. Additionally, an ensemble approach is implemented to combine the strengths of multiple models for improved prediction accuracy.

Purpose

To build on previous work in injury prediction, this project aimed not only to replicate but also to improve existing models by systematically comparing different boosting algorithms and feature sets. Special attention was given to evaluating the predictive value of subjective (self-reported) versus objective (measured) training features, as well as exploring how different temporal resolutions (daily vs. weekly data) affect model performance.

Models Implemented

1. XGBoost

  • Utilized for binary classification.
  • Hyperparameter tuning performed with Optuna.
  • Model trained on various feature subsets.

2. LightGBM

  • Faster than XGBoost with large datasets.
  • Uses histogram-based learning for better efficiency.
  • Optimized with Optuna to maximize AUC score.

3. CatBoost

  • Handles categorical features efficiently.
  • Uses Oblivious Trees for balanced splits.
  • Optimized with Optuna to maximize AUC score.

Feature Sets

To analyze model performance, five different feature sets were used:

  1. All Features – Combination of all feature categories (139 Features in Total)
  2. Days Features – Data collected on a daily basis (70 Features in Total)
  3. Weeks Features – Aggregated weekly data (69 Features in Total)
  4. Objective Features – Metrics derived from measurable performance (91 Features in Total)
  5. Subjective Features – Data based on athlete-reported conditions (48 Features in Total)

Training and Evaluation

  • Data Handling: Features and labels are stored as .pkl files.
  • Data Splitting: Standard train-validation-test split (70%-15%-15%).
  • Batch Sampling: Balanced mini-batches created to handle class imbalance (2048 samples in total, 50% injured & 50% healthy event)
  • Performance Metric:
    • AUC (Area Under the Curve) → Measures how well the model distinguishes between injured and non-injured individuals
    • Recall (Sensitivity) → The proportion of actually injured individuals correctly identified by the model.
    • Precision → The proportion of predicted injured individuals who are actually injured
    • F1-Score → The harmonic mean of Precision and Recall
    • Specificity → The proportion of actually non-injured individuals correctly identified as non-injured by the model.
  • Ensemble Approach: Models calibrated using CalibratedClassifierCV, then combined via probability averaging.

Results

Comparison of Feature Sets for XGBoost

Comparison of Feature Sets for LightGBM

Comparison of Feature Sets for CatBoost

Comparison of Feature Sets for Ensemble Models

Usage

Running in Google Colab

  1. Open the Colab notebook in your browser.
  2. Upload the required .pkl files containing the dataset.
  3. Run the notebook cells to train and evaluate the models.

Running Locally (Optional)

  1. Clone the repository:
    git clone https://github.com/pascalghanimi/Injury-Prediction-in-Runners.git
    cd injury-prediction
  2. Install dependencies:
    pip install -r requirements.txt
  3. Modify the notebook or create Python scripts for local execution.

Future Improvements

  • Feature Engineering: Experiment with additional metrics.
  • Hyperparameter Optimization: Further fine-tuning with more trials and especially fine-tuning for each individual feature set (e.g. days, weeks, objective and subjective features)
  • Deployment: Convert the best model into a web-based API for real-time predictions.

Feature Lists

For the analysis, all available features were grouped into five distinct feature sets: All Features, Days, Weeks, Objective, and Subjective. This structure allows for a targeted evaluation of different data dimensions. While All Features includes the complete set of variables, the remaining subsets focus on specific aspects such as daily or weekly metrics, purely objective training data, or subjective athlete self-assessments.

  1. All Features – 'nr. sessions_day', 'total km', 'km Z3-4', 'km Z5-T1-T2', 'km sprinting', 'strength training', 'hours alternative', 'perceived exertion', 'perceived trainingSuccess', 'perceived recovery', 'nr. sessions.1_day', 'total km.1', 'km Z3-4.1', 'km Z5-T1-T2.1', 'km sprinting.1', 'strength training.1', 'hours alternative.1', 'perceived exertion.1', 'perceived trainingSuccess.1', 'perceived recovery.1', 'nr. sessions.2_day', 'total km.2', 'km Z3-4.2', 'km Z5-T1-T2.2', 'km sprinting.2', 'strength training.2', 'hours alternative.2', 'perceived exertion.2', 'perceived trainingSuccess.2', 'perceived recovery.2', 'nr. sessions.3', 'total km.3', 'km Z3-4.3', 'km Z5-T1-T2.3', 'km sprinting.3', 'strength training.3', 'hours alternative.3', 'perceived exertion.3', 'perceived trainingSuccess.3', 'perceived recovery.3', 'nr. sessions.4', 'total km.4', 'km Z3-4.4', 'km Z5-T1-T2.4', 'km sprinting.4', 'strength training.4', 'hours alternative.4', 'perceived exertion.4', 'perceived trainingSuccess.4', 'perceived recovery.4', 'nr. sessions.5', 'total km.5', 'km Z3-4.5', 'km Z5-T1-T2.5', 'km sprinting.5', 'strength training.5', 'hours alternative.5', 'perceived exertion.5', 'perceived trainingSuccess.5', 'perceived recovery.5', 'nr. sessions.6', 'total km.6', 'km Z3-4.6', 'km Z5-T1-T2.6', 'km sprinting.6', 'strength training.6', 'hours alternative.6', 'perceived exertion.6', 'perceived trainingSuccess.6', 'perceived recovery.6', 'nr. sessions_week', 'nr. rest days', 'total kms', 'max km one day', 'total km Z3-Z4-Z5-T1-T2', 'nr. tough sessions (effort in Z5, T1 or T2)', 'nr. days with interval session', 'total km Z3-4', 'max km Z3-4 one day', 'total km Z5-T1-T2', 'max km Z5-T1-T2 one day', 'total hours alternative training', 'nr. strength trainings', 'avg exertion', 'min exertion', 'max exertion', 'avg training success', 'min training success', 'max training success', 'avg recovery', 'min recovery', 'max recovery', 'nr. sessions.1_week', 'nr. rest days.1', 'total kms.1', 'max km one day.1', 'total km Z3-Z4-Z5-T1-T2.1', 'nr. tough sessions (effort in Z5, T1 or T2).1', 'nr. days with interval session.1', 'total km Z3-4.1', 'max km Z3-4 one day.1', 'total km Z5-T1-T2.1', 'max km Z5-T1-T2 one day.1', 'total hours alternative training.1', 'nr. strength trainings.1', 'avg exertion.1', 'min exertion.1', 'max exertion.1', 'avg training success.1', 'min training success.1', 'max training success.1', 'avg recovery.1', 'min recovery.1', 'max recovery.1', 'nr. sessions.2_week', 'nr. rest days.2', 'total kms.2', 'max km one day.2', 'total km Z3-Z4-Z5-T1-T2.2', 'nr. tough sessions (effort in Z5, T1 or T2).2', 'nr. days with interval session.2','total km Z3-4.2','max km Z3-4 one day.2', 'total km Z5-T1-T2.2', 'max km Z5-T1-T2 one day.2', 'total hours alternative training.2', 'nr. strength trainings.2', 'avg exertion.2', 'min exertion.2', 'max exertion.2','avg training success.2','min training success.2', 'max training success.2', 'avg recovery.2', 'min recovery.2', 'max recovery.2', 'rel total kms week 0_1', 'rel total kms week 0_2', 'rel total kms week 1_2'
  2. Days Features – 'nr. sessions_day', 'total km', 'km Z3-4', 'km Z5-T1-T2', 'km sprinting', 'strength training', 'hours alternative', 'perceived exertion', 'perceived trainingSuccess', 'perceived recovery', 'nr. sessions.1_day', 'total km.1', 'km Z3-4.1', 'km Z5-T1-T2.1', 'km sprinting.1', 'strength training.1', 'hours alternative.1', 'perceived exertion.1', 'perceived trainingSuccess.1', 'perceived recovery.1', 'nr. sessions.2_day', 'total km.2', 'km Z3-4.2', 'km Z5-T1-T2.2', 'km sprinting.2', 'strength training.2', 'hours alternative.2', 'perceived exertion.2', 'perceived trainingSuccess.2', 'perceived recovery.2', 'nr. sessions.3', 'total km.3', 'km Z3-4.3', 'km Z5-T1-T2.3', 'km sprinting.3', 'strength training.3', 'hours alternative.3', 'perceived exertion.3', 'perceived trainingSuccess.3', 'perceived recovery.3', 'nr. sessions.4', 'total km.4', 'km Z3-4.4', 'km Z5-T1-T2.4', 'km sprinting.4', 'strength training.4', 'hours alternative.4', 'perceived exertion.4', 'perceived trainingSuccess.4', 'perceived recovery.4', 'nr. sessions.5', 'total km.5', 'km Z3-4.5', 'km Z5-T1-T2.5', 'km sprinting.5', 'strength training.5', 'hours alternative.5', 'perceived exertion.5', 'perceived trainingSuccess.5', 'perceived recovery.5', 'nr. sessions.6', 'total km.6', 'km Z3-4.6', 'km Z5-T1-T2.6', 'km sprinting.6', 'strength training.6', 'hours alternative.6', 'perceived exertion.6', 'perceived trainingSuccess.6', 'perceived recovery.6',
  3. Weeks Features – 'nr. sessions_week', 'nr. rest days', 'total kms', 'max km one day', 'total km Z3-Z4-Z5-T1-T2', 'nr. tough sessions (effort in Z5, T1 or T2)', 'nr. days with interval session', 'total km Z3-4', 'max km Z3-4 one day', 'total km Z5-T1-T2', 'max km Z5-T1-T2 one day', 'total hours alternative training', 'nr. strength trainings', 'avg exertion', 'min exertion', 'max exertion', 'avg training success', 'min training success', 'max training success', 'avg recovery', 'min recovery', 'max recovery', 'nr. sessions.1_week', 'nr. rest days.1', 'total kms.1', 'max km one day.1', 'total km Z3-Z4-Z5-T1-T2.1', 'nr. tough sessions (effort in Z5, T1 or T2).1', 'nr. days with interval session.1', 'total km Z3-4.1', 'max km Z3-4 one day.1', 'total km Z5-T1-T2.1', 'max km Z5-T1-T2 one day.1', 'total hours alternative training.1', 'nr. strength trainings.1', 'avg exertion.1', 'min exertion.1', 'max exertion.1', 'avg training success.1', 'min training success.1', 'max training success.1', 'avg recovery.1', 'min recovery.1', 'max recovery.1', 'nr. sessions.2_week', 'nr. rest days.2', 'total kms.2', 'max km one day.2', 'total km Z3-Z4-Z5-T1-T2.2', 'nr. tough sessions (effort in Z5, T1 or T2).2', 'nr. days with interval session.2','total km Z3-4.2','max km Z3-4 one day.2', 'total km Z5-T1-T2.2', 'max km Z5-T1-T2 one day.2', 'total hours alternative training.2', 'nr. strength trainings.2', 'avg exertion.2', 'min exertion.2', 'max exertion.2','avg training success.2','min training success.2', 'max training success.2', 'avg recovery.2', 'min recovery.2', 'max recovery.2', 'rel total kms week 0_1', 'rel total kms week 0_2', 'rel total kms week 1_2'
  4. Objective Features – 'nr. sessions_day', 'total km', 'km Z3-4', 'km Z5-T1-T2', 'km sprinting', 'strength training', 'hours alternative', 'nr. sessions.1_day', 'total km.1', 'km Z3-4.1', 'km Z5-T1-T2.1', 'km sprinting.1', 'strength training.1', 'hours alternative.1', 'nr. sessions.2_day', 'total km.2', 'km Z3-4.2', 'km Z5-T1-T2.2', 'km sprinting.2', 'strength training.2', 'hours alternative.2', 'nr. sessions.3', 'total km.3', 'km Z3-4.3', 'km Z5-T1-T2.3', 'km sprinting.3', 'strength training.3', 'hours alternative.3', 'nr. sessions.4', 'total km.4', 'km Z3-4.4', 'km Z5-T1-T2.4', 'km sprinting.4', 'strength training.4', 'hours alternative.4', 'nr. sessions.5', 'total km.5', 'km Z3-4.5', 'km Z5-T1-T2.5', 'km sprinting.5', 'strength training.5', 'hours alternative.5', 'nr. sessions.6', 'total km.6', 'km Z3-4.6', 'km Z5-T1-T2.6', 'km sprinting.6', 'strength training.6', 'hours alternative.6', 'nr. sessions_week', 'nr. rest days', 'total kms', 'max km one day', 'total km Z3-Z4-Z5-T1-T2', 'nr. tough sessions (effort in Z5, T1 or T2)', 'nr. days with interval session', 'total km Z3-4', 'max km Z3-4 one day', 'total km Z5-T1-T2', 'max km Z5-T1-T2 one day', 'total hours alternative training', 'nr. strength trainings', 'nr. sessions.1_week', 'nr. rest days.1', 'total kms.1', 'max km one day.1', 'total km Z3-Z4-Z5-T1-T2.1', 'nr. tough sessions (effort in Z5, T1 or T2).1', 'nr. days with interval session.1', 'total km Z3-4.1', 'max km Z3-4 one day.1', 'total km Z5-T1-T2.1', 'max km Z5-T1-T2 one day.1', 'total hours alternative training.1', 'nr. strength trainings.1', 'nr. sessions.2_week', 'nr. rest days.2', 'total kms.2', 'max km one day.2', 'total km Z3-Z4-Z5-T1-T2.2', 'nr. tough sessions (effort in Z5, T1 or T2).2', 'nr. days with interval session.2','total km Z3-4.2','max km Z3-4 one day.2', 'total km Z5-T1-T2.2', 'max km Z5-T1-T2 one day.2', 'total hours alternative training.2', 'nr. strength trainings.2', 'rel total kms week 0_1', 'rel total kms week 0_2', 'rel total kms week 1_2'
  5. Subjective Features - 'perceived exertion', 'perceived trainingSuccess', 'perceived recovery', 'perceived exertion.1', 'perceived trainingSuccess.1', 'perceived recovery.1', 'perceived exertion.2', 'perceived trainingSuccess.2', 'perceived recovery.2', 'perceived exertion.3', 'perceived trainingSuccess.3', 'perceived recovery.3', 'perceived exertion.4', 'perceived trainingSuccess.4', 'perceived recovery.4', 'perceived exertion.5', 'perceived trainingSuccess.5', 'perceived recovery.5', 'perceived exertion.6', 'perceived trainingSuccess.6', 'perceived recovery.6', 'avg exertion', 'min exertion', 'max exertion', 'avg training success', 'min training success', 'max training success', 'avg recovery', 'min recovery', 'max recovery', 'avg exertion.1', 'min exertion.1', 'max exertion.1', 'avg training success.1', 'min training success.1', 'max training success.1', 'avg recovery.1', 'min recovery.1', 'max recovery.1', 'avg exertion.2', 'min exertion.2', 'max exertion.2', 'avg training success.2','min training success.2', 'max training success.2', 'avg recovery.2', 'min recovery.2', 'max recovery.2'

Author: Pascal Ghanimi GitHub Repository: https://github.com/pascalghanimi/Injury-Prediction-in-Runners.git

About

Injury Prediction in Competitive Runners With Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published