Injury-Prediction-in-Runners

Injury Prediction in Competitive Runners With Machine Learning

Overview

This project focuses on predicting injuries in athletes using machine learning models. The goal is to compare different boosting algorithms—XGBoost, LightGBM, and CatBoost—and evaluate their performance across different feature sets. Additionally, an ensemble approach is implemented to combine the strengths of multiple models for improved prediction accuracy.

Purpose

To build on previous work in injury prediction, this project aimed not only to replicate but also to improve existing models by systematically comparing different boosting algorithms and feature sets. Special attention was given to evaluating the predictive value of subjective (self-reported) versus objective (measured) training features, as well as exploring how different temporal resolutions (daily vs. weekly data) affect model performance.

Models Implemented

1. XGBoost

Utilized for binary classification.
Hyperparameter tuning performed with Optuna.
Model trained on various feature subsets.

2. LightGBM

Faster than XGBoost with large datasets.
Uses histogram-based learning for better efficiency.
Optimized with Optuna to maximize AUC score.

3. CatBoost

Handles categorical features efficiently.
Uses Oblivious Trees for balanced splits.
Optimized with Optuna to maximize AUC score.

Feature Sets

To analyze model performance, five different feature sets were used:

All Features – Combination of all feature categories (139 Features in Total)
Days Features – Data collected on a daily basis (70 Features in Total)
Weeks Features – Aggregated weekly data (69 Features in Total)
Objective Features – Metrics derived from measurable performance (91 Features in Total)
Subjective Features – Data based on athlete-reported conditions (48 Features in Total)

Training and Evaluation

Data Handling: Features and labels are stored as .pkl files.
Data Splitting: Standard train-validation-test split (70%-15%-15%).
Batch Sampling: Balanced mini-batches created to handle class imbalance (2048 samples in total, 50% injured & 50% healthy event)
Performance Metric:
- AUC (Area Under the Curve) → Measures how well the model distinguishes between injured and non-injured individuals
- Recall (Sensitivity) → The proportion of actually injured individuals correctly identified by the model.
- Precision → The proportion of predicted injured individuals who are actually injured
- F1-Score → The harmonic mean of Precision and Recall
- Specificity → The proportion of actually non-injured individuals correctly identified as non-injured by the model.
Ensemble Approach: Models calibrated using CalibratedClassifierCV, then combined via probability averaging.

Results

Comparison of Feature Sets for XGBoost

Comparison of Feature Sets for LightGBM

Comparison of Feature Sets for CatBoost

Comparison of Feature Sets for Ensemble Models

Usage

Running in Google Colab

Open the Colab notebook in your browser.
Upload the required .pkl files containing the dataset.
Run the notebook cells to train and evaluate the models.

Running Locally (Optional)

Clone the repository:

git clone https://github.com/pascalghanimi/Injury-Prediction-in-Runners.git
cd injury-prediction

Install dependencies:
```
pip install -r requirements.txt
```
Modify the notebook or create Python scripts for local execution.

Future Improvements

Feature Engineering: Experiment with additional metrics.
Hyperparameter Optimization: Further fine-tuning with more trials and especially fine-tuning for each individual feature set (e.g. days, weeks, objective and subjective features)
Deployment: Convert the best model into a web-based API for real-time predictions.

Feature Lists

For the analysis, all available features were grouped into five distinct feature sets: All Features, Days, Weeks, Objective, and Subjective. This structure allows for a targeted evaluation of different data dimensions. While All Features includes the complete set of variables, the remaining subsets focus on specific aspects such as daily or weekly metrics, purely objective training data, or subjective athlete self-assessments.

All Features – 'nr. sessions_day', 'total km', 'km Z3-4', 'km Z5-T1-T2', 'km sprinting', 'strength training', 'hours alternative', 'perceived exertion', 'perceived trainingSuccess', 'perceived recovery', 'nr. sessions.1_day', 'total km.1', 'km Z3-4.1', 'km Z5-T1-T2.1', 'km sprinting.1', 'strength training.1', 'hours alternative.1', 'perceived exertion.1', 'perceived trainingSuccess.1', 'perceived recovery.1', 'nr. sessions.2_day', 'total km.2', 'km Z3-4.2', 'km Z5-T1-T2.2', 'km sprinting.2', 'strength training.2', 'hours alternative.2', 'perceived exertion.2', 'perceived trainingSuccess.2', 'perceived recovery.2', 'nr. sessions.3', 'total km.3', 'km Z3-4.3', 'km Z5-T1-T2.3', 'km sprinting.3', 'strength training.3', 'hours alternative.3', 'perceived exertion.3', 'perceived trainingSuccess.3', 'perceived recovery.3', 'nr. sessions.4', 'total km.4', 'km Z3-4.4', 'km Z5-T1-T2.4', 'km sprinting.4', 'strength training.4', 'hours alternative.4', 'perceived exertion.4', 'perceived trainingSuccess.4', 'perceived recovery.4', 'nr. sessions.5', 'total km.5', 'km Z3-4.5', 'km Z5-T1-T2.5', 'km sprinting.5', 'strength training.5', 'hours alternative.5', 'perceived exertion.5', 'perceived trainingSuccess.5', 'perceived recovery.5', 'nr. sessions.6', 'total km.6', 'km Z3-4.6', 'km Z5-T1-T2.6', 'km sprinting.6', 'strength training.6', 'hours alternative.6', 'perceived exertion.6', 'perceived trainingSuccess.6', 'perceived recovery.6', 'nr. sessions_week', 'nr. rest days', 'total kms', 'max km one day', 'total km Z3-Z4-Z5-T1-T2', 'nr. tough sessions (effort in Z5, T1 or T2)', 'nr. days with interval session', 'total km Z3-4', 'max km Z3-4 one day', 'total km Z5-T1-T2', 'max km Z5-T1-T2 one day', 'total hours alternative training', 'nr. strength trainings', 'avg exertion', 'min exertion', 'max exertion', 'avg training success', 'min training success', 'max training success', 'avg recovery', 'min recovery', 'max recovery', 'nr. sessions.1_week', 'nr. rest days.1', 'total kms.1', 'max km one day.1', 'total km Z3-Z4-Z5-T1-T2.1', 'nr. tough sessions (effort in Z5, T1 or T2).1', 'nr. days with interval session.1', 'total km Z3-4.1', 'max km Z3-4 one day.1', 'total km Z5-T1-T2.1', 'max km Z5-T1-T2 one day.1', 'total hours alternative training.1', 'nr. strength trainings.1', 'avg exertion.1', 'min exertion.1', 'max exertion.1', 'avg training success.1', 'min training success.1', 'max training success.1', 'avg recovery.1', 'min recovery.1', 'max recovery.1', 'nr. sessions.2_week', 'nr. rest days.2', 'total kms.2', 'max km one day.2', 'total km Z3-Z4-Z5-T1-T2.2', 'nr. tough sessions (effort in Z5, T1 or T2).2', 'nr. days with interval session.2','total km Z3-4.2','max km Z3-4 one day.2', 'total km Z5-T1-T2.2', 'max km Z5-T1-T2 one day.2', 'total hours alternative training.2', 'nr. strength trainings.2', 'avg exertion.2', 'min exertion.2', 'max exertion.2','avg training success.2','min training success.2', 'max training success.2', 'avg recovery.2', 'min recovery.2', 'max recovery.2', 'rel total kms week 0_1', 'rel total kms week 0_2', 'rel total kms week 1_2'
Days Features – 'nr. sessions_day', 'total km', 'km Z3-4', 'km Z5-T1-T2', 'km sprinting', 'strength training', 'hours alternative', 'perceived exertion', 'perceived trainingSuccess', 'perceived recovery', 'nr. sessions.1_day', 'total km.1', 'km Z3-4.1', 'km Z5-T1-T2.1', 'km sprinting.1', 'strength training.1', 'hours alternative.1', 'perceived exertion.1', 'perceived trainingSuccess.1', 'perceived recovery.1', 'nr. sessions.2_day', 'total km.2', 'km Z3-4.2', 'km Z5-T1-T2.2', 'km sprinting.2', 'strength training.2', 'hours alternative.2', 'perceived exertion.2', 'perceived trainingSuccess.2', 'perceived recovery.2', 'nr. sessions.3', 'total km.3', 'km Z3-4.3', 'km Z5-T1-T2.3', 'km sprinting.3', 'strength training.3', 'hours alternative.3', 'perceived exertion.3', 'perceived trainingSuccess.3', 'perceived recovery.3', 'nr. sessions.4', 'total km.4', 'km Z3-4.4', 'km Z5-T1-T2.4', 'km sprinting.4', 'strength training.4', 'hours alternative.4', 'perceived exertion.4', 'perceived trainingSuccess.4', 'perceived recovery.4', 'nr. sessions.5', 'total km.5', 'km Z3-4.5', 'km Z5-T1-T2.5', 'km sprinting.5', 'strength training.5', 'hours alternative.5', 'perceived exertion.5', 'perceived trainingSuccess.5', 'perceived recovery.5', 'nr. sessions.6', 'total km.6', 'km Z3-4.6', 'km Z5-T1-T2.6', 'km sprinting.6', 'strength training.6', 'hours alternative.6', 'perceived exertion.6', 'perceived trainingSuccess.6', 'perceived recovery.6',
Weeks Features – 'nr. sessions_week', 'nr. rest days', 'total kms', 'max km one day', 'total km Z3-Z4-Z5-T1-T2', 'nr. tough sessions (effort in Z5, T1 or T2)', 'nr. days with interval session', 'total km Z3-4', 'max km Z3-4 one day', 'total km Z5-T1-T2', 'max km Z5-T1-T2 one day', 'total hours alternative training', 'nr. strength trainings', 'avg exertion', 'min exertion', 'max exertion', 'avg training success', 'min training success', 'max training success', 'avg recovery', 'min recovery', 'max recovery', 'nr. sessions.1_week', 'nr. rest days.1', 'total kms.1', 'max km one day.1', 'total km Z3-Z4-Z5-T1-T2.1', 'nr. tough sessions (effort in Z5, T1 or T2).1', 'nr. days with interval session.1', 'total km Z3-4.1', 'max km Z3-4 one day.1', 'total km Z5-T1-T2.1', 'max km Z5-T1-T2 one day.1', 'total hours alternative training.1', 'nr. strength trainings.1', 'avg exertion.1', 'min exertion.1', 'max exertion.1', 'avg training success.1', 'min training success.1', 'max training success.1', 'avg recovery.1', 'min recovery.1', 'max recovery.1', 'nr. sessions.2_week', 'nr. rest days.2', 'total kms.2', 'max km one day.2', 'total km Z3-Z4-Z5-T1-T2.2', 'nr. tough sessions (effort in Z5, T1 or T2).2', 'nr. days with interval session.2','total km Z3-4.2','max km Z3-4 one day.2', 'total km Z5-T1-T2.2', 'max km Z5-T1-T2 one day.2', 'total hours alternative training.2', 'nr. strength trainings.2', 'avg exertion.2', 'min exertion.2', 'max exertion.2','avg training success.2','min training success.2', 'max training success.2', 'avg recovery.2', 'min recovery.2', 'max recovery.2', 'rel total kms week 0_1', 'rel total kms week 0_2', 'rel total kms week 1_2'
Objective Features – 'nr. sessions_day', 'total km', 'km Z3-4', 'km Z5-T1-T2', 'km sprinting', 'strength training', 'hours alternative', 'nr. sessions.1_day', 'total km.1', 'km Z3-4.1', 'km Z5-T1-T2.1', 'km sprinting.1', 'strength training.1', 'hours alternative.1', 'nr. sessions.2_day', 'total km.2', 'km Z3-4.2', 'km Z5-T1-T2.2', 'km sprinting.2', 'strength training.2', 'hours alternative.2', 'nr. sessions.3', 'total km.3', 'km Z3-4.3', 'km Z5-T1-T2.3', 'km sprinting.3', 'strength training.3', 'hours alternative.3', 'nr. sessions.4', 'total km.4', 'km Z3-4.4', 'km Z5-T1-T2.4', 'km sprinting.4', 'strength training.4', 'hours alternative.4', 'nr. sessions.5', 'total km.5', 'km Z3-4.5', 'km Z5-T1-T2.5', 'km sprinting.5', 'strength training.5', 'hours alternative.5', 'nr. sessions.6', 'total km.6', 'km Z3-4.6', 'km Z5-T1-T2.6', 'km sprinting.6', 'strength training.6', 'hours alternative.6', 'nr. sessions_week', 'nr. rest days', 'total kms', 'max km one day', 'total km Z3-Z4-Z5-T1-T2', 'nr. tough sessions (effort in Z5, T1 or T2)', 'nr. days with interval session', 'total km Z3-4', 'max km Z3-4 one day', 'total km Z5-T1-T2', 'max km Z5-T1-T2 one day', 'total hours alternative training', 'nr. strength trainings', 'nr. sessions.1_week', 'nr. rest days.1', 'total kms.1', 'max km one day.1', 'total km Z3-Z4-Z5-T1-T2.1', 'nr. tough sessions (effort in Z5, T1 or T2).1', 'nr. days with interval session.1', 'total km Z3-4.1', 'max km Z3-4 one day.1', 'total km Z5-T1-T2.1', 'max km Z5-T1-T2 one day.1', 'total hours alternative training.1', 'nr. strength trainings.1', 'nr. sessions.2_week', 'nr. rest days.2', 'total kms.2', 'max km one day.2', 'total km Z3-Z4-Z5-T1-T2.2', 'nr. tough sessions (effort in Z5, T1 or T2).2', 'nr. days with interval session.2','total km Z3-4.2','max km Z3-4 one day.2', 'total km Z5-T1-T2.2', 'max km Z5-T1-T2 one day.2', 'total hours alternative training.2', 'nr. strength trainings.2', 'rel total kms week 0_1', 'rel total kms week 0_2', 'rel total kms week 1_2'
Subjective Features - 'perceived exertion', 'perceived trainingSuccess', 'perceived recovery', 'perceived exertion.1', 'perceived trainingSuccess.1', 'perceived recovery.1', 'perceived exertion.2', 'perceived trainingSuccess.2', 'perceived recovery.2', 'perceived exertion.3', 'perceived trainingSuccess.3', 'perceived recovery.3', 'perceived exertion.4', 'perceived trainingSuccess.4', 'perceived recovery.4', 'perceived exertion.5', 'perceived trainingSuccess.5', 'perceived recovery.5', 'perceived exertion.6', 'perceived trainingSuccess.6', 'perceived recovery.6', 'avg exertion', 'min exertion', 'max exertion', 'avg training success', 'min training success', 'max training success', 'avg recovery', 'min recovery', 'max recovery', 'avg exertion.1', 'min exertion.1', 'max exertion.1', 'avg training success.1', 'min training success.1', 'max training success.1', 'avg recovery.1', 'min recovery.1', 'max recovery.1', 'avg exertion.2', 'min exertion.2', 'max exertion.2', 'avg training success.2','min training success.2', 'max training success.2', 'avg recovery.2', 'min recovery.2', 'max recovery.2'

Author: Pascal Ghanimi GitHub Repository: https://github.com/pascalghanimi/Injury-Prediction-in-Runners.git

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
data		data
models		models
CatBoost.ipynb		CatBoost.ipynb
Ensemble_XGBoost_LightGBM_CatBoost.ipynb		Ensemble_XGBoost_LightGBM_CatBoost.ipynb
Feature_Extraction_Injury_Prediction_in_Runners.ipynb		Feature_Extraction_Injury_Prediction_in_Runners.ipynb
LightGBM.ipynb		LightGBM.ipynb
Predictions.ipynb		Predictions.ipynb
README.md		README.md
XGBoost.ipynb		XGBoost.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Injury-Prediction-in-Runners

Injury Prediction in Competitive Runners With Machine Learning

Overview

Purpose

Models Implemented

1. XGBoost

2. LightGBM

3. CatBoost

Feature Sets

Training and Evaluation

Results

Comparison of Feature Sets for XGBoost

Comparison of Feature Sets for LightGBM

Comparison of Feature Sets for CatBoost

Comparison of Feature Sets for Ensemble Models

Usage

Running in Google Colab

Running Locally (Optional)

Future Improvements

Feature Lists

About

Uh oh!

Releases

Packages

Uh oh!

Languages

pascalghanimi/Injury-Prediction-in-Runners

Folders and files

Latest commit

History

Repository files navigation

Injury-Prediction-in-Runners

Injury Prediction in Competitive Runners With Machine Learning

Overview

Purpose

Models Implemented

1. XGBoost

2. LightGBM

3. CatBoost

Feature Sets

Training and Evaluation

Results

Comparison of Feature Sets for XGBoost

Comparison of Feature Sets for LightGBM

Comparison of Feature Sets for CatBoost

Comparison of Feature Sets for Ensemble Models

Usage

Running in Google Colab

Running Locally (Optional)

Future Improvements

Feature Lists

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages