This project focuses on predicting injuries in athletes using machine learning models. The goal is to compare different boosting algorithms—XGBoost, LightGBM, and CatBoost—and evaluate their performance across different feature sets. Additionally, an ensemble approach is implemented to combine the strengths of multiple models for improved prediction accuracy.
- Utilized for binary classification.
- Hyperparameter tuning performed with Optuna.
- Model trained on various feature subsets.
- Faster than XGBoost with large datasets.
- Uses histogram-based learning for better efficiency.
- Optimized with Optuna to maximize AUC score.
- Handles categorical features efficiently.
- Uses Oblivious Trees for balanced splits.
- Optimized with Optuna to maximize AUC score.
To analyze model performance, five different feature sets were used:
- All Features – Combination of all feature categories.
- Days Features – Data collected on a daily basis.
- Weeks Features – Aggregated weekly data.
- Objective Features – Metrics derived from measurable performance.
- Subjective Features – Data based on athlete-reported conditions.
- Data Handling: Features and labels are stored as
.pkl
files. - Data Splitting: Standard train-validation-test split (70%-15%-15%).
- Batch Sampling: Balanced mini-batches created to handle class imbalance (2048 samples in total, 50% injured & 50% healthy event)
- Performance Metric: AUC (Area Under the Curve) used for evaluation.
- Ensemble Approach: Models calibrated using
CalibratedClassifierCV
, then combined via probability averaging.
- Open the Colab notebook in your browser.
- Upload the required
.pkl
files containing the dataset. - Run the notebook cells to train and evaluate the models.
- Clone the repository:
git clone https://github.com/pascalghanimi/Injury-Prediction-in-Runners.git cd injury-prediction
- Install dependencies:
pip install -r requirements.txt
- Modify the notebook or create Python scripts for local execution.
- Feature Engineering: Experiment with additional metrics.
- Hyperparameter Optimization: Further fine-tuning with more trials and especially fine-tuning for each individual feature set (e.g. days, weeks, objective and subjective features)
- Deployment: Convert the best model into a web-based API for real-time predictions.
Author: Pascal Ghanimi GitHub Repository: https://github.com/pascalghanimi/Injury-Prediction-in-Runners.git