# Data Preprocessing for Injury Prediction Model

## Overview
This script focuses on preparing and cleaning the datasets related to player injuries, muscle imbalances, and training sessions. Preprocessing steps may include handling missing values, normalizing or transforming features, and preparing data for further analysis or model training.

In [62]:
# Importing necessary libraries
import pandas as pd          # pandas for data manipulation and analysis
import numpy as np           # numpy for numerical calculations
from datetime import timedelta  # timedelta for calculating time differences
from sklearn.decomposition import PCA


### Data Loading: Injury, Muscle Imbalance, and Player Sessions Datasets


In [63]:
#Each dataset is loaded into a pandas DataFram

injury_history = pd.read_csv('../Data/injury_history(injury_history).csv')
# The `injury_history` dataset contains player injury records, which may include injury type, duration, and recovery status.

muscle_imbalance_data = pd.read_csv('../Data/injury_history(muscle_imbalance_data).csv')
# The `muscle_imbalance_data` dataset includes muscle imbalance metrics for players, likely predictive of injury risk.

player_sessions = pd.read_csv('../Data/injury_history(player_sessions).csv', encoding='latin1')
# The `player_sessions` dataset records player training sessions, including session duration, intensity, and other relevant metrics.


### Date Conversion for Merging


In [64]:
# Convert date columns to datetime format for merging
injury_history['Injury Date'] = pd.to_datetime(injury_history['Injury Date'], errors='coerce')
muscle_imbalance_data['Date Recorded'] = pd.to_datetime(muscle_imbalance_data['Date Recorded'], errors='coerce')
player_sessions['session_date'] = pd.to_datetime(player_sessions['session_date'], errors='coerce')


### Feature Addition: Severity Score and Risk Label

- **Severity Score**: A numeric value (1-3) representing injury severity, derived from injury grades.  
  - **Importance**: Quantifies injury impact, helping models link severity with future injury risk.
  - **Calculation**: Categorical grades mapped to numeric values using a dictionary (`{'Grade 1': 1, 'Grade 2': 2, 'Grade 3': 3}`).

- **Risk Label**: A binary indicator (0 or 1) showing if a session falls within a high-risk period.
  - **Importance**: Identifies high-risk sessions, aiding in proactive injury prevention.
  - **Calculation**: Sessions within 7 days of an injury are labeled high-risk and assigned the corresponding `Severity Score`.


In [65]:
# Step 1: Map severity grades to numeric scores in injury_history
severity_mapping = {'Grade 1': 1, 'Grade 2': 2, 'Grade 3': 3}
injury_history['Severity_Score'] = injury_history['Severity'].map(severity_mapping)

# Step 2: Add 'Year' and 'Month' columns to both datasets for monthly merge
muscle_imbalance_data['Year'] = muscle_imbalance_data['Date Recorded'].dt.year
muscle_imbalance_data['Month'] = muscle_imbalance_data['Date Recorded'].dt.month
player_sessions['Year'] = player_sessions['session_date'].dt.year
player_sessions['Month'] = player_sessions['session_date'].dt.month

injury_history = injury_history.rename(columns={'Player.ID': 'playerid'})
muscle_imbalance_data = muscle_imbalance_data.rename(columns={'Player.ID': 'playerid'})

# Step 3: Merge muscle_imbalance_data with player_sessions on 'Player.ID', 'Year', and 'Month'
merged_sessions_imbalance = pd.merge(
    player_sessions,
    muscle_imbalance_data,
    on=['playerid', 'Year', 'Month'],
    how='left'
)

# Initialize 'Risk_Label' and 'Severity_Score' in merged_sessions_imbalance
merged_sessions_imbalance['Risk_Label'] = 0
merged_sessions_imbalance['Severity_Score'] = 0  # default score for non-risk sessions

# Step 4: Set Risk_Label and Severity_Score based on injuries within a 7-day window
for idx, injury_row in injury_history.iterrows():
    player_id = injury_row['playerid']
    injury_date = injury_row['Injury Date']
    severity_score = injury_row['Severity_Score']
    
    # Mark sessions as high risk within 7 days before the injury and assign Severity_Score
    merged_sessions_imbalance.loc[
        (merged_sessions_imbalance['playerid'] == player_id) &
        (merged_sessions_imbalance['session_date'] >= (injury_date - timedelta(days=7))) &
        (merged_sessions_imbalance['session_date'] <= injury_date),
        ['Risk_Label', 'Severity_Score']
    ] = [1, severity_score]

# Display a sample of the merged data to verify
print("Merged Sessions with Imbalance Data, Risk_Label, and Severity_Score added:")
print(merged_sessions_imbalance[['playerid', 'session_date', 'Risk_Label', 'Severity_Score']].head(10))


Merged Sessions with Imbalance Data, Risk_Label, and Severity_Score added:
   playerid session_date  Risk_Label  Severity_Score
0       112   2023-01-01           0             0.0
1       112   2023-01-03           0             0.0
2       112   2023-01-04           0             0.0
3       112   2023-01-06           0             0.0
4       112   2023-01-07           0             0.0
5       112   2023-01-08           0             0.0
6       112   2023-01-10           0             0.0
7       112   2023-01-11           0             0.0
8       112   2023-01-12           0             0.0
9       112   2023-01-16           0             0.0


### Feature Explanation: Days Since Last Injury

- **Days_Since_Last_Injury**: The number of days since a player’s most recent injury, calculated for each session.
  - **Importance**: Helps identify high-risk sessions based on recovery time from past injuries, giving insights into potential re-injury risks.
  - **Calculation**: For each session, finds the most recent injury date. If no prior injury exists, sets a default value (e.g., -1) indicating no injury history.


In [66]:
# Calculate Days_Since_Last_Injury for each session based on the most recent prior injury date
# Initialize 'Days_Since_Last_Injury' as a placeholder
merged_sessions_imbalance['Days_Since_Last_Injury'] = np.nan

# Sort injury history by Player ID and Injury Date
injury_history = injury_history.sort_values(['playerid', 'Injury Date'])

# Iterate over each player and update Days_Since_Last_Injury for each session
for player_id, player_group in merged_sessions_imbalance.groupby('playerid'):
    # Get the injury dates for this player
    injuries = injury_history[injury_history['playerid'] == player_id]
    
    # Sort player sessions by date for consistent calculations
    sessions = player_group.sort_values('session_date')
    
    # Track the most recent injury date prior to each session
    last_injury_date = None
    days_since_last_injury = []
    
    for session_date in sessions['session_date']:
        # Update last_injury_date if there were injuries before the session date
        previous_injuries = injuries[injuries['Injury Date'] < session_date]
        
        if not previous_injuries.empty:
            last_injury_date = previous_injuries['Injury Date'].max()
            days_since_last_injury.append((session_date - last_injury_date).days)
        else:
            # If no prior injuries, set a large default number to indicate no history
            days_since_last_injury.append(9999)
    
    # Assign computed values back to the DataFrame for this player
    merged_sessions_imbalance.loc[sessions.index, 'Days_Since_Last_Injury'] = days_since_last_injury

# Convert Days_Since_Last_Injury to integer type
merged_sessions_imbalance['Days_Since_Last_Injury'] = merged_sessions_imbalance['Days_Since_Last_Injury'].astype(int)
merged_sessions_imbalance['Days_Since_Last_Injury'] = merged_sessions_imbalance['Days_Since_Last_Injury'].replace(9999, -1)  # or -1 if preferred
# Display updated player_sessions with corrected Days_Since_Last_Injury
print("Player sessions with corrected Days_Since_Last_Injury:")
print(merged_sessions_imbalance[['playerid', 'session_date', 'Days_Since_Last_Injury']].head(10))

merged_sessions_imbalance.info()


Player sessions with corrected Days_Since_Last_Injury:
   playerid session_date  Days_Since_Last_Injury
0       112   2023-01-01                      -1
1       112   2023-01-03                      -1
2       112   2023-01-04                      -1
3       112   2023-01-06                      -1
4       112   2023-01-07                      -1
5       112   2023-01-08                      -1
6       112   2023-01-10                      -1
7       112   2023-01-11                      -1
8       112   2023-01-12                      -1
9       112   2023-01-16                      -1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2604 entries, 0 to 2603
Data columns (total 43 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   name                         2604 non-null   object        
 1   playerid                     2604 non-null   int64         
 2   groupid                      2604 

### Feature Explanation: Past 3 months Injuries

- **Past_Injuries_3m**: The number of injuries a player has had in the past 90 days, recorded for each session.
  - **Importance**: Indicates recent injury frequency, providing insights into injury recurrence and helping to identify patterns associated with high-risk sessions.
  - **Calculation**: Counts injuries occurring within the 90-day window preceding each session date.


In [67]:
# Initialize the new feature 'Past_Injuries_3m' in merged_sessions_imbalance
merged_sessions_imbalance['Past_Injuries_3m'] = 0

# Step 2: Calculate the count of total injuries within the past 90 days for each session date
for idx, session_row in merged_sessions_imbalance.iterrows():
    player_id = session_row['playerid']
    session_date = session_row['session_date']
    
    # Filter injuries for the same player within the last 90 days before the session date
    past_injuries = injury_history[
        (injury_history['playerid'] == player_id) &
        (injury_history['Injury Date'] < session_date) &
        (injury_history['Injury Date'] >= session_date - timedelta(days=90))
    ]
    
    # Assign the count of past injuries to 'Past_Injuries_3m' for this session date
    merged_sessions_imbalance.loc[idx, 'Past_Injuries_3m'] = len(past_injuries)


### Feature Additions

- **Average_Recovery_Time**: The average recovery time (in days) per player, calculated from historical injury data.
  - **Importance**: Helps establish individual recovery baselines, useful for assessing readiness and injury risk.
  - **Calculation**: Calculated as the mean recovery time for each player, then merged into the main dataset.

- **Exertion_Deviation_7d**: A 7-day rolling standard deviation of exertion levels for each player.
  - **Importance**: Identifies variations in exertion, which may indicate periods of overtraining or recovery.
  - **Calculation**: Calculated using a rolling window of 3 sessions with the standard deviation for each player.

- **Baseline_Exertion & Baseline_Exertion_Deviation**: Baseline exertion per player and deviation from this baseline for each session.
  - **Importance**: Helps measure exertion consistency and detect any significant deviations that could affect injury risk.
  - **Calculation**: Baseline is the player’s average exertion; deviation is the difference between session exertion and baseline.

- **High_Intensity_Session**: Binary indicator (0 or 1) of whether a session's exertion is above the 75th percentile threshold.
  - **Importance**: Highlights high-intensity sessions, which can be a risk factor for injury.
  - **Calculation**: Sessions above the 75th percentile of exertion are marked as high intensity.

- **Heart_Rate_Recovery**: Difference between max and min heart rates within a session.
  - **Importance**: Indicates cardiovascular recovery, relevant for assessing physical stress and recovery.
  - **Calculation**: Calculated as `heartratemaxbpm - heartrateminbpm`.

- **TRIMP_Change**: The change in Training Impulse (TRIMP) from the previous session.
  - **Importance**: Measures changes in training load, useful for spotting sudden workload spikes.
  - **Calculation**: Difference in `trimp` between consecutive sessions per player.


### Feature Reduction: Muscle Imbalance Using PCA

- **Muscle Imbalance**: A new feature created by applying Principal Component Analysis (PCA) to reduce multiple muscle imbalance metrics into a single component.
  - **Importance**: Simplifies the dataset by combining muscle imbalance metrics, reducing dimensionality while preserving key variance. This can improve model performance and interpretability.
  - **Calculation**: PCA is applied to selected muscle imbalance features (`Hamstring To Quad Ratio`, `Quad Imbalance Percent`, `Hamstring Imbalance Percent`, `Calf Imbalance Percent`, `Groin Imbalance Percent`), producing a single `Muscle_Imbalance` score.


In [68]:
# Calculate average recovery time for each player
avg_recovery_time = injury_history.groupby('playerid')['Recovery Time (days)'].mean().reset_index()
avg_recovery_time.columns = ['playerid', 'Average_Recovery_Time']

# Merge Average_Recovery_Time without duplicating columns
if 'Average_Recovery_Time' not in merged_sessions_imbalance.columns:
    merged_sessions_imbalance = pd.merge(merged_sessions_imbalance, avg_recovery_time, on='playerid', how='left')

# Add Exertion features
merged_sessions_imbalance['Exertion_Deviation_7d'] = merged_sessions_imbalance.groupby('playerid')['exertions'].transform(lambda x: x.rolling(window=3, min_periods=1).std())
merged_sessions_imbalance['Baseline_Exertion'] = merged_sessions_imbalance.groupby('playerid')['exertions'].transform('mean')
merged_sessions_imbalance['Baseline_Exertion_Deviation'] = merged_sessions_imbalance['exertions'] - merged_sessions_imbalance['Baseline_Exertion']

# High_Intensity_Session: Identify sessions above 75th percentile of exertion
high_intensity_threshold = merged_sessions_imbalance['exertions'].quantile(0.75)
merged_sessions_imbalance['High_Intensity_Session'] = (merged_sessions_imbalance['exertions'] > high_intensity_threshold).astype(int)

# Heart Rate Recovery: Difference between max and min heart rates in each session
merged_sessions_imbalance['Heart_Rate_Recovery'] = (
    merged_sessions_imbalance['heartratemaxbpm'] - merged_sessions_imbalance['heartrateminbpm']
)

# TRIMP Change: Difference in TRIMP between consecutive sessions
merged_sessions_imbalance['TRIMP_Change'] = merged_sessions_imbalance.groupby('playerid')['trimp'].diff()

# PCA for Muscle Imbalance Features
muscle_features = merged_sessions_imbalance[['Hamstring To Quad Ratio', 'Quad Imbalance Percent', 
                                             'HamstringImbalance Percent', 'Calf Imbalance Percent', 
                                             'Groin Imbalance Percent']]
pca = PCA(n_components=1)
merged_sessions_imbalance['Muscle_Imbalance'] = pca.fit_transform(muscle_features)


In [69]:
merged_sessions_imbalance

Unnamed: 0,name,playerid,groupid,groupname,leagueid,sessionid,session_date,position,distancemi,distanceminmi,...,Days_Since_Last_Injury,Past_Injuries_3m,Average_Recovery_Time,Exertion_Deviation_7d,Baseline_Exertion,Baseline_Exertion_Deviation,High_Intensity_Session,Heart_Rate_Recovery,TRIMP_Change,Muscle_Imbalance
0,Anthony Lopez,112,212,Group 1,301,1001,2023-01-01,Center,4.58,0.12,...,-1,0,17.0,,295.534031,11.465969,0,124,,-23.611603
1,Anthony Lopez,112,212,Group 1,301,1002,2023-01-03,Center,1.18,0.11,...,-1,0,17.0,89.802561,295.534031,-115.534031,0,117,9.0,-23.611603
2,Anthony Lopez,112,212,Group 1,301,1003,2023-01-04,Center,5.59,0.14,...,-1,0,17.0,130.011538,295.534031,144.465969,1,94,-121.0,-23.611603
3,Anthony Lopez,112,212,Group 1,301,1004,2023-01-06,Center,3.22,0.09,...,-1,0,17.0,153.079500,295.534031,154.465969,1,122,31.0,-23.611603
4,Anthony Lopez,112,212,Group 1,301,1005,2023-01-07,Center,2.19,0.10,...,-1,0,17.0,17.473790,295.534031,120.465969,1,84,-28.0,-23.611603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2599,Xavier Foster,108,211,Group 1,301,1216,2023-12-24,Guard,1.08,0.12,...,-1,0,,124.145076,282.259091,53.740909,0,84,13.0,-21.762990
2600,Xavier Foster,108,211,Group 1,301,1217,2023-12-25,Guard,5.32,0.11,...,-1,0,,134.257216,282.259091,58.740909,0,125,121.0,-21.762990
2601,Xavier Foster,108,211,Group 1,301,1218,2023-12-26,Guard,1.24,0.14,...,-1,0,,107.127650,282.259091,-129.259091,0,89,-133.0,-21.762990
2602,Xavier Foster,108,211,Group 1,301,1219,2023-12-28,Guard,2.52,0.07,...,-1,0,,94.214295,282.259091,-46.259091,0,68,134.0,-21.762990


### Handling Missing Values

- **Days_Since_Last_Injury**: Set to `-1` for sessions with no prior injury history.
- **Exertion_Deviation_7d**: Set to `0` for initial sessions lacking sufficient data.
- **Baseline_Exertion_Deviation**: Set to `0` if a player has no baseline exertion record.
- **High_Intensity_Session**: Set to `0`, assuming no exertion implies no high intensity.
- **Heart_Rate_Recovery & TRIMP_Change**: Set to `0`, filling missing values for consistency.
- **Average_Recovery_Time**: Set to the mean of the column if a player has no recovery record.
- **Muscle Imbalance Features**: Filled with player-specific averages or overall mean if unavailable.
- **Severity_Score**: Set to `0` for sessions without associated injuries.
- **TRIMP_Change**: Set to `0` for a player’s first session to standardize values.


In [75]:
# Handle missing Days_Since_Last_Injury by setting to -1 for sessions with no prior injury
merged_sessions_imbalance['Days_Since_Last_Injury'] = merged_sessions_imbalance['Days_Since_Last_Injury'].fillna(-1).astype(int)

# Set Exertion_Deviation_7d to 0 for initial sessions without enough data for deviation calculation
merged_sessions_imbalance['Exertion_Deviation_7d'] = merged_sessions_imbalance['Exertion_Deviation_7d'].fillna(0)

# Baseline_Exertion_Deviation: Set to 0 if the player has no baseline exertion record
merged_sessions_imbalance['Baseline_Exertion_Deviation'] = merged_sessions_imbalance['Baseline_Exertion_Deviation'].fillna(0)

# High_Intensity_Session: Fill nulls with 0 assuming no exertion implies no high intensity
merged_sessions_imbalance['High_Intensity_Session'] = merged_sessions_imbalance['High_Intensity_Session'].fillna(0)

# Heart_Rate_Recovery and TRIMP_Change: Fill nulls with 0
merged_sessions_imbalance['Heart_Rate_Recovery'] = merged_sessions_imbalance['Heart_Rate_Recovery'].fillna(0)
merged_sessions_imbalance['TRIMP_Change'] = merged_sessions_imbalance['TRIMP_Change'].fillna(0)

# Average_Recovery_Time: Fill missing values with the overall mean if no recovery time record exists
merged_sessions_imbalance['Average_Recovery_Time'] = merged_sessions_imbalance['Average_Recovery_Time'].fillna(
    merged_sessions_imbalance['Average_Recovery_Time'].mean()
)

# Fill remaining muscle imbalance features with player-specific or overall mean
muscle_features = ['Hamstring To Quad Ratio', 'Quad Imbalance Percent', 'HamstringImbalance Percent', 
                   'Calf Imbalance Percent', 'Groin Imbalance Percent']
for feature in muscle_features:
    merged_sessions_imbalance[feature] = merged_sessions_imbalance.groupby('playerid')[feature].transform(lambda x: x.fillna(x.mean()))
    merged_sessions_imbalance[feature] = merged_sessions_imbalance[feature].fillna(merged_sessions_imbalance[feature].mean())

# Severity_Score: Fill nulls with 0 to indicate no associated injury
merged_sessions_imbalance['Severity_Score'] = merged_sessions_imbalance['Severity_Score'].fillna(0).astype(int)


In [76]:
# Check for nulls and NaNs in the entire DataFrame
null_counts = merged_sessions_imbalance.isnull().sum()
nan_counts = merged_sessions_imbalance.isna().sum()

# Display columns with null or NaN values
print("Columns with Null Values:")
print(null_counts[null_counts > 0])

print("\nColumns with NaN Values:")
print(nan_counts[nan_counts > 0])

# If you want a summary of rows with any null or NaN values
rows_with_nulls = merged_sessions_imbalance[merged_sessions_imbalance.isnull().any(axis=1)]
rows_with_nans = merged_sessions_imbalance[merged_sessions_imbalance.isna().any(axis=1)]

print("\nNumber of rows with any null values:", len(rows_with_nulls))
print("Number of rows with any NaN values:", len(rows_with_nans))


Columns with Null Values:
Series([], dtype: int64)

Columns with NaN Values:
Series([], dtype: int64)

Number of rows with any null values: 0
Number of rows with any NaN values: 0


In [73]:
merged_sessions_imbalance.to_csv("../Data/Cleaned Data/Cleaned_data.csv")

### Train-Test Split and Save

- **Objective**: Separate `Player ID` 115 as the test set and the remaining data as the training set.
- **Steps**:
  - Filter rows with `Player ID` 115 for `test_data`, and use all other rows as `train_data`.
  - Save both datasets as CSV files in the `Cleaned Data` directory.
  - Print the shapes of `train_data` and `test_data` to confirm the split.
  
This approach ensures a clear division of data for training and testing based on a specific player.



In [74]:
# Separate rows with Player ID 115 for the test set
test_data = merged_sessions_imbalance[merged_sessions_imbalance['playerid'] == 115]

# The remaining data will be used for the training set
train_data = merged_sessions_imbalance[merged_sessions_imbalance['playerid'] != 115]

# Save the datasets to CSV files
train_data.to_csv('../Data/Cleaned Data/train.csv', index=False)
test_data.to_csv('../Data/Cleaned Data/test.csv', index=False)

# Confirm the shapes of the train and test sets
print("Train Data Shape:", train_data.shape)
print("Test Data Shape:", test_data.shape)

Train Data Shape: (2519, 52)
Test Data Shape: (85, 52)
