# Predicting basketball injuries using ML
### Alternative finance project

Supervised by Prof. Marcus Frunza - coded by **Soughati Kenza, Henry-Biabaud Briac, Collin Thibault**

<div style="background-color: #ffffcc; border: 1px solid #d0d0d0; padding: 10px;">

The aim of this project was to work around machine learning models to predict NBA basketball player injuries for the next game-ahead, using a collection of box score individual statistics from previous games.

</div>

<div style="background-color: #f0f0f0; border: 1px solid #d0d0d0; padding: 10px;">
Having already extracted all the data required, we now work on implementing machine learning techniques to analyze the sets and work on the prediction.
<div>

# Loading general libraries

In [1]:
import numpy as np
import pandas as pd

# Features engineering

Because our set are so imbalanced by nature (barely 5% of all individual games result in an injury) we have to rebalance conveniently our data. We perform *cluster-based resampling* to preserve patterns and only remove what can be considered as close duplicates.

In [2]:
stats_df = pd.read_csv('stats_full.txt', sep=' ')

y_df = stats_df[['Inj After']]
x_df = stats_df.drop(columns=['Inj After', 'Player', 'Date', 'Season'])

In [None]:
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

majority_class_indices = y_df[y_df['Inj After'] == 0].index
minority_class_indices = y_df[y_df['Inj After'] == 1].index

kmeans = KMeans(n_clusters=len(minority_class_indices), random_state=203)
kmeans.fit(x_df.loc[majority_class_indices])

x_df_majority_undersampled = pd.DataFrame(kmeans.cluster_centers_, columns=x_df.columns.tolist())
x_df_minority = x_df.loc[minority_class_indices]

x_df_balanced = pd.concat([x_df_majority_undersampled, x_df_minority], ignore_index=True)
y_df_balanced = np.concatenate([np.zeros(len(x_df_majority_undersampled)), np.ones(len(minority_class_indices))])

Splitting our set into training and set sets.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_df_balanced, y_df_balanced, test_size=0.2, random_state=203)

# Features selection

Performing PCA to remove poor feature.

In [None]:
features = list(x_df_balanced.columns)

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

if pcaSelection:
    scaler = StandardScaler()
    x_scaled = scaler.fit_transform(x_train[features])

    pca = PCA(n_components=0.975)
    x_pca = pca.fit_transform(x_scaled)
    n_components_chosen = pca.n_components_

    components_abs = np.abs(pca.components_)
    important_feature_indices = np.argmax(components_abs, axis=1)
    features = [features[i] for i in important_feature_indices]

# Model selection

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(cleaned_df_train_team, df_scores, test_size=0.2, random_state=20)
x_train = x_train[features]
x_val = x_val[features]
x_test = cleaned_df_test_team[features]

In [None]:
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

if xgBoost:
    label_encoder = LabelEncoder()    
    y_train_ready_encoded = label_encoder.fit_transform(y_train.apply(map_label, axis=1))
    y_val_ready_encoded = label_encoder.transform(y_val.apply(map_label, axis=1))
    
    param_grid = {
        'xgboost__n_estimators': [40],
        'xgboost__max_depth': [5, 7],
        'xgboost__learning_rate': [0.01, 0.15],
        'xgboost__subsample': [0.6, 0.8],
        'xgboost__colsample_bytree': [0.7, 1.0]
    }

    xgb_pipeline = Pipeline([('xgboost', XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'))])
    grid_search = GridSearchCV(xgb_pipeline, param_grid, cv=10, n_jobs=-1, scoring='accuracy', verbose=True)
    grid_search.fit(x_train, y_train_ready_encoded)

    best_params = grid_search.best_params_
    best_model = grid_search.best_estimator_
    best_model.fit(x_train, y_train_ready_encoded)
    
    print("Best parameters found: ", best_params)

    pred_train = best_model.predict(x_train)
    pred_train = pd.Series(pred_train, index=y_train.index)
    pred_train = label_encoder.inverse_transform(pred_train)
    
    pred_val = best_model.predict(x_val)
    pred_val = pd.Series(pred_val, index=y_val.index)
    pred_val = label_encoder.inverse_transform(pred_val)