# Models Comparison

This notebook explores the effectiveness of various machine learning algorithms in predicting whether NBA players will remain in the league after five years. We focus on achieving high precision to minimize false positives, which translates to avoiding costly recommendations for investors.

**We compare the performance of:**

- Logistic Regression
- Support Vector Machine (SVM)
- Random Forest
- XGBoost
- Gradient Boosting
- Multi-Layer Perceptron (MLP)

**To potentially improve model performance, we will explore:**

- Principal Component Analysis (PCA) for dimensionality reduction
- Interquartile Range (IQR) for outlier clipping

**Selection Criteria and Deployment Considerations**

- Based on the results, we will identify the model that achieves the highest average precision while considering factors such as interpretability and training time. This analysis will provide valuable insights for selecting the most suitable model for deployment.

In [1]:
import os
import pandas as pd
import matplotlib

%matplotlib widget
%load_ext autoreload
%autoreload 2

pd.options.display.max_columns = None

## Global variables

In [2]:
DATA_INPUT_PATH = "../data/inputs"
DATA_OUTPUT_PATH = "../data/outputs"

## Data Import

In [3]:
data = pd.read_csv(os.path.join(DATA_OUTPUT_PATH, "preprocessed_nba_data.csv"))
print(f'data shape: {data.shape}')
data.head()

data shape: (1280, 21)


Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
0,Zach Randolph,41,5.8,2.8,1.2,2.6,44.9,0.0,0.0,0.0,0.4,0.7,66.7,0.8,0.9,1.7,0.3,0.2,0.1,0.4,1.0
1,Zach LaVine,77,24.7,10.1,3.7,8.8,42.2,0.7,2.2,34.1,1.9,2.3,84.2,0.4,2.4,2.8,3.6,0.7,0.1,2.5,0.0
2,Xavier McDaniel,82,33.0,17.1,7.0,14.3,49.0,0.0,0.1,20.0,3.0,4.4,68.7,3.7,4.2,8.0,2.4,1.2,0.5,3.0,1.0
3,Winston Garland,67,31.7,12.4,5.1,11.6,43.9,0.2,0.6,33.3,2.1,2.3,87.9,1.0,2.4,3.4,6.4,1.7,0.1,2.5,1.0
4,Winston Bennett,55,18.0,6.1,2.5,5.2,47.9,0.0,0.0,0.0,1.2,1.7,66.7,1.5,1.9,3.4,1.0,0.4,0.2,1.1,0.0


## Splitting Features and Target

In [4]:
X = data.drop(columns= ["Name", "TARGET_5Yrs"])
y = data["TARGET_5Yrs"].values

## Data Standardization

**It's important to note that some machine learning models, particularly tree-based models, are less sensitive to feature scaling.**

In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## Training Binary Classification Models: Base models

In this section, we will compare the performance of different algorithms for binary classification. We will train each model with default hyperparameters to establish a baseline performance. This will help us identify which algorithm performs best for this specific task.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

classifiers = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "SVM": SVC(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "XGBoost": XGBClassifier(objective='binary:logistic', random_state=42),
    "MLP": MLPClassifier(random_state=42),
}

In [7]:
from src.utils import evaluate_classifiers

evaluate_classifiers(classifiers,  X_scaled, y, n_splits=3, random_state=42)

******* Training Logistic Regression *******
Average precision of Logistic Regression: 0.7521
Average recall of Logistic Regression: 0.8235
Confusion matrix of Logistic Regression: 
 [[264. 217.]
 [141. 658.]]
******* Training SVM *******
Average precision of SVM: 0.7456
Average recall of SVM: 0.8198
Confusion matrix of SVM: 
 [[257. 224.]
 [144. 655.]]
******* Training Random Forest *******
Average precision of Random Forest: 0.7356
Average recall of Random Forest: 0.8010
Confusion matrix of Random Forest: 
 [[251. 230.]
 [159. 640.]]
******* Training Gradient Boosting *******
Average precision of Gradient Boosting: 0.7362
Average recall of Gradient Boosting: 0.8111
Confusion matrix of Gradient Boosting: 
 [[249. 232.]
 [151. 648.]]
******* Training XGBoost *******
Average precision of XGBoost: 0.7138
Average recall of XGBoost: 0.7735
Confusion matrix of XGBoost: 
 [[233. 248.]
 [181. 618.]]
******* Training MLP *******




Average precision of MLP: 0.7460
Average recall of MLP: 0.7721
Confusion matrix of MLP: 
 [[270. 211.]
 [182. 617.]]





- Logistic Regression: Achieves the highest average precision (0.7521) among all models. However, its recall (0.8235) is not the best.
- Other Models: SVM (0.7456), Random Forest (0.7356), Gradient Boosting (0.7362), MLP (0.7460), and XGBoost (0.7138) all have lower precision than Logistic Regression.
- Recall: While some models have slightly higher recall than Logistic Regression, we prioritize precision in this case. That means focusing on the model that minimizes false positives. 

## Hyperparameter tuning

In [8]:
classifiers_params = {
    "Logistic Regression": (LogisticRegression(random_state=42), {'random_state': [42], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}),
    "Random Forest": (RandomForestClassifier(random_state=42), {'random_state': [42], 'n_estimators': [50, 100, 200, 400, 600], 'max_depth': [None, 10, 20]}),
    "Support Vector Machine": (SVC(random_state=42), {'random_state': [42],'C': [0.1, 1, 10], 'gamma': [0.1, 0.01, 0.001], 'kernel': ['linear', 'poly', 'rbf']}),
    "XGBoost": (XGBClassifier(random_state=42), {'random_state': [42],'learning_rate': [0.001, 0.01, 0.1, 0.5], 'n_estimators': [100, 200, 300, 500, 700], 'max_depth': [5, 10, 20, 50]}),
    "Gradient Boosting": (GradientBoostingClassifier(random_state=42), {'random_state': [42],'learning_rate': [0.01, 0.1, 0.5], 'n_estimators': [100, 200, 300]}),
    "MLP": (MLPClassifier(batch_size=32, random_state=42), {'random_state': [42],'hidden_layer_sizes': [(100,), (50, 100), (50, 50, 50), (200, 200, 200)], 'activation': ['relu', 'elu'], 'alpha': [0.0001, 0.001, 0.01]})
}

In [9]:
from src.utils import tune_and_evaluate_models

tune_and_evaluate_models(classifiers_params, X_scaled, y, scoring='precision', cv=10, n_jobs=-1)

************Tuning hyperparameters for Logistic Regression ************


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the fa

Best Hyperparameters: {'C': 0.1, 'penalty': 'l2', 'random_state': 42}
Test Precision: 0.7281
************Tuning hyperparameters for Random Forest ************
Best Hyperparameters: {'max_depth': 10, 'n_estimators': 100, 'random_state': 42}
Test Precision: 0.9742
************Tuning hyperparameters for Support Vector Machine ************
Best Hyperparameters: {'C': 1, 'gamma': 0.1, 'kernel': 'rbf', 'random_state': 42}
Test Precision: 0.7734
************Tuning hyperparameters for XGBoost ************
Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100, 'random_state': 42}
Test Precision: 0.9578
************Tuning hyperparameters for Gradient Boosting ************
Best Hyperparameters: {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 42}
Test Precision: 0.9648
************Tuning hyperparameters for MLP ************


120 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
44 fits failed with the following error:
Traceback (most recent call last):
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 95

Best Hyperparameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'random_state': 42}
Test Precision: 0.8492




### Training Binary Classification Models: Using Optimal Hyperparameter

In [10]:
classifiers = {
    "Logistic Regression": LogisticRegression(C=0.1, penalty='l2', random_state=42),
    "SVM": SVC(C=1, gamma=0.1, kernel='rbf', random_state=42),
    "Random Forest": RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(learning_rate=0.1, n_estimators=300, random_state=42),
    "XGBoost": XGBClassifier(objective='binary:logistic',learning_rate= 0.1, max_depth=5, n_estimators=100, random_state=42),
    "MLP": MLPClassifier(activation= 'relu', alpha= 0.0001, hidden_layer_sizes= (100,), random_state=42),
}

In [11]:
evaluate_classifiers(classifiers,  X_scaled, y, n_splits=3, random_state=42)

******* Training Logistic Regression *******
Average precision of Logistic Regression: 0.7548
Average recall of Logistic Regression: 0.8361
Confusion matrix of Logistic Regression: 
 [[264. 217.]
 [131. 668.]]
******* Training SVM *******
Average precision of SVM: 0.7468
Average recall of SVM: 0.8136
Confusion matrix of SVM: 
 [[260. 221.]
 [149. 650.]]
******* Training Random Forest *******
Average precision of Random Forest: 0.7377
Average recall of Random Forest: 0.8097
Confusion matrix of Random Forest: 
 [[250. 231.]
 [152. 647.]]
******* Training Gradient Boosting *******
Average precision of Gradient Boosting: 0.7398
Average recall of Gradient Boosting: 0.8011
Confusion matrix of Gradient Boosting: 
 [[256. 225.]
 [159. 640.]]
******* Training XGBoost *******
Average precision of XGBoost: 0.7331
Average recall of XGBoost: 0.8048
Confusion matrix of XGBoost: 
 [[247. 234.]
 [156. 643.]]
******* Training MLP *******




Average precision of MLP: 0.7460
Average recall of MLP: 0.7721
Confusion matrix of MLP: 
 [[270. 211.]
 [182. 617.]]




**The fine-tuning process yielded some positive improvements, particularly for Logistic Regression, which remains the best model in terms of precision:**

- Logistic Regression:

  - Precision increased slightly (0.7521 -> 0.7548), indicating a marginal improvement in correctly identifying positive cases.
  - Recall also increased (0.8235 -> 0.8361), suggesting it captured more true positives without sacrificing too much precision.

- Other Models:

  - Precision improvements for other models were minimal or nonexistent. SVM and MLP even showed a slight decrease.
  - Recall changes were mixed, with some models like Random Forest showing a small improvement.

## Hyperparameter tuning: Using PCA

In [12]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.90)
X_pca = pca.fit_transform(X_scaled)

In [13]:
classifiers_params = {
    "Logistic Regression": (LogisticRegression(random_state=42), {'random_state': [42], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}),
    "Random Forest": (RandomForestClassifier(random_state=42), {'random_state': [42], 'n_estimators': [50, 100, 200, 400, 600], 'max_depth': [None, 10, 20]}),
    "Support Vector Machine": (SVC(random_state=42), {'random_state': [42],'C': [0.1, 1, 10], 'gamma': [0.1, 0.01, 0.001], 'kernel': ['linear', 'poly', 'rbf']}),
    "XGBoost": (XGBClassifier(random_state=42), {'random_state': [42],'learning_rate': [0.001, 0.01, 0.1, 0.5], 'n_estimators': [100, 200, 300, 500, 700], 'max_depth': [5, 10, 20, 50]}),
    "Gradient Boosting": (GradientBoostingClassifier(random_state=42), {'random_state': [42],'learning_rate': [0.01, 0.1, 0.5], 'n_estimators': [100, 200, 300]}),
    "MLP": (MLPClassifier(batch_size=32, random_state=42), {'random_state': [42],'hidden_layer_sizes': [(100,), (50, 100), (50, 50, 50), (200, 200, 200)], 'activation': ['relu', 'elu'], 'alpha': [0.0001, 0.001, 0.01]})
}

In [14]:
tune_and_evaluate_models(classifiers_params, X_pca, y, scoring='precision', cv=10, n_jobs=-1)

************Tuning hyperparameters for Logistic Regression ************
Best Hyperparameters: {'C': 100, 'penalty': 'l2', 'random_state': 42}
Test Precision: 0.7133
************Tuning hyperparameters for Random Forest ************


60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
         

Best Hyperparameters: {'max_depth': 10, 'n_estimators': 100, 'random_state': 42}
Test Precision: 0.9461
************Tuning hyperparameters for Support Vector Machine ************
Best Hyperparameters: {'C': 1, 'gamma': 0.1, 'kernel': 'linear', 'random_state': 42}
Test Precision: 0.7148
************Tuning hyperparameters for XGBoost ************
Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100, 'random_state': 42}
Test Precision: 0.9289
************Tuning hyperparameters for Gradient Boosting ************
Best Hyperparameters: {'learning_rate': 0.1, 'n_estimators': 200, 'random_state': 42}
Test Precision: 0.9047
************Tuning hyperparameters for MLP ************


120 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
31 fits failed with the following error:
Traceback (most recent call last):
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 95

Best Hyperparameters: {'activation': 'relu', 'alpha': 0.01, 'hidden_layer_sizes': (100,), 'random_state': 42}
Test Precision: 0.7680




### Training Binary Classification Models: Using Optimal Hyperparameter and PCA

In [15]:
classifiers = {
    "Logistic Regression": LogisticRegression(C=100, penalty='l2', random_state=42),
    "SVM": SVC(C=1, gamma=0.1, kernel='linear', random_state=42),
    "Random Forest": RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(learning_rate=0.1, n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(objective='binary:logistic',learning_rate= 0.1, max_depth=5, n_estimators=100, random_state=42),
    "MLP": MLPClassifier(activation= 'relu', alpha= 0.01, hidden_layer_sizes= (100,), random_state=42),
}

In [16]:
evaluate_classifiers(classifiers,  X_pca, y, n_splits=3, random_state=42)

******* Training Logistic Regression *******
Average precision of Logistic Regression: 0.7494
Average recall of Logistic Regression: 0.8236
Confusion matrix of Logistic Regression: 
 [[261. 220.]
 [141. 658.]]
******* Training SVM *******
Average precision of SVM: 0.7548
Average recall of SVM: 0.8249
Confusion matrix of SVM: 
 [[267. 214.]
 [140. 659.]]
******* Training Random Forest *******
Average precision of Random Forest: 0.7469
Average recall of Random Forest: 0.7973
Confusion matrix of Random Forest: 
 [[265. 216.]
 [162. 637.]]
******* Training Gradient Boosting *******
Average precision of Gradient Boosting: 0.7389
Average recall of Gradient Boosting: 0.7935
Confusion matrix of Gradient Boosting: 
 [[257. 224.]
 [165. 634.]]
******* Training XGBoost *******
Average precision of XGBoost: 0.7516
Average recall of XGBoost: 0.7797
Confusion matrix of XGBoost: 
 [[275. 206.]
 [176. 623.]]
******* Training MLP *******




Average precision of MLP: 0.7553
Average recall of MLP: 0.8073
Confusion matrix of MLP: 
 [[272. 209.]
 [154. 645.]]




**Applying PCA to the input data resulted in mixed results for model performance, particularly when focusing on precision:**

- Logistic Regression:

    - Precision decreased slightly (0.7548 -> 0.7494), suggesting a minor decline in correctly identifying positive cases.
    - Recall remained relatively unchanged (0.8361 -> 0.8236).
- SVM:

    - Interestingly, SVM showed a slight improvement in precision (0.7468 -> 0.7548) despite the dimensionality reduction.
    - Recall also increased marginally (0.8136 -> 0.8249).
- Other Models:

    - Random Forest, Gradient Boosting, and XGBoost all experienced a decrease in precision after PCA.
    - Recall changes were mixed, with some models like MLP showing a slight improvement.

## Hyperparameter Tuning: Outlier Clipping Threshold

In [17]:
X = data.drop(columns= ["Name", "TARGET_5Yrs"]).values
y = data["TARGET_5Yrs"].values

In [18]:
from src.utils import replace_outliers_with_bounds
X = replace_outliers_with_bounds(X)

In [19]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [20]:
classifiers_params = {
    "Logistic Regression": (LogisticRegression(random_state=42), {'random_state': [42], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}),
    "Random Forest": (RandomForestClassifier(random_state=42), {'random_state': [42], 'n_estimators': [50, 100, 200, 400, 600], 'max_depth': [None, 10, 20]}),
    "Support Vector Machine": (SVC(random_state=42), {'random_state': [42],'C': [0.1, 1, 10], 'gamma': [0.1, 0.01, 0.001], 'kernel': ['linear', 'poly', 'rbf']}),
    "XGBoost": (XGBClassifier(random_state=42), {'random_state': [42],'learning_rate': [0.001, 0.01, 0.1, 0.5], 'n_estimators': [100, 200, 300, 500, 700], 'max_depth': [5, 10, 20, 50]}),
    "Gradient Boosting": (GradientBoostingClassifier(random_state=42), {'random_state': [42],'learning_rate': [0.01, 0.1, 0.5], 'n_estimators': [100, 200, 300]}),
    "MLP": (MLPClassifier(batch_size=32, random_state=42), {'random_state': [42],'hidden_layer_sizes': [(100,), (50, 100), (50, 50, 50), (200, 200, 200)], 'activation': ['relu', 'elu'], 'alpha': [0.0001, 0.001, 0.01]})
}

In [21]:
tune_and_evaluate_models(classifiers_params, X_scaled, y, scoring='precision', cv=10, n_jobs=-1)

************Tuning hyperparameters for Logistic Regression ************
Best Hyperparameters: {'C': 0.1, 'penalty': 'l2', 'random_state': 42}
Test Precision: 0.7297
************Tuning hyperparameters for Random Forest ************


60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
         

Best Hyperparameters: {'max_depth': 10, 'n_estimators': 100, 'random_state': 42}
Test Precision: 0.9727
************Tuning hyperparameters for Support Vector Machine ************
Best Hyperparameters: {'C': 0.1, 'gamma': 0.1, 'kernel': 'linear', 'random_state': 42}
Test Precision: 0.7234
************Tuning hyperparameters for XGBoost ************
Best Hyperparameters: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500, 'random_state': 42}
Test Precision: 0.8914
************Tuning hyperparameters for Gradient Boosting ************
Best Hyperparameters: {'learning_rate': 0.1, 'n_estimators': 200, 'random_state': 42}
Test Precision: 0.9266
************Tuning hyperparameters for MLP ************


120 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/mnt/CAB/NBA-Challenge/.venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 95

Best Hyperparameters: {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': (100,), 'random_state': 42}
Test Precision: 0.8609




In [22]:
classifiers = {
    "Logistic Regression": LogisticRegression(C=0.1, penalty='l2', random_state=42),
    "SVM": SVC(C=0.1, gamma=0.1, kernel='linear', random_state=42),
    "Random Forest": RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(learning_rate=0.1, n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(objective='binary:logistic',learning_rate= 0.01, max_depth=5, n_estimators=500, random_state=42),
    "MLP": MLPClassifier(activation= 'relu', alpha= 0.001, hidden_layer_sizes= (100,), random_state=42),
}

In [23]:
evaluate_classifiers(classifiers,  X_scaled, y, n_splits=3, random_state=42)

******* Training Logistic Regression *******
Average precision of Logistic Regression: 0.7580
Average recall of Logistic Regression: 0.8349
Confusion matrix of Logistic Regression: 
 [[268. 213.]
 [132. 667.]]
******* Training SVM *******
Average precision of SVM: 0.7624
Average recall of SVM: 0.8098
Confusion matrix of SVM: 
 [[279. 202.]
 [152. 647.]]
******* Training Random Forest *******
Average precision of Random Forest: 0.7355
Average recall of Random Forest: 0.8123
Confusion matrix of Random Forest: 
 [[247. 234.]
 [150. 649.]]
******* Training Gradient Boosting *******
Average precision of Gradient Boosting: 0.7271
Average recall of Gradient Boosting: 0.8011
Confusion matrix of Gradient Boosting: 
 [[241. 240.]
 [159. 640.]]
******* Training XGBoost *******
Average precision of XGBoost: 0.7354
Average recall of XGBoost: 0.8172
Confusion matrix of XGBoost: 
 [[246. 235.]
 [146. 653.]]
******* Training MLP *******




Average precision of MLP: 0.7494
Average recall of MLP: 0.7722
Confusion matrix of MLP: 
 [[274. 207.]
 [182. 617.]]




**After applying outlier clipping using IQR (Interquartile Range), we notice a general improvement in models performances, particularly for precision**

- Precision Increase:

    - Logistic Regression: The most significant improvement (0.7494 -> 0.7580).
    - SVM: A noticeable increase (0.7548 -> 0.7624).
- Recall:

    - Logistic Regression and SVM also showed a slight increase in recall, suggesting they captured more true positives without sacrificing too much precision.
    - Some models (Random Forest, Gradient Boosting, XGBoost) experienced a trade-off, with a slight decrease in recall in exchange for a gain in precision.

#### Conclusion

- Outlier Clipping with IQR: This technique generally led to the highest precision for several models, including Logistic Regression and SVM.
- PCA: PCA had a minimal impact on precision for most models, with some models even experiencing a slight decrease (e.g., Random Forest).
- Model Comparison: Overall, Logistic Regression and SVM achieved the highest average precision across all preprocessing methods. 
- Performance can be further enhanced by incorporating new player statistics that go beyond traditional box score metrics, such as :
  
    - Injuries: Injury history, including frequency, type, and severity of past injuries.
    - Advanced metrics: Statistics that capture a more nuanced picture of player performance, such as Win Share, Player Efficiency Rating (PER), or Value Over Replacement Player (VORP).
    - Game context: Data on factors like opponent strength, home/away advantage, and playing time in specific situations (clutch time, fourth quarter, etc.).
      
- Based on these experiments, two strong candidates for deployment emerged: Logistic Regression and SVM. While SVM achieved slightly better results with outlier clipping, Logistic Regression offers a significant advantage in interpretability. This makes it easier to understand the factors influencing the model's predictions, which can be valuable for decision-making. The final choice will depend on the relative importance of interpretability and raw performance in this specific context.

## Reduce the number of features for logistic regression

In [24]:
X = data.drop(columns= ["Name", "TARGET_5Yrs"])
y = data["TARGET_5Yrs"].values

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [25]:
from sklearn.feature_selection import RFECV

log_reg = LogisticRegression(C=0.1, penalty='l2', random_state=42)
rfecv = RFECV(estimator=log_reg, cv=20, scoring='precision')
rfecv.fit(X_scaled, y)
print(f"Used features: {X.columns[rfecv.support_]}")

Used features: Index(['GP', 'PTS', 'OREB'], dtype='object')


In [26]:
classifiers = {
    "Logistic Regression": LogisticRegression(C=0.1, penalty='l2', random_state=42),
}
evaluate_classifiers(classifiers,  X_scaled[:, rfecv.support_], y, n_splits=3, random_state=42)

******* Training Logistic Regression *******
Average precision of Logistic Regression: 0.7602
Average recall of Logistic Regression: 0.8174
Confusion matrix of Logistic Regression: 
 [[275. 206.]
 [146. 653.]]


We've chosen the 3 most important features to make it easier for users to complete the form without sacrificing significant performance.