# Example of use of the Assistant class

We will use a NBA player draft dataset, and the goal is to predict whether each player will stay more than 5 years in the league.
We will perform standard level of machine learning since it is not the main goal of the notebook. We want to show how one can use the class designed, and how it can save time and code therefore errors and debugging hours.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("nba_logreg.csv")
df.head(10)

Unnamed: 0,Name,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,...,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,...,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0.0
1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,...,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0.0
2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,...,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0.0
3,Malik Sealy,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,...,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1.0
4,Matt Geiger,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,...,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1.0
5,Tony Bennett,75,11.4,3.7,1.5,3.5,42.3,0.3,1.1,32.5,...,0.5,73.2,0.2,0.7,0.8,1.8,0.4,0.0,0.7,0.0
6,Don MacLean,62,10.9,6.6,2.5,5.8,43.5,0.0,0.1,50.0,...,1.8,81.1,0.5,1.4,2.0,0.6,0.2,0.1,0.7,1.0
7,Tracy Murray,48,10.3,5.7,2.3,5.4,41.5,0.4,1.5,30.0,...,0.8,87.5,0.8,0.9,1.7,0.2,0.2,0.1,0.7,1.0
8,Duane Cooper,65,9.9,2.4,1.0,2.4,39.2,0.1,0.5,23.3,...,0.5,71.4,0.2,0.6,0.8,2.3,0.3,0.0,1.1,0.0
9,Dave Johnson,42,8.5,3.7,1.4,3.5,38.3,0.1,0.3,21.4,...,1.4,67.8,0.4,0.7,1.1,0.3,0.2,0.0,0.7,0.0


There is a lot to say about the preprocessing of this dataset, but we will remain soft:
* There exist some *NA* in the column *3P%* because of a division by zero, but the correct value is 0
* Some names are displayed multiple times, with exact same statistics but different labels. We decided to remove these observations
* We create three new features based on known advanced statistics in the NBA

In [2]:
df["3P%"] = df["3P%"].fillna(0)
unique_names = pd.DataFrame({"Name": df["Name"].value_counts().loc[df["Name"].value_counts() == 1, ].index})
df = df.merge(unique_names, on="Name", how="inner")
df["TSA"] = df["FGA"] + 0.44 * df["FTA"]
df["TS%"] = df["PTS"]/ (2 * df["TSA"])
df["3NG"] = 1.94 * df["3P Made"] - 1.06 * (df["3PA"] - df["3P Made"])

We start the use of the **Assistant** class.

In [3]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from Jarvis import Assistant


from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
metrics = [accuracy_score, precision_score, recall_score, f1_score]


from sklearn.model_selection import train_test_split
X = df.drop(columns=["Name", "TARGET_5Yrs"], axis=1)
y = df["TARGET_5Yrs"]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.8)

jarvis = Assistant(X_train, y_train, f1_score, metrics=metrics)

We want to look at the performance of a standard logistic regression, without tuning :

In [4]:
from sklearn.linear_model import LogisticRegression
jarvis.tryout(LogisticRegression)

LogisticRegression
accuracy: 0.71 (+/-0.04)	 precision: 0.75 (+/-0.03)	 recall: 0.81 (+/-0.06)	 f1: 0.78 (+/-0.04)	 
--------------------------------------------------------------------------------------------------------------


Quite good ! But at the moment, no model has yet been added to the assistant. We need to learn the fine-tuned model :

In [5]:
jarvis.learn([(LogisticRegression, {"C": [1e-6, 1e-5, 1e-4, 1e-3, 0.01, 0.1],
                                    "solver": ["lbfgs"], 
                                    "max_iter": [1000]})])

LogisticRegression
accuracy: 0.72 (+/-0.06)	 precision: 0.75 (+/-0.04)	 recall: 0.83 (+/-0.07)	 f1: 0.79 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------


The assistant stored the best model according to the cross-validated grid search we performed. Let's add other algorithm :

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier


params_logit = {"C": [1e-6, 1e-5, 1e-4, 1e-3, 0.01, 0.1], 
                "solver": ["lbfgs"], 
                "max_iter": [5000]}

params_tree = {"max_depth": [7, 8, 9, 10, 11, 12, 13, 14, 15], 
               "min_samples_leaf": [1, 2, 3, 4, 5]}

params_forest = {"n_estimators": [10, 20, 30, 40, 50, 75, 100], 
                 "max_depth": [6, 8, 10, 12, 14]}

params_LGBM = {"n_estimators": [10, 20, 30, 40, 50, 75, 100], 
               "max_depth": [6, 8, 10, 12, 14], "lr": [1e-3, 1e-2, 0.1, 0.5, 1, 1.5]}





models = [(LogisticRegression, params_logit), 
          (DecisionTreeClassifier, params_tree), 
          (RandomForestClassifier, params_forest),
          (LGBMClassifier, params_LGBM)
         ]


jarvis.learn(models)

LogisticRegression
accuracy: 0.72 (+/-0.06)	 precision: 0.75 (+/-0.04)	 recall: 0.83 (+/-0.07)	 f1: 0.79 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------
DecisionTreeClassifier
accuracy: 0.66 (+/-0.06)	 precision: 0.71 (+/-0.04)	 recall: 0.77 (+/-0.08)	 f1: 0.73 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------
RandomForestClassifier
accuracy: 0.71 (+/-0.05)	 precision: 0.74 (+/-0.03)	 recall: 0.81 (+/-0.06)	 f1: 0.77 (+/-0.04)	 
--------------------------------------------------------------------------------------------------------------
LGBMClassifier
accuracy: 0.71 (+/-0.03)	 precision: 0.74 (+/-0.02)	 recall: 0.82 (+/-0.04)	 f1: 0.78 (+/-0.03)	 
--------------------------------------------------------------------------------------------------------------


It looks like we tried first the best performing algorithm ! Let's have a recap of all the model stored already :

In [7]:
jarvis.performance_recap()

id: LogisticRegression - LogisticRegression(C=0.1, max_iter=1000)
accuracy: 0.72 (+/-0.06)	 precision: 0.75 (+/-0.04)	 recall: 0.83 (+/-0.07)	 f1: 0.79 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------
id: LogisticRegression_1 - LogisticRegression(C=0.1, max_iter=5000)
accuracy: 0.72 (+/-0.06)	 precision: 0.75 (+/-0.04)	 recall: 0.83 (+/-0.07)	 f1: 0.79 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------
id: DecisionTreeClassifier - DecisionTreeClassifier(max_depth=7, min_samples_leaf=5)
accuracy: 0.66 (+/-0.05)	 precision: 0.72 (+/-0.03)	 recall: 0.75 (+/-0.09)	 f1: 0.73 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------
id: RandomForestClassifier - RandomForestClassifier(max_depth=6)
accuracy: 0.70 (+/-0.06)	 precision: 0.74 (+/-0.03)	 recall: 0.80 (+/-0.06)	 f1: 0.77 (+/

We have two logistic regression stored, with the exact same performance. Also, the performance for the Random Forest is not as good as the others, then we decide to delete both of these two algorithms from the assistant.

In [8]:
jarvis.delete_model(["LogisticRegression_1", "DecisionTreeClassifier"])

Let's check it worked:

In [9]:
jarvis.performance_recap()

id: LogisticRegression - LogisticRegression(C=0.1, max_iter=1000)
accuracy: 0.72 (+/-0.06)	 precision: 0.75 (+/-0.04)	 recall: 0.83 (+/-0.07)	 f1: 0.79 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------
id: RandomForestClassifier - RandomForestClassifier(max_depth=6)
accuracy: 0.71 (+/-0.06)	 precision: 0.74 (+/-0.05)	 recall: 0.81 (+/-0.07)	 f1: 0.76 (+/-0.04)	 
--------------------------------------------------------------------------------------------------------------
id: LGBMClassifier - LGBMClassifier(lr=0.001, max_depth=8, n_estimators=20)
accuracy: 0.71 (+/-0.03)	 precision: 0.74 (+/-0.02)	 recall: 0.82 (+/-0.04)	 f1: 0.78 (+/-0.03)	 
--------------------------------------------------------------------------------------------------------------


It does ! Now, we want to compute a voting classifier from these three algorithm :

In [10]:
jarvis.make_ensemble()

Let's have a look at the performance:

In [11]:
jarvis.performance_recap()

id: LogisticRegression - LogisticRegression(C=0.1, max_iter=1000)
accuracy: 0.72 (+/-0.06)	 precision: 0.75 (+/-0.04)	 recall: 0.83 (+/-0.07)	 f1: 0.79 (+/-0.05)	 
--------------------------------------------------------------------------------------------------------------
id: RandomForestClassifier - RandomForestClassifier(max_depth=6)
accuracy: 0.70 (+/-0.06)	 precision: 0.74 (+/-0.03)	 recall: 0.81 (+/-0.06)	 f1: 0.76 (+/-0.04)	 
--------------------------------------------------------------------------------------------------------------
id: LGBMClassifier - LGBMClassifier(lr=0.001, max_depth=8, n_estimators=20)
accuracy: 0.71 (+/-0.03)	 precision: 0.74 (+/-0.02)	 recall: 0.82 (+/-0.04)	 f1: 0.78 (+/-0.03)	 
--------------------------------------------------------------------------------------------------------------
id: VotingClassifier - VotingClassifier(estimators=[('LogisticRegression',
                              LogisticRegression(C=0.1, max_iter=1000)),
                  

The ensemble model has a low variance, as expected from the theory, but high performance overall. We have now seen everything for learning and testing algorithms.
We need to predict on an unseen test dataset. We can pick each algorith stored. Let's try :

In [12]:
y_pred = jarvis.predict("Logit", X_test)

NameError: 'Logit' is not in the model id list

As the error said, the identifier of the model is not correct. We add this error because it is way more understandable than the other error.
This time with a valid id :

In [13]:
y_pred = jarvis.predict("VotingClassifier", X_test)
print("F1-score : %0.4f" % f1_score(y_test, y_pred))

F1-score : 0.7929


The prediction is in the interval of performance predicted, perfect !