# Classificador de draft da NBA
O objetivo do modelo é, a partir dos dados das temporadas 1996-97 a 2021-22, classificar corretamente se o jogador é *undrafted*, ou seja, não foi draftado.

Passo a passo:
* Escolha do problema ✔
* Separação dos dados em treino, validação e teste ✔
* Selecionar algoritmos para resolver o problema 
* Adicionar MLFlow no treinamento dos modelos
* Executar uma ferramenta de seleção de hiper-parâmetros
* Realizar diagnóstico do melhor modelo e melhorá-lo a partir disso

## Dependências

In [None]:
!pip install mlflow

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import mlflow
import mlflow.sklearn
from urllib.parse import urlparse

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Path Tales
path = '/content/drive/MyDrive/2022.1/TA GDI/projeto1/data/classification.csv'

In [7]:
dataset = pd.read_csv(path)

In [8]:
dataset.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,...,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season,undrafted,season_start_year,season_end_year,gp_pct
0,2071,16,0.153846,0.75,0.485714,65,72,0.568966,0.125,0.175758,...,0.127,0.182,0.142,0.536,0.052,0,0.0,0.0,0.0,0.865854
1,1475,13,0.461538,0.678571,0.485714,117,72,,,,...,0.016,0.115,0.151,0.535,0.099,0,1.0,0.0,0.0,0.865854
2,1466,2,0.423077,0.714286,0.533333,227,72,,,,...,0.083,0.152,0.167,0.542,0.101,0,1.0,0.0,0.0,0.902439
3,1465,9,0.153846,0.642857,0.485714,189,72,0.568966,0.125,0.151515,...,0.109,0.118,0.233,0.482,0.114,0,0.0,0.0,0.0,0.512195
4,1464,35,0.153846,0.535714,0.438095,249,72,0.551724,0.25,0.30303,...,0.087,0.045,0.135,0.47,0.125,0,0.0,0.0,0.0,0.109756


In [9]:
dataset.groupby(['undrafted'])['player_name'].count()

undrafted
0.0    9629
1.0    1875
Name: player_name, dtype: int64

## Separação dos conjuntos de treino, validação e teste

In [10]:
RANDOM_STATE = 42

In [11]:
dataset.columns

Index(['player_name', 'team_abbreviation', 'age', 'player_height',
       'player_weight', 'college', 'country', 'draft_year', 'draft_round',
       'draft_number', 'gp', 'pts', 'reb', 'ast', 'net_rating', 'oreb_pct',
       'dreb_pct', 'usg_pct', 'ts_pct', 'ast_pct', 'season', 'undrafted',
       'season_start_year', 'season_end_year', 'gp_pct'],
      dtype='object')

Escolhemos as colunas de altura, peso, faculdade, país, jogos realizados e métricas de desempenho (pts, reb, ast...) para compor o conjunto de features e a coluna 'undrafted' é nossa label

In [12]:
X = dataset[['player_height', 'player_weight', 'college', 'country', 
             'gp', 'pts', 'reb', 'ast', 'net_rating', 'oreb_pct',
             'dreb_pct', 'usg_pct', 'ts_pct', 'ast_pct']]
y = dataset['undrafted']
X.shape, y.shape

((11504, 14), (11504,))

Designamos 70% dos dados para teste, 20% para treino e 10% para validação

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=RANDOM_STATE)

X_train.shape, X_val.shape, X_test.shape

((8052, 14), (1151, 14), (2301, 14))

## Seleção de algoritmos
Selecionamos os seguintes para resolução do problema de classificação:
* MLP
* RandomForest
* XGBoost
* LogisticRegression
Além disso, como notamos que o dataset é bem desbalanceado, utilizaremos o f1 score como métrica de treinamento.

In [15]:
def eval_metrics(y_true, y_pred, y_proba):
  f1 = f1_score(y_true, y_pred, average='weighted')
  acc = accuracy_score(y_true, y_pred)
  auroc = roc_auc_score(y_true, y_proba[:, 1])
  return acc, auroc, f1

### MLPs
No algoritmo MLP, vamos variar os parâmetros `hidden_layer_sizes` (número de neurônios nas camadas escondidas), `activation` (função de ativação dos neurônios) e `solver` que otimiza a atualização dos pesos.

In [14]:
EPOCHS=500

#### Tentativa 1

In [40]:
hidden_layer_sizes=(100,)
activation='relu'
solver='adam'

In [41]:
with mlflow.start_run():
  mlp = MLPClassifier(random_state=RANDOM_STATE, max_iter=EPOCHS, hidden_layer_sizes=hidden_layer_sizes, activation=activation, solver=solver)
  mlp.fit(X_train, y_train)

  y_pred = mlp.predict(X_val)
  y_proba = mlp.predict_proba(X_val)

  (acc, auroc, f1) = eval_metrics(y_val, y_pred, y_proba)

  mlflow.log_param("hidden_layer_sizes", hidden_layer_sizes)
  mlflow.log_param("activation", activation)
  mlflow.log_param("solver", solver)
  mlflow.log_metric("accuracy", acc)
  mlflow.log_metric("auroc", auroc)
  mlflow.log_metric("f1", f1)

  tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

  # Model registry does not work with file store
  if tracking_url_type_store != "file":
    # Register the model
    # There are other ways to use the Model Registry, which depends on the use case,
    # please refer to the doc for more information:
    # https://mlflow.org/docs/latest/model-registry.html#api-workflow
    mlflow.sklearn.log_model(mlp, "model", registered_model_name="MLP_NBA_Undrafted")
  else:
    mlflow.sklearn.log_model(mlp, "model")

In [44]:
!less /content/mlruns/0/b5f8ad22854f4b568521073bda44aa34/params/activation

relu
[K[?1l>[2J[?47l8

In [45]:
!less /content/mlruns/0/b5f8ad22854f4b568521073bda44aa34/params/hidden_layer_sizes

(100,)
[K[?1l>[2J[?47l8

In [46]:
!less /content/mlruns/0/b5f8ad22854f4b568521073bda44aa34/params/solver

adam
[K[?1l>[2J[?47l8

In [47]:
!less /content/mlruns/0/b5f8ad22854f4b568521073bda44aa34/metrics/auroc

1662078783968 0.7446037229928325 0
[K[?1l>[2J[?47l8

In [48]:
!less /content/mlruns/0/b5f8ad22854f4b568521073bda44aa34/metrics/f1

1662078783970 0.754230082735617 0
[K[?1l>[2J[?47l8

In [49]:
!less /content/mlruns/0/b5f8ad22854f4b568521073bda44aa34/metrics/accuracy

1662078783967 0.8201563857515204 0
[K[?1l>[2J[?47l8

#### Tentativa 2


```
hidden_layers = (10,10)
```



In [50]:
hidden_layer_sizes=(10, 10)
activation='relu'
solver='adam'

In [51]:
with mlflow.start_run():
  mlp = MLPClassifier(random_state=RANDOM_STATE, max_iter=EPOCHS, hidden_layer_sizes=hidden_layer_sizes, activation=activation, solver=solver)
  mlp.fit(X_train, y_train)

  y_pred = mlp.predict(X_val)
  y_proba = mlp.predict_proba(X_val)

  (acc, auroc, f1) = eval_metrics(y_val, y_pred, y_proba)

  mlflow.log_param("hidden_layer_sizes", hidden_layer_sizes)
  mlflow.log_param("activation", activation)
  mlflow.log_param("solver", solver)
  mlflow.log_metric("accuracy", acc)
  mlflow.log_metric("auroc", auroc)
  mlflow.log_metric("f1", f1)

  tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

  # Model registry does not work with file store
  if tracking_url_type_store != "file":
    # Register the model
    # There are other ways to use the Model Registry, which depends on the use case,
    # please refer to the doc for more information:
    # https://mlflow.org/docs/latest/model-registry.html#api-workflow
    mlflow.sklearn.log_model(mlp, "model", registered_model_name="MLP_NBA_Undrafted")
  else:
    mlflow.sklearn.log_model(mlp, "model")

In [52]:
!cat /content/mlruns/0/dac45b74013a437bae0dfae8d95fe6f1/params/activation

relu

In [53]:
!cat /content/mlruns/0/dac45b74013a437bae0dfae8d95fe6f1/params/hidden_layer_sizes

(10, 10)

In [54]:
!cat /content/mlruns/0/dac45b74013a437bae0dfae8d95fe6f1/params/solver

adam

In [55]:
!cat /content/mlruns/0/dac45b74013a437bae0dfae8d95fe6f1/metrics/auroc

1662078957608 0.7428969215696385 0


In [56]:
!cat /content/mlruns/0/dac45b74013a437bae0dfae8d95fe6f1/metrics/f1

1662078957610 0.7431908403435481 0


In [57]:
!cat /content/mlruns/0/dac45b74013a437bae0dfae8d95fe6f1/metrics/accuracy

1662078957607 0.8218940052128584 0


#### Tentativa 3

```
hidden_layers = (10,10)
activation = 'logistic'
solver = 'sgd'
```




In [58]:
hidden_layer_sizes=(10, 10)
activation='logistic'
solver='sgd'

In [59]:
with mlflow.start_run():
  mlp = MLPClassifier(random_state=RANDOM_STATE, max_iter=EPOCHS, hidden_layer_sizes=hidden_layer_sizes, activation=activation, solver=solver)
  mlp.fit(X_train, y_train)

  y_pred = mlp.predict(X_val)
  y_proba = mlp.predict_proba(X_val)

  (acc, auroc, f1) = eval_metrics(y_val, y_pred, y_proba)

  mlflow.log_param("hidden_layer_sizes", hidden_layer_sizes)
  mlflow.log_param("activation", activation)
  mlflow.log_param("solver", solver)
  mlflow.log_metric("accuracy", acc)
  mlflow.log_metric("auroc", auroc)
  mlflow.log_metric("f1", f1)

  tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

  # Model registry does not work with file store
  if tracking_url_type_store != "file":
    # Register the model
    # There are other ways to use the Model Registry, which depends on the use case,
    # please refer to the doc for more information:
    # https://mlflow.org/docs/latest/model-registry.html#api-workflow
    mlflow.sklearn.log_model(mlp, "model", registered_model_name="MLP_NBA_Undrafted")
  else:
    mlflow.sklearn.log_model(mlp, "model")

In [60]:
!cat /content/mlruns/0/c3f5c69195704f6f80d3a0b0c326c97d/params/activation

logistic

In [61]:
!cat /content/mlruns/0/c3f5c69195704f6f80d3a0b0c326c97d/params/hidden_layer_sizes

(10, 10)

In [62]:
!cat /content/mlruns/0/c3f5c69195704f6f80d3a0b0c326c97d/params/solver

sgd

In [63]:
!cat /content/mlruns/0/c3f5c69195704f6f80d3a0b0c326c97d/metrics/auroc

1662079054027 0.48951683597174234 0


In [64]:
!cat /content/mlruns/0/c3f5c69195704f6f80d3a0b0c326c97d/metrics/f1

1662079054028 0.7415467133346342 0


In [65]:
!cat /content/mlruns/0/c3f5c69195704f6f80d3a0b0c326c97d/metrics/accuracy

1662079054025 0.8218940052128584 0
