## Using Machine Learning to Predict the Longevity of NBA Rookies
This notebook looks into using various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether or not an NBA rookie will last 5 years in the league based on their rookie year statistics.

Here is a brief outline of the notebook:
    1. Problem definition and Data Introduction
    2. Data Wrangling
    3. Exploratory Data Analysis 
    4. Modelling
    5. Evaluation
    6. Conclusion

# 1. Problem Definition and Data Introduction

## Problem Definition
The NBA is the world's premier basketball league and as such, the competition for admission into the league is fierce; only about 1% of NCAA College Basketball players get drafted into the NBA. In order to remain in the league, newly drafted players must continue to prove their worth on the court. This notebook will use various rookie year stats to predict whether or not a player will last five years in the league.

## Data Introduction
This project will use data from two datasets: a rookies dataset and an active players dataset. The rookies dataset includes all of the rookies drafted between 1980 and 2015. The active players dataset lists the active players during each seeason from 1980 to 2017. Both datasets were taken from data.world. The active players dataset will be used to create a target column in the rookies dataset, which will then be used to model.


## Rookies Data Dictionary
The following are the rookie year statistics that will be used to predict whether or not a player lasts 5 years in thee league:
'Year Drafted', 'GP', 'MIN', 'PTS', 'FGM', 'FGA', 'FG%',
       '3P Made', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB',
       'AST', 'STL', 'BLK', 'TOV', 'EFF', 'target'
      
1. Year Drafted
2. GP: games played during rookie season
3. MIN: average minutes played per game
4. PTS: average points per game
5. FGM: average field goals made per game
6. FGA: average field goals attempted per game
7. FG%: average field goal percentage
8. 3P Made: average 3-point field goals made per game
9. 3PA: average 3-point field goals attempted per game
10. 3P%: 3-point percentage
11. FTM: average free throws made per game
12. FTA: average free throws attempted per game
13. FT%: free throw percentage
14. OREB: average offensive rebounds per game
15. DREB: average defensive rebound per game
16. REB: average total rebounds per game
17. AST: average assists per game
18. BLK: average blocks per game
19. TOV: average turnovers per game
20. EFF: a player's efficiency; EFF = (PTS + REB + AST + STL + BLK - Missed FG âˆ’ Missed FT - TO) / GP

# 2. Data Wrangling

## Prepare the tools

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, plot_roc_curve, accuracy_score
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-dark')


## Cleaning the Rookies Dataset

In [None]:
#import the rookies dataset
rookies_original = pd.read_excel("../input/nba-rookies-stats/NBA_Rookies_by_Year.xlsx")
rookies = rookies_original[rookies_original["Year Drafted"] < 2016]
rookies.index = range(0, len(rookies.index)) 
rookies.head()

In [None]:
rookies

## Cleaning the Active Players Dataset


In [None]:
#import players dataset
players_all = pd.read_csv("../input/nba-players-stats-19802017/player_df.csv")
players_all = players_all.drop(players_all.columns[0], axis=1)
players_all.head()

In [None]:
#dropping columns with irregularities
players_all = players_all.drop(["G","OWS","BPM","FG%","2P","FT","DRB","BLK"], axis=1)
players_all.head()

In [None]:
#converting year column to int
players_all = players_all.astype({"Year":int})
players_all

In [None]:
#we can disregard rookies drafted after 2013 because the players dataset only goes up to 2017
rookies = rookies[rookies["Year Drafted"] < 2014]
rookies

## Creating a Target Column (whether or not the rookie lasted 5 years)

In [None]:
#storing rookie info in a dictionary
rkeys_list = list(rookies.loc[:, "Name"])
rval_list = list(rookies.loc[:, "Year Drafted"])
rookie_dict = {k:v for k,v in zip(rkeys_list, rval_list)}

In [None]:
#function that groups active players in a list based on the year
def active_players(year):
    players_year = players_all[players_all["Year"] == year]
    players_year = list(players_year.loc[:, "Player"])
    players_year = [s.strip('*') for s in players_year]
    return players_year

#creating a 2D list where one dimension is the year and the other dimension is the active players
players_by_year = [[None]] * 38
i=0
year = 1980
for year in range(1980, 2018):
    players = active_players(year)
    players_by_year[i] = players
    i+=1

In [None]:
#storing active player info in a dictionary where the key is the year and the value is the active players during that yera#

#keys
keys_list = [year for year in range(1980,2018)]

#creating dictionary
players_dict = {k:v for k,v in zip(keys_list, players_by_year)}

In [None]:
#creating list of players that spent at least 5 years in the league
fivyrs = []
for player, rookie_year in rookie_dict.items():
    target_year = rookie_year + 4
    if player in players_dict[target_year]:
        fivyrs.append(player)

In [None]:
#creating the target column by comparing fivyrs to rookie_dict
target_col = [None]*1424
rookie_names = list(rookies.loc[:, "Name"])
i = 0
for rookie in rookie_names:
    if rookie in fivyrs:
        target_col[i] = 1
    else:
        target_col[i] = 0
    i+=1
target_col = np.array(target_col)
print(target_col)

In [None]:
#adding the target column to the dataframe
target_col = pd.DataFrame(data=target_col, index=[i for i in range(0,len(rookies.index))], columns=["target"])
rookies.index = range(0,len(rookies.index))
rookies["target"] = target_col.loc[:, "target"]

In [None]:
rookies

# 3. Exploratory Data Analysis

In [None]:
pd.set_option('display.max_columns', None)
rookies.head(10)

In [None]:
rookies.tail(10)

In [None]:
#Let's find out how many of each class there is
rookies["target"].value_counts()

In [None]:
#Let's visualize this distribution
rookies["target"].value_counts().plot(kind="bar", color=["salmon", "lightblue"])

In [None]:
#Deleting the name column
rookies = rookies.drop(["Name"], axis=1)
rookies

In [None]:
#General description of data
rookies.describe()

In [None]:
#compare target column with year
yr_series = pd.Series(rookies.loc[:, "Year Drafted"])
target_series = pd.Series(rookies.loc[:, "target"])
pd.crosstab(target_series, yr_series)

In [None]:
#visualizing this info
pd.crosstab(yr_series, target_series).plot(kind="bar", figsize=(10,7), color=["salmon", "lightblue"])
plt.title("5yr Survival By Year")
plt.ylabel("Count")

In [None]:
rookies.head()

In [None]:
#PTS Distribution
rookies["PTS"].plot(kind="hist")

In [None]:
#MIN Distribution
rookies["MIN"].plot(kind="hist")

In [None]:
#FG% Distribution
rookies["FG%"].plot(kind="hist")

In [None]:
#3P% Distrbution
rookies["GP"].plot(kind="hist")

In [None]:
rookies.head()

In [None]:
#Correlaton matrix
corr_matrix = rookies.corr()
fig, ax = plt.subplots(figsize=(16,10))
ax = sns.heatmap(corr_matrix, annot=True, linewidths=0.5, fmt=".2f", cmap="YlGnBu")

# 4. Modelling 

In [None]:
#Cleaning the 3P% column
rookies["3P%"] = rookies["3P%"].map(lambda x:0 if x=="-" else x)

In [None]:
#Creating Matrix of Features
X = rookies.drop(["target"], axis = 1)

In [None]:
X

In [None]:
#creating target column
y = rookies.loc[:, "target"]
y

In [None]:
rookies.dtypes

In [None]:
# Splitting into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
X_train

In [None]:
# Models dictionary
models = {"Logistic Regression": LogisticRegression(),
         "KNN": KNeighborsClassifier(),
         "Random Forest": RandomForestClassifier(),
         "XGBoost": XGBClassifier()}

#Function that will evaluate the model performance using various metrics
def evaluate_pred(y_pred, y_test):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    metric_dict = {"accuracy": round(accuracy, 2), "precision": round(precision, 2), "recall": round(recall, 2),
                  "f1": round(f1,2)}
    print(f"Accuracy: {accuracy*100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")
    
    return metric_dict

# Function that will fit and score the models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    #Dictionary of model scores
    model_scores = {}
    
    #Loop through models
    for name, model in models.items():
        clf = model
        clf.fit(X_train, y_train)
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)
model_scores


# Hyperparameter Tuning with GridSearchCV
We will use GridSearchCV to try and improve the performance of these models
    1. Logistic Regression Tuning
    2. XGBoost Tuning
    3. KNN Tuning

## Logistic Regression Tuning

In [None]:
# Create hyperparameter options
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}

# Apply grid search
log_clf = GridSearchCV(LogisticRegression(), grid, cv=5, verbose=0)

#Fit
log_clf.fit(X_train, y_train)

In [None]:
#print the best estimator
log_clf.best_estimator_

In [None]:
#evaluating the performance of the best estimator
log_clf1 = LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
log_clf1.fit(X_train, y_train)
y_pred = log_clf1.predict(X_test)
accuracy_score(y_pred, y_test)

In [None]:
#negligible increase in accuracy

## XGBoost Tuning

In [None]:
#Constructing the grid
param_test1 = {
 'n_estimators':range(50,200,10),
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}

#Apply grid search
xg_clf = GridSearchCV(XGBClassifier(), param_test1, cv=5, verbose=0)
xg_clf.fit(X_train, y_train)

In [None]:
#Print best estimator
xg_clf.best_estimator_

In [None]:
#evaluating the performance of the best estimator
xg_clf1 = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=7,
              min_child_weight=1, monotone_constraints='()',
              n_estimators=120, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

xg_clf1.fit(X_train, y_train)
y_pred = xg_clf1.predict(X_test)
accuracy_score(y_pred, y_test)

In [None]:
#decrease in accuracy

## KNN Tuning

In [None]:
#Desired range for k parameter
k_range = list(range(19, 50))

#Creating grid
param_grid = dict(n_neighbors=k_range)

#Applying GridSearchCV
knn_clf = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring='accuracy')
knn_clf.fit(X, y)

In [None]:
#printing best estimator
knn_clf.best_estimator_

In [None]:
#evaluating the performance of the best estimator
knn_clf1 = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=37, p=2,
                     weights='uniform')
knn_clf1.fit(X_train, y_train)
y_pred = knn_clf1.predict(X_test)
print(accuracy_score(y_pred, y_test))

In [None]:
#6% increase in accuracy achieved

## Random Forest Tuning

In [None]:
#Creating the grid
param_grid = {
    'n_estimators'      : range(50,200,10),
    'max_depth'         : [8, 9, 10, 11, 12],
    'random_state'      : [0],
    #'max_features': ['auto'],
    #'criterion' :['gini']
}

#Applying grid search
cv_rfc = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv= 10, scoring='accuracy')
cv_rfc.fit(X_train, y_train)

In [None]:
#printing best estimator
cv_rfc.best_estimator_

In [None]:
# #evaluating performance of best estimator
rfc = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=9, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=340,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
accuracy_score(y_pred, y_test)

In [None]:
#negligible increase in accuracy

# 5. Evaluation
For each model we will look at:
1. ROC curve and AUC score
2. Confusion matrix
3. Classification report
4. Precision
5. Recall
6. F1 Score
7. Feature Importance

In [None]:
#Function that creates visualization for confusion matrix
sns.set(font_scale=1.0)

def plot_conf_mat(y_test, y_preds):
    fig, ax = plt.subplots(figsize=(4, 4))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True,
                     cbar=False)
    plt.xlabel("True label")
    plt.ylabel("Predicted label")

In [None]:
#function that calculates classification metrics using cross validation
cv_metrics = ["accuracy", "precision", "recall", "f1"]
def cv_calculator(cv_metrics, clf, X, y):
    cv_dict = {}
    for metric in cv_metrics:
        cv_dict[metric] = np.mean(cross_val_score(clf, X, y, cv=5, scoring=metric))
    return cv_dict

## XGBoost Evaluation

In [None]:
#Plot ROC Curve and calculate AUC for XGB
plot_roc_curve(xg_clf1, X_test, y_test)

In [None]:
#confusion matrix for XGB
y_pred1 = xg_clf1.predict(X_test)
plot_conf_mat(y_pred1, y_test)

In [None]:
#cross validated classification metrics for XGB
cv_dict = cv_calculator(cv_metrics, xg_clf1, X, y)
cv_dict

In [None]:
#visualize the cv metrics
cv_metrics1 = pd.DataFrame(cv_dict, index=["score"])
cv_metrics1.T.plot.bar(title="XGB CV Metrics", legend=False)

In [None]:
#feature importance XGB
plt.figure(figsize=(15, 5))
plt.bar(list(X_train.columns), xg_clf1.feature_importances_, align='edge', width=0.3)
plt.show()

## Logistic Regression Evaluation

In [None]:
#Plot ROC Curve and calculate AUC for Logistic Regression
plot_roc_curve(log_clf1, X_test, y_test)

In [None]:
#confusion matrix for Log Reg
y_pred2 = log_clf1.predict(X_test)
plot_conf_mat(y_pred2, y_test)

In [None]:
#cross validated classification metrics for Log Reg
cv_dict2 = cv_calculator(cv_metrics, log_clf, X, y)
cv_dict2

In [None]:
#visualize the cv metrics
cv_metrics2 = pd.DataFrame(cv_dict2, index=["score"])
cv_metrics2.T.plot.bar(title="Log Reg CV Metrics", legend=False)

In [None]:
#feature importance log reg#

#Match coefficients to corresponding columns
feature_dict = dict(zip(rookies.columns, list(log_clf1.coef_[0])))

#Visualize feature importance
plt.figure(figsize=(15, 5))
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature Importance", legend=False)

## KNN Evaluation 

In [None]:
#Plot ROC Curve and calculate AUC KNeighbors
plot_roc_curve(knn_clf1, X_test, y_test)

In [None]:
#confusion matrix for KNN
y_pred3 = knn_clf1.predict(X_test)
plot_conf_mat(y_pred3, y_test)

In [None]:
#cross validated classification metrics for KNN
cv_dict3 = cv_calculator(cv_metrics, knn_clf1, X, y)
cv_dict3

In [None]:
#visualize the cv metrics
cv_metrics3 = pd.DataFrame(cv_dict3, index=["score"])
cv_metrics3.T.plot.bar(title="KNN CV Metrics", legend=False)

Feature importance is not explicitly defined for the KNN algorithm

## Random Forest Evaluation

In [None]:
#Plot ROC Curve and calculate AUC for Random Forest
plot_roc_curve(rfc, X_test, y_test)

In [None]:
#confusion matrix for RFC
y_pred4 = rfc.predict(X_test)
plot_conf_mat(y_pred4, y_test)

In [None]:
#cross validated classification metrics RF
cv_dict4 = cv_calculator(cv_metrics, rfc, X, y)
cv_dict4

In [None]:
#visualize the cv metrics
cv_metrics4 = pd.DataFrame(cv_dict4, index=["score"])
cv_metrics4.T.plot.bar(title="Random Forest CV Metrics", legend=False)

In [None]:
#feature importance for random forest#

#creating feature importance dictionary
features_dict2 = dict(zip(rookies.columns , rfc.feature_importances_))

#visualizing feature importance
plt.figure(figsize=(15, 5))
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Random Forest Feature Importance", legend=False)

# 6. Conclusion

With cross-validated accuracies of around 69%, the logistic regression, KNN, and random forest models seem to perform the best. This low accuracy is due in large part to the fact that the training data does not include any information about injuries, which are a critical determinant of a rookie's longevity. Despite this low accuracy, all four models identify a rookie's efficiency as the most important determinant of longevity.