## Can we predict whether a shot will be a hit in NBA?

![](https://betsulblog.wecontent.com.br/media/nba-melhores-temporadas-calouro-historiaj7h.jpg)

NBA is the most prestigious basketall championship out there and only the world's best players have a chance to play in the league. Given that, can we accurately predict if a player will make a hit or a miss? In order to answer that, I'm going to use data from the 2014/2015 season containing variables like shot distance and shot clock. In this kernel I will explore the dataset to visualize the feature distributions and their relationships, as well as, use unsupervised learning algorithms to figure out structure in the data. Then I will use tree algorithms to predict the shot and also understand which variables are the more important in the classification. **If you like the notebook, please give it an upvote!**

## Import libs and data

In [None]:
# Libs to deal with tabular data
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)

# Plotting packages
import seaborn as sns
sns.axes_style("darkgrid")
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')

# Machine Learning
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import roc_auc_score
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier

from lightgbm import LGBMClassifier

# Optimization
!pip uninstall optuna -y
!pip uninstall typing -y
!pip install optuna==2.3.0
import optuna
from optuna.samplers import TPESampler
from optuna.visualization import plot_contour, plot_optimization_history
from optuna.visualization import plot_param_importances, plot_slice


# To display stuff in notebook
from IPython.display import display, Markdown

# Misc tqdm.notebook.tqdm
from tqdm.notebook import tqdm
import time

In [None]:
shots = pd.read_csv('../input/nba-shot-logs/shot_logs.csv')

## Data preparation

In this section I will take a first look at the data and transform it into a dataframe more suitable to machine learning.

#### Random sample

In [None]:
shots.sample(5)

In [None]:
shots.shape

In [None]:
shots.columns = shots.columns.str.lower()
shots.dtypes

In [None]:
shots.isnull().sum()

#### Ten rows of a random match

In [None]:
with pd.option_context('display.max_columns', None):
    display(shots[shots['game_id'] == shots['game_id'].sample().iloc[0]].head(10))

#### Inspecting touch_time

In [None]:
shots['touch_time'].value_counts()

In [None]:
(shots['touch_time'] < 0).sum()

In [None]:
with pd.option_context('display.max_columns', None):
    display(shots.loc[shots['touch_time'] < 0, :].head())

Touch time is defined as the amount of time that the player has the ball possession before makeing a shot. Thus, negative values aren't defined but, as we can see above, they are present.

#### Fixing columns

In this part I convert time variables to seconds, transform categorical variables in booleans and remove redundant columns.

In [None]:
# Convert game clock to seconds
shots['game_clock'] = shots['game_clock'].apply(
    lambda x: 60*int(x.split(':')[0]) + int(x.split(':')[1])
)

# Replacing abnormal values with NaNs
shots.loc[shots['touch_time'] < 0, 'touch_time'] = np.nan

# Converting type of shot (2 or 3 points) to categorical
shots['pts_type'] = (shots['pts_type'] == 3) * 1

# Converting location
shots['location'] = (shots['location'] == 'H') * 1

# Renaming columns
shots = shots.rename(columns = {
    'fgm':'hit',
    'pts_type':'3pts_shot',
    'location':'home_match'
})

# Dropping informative columns (not useful to modelling) as well as 
# future variables which won't be available at predicting time
shots = shots.drop(columns = [
    'game_id',
    'matchup',
    'w',
    'final_margin',
    'closest_defender_player_id',
    'player_id',
    'shot_result',
    'closest_defender',
    'player_name',
    'pts'
])

#### Final dataset

Notice that I dropped categorical features like player name and defender name. I did so because I want to create a model that predicts a miss or a hit regardless of the player making the shot. In other words, I want to use game features, not "personal" features. 

In [None]:
shots.head(5)

In [None]:
shots.shape

## Visualizing data

Let's first visualize variables individually and then inspect their relationships.

In [None]:
def plot_distribution(col, bins=10):
    shots[col].plot.hist(bins=bins)
    plt.title(col, fontsize=16)
    plt.show()
    
def plot_relationship(x, y):
    sns.boxplot(data = shots, y = y, x = x)
    plt.title('{} vs {}'.format(x, y), fontsize=16)
    plt.show()
    
def show_frequency(x, y):
    joint = pd.crosstab(shots[x], shots[y], margins = True)
    joint = joint / joint.loc['All', 'All']
    display(joint)
    
def plot_scatter(x, y):
    sns.scatterplot(data = shots_scaled, x = x, y = y)
    plt.title('{} vs {}'.format(x, y), fontsize=16)
    plt.show()

### Distributions

In [None]:
ax = shots['hit'].replace({
    0:'Miss',
    1:'Hit'
}).value_counts().plot.bar(rot=0)
ax.set_title('hit', fontsize=16)
plt.show()

In [None]:
shots['hit'].replace({
    0:'Miss',
    1:'Hit'
}).value_counts() / shots.shape[0]

Notice that the dataset is almost balanced, so we can use metrics like accuracy and AUC to assess a model performance.

In [None]:
ax = shots['home_match'].replace({
    0:'Away',
    1:'Home'
}).value_counts().plot.bar(rot=0)
ax.set_title('home_match', fontsize=16)
plt.show()

ax = shots['period'].value_counts().plot.bar(rot=0)
ax.set_title('period', fontsize=16)
plt.show()

plot_distribution('game_clock', bins=10)

- The distributions above are kind of uniform, except the period. In this case, if a game ends up in a tie, new periods are played until the match has a winner.
- Notice that, while it not so significant, as the game progress, the number of shot attemps is reduced. Thus, in early game players tend to play faster and with more risk.

In [None]:
ax = shots['3pts_shot'].replace({
    0:'2 points',
    1:'3 points'
}).value_counts().plot.bar(rot=0)
ax.set_title('3pts_shot', fontsize=16)
plt.show()

Notice that roughly one third to the shots are behind the three points line.

In [None]:
plot_distribution('shot_dist', bins=25)

The shot distance is probably bimodal because of the three points field goal. It has a radius of 23 ft (7 m) to the basket and this distance is very close to the second peak. Also notice that there are some outliers above 30 ft, which probably happened due to the game clock.

In [None]:
plot_distribution('shot_clock', bins=20)

For me, the shot clock distribution is interesting. It is kind of normal and has a peak in the end, which can be shots originated from rebounds. 

In [None]:
plot_distribution('close_def_dist', bins=40)

The closest player distance distribution tell us that most defensors tend to be up to 5 ft far from the shooter.

In [None]:
plot_distribution('shot_number', bins=10)
plot_distribution('dribbles', bins=30)
plot_distribution('touch_time', bins=20)

These last histograms show how dynamic is a basketball match nowadays. Teams usually play together exchanging lots of passes and keeping low ball possession times. From the point of view of a defensive player, it is much harder to counter an attack if the balls keep changing directions all the time and you have to adjust your position frequently.

### Descriptive statistics

To finalize, let's take a look in the means, percentiles and outliers of each distribution.

In [None]:
shots.describe()

### Features and target relationships

Now, let's visualize if there is a relation between explanatory variables and our target. Below I show only the relevant relations.

In [None]:
show_frequency('3pts_shot', 'hit')

The frequency table above shows that, when analyzing shots according to their type, 2 points shots can be hit or a miss with the same probability, however 3 points shots are converted only roughly one third of the attempts.

In [None]:
plot_relationship('hit', 'close_def_dist')

Notice how the distribution of closest defensor distance when the shot is hit has much more outliers. A theory that can explain that are rebound and counter-attacks. 

In [None]:
plot_relationship('hit', 'shot_dist')
plot_relationship('hit', 'touch_time')
plot_relationship('hit', 'dribbles')

The three plots above tell us that hits tipically have smaller touch times, shot distances and dribbles.

In [None]:
plot_relationship('hit', 'shot_clock')
plot_relationship('3pts_shot', 'close_def_dist')

Also, shots are more frequently converted when the defensive player is far from the shooter and the ofensive team has more shot time to develop the attack.

## Scaling data and filling missing values

Now we must scale the distributions and fill NaNs with placeholds in order to apply unsupervised learning techniques and futher explore the data. In the following sections, we are going to apply K-Means and PCA, which are sensitive to measurement units. In the ideal setting, each distribution should have the same unit so that the algorith gives them the importance. In other words, we don't want a PCA component being dominated by one feature just because its measument unit is greater than the others. 

Below I will apply transformations to make variables variance more equal. Also, depndending on the distribution I filled missing values with the mean or the median, which are central statics.

In [None]:
shots_scaled = shots.copy()

# Stardardization
shots_scaled[['shot_clock']] = preprocessing.StandardScaler().fit_transform(shots[['shot_clock']].values)

# Robust scaling
skewed_cols = ['shot_number', 'dribbles', 'touch_time', 'close_def_dist']
shots_scaled[skewed_cols] = preprocessing.RobustScaler().fit_transform(shots[skewed_cols].values)
    
# Min max transformation
min_max_cols = ['period', 'game_clock', 'shot_dist']
shots_scaled[min_max_cols] = preprocessing.MinMaxScaler().fit_transform(shots[min_max_cols].values)

# Filling NaNs with mean
shots_scaled['shot_clock'] = shots_scaled['shot_clock'].fillna(shots_scaled['shot_clock'].mean())
shots_scaled['touch_time'] = shots_scaled['touch_time'].fillna(shots_scaled['touch_time'].median())
shots['shot_clock'] = shots['shot_clock'].fillna(shots['shot_clock'].mean())
shots['touch_time'] = shots['touch_time'].fillna(shots['touch_time'].median())

## Analyzing correlations

Let's analyze our features based on pearson correlation. These coefficients range from -1 and 1 and tell us if pair of variables have a linear relationship. For two distributions X and Y:

- 1 indicates that X increases as Y increases fitting perfectly a line.
- -1 indicates that X increases as Y decreases also fitting a perfectly line.
- 0 indicates that variables doesn't have a linear relationship.

In [None]:
x_corr = shots_scaled[['shot_number', 'shot_clock', 'dribbles', 'touch_time', 'shot_dist', 'close_def_dist', 'period', 'game_clock']].corr()
corr_mask = np.zeros_like(x_corr, dtype=np.bool)
corr_mask[np.tril_indices_from(corr_mask, k=0)] = True

sns.heatmap(x_corr, mask = corr_mask, annot=True)
plt.show()

There are some coefficients higher than 0.5. Let's take a look at their scatter plot. The explanation for these relationships are kind of straightforward, for example the number of dribbles increase as the touch time (ball possession of the shooter) increases.

In [None]:
plot_scatter('dribbles', 'touch_time')

In [None]:
plot_scatter('shot_dist', 'close_def_dist')

In [None]:
plot_scatter('period', 'shot_number')

## Principal component analysis

Principal component analysis is a dimensionality reduction algorithm that tries to find linear combinations of variables that explain as much variance as possible. It is very useful when we have correlated variables and we want to reduce them into a single variable, which makes the dataset less redundant and complex.

PCA works in steps, creating components sequentially. In each step it tries to find a linear combination of features, which creates a new feature, whose variance is the maximum possible. Starting from the second component, the components are required to be uncorrelated to preceding new features. An alternative interpretation is that when we find a component with maximum variance, we are actually finding a line in the feature space such that the projections over it have the maximum variance. And uncorrelated components actually mean perpendicular lines. So, when we apply PCA we find a new set of axis that better describes the dataset.

In [None]:
# Fitting PCA and showing PVE
pca = PCA(random_state=42).fit(shots_scaled.drop('hit', axis=1).values)
pve = pca.explained_variance_ratio_

In [None]:
plt.plot(range(1, len(pve) + 1), pve.cumsum())
plt.title('Cumulative sum of explained variance', fontsize=16)
plt.xlabel('Components')
plt.ylabel('Percentage (%)')
plt.show()

One way to select the number of components is to analyze the amount of variance a component explains compared to the original variances. In the graph above, I show how much variance is explained by adding a new component. Starting from the fourth component, the curve doesn't increases as much as before so that it creates an elbow. We can can stop here by arguing that adding more component won't lead to a much better result.  

In [None]:
pca_components = pd.DataFrame(
    pca.components_[:4,:].T,
    index = shots.columns[:-1]
)
sns.heatmap(pca_components, annot=True)

Now it is also useful to inspect the coefficients of the linear combinations that generated each component. We can see that:

- The first component is mostly about dribbles and touch time, which are correlated. We can interpret it as a variable describing how the player acted before the shoot.
- The second one is dominated by shot clock. It tell us if the shot involved team work.
- The third component can be describe as a mesure of the difficulty of the shot. It accounts for closest defensor distance, shot distance and shot type.
- The last one describes the match situation, that is, if the player is warmed-up.

## Clustering with K-means

Apply K-Means is usually useful because it can show if there are clusters of instances that we can analyze deeper. It an iterative technique that uses euclidian distance and works better with spherical clusters. It has a hyperparameter that is the number of clusters we want to find. Most of the time we don't know this, so I ran it with different numbers.

In [None]:
n_clusters = range(2, 41)
kms, inertias = [], []
for n in tqdm(n_clusters):
    ts = time.time()
    km = KMeans(n_clusters=n, random_state=42)
    km.fit(shots_scaled.drop('hit', axis=1).values)
    kms.append(km)
    inertias.append(km.inertia_)

In [None]:
# Checking the within-cluster variation.
plt.plot(range(2,41), inertias)
plt.title('Within-cluster variance', fontsize=16)
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distances')
# plt.show()

The graph below is tipically the first option to assess the K-Means performance. The y-axis is the sum of the distances between each point and its respectively centroid. It will always descrease and we need to find a point where adding a new cluster won't lead to much better segmentation. In this case, that point can't be clearly seen.

## Splitting the dataset

To start modelling, we first need to split the data in train and test sets. Also I will create two versions of the features, the original one and the PCA reduced one.

In [None]:
# Splitting original dataset
X = shots.drop('hit', axis = 1).values
y = shots['hit'].values
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

# Splitting scaled and transformed by PCA dataset
X_scaled = pca.transform(shots_scaled.drop('hit', axis = 1).values)[:,:4]
x_train_scaled, x_test_scaled, _, _ = train_test_split(X_scaled, y, test_size = 0.2, random_state=42)

## Decision Tree

First I will use a decision tree due to its interpretability and ease of use. The algorithm's core ideia is to split the feature space in smaller regions so that each of these regions are as pure as possible. In other words, we want the regions to have mostly one class. These splits can be interpreted as rules that divide the space in two. For example, we can split the data whether the shot distance is greater or lower than 15 ft. 

Since this algorithm sequencially split the feature space in two regions, we can think of it as a binary tree where each node represents a split (i.e. a rule) and the leaves represents regions of the space where the purity should be higher. In classification tasks, the training set is used to figure out these rules and create good leaves. Then, to predict an instance we follow these rules and use the points contained in the leave to devise a prediction, which is usually the most frequent class among these points.

If we let the tree grow indefinitely, we would end up overfitting the training set. That's because each subsequent split will be more specific and it will end up fitting the noise of the data, instead of capturing the general patterns. So it is necessary to regularize it controlling the depth of tree, i.e., the number of splits used. Another option is to use cost-complexity pruning, but controlling depth is easier and more efficient.

In this part I'm going to implement a function to train and test the decision tree using cross-validation. Then I will save the results and report them using a graph to see if our model has overffited.

In [None]:
def decision_tree_cv(x, y, folds=5):
    cv = KFold(folds, random_state=42, shuffle=True)
    depths = list(range(1, 101))
    scores = np.zeros((len(depths), folds, 2, 2)) #depth, fold, split, metric
    
    for id_split, array_idxs in tqdm(enumerate(cv.split(x))):
        train_index, val_index = array_idxs[0], array_idxs[1]
        x_train, x_val = x[train_index], x[val_index]
        y_train, y_val = y[train_index], y[val_index]
        
        for depth in depths:
            clf = DecisionTreeClassifier(max_depth=depth, random_state=42).fit(x_train, y_train)
            scores[depth - 1, id_split, 0, 0] = clf.score(x_train, y_train)
            scores[depth - 1, id_split, 1, 0] = clf.score(x_val, y_val)
            scores[depth - 1, id_split, 0, 1] = roc_auc_score(y_train, clf.predict_proba(x_train)[:,1])
            scores[depth - 1, id_split, 1, 1] = roc_auc_score(y_val, clf.predict_proba(x_val)[:,1])
            
    return scores

def report_cv(scores):
    sns.lineplot(data = pd.DataFrame(
        scores.mean(1)[:,:, 1], 
        index = list(range(1, 101)), 
        columns = ['train', 'test']
    ))
    plt.show()
    
    val_scores = scores.mean(1)[:, 1]
    
    print('Best model')
    print('***********')
    print('Mean validation accuracy: ', scores.mean(1)[:, 1, 0].max())
    print('Mean validation AUC: ', scores.mean(1)[:, 1, 1].max())
    print('Depth of the best model: ', scores.mean(1)[:, 1, 1].argmax() + 1)

To evaluate each model, I'm going to use accuracy, AUC and cross validation with 5 folds.

### Original dataset

In [None]:
scores = decision_tree_cv(x_train, y_train)

In [None]:
report_cv(scores)

### Dataset transformed by PCA

In [None]:
scores_pca = decision_tree_cv(x_train_scaled, y_train)

In [None]:
report_cv(scores_pca)

## Interpreting the decision tree

In this section we inspect which variables are the most important for the prediction and visualize the beggining of the decision tree to see how it works. Before we do that, I will fit a decision tree to the entire training set using the depth of the best model.

In [None]:
depth_best_model = scores.mean(1)[:, 1, 1].argmax() + 1
clf = DecisionTreeClassifier(max_depth=depth_best_model, random_state=42).fit(x_train, y_train)

### Gini importance

First, let's take a look at the feature importances.

In [None]:
importances = pd.Series(clf.feature_importances_, index=shots.columns[:-1]).sort_values(ascending=False)

sns.barplot(x = importances.values, y = importances.index, orient='h', palette='Reds_r')
plt.title('Gini importance of features', fontsize=16)
plt.show()

The Gini importance of a feature is the normalized amount of impurity reduced by using this variable to split a region. It is called Gini importance because the metric utilized to measure region impurity in the training phase is called Gini.

As we can see, there are 5 very important variable:

- Shot distance
- Closest defensor distance
- Touch time
- Shot clock
- Game clock

However, shot distance and closest defensor distance are by far the most important variables in order to make a good prediction. The remaining 5 variables reduced very little impurity, which indicates that we could in addition drop them and fit the model just with the first 5 variables. This could lead to faster predictions, less noise and maybe better performance (generalization).

#### Permutation importance

Now, let's evaluate the feature importances using a method called permutation importance. In order to assess the importance of a feature X:

1. Choose a dataset to evaluate the model and a metric. In our case, I will use the training set and AUC.
2. Evaluate the model performance on this dataset.
3. Build a new feature set where X is randomly permuted and the other variables are kept unchanged.
4. Evaluate again the model on the modified dataset.
5. The importance is the difference of performance between the original and the modified dataset.

This procedure is repeated a number of times using bootstrap to generate a distribution of importance for each variable, so that we can have a confidence interval.

The intuition of this methodology is that permuting important features will cause a high damage to the model performance, while less important features wouldn't do the same. That is, high values mean higher importance.

In [None]:
perm_importances = permutation_importance(
    clf,
    x_train,
    y_train,
    scoring = 'roc_auc',
    n_repeats = 10,
    n_jobs = -1
)
df_perm_importances = pd.DataFrame(perm_importances.importances, index=shots.columns[:-1]).T
df_perm_importances = df_perm_importances.melt(
    value_vars = df_perm_importances.columns,
    var_name = 'feature',
    value_name = 'importance'
)

plt.figure(figsize=(10,5))
sns.boxplot(data = df_perm_importances, x = 'importance', y = 'feature')
plt.title('Permutation importance using AUC (train set)', fontsize=16)
plt.show()

The rank of for the first four variables didn't change between methods, but the shot type (2 or 3 points) seems to have higher importance according to the permutation method.

### Visualizing the tree

Below you can see a representation of the decision tree. Notice that, since the tree has 7 levels, it is too deep to be visualized. So we are only showing the first 2 levels (not counting the root node).

In [None]:
plt.figure(figsize=(14,10))
plot_tree(
    clf,
    feature_names = shots.columns[:-1], 
    class_names = ['Miss', 'Hit'],
    filled = True,
    impurity = False,
    max_depth = 2
)
plt.show()

In these first 7 nodes, 6 of them split the feature space using the features shot distance and closest defensor distance, which are the most important features based on the graphs aboves. Also, it seems like the left side of the tree is predicting hit while the right side predicts miss. Let's visualize the entire tree to see the leave colors, not the content. 

In [None]:
plot_tree(
    clf,
    feature_names = shots.columns[:-1], 
    class_names = ['Miss', 'Hit'],
    filled = True,
    impurity = False
)
plt.show()

Indeed, most leaves at the left side predict a hit, while the right side predicts a miss.

## Random Forest

Now that we created a baseline of our problem using a decision tree, it's time to use more powerful models. But before we explore random forests, it's important to understand why decision trees aren't very good. A very common problem is that they can easily overfit. We tried to avoid this behaviour by controlling the depth of the tree, but sometimes it's not enough. More sophisticated ways of controling the decision tree are using cost-complexity pruning or tuning other hyperparameters. The problem is that fitting a decision tree with this added complexity can become time consuming. Another problem is that slightly changes in the dataset can drastically change the structure of the tree. In other words, decision trees usually have high variance.

Random forests were proposed to address some of these issues. A random forest is actually composed of set of decision trees, each one independently fitted on a resampled version of the training set. The main ideia is to use bootstraping to decresase the variance of the prediction by "averaging" (actually in classification we use the majority vote) the result of the trees. In addition to that, in order to make each indidual tree even more different, random forests only consider a limited number of features in each split. To sum it up, random forests offer two advantages:

- It creates variability by using a different dataset (resampled) in each tree.
- Reduces the impact of very important or correlated features, allowing the trees to explore different ways of splitting the feature space.

Important hyperparameters in the model are the number of trees and the number of features, but it's always good to control the depth of the tree or the sample size in each split or leave to avoid overfitting. In this case, I'm going to set the number of trees to 100, which is probably high enough to get a good set of different trees, and I'm going to use $\sqrt{11}$ variables in each split, which is a recommended number in the book Introduction to Statistical Learning. Finally, I'm going to employ grid search to figure out the best model configuration on the other hyperparameters.

To do so, I'm going to use scikit-learn's Grid Search class, which provides an easy interface to run a grid search over a cross-validation.

In [None]:
param_grid = {
    'n_estimators': [100],
    'criterion': ['gini'],
    'max_depth': [5, 10, 15, 20, 25, 50],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [2, 5, 10, 15],
    'max_features': ['sqrt']
}

grid_search_cv = GridSearchCV(
    estimator = RandomForestClassifier(n_jobs = -1, random_state = 42),
    param_grid = param_grid,
    scoring = ['accuracy', 'roc_auc'],
    n_jobs = -1,
    refit = 'roc_auc',
    cv = 5,
    verbose = 3,
    return_train_score = True    
).fit(x_train, y_train)

In [None]:
scores = pd.DataFrame(grid_search_cv.cv_results_).sort_values('rank_test_roc_auc')

print('Best model')
print('***********')
print('Mean validation accuracy: ', scores.iloc[0, :]['mean_test_accuracy'])
print('Mean validation AUC: ', grid_search_cv.best_score_, '\n')
print('Best hyperparameters')
print('***********')
for param, val in grid_search_cv.best_params_.items():
    print(param + ':', val)

As I expected, accuracy and AUC improved, althought not too much. Now let's see how our model performs in the train and test set. Since we fitted 3 hyperparameters, we can't visualize all combinations at the same time. Thus I'm going to inspect each one individually and average the scores to have one score by parameters value.  

In [None]:
def report_train_test(scores, param, name):
    df = scores.groupby(param).mean()[['mean_train_roc_auc', 'mean_test_roc_auc']]
    df.columns = ['Train', 'Test']
    df.index.name = name
    sns.lineplot(data = df)
    plt.ylabel('ROC AUC')
    plt.title('Tuning AUC based on {}'.format(name.lower()), fontsize = 14)
    plt.show()

In [None]:
report_train_test(scores, 'param_max_depth', 'Max depth')
report_train_test(scores, 'param_min_samples_leaf', 'Min samples on leaves')
report_train_test(scores, 'param_min_samples_split', 'Min samples on internal nodes')

Based on the graphs above, we can see that building deeper trees than 10 levels doesn't improve our model. Also, notice that increasing the minimum number of samples on the leaves slightly improve the test set metric while it decreases drastically the train set performance. When we increase this hyperparameter, we are actually also restricting the tree's height. The result of that is an increase in the model's generalization, which improves the test set AUC and underfit the training set.

I probably could have better results if I expanded our grid search, but using the current settings it already took over 45 minutes, which is a lot. 

## Interpreting the random forest

When using random forests, there is an obvious accuracy-explainability trade-off. There is no easy way to visualize hundreds of trees, so we lose an appealing feature of the decision tree. But, at least we can access which variables are the most important using gini importance or permutation importance.

In [None]:
perm_importances = permutation_importance(
    grid_search_cv.best_estimator_,
    x_train,
    y_train,
    scoring = 'roc_auc',
    n_repeats = 10,
    n_jobs = -1
)
df_perm_importances = pd.DataFrame(perm_importances.importances, index=shots.columns[:-1]).T
df_perm_importances = df_perm_importances.melt(
    value_vars = df_perm_importances.columns,
    var_name = 'feature',
    value_name = 'importance'
)

plt.figure(figsize=(10,5))
sns.boxplot(data = df_perm_importances, x = 'importance', y = 'feature')
plt.title('Permutation importance using AUC (train set)', fontsize=16)
plt.show()

The importance rank didn't change, but now we can see that variable that had very little importance in the decision tree, such as home match, shot number and period, actually are more important in the random forest. Also, shuffling a variable has a greater AUC impact on the random forest than in the decision tree.

## Gradient Boosting Machine

Instead of fitting a number of trees in parallel, another way to improve decision trees is by stacking them. The core idea of a boosting algorithm is to use a sequence of weak learners to gradually, step by step, learn a function that best approximate our target. Generally these weak learners are decision trees with a small number of leaves (small depth) that constraint the number of interactions between variables. For example, if we have $d$ leaves, at most $d-1$ variables were used in the tree.

In addition to that, each tree is fitted based on the residuals of the approximated funcion up until that tree. Notice that for this to work in a classification task, the prediction of tree must be the odds. This allows the weaker learners to focus on specific areas where the previous learners didn't perform well. Thus, the final function will be the sum of the predictions of each tree. However, we can do two more things to further improve boosting algorithms. First, we can multiply each tree's prediction by a shrinkage parameter to slow even more the process. This parameter can be fixed and tuned using grid search, but in Grandient Boosting Machines this shrinkage parameter is found by using gradient descent. So in each boosting step, we solve a optimization problem where we want to minimize a loss function that depends on the shrinkage. Second, we can multiply the shrinkage parameter and the weak learner prediction by a learning rate. This learning rate is fixed and should be fine-tuned.

Therefore, typically we have three essential parameters to fine-tune and avoid overfitting: number of tree, depth of leaves and learning rate. If we want to get the best perfomance out of GBM's, we could also search for the best parameters related to the tree construction. Notice that if let the algorithm build a very long stack of learners it will easily overfit. Instead of grid searching each value for the number of tree, we can use the test set to access when to stop stacking models.

To evaluate this model and figure out the best hyperparameter values, I'm going to use LightGMB and Optuna. The latter is a optimization package that implements a bayesian approach to make a guided search in the hyperparameter space. 

In [None]:
class Light_GBM_CV:
    def __init__(self, x, y, folds=5, random_state=42):
        # Hold this implementation specific arguments as the fields of the class.
        self.x = x
        self.y = y
        self.folds = folds
        self.random_state = random_state

    def __call__(self, trial):
        cv = KFold(
            self.folds, 
            random_state = self.random_state, 
            shuffle=True
        )
        
        clf = LGBMClassifier(
            boosting_type = 'gbdt',
            objective = 'binary',
            metric = 'auc',
            random_state = self.random_state,
            num_leaves = trial.suggest_int('num_leaves', 16, 256),
            max_depth = trial.suggest_int('max_depth', 4, 8),
            learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1.0),
            min_child_samples = trial.suggest_int('min_child_samples', 5, 100),
            n_estimators = trial.suggest_int('n_estimators', 10, 200),
            lambda_l1 = trial.suggest_loguniform('lambda_l1', 1e-5, 1.0),
            lambda_l2 = trial.suggest_loguniform('lambda_l2', 1e-5, 1.0),
            max_bin = trial.suggest_int('max_bin', 50, 256)
        )
        
        scores = []

        for array_idxs in cv.split(self.x):
            train_index, val_index = array_idxs[0], array_idxs[1]
            x_train, x_val = self.x[train_index], self.x[val_index]
            y_train, y_val = self.y[train_index], self.y[val_index]

            clf.fit(
                x_train, 
                y_train,
                eval_set = (x_val, y_val),
                early_stopping_rounds = 5,
                verbose = False
            )
            scores.append(clf.best_score_['valid_0']['auc'])

        return sum(scores) / self.folds

In [None]:
lgbm_cv = Light_GBM_CV(x_train, y_train)
study = optuna.create_study(sampler=TPESampler(seed = 42), direction='maximize')
study.optimize(lgbm_cv, n_trials=1000)

In [None]:
print('Best model')
print('***********')
print('Mean validation AUC: ', study.best_value, '\n')
print('Best hyperparameters')
print('***********')
for param, val in study.best_params.items():
    print(param + ':', val)

Now that we ran hundreds of hyperparameters trials, we can use some visualizations provided by optuna to see how the optimization worked and to see the hyperparameters interactions. 

In [None]:
plot_optimization_history(study)

This plot above shows how after some time optune figured out the best parameters and kept exploring its neighborhood to try to improve the objective value (in our case the cross-validated AUC).

In [None]:
plot_slice(study)

The slice plot show how each variable interacts with the objective value. Notice how in the latter trials the objective is closer to the maximum value.

In [None]:
plot_contour(study, params=['num_leaves', 'min_child_samples', 'n_estimators'])

Using contour plots we can see how pairs of hyperparameters interact. It's interesting to know that there are large regions where we can get large AUC.  

In [None]:
plot_param_importances(study)

With this graph it's clear that adjusting the learning rate is extremely important. As a side note, these parameter importances are calculated by fitting a random forest using the hyperparameters as features and the objective value as target.

## Interpreting the Light GBM

As with random forests, there is no easy way to visualize hundreds of stacked trees But again we can compute the permutation importance. Before that, I'm going to fit a GBM to the entire training set with the best parameters given by Optuna.

In [None]:
lgbm_final_clf = LGBMClassifier(
    boosting_type = 'gbdt',
    objective = 'binary',
    metric = 'auc',
    random_state = 42,
    **study.best_params
)

lgbm_final_clf.fit(
    x_train, 
    y_train,
    eval_set = (x_test, y_test),
    early_stopping_rounds = 5
)

In [None]:
perm_importances = permutation_importance(
    lgbm_final_clf,
    x_test,
    y_test,
    scoring = 'roc_auc',
    n_repeats = 10,
    n_jobs = -1
)
df_perm_importances = pd.DataFrame(perm_importances.importances, index=shots.columns[:-1]).T
df_perm_importances = df_perm_importances.melt(
    value_vars = df_perm_importances.columns,
    var_name = 'feature',
    value_name = 'importance'
)

plt.figure(figsize=(10,5))
sns.boxplot(data = df_perm_importances, x = 'importance', y = 'feature')
plt.title('Permutation importance using AUC (train set)', fontsize=16)
plt.show()

## Conclusions

After fitting three different, it's clear that Light GBM is the winner. I've got approximately 64.4% in the cross-validation AUC and in the test set (which wasn't used) we have about 63.8% AUC. It's important to remeber that 50% AUC is the threshold for random predictions, so after all our model isn't very good. But it's probably due to our restrictions on the variables used. If we kept and explored things such as player performance in the game or in the tournament, we could have a better result. 

One takeaway from this study is that it doesn't matter how good or complex is the model we are using. Most of the accuracy comes from good predictive data, that is, if we don't have data with quality and predictive power, the gains from using more advanced models will be marginal.

**If you found this notebook interesting, please give it an upvote. Thanks for reading!**