# ATP Tennis Data - Model Tuning For Graident Boosting


In our previous notebooks, we removed some columns from our feature set and found that it had little effect on our model.

In this notebook, we will use the same feature columns and tune our Gradient Boosting model to see if we can improve our performance

### Before tuning

```
Model Score: 0.6903443619176233

ROC/AUC Score: 0.6903161608528401
              precision    recall  f1-score   support

        Loss       0.69      0.68      0.69      7381
         Win       0.69      0.70      0.69      7429

    accuracy                           0.69     14810
   macro avg       0.69      0.69      0.69     14810
weighted avg       0.69      0.69      0.69     14810
```


# Summary of Results




In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.decomposition import PCA
from datetime import datetime
import pickle
import json
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from util import jupyter_util as ju
from util.model_util import ModelWrapper, REPORT_FILE, RSTATE, N_JOBS, MAX_ITER, LABEL_COL

%matplotlib inline
sns.set()


# date
DATE_FORMAT = '%Y-%m-%d'
DATE = datetime.now().strftime(DATE_FORMAT)

DESCRIPTION = "ohe-reduced_history_matchup"


# update this
FEATURE_FILE = f'../datasets/atp_matches_1985-2019_features-ohe-history5-matchup5.csv'

START_YEAR = 1998
END_YEAR = 2018




In [4]:
X_train_orig, X_test_orig, y_train, y_test = ju.get_data(FEATURE_FILE, LABEL_COL, START_YEAR, END_YEAR)

Our dataset actually has both historical data as well as matchup data. We will remove columns we are not using from this datast

In [6]:
import re

def filter_features(data: pd.DataFrame):
    
    print(f'\nBefore: data.shape {data.shape}')

    new_features = data[["p1_rank", "p2_rank", "p1_seed", "p2_seed", "p1_history_games_won_percentage_diff", "p1_history_sets_won_percentage_diff", 
                        "p1_ht", "p2_ht", "p1_age", "p2_age", "p1_matchup_games_won_percentage", "p2_matchup_games_won_percentage",
                        "p1_history_wins_diff", "tourney_level_label", "p1_matchup_sets_won_percentage", "p2_matchup_sets_won_percentage",
                        "tourney_month", "round_label"]]
             
    surface_cols = [col for col in data.columns if re.match("surface_", col)]
    new_features = pd.concat([new_features, data[surface_cols]], axis=1)

    best_of_cols = [col for col in data.columns if re.match("best_of_", col)]
    new_features = pd.concat([new_features, data[best_of_cols]], axis=1)
             
    player_ioc_cols = [col for col in data.columns if re.match(r"(p1|p2)_ioc_", col)]
    new_features = pd.concat([new_features, data[player_ioc_cols]], axis=1)

    player_id_cols = [col for col in data.columns if re.match(r"(p1|p2)_[\d]+", col)]
    new_features = pd.concat([new_features, data[player_id_cols]], axis=1)

    


    print(f'After: data.shape {new_features.shape}')
    return new_features

X_train = filter_features(X_train_orig)
X_test = filter_features(X_test_orig)


Before: data.shape (44429, 5299)
After: data.shape (44429, 5044)

Before: data.shape (14810, 5299)
After: data.shape (14810, 5044)


In [7]:
print(f'Columns removed: {[col for col in X_test_orig.columns if col not in X_test.columns]}')

Columns removed: ['draw_size', 'tourney_year', 'p1_hand_l', 'p1_hand_r', 'p1_hand_u', 'p2_hand_l', 'p2_hand_r', 'p2_hand_u', 'tourney_id_0301', 'tourney_id_0308', 'tourney_id_0311', 'tourney_id_0314', 'tourney_id_0315', 'tourney_id_0316', 'tourney_id_0319', 'tourney_id_0321', 'tourney_id_0322', 'tourney_id_0328', 'tourney_id_0329', 'tourney_id_0337', 'tourney_id_0341', 'tourney_id_0352', 'tourney_id_0360', 'tourney_id_0375', 'tourney_id_0402', 'tourney_id_0407', 'tourney_id_0410', 'tourney_id_0414', 'tourney_id_0421', 'tourney_id_0424', 'tourney_id_0425', 'tourney_id_0429', 'tourney_id_0439', 'tourney_id_0451', 'tourney_id_0495', 'tourney_id_0496', 'tourney_id_0499', 'tourney_id_0500', 'tourney_id_0506', 'tourney_id_0533', 'tourney_id_0568', 'tourney_id_0605', 'tourney_id_0717', 'tourney_id_0741', 'tourney_id_0773', 'tourney_id_0891', 'tourney_id_1536', 'tourney_id_1720', 'tourney_id_201', 'tourney_id_215', 'tourney_id_224', 'tourney_id_2276', 'tourney_id_237', 'tourney_id_240', 'tourn

# Load our Best Estimater to give us a starting point

In [11]:
report = pd.read_csv(REPORT_FILE)
current_report = report[(report.model_name == 'GradientBoostingClassifier') &
                                  (report.description == "ohe-reduced_history_matchup")]
mw = ModelWrapper.get_model_wrapper_from_report(current_report)
mw.model.n_estimators_

96

# Decision Tree - Grid Search 1

In [None]:
from sklearn.model_selection import GridSearchCV

dt = GradientBoostingClassifier(random_state=RSTATE, verbose=1, n_iter_no_change = 4)
parameters = {'n_estimators': [100, 200, 300],
              'min_samples_split': [2, 4, 8, 16],  
              'min_samples_leaf': [1, 2, 4, 8, 16], 
              'max_depth': [1, 3, 6, 12, 24], 
              'max_features': [None, 'sqrt', 'log2'],
            }
gscv = GridSearchCV(dt, parameters, cv=5, scoring='accuracy', verbose = 1, refit = True, n_jobs = N_JOBS)
gscv.fit(X_train, y_train)

Fitting 5 folds for each of 900 candidates, totalling 4500 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


In [None]:
gscv.cv_results_.keys()

In [None]:
results_df = pd.DataFrame(gscv.cv_results_)
results_df.head()

In [None]:
results_df.iloc[gscv.best_index_].T

In [None]:
max_features_list = results_df["param_max_features"].unique()
replace_dict = { max_features_list[idx]: idx for idx in np.arange(len(max_features_list)) }
replace_dict

In [None]:
results_df["param_max_features"] = results_df["param_max_features"].replace(replace_dict)

In [None]:
results_df.iloc[gscv.best_index_].T

In [None]:
gscv.best_score_

In [None]:
gscv.best_params_

In [None]:
gscv.best_index_

In [None]:
results_df[["param_n_estimators", "param_max_depth", "param_max_features", "param_min_samples_leaf", "param_min_samples_split", "mean_test_score"]].head()

## Grid Search Results

* XXX

In [None]:
f, a = plt.subplots(5, 1, figsize=(20, 25))

ax = results_df["param_max_depth"].plot(ax=a[0], legend=True)
results_df["mean_test_score"].plot(ax=ax.twinx(), legend=True, color='r')
ax.axvline(gscv.best_index_, 0, results_df["param_max_depth"].max(), c='orange', linewidth=5)
ax.set_title("Max Depth vs Test Score")
ax.grid(False)


ax = results_df["param_max_features"].plot(ax=a[1], legend=True)
results_df["mean_test_score"].plot(ax=ax.twinx(), legend=True, color='r')
ax.axvline(gscv.best_index_, 0, results_df["param_max_features"].max(), c='orange', linewidth=5)
ax.set_title("Max Features vs Test Score")
ax.grid(False)

ax = results_df["param_min_samples_leaf"].plot(ax=a[2], legend=True)
results_df["mean_test_score"].plot(ax=ax.twinx(), legend=True, color='r')
ax.axvline(gscv.best_index_, 0, results_df["param_min_samples_leaf"].max(), c='orange', linewidth=5)
ax.set_title("Min Sample Leaf vs Test Score")
ax.grid(False)


ax = results_df["param_min_samples_split"].plot(ax=a[3], legend=True)
results_df["mean_test_score"].plot(ax=ax.twinx(), legend=True, color='r')
ax.axvline(gscv.best_index_, 0, results_df["param_min_samples_split"].max(), c='orange', linewidth=5)
ax.set_title("Min Sample Split vs Test Score")
ax.grid(False)

ax = results_df["param_n_estimators"].plot(ax=a[4], legend=True)
results_df["mean_test_score"].plot(ax=ax.twinx(), legend=True, color='r')
ax.axvline(gscv.best_index_, 0, results_df["param_n_estimators"].max(), c='orange', linewidth=5)
ax.set_title("N_estimators vs Test Score")
ax.grid(False)



# We will do another Grid Search to fine tune some of our parameters

In [None]:
gscv.best_params_

In [None]:
# from sklearn.model_selection import GridSearchCV

# estimator2 = GradientBoostingClassifier(random_state=RSTATE, verbose=1, n_iter_no_change = 4)
# parameters = { 'min_samples_split': [26, 28, 30, 32, 34, 36, 38, 40, 50],  'min_samples_leaf': [3, 4, 5, 6, 7]}
# gscv2 = GridSearchCV(estimator2, parameters, cv=5, scoring='accuracy', verbose = 1, refit = True, n_jobs = N_JOBS, return_train_score = True)
# gscv2.fit(X_train, y_train)

In [None]:
results_df2 = pd.DataFrame(gscv2.cv_results_)
results_df2.head()

In [None]:
gscv2.best_score_

In [None]:
gscv2.best_index_

In [None]:
gscv2.best_params_

## Looks like our results are still the same as before with min sample leaf at 4 and min smaple split at 32

In [None]:
f, a = plt.subplots(2, 1, figsize=(20, 10))

ax = results_df2["param_min_samples_leaf"].plot(ax=a[0], legend=True)
results_df2["mean_test_score"].plot(ax=ax.twinx(), legend=True, color='r')
ax.axvline(gscv2.best_index_, 0, results_df2["param_min_samples_leaf"].max(), c='orange', linewidth=5)
ax.set_title("Min Sample Leaf vs Test Score")
ax.grid(False)


ax = results_df2["param_min_samples_split"].plot(ax=a[1], legend=True)
results_df2["mean_test_score"].plot(ax=ax.twinx(), legend=True, color='r')
ax.axvline(gscv2.best_index_, 0, results_df2["param_min_samples_split"].max(), c='orange', linewidth=5)
ax.set_title("Min Sample Split vs Test Score")
ax.grid(False)


## Let's compare our model accuracy against our test datset

In [None]:
gscv2.best_params_

### Best Model from our Original Grid Search

In [None]:
mw = ModelWrapper(gscv.best_estimator_,
                description = DESCRIPTION, 
                 data_file = FEATURE_FILE,
                  start_year = START_YEAR,
                  end_year = END_YEAR,
                   X_train = X_train,
                   y_train = y_train,
                   X_test = X_test,
                   y_test = y_test)
y_predict_dt = mw.predict()
mw.analyze()

## Model from our 2nd Grid search

Looks like there is a slight improvement in accuracy compared to our first grid search

Although recall for Losses dropped by a percent, our precision actually increased by 1% - meaning that we are slightly worse at identifying losses but when we do, it tends to be more accurate

Win precision also decreased by 1% but our recall for Wins increased by 1% - meaning we are better at identifying wins in our predictions, however, out of these wins there are more false positives

In [None]:
mw2 = ModelWrapper(gscv2.best_estimator_,
                description = DESCRIPTION, 
                 data_file = FEATURE_FILE,
                  start_year = START_YEAR,
                  end_year = END_YEAR,
                   X_train = X_train,
                   y_train = y_train,
                   X_test = X_test,
                   y_test = y_test)
y_predict_dt2 = mw2.predict()
mw2.analyze()

## Saving off our model in case we need it later

In [None]:
mw2.save()