## Full Data Set Modeling (Clay Court Version)


In the previous stage (see Workbook 5; Modeling_ClaySurface), 4 models were evaluated:

* Linear Regression and  decision tree models. The decision tree models evaluated were Random Forest, Gradient Boosting Regressor and HistGradient Boosting Regressor. The latter is still in beta for scikitlearn. It is a very fast (relative to standard gradient boosting models), consistently high-performing on heterogenous data sets, ensemble machine learning algorithm. Boosting, generally, refers to a class of ensemble learning algorithms that add tree models to an ensemble sequentially.
* Of these 4 models, the best by training set cross-validation error (RMSE) was GradientBoostingRegressor
    * The best model resulted in RMSE(STD): 5.84% (.09%) for training set cross validation and 5.87% for the test set.
* Presently this best model is rerun on the full data set (20 prior match threshold per player, 2012-2019, 2009-2011 additionally used previous to modeling stage for stats accrual/feature generation)
* See Intro and Summary of Findings sections of Workbook 5; Modeling_ClaySurface for details on modeling and prior stages, as well as for interpretation of findings and proposed next steps 


### Imports

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.model_selection import cross_validate

### Load Best Model

In [2]:
expected_model_version = '1.0'
model_path = '../models/tennis_CC_model.pkl'
if os.path.exists(model_path):
    with open(model_path, 'rb') as f:
        model = pickle.load(f)
    if model.version != expected_model_version:
        print("Expected model version doesn't match version loaded")
    if model.sklearn_version != sklearn_version:
        print("Warning: model created under different sklearn version")
else:
    print("Expected model not found")

### Load Data

In [3]:
# Ths is the file and analysis data range used for the main clay court analysis
df = pd.read_csv('../data/df_player_all_2009to2019.csv')
df.head()

Unnamed: 0,p_pts_won%,p_sv_pts_won%,p_ret_pts_won%,p_ace%,p_aced%,p_bp_save%,p_bp_convert%,t_id,t_date,tour_wk,...,p_tot_pts_last_diff,p_tot_pts_l6_diff,p_tot_pts_l6_decay_diff,p_matches_diff,p_matches_surf_diff,p_stam_adj_fatigue_diff,p_stam_adj_fatigue_decay_diff,p_H2H_diff,p_H2H_pts_won%_diff,m_outcome
0,47.84,56.28,38.41,1.09,3.66,55.0,43.75,2019-560,20190826,2019_24,...,-0.0,-0.0,-0.0,-243.0,-109.0,179.240506,134.43038,-0.0,,0
1,41.29,49.4,31.94,4.82,15.28,33.33,40.0,2019-M014,20191014,2019_29,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,,0
2,37.23,51.85,17.5,3.7,5.0,33.33,0.0,2019-M004,20190225,2019_07,...,-0.0,-0.0,-0.0,-25.0,-24.0,133.80531,100.353982,-0.0,,0
3,59.14,69.23,46.34,3.85,4.88,100.0,37.5,2019-7696,20191105,2019_33,...,9.0,49.0,46.9,-79.0,-77.0,109.423559,90.917141,0.0,,1
4,53.66,70.77,34.48,7.69,3.45,88.89,37.5,2019-7696,20191105,2019_33,...,53.0,44.0,42.5,-17.0,-20.0,55.064126,48.178912,0.0,,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50356 entries, 0 to 50355
Columns: 160 entries, p_pts_won% to m_outcome
dtypes: float64(138), int64(16), object(6)
memory usage: 61.5+ MB


### Refit Best Model on All Available Data Used in Train-Test Split in Previous Model Construction

In [5]:
# This is the data range used for the main analysis for hard courts- 2012-2019 in model; 2009-2011 used for additional stats accrual "runway"
df_filter = df[~df['tour_wk'].str.contains("2009")] 
df_filter = df_filter[~df_filter['tour_wk'].str.contains("2010")]
df_filter = df_filter[~df_filter['tour_wk'].str.contains("2011")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2012")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2013")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2014")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2015")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2016")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2017")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2018")]
#df_filter = df_filter[~df_filter['tour_wk'].str.contains("2019")]

In [6]:
# Filter down to only matches played on clay courts
df_filter2 = df_filter.loc[(df_filter["t_surf"] == 1)]
#df_filter2 = df_filter.loc[(df_filter["t_surf"] == 1) & (df_filter["p_matches_surf"] > 50)]

In [7]:
# Now also will remove BOTH players from individual matches remaining in the surface-specific sample already filtered by year range
# where one or both players has played N or fewer matches prior to the one to be predicted on. 
df_low = df_filter2.loc[df_filter2['p_matches_surf'] <= 20, 'm_num']
df_filter3 = df_filter2[~df_filter2['m_num'].isin(df_low)]

In [8]:
#Pared down to just the predictive features(both raw and player-opponent differential for match being predicted on), and the target feature itself ( player % pts won in the mtch being predicted on)
# All features are derived from data available prior to any given match being predicted on. No data leakage!
df_model1 = df_filter3[["p_pts_won%", "t_indoor", "t_alt", "t_ace%_last", "t_lvl", "t_draw_size", "t_rd_num", "m_best_of", "p_rank", "p_log_rank", "p_rank_pts", "p_ent", "p_hd", "p_ht", "p_age", "p_matches", "p_matches_surf", "p_H2H_w", "p_H2H_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l60_decay_IO", "p_pts_won%_l10", "p_SOS_adj_pts_won%_l60_decay", "p_SOS_adj_pts_won%_l60_decay_IO", "p_SOS_adj_pts_won%_l60_decay_IO_weighted", "p_SOS_adj_pts_won%_l10", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_SOS_adj_sv_pts_won%_l60_decay", "p_SOS_adj_sv_pts_won%_l10", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_SOS_adj_ret_pts_won%_l60_decay", "p_SOS_adj_ret_pts_won%_l10", "p_ace%_l60_decay", "p_ace%_l10", "p_SOS_adj_ace%_l60_decay", "p_SOS_adj_ace%_l10", "p_aced%_l60_decay", "p_aced%_l10", "p_SOS_adj_aced%_l60_decay", "p_SOS_adj_aced%_l10", "p_bp_save%_l60", "p_bp_save%_l10", "p_SOS_adj_bp_save%_l60", "p_SOS_adj_bp_save%_l10", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_SOS_adj_bp_convert%_l60", "p_SOS_adj_bp_convert%_l10", "p_pts_won%_std_l60_decay","p_sv_pts_won%_std_l60_decay", "p_ret_pts_won%_std_l60_decay","p_m_time_last", "p_tot_time_l6", "p_tot_time_l6_decay", "p_tot_pts_last", "p_tot_pts_l6", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue", "p_stamina_adj_fatigue_decay", "high_t_ace_p_ace", "high_t_ace_p_aced", "p_opp_rank_diff", "p_opp_log_rank_diff", "p_opp_rank_pts_diff", "p_ent_diff", "p_opp_ht_diff", "p_opp_age_diff", "p_L_opp_R", "p_HCA_opp_N", "p_pts_won%_l60_decay_diff", "p_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_diff", "p_SOS_adj_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff", "p_pts_won%_l10_diff", "p_SOS_adj_pts_won%_l10_diff", "p_sv_pts_won%_l60_decay_diff", "p_SOS_adj_sv_pts_won%_l60_decay_diff", "p_sv_pts_won%_l10_diff", "p_SOS_adj_sv_pts_won%_l10_diff", "p_ret_pts_won%_l60_decay_diff", "p_SOS_adj_ret_pts_won%_l60_decay_diff", "p_ret_pts_won%_l10_diff", "p_SOS_adj_ret_pts_won%_l10_diff", "p_sv_opp_ret_pts_won%_l60_decay_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff", "p_sv_opp_ret_pts_won%_l10_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l10_diff", "p_ret_opp_sv_pts_won%_l60_decay_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff", "p_ret_opp_sv_pts_won%_l10_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l10_diff", "p_ace%_l60_decay_diff", "p_SOS_adj_ace%_l60_decay_diff", "p_ace%_l10_diff", "p_SOS_adj_ace%_l10_diff", "p_aced%_l60_decay_diff", "p_SOS_adj_aced%_l60_decay_diff", "p_aced%_l10_diff", "p_SOS_adj_aced%_l10_diff", "p_ace%_opp_aced%_l60_decay_diff", "p_SOS_adj_ace%_opp_aced%_l60_decay_diff", "p_ace%_opp_aced%_l10_diff", "p_SOS_adj_ace%_opp_aced%_l10_diff", "p_aced%_opp_ace%_l60_decay_diff", "p_SOS_adj_aced%_opp_ace%_l60_decay_diff", "p_aced%_opp_ace%_l10_diff", "p_SOS_adj_aced%_opp_ace%_l10_diff", "p_bp_save%_l60_diff", "p_SOS_adj_bp_save%_l60_diff", "p_bp_save%_l10_diff", "p_SOS_adj_bp_save%_l10_diff", "p_bp_convert%_l60_diff", "p_SOS_adj_bp_convert%_l60_diff", "p_bp_convert%_l10_diff", "p_SOS_adj_bp_convert%_l10_diff", "p_bp_convert%_opp_bp_save%_l60_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff", "p_bp_convert%_opp_bp_save%_l10_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff", "p_bp_save%_opp_bp_convert%_l60_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff", "p_bp_save%_opp_bp_convert%_l10_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff", "p_pts_won%_std_l60_decay_diff", 'p_sv_pts_won%_std_l60_decay_diff','p_ret_pts_won%_std_l60_decay_diff', "p_m_time_last_diff", "p_tot_time_l6_diff", "p_tot_time_l6_decay_diff", "p_tot_pts_last_diff", "p_tot_pts_l6_diff", "p_tot_pts_l6_decay_diff", "p_matches_diff", "p_matches_surf_diff", "p_stam_adj_fatigue_diff", "p_stam_adj_fatigue_decay_diff", "p_H2H_diff", "p_H2H_pts_won%_diff"]] #all features

In [9]:
X = df_model1[model.X_columns]
y = df_model1["p_pts_won%"]

In [10]:
len(X), len(y)

(7214, 7214)

In [11]:
model.fit(X,y)

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(learning_rate=0.04, max_depth=4,
                                           max_features=9, random_state=47))])

#### Mean Absolute Error (MAE)

In [12]:
cv_results = cross_validate(model, X, y, scoring='neg_mean_absolute_error', cv=5)

In [13]:
cv_results['test_score']

array([-4.57841031, -4.61885104, -4.75321448, -4.57095525, -4.62657835])

In [14]:
mae_mean, mae_std = np.mean(-1 * cv_results['test_score']), np.std(-1 * cv_results['test_score'])
mae_mean, mae_std

(4.629601887157525, 0.06552068398569626)

#### Root Mean Squared Error (RMSE)

In [15]:
cv_results2 = cross_validate(model, X, y, scoring='neg_root_mean_squared_error', cv=5)

In [16]:
cv_results2['test_score']

array([-5.81831453, -5.78797122, -5.98252974, -5.85974993, -5.79421397])

In [17]:
rmse_mean, rmse_std = np.mean(-1 * cv_results2['test_score']), np.std(-1 * cv_results2['test_score'])
rmse_mean, rmse_std

(5.848555878283586, 0.07157148373472538)

Model when applied to full data set had very similar prediction error (RMSE) to that seen with cross-validation on training set. 