## Overview of Feature Development - Clay Court

In the previous workbook (02_Data_Cleaning_Wrangling_ClayCourt), raw data were imported, clean and expanded upon. In this workbook, an extensive set of features is developed from the raw data (currently years 2009-2019). At a high level:
* Target feature created per match to be predicted on (% total points won per player in match to be predicted on)
* For each match to be predicted on, to assess long-term player performance as a predictor of performance in the next match under similar conditions, mean match stats-derived features from previous matches are accrued over the 60 matches previous to the match being predicted on. 
    * This accrual is surface-specific (hard or clay courts, with addtional features sensitive to indoor or outdoor match status) 
    * These past 60-match features are also decay weighted, such that matches occuring closer in time to the match being predicted on are weighted more than matches occuring less recently in time relative to the match being predicted on. Both the 60 match horizon, as well as the specific decay-weighting employed presently, have been derived ("optimized") empiricaly based on feedback from model performance with different parameters 
    * Decay-weighted past perfomance features (over the last 60 surface matches relative to a given match to be predicted on) have also been adjusted based on the relatively strong or weak past performance of the opponents faced during that 60 match stretch relative to a theoretical schedule of opponents of average past performance. 
        * This "Strength of Schedule" weighting makes a substantial improvement to prediction accuracy (I've determined this from running models with and without such adjustment). 
        * This weghting is also done with respect to time-decay weighting, ie opponent strength computation is subject to the same time decay weighting relative to the match being predicted on as the player's own performance being adjusted was
* Apart from performance-based features, another set of features generated here deals with player stamina (derived from number of past matches played) and fatigue (points or minutes spent on court thus far in a given tournament). Stamina and fatigue are integrated into one "body battery" metric as well, which I've shown has more predictive value than considering fatigue or stamina in isolation. 
    * As with engineered performance features, fatigue features are fitted with empirically-derived, time decay-weighted curves
* A number of other features are derived in this workbook, including on topics including head-to-head past performance between two players in a given match to be predicted on, handedness, height, "home court advantage" and court speed estimation (based on the past year's aggregate ace performance at the same tournament) and potential effects relative to individual player profiles
* All player-level features generated in this workbook are formulated both in a "raw" version (ie, without respect to the opponent in the match to be predicted on) and a "differential" version (ie, WITH respect to the opponent in the match to be predicted on). Modeling shows that the differential versions of the features are MUCH more predictive than the raw versions. Though including the raw versions as well does yield a slight improvement in target feature prediction.
* Currently, a data inclusion date range of 2009-2019 is delimited (this has been shaped by feedback from EDA and modeling), and matches played on grass (too low a sample size; also removed Davis up and Olympics matches for same reason as well as for their "odd" contexts) and matches where one player withdrew (usually for injury reason) either before the match or early on in the match were filtered out and NOT included in feature generation. Critically, matches filtered out beyond this point (see EDA and Modeling workbooks) WERE used for initial feature generation/accrual.
    * For modeling clay court tennis specifically, where the sample is much smaller (by ~3-4x) over the same time frame as for hard court tennis, the sample has been expanded over 3 additional years (2009-2019) as compared to the hard court model. See reporting from the original Tennis Prediction Project for detailed analyses and discussion of this dichotomy. 
* A final general statement that is important to make is that pains are taken throughout to avoid "data leakage", which is to say to avoid incorporating into predictive features any information that would not have been available prior to the time a given match being predicted on was played. 

### Imports

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

### Load Preprocessed Data

In [2]:
#Load preprocessed
df = pd.read_csv('../data/cleaned_data_for_FeatureDev_Clay.csv')

In [None]:
df.info()

In [None]:
df.head()

#### dataframe in by-match organization (first iteration)

### Target Feature Creation: % Pts Won By Player in a Given Match

In [3]:
# Creation of target feature for each player in a given match: proportionalizing points played in the match appropriately between the two players.
df["w_pts_won%"] = round(((df["w_1stWon"] + df["w_2ndWon"] + (df["l_svpt"] - (df["l_1stWon"] + df["l_2ndWon"])))/ (df["w_svpt"] + df["l_svpt"]))*100, 2)

# Loser % pts won is simply 100 - w % pts won
df["l_pts_won%"] = 100 - df["w_pts_won%"]

In [4]:
# Target broken down into serving and returning 

# Winner % Serve pts won
df["w_sv_pts_won%"] = round((df["w_1stWon"] + df["w_2ndWon"]) / df["w_svpt"]*100,2)

# Winner % Return pts won
df["w_ret_pts_won%"] =round(((df["l_svpt"] - (df["l_1stWon"] + df["l_2ndWon"]))/df["l_svpt"])*100,2)

# Loser % Serve pts won
df["l_sv_pts_won%"] = round((df["l_1stWon"] + df["l_2ndWon"]) / df["l_svpt"]*100,2)

# Loser % Return pts won
df["l_ret_pts_won%"] =round(((df["w_svpt"] - (df["w_1stWon"] + df["w_2ndWon"]))/df["w_svpt"])*100,2)

In [5]:
# Not a target feature, but useful for generating predictive features: Total Points Played In Match
df["tot_pts"] = df["l_svpt"] + df["w_svpt"]

#### dataframe in by-player organization (first iteration)

In [None]:
df.info()

### Retrospective, Surface-Specific Performance Prediction Features by Player per Match

The goal in this section is to generate, for a given player relative to a given match to be played, backward-looking predictors of performance in the match to be predicted on. A number of early experiments (including feedback from EDA and simple modeling) I conducted with integration windows and various decay weights have driven me to land on the values currently present here (though more optimization surely will follow in post-complex modeling iterations).

In [6]:
df_winners = df.drop(["l_name", "l_rank", "l_rank_pts", "l_ioc", "l_ent", "l_hd", "l_ht", "l_age", "l_1stWon", "l_2ndWon", "l_SvGms", "AvgL_C_IP_NV", "PSL_O_IP_NV", "PSL_C_IP_NV", "l_pts_won%", "l_sv_pts_won%", "l_ret_pts_won%", ], axis = 1)
df_winners["m_outcome"] = 1
df_losers =  df.drop(["w_name", "w_rank", "w_rank_pts", "w_ioc", "w_ent", "w_hd", "w_ht", "w_age", "w_1stWon", "w_2ndWon", "w_SvGms", "AvgW_C_IP_NV", "PSW_O_IP_NV", "PSW_C_IP_NV", "w_pts_won%", "w_sv_pts_won%", "w_ret_pts_won%"], axis = 1)
df_losers["m_outcome"] = 0

In [None]:
df_winners.info()

In [None]:
df_losers.info()

In [7]:
# Split out winners and losers from by-match organization and concatenate into a per player organization
df_winners = df_winners.set_axis(["t_id", "t_date", "tour_day", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "m_num", "t_round", "t_rd_num", "m_best_of", "m_score", "m_time(m)", "p_id", "p_name","p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_svpt", "p_1stWon","p_2ndWon","p_SvGms","p_ace","p_bpSaved","p_bpFaced","opp_id","opp_svpt","opp_ace","opp_bpSaved","opp_bpFaced", "p_AVG_C_IP_NV", "p_PS_O_IP_NV", "p_PS_C_IP_NV", "p_pts_won%","p_sv_pts_won%","p_ret_pts_won%", "m_tot_pts", "m_outcome"], axis=1)
df_losers = df_losers.set_axis(["t_id", "t_date", "tour_day", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "m_num", "t_round", "t_rd_num", "m_best_of", "m_score", "m_time(m)", "opp_id", "opp_svpt","opp_ace","opp_bpSaved","opp_bpFaced","p_id","p_name","p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_svpt", "p_1stWon","p_2ndWon","p_SvGms", "p_ace","p_bpSaved","p_bpFaced","p_AVG_C_IP_NV", "p_PS_O_IP_NV", "p_PS_C_IP_NV", "p_pts_won%","p_sv_pts_won%","p_ret_pts_won%", "m_tot_pts", "m_outcome"], axis=1)
df_player1 = pd.concat([df_winners, df_losers], ignore_index=True)
df_player1 = df_player1.sort_values(by=['p_id','tour_wk','t_rd_num'], ascending = False)

Generated below are a number of retrospective (relative to the match being predicted on) predictive performance features. Aggregations are surface specific in this workstream (though some filtering components remain in the code form the cross-surface workbook at this stage of analysis in the original Tennis Prediction Project).

In [8]:
#Sorting as such helps visually verify the complicated, backward-looking stat accrual calculations we will make below
df_player1 = df_player1.sort_values(by=['p_id','tour_wk','t_rd_num'], ascending = False)

In [None]:
df_player1.head(20)

In [9]:
# % total points won over up to the last 60 surface-specific matches for a given player prior to a match to be predicted on
# In EDA and modeling, we will require a minimum # of matches in the past relative to a match being predicted on FOR BOTH PLAYERS IN THE MATCH
# Therefore, we do not need to go to extremes to backfill NaNs here when a window to compute on doesn't meet the min period requirement.

df_player1 = df_player1.iloc[::-1]

df_player1['p_pts_won%_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player1['p_pts_won%_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player1['p_pts_won%_11to20'] = df_player1['p_pts_won%_11to20'].fillna(df_player1['p_pts_won%_1to10'])

df_player1['p_pts_won%_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player1['p_pts_won%_21to30'] = df_player1['p_pts_won%_21to30'].fillna(df_player1['p_pts_won%_11to20'])

df_player1['p_pts_won%_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player1['p_pts_won%_31to40'] = df_player1['p_pts_won%_31to40'].fillna(df_player1['p_pts_won%_21to30'])

df_player1['p_pts_won%_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player1['p_pts_won%_41to50'] = df_player1['p_pts_won%_41to50'].fillna(df_player1['p_pts_won%_31to40'])

df_player1['p_pts_won%_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player1['p_pts_won%_51to60'] = df_player1['p_pts_won%_51to60'].fillna(df_player1['p_pts_won%_41to50'])

#df_player1['p_pts_won%_61to70'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player1['p_pts_won%_61to70'] = df_player1['p_pts_won%_61to70'].fillna(df_player1['p_pts_won%_51to60'])

#df_player1['p_pts_won%_71to80'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(71))
#df_player1['p_pts_won%_71to80'] = df_player1['p_pts_won%_71to80'].fillna(df_player1['p_pts_won%_61to70'])

df_player1 = df_player1.iloc[::-1]

In [10]:
# Time-decay weighting the total pts won % by player result from above.
#  Core version
df_player1["p_pts_won%_l60_decay"] = (((df_player1['p_pts_won%_1to10'] * 14) + (df_player1['p_pts_won%_11to20'] * 8) + (df_player1['p_pts_won%_21to30'] * 5) 
+ (df_player1['p_pts_won%_31to40'] * 3) + (df_player1['p_pts_won%_41to50'] * 2) + (df_player1['p_pts_won%_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
df_player1.drop(["p_pts_won%_11to20","p_pts_won%_21to30","p_pts_won%_31to40","p_pts_won%_41to50","p_pts_won%_51to60"],axis=1, inplace=True)
#df_player1

In [11]:
# Short term total points won% perfomance
df_player1["p_pts_won%_l10"] = df_player1["p_pts_won%_1to10"]

df_player1.drop(["p_pts_won%_1to10"],axis=1, inplace=True)

In [12]:
# Total points won% over the last 60 surface-specific matches taking indoor vs outdoor into consideration

# In EDA and modeling, we will require a minimum # of matches in the past relative to a match being predicted on FOR BOTH PLAYERS IN THE MATCH
# Therefore, we do not need to go to extremes to backfill NaNs here when a window to compute on doesn't meet the min period requirement.

df_player1 = df_player1.iloc[::-1]

df_player1['p_pts_won%_1to10_IO'] = df_player1.groupby(['p_id','t_surf','t_indoor'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player1['p_pts_won%_11to20_IO'] = df_player1.groupby(['p_id','t_surf','t_indoor'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player1['p_pts_won%_11to20_IO'] = df_player1['p_pts_won%_11to20_IO'].fillna(df_player1['p_pts_won%_1to10_IO'])

df_player1['p_pts_won%_21to30_IO'] = df_player1.groupby(['p_id','t_surf','t_indoor'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player1['p_pts_won%_21to30_IO'] = df_player1['p_pts_won%_21to30_IO'].fillna(df_player1['p_pts_won%_11to20_IO'])

df_player1['p_pts_won%_31to40_IO'] = df_player1.groupby(['p_id','t_surf','t_indoor'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player1['p_pts_won%_31to40_IO'] = df_player1['p_pts_won%_31to40_IO'].fillna(df_player1['p_pts_won%_21to30_IO'])

df_player1['p_pts_won%_41to50_IO'] = df_player1.groupby(['p_id','t_surf','t_indoor'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player1['p_pts_won%_41to50_IO'] = df_player1['p_pts_won%_41to50_IO'].fillna(df_player1['p_pts_won%_31to40_IO'])

df_player1['p_pts_won%_51to60_IO'] = df_player1.groupby(['p_id','t_surf','t_indoor'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player1['p_pts_won%_51to60_IO'] = df_player1['p_pts_won%_51to60_IO'].fillna(df_player1['p_pts_won%_41to50_IO'])

#df_player1['p_pts_won%_61to70_IO'] = df_player1.groupby(['p_id','t_surf','t_indoor'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player1['p_pts_won%_61to70_IO'] = df_player1['p_pts_won%_61to70_IO'].fillna(df_player1['p_pts_won%_51to60_IO'])

#df_player1['p_pts_won%_71to80'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(71))
#df_player1['p_pts_won%_71to80'] = df_player1['p_pts_won%_71to80'].fillna(df_player1['p_pts_won%_61to70'])

df_player1 = df_player1.iloc[::-1]

In [13]:
# Time-decay weighting the total pts won % by player result from above.
df_player1["p_pts_won%_l60_decay_IO"] = (((df_player1['p_pts_won%_1to10_IO'] * 14) + (df_player1['p_pts_won%_11to20_IO'] * 8) + (df_player1['p_pts_won%_21to30_IO'] * 5) 
+ (df_player1['p_pts_won%_31to40_IO'] * 3) + (df_player1['p_pts_won%_41to50_IO'] * 2) + (df_player1['p_pts_won%_51to60_IO'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
df_player1.drop(["p_pts_won%_1to10_IO","p_pts_won%_11to20_IO","p_pts_won%_21to30_IO","p_pts_won%_31to40_IO","p_pts_won%_41to50_IO","p_pts_won%_51to60_IO"],axis=1, inplace=True)
#df_player1

In [14]:
# % SERVE points won over up to the last 60 surface-specific matches for a given player prior to a match to be predicted on
# In EDA and modeling, we will require a minimum # of matches in the past relative to a match being predicted on FOR BOTH PLAYERS IN THE MATCH
# Therefore, we do not need to go to extremes to backfill NaNs here when a window to compute on doesn't meet the min period requirement.

df_player1 = df_player1.iloc[::-1]

df_player1['p_sv_pts_won%_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player1['p_sv_pts_won%_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player1['p_sv_pts_won%_11to20'] = df_player1['p_sv_pts_won%_11to20'].fillna(df_player1['p_sv_pts_won%_1to10'])

df_player1['p_sv_pts_won%_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player1['p_sv_pts_won%_21to30'] = df_player1['p_sv_pts_won%_21to30'].fillna(df_player1['p_sv_pts_won%_11to20'])

df_player1['p_sv_pts_won%_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player1['p_sv_pts_won%_31to40'] = df_player1['p_sv_pts_won%_31to40'].fillna(df_player1['p_sv_pts_won%_21to30'])

df_player1['p_sv_pts_won%_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player1['p_sv_pts_won%_41to50'] = df_player1['p_sv_pts_won%_41to50'].fillna(df_player1['p_sv_pts_won%_31to40'])

df_player1['p_sv_pts_won%_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player1['p_sv_pts_won%_51to60'] = df_player1['p_sv_pts_won%_51to60'].fillna(df_player1['p_sv_pts_won%_41to50'])

#df_player1['p_sv_pts_won%_61to70'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player1['p_sv_pts_won%_61to70'] = df_player1['p_sv_pts_won%_61to70'].fillna(df_player1['p_sv_pts_won%_51to60'])

df_player1 = df_player1.iloc[::-1]


In [15]:
# Time-decay weighting the SERVE pts won % by player result from above.
df_player1["p_sv_pts_won%_l60_decay"] = (((df_player1['p_sv_pts_won%_1to10'] * 14) + (df_player1['p_sv_pts_won%_11to20'] * 8) + (df_player1['p_sv_pts_won%_21to30'] * 5) 
+ (df_player1['p_sv_pts_won%_31to40'] * 3) + (df_player1['p_sv_pts_won%_41to50'] * 2) + (df_player1['p_sv_pts_won%_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
df_player1.drop(["p_sv_pts_won%_11to20","p_sv_pts_won%_21to30","p_sv_pts_won%_31to40","p_sv_pts_won%_41to50","p_sv_pts_won%_51to60"],axis=1, inplace=True)


In [16]:
# Short term serve points won% perfomance
df_player1["p_sv_pts_won%_l10"] = df_player1["p_sv_pts_won%_1to10"]

df_player1.drop(["p_sv_pts_won%_1to10"],axis=1, inplace=True)

In [17]:
# % RETURN points won over up to the last 60 surface-specific matches for a given player prior to a match to be predicted on
# In EDA and modeling, we will require a minimum # of matches in the past relative to a match being predicted on FOR BOTH PLAYERS IN THE MATCH
# Therefore, we do not need to go to extremes to backfill NaNs here when a window to compute on doesn't meet the min period requirement.

df_player1 = df_player1.iloc[::-1]

df_player1['p_ret_pts_won%_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player1['p_ret_pts_won%_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player1['p_ret_pts_won%_11to20'] = df_player1['p_ret_pts_won%_11to20'].fillna(df_player1['p_ret_pts_won%_1to10'])

df_player1['p_ret_pts_won%_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player1['p_ret_pts_won%_21to30'] = df_player1['p_ret_pts_won%_21to30'].fillna(df_player1['p_ret_pts_won%_11to20'])

df_player1['p_ret_pts_won%_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player1['p_ret_pts_won%_31to40'] = df_player1['p_ret_pts_won%_31to40'].fillna(df_player1['p_ret_pts_won%_21to30'])

df_player1['p_ret_pts_won%_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player1['p_ret_pts_won%_41to50'] = df_player1['p_ret_pts_won%_41to50'].fillna(df_player1['p_ret_pts_won%_31to40'])

df_player1['p_ret_pts_won%_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player1['p_ret_pts_won%_51to60'] = df_player1['p_ret_pts_won%_51to60'].fillna(df_player1['p_ret_pts_won%_41to50'])

#df_player1['p_ret_pts_won%_61to70'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player1['p_ret_pts_won%_61to70'] = df_player1['p_ret_pts_won%_61to70'].fillna(df_player1['p_ret_pts_won%_51to60'])

df_player1 = df_player1.iloc[::-1]


In [18]:
# Time-decay weighting the RETURN pts won % by player result from above.
df_player1["p_ret_pts_won%_l60_decay"] = (((df_player1['p_ret_pts_won%_1to10'] * 14) + (df_player1['p_ret_pts_won%_11to20'] * 8) + (df_player1['p_ret_pts_won%_21to30'] * 5) 
+ (df_player1['p_ret_pts_won%_31to40'] * 3) + (df_player1['p_ret_pts_won%_41to50'] * 2) + (df_player1['p_ret_pts_won%_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
df_player1.drop(["p_ret_pts_won%_11to20","p_ret_pts_won%_21to30","p_ret_pts_won%_31to40","p_ret_pts_won%_41to50","p_ret_pts_won%_51to60"],axis=1, inplace=True)


In [19]:
# Short term return points won% perfomance
df_player1["p_ret_pts_won%_l10"] = df_player1["p_ret_pts_won%_1to10"]

df_player1.drop(["p_ret_pts_won%_1to10"],axis=1, inplace=True)

In [None]:
#Save to review
#df_player1.to_csv('../data/df_player1.csv', index=False)

In [20]:
# player ace% over up to the last 60 surface-specific matches for a given player prior to a match to be predicted on
# In EDA and modeling, we will require a minimum # of matches in the past relative to a match being predicted on FOR BOTH PLAYERS IN THE MATCH
# Therefore, we do not need to go to extremes to backfill NaNs here when a window to compute on doesn't meet the min period requirement.

# decay-weighted player ace % over up to the last 60 matches (surface-specific)
df_player1["p_ace%"] = ((df_player1["p_ace"]/df_player1["p_svpt"])*100).round(2)

df_player1 = df_player1.iloc[::-1]

df_player1['p_ace%_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_ace%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player1['p_ace%_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_ace%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player1['p_ace%_11to20'] = df_player1['p_ace%_11to20'].fillna(df_player1['p_ace%_1to10'])

df_player1['p_ace%_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_ace%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player1['p_ace%_21to30'] = df_player1['p_ace%_21to30'].fillna(df_player1['p_ace%_11to20'])

df_player1['p_ace%_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_ace%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player1['p_ace%_31to40'] = df_player1['p_ace%_31to40'].fillna(df_player1['p_ace%_21to30'])

df_player1['p_ace%_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_ace%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player1['p_ace%_41to50'] = df_player1['p_ace%_41to50'].fillna(df_player1['p_ace%_31to40'])

df_player1['p_ace%_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_ace%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player1['p_ace%_51to60'] = df_player1['p_ace%_51to60'].fillna(df_player1['p_ace%_41to50'])

#df_player1['p_ace%_61to70'] = df_player1.groupby(['p_id','t_surf'])['p_ace%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player1['p_ace%_61to70'] = df_player1['p_ace%_61to70'].fillna(df_player1['p_ace%_51to60'])

df_player1 = df_player1.iloc[::-1]

In [21]:
# Time-decay weighting the ace% by player result from above.
df_player1["p_ace%_l60_decay"] = (((df_player1['p_ace%_1to10'] * 14) + (df_player1['p_ace%_11to20'] * 8) + (df_player1['p_ace%_21to30'] * 5) 
+ (df_player1['p_ace%_31to40'] * 3) + (df_player1['p_ace%_41to50'] * 2) + (df_player1['p_ace%_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
df_player1.drop(["p_ace%_11to20","p_ace%_21to30","p_ace%_31to40","p_ace%_41to50","p_ace%_51to60"],axis=1, inplace=True)

In [22]:
# Short term ace% perfomance
df_player1["p_ace%_l10"] = df_player1["p_ace%_1to10"]
df_player1.drop(["p_ace%_1to10"],axis=1, inplace=True)

In [23]:
# player aced% (as a returner) over up to the last 60 surface-specific matches for a given player prior to a match to be predicted on
# In EDA and modeling, we will require a minimum # of matches in the past relative to a match being predicted on FOR BOTH PLAYERS IN THE MATCH
# Therefore, we do not need to go to extremes to backfill NaNs here when a window to compute on doesn't meet the min period requirement.

df_player1["p_aced%"] = ((df_player1["opp_ace"]/df_player1["opp_svpt"])*100).round(2)

df_player1 = df_player1.iloc[::-1]

df_player1['p_aced%_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_aced%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player1['p_aced%_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_aced%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player1['p_aced%_11to20'] = df_player1['p_aced%_11to20'].fillna(df_player1['p_aced%_1to10'])

df_player1['p_aced%_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_aced%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player1['p_aced%_21to30'] = df_player1['p_aced%_21to30'].fillna(df_player1['p_aced%_11to20'])

df_player1['p_aced%_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_aced%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player1['p_aced%_31to40'] = df_player1['p_aced%_31to40'].fillna(df_player1['p_aced%_21to30'])

df_player1['p_aced%_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_aced%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player1['p_aced%_41to50'] = df_player1['p_aced%_41to50'].fillna(df_player1['p_aced%_31to40'])

df_player1['p_aced%_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_aced%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player1['p_aced%_51to60'] = df_player1['p_aced%_51to60'].fillna(df_player1['p_aced%_41to50'])

#df_player1['p_aced%_61to70'] = df_player1.groupby(['p_id','t_surf'])['p_aced%'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player1['p_aced%_61to70'] = df_player1['p_aced%_61to70'].fillna(df_player1['p_aced%_51to60'])

df_player1 = df_player1.iloc[::-1]

In [24]:
# Time-decay weighting the aced% by player result from above.
df_player1["p_aced%_l60_decay"] = (((df_player1['p_aced%_1to10'] * 14) + (df_player1['p_aced%_11to20'] * 8) + (df_player1['p_aced%_21to30'] * 5) 
+ (df_player1['p_aced%_31to40'] * 3) + (df_player1['p_aced%_41to50'] * 2) + (df_player1['p_aced%_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
df_player1.drop(["p_aced%_11to20","p_aced%_21to30","p_aced%_31to40","p_aced%_41to50","p_aced%_51to60"],axis=1, inplace=True)

In [25]:
# Short term aced% perfomance
df_player1["p_aced%_l10"] = df_player1["p_aced%_1to10"]
df_player1.drop(["p_aced%_1to10"],axis=1, inplace=True)

In [None]:
#df_player1.to_csv('../data/df_player1.csv', index=False)

In [26]:
# player break point save % (as a server) over up to the last 60 matches (surface-specific)
# I played around with a weighted version of this, but it didn't work as well as unweighted just due to the rareness of the events

df_player1["p_bp_save%"] = ((df_player1["p_bpSaved"]/df_player1["p_bpFaced"])*100).round(2)
df_player1['p_bp_save%'] = df_player1['p_bp_save%'].fillna(100) #covers the cases where a player faced 0 break pts in the match

df_player1 = df_player1.iloc[::-1]
df_player1['p_bp_save%_l60'] = df_player1.groupby(['p_id','t_surf'])['p_bp_save%'].transform(lambda x: x.rolling(window=60, min_periods = 1).mean().round(2).shift(1))
df_player1 = df_player1.iloc[::-1]

In [27]:
# player break point save% (as a server) over the short term (last 10 matches) (surface-specific)
df_player1 = df_player1.iloc[::-1]
df_player1['p_bp_save%_l10'] = df_player1.groupby(['p_id','t_surf'])['p_bp_save%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))
df_player1 = df_player1.iloc[::-1]

In [28]:
# player break point conversion % (as a returner) over up to the last 60 matches (surface-specific)
# I played around with a weighted version of this, but it didn't work as well as unweighted just due to the rareness of the events

df_player1["p_bp_convert%"] = ((1 - (df_player1["opp_bpSaved"]/df_player1["opp_bpFaced"]))*100).round(2)
df_player1['p_bp_convert%'] = df_player1['p_bp_convert%'].fillna(0) #covers the cases where a player generated 0 break point opportunities in a match

df_player1 = df_player1.iloc[::-1]
df_player1['p_bp_convert%_l60'] = df_player1.groupby(['p_id','t_surf'])['p_bp_convert%'].transform(lambda x: x.rolling(window=60, min_periods = 1).mean().round(2).shift(1))
df_player1 = df_player1.iloc[::-1]

In [29]:
# player break point convert% (as a returner) over the short term (last 10 matches) (surface-specific)
df_player1 = df_player1.iloc[::-1]
df_player1['p_bp_convert%_l10'] = df_player1.groupby(['p_id','t_surf'])['p_bp_convert%'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))
df_player1 = df_player1.iloc[::-1]

We can also use the "wisdom of the markets" from previous matches involving a given player to create putatiely predictive features. In practice, here we will use the implied win probability (derived from averaged, vig-removed closing lines across a number of sports books) to build these features.

In [30]:
# Mean Implied Win Probability (based on Closing Line with vig removed across a number of sports books) in up to the last 60 surface-specific matches for a given player prior to a match to be predicted on
# In EDA and modeling, we will require a minimum # of matches in the past relative to a match being predicted on FOR BOTH PLAYERS IN THE MATCH
# Therefore, we do not need to go to extremes to backfill NaNs here when a window to compute on doesn't meet the min period requirement.

df_player1 = df_player1.iloc[::-1]

df_player1['p_IWP_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_AVG_C_IP_NV'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player1['p_IWP_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_AVG_C_IP_NV'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player1['p_IWP_11to20'] = df_player1['p_IWP_11to20'].fillna(df_player1['p_IWP_1to10'])

df_player1['p_IWP_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_AVG_C_IP_NV'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player1['p_IWP_21to30'] = df_player1['p_IWP_21to30'].fillna(df_player1['p_IWP_11to20'])

df_player1['p_IWP_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_AVG_C_IP_NV'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player1['p_IWP_31to40'] = df_player1['p_IWP_31to40'].fillna(df_player1['p_IWP_21to30'])

df_player1['p_IWP_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_AVG_C_IP_NV'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player1['p_IWP_41to50'] = df_player1['p_IWP_41to50'].fillna(df_player1['p_IWP_31to40'])

df_player1['p_IWP_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_AVG_C_IP_NV'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player1['p_IWP_51to60'] = df_player1['p_IWP_51to60'].fillna(df_player1['p_IWP_41to50'])

df_player1 = df_player1.iloc[::-1]

In [31]:
# Time-decay weighting the Implied Win Probability % by player result from above.
#  Core version
df_player1["p_IWP_l60_decay"] = (((df_player1['p_IWP_1to10'] * 14) + (df_player1['p_IWP_11to20'] * 8) + (df_player1['p_IWP_21to30'] * 5) 
+ (df_player1['p_IWP_31to40'] * 3) + (df_player1['p_IWP_41to50'] * 2) + (df_player1['p_IWP_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
df_player1.drop(["p_IWP_11to20","p_IWP_21to30","p_IWP_31to40","p_IWP_41to50","p_IWP_51to60"],axis=1, inplace=True)
#df_player1

In [32]:
# Short term player mean Implied Win Percentage
df_player1["p_IWP_l10"] = df_player1["p_IWP_1to10"]

df_player1.drop(["p_IWP_1to10"],axis=1, inplace=True)

In previous model iterations I looked pretty closely at whether variability in past performance, when accounting for strength of schedule, resulted in a set of useful predictive feature and these features had turned out to have extremely little impact on model performance. So I've jettisoned for now, but left commented out code in below. 

In [None]:
# Variability (standard deviation) in total pts won% over the previous 60 surface-specific matches (decay weighted)
#df_player1 = df_player1.iloc[::-1]

#df_player1['p_pts_won%_std_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 1).std().round(2).shift(1))

#df_player1['p_pts_won%_std_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(11))
#df_player1['p_pts_won%_std_11to20'] = df_player1['p_pts_won%_std_11to20'].fillna(df_player1['p_pts_won%_std_1to10'])

#df_player1['p_pts_won%_std_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(21))
#df_player1['p_pts_won%_std_21to30'] = df_player1['p_pts_won%_std_21to30'].fillna(df_player1['p_pts_won%_std_11to20'])

#df_player1['p_pts_won%_std_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(31))
#df_player1['p_pts_won%_std_31to40'] = df_player1['p_pts_won%_std_31to40'].fillna(df_player1['p_pts_won%_std_21to30'])

#df_player1['p_pts_won%_std_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(41))
#df_player1['p_pts_won%_std_41to50'] = df_player1['p_pts_won%_std_41to50'].fillna(df_player1['p_pts_won%_std_31to40'])

#df_player1['p_pts_won%_std_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(51))
#df_player1['p_pts_won%_std_51to60'] = df_player1['p_pts_won%_std_51to60'].fillna(df_player1['p_pts_won%_std_41to50'])

#df_player1 = df_player1.iloc[::-1]

In [None]:
# Time-decay weighting the total pts won % by player result from above.

#df_player1["p_pts_won%_std_l60_decay"] = (((df_player1['p_pts_won%_std_1to10'] * 14) + (df_player1['p_pts_won%_std_11to20'] * 8) + (df_player1['p_pts_won%_std_21to30'] * 5) 
#+ (df_player1['p_pts_won%_std_31to40'] * 3) + (df_player1['p_pts_won%_std_41to50'] * 2) + (df_player1['p_pts_won%_std_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
#df_player1.drop(["p_pts_won%_std_1to10", "p_pts_won%_std_11to20","p_pts_won%_std_21to30","p_pts_won%_std_31to40","p_pts_won%_std_41to50","p_pts_won%_std_51to60"],axis=1, inplace=True)


In [None]:
# Variability (standard deviation) in serve pts won% over the previous 60 surface-specific matches (decay weighted)
#df_player1 = df_player1.iloc[::-1]

#df_player1['p_sv_pts_won%_std_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 1).std().round(2).shift(1))

#df_player1['p_sv_pts_won%_std_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(11))
#df_player1['p_sv_pts_won%_std_11to20'] = df_player1['p_sv_pts_won%_std_11to20'].fillna(df_player1['p_sv_pts_won%_std_1to10'])

#df_player1['p_sv_pts_won%_std_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(21))
#df_player1['p_sv_pts_won%_std_21to30'] = df_player1['p_sv_pts_won%_std_21to30'].fillna(df_player1['p_sv_pts_won%_std_11to20'])

#df_player1['p_sv_pts_won%_std_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(31))
#df_player1['p_sv_pts_won%_std_31to40'] = df_player1['p_sv_pts_won%_std_31to40'].fillna(df_player1['p_sv_pts_won%_std_21to30'])

#df_player1['p_sv_pts_won%_std_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(41))
#df_player1['p_sv_pts_won%_std_41to50'] = df_player1['p_sv_pts_won%_std_41to50'].fillna(df_player1['p_sv_pts_won%_std_31to40'])

#df_player1['p_sv_pts_won%_std_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_sv_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(51))
#df_player1['p_sv_pts_won%_std_51to60'] = df_player1['p_sv_pts_won%_std_51to60'].fillna(df_player1['p_sv_pts_won%_std_41to50'])

#df_player1 = df_player1.iloc[::-1]

In [None]:
# Time-decay weighting the serve pts won % by player result from above.

#df_player1["p_sv_pts_won%_std_l60_decay"] = (((df_player1['p_sv_pts_won%_std_1to10'] * 14) + (df_player1['p_sv_pts_won%_std_11to20'] * 8) + (df_player1['p_sv_pts_won%_std_21to30'] * 5) 
#+ (df_player1['p_sv_pts_won%_std_31to40'] * 3) + (df_player1['p_sv_pts_won%_std_41to50'] * 2) + (df_player1['p_sv_pts_won%_std_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
#df_player1.drop(["p_sv_pts_won%_std_1to10", "p_sv_pts_won%_std_11to20","p_sv_pts_won%_std_21to30","p_sv_pts_won%_std_31to40","p_sv_pts_won%_std_41to50","p_sv_pts_won%_std_51to60"],axis=1, inplace=True)


In [None]:
# Variability (standard deviation) in return pts won% over the previous 60 surface-specific matches (decay weighted)
#df_player1 = df_player1.iloc[::-1]

#df_player1['p_ret_pts_won%_std_1to10'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 1).std().round(2).shift(1))

#df_player1['p_ret_pts_won%_std_11to20'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(11))
#df_player1['p_ret_pts_won%_std_11to20'] = df_player1['p_ret_pts_won%_std_11to20'].fillna(df_player1['p_ret_pts_won%_std_1to10'])

#df_player1['p_ret_pts_won%_std_21to30'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(21))
#df_player1['p_ret_pts_won%_std_21to30'] = df_player1['p_ret_pts_won%_std_21to30'].fillna(df_player1['p_ret_pts_won%_std_11to20'])

#df_player1['p_ret_pts_won%_std_31to40'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(31))
#df_player1['p_ret_pts_won%_std_31to40'] = df_player1['p_ret_pts_won%_std_31to40'].fillna(df_player1['p_ret_pts_won%_std_21to30'])

#df_player1['p_ret_pts_won%_std_41to50'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(41))
#df_player1['p_ret_pts_won%_std_41to50'] = df_player1['p_ret_pts_won%_std_41to50'].fillna(df_player1['p_ret_pts_won%_std_31to40'])

#df_player1['p_ret_pts_won%_std_51to60'] = df_player1.groupby(['p_id','t_surf'])['p_ret_pts_won%'].transform(lambda x: x.rolling(window=10, min_periods = 3).std().round(2).shift(51))
#df_player1['p_ret_pts_won%_std_51to60'] = df_player1['p_ret_pts_won%_std_51to60'].fillna(df_player1['p_ret_pts_won%_std_41to50'])

#df_player1 = df_player1.iloc[::-1]

In [None]:
# Time-decay weighting the return pts won % by player result from above.

#df_player1["p_ret_pts_won%_std_l60_decay"] = (((df_player1['p_ret_pts_won%_std_1to10'] * 14) + (df_player1['p_ret_pts_won%_std_11to20'] * 8) + (df_player1['p_ret_pts_won%_std_21to30'] * 5) 
#+ (df_player1['p_ret_pts_won%_std_31to40'] * 3) + (df_player1['p_ret_pts_won%_std_41to50'] * 2) + (df_player1['p_ret_pts_won%_std_51to60'] * 1))/33).round(2)

#Dropping the transient columns used for the decay calculations
#df_player1.drop(["p_ret_pts_won%_std_1to10", "p_ret_pts_won%_std_11to20","p_ret_pts_won%_std_21to30","p_ret_pts_won%_std_31to40","p_ret_pts_won%_std_41to50","p_ret_pts_won%_std_51to60"],axis=1, inplace=True)


In [None]:
#df_player1.to_csv('../data/df_player1.csv', index=False)

In [None]:
df_player1.info()

### Fatigue and Stamina Predictive Features

In [33]:
# Computes number of matches on clay surface previous to the one being predicted on a player has played AND is contained in the sample

df_player1 = df_player1.iloc[::-1]
df_player1['p_matches_surf'] = df_player1.groupby(['p_id','t_surf'])['p_id'].transform(lambda x: x.rolling(window = 1000, min_periods=1).count().shift(1))
df_player1 = df_player1.iloc[::-1]

# If this is the first match in the sample for the player, the NaN will become 1 (these matches will be filtered out before modeling anyhow)
df_player1['p_matches_surf'] = df_player1['p_matches_surf'].fillna(1)

In [34]:
# Computes decay-weighted time spent on court for a player across up to his prior 6 matches within the same tournament (the most possible prior matchesin one tournament a player can have).

# For the sake of fatigue computation, time spent on court is weighted by the number of days between the match at hand and a given previous match 

#Fatigue weighting factors based on time since a given match played within the same tournament
#FF = [1, .67, .45, .3, .2, .14, .09, .06, .04, .03, .02, .01, .005, .004]
#FF = [1, .75, .56, .42, .32, .24, .18, .13, .1, .08, .06, .04, .03, .02,.01]
FF = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

df_player1["p_m_time_last"] = df_player1.groupby(['p_id','tour_wk'])['m_time(m)'].shift(-1)
df_player1['p_m_time_last'] = df_player1['p_m_time_last'].fillna(15)

df_player1["p_m_day"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(0)
df_player1["p_m_day_last"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-1)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 0), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 1), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 2), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 3), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 4), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 5), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 6), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 7), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 8), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 9), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 10), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 11), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 12), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 13), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 14), "p_m_time_last_w"] = df_player1['p_m_time_last'] * FF[14]
df_player1['p_m_time_last_w'] = df_player1['p_m_time_last_w'].fillna(df_player1['p_m_time_last'])


df_player1["p_m_time_2ago"] = df_player1.groupby(['p_id','tour_wk'])['m_time(m)'].shift(-2)
df_player1['p_m_time_2ago'] = df_player1['p_m_time_2ago'].fillna(15)

df_player1["p_m_day_2ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-2)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 0), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 1), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 2), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 3), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 4), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 5), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 6), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 7), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 8), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 9), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 10), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 11), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 12), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 13), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 14), "p_m_time_2ago_w"] = df_player1['p_m_time_2ago'] * FF[14]
df_player1['p_m_time_2ago_w'] = df_player1['p_m_time_2ago_w'].fillna(15)


df_player1["p_m_time_3ago"] = df_player1.groupby(['p_id','tour_wk'])['m_time(m)'].shift(-3)
df_player1['p_m_time_3ago'] = df_player1['p_m_time_3ago'].fillna(15)

df_player1["p_m_day_3ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-3)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 0), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 1), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 2), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 3), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 4), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 5), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 6), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 7), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 8), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 9), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 10), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 11), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 12), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 13), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 14), "p_m_time_3ago_w"] = df_player1['p_m_time_3ago'] * FF[14]
df_player1['p_m_time_3ago_w'] = df_player1['p_m_time_3ago_w'].fillna(15)


df_player1["p_m_time_4ago"] = df_player1.groupby(['p_id','tour_wk'])['m_time(m)'].shift(-4)
df_player1['p_m_time_4ago'] = df_player1['p_m_time_4ago'].fillna(15)

df_player1["p_m_day_4ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-4)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 0), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 1), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 2), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 3), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 4), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 5), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 6), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 7), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 8), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 9), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 10), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 11), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 12), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 13), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 14), "p_m_time_4ago_w"] = df_player1['p_m_time_4ago'] * FF[14]
df_player1['p_m_time_4ago_w'] = df_player1['p_m_time_4ago_w'].fillna(15)


df_player1["p_m_time_5ago"] = df_player1.groupby(['p_id','tour_wk'])['m_time(m)'].shift(-5)
df_player1['p_m_time_5ago'] = df_player1['p_m_time_5ago'].fillna(15)

df_player1["p_m_day_5ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-5)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 0), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 1), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 2), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 3), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 4), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 5), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 6), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 7), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 8), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 9), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 10), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 11), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 12), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 13), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 14), "p_m_time_5ago_w"] = df_player1['p_m_time_5ago'] * FF[14]
df_player1['p_m_time_5ago_w'] = df_player1['p_m_time_5ago_w'].fillna(15)


df_player1["p_m_time_6ago"] = df_player1.groupby(['p_id','tour_wk'])['m_time(m)'].shift(-6)
df_player1['p_m_time_6ago'] = df_player1['p_m_time_6ago'].fillna(15)

df_player1["p_m_day_6ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-6)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 0), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 1), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 2), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 3), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 4), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 5), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 6), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 7), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 8), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 9), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 10), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 11), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 12), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 13), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 14), "p_m_time_6ago_w"] = df_player1['p_m_time_6ago'] * FF[14]
df_player1['p_m_time_6ago_w'] = df_player1['p_m_time_6ago_w'].fillna(15)


# Summing latency-weighted time on court across previous matches within the tourny
df_player1['p_tot_time_l6_decay'] = ((df_player1['p_m_time_last_w']) + (df_player1['p_m_time_2ago_w']) + (df_player1['p_m_time_3ago_w']) + (df_player1['p_m_time_4ago_w']) + (df_player1['p_m_time_5ago_w']) + (df_player1['p_m_time_6ago_w']))

# Dropping transient columns 
df_player1 = df_player1.drop(['p_m_day', 'p_m_day_last', 'p_m_day_2ago', 'p_m_day_3ago','p_m_day_4ago', 'p_m_day_5ago', 'p_m_day_6ago','p_m_time_last', 'p_m_time_last_w', 'p_m_time_2ago', 'p_m_time_2ago_w','p_m_time_3ago', 'p_m_time_3ago_w','p_m_time_4ago', 'p_m_time_4ago_w','p_m_time_5ago', 'p_m_time_5ago_w', 'p_m_time_6ago', 'p_m_time_6ago_w'],axis=1)

In [None]:
#df_player1.to_csv('../data/df_player1.csv', index=False)

In [35]:
# Computes decay-weighted total points played for a player across up to his prior 6 matches within the same tournament (the most possible prior matchesin one tournament a player can have).

# For the sake of fatigue computation, total points played are weighted by the number of days between the match at hand and a given previous match 

#Fatigue weighting factors based on time since a given match played within the same tournament

#FF = [1, .67, .45, .3, .2, .14, .09, .06, .04, .03, .02, .01, .005, .004]
#FF = [1, .75, .56, .42, .32, .24, .18, .13, .1, .08, .06, .04, .03, .02, .01]
FF = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

df_player1["p_tot_pts_last"] = df_player1.groupby(['p_id','tour_wk'])['m_tot_pts'].shift(-1)
df_player1['p_tot_pts_last'] = df_player1['p_tot_pts_last'].fillna(40)

df_player1["p_m_day"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(0)
df_player1["p_m_day_last"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-1)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 0), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 1), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 2), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 3), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 4), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 5), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 6), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 7), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 8), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 9), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 10), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 11), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 12), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 13), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_last"]) == 14), "p_tot_pts_last_w"] = df_player1['p_tot_pts_last'] * FF[14]
df_player1['p_tot_pts_last_w'] = df_player1['p_tot_pts_last_w'].fillna(df_player1['p_tot_pts_last'])


df_player1["p_tot_pts_2ago"] = df_player1.groupby(['p_id','tour_wk'])['m_tot_pts'].shift(-2)
df_player1['p_tot_pts_2ago'] = df_player1['p_tot_pts_2ago'].fillna(40)

df_player1["p_m_day_2ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-2)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 0), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 1), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 2), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 3), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 4), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 5), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 6), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 7), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 8), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 9), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 10), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 11), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 12), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 13), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_2ago"]) == 14), "p_tot_pts_2ago_w"] = df_player1['p_tot_pts_2ago'] * FF[14]
df_player1['p_tot_pts_2ago_w'] = df_player1['p_tot_pts_2ago_w'].fillna(40)


df_player1["p_tot_pts_3ago"] = df_player1.groupby(['p_id','tour_wk'])['m_tot_pts'].shift(-3)
df_player1['p_tot_pts_3ago'] = df_player1['p_tot_pts_3ago'].fillna(40)

df_player1["p_m_day_3ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-3)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 0), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 1), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 2), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 3), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 4), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 5), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 6), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 7), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 8), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 9), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 10), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 11), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 12), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 13), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_3ago"]) == 14), "p_tot_pts_3ago_w"] = df_player1['p_tot_pts_3ago'] * FF[14]
df_player1['p_tot_pts_3ago_w'] = df_player1['p_tot_pts_3ago_w'].fillna(40)


df_player1["p_tot_pts_4ago"] = df_player1.groupby(['p_id','tour_wk'])['m_tot_pts'].shift(-4)
df_player1['p_tot_pts_4ago'] = df_player1['p_tot_pts_4ago'].fillna(40)

df_player1["p_m_day_4ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-4)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 0), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 1), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 2), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 3), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 4), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 5), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 6), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 7), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 8), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 9), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 10), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 11), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 12), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 13), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_4ago"]) == 14), "p_tot_pts_4ago_w"] = df_player1['p_tot_pts_4ago'] * FF[14]
df_player1['p_tot_pts_4ago_w'] = df_player1['p_tot_pts_4ago_w'].fillna(40)


df_player1["p_tot_pts_5ago"] = df_player1.groupby(['p_id','tour_wk'])['m_tot_pts'].shift(-5)
df_player1['p_tot_pts_5ago'] = df_player1['p_tot_pts_5ago'].fillna(40)

df_player1["p_m_day_5ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-5)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 0), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 1), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 2), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 3), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 4), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 5), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 6), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 7), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 8), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 9), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 10), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 11), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 12), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 13), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_5ago"]) == 14), "p_tot_pts_5ago_w"] = df_player1['p_tot_pts_5ago'] * FF[14]
df_player1['p_tot_pts_5ago_w'] = df_player1['p_tot_pts_5ago_w'].fillna(40)


df_player1["p_tot_pts_6ago"] = df_player1.groupby(['p_id','tour_wk'])['m_tot_pts'].shift(-6)
df_player1['p_tot_pts_6ago'] = df_player1['p_tot_pts_6ago'].fillna(40)

df_player1["p_m_day_6ago"] = df_player1.groupby(['p_id','tour_wk'])['tour_day'].shift(-6)
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 0), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[0]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 1), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[1]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 2), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[2]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 3), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[3]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 4), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[4]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 5), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[5]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 6), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[6]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 7), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[7]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 8), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[8]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 9), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[9]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 10), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[10]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 11), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[11]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 12), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[12]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 13), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[13]
df_player1.loc[((df_player1["p_m_day"] - df_player1["p_m_day_6ago"]) == 14), "p_tot_pts_6ago_w"] = df_player1['p_tot_pts_6ago'] * FF[14]
df_player1['p_tot_pts_6ago_w'] = df_player1['p_tot_pts_6ago_w'].fillna(40)


# Summing latency-weighted total pts across previous matches within the tourny
df_player1['p_tot_pts_l6_decay'] = ((df_player1['p_tot_pts_last_w']) + (df_player1['p_tot_pts_2ago_w']) + (df_player1['p_tot_pts_3ago_w']) + (df_player1['p_tot_pts_4ago_w']) + (df_player1['p_tot_pts_5ago_w']) + (df_player1['p_tot_pts_6ago_w']))

# Dropping transient columns 
df_player1 = df_player1.drop(['p_m_day', 'p_m_day_last', 'p_m_day_2ago', 'p_m_day_3ago','p_m_day_4ago', 'p_m_day_5ago', 'p_m_day_6ago','p_tot_pts_last', 'p_tot_pts_last_w', 'p_tot_pts_2ago', 'p_tot_pts_2ago_w','p_tot_pts_3ago', 'p_tot_pts_3ago_w','p_tot_pts_4ago', 'p_tot_pts_4ago_w','p_tot_pts_5ago', 'p_tot_pts_5ago_w', 'p_tot_pts_6ago', 'p_tot_pts_6ago_w'],axis=1)

In [36]:
# Integrates "stamina" and "fatigue" features into a "body battery" feature (decay weighted for match pts variable)
# Currently, player matches in denom factored 4th root, based on some prediction quality feedbck from simple (linear) model 
# This version uses decay-weighted points played as the fatigue input, but time on court could also be used. 

df_player1["p_stamina_adj_fatigue_decay"] = (df_player1["p_tot_pts_l6_decay"])/(df_player1["p_matches_surf"]**(1/4)).round(2)
#df_player1["p_stamina_adj_fatigue_decay"] = (df_player1["p_tot_pts_l6_decay"]/np.sqrt(df_player1["p_matches"])).round(2)
#df_player1["p_stamina_adj_fatigue"] = (df_player1["p_tot_pts_l6"]/np.sqrt(df_player1["p_matches"])).round(2)
#df_player1["p_stamina_adj_fatigue"] = (df_player1["p_tot_pts_l6"])/(df_player1["p_matches"]**(1/4)).round(2)

#### below is computed a given player's H2H wins versus an opponent prior to a match being predicted on in a surface-specific manner (2012-2019 inclusive)

In [37]:
df_player1 = df_player1.iloc[::-1]
df_player1['p_H2H_w'] = df_player1.groupby(['p_id','opp_id','t_surf'])['m_outcome'].transform(lambda x: x.rolling(window=2000, min_periods = 1).sum().shift(1))
df_player1 = df_player1.iloc[::-1]
df_player1['p_H2H_w'] = df_player1['p_H2H_w'].fillna(0)
df_player1

Unnamed: 0,t_id,t_date,tour_day,tour_wk,t_name,t_country,t_surf,t_indoor,t_alt,t_lvl,...,p_bp_convert%,p_bp_convert%_l60,p_bp_convert%_l10,p_IWP_l60_decay,p_IWP_l10,p_matches_surf,p_tot_time_l6_decay,p_tot_pts_l6_decay,p_stamina_adj_fatigue_decay,p_H2H_w
8932,2019-0439,20190715,1908.0,2019_18,Umag,CRO,Clay,0,0,1,...,33.33,36.19,36.19,40.02,40.02,6.0,178.0,362.0,230.573248,0.0
133,2019-0439,20190715,1905.0,2019_18,Umag,CRO,Clay,0,0,1,...,66.67,30.09,30.09,38.71,38.71,5.0,90.0,240.0,160.000000,0.0
9132,2019-7694,20190520,1885.0,2019_16,Lyon,FRA,Clay,0,0,1,...,20.00,32.62,32.62,32.66,32.66,4.0,90.0,240.0,170.212766,0.0
9165,2019-M009,20190513,1881.0,2019_15,Rome Masters,ITA,Clay,0,0,2,...,33.33,32.38,32.38,38.40,38.40,3.0,186.0,357.0,270.454545,0.0
369,2019-M009,20190513,1878.0,2019_15,Rome Masters,ITA,Clay,0,0,2,...,57.14,20.00,20.00,35.78,35.78,2.0,90.0,240.0,201.680672,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4044,2014-414,20140714,657.0,2014_19,Hamburg,GER,Clay,0,0,1,...,36.36,42.46,42.46,24.82,24.82,4.0,133.0,296.0,209.929078,0.0
4056,2014-414,20140714,656.0,2014_19,Hamburg,GER,Clay,0,0,1,...,55.56,38.10,38.10,20.79,20.79,3.0,90.0,240.0,181.818182,0.0
12934,2014-321,20140707,650.0,2014_18,Stuttgart,GER,Clay,0,0,1,...,66.67,23.81,23.81,17.05,17.05,2.0,90.0,240.0,201.680672,0.0
13273,2014-308,20140428,609.0,2014_13,Munich,GER,Clay,0,0,1,...,14.29,33.33,33.33,10.48,10.48,1.0,90.0,240.0,240.000000,0.0


In [38]:
# H2H Past Comparison at the level of pts% won
df_player1 = df_player1.iloc[::-1]
df_player1['p_H2H_pts_won%'] = df_player1.groupby(['p_id','opp_id','t_surf'])['p_pts_won%'].transform(lambda x: x.rolling(window=2000, min_periods = 1).mean().shift(1))
df_player1 = df_player1.iloc[::-1]
#df_player1['p_H2H_pts_won%'] = df_player1['p_H2H_pts_won%'].fillna(0)
#df_player1

In [None]:
#Save current df prior to another transformation back to by-match organization
#df_player2.to_csv('../data/df_player2.csv', index=False)

converting briefly back to by-match organization so that we can obtain the data we need to compute player "Strength of Schedule" at the time of each match to be predicted on, across a range of features. The concept is the same as you might seen in football or soccer analytics. At the time of every match to be predicted on, we have a number of (mostly time-decay weighted) assessments of how the player performed over the last stretch of time (60 matches) on the same surface as the match to be played. However, we want to normalize these predictive features by the aggregate strength of the schedule of opponents they faced in the stretch over which those features were generated. For example, winning 60% of your serve points against a schedule of opponents who had historically yielded 65% of opponent serve points (ie, their own return points) is not as impressive as winning 60% of your serve points against a schedule of opponents who had historically yielded 55% of opponent serve points (ie, their own return points).

In [39]:
df_winners2 = df_player1[df_player1['m_outcome'] == 1]
df_losers2 = df_player1[df_player1['m_outcome'] == 0]
df_match2 = df_winners2.merge(df_losers2, on='m_num', how = 'left')

In [None]:
#Save current df prior to another transformation back to by-match organization
df_match2.to_csv('../data/df_match2.csv', index=False)

In [None]:
df_match2.info()

In [40]:
#Back to by-player organization, picking up the reciprocal columns per player needed to make SOS calculations

# Dropping other player columns for winners
df_winners2 = df_match2.drop(["p_svpt_x", "p_1stWon_x", "p_2ndWon_x", "p_SvGms_x", "p_ace_x", "p_bpSaved_x", "p_bpFaced_x", "opp_svpt_x", "opp_ace_x", "opp_bpSaved_x", "opp_bpFaced_x", "m_outcome_x", "t_id_y", "t_date_y", "tour_day_y", "tour_wk_y", "t_name_y", "t_country_y", "t_surf_y", "t_indoor_y", "t_alt_y", "t_lvl_y", "t_draw_size_y", "t_round_y", "t_rd_num_y", "m_best_of_y", "m_score_y", "m_time(m)_y", "p_id_y", "p_name_y", "p_rank_y", "p_rank_pts_y", "p_country_y", "p_ent_y", "p_hd_y", "p_ht_y", "p_age_y", "p_svpt_y", "p_1stWon_y", "p_2ndWon_y", "p_SvGms_y", "p_ace_y", "p_bpSaved_y", "p_bpFaced_y", "opp_id_y", "opp_svpt_y", "opp_ace_y", "opp_bpSaved_y", "opp_bpFaced_y", "p_AVG_C_IP_NV_y", "p_PS_O_IP_NV_y", "p_PS_C_IP_NV_y", "p_pts_won%_y", "p_sv_pts_won%_y", "p_ret_pts_won%_y", "m_tot_pts_y", "m_outcome_y", "p_ace%_y", "p_aced%_y", "p_bp_save%_y", "p_bp_convert%_y", "p_matches_surf_y", "p_tot_time_l6_decay_y", "p_tot_pts_l6_decay_y", "p_stamina_adj_fatigue_decay_y", "p_H2H_w_y", "p_H2H_pts_won%_y"], axis = 1)
df_winners2["m_outcome"] = 1

In [None]:
df_winners2.info()

In [41]:
#Renaming columns to remove winner-loser descriptions so we can re-concatenate winners and losers
df_winners2 = df_winners2.set_axis(["t_id", "t_date", "tour_day", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "m_num", "t_round", "t_rd_num", "m_best_of", "m_score","m_time(m)", "p_id", "p_name","p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "opp_id", "p_AVG_C_IP_NV", "p_PS_O_IP_NV", "p_PS_C_IP_NV", "p_pts_won%", "p_sv_pts_won%", "p_ret_pts_won%", "m_tot_pts", "p_pts_won%_l60_decay", "p_pts_won%_l10", "p_pts_won%_l60_decay_IO", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_ace%", "p_ace%_l60_decay", "p_ace%_l10", "p_aced%", "p_aced%_l60_decay", "p_aced%_l10", "p_bp_save%", "p_bp_save%_l60", "p_bp_save%_l10", "p_bp_convert%", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_IWP_l60_decay", "p_IWP_l10", "p_matches_surf", "p_tot_time_l6_decay", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue_decay", "p_H2H_w", "p_H2H_pts_won%", "p_opp_pts_won%_l60_decay", "p_opp_pts_won%_l10", "p_opp_pts_won%_l60_decay_IO", "p_opp_sv_pts_won%_l60_decay", "p_opp_sv_pts_won%_l10", "p_opp_ret_pts_won%_l60_decay", "p_opp_ret_pts_won%_l10", "p_opp_ace%_l60_decay", "p_opp_ace%_l10", "p_opp_aced%_l60_decay", "p_opp_aced%_l10", "p_opp_bp_save%_l60", "p_opp_bp_save%_l10", "p_opp_bp_convert%_l60", "p_opp_bp_convert%_l10", "p_opp_IWP_l60_decay", "p_opp_IWP_l10", "m_outcome"], axis=1)

In [42]:
#Dropping other player columns for losers
df_losers2 = df_match2.drop(["p_id_x", "p_name_x", "p_rank_x", "p_rank_pts_x", "p_country_x", "p_ent_x", "p_hd_x", "p_ht_x", "p_age_x", "p_svpt_x", "p_1stWon_x", "p_2ndWon_x", "p_SvGms_x", "p_ace_x", "p_bpSaved_x", "p_bpFaced_x", "opp_id_x", "opp_svpt_x", "opp_ace_x", "opp_bpSaved_x", "opp_bpFaced_x", "p_AVG_C_IP_NV_x", "p_PS_O_IP_NV_x", "p_PS_C_IP_NV_x", "p_pts_won%_x", "p_sv_pts_won%_x", "p_ret_pts_won%_x", "m_outcome_x", "p_ace%_x", "p_aced%_x", "p_bp_save%_x", "p_bp_convert%_x", "p_matches_surf_x", "p_tot_time_l6_decay_x", "p_tot_pts_l6_decay_x", "p_stamina_adj_fatigue_decay_x", "p_H2H_w_x", "p_H2H_pts_won%_x", "t_id_y", "t_date_y", "tour_day_y", "tour_wk_y", "t_name_y", "t_country_y", "t_surf_y", "t_indoor_y", "t_alt_y", "t_lvl_y", "t_draw_size_y", "t_round_y", "t_rd_num_y", "m_best_of_y", "m_score_y", "m_time(m)_y", "p_svpt_y", "p_1stWon_y", "p_2ndWon_y", "p_SvGms_y", "p_ace_y", "p_bpSaved_y", "p_bpFaced_y", "opp_svpt_y", "opp_ace_y", "opp_bpSaved_y", "opp_bpFaced_y", "m_tot_pts_y", "m_outcome_y"], axis = 1)
df_losers2["m_outcome"] = 0

In [None]:
df_losers2.info()

In [43]:
#Renaming columns to remove winner-loser descriptions so we can re-concatenate winners and losers
df_losers2 = df_losers2.set_axis(["t_id", "t_date", "tour_day", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "m_num", "t_round", "t_rd_num", "m_best_of", "m_score","m_time(m)", "m_tot_pts", "p_opp_pts_won%_l60_decay", "p_opp_pts_won%_l10", "p_opp_pts_won%_l60_decay_IO", "p_opp_sv_pts_won%_l60_decay", "p_opp_sv_pts_won%_l10", "p_opp_ret_pts_won%_l60_decay", "p_opp_ret_pts_won%_l10", "p_opp_ace%_l60_decay", "p_opp_ace%_l10", "p_opp_aced%_l60_decay", "p_opp_aced%_l10", "p_opp_bp_save%_l60", "p_opp_bp_save%_l10", "p_opp_bp_convert%_l60", "p_opp_bp_convert%_l10", "p_opp_IWP_l60_decay", "p_opp_IWP_l10", "p_id", "p_name","p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "opp_id", "p_AVG_C_IP_NV", "p_PS_O_IP_NV", "p_PS_C_IP_NV", "p_pts_won%", "p_sv_pts_won%", "p_ret_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l10", "p_pts_won%_l60_decay_IO", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_ace%", "p_ace%_l60_decay", "p_ace%_l10", "p_aced%", "p_aced%_l60_decay", "p_aced%_l10", "p_bp_save%", "p_bp_save%_l60", "p_bp_save%_l10", "p_bp_convert%", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_IWP_l60_decay", "p_IWP_l10", "p_matches_surf", "p_tot_time_l6_decay", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue_decay", "p_H2H_w", "p_H2H_pts_won%", "m_outcome"], axis=1)

In [None]:
df_losers2.info()

In [44]:
#Re-merge data, but now with no separate columns for winners and losers 
df_player2 = pd.concat([df_winners2, df_losers2], ignore_index=True)

In [None]:
df_player2.info()

In [45]:
#Reorder columns and sort in useful way visually 
df_player2 = df_player2[["t_id", "t_date", "tour_day", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "t_round", "t_rd_num", "m_num", "m_best_of", "m_outcome", "m_score","m_time(m)", "m_tot_pts", "p_id", "p_name", "opp_id", "p_H2H_w", "p_H2H_pts_won%", "p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_matches_surf", "p_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l60_decay_IO", "p_pts_won%_l10", "p_sv_pts_won%", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_ret_pts_won%", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_ace%", "p_ace%_l60_decay", "p_ace%_l10", "p_aced%", "p_aced%_l60_decay", "p_aced%_l10", "p_bp_save%", "p_bp_save%_l60", "p_bp_save%_l10", "p_bp_convert%", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_tot_time_l6_decay", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue_decay", "p_AVG_C_IP_NV", "p_PS_O_IP_NV", "p_PS_C_IP_NV", "p_IWP_l60_decay", "p_IWP_l10", "p_opp_pts_won%_l60_decay", "p_opp_pts_won%_l60_decay_IO", "p_opp_pts_won%_l10", "p_opp_sv_pts_won%_l60_decay", "p_opp_sv_pts_won%_l10", "p_opp_ret_pts_won%_l60_decay", "p_opp_ret_pts_won%_l10", "p_opp_ace%_l60_decay", "p_opp_ace%_l10", "p_opp_aced%_l60_decay", "p_opp_aced%_l10", "p_opp_bp_save%_l60", "p_opp_bp_save%_l10", "p_opp_bp_convert%_l60", "p_opp_bp_convert%_l10", "p_opp_IWP_l60_decay", "p_opp_IWP_l10"]]
df_player2 = df_player2.sort_values(by=['p_id','tour_wk','t_rd_num'], ascending = False)

In [None]:
#df_player2.head(20)

In [None]:
df_player2.info()

In [46]:
#Save to review
df_player2.to_csv('../data/df_player2.csv', index=False)

### Strength of Schedule Calculation and Adjustment for Predictive Features

For a given player, time decay-weighted (for most features) performance over their most recent (up to) 60 matches on clay prior  to the match being predicted is adjusted by how much above or below the sample mean performance their roster of opponents during that stretch had THEMSELVES performed over THEIR last 60 matches heading into their match with the player of interest. 

This Strength of Schedule adjustment is common practice in team sports, but tennis requires much more computation because it's relative to each match in a typically comparatively very large sample per player (in NFL, for example, you only need to re-compute 16 times per season).

In [47]:
#Calculates % total points won 'Strength of Schedule' for the past 60 opponents of a given player in a given match.
#Uses each opponent's decay-weighted last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_pts_won%_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_pts_won%_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_pts_won%_SOS_11to20'] = df_player2['p_pts_won%_SOS_11to20'].fillna(df_player2['p_pts_won%_SOS_1to10'])

df_player2['p_pts_won%_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_pts_won%_SOS_21to30'] = df_player2['p_pts_won%_SOS_21to30'].fillna(df_player2['p_pts_won%_SOS_11to20'])

df_player2['p_pts_won%_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_pts_won%_SOS_31to40'] = df_player2['p_pts_won%_SOS_31to40'].fillna(df_player2['p_pts_won%_SOS_21to30'])

df_player2['p_pts_won%_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_pts_won%_SOS_41to50'] = df_player2['p_pts_won%_SOS_41to50'].fillna(df_player2['p_pts_won%_SOS_31to40'])

df_player2['p_pts_won%_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_pts_won%_SOS_51to60'] = df_player2['p_pts_won%_SOS_51to60'].fillna(df_player2['p_pts_won%_SOS_41to50'])

df_player2['p_pts_won%_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
df_player2['p_pts_won%_SOS_61to70'] = df_player2['p_pts_won%_SOS_61to70'].fillna(df_player2['p_pts_won%_SOS_51to60'])

#df_player2['p_pts_won%_SOS_71to80'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
#df_player2['p_pts_won%_SOS_71to80'] = df_player2['p_pts_won%_SOS_71to80'].fillna(df_player2['p_pts_won%_SOS_61to70'])

df_player2 = df_player2.iloc[::-1]


In [48]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected points% given up by player opponents over the last 60 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 70 prior to that match to be predict
# Core version

df_player2["p_expected_opp_yield_pts%"] = (100 - (((df_player2['p_pts_won%_SOS_1to10'] * 14) + (df_player2['p_pts_won%_SOS_11to20'] * 8) + (df_player2['p_pts_won%_SOS_21to30'] * 5) 
+ (df_player2['p_pts_won%_SOS_31to40'] * 3) + (df_player2['p_pts_won%_SOS_41to50'] * 2) + (df_player2['p_pts_won%_SOS_51to60'] * 1))/33)).round(2)

# Drops transient columns
df_player2.drop(["p_pts_won%_SOS_11to20","p_pts_won%_SOS_21to30","p_pts_won%_SOS_31to40","p_pts_won%_SOS_41to50","p_pts_won%_SOS_51to60"],axis=1, inplace=True)

In [49]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean

mean_clay_SOS1 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_pts_won%_l60_decay'].mean()
mean_clay_SOS1 = 100 - mean_clay_SOS1 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS1 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_pts_won%_l60_decay'].mean()
#mean_hard_SOS1 = 100 - mean_hard_SOS1 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS1

49.75878188626242

In [51]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_pts_won%_l60_decay"] = ((df_player2["p_pts_won%_l60_decay"])*(mean_clay_SOS1/df_player2["p_expected_opp_yield_pts%"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_pts_won%_l60_decay"] = ((df_player2["p_pts_won%_l60_decay"])*(mean_hard_SOS1/df_player2["p_expected_opp_yield_pts%"])).round(2)

df_player2.loc[(df_player2["p_SOS_adj_pts_won%_l60_decay"] > 100), "p_SOS_adj_pts_won%_l60_decay"] = df_player2["p_SOS_adj_pts_won%_l60_decay"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

#df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_pts_won%_l60_decay"] = ((df_player2["p_pts_won%_l60_decay"])*(np.cbrt(mean_clay_SOS1/df_player2["p_expected_opp_yield_pts%"]))).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_pts_won%_l60_decay"] = ((df_player2["p_pts_won%_l60_decay"])*(np.cbrt(mean_hard_SOS1/df_player2["p_expected_opp_yield_pts%"]))).round(2)

#df_player2.loc[(df_player2["p_SOS_adj_pts_won%_l60_decay"] > df_player2["p_pts_won%_l60_decay"]), "p_SOS_adj_pts_won%_l60_decay"] = df_player2["p_SOS_adj_pts_won%_l60_decay"] + 0.25 
#df_player2.loc[(df_player2["p_SOS_adj_pts_won%_l60_decay"] < df_player2["p_pts_won%_l60_decay"]), "p_SOS_adj_pts_won%_l60_decay"] = df_player2["p_SOS_adj_pts_won%_l60_decay"] - 0.25

In [52]:
#Calculates % total points won 'Strength of Schedule' for the past 60 opponents of a given player in a given match
#This version respects INDOOR AND OUTDOOR distinction
#Uses each opponent's decay-weighted last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_pts_won%_SOS_1to10_IO'] = df_player2.groupby(['p_id','t_surf','t_indoor'])['p_opp_pts_won%_l60_decay_IO'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_pts_won%_SOS_11to20_IO'] = df_player2.groupby(['p_id','t_surf','t_indoor'])['p_opp_pts_won%_l60_decay_IO'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_pts_won%_SOS_11to20_IO'] = df_player2['p_pts_won%_SOS_11to20_IO'].fillna(df_player2['p_pts_won%_SOS_1to10_IO'])

df_player2['p_pts_won%_SOS_21to30_IO'] = df_player2.groupby(['p_id','t_surf','t_indoor'])['p_opp_pts_won%_l60_decay_IO'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_pts_won%_SOS_21to30_IO'] = df_player2['p_pts_won%_SOS_21to30_IO'].fillna(df_player2['p_pts_won%_SOS_11to20_IO'])

df_player2['p_pts_won%_SOS_31to40_IO'] = df_player2.groupby(['p_id','t_surf','t_indoor'])['p_opp_pts_won%_l60_decay_IO'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_pts_won%_SOS_31to40_IO'] = df_player2['p_pts_won%_SOS_31to40_IO'].fillna(df_player2['p_pts_won%_SOS_21to30_IO'])

df_player2['p_pts_won%_SOS_41to50_IO'] = df_player2.groupby(['p_id','t_surf','t_indoor'])['p_opp_pts_won%_l60_decay_IO'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_pts_won%_SOS_41to50_IO'] = df_player2['p_pts_won%_SOS_41to50_IO'].fillna(df_player2['p_pts_won%_SOS_31to40_IO'])

df_player2['p_pts_won%_SOS_51to60_IO'] = df_player2.groupby(['p_id','t_surf','t_indoor'])['p_opp_pts_won%_l60_decay_IO'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_pts_won%_SOS_51to60_IO'] = df_player2['p_pts_won%_SOS_51to60_IO'].fillna(df_player2['p_pts_won%_SOS_41to50_IO'])

#df_player2['p_pts_won%_SOS_61to70_IO'] = df_player2.groupby(['p_id','t_surf','t_indoor'])['p_opp_pts_won%_l60_decay_IO'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player2['p_pts_won%_SOS_61to70_IO'] = df_player2['p_pts_won%_SOS_61to70_IO'].fillna(df_player2['p_pts_won%_SOS_51to60_IO'])

#df_player2['p_pts_won%_SOS_71to80'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
#df_player2['p_pts_won%_SOS_71to80'] = df_player2['p_pts_won%_SOS_71to80'].fillna(df_player2['p_pts_won%_SOS_61to70'])

df_player2 = df_player2.iloc[::-1]

In [53]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected points% given up by player opponents over the last 60 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 60 prior to that match to be predicted on. 
#This version respects INDOOR AND OUTDOOR distinction

df_player2["p_expected_opp_yield_pts%_IO"] = (100 - (((df_player2['p_pts_won%_SOS_1to10_IO'] * 14) + (df_player2['p_pts_won%_SOS_11to20_IO'] * 8) + (df_player2['p_pts_won%_SOS_21to30_IO'] * 5) 
+ (df_player2['p_pts_won%_SOS_31to40_IO'] * 3) + (df_player2['p_pts_won%_SOS_41to50_IO'] * 2) + (df_player2['p_pts_won%_SOS_51to60_IO'] * 1))/33)).round(2)

# Drops transient columns
df_player2.drop(["p_pts_won%_SOS_1to10_IO", "p_pts_won%_SOS_11to20_IO","p_pts_won%_SOS_21to30_IO","p_pts_won%_SOS_31to40_IO","p_pts_won%_SOS_41to50_IO","p_pts_won%_SOS_51to60_IO"],axis=1, inplace=True)

In [54]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
#This version respects INDOOR AND OUTDOOR distinction

mean_clay_outdoor_SOS1 = df_player2.loc[(df_player2['t_surf'] == "Clay") & (df_player2['t_indoor'] == 0), 'p_opp_pts_won%_l60_decay_IO'].mean()
mean_clay_outdoor_SOS1 = 100 - mean_clay_outdoor_SOS1 #we want in terms of pct pts the field ALLOWS on average
mean_clay_indoor_SOS1 = df_player2.loc[(df_player2['t_surf'] == "Clay") & (df_player2['t_indoor'] == 1), 'p_opp_pts_won%_l60_decay_IO'].mean()
mean_clay_indoor_SOS1 = 100 - mean_clay_indoor_SOS1 #we want in terms of pct pts the field ALLOWS on average

#mean_hard_outdoor_SOS1 = df_player2.loc[(df_player2['t_surf'] == "Hard") & (df_player2['t_indoor'] == 0), 'p_opp_pts_won%_l60_decay_IO'].mean()
#mean_hard_outdoor_SOS1 = 100 - mean_hard_outdoor_SOS1 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_indoor_SOS1 = df_player2.loc[(df_player2['t_surf'] == "Hard") & (df_player2['t_indoor'] == 1), 'p_opp_pts_won%_l60_decay_IO'].mean()
#mean_hard_indoor_SOS1 = 100 - mean_hard_indoor_SOS1 #we want in terms of pct pts the field ALLOWS on average

mean_clay_outdoor_SOS1, mean_clay_indoor_SOS1

(49.744519513991136, 48.72746388443018)

In [55]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
#This version respects INDOOR AND OUTDOOR distinction

df_player2.loc[(df_player2["t_surf"] == "Clay") & (df_player2['t_indoor'] == 0), "p_SOS_adj_pts_won%_l60_decay_IO"] = ((df_player2["p_pts_won%_l60_decay_IO"])*(mean_clay_outdoor_SOS1/df_player2["p_expected_opp_yield_pts%_IO"])).round(2)          
df_player2.loc[(df_player2["t_surf"] == "Clay") & (df_player2['t_indoor'] == 1), "p_SOS_adj_pts_won%_l60_decay_IO"] = ((df_player2["p_pts_won%_l60_decay_IO"])*(mean_clay_indoor_SOS1/df_player2["p_expected_opp_yield_pts%_IO"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard") & (df_player2['t_indoor'] == 0), "p_SOS_adj_pts_won%_l60_decay_IO"] = ((df_player2["p_pts_won%_l60_decay_IO"])*(mean_hard_outdoor_SOS1/df_player2["p_expected_opp_yield_pts%_IO"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard") & (df_player2['t_indoor'] == 1), "p_SOS_adj_pts_won%_l60_decay_IO"] = ((df_player2["p_pts_won%_l60_decay_IO"])*(mean_hard_indoor_SOS1/df_player2["p_expected_opp_yield_pts%_IO"])).round(2)          

df_player2.loc[(df_player2["p_SOS_adj_pts_won%_l60_decay_IO"] > 100), "p_SOS_adj_pts_won%_l60_decay_IO"] = df_player2["p_SOS_adj_pts_won%_l60_decay_IO"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)


In [56]:
# New SOS-adjusted last 60 points won% feature weighting overall surface-specific performance and indoor or outdoor-specific performance

df_player2["p_SOS_adj_pts_won%_l60_decay_IO_weighted"] = (((df_player2["p_SOS_adj_pts_won%_l60_decay"]*1) + (df_player2["p_SOS_adj_pts_won%_l60_decay_IO"]*1))/2).round(2)

In [57]:
# SOS adjustment for total points% for just the last 10 matches (recent performance)

df_player2["p_expected_opp_yield_pts%_l10"] = (100 - df_player2['p_pts_won%_SOS_1to10'])

# Calculates mean opponent performance per surface. We will use these to factor player l10 performance based on opponent 
# l10 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
mean_clay_SOS1 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_pts_won%_l10'].mean()
mean_clay_SOS1 = 100 - mean_clay_SOS1 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS1 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_pts_won%_l10'].mean()
#mean_hard_SOS1 = 100 - mean_hard_SOS1 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS1

# Puts together the above- factors the player's actual performance over the last 10 by schedule of opponents' aggregrate performance over THEIR l10 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_pts_won%_l10"] = ((df_player2["p_pts_won%_l10"])*(mean_clay_SOS1/df_player2["p_expected_opp_yield_pts%_l10"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_pts_won%_l10"] = ((df_player2["p_pts_won%_l10"])*(mean_hard_SOS1/df_player2["p_expected_opp_yield_pts%_l10"])).round(2)

df_player2.loc[(df_player2["p_SOS_adj_pts_won%_l10"] > 100), "p_SOS_adj_pts_won%_l10"] = df_player2["p_SOS_adj_pts_won%_l10"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

df_player2.drop(["p_pts_won%_SOS_1to10"],axis=1, inplace=True)

In [None]:
#Save to review
#df_player2.to_csv('../data/df_player2b.csv', index=False)

In [58]:
#Calculates % SERVE points won 'Strength of Schedule' for the past 60 opponents of a given player in a given match
#Uses each opponent's decay-weighted last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_sv_pts_won%_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_sv_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_sv_pts_won%_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_sv_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_sv_pts_won%_SOS_11to20'] = df_player2['p_sv_pts_won%_SOS_11to20'].fillna(df_player2['p_sv_pts_won%_SOS_1to10'])

df_player2['p_sv_pts_won%_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_sv_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_sv_pts_won%_SOS_21to30'] = df_player2['p_sv_pts_won%_SOS_21to30'].fillna(df_player2['p_sv_pts_won%_SOS_11to20'])

df_player2['p_sv_pts_won%_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_sv_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_sv_pts_won%_SOS_31to40'] = df_player2['p_sv_pts_won%_SOS_31to40'].fillna(df_player2['p_sv_pts_won%_SOS_21to30'])

df_player2['p_sv_pts_won%_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_sv_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_sv_pts_won%_SOS_41to50'] = df_player2['p_sv_pts_won%_SOS_41to50'].fillna(df_player2['p_sv_pts_won%_SOS_31to40'])

df_player2['p_sv_pts_won%_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_sv_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_sv_pts_won%_SOS_51to60'] = df_player2['p_sv_pts_won%_SOS_51to60'].fillna(df_player2['p_sv_pts_won%_SOS_41to50'])

#df_player2['p_sv_pts_won%_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_sv_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player2['p_sv_pts_won%_SOS_61to70'] = df_player2['p_sv_pts_won%_SOS_61to70'].fillna(df_player2['p_sv_pts_won%_SOS_51to60'])

df_player2 = df_player2.iloc[::-1]


In [59]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected RETURN PTS YIELD up by player's opponents over the last 60 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 60 prior to that match to be predicted on). 

df_player2["p_expected_opp_yield_ret_pts%"] = (100 - (((df_player2['p_sv_pts_won%_SOS_1to10'] * 14) + (df_player2['p_sv_pts_won%_SOS_11to20'] * 8) + (df_player2['p_sv_pts_won%_SOS_21to30'] * 5) 
+ (df_player2['p_sv_pts_won%_SOS_31to40'] * 3) + (df_player2['p_sv_pts_won%_SOS_41to50'] * 2) + (df_player2['p_sv_pts_won%_SOS_51to60'] * 1))/33)).round(2)

# Drops transient columns
df_player2.drop(["p_sv_pts_won%_SOS_11to20","p_sv_pts_won%_SOS_21to30","p_sv_pts_won%_SOS_31to40","p_sv_pts_won%_SOS_41to50","p_sv_pts_won%_SOS_51to60"],axis=1, inplace=True)

In [60]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean

mean_clay_SOS2 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_sv_pts_won%_l60_decay'].mean()
mean_clay_SOS2 = 100 - mean_clay_SOS2 #we want in terms of pct RETURN pts the field ALLOWS on average
#mean_hard_SOS2 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_sv_pts_won%_l60_decay'].mean()
#mean_hard_SOS2 = 100 - mean_hard_SOS2 #we want in terms of pct RETURN pts the field ALLOWS on average
mean_clay_SOS2

37.99821319915747

In [61]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.

df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_ret_pts_won%_l60_decay"] = ((df_player2["p_ret_pts_won%_l60_decay"])*(mean_clay_SOS2/df_player2["p_expected_opp_yield_ret_pts%"])).round(2)
                
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_ret_pts_won%_l60_decay"] = ((df_player2["p_ret_pts_won%_l60_decay"])*(mean_hard_SOS2/df_player2["p_expected_opp_yield_ret_pts%"])).round(2)

df_player2.loc[(df_player2["p_SOS_adj_ret_pts_won%_l60_decay"] > 100), "p_SOS_adj_ret_pts_won%_l60_decay"] = df_player2["p_SOS_adj_ret_pts_won%_l60_decay"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modeling at all, as these matches will be filter

In [62]:
# SOS adjustment for serve points% (OPPONENT RETURN PTS% YIELD) for just the last 10 matches (recent performance)
df_player2["p_expected_opp_yield_ret_pts%_l10"] = (100 - df_player2['p_sv_pts_won%_SOS_1to10'])

# Calculates mean opponent performance per surface. We will use these to factor player l10 performance based on opponent 
# l10 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
mean_clay_SOS2 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_sv_pts_won%_l10'].mean()
mean_clay_SOS2 = 100 - mean_clay_SOS2 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS2 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_sv_pts_won%_l10'].mean()
#mean_hard_SOS2 = 100 - mean_hard_SOS2 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS2

# Puts together the above- factors the player's actual performance over the last 10 by schedule of opponents' aggregrate performance over THEIR l10 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_ret_pts_won%_l10"] = ((df_player2["p_ret_pts_won%_l10"])*(mean_clay_SOS2/df_player2["p_expected_opp_yield_ret_pts%_l10"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_ret_pts_won%_l10"] = ((df_player2["p_ret_pts_won%_l10"])*(mean_hard_SOS2/df_player2["p_expected_opp_yield_ret_pts%_l10"])).round(2)

df_player2.loc[(df_player2["p_SOS_adj_ret_pts_won%_l10"] > 100), "p_SOS_adj_ret_pts_won%_l10"] = df_player2["p_SOS_adj_ret_pts_won%_l10"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

df_player2.drop(["p_sv_pts_won%_SOS_1to10"],axis=1, inplace=True)

In [63]:
#Calculates % RETURN points won 'Strength of Schedule' for the past 60 opponents of a given player in a given match
#Uses each opponent's decay-weighted last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_ret_pts_won%_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ret_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_ret_pts_won%_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ret_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_ret_pts_won%_SOS_11to20'] = df_player2['p_ret_pts_won%_SOS_11to20'].fillna(df_player2['p_ret_pts_won%_SOS_1to10'])

df_player2['p_ret_pts_won%_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ret_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_ret_pts_won%_SOS_21to30'] = df_player2['p_ret_pts_won%_SOS_21to30'].fillna(df_player2['p_ret_pts_won%_SOS_11to20'])

df_player2['p_ret_pts_won%_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ret_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_ret_pts_won%_SOS_31to40'] = df_player2['p_ret_pts_won%_SOS_31to40'].fillna(df_player2['p_ret_pts_won%_SOS_21to30'])

df_player2['p_ret_pts_won%_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ret_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_ret_pts_won%_SOS_41to50'] = df_player2['p_ret_pts_won%_SOS_41to50'].fillna(df_player2['p_ret_pts_won%_SOS_31to40'])

df_player2['p_ret_pts_won%_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ret_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_ret_pts_won%_SOS_51to60'] = df_player2['p_ret_pts_won%_SOS_51to60'].fillna(df_player2['p_ret_pts_won%_SOS_41to50'])

#df_player2['p_ret_pts_won%_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ret_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player2['p_ret_pts_won%_SOS_61to70'] = df_player2['p_ret_pts_won%_SOS_61to70'].fillna(df_player2['p_ret_pts_won%_SOS_51to60'])

df_player2 = df_player2.iloc[::-1]


In [64]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected SERVE PTS YIELD up by player's opponents over the last 70 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 70 prior to that match to be predicted on). 

df_player2["p_expected_opp_yield_sv_pts%"] = (100 - (((df_player2['p_ret_pts_won%_SOS_1to10'] * 14) + (df_player2['p_ret_pts_won%_SOS_11to20'] * 8) + (df_player2['p_ret_pts_won%_SOS_21to30'] * 5) 
+ (df_player2['p_ret_pts_won%_SOS_31to40'] * 3) + (df_player2['p_ret_pts_won%_SOS_41to50'] * 2) + (df_player2['p_ret_pts_won%_SOS_51to60'] * 1))/33)).round(2)

# Drops transient columns
df_player2.drop(["p_ret_pts_won%_SOS_11to20","p_ret_pts_won%_SOS_21to30","p_ret_pts_won%_SOS_31to40","p_ret_pts_won%_SOS_41to50","p_ret_pts_won%_SOS_51to60"],axis=1, inplace=True)

In [65]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean

mean_clay_SOS3 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_ret_pts_won%_l60_decay'].mean()
mean_clay_SOS3 = 100 - mean_clay_SOS3 #we want in terms of pct SERVE pts the field ALLOWS on average
#mean_hard_SOS3 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_ret_pts_won%_l60_decay'].mean()
#mean_hard_SOS3 = 100 - mean_hard_SOS3 #we want in terms of pct SERVE pts the field ALLOWS on average
mean_clay_SOS3

61.49536449801049

In [67]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.

df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_sv_pts_won%_l60_decay"] = ((df_player2["p_sv_pts_won%_l60_decay"])*(mean_clay_SOS3/df_player2["p_expected_opp_yield_sv_pts%"])).round(2)
                
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_sv_pts_won%_l60_decay"] = ((df_player2["p_sv_pts_won%_l60_decay"])*(mean_hard_SOS3/df_player2["p_expected_opp_yield_sv_pts%"])).round(2)

df_player2.loc[(df_player2["p_SOS_adj_sv_pts_won%_l60_decay"] > 100), "p_SOS_adj_sv_pts_won%_l60_decay"] = df_player2["p_SOS_adj_sv_pts_won%_l60_decay"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modeling at all, as these matches will be filter

In [68]:
# SOS adjustment for serve points% (OPPONENT SERVE PTS% YIELD) for just the last 10 matches (recent performance)
df_player2["p_expected_opp_yield_sv_pts%_l10"] = (100 - df_player2['p_ret_pts_won%_SOS_1to10'])

# Calculates mean opponent performance per surface. We will use these to factor player l10 performance based on opponent 
# l10 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
mean_clay_SOS3 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_ret_pts_won%_l10'].mean()
mean_clay_SOS3 = 100 - mean_clay_SOS3 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS3 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_ret_pts_won%_l10'].mean()
#mean_hard_SOS3 = 100 - mean_hard_SOS3 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS3

# Puts together the above- factors the player's actual performance over the last 10 by schedule of opponents' aggregrate performance over THEIR l10 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_sv_pts_won%_l10"] = ((df_player2["p_sv_pts_won%_l10"])*(mean_clay_SOS3/df_player2["p_expected_opp_yield_sv_pts%_l10"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_sv_pts_won%_l10"] = ((df_player2["p_sv_pts_won%_l10"])*(mean_hard_SOS3/df_player2["p_expected_opp_yield_sv_pts%_l10"])).round(2)

df_player2.loc[(df_player2["p_SOS_adj_sv_pts_won%_l10"] > 100), "p_SOS_adj_sv_pts_won%_l10"] = df_player2["p_SOS_adj_sv_pts_won%_l10"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

df_player2.drop(["p_ret_pts_won%_SOS_1to10"],axis=1, inplace=True)

In [None]:
#Save to review
#df_player2.to_csv('../data/df_player2c.csv', index=False)

In [69]:
#Calculates ace % 'Strength of Schedule' for the past 60 opponents of a given player in a given match
#Uses each opponent's decay-weighted last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_ace%_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ace%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_ace%_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ace%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_ace%_SOS_11to20'] = df_player2['p_ace%_SOS_11to20'].fillna(df_player2['p_ace%_SOS_1to10'])

df_player2['p_ace%_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ace%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_ace%_SOS_21to30'] = df_player2['p_ace%_SOS_21to30'].fillna(df_player2['p_ace%_SOS_11to20'])

df_player2['p_ace%_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ace%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_ace%_SOS_31to40'] = df_player2['p_ace%_SOS_31to40'].fillna(df_player2['p_ace%_SOS_21to30'])

df_player2['p_ace%_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ace%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_ace%_SOS_41to50'] = df_player2['p_ace%_SOS_41to50'].fillna(df_player2['p_ace%_SOS_31to40'])

df_player2['p_ace%_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ace%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_ace%_SOS_51to60'] = df_player2['p_ace%_SOS_51to60'].fillna(df_player2['p_ace%_SOS_41to50'])

#df_player2['p_ace%_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_ace%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player2['p_ace%_SOS_61to70'] = df_player2['p_ace%_SOS_61to70'].fillna(df_player2['p_ace%_SOS_51to60'])

df_player2 = df_player2.iloc[::-1]


In [70]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected ACED% YIELD by player's opponents over the last 60 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 60 prior to that match to be predicted on). 

df_player2["p_expected_opp_yield_aced%"] = (((df_player2['p_ace%_SOS_1to10'] * 14) + (df_player2['p_ace%_SOS_11to20'] * 8) + (df_player2['p_ace%_SOS_21to30'] * 5) 
+ (df_player2['p_ace%_SOS_31to40'] * 3) + (df_player2['p_ace%_SOS_41to50'] * 2) + (df_player2['p_ace%_SOS_51to60'] * 1))/33).round(2)

# Drops transient columns
df_player2.drop(["p_ace%_SOS_11to20","p_ace%_SOS_21to30","p_ace%_SOS_31to40","p_ace%_SOS_41to50","p_ace%_SOS_51to60"],axis=1, inplace=True)

In [71]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean

mean_clay_SOS4 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_ace%_l60_decay'].mean()
mean_clay_SOS4 = mean_clay_SOS4 #we want in terms of pct ACED the field ALLOWS on average
#mean_hard_SOS4 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_ace%_l60_decay'].mean()
#mean_hard_SOS4 = mean_hard_SOS4 #we want in terms of pct ACED the field ALLOWS on average
mean_clay_SOS4

5.479813362976831

In [72]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.
 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_aced%_l60_decay"] = ((df_player2["p_aced%_l60_decay"])*(mean_clay_SOS4/df_player2["p_expected_opp_yield_aced%"])).round(2)
                
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_aced%_l60_decay"] = ((df_player2["p_aced%_l60_decay"])*(mean_hard_SOS4/df_player2["p_expected_opp_yield_aced%"])).round(2)

df_player2["p_SOS_adj_aced%_l60_decay"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_aced%_l60_decay"] > 100), "p_SOS_adj_aced%_l60_decay"] = df_player2["p_SOS_adj_aced%_l60_decay"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modeling at all, as these matches will be filter


In [73]:
# SOS adjustment for serve points% (OPPONENT ACED% YIELD) for just the last 10 matches (recent performance)
df_player2["p_expected_opp_yield_aced%_l10"] = df_player2['p_ace%_SOS_1to10']

# Calculates mean opponent performance per surface. We will use these to factor player l10 performance based on opponent 
# l10 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
mean_clay_SOS4 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_ace%_l10'].mean()
mean_clay_SOS4 = mean_clay_SOS4 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS4 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_ace%_l10'].mean()
#mean_hard_SOS4 = mean_hard_SOS4 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS4

# Puts together the above- factors the player's actual performance over the last 10 by schedule of opponents' aggregrate performance over THEIR l10 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_aced%_l10"] = ((df_player2["p_aced%_l10"])*(mean_clay_SOS4/df_player2["p_expected_opp_yield_aced%_l10"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_aced%_l10"] = ((df_player2["p_aced%_l10"])*(mean_hard_SOS4/df_player2["p_expected_opp_yield_aced%_l10"])).round(2)

df_player2["p_SOS_adj_aced%_l10"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_aced%_l10"] > 100), "p_SOS_adj_aced%_l10"] = df_player2["p_SOS_adj_aced%_l10"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

df_player2.drop(["p_ace%_SOS_1to10"],axis=1, inplace=True)

In [None]:
#Save to review
#df_player2.to_csv('../data/df_player2d.csv', index=False)

In [74]:
#Calculates aced % 'Strength of Schedule' for the past 60 opponents of a given player in a given match
#Uses each opponent's decay-weighted last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_aced%_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_aced%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_aced%_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_aced%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_aced%_SOS_11to20'] = df_player2['p_aced%_SOS_11to20'].fillna(df_player2['p_aced%_SOS_1to10'])

df_player2['p_aced%_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_aced%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_aced%_SOS_21to30'] = df_player2['p_aced%_SOS_21to30'].fillna(df_player2['p_aced%_SOS_11to20'])

df_player2['p_aced%_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_aced%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_aced%_SOS_31to40'] = df_player2['p_aced%_SOS_31to40'].fillna(df_player2['p_aced%_SOS_21to30'])

df_player2['p_aced%_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_aced%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_aced%_SOS_41to50'] = df_player2['p_aced%_SOS_41to50'].fillna(df_player2['p_aced%_SOS_31to40'])

df_player2['p_aced%_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_aced%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_aced%_SOS_51to60'] = df_player2['p_aced%_SOS_51to60'].fillna(df_player2['p_aced%_SOS_41to50'])

#df_player2['p_aced%_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_aced%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player2['p_aced%_SOS_61to70'] = df_player2['p_aced%_SOS_61to70'].fillna(df_player2['p_aced%_SOS_51to60'])

df_player2 = df_player2.iloc[::-1]


In [75]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected ACE% YIELD by player's opponents over the last 60 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 60 prior to that match to be predicted on). 

df_player2["p_expected_opp_yield_ace%"] = (((df_player2['p_aced%_SOS_1to10'] * 14) + (df_player2['p_aced%_SOS_11to20'] * 8) + (df_player2['p_aced%_SOS_21to30'] * 5) 
+ (df_player2['p_aced%_SOS_31to40'] * 3) + (df_player2['p_aced%_SOS_41to50'] * 2) + (df_player2['p_aced%_SOS_51to60'] * 1))/33).round(2)

# Drops transient columns
df_player2.drop(["p_aced%_SOS_11to20","p_aced%_SOS_21to30","p_aced%_SOS_31to40","p_aced%_SOS_41to50","p_aced%_SOS_51to60"],axis=1, inplace=True)

In [76]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean

mean_clay_SOS5 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_aced%_l60_decay'].mean()
mean_clay_SOS5 = mean_clay_SOS5 #we want in terms of pct ACES the field ALLOWS on average
#mean_hard_SOS5 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_aced%_l60_decay'].mean()
#mean_hard_SOS5 = mean_hard_SOS5 #we want in terms of pct ACES the field ALLOWS on average
mean_clay_SOS5

5.2839187924175235

In [77]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.
# If opponents 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_ace%_l60_decay"] = ((df_player2["p_ace%_l60_decay"])*(mean_clay_SOS5/df_player2["p_expected_opp_yield_ace%"])).round(2)
                
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_ace%_l60_decay"] = ((df_player2["p_ace%_l60_decay"])*(mean_hard_SOS5/df_player2["p_expected_opp_yield_ace%"])).round(2)

df_player2["p_SOS_adj_ace%_l60_decay"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_ace%_l60_decay"] > 100), "p_SOS_adj_ace%_l60_decay"] = df_player2["p_SOS_adj_ace%_l60_decay"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modeling at all, as these matches will be filter

In [78]:
# SOS adjustment for serve points% (OPPONENT ACE% YIELD) for just the last 10 matches (recent performance)
df_player2["p_expected_opp_yield_ace%_l10"] = df_player2['p_aced%_SOS_1to10']

# Calculates mean opponent performance per surface. We will use these to factor player l10 performance based on opponent 
# l10 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
mean_clay_SOS5 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_aced%_l10'].mean()
mean_clay_SOS5 = mean_clay_SOS5 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS5 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_aced%_l10'].mean()
#mean_hard_SOS5 = mean_hard_SOS5 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS5

# Puts together the above- factors the player's actual performance over the last 10 by schedule of opponents' aggregrate performance over THEIR l10 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_ace%_l10"] = ((df_player2["p_ace%_l10"])*(mean_clay_SOS5/df_player2["p_expected_opp_yield_ace%_l10"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_ace%_l10"] = ((df_player2["p_ace%_l10"])*(mean_hard_SOS5/df_player2["p_expected_opp_yield_ace%_l10"])).round(2)

df_player2["p_SOS_adj_ace%_l10"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_ace%_l10"] > 100), "p_SOS_adj_ace%_l10"] = df_player2["p_SOS_adj_ace%_l10"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

df_player2.drop(["p_aced%_SOS_1to10"],axis=1, inplace=True)

In [None]:
#Save to review
#df_player2.to_csv('../data/df_player2e.csv', index=False)

In [79]:
#Calculates break point saved% 'Strength of Schedule' for the past 60 opponents of a given player in a given match
#Uses each opponent's (non decay-weighted) last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_bp_save%_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_save%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_bp_save%_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_save%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_bp_save%_SOS_11to20'] = df_player2['p_bp_save%_SOS_11to20'].fillna(df_player2['p_bp_save%_SOS_1to10'])

df_player2['p_bp_save%_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_save%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_bp_save%_SOS_21to30'] = df_player2['p_bp_save%_SOS_21to30'].fillna(df_player2['p_bp_save%_SOS_11to20'])

df_player2['p_bp_save%_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_save%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_bp_save%_SOS_31to40'] = df_player2['p_bp_save%_SOS_31to40'].fillna(df_player2['p_bp_save%_SOS_21to30'])

df_player2['p_bp_save%_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_save%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_bp_save%_SOS_41to50'] = df_player2['p_bp_save%_SOS_41to50'].fillna(df_player2['p_bp_save%_SOS_31to40'])

df_player2['p_bp_save%_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_save%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_bp_save%_SOS_51to60'] = df_player2['p_bp_save%_SOS_51to60'].fillna(df_player2['p_bp_save%_SOS_41to50'])

#df_player2['p_bp_save%_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_save%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player2['p_bp_save%_SOS_61to70'] = df_player2['p_bp_save%_SOS_61to70'].fillna(df_player2['p_bp_save%_SOS_51to60'])

df_player2 = df_player2.iloc[::-1]

In [80]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected BP CONVERT% YIELD by player's opponents over the last 60 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 60 prior to that match to be predicted on). 

df_player2["p_expected_opp_yield_bp_convert%"] = (100 - (((df_player2['p_bp_save%_SOS_1to10'] * 14) + (df_player2['p_bp_save%_SOS_11to20'] * 8) + (df_player2['p_bp_save%_SOS_21to30'] * 5) 
+ (df_player2['p_bp_save%_SOS_31to40'] * 3) + (df_player2['p_bp_save%_SOS_41to50'] * 2) + (df_player2['p_bp_save%_SOS_51to60'] * 1))/33)).round(2)

# Drops transient columns
df_player2.drop(["p_bp_save%_SOS_11to20","p_bp_save%_SOS_21to30","p_bp_save%_SOS_31to40","p_bp_save%_SOS_41to50","p_bp_save%_SOS_51to60"],axis=1, inplace=True)

In [81]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean

mean_clay_SOS6 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_bp_save%_l60'].mean()
mean_clay_SOS6 = 100 - mean_clay_SOS6 #we want in terms of pct BREAK CONVERSIONS the field ALLOWS on average
#mean_hard_SOS6 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_bp_save%_l60'].mean()
#mean_hard_SOS6 = 100 - mean_hard_SOS6 #we want in terms of pct BREAK CONVERSIONS the field ALLOWS on average
mean_clay_SOS6

40.96534987128441

In [82]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.

df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_bp_convert%_l60"] = ((df_player2["p_bp_convert%_l60"])*(mean_clay_SOS6/df_player2["p_expected_opp_yield_bp_convert%"])).round(2)
                
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_bp_convert%_l60"] = ((df_player2["p_bp_convert%_l60"])*(mean_hard_SOS6/df_player2["p_expected_opp_yield_bp_convert%"])).round(2)

df_player2["p_SOS_adj_bp_convert%_l60"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_bp_convert%_l60"] > 100), "p_SOS_adj_bp_convert%_l60"] = df_player2["p_SOS_adj_bp_convert%_l60"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modeling at all, as these matches will be filtered)

In [83]:
# SOS adjustment for serve points% (OPPONENT BP CONVERT% YIELD) for just the last 10 matches (recent performance)
df_player2["p_expected_opp_yield_bp_convert%_l10"] = (100 - df_player2['p_bp_save%_SOS_1to10'])

# Calculates mean opponent performance per surface. We will use these to factor player l10 performance based on opponent 
# l10 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
mean_clay_SOS6 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_bp_save%_l10'].mean()
mean_clay_SOS6 = 100 - mean_clay_SOS6 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS6 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_bp_save%_l10'].mean()
#mean_hard_SOS6 = 100 - mean_hard_SOS6 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS6

# Puts together the above- factors the player's actual performance over the last 10 by schedule of opponents' aggregrate performance over THEIR l10 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_bp_convert%_l10"] = ((df_player2["p_bp_convert%_l10"])*(mean_clay_SOS6/df_player2["p_expected_opp_yield_bp_convert%_l10"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_bp_convert%_l10"] = ((df_player2["p_bp_convert%_l10"])*(mean_hard_SOS6/df_player2["p_expected_opp_yield_bp_convert%_l10"])).round(2)

df_player2["p_SOS_adj_bp_convert%_l10"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_bp_convert%_l10"] > 100), "p_SOS_adj_bp_convert%_l10"] = df_player2["p_SOS_adj_bp_convert%_l10"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

df_player2.drop(["p_bp_save%_SOS_1to10"],axis=1, inplace=True)

In [None]:
#Save to review
#df_player2.to_csv('../data/df_player2f.csv', index=False)

In [84]:
#Calculates break point converted% 'Strength of Schedule' for the past 60 opponents of a given player in a given match
#Uses each opponent's (non decay-weighted) last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_bp_convert%_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_convert%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_bp_convert%_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_convert%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_bp_convert%_SOS_11to20'] = df_player2['p_bp_convert%_SOS_11to20'].fillna(df_player2['p_bp_convert%_SOS_1to10'])

df_player2['p_bp_convert%_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_convert%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_bp_convert%_SOS_21to30'] = df_player2['p_bp_convert%_SOS_21to30'].fillna(df_player2['p_bp_convert%_SOS_11to20'])

df_player2['p_bp_convert%_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_convert%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_bp_convert%_SOS_31to40'] = df_player2['p_bp_convert%_SOS_31to40'].fillna(df_player2['p_bp_convert%_SOS_21to30'])

df_player2['p_bp_convert%_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_convert%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_bp_convert%_SOS_41to50'] = df_player2['p_bp_convert%_SOS_41to50'].fillna(df_player2['p_bp_convert%_SOS_31to40'])

df_player2['p_bp_convert%_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_convert%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_bp_convert%_SOS_51to60'] = df_player2['p_bp_convert%_SOS_51to60'].fillna(df_player2['p_bp_convert%_SOS_41to50'])

#df_player2['p_bp_convert%_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_bp_convert%_l60'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
#df_player2['p_bp_convert%_SOS_61to70'] = df_player2['p_bp_convert%_SOS_61to70'].fillna(df_player2['p_bp_convert%_SOS_51to60'])

df_player2 = df_player2.iloc[::-1]

In [85]:
# Decay weights the SOS calculation at each match to be predicted on, and frames as expected BP SAVE% YIELD by player's opponents over the last 60 surface-specific matches (we will contrast directly to the player's ACTUAL performance over the last 60 prior to that match to be predicted on). 

df_player2["p_expected_opp_yield_bp_save%"] = (100 - (((df_player2['p_bp_convert%_SOS_1to10'] * 14) + (df_player2['p_bp_convert%_SOS_11to20'] * 8) + (df_player2['p_bp_convert%_SOS_21to30'] * 5) 
+ (df_player2['p_bp_convert%_SOS_31to40'] * 3) + (df_player2['p_bp_convert%_SOS_41to50'] * 2) + (df_player2['p_bp_convert%_SOS_51to60'] * 1))/33)).round(2)

# Drops transient columns
df_player2.drop(["p_bp_convert%_SOS_11to20","p_bp_convert%_SOS_21to30","p_bp_convert%_SOS_31to40","p_bp_convert%_SOS_41to50","p_bp_convert%_SOS_51to60"],axis=1, inplace=True)

In [86]:
# Calculates mean opponent performance per surface. We will use these to factor player l60 performance based on opponent 
# l60 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean

mean_clay_SOS7 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_bp_convert%_l60'].mean()
mean_clay_SOS7 = 100 - mean_clay_SOS7 #we want in terms of pct BREAK PTS SAVED the field ALLOWS on average
#mean_hard_SOS7 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_bp_convert%_l60'].mean()
#mean_hard_SOS7 = 100 - mean_hard_SOS7 #we want in terms of pct BREAK PTS SAVED the field ALLOWS on average
mean_clay_SOS7

58.0961163117247

In [87]:
# Puts together the above- factors the player's actual performance over the last 60 by schedule of opponents' aggregrate performance over THEIR l60 prior to when they faced the player.

df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_bp_save%_l60"] = ((df_player2["p_bp_save%_l60"])*(mean_clay_SOS7/df_player2["p_expected_opp_yield_bp_save%"])).round(2)
                
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_bp_save%_l60"] = ((df_player2["p_bp_save%_l60"])*(mean_hard_SOS7/df_player2["p_expected_opp_yield_bp_save%"])).round(2)

df_player2["p_SOS_adj_bp_save%_l60"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_bp_save%_l60"] > 100), "p_SOS_adj_bp_save%_l60"] = df_player2["p_SOS_adj_bp_save%_l60"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modeling at all, as these matches will be filtered)

In [88]:
# SOS adjustment for serve points% (OPPONENT BP SAVE% YIELD) for just the last 10 matches (recent performance)
df_player2["p_expected_opp_yield_bp_save%_l10"] = (100 - df_player2['p_bp_convert%_SOS_1to10'])

# Calculates mean opponent performance per surface. We will use these to factor player l10 performance based on opponent 
# l10 performance (surface-specific) prior to the match of interest relative to the surface-specific sample mean
mean_clay_SOS7 = df_player2.loc[df_player2['t_surf'] == "Clay", 'p_opp_bp_convert%_l10'].mean()
mean_clay_SOS7 = 100 - mean_clay_SOS7 #we want in terms of pct pts the field ALLOWS on average
#mean_hard_SOS7 = df_player2.loc[df_player2['t_surf'] == "Hard", 'p_opp_bp_convert%_l10'].mean()
#mean_hard_SOS7 = 100 - mean_hard_SOS7 #we want in terms of pct pts the field ALLOWS on average
mean_clay_SOS7

# Puts together the above- factors the player's actual performance over the last 10 by schedule of opponents' aggregrate performance over THEIR l10 prior to when they faced the player.
# Adjustment proportional to opponents' deviation from field mean performs better than various boosted or blunted versions attempted 
df_player2.loc[(df_player2["t_surf"] == "Clay"), "p_SOS_adj_bp_save%_l10"] = ((df_player2["p_bp_save%_l10"])*(mean_clay_SOS7/df_player2["p_expected_opp_yield_bp_save%_l10"])).round(2)          
#df_player2.loc[(df_player2["t_surf"] == "Hard"), "p_SOS_adj_bp_save%_l10"] = ((df_player2["p_bp_save%_l10"])*(mean_hard_SOS7/df_player2["p_expected_opp_yield_bp_save%_l10"])).round(2)

df_player2["p_SOS_adj_bp_save%_l10"].replace(np.inf, np.nan, inplace=True) #deals with a few infs in first handful of matches in sample where there is no SOS (divide by zero errors in the above)
df_player2.loc[(df_player2["p_SOS_adj_bp_save%_l10"] > 100), "p_SOS_adj_bp_save%_l10"] = df_player2["p_SOS_adj_bp_save%_l10"].mean() #deals with a few spuriously high values when there's only one previous match (won't impact modelling at all, as these matches will be filtered out)

df_player2.drop(["p_bp_convert%_SOS_1to10"],axis=1, inplace=True)

In [89]:
#Calculates Implied Win Probability (derived from average across a number of books) 'Strength of Schedule' for the past 60 opponents of a given player in a given match.
#Uses each opponent's decay-weighted last 60 match performance prior to facing the player of interest (surface-specific)
#With this, we can obtain the "expected" performance over the last 60 matches for the player of interest, then can SOS adjust
#that player's performance over their last 60 on the surface to reflect how much above or below an average schedule they faced (see calculations below)

df_player2 = df_player2.iloc[::-1]

df_player2['p_IWP_SOS_1to10'] = df_player2.groupby(['p_id','t_surf'])['p_opp_IWP_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 1).mean().round(2).shift(1))

df_player2['p_IWP_SOS_11to20'] = df_player2.groupby(['p_id','t_surf'])['p_opp_IWP_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(11))
df_player2['p_IWP_SOS_11to20'] = df_player2['p_IWP_SOS_11to20'].fillna(df_player2['p_IWP_SOS_1to10'])

df_player2['p_IWP_SOS_21to30'] = df_player2.groupby(['p_id','t_surf'])['p_opp_IWP_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(21))
df_player2['p_IWP_SOS_21to30'] = df_player2['p_IWP_SOS_21to30'].fillna(df_player2['p_IWP_SOS_11to20'])

df_player2['p_IWP_SOS_31to40'] = df_player2.groupby(['p_id','t_surf'])['p_opp_IWP_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(31))
df_player2['p_IWP_SOS_31to40'] = df_player2['p_IWP_SOS_31to40'].fillna(df_player2['p_IWP_SOS_21to30'])

df_player2['p_IWP_SOS_41to50'] = df_player2.groupby(['p_id','t_surf'])['p_opp_IWP_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(41))
df_player2['p_IWP_SOS_41to50'] = df_player2['p_IWP_SOS_41to50'].fillna(df_player2['p_IWP_SOS_31to40'])

df_player2['p_IWP_SOS_51to60'] = df_player2.groupby(['p_id','t_surf'])['p_opp_IWP_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
df_player2['p_IWP_SOS_51to60'] = df_player2['p_IWP_SOS_51to60'].fillna(df_player2['p_IWP_SOS_41to50'])

df_player2['p_IWP_SOS_61to70'] = df_player2.groupby(['p_id','t_surf'])['p_opp_IWP_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(61))
df_player2['p_IWP_SOS_61to70'] = df_player2['p_IWP_SOS_61to70'].fillna(df_player2['p_IWP_SOS_51to60'])

#df_player2['p_pts_won%_SOS_71to80'] = df_player2.groupby(['p_id','t_surf'])['p_opp_pts_won%_l60_decay'].transform(lambda x: x.rolling(window=10, min_periods = 3).mean().round(2).shift(51))
#df_player2['p_pts_won%_SOS_71to80'] = df_player2['p_pts_won%_SOS_71to80'].fillna(df_player2['p_pts_won%_SOS_61to70'])

df_player2 = df_player2.iloc[::-1]

In [None]:
#Save to review
df_player2.to_csv('../data/df_player2g.csv', index=False)

In [None]:
df_player2.info()

In [None]:
#Tidy up latest by-player iteration before converting back to by-match for calculation of within-match to be predicted on player differentials
df_player3 = df_player2[["t_id", "t_date", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "t_rd_num", "m_num", "m_best_of", "m_outcome", "m_time(m)", "m_tot_pts", "p_id", "p_name", "p_H2H_w", "p_H2H_pts_won%", "p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_matches", "p_matches_surf", "p_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l60_decay_IO", "p_pts_won%_l10", "p_SOS_adj_pts_won%_l60_decay", "p_SOS_adj_pts_won%_l60_decay_IO", "p_SOS_adj_pts_won%_l60_decay_IO_weighted", "p_SOS_adj_pts_won%_l10", "p_sv_pts_won%", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_SOS_adj_sv_pts_won%_l60_decay", "p_SOS_adj_sv_pts_won%_l10", "p_ret_pts_won%", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_SOS_adj_ret_pts_won%_l60_decay", "p_SOS_adj_ret_pts_won%_l10", "p_ace%", "p_ace%_l60_decay", "p_ace%_l10", "p_SOS_adj_ace%_l60_decay", "p_SOS_adj_ace%_l10", "p_aced%", "p_aced%_l60_decay", "p_aced%_l10", "p_SOS_adj_aced%_l60_decay", "p_SOS_adj_aced%_l10", "p_bp_save%", "p_bp_save%_l60", "p_bp_save%_l10", "p_SOS_adj_bp_save%_l60", "p_SOS_adj_bp_save%_l10", "p_bp_convert%", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_SOS_adj_bp_convert%_l60", "p_SOS_adj_bp_convert%_l10", "p_pts_won%_std_l60_decay", "p_sv_pts_won%_std_l60_decay", "p_ret_pts_won%_std_l60_decay", "p_m_time_last", "p_tot_time_l6", "p_tot_time_l6_decay", "p_tot_pts_last", "p_tot_pts_l6", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue_decay", "p_stamina_adj_fatigue", "p_IP_NV"]]

In [None]:
df_player3.info()

In [None]:
#Save prior to conversion back to by-match format for computation of player differential features per match.
df_player3.to_csv('../data/df_player3.csv', index=False)

### Player vs Player Differentials in Predictive Features By Match

Now will convert dataframe back to by-match format to compute player differentials for predictive features aligned to each match to predict on. A few additional by-player features will be computed in the process as well. After these features are computed, the dataframe will be converted back to a by-player format for output to EDA.

In [None]:
df_winners3 = df_player3[df_player3['m_outcome'] == 1]
df_losers3 = df_player3[df_player3['m_outcome'] == 0]
df_match3 = df_winners3.merge(df_losers3, on='m_num', how = 'left')

In [None]:
df_match3 = df_match3.drop(["t_id_y", "t_date_y", "tour_wk_y", "t_name_y", "t_country_y", "t_surf_y", "t_indoor_y", "t_alt_y", "t_lvl_y", "t_draw_size_y", "t_rd_num_y", "m_best_of_y", "m_time(m)_y", "m_tot_pts_y"], axis=1)
df_match3.rename(columns = {'t_id_x':'t_id', 't_date_x':'t_date', 'tour_wk_x':'tour_wk', 't_name_x':'t_name','t_country_x':'t_country','t_surf_x':'t_surf','t_indoor_x':'t_indoor', 't_alt_x':'t_alt','t_lvl_x':'t_lvl','t_draw_size_x':'t_draw_size', 't_rd_num_x':'t_rd_num', 'm_best_of_x':'m_best_of', 'm_time(m)_x':'m_time(m)','m_tot_pts_x':'m_tot_pts'}, inplace=True)

In [None]:
#df_match3.to_csv('../data/df_match3.csv', index=False)

#### Rankings and Entry Type-Related Player Differential Predictive Features By Match

In [None]:
# ATP ranking differential between winner (_x) vs loser (_y) (and loser vs winner) (to be consistent with points diff, positive number = better ranking than opp)
max_winners = df_match3['p_rank_x'].max()
max_losers = df_match3['p_rank_y'].max()
max_sample = max(max_winners, max_losers)
#max_sample

df_match3['p_rank_x'] = df_match3['p_rank_x'].fillna(max_sample + 1) # if player has no ranking, assign sample max + 1
df_match3['p_rank_y'] = df_match3['p_rank_y'].fillna(max_sample + 1) # if player has no ranking, assign sample max + 1
df_match3["p_rank_diff_x"] = -(df_match3["p_rank_x"] - df_match3["p_rank_y"])
df_match3["p_rank_diff_y"] = -df_match3["p_rank_diff_x"]

In [None]:
# Generate log of ranking for both players and then calculate the difference (assumption that one ranking place separates players more as you get closer to the top of the rankings)
df_match3["p_log_rank_x"] = np.log(df_match3["p_rank_x"]).round(2)
df_match3["p_log_rank_y"] = np.log(df_match3["p_rank_y"]).round(2)
df_match3["p_log_rank_diff_x"] = -(df_match3["p_log_rank_x"] - df_match3["p_log_rank_y"])
df_match3["p_log_rank_diff_y"] = -(df_match3["p_log_rank_diff_x"])

In [None]:
# ATP ranking points differential between winner (_x) and loser (_y) (and loser vs winner)
df_match3['p_rank_pts_x'] = df_match3['p_rank_pts_x'].fillna(0) # if player has no pts, assign 0
df_match3['p_rank_pts_y'] = df_match3['p_rank_pts_y'].fillna(0) # if player has no pts, assign 0
df_match3["p_rank_pts_diff_x"] = df_match3["p_rank_pts_x"] - df_match3["p_rank_pts_y"]
df_match3["p_rank_pts_diff_y"] = -df_match3["p_rank_pts_diff_x"]

In [None]:
# Entry Type Differential (entry type was encoded in first stage as 3=Ranking-based entry; 2=Qualifier; 1.5=Lucky Loser; 1=Special Entry/non-rankings based entry)
df_match3["p_ent_diff_x"] = df_match3["p_ent_x"] - df_match3["p_ent_y"]
df_match3["p_ent_diff_y"] = -df_match3["p_ent_diff_x"]

#### Basic Player Characteristics Differential Predictive Features By Match

In [None]:
# Height differential between winner (_x) vs loser (_y) (in cm) (and loser vs winner)
df_match3["p_ht_diff_x"] = (df_match3["p_ht_x"] - df_match3["p_ht_y"])
df_match3["p_ht_diff_y"] = -df_match3["p_ht_diff_x"]

In [None]:
# Age differential between winner (_x) vs loser (_y) (yrs) (and loser vs winner)
df_match3["p_age_diff_x"] = (df_match3["p_age_x"] - df_match3["p_age_y"])
df_match3["p_age_diff_y"] = -df_match3["p_age_diff_x"]

In [None]:
# Marker column for if winner was Left-Handed and loser was Right-Handed (and vice versa) (1=T, 0=F)
df_match3['p_L_opp_R_x'] = np.where((df_match3['p_hd_x'] == 'L') & (df_match3['p_hd_y'] == 'R'), 1, 0)
df_match3['p_L_opp_R_y'] = np.where((df_match3['p_hd_x'] == 'R') & (df_match3['p_hd_y'] == 'L'), 1, 0)

# a small number of low-match # players in the sample are unknown (U) for handed, even after investigation on ATP site.

In [None]:
# Convert player handedness itself to numeric encoding
df_match3.loc[(df_match3["p_hd_x"] == "L"), "p_hd_x"] = 2 #Lefties converts to 2
df_match3.loc[(df_match3["p_hd_y"] == "L"), "p_hd_y"] = 2 #Lefties converts to 2
df_match3.loc[(df_match3["p_hd_x"] == "R"), "p_hd_x"] = 1 #Righties converts to 1
df_match3.loc[(df_match3["p_hd_y"] == "R"), "p_hd_y"] = 1 #Righties converts to 1
df_match3.loc[(df_match3["p_hd_x"] == "U"), "p_hd_x"] = 1 #Unknowns convert to 0
df_match3.loc[(df_match3["p_hd_y"] == "U"), "p_hd_y"] = 1 #Unknowns convert to 0

df_match3["p_hd_x"] = pd.to_numeric(df_match3["p_hd_x"])
df_match3["p_hd_y"] = pd.to_numeric(df_match3["p_hd_y"])

In [None]:
# Marker column for if winner was from the country where the tourney was held, and opponent was not (and vice versa) (1=T, 0=F)
df_match3['p_HCA_opp_N_x'] = np.where((df_match3['t_country'] == df_match3['p_country_x']) & (df_match3['t_country'] != df_match3['p_country_y']), 1, 0)
df_match3['p_HCA_opp_N_y'] = np.where((df_match3['t_country'] != df_match3['p_country_x']) & (df_match3['t_country'] == df_match3['p_country_y']), 1, 0)

#### Retrospective Player Performance Differential Predictive Features By Match

In [None]:
# % total points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_pts_won%_l60_decay_diff_x"] = df_match3["p_pts_won%_l60_decay_x"] - df_match3["p_pts_won%_l60_decay_y"]
df_match3["p_pts_won%_l60_decay_diff_y"] = -(df_match3["p_pts_won%_l60_decay_diff_x"])

In [None]:
# % total points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version
# This version respects INDOOR vs OUTDOOR SEPARATION

df_match3["p_pts_won%_l60_decay_IO_diff_x"] = df_match3["p_pts_won%_l60_decay_IO_x"] - df_match3["p_pts_won%_l60_decay_IO_y"]
df_match3["p_pts_won%_l60_decay_IO_diff_y"] = -(df_match3["p_pts_won%_l60_decay_IO_diff_x"])

In [None]:
# % total points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_pts_won%_l60_decay_diff_x"] = df_match3["p_SOS_adj_pts_won%_l60_decay_x"] - df_match3["p_SOS_adj_pts_won%_l60_decay_y"]
df_match3["p_SOS_adj_pts_won%_l60_decay_diff_y"] = -(df_match3["p_SOS_adj_pts_won%_l60_decay_diff_x"])

In [None]:
# % total points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version
# This version respects INDOOR vs OUTDOOR SEPARATION

df_match3["p_SOS_adj_pts_won%_l60_decay_IO_diff_x"] = df_match3["p_SOS_adj_pts_won%_l60_decay_IO_x"] - df_match3["p_SOS_adj_pts_won%_l60_decay_IO_y"]
df_match3["p_SOS_adj_pts_won%_l60_decay_IO_diff_y"] = -(df_match3["p_SOS_adj_pts_won%_l60_decay_IO_diff_x"])

In [None]:
# % total points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is a version of the standard SOS-Adjusted l60 WEIGHTED BY INDOOR/OUTDOOR DISTINCTION

df_match3["p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff_x"] = df_match3["p_SOS_adj_pts_won%_l60_decay_IO_weighted_x"] - df_match3["p_SOS_adj_pts_won%_l60_decay_IO_weighted_y"]
df_match3["p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff_y"] = -(df_match3["p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff_x"])

In [None]:
# % total points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_pts_won%_l10_diff_x"] = df_match3["p_pts_won%_l10_x"] - df_match3["p_pts_won%_l10_y"]
df_match3["p_pts_won%_l10_diff_y"] = -(df_match3["p_pts_won%_l10_diff_x"])

In [None]:
# % total points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version
df_match3["p_SOS_adj_pts_won%_l10_diff_x"] = ""
df_match3["p_SOS_adj_pts_won%_l10_diff_y"] = ""

df_match3["p_SOS_adj_pts_won%_l10_diff_x"] = df_match3["p_SOS_adj_pts_won%_l10_x"] - df_match3["p_SOS_adj_pts_won%_l10_y"]
df_match3["p_SOS_adj_pts_won%_l10_diff_y"] = -(df_match3["p_SOS_adj_pts_won%_l10_diff_x"])

In [None]:
# "OFFENSE VS OFFENSE": % SERVE points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_sv_pts_won%_l60_decay_diff_x"] = df_match3["p_sv_pts_won%_l60_decay_x"] - df_match3["p_sv_pts_won%_l60_decay_y"]
df_match3["p_sv_pts_won%_l60_decay_diff_y"] = -(df_match3["p_sv_pts_won%_l60_decay_diff_x"])

In [None]:
# "OFFENSE VS OFFENSE": % SERVE points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_sv_pts_won%_l60_decay_diff_x"] = df_match3["p_SOS_adj_sv_pts_won%_l60_decay_x"] - df_match3["p_SOS_adj_sv_pts_won%_l60_decay_y"]
df_match3["p_SOS_adj_sv_pts_won%_l60_decay_diff_y"] = -(df_match3["p_SOS_adj_sv_pts_won%_l60_decay_diff_x"])

In [None]:
# "OFFENSE VS OFFENSE": % SERVE points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_sv_pts_won%_l10_diff_x"] = df_match3["p_sv_pts_won%_l10_x"] - df_match3["p_sv_pts_won%_l10_y"]
df_match3["p_sv_pts_won%_l10_diff_y"] = -(df_match3["p_sv_pts_won%_l10_diff_x"])

In [None]:
# "OFFENSE VS OFFENSE": % SERVE points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_sv_pts_won%_l10_diff_x"] = df_match3["p_SOS_adj_sv_pts_won%_l10_x"] - df_match3["p_SOS_adj_sv_pts_won%_l10_y"]
df_match3["p_SOS_adj_sv_pts_won%_l10_diff_y"] = -(df_match3["p_SOS_adj_sv_pts_won%_l10_diff_x"])

In [None]:
# "DEFENSE VS DEFENSE": % RETURN points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ret_pts_won%_l60_decay_diff_x"] = df_match3["p_ret_pts_won%_l60_decay_x"] - df_match3["p_ret_pts_won%_l60_decay_y"]
df_match3["p_ret_pts_won%_l60_decay_diff_y"] = -(df_match3["p_ret_pts_won%_l60_decay_diff_x"])

In [None]:
# "DEFENSE VS DEFENSE": % RETURN points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ret_pts_won%_l60_decay_diff_x"] = df_match3["p_SOS_adj_ret_pts_won%_l60_decay_x"] - df_match3["p_SOS_adj_ret_pts_won%_l60_decay_y"]
df_match3["p_SOS_adj_ret_pts_won%_l60_decay_diff_y"] = -(df_match3["p_SOS_adj_ret_pts_won%_l60_decay_diff_x"])

In [None]:
# "DEFENSE VS DEFENSE": % RETURN points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ret_pts_won%_l10_diff_x"] = df_match3["p_ret_pts_won%_l10_x"] - df_match3["p_ret_pts_won%_l10_y"]
df_match3["p_ret_pts_won%_l10_diff_y"] = -(df_match3["p_ret_pts_won%_l10_diff_x"])

In [None]:
# "DEFENSE VS DEFENSE": % RETURN points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ret_pts_won%_l10_diff_x"] = df_match3["p_SOS_adj_ret_pts_won%_l10_x"] - df_match3["p_SOS_adj_ret_pts_won%_l10_y"]
df_match3["p_SOS_adj_ret_pts_won%_l10_diff_y"] = -(df_match3["p_SOS_adj_ret_pts_won%_l10_diff_x"])

In [None]:
# "OFFENSE VS DEFENSE": % SERVE points won VS OPPONENT % RETURN points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_sv_opp_ret_pts_won%_l60_decay_diff_x"] = df_match3["p_sv_pts_won%_l60_decay_x"] - df_match3["p_ret_pts_won%_l60_decay_y"]
df_match3["p_sv_opp_ret_pts_won%_l60_decay_diff_y"] = df_match3["p_sv_pts_won%_l60_decay_y"] - df_match3["p_ret_pts_won%_l60_decay_x"]

In [None]:
# "OFFENSE VS DEFENSE": % SERVE points won VS OPPONENT % RETURN points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff_x"] = df_match3["p_SOS_adj_sv_pts_won%_l60_decay_x"] - df_match3["p_SOS_adj_ret_pts_won%_l60_decay_y"]
df_match3["p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff_y"] = df_match3["p_SOS_adj_sv_pts_won%_l60_decay_y"] - df_match3["p_SOS_adj_ret_pts_won%_l60_decay_x"]

In [None]:
# "OFFENSE VS DEFENSE": % SERVE points won VS OPPONENT % RETURN points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_sv_opp_ret_pts_won%_l10_diff_x"] = df_match3["p_sv_pts_won%_l10_x"] - df_match3["p_ret_pts_won%_l10_y"]
df_match3["p_sv_opp_ret_pts_won%_l10_diff_y"] = df_match3["p_sv_pts_won%_l10_y"] - df_match3["p_ret_pts_won%_l10_x"]

In [None]:
# "OFFENSE VS DEFENSE": % SERVE points won VS OPPONENT % RETURN points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_sv_opp_ret_pts_won%_l10_diff_x"] = df_match3["p_SOS_adj_sv_pts_won%_l10_x"] - df_match3["p_SOS_adj_ret_pts_won%_l10_y"]
df_match3["p_SOS_adj_sv_opp_ret_pts_won%_l10_diff_y"] = df_match3["p_SOS_adj_sv_pts_won%_l10_y"] - df_match3["p_SOS_adj_ret_pts_won%_l10_x"]

In [None]:
# "DEFENSE VS OFFENSE": % RETURN points won VS OPPONENT % SERVE points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ret_opp_sv_pts_won%_l60_decay_diff_x"] = df_match3["p_ret_pts_won%_l60_decay_x"] - df_match3["p_sv_pts_won%_l60_decay_y"]
df_match3["p_ret_opp_sv_pts_won%_l60_decay_diff_y"] = df_match3["p_ret_pts_won%_l60_decay_y"] - df_match3["p_sv_pts_won%_l60_decay_x"]

In [None]:
# "DEFENSE VS OFFENSE": % RETURN points won VS OPPONENT % SERVE points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff_x"] = df_match3["p_SOS_adj_ret_pts_won%_l60_decay_x"] - df_match3["p_SOS_adj_sv_pts_won%_l60_decay_y"]
df_match3["p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff_y"] = df_match3["p_SOS_adj_ret_pts_won%_l60_decay_y"] - df_match3["p_SOS_adj_sv_pts_won%_l60_decay_x"]

In [None]:
# "DEFENSE VS OFFENSE": % RETURN points won VS OPPONENT % SERVE points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ret_opp_sv_pts_won%_l10_diff_x"] = df_match3["p_ret_pts_won%_l10_x"] - df_match3["p_sv_pts_won%_l10_y"]
df_match3["p_ret_opp_sv_pts_won%_l10_diff_y"] = df_match3["p_ret_pts_won%_l10_y"] - df_match3["p_sv_pts_won%_l10_x"]

In [None]:
# "DEFENSE VS OFFENSE": % RETURN points won VS OPPONENT % SERVE points won in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ret_opp_sv_pts_won%_l10_diff_x"] = df_match3["p_SOS_adj_ret_pts_won%_l10_x"] - df_match3["p_SOS_adj_sv_pts_won%_l10_y"]
df_match3["p_SOS_adj_ret_opp_sv_pts_won%_l10_diff_y"] = df_match3["p_SOS_adj_ret_pts_won%_l10_y"] - df_match3["p_SOS_adj_sv_pts_won%_l10_x"]

In [None]:
# "OFFENSE VS OFFENSE": player ace% VS OPPONENT ace% in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ace%_l60_decay_diff_x"] = df_match3["p_ace%_l60_decay_x"] - df_match3["p_ace%_l60_decay_y"]
df_match3["p_ace%_l60_decay_diff_y"] = -(df_match3["p_ace%_l60_decay_diff_x"]) 

In [None]:
# "OFFENSE VS OFFENSE": player ace% VS OPPONENT ace% in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ace%_l60_decay_diff_x"] = df_match3["p_SOS_adj_ace%_l60_decay_x"] - df_match3["p_SOS_adj_ace%_l60_decay_y"]
df_match3["p_SOS_adj_ace%_l60_decay_diff_y"] = -(df_match3["p_SOS_adj_ace%_l60_decay_diff_x"]) 

In [None]:
# "OFFENSE VS OFFENSE": player ace% VS OPPONENT ace% in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ace%_l10_diff_x"] = df_match3["p_ace%_l10_x"] - df_match3["p_ace%_l10_y"]
df_match3["p_ace%_l10_diff_y"] = -(df_match3["p_ace%_l10_diff_x"]) 

In [None]:
# "OFFENSE VS OFFENSE": player ace% VS OPPONENT ace% in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ace%_l10_diff_x"] = df_match3["p_SOS_adj_ace%_l10_x"] - df_match3["p_SOS_adj_ace%_l10_y"]
df_match3["p_SOS_adj_ace%_l10_diff_y"] = -(df_match3["p_SOS_adj_ace%_l10_diff_x"]) 

In [None]:
# "DEFENSE VS DEFENSE": player aced% VS OPPONENT aced% in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_aced%_l60_decay_diff_x"] = df_match3["p_aced%_l60_decay_x"] - df_match3["p_aced%_l60_decay_y"]
df_match3["p_aced%_l60_decay_diff_y"] = -(df_match3["p_aced%_l60_decay_diff_x"]) 

In [None]:
# "DEFENSE VS DEFENSE": player aced% VS OPPONENT aced% in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_aced%_l60_decay_diff_x"] = df_match3["p_SOS_adj_aced%_l60_decay_x"] - df_match3["p_SOS_adj_aced%_l60_decay_y"]
df_match3["p_SOS_adj_aced%_l60_decay_diff_y"] = -(df_match3["p_SOS_adj_aced%_l60_decay_diff_x"])

In [None]:
# "DEFENSE VS DEFENSE": player aced% VS OPPONENT aced% in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_aced%_l10_diff_x"] = df_match3["p_aced%_l10_x"] - df_match3["p_aced%_l10_y"]
df_match3["p_aced%_l10_diff_y"] = -(df_match3["p_aced%_l10_diff_x"]) 

In [None]:
# "DEFENSE VS DEFENSE": player aced% VS OPPONENT aced% in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_aced%_l10_diff_x"] = df_match3["p_SOS_adj_aced%_l10_x"] - df_match3["p_SOS_adj_aced%_l10_y"]
df_match3["p_SOS_adj_aced%_l10_diff_y"] = -(df_match3["p_SOS_adj_aced%_l10_diff_x"]) 

In [None]:
# "OFFENSE VS DEFENSE": player ace% VS OPPONENT aced% in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ace%_opp_aced%_l60_decay_diff_x"] = df_match3["p_ace%_l60_decay_x"] - df_match3["p_aced%_l60_decay_y"]
df_match3["p_ace%_opp_aced%_l60_decay_diff_y"] = df_match3["p_ace%_l60_decay_y"] - df_match3["p_aced%_l60_decay_x"]

In [None]:
# "OFFENSE VS DEFENSE": player ace% VS OPPONENT aced% in previous 60 matches (surface-specific;decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ace%_opp_aced%_l60_decay_diff_x"] = df_match3["p_SOS_adj_ace%_l60_decay_x"] - df_match3["p_SOS_adj_aced%_l60_decay_y"]
df_match3["p_SOS_adj_ace%_opp_aced%_l60_decay_diff_y"] = df_match3["p_SOS_adj_ace%_l60_decay_y"] - df_match3["p_SOS_adj_aced%_l60_decay_x"]

In [None]:
# "OFFENSE VS DEFENSE": player ace% VS OPPONENT aced% in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ace%_opp_aced%_l10_diff_x"] = df_match3["p_ace%_l10_x"] - df_match3["p_aced%_l10_y"]
df_match3["p_ace%_opp_aced%_l10_diff_y"] = df_match3["p_ace%_l10_y"] - df_match3["p_aced%_l10_x"]

In [None]:
# "OFFENSE VS DEFENSE": player ace% VS OPPONENT aced% in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_ace%_opp_aced%_l10_diff_x"] = df_match3["p_SOS_adj_ace%_l10_x"] - df_match3["p_SOS_adj_aced%_l10_y"]
df_match3["p_SOS_adj_ace%_opp_aced%_l10_diff_y"] = df_match3["p_SOS_adj_ace%_l10_y"] - df_match3["p_SOS_adj_aced%_l10_x"]

In [None]:
# "DEFENSE VS OFFENSE": player aced% VS OPPONENT ace% in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Non-Strength of Schedule Adjusted Version

df_match3["p_aced%_opp_ace%_l60_decay_diff_x"] = df_match3["p_aced%_l60_decay_x"] - df_match3["p_ace%_l60_decay_y"]
df_match3["p_aced%_opp_ace%_l60_decay_diff_y"] = df_match3["p_aced%_l60_decay_y"] - df_match3["p_ace%_l60_decay_x"]

In [None]:
# "DEFENSE VS OFFENSE": player aced% VS OPPONENT ace% in previous 6o matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_aced%_opp_ace%_l60_decay_diff_x"] = df_match3["p_SOS_adj_aced%_l60_decay_x"] - df_match3["p_SOS_adj_ace%_l60_decay_y"]
df_match3["p_SOS_adj_aced%_opp_ace%_l60_decay_diff_y"] = df_match3["p_SOS_adj_aced%_l60_decay_y"] - df_match3["p_SOS_adj_ace%_l60_decay_x"]

In [None]:
# "DEFENSE VS OFFENSE": player aced% VS OPPONENT ace% in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Non-Strength of Schedule Adjusted Version

df_match3["p_aced%_opp_ace%_l10_diff_x"] = df_match3["p_aced%_l10_x"] - df_match3["p_ace%_l10_y"]
df_match3["p_aced%_opp_ace%_l10_diff_y"] = df_match3["p_aced%_l10_y"] - df_match3["p_ace%_l10_x"]

In [None]:
# "DEFENSE VS OFFENSE": player aced% VS OPPONENT ace% in previous 10 matches (surface-specific; NON-decay-weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_aced%_opp_ace%_l10_diff_x"] = df_match3["p_SOS_adj_aced%_l10_x"] - df_match3["p_SOS_adj_ace%_l10_y"]
df_match3["p_SOS_adj_aced%_opp_ace%_l10_diff_y"] = df_match3["p_SOS_adj_aced%_l10_y"] - df_match3["p_SOS_adj_ace%_l10_x"]

In [None]:
# "DEFENSE VS DEFENSE": player bp saved% VS OPPONENT bp saved % in previous 60 matches (surface-specific; non-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_save%_l60_diff_x"] = df_match3["p_bp_save%_l60_x"] - df_match3["p_bp_save%_l60_y"]
df_match3["p_bp_save%_l60_diff_y"] = -(df_match3["p_bp_save%_l60_diff_x"]) 

In [None]:
# "DEFENSE VS DEFENSE": player bp saved% VS OPPONENT bp saved % in previous 60 matches (surface-specific; non-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_save%_l60_diff_x"] = df_match3["p_SOS_adj_bp_save%_l60_x"] - df_match3["p_SOS_adj_bp_save%_l60_y"]
df_match3["p_SOS_adj_bp_save%_l60_diff_y"] = -(df_match3["p_SOS_adj_bp_save%_l60_diff_x"]) 

In [None]:
# "DEFENSE VS DEFENSE": player bp saved% VS OPPONENT bp saved % in previous 10 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_save%_l10_diff_x"] = df_match3["p_bp_save%_l10_x"] - df_match3["p_bp_save%_l10_y"]
df_match3["p_bp_save%_l10_diff_y"] = -(df_match3["p_bp_save%_l10_diff_x"])

In [None]:
# "DEFENSE VS DEFENSE": player bp saved% VS OPPONENT bp saved % in previous 10 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_save%_l10_diff_x"] = df_match3["p_SOS_adj_bp_save%_l10_x"] - df_match3["p_SOS_adj_bp_save%_l10_y"]
df_match3["p_SOS_adj_bp_save%_l10_diff_y"] = -(df_match3["p_SOS_adj_bp_save%_l10_diff_x"])

In [None]:
# "OFFENSE VS OFFENSE": player bp convert% VS OPPONENT bp convert% in previous 60 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_convert%_l60_diff_x"] = df_match3["p_bp_convert%_l60_x"] - df_match3["p_bp_convert%_l60_y"]
df_match3["p_bp_convert%_l60_diff_y"] = -(df_match3["p_bp_convert%_l60_diff_x"]) 

In [None]:
# "OFFENSE VS OFFENSE": player bp convert% VS OPPONENT bp convert% in previous 60 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_convert%_l60_diff_x"] = df_match3["p_SOS_adj_bp_convert%_l60_x"] - df_match3["p_SOS_adj_bp_convert%_l60_y"]
df_match3["p_SOS_adj_bp_convert%_l60_diff_y"] = -(df_match3["p_SOS_adj_bp_convert%_l60_diff_x"]) 

In [None]:
# "OFFENSE VS OFFENSE": player bp convert% VS OPPONENT bp convert% in previous 10 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_convert%_l10_diff_x"] = df_match3["p_bp_convert%_l10_x"] - df_match3["p_bp_convert%_l10_y"]
df_match3["p_bp_convert%_l10_diff_y"] = -(df_match3["p_bp_convert%_l10_diff_x"]) 

In [None]:
# "OFFENSE VS OFFENSE": player bp convert% VS OPPONENT bp convert% in previous 10 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_convert%_l10_diff_x"] = df_match3["p_SOS_adj_bp_convert%_l10_x"] - df_match3["p_SOS_adj_bp_convert%_l10_y"]
df_match3["p_SOS_adj_bp_convert%_l10_diff_y"] = -(df_match3["p_SOS_adj_bp_convert%_l10_diff_x"]) 

In [None]:
# "OFFENSE VS DEFENSE": player bp convert% VS OPPONENT bp save% in previous 60 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_convert%_opp_bp_save%_l60_diff_x"] = df_match3["p_bp_convert%_l60_x"] - df_match3["p_bp_save%_l60_y"]
df_match3["p_bp_convert%_opp_bp_save%_l60_diff_y"] = df_match3["p_bp_convert%_l60_y"] - df_match3["p_bp_save%_l60_x"]

In [None]:
# "OFFENSE VS DEFENSE": player bp convert% VS OPPONENT bp save% in previous 60 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff_x"] = df_match3["p_bp_convert%_l60_x"] - df_match3["p_bp_save%_l60_y"]
df_match3["p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff_y"] = df_match3["p_bp_convert%_l60_y"] - df_match3["p_bp_save%_l60_x"]

In [None]:
# "OFFENSE VS DEFENSE": player bp convert% VS OPPONENT bp save% in previous 10 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_convert%_opp_bp_save%_l10_diff_x"] = df_match3["p_bp_convert%_l10_x"] - df_match3["p_bp_save%_l10_y"]
df_match3["p_bp_convert%_opp_bp_save%_l10_diff_y"] = df_match3["p_bp_convert%_l10_y"] - df_match3["p_bp_save%_l10_x"]

In [None]:
# "OFFENSE VS DEFENSE": player bp convert% VS OPPONENT bp save% in previous 10 matches (surface-specific; NON-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff_x"] = df_match3["p_bp_convert%_l10_x"] - df_match3["p_bp_save%_l10_y"]
df_match3["p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff_y"] = df_match3["p_bp_convert%_l10_y"] - df_match3["p_bp_save%_l10_x"]

In [None]:
# "DEFENSE VS OFFENSE": player bp save% VS OPPONENT bp convert% in previous 60 matches (surface-specific; non-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_save%_opp_bp_convert%_l60_diff_x"] = df_match3["p_bp_save%_l60_x"] - df_match3["p_bp_convert%_l60_y"]
df_match3["p_bp_save%_opp_bp_convert%_l60_diff_y"] = df_match3["p_bp_save%_l60_y"] - df_match3["p_bp_convert%_l60_x"]

In [None]:
# "DEFENSE VS OFFENSE": player bp save% VS OPPONENT bp convert% in previous 60 matches (surface-specific; non-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff_x"] = df_match3["p_SOS_adj_bp_save%_l60_x"] - df_match3["p_SOS_adj_bp_convert%_l60_y"]
df_match3["p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff_y"] = df_match3["p_SOS_adj_bp_save%_l60_y"] - df_match3["p_SOS_adj_bp_convert%_l60_x"]

In [None]:
# "DEFENSE VS OFFENSE": player bp save% VS OPPONENT bp convert% in previous 10 matches (surface-specific; non-decay weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_bp_save%_opp_bp_convert%_l10_diff_x"] = df_match3["p_bp_save%_l10_x"] - df_match3["p_bp_convert%_l10_y"]
df_match3["p_bp_save%_opp_bp_convert%_l10_diff_y"] = df_match3["p_bp_save%_l10_y"] - df_match3["p_bp_convert%_l10_x"]

In [None]:
# "DEFENSE VS OFFENSE": player bp save% VS OPPONENT bp convert% in previous 10 matches (surface-specific; non-decay weighted) differential between winner (_x) and loser (_y)
#This is the Strength of Schedule Adjusted Version

df_match3["p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff_x"] = df_match3["p_SOS_adj_bp_save%_l10_x"] - df_match3["p_SOS_adj_bp_convert%_l10_y"]
df_match3["p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff_y"] = df_match3["p_SOS_adj_bp_save%_l10_y"] - df_match3["p_SOS_adj_bp_convert%_l10_x"]

In [None]:
# Diff in std for % total points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_pts_won%_std_l60_decay_diff_x"] = df_match3["p_pts_won%_std_l60_decay_x"] - df_match3["p_pts_won%_std_l60_decay_y"]
df_match3["p_pts_won%_std_l60_decay_diff_y"] = -(df_match3["p_pts_won%_std_l60_decay_diff_x"])

In [None]:
# Diff in std for % serve points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_sv_pts_won%_std_l60_decay_diff_x"] = df_match3["p_sv_pts_won%_std_l60_decay_x"] - df_match3["p_sv_pts_won%_std_l60_decay_y"]
df_match3["p_sv_pts_won%_std_l60_decay_diff_y"] = -(df_match3["p_sv_pts_won%_std_l60_decay_diff_x"])

In [None]:
# Diff in std for % return points won in previous 60 matches (surface-specific; decay-weighted) differential between winner (_x) and loser (_y)
#This is the NON-Strength of Schedule Adjusted Version

df_match3["p_ret_pts_won%_std_l60_decay_diff_x"] = df_match3["p_ret_pts_won%_std_l60_decay_x"] - df_match3["p_ret_pts_won%_std_l60_decay_y"]
df_match3["p_ret_pts_won%_std_l60_decay_diff_y"] = -(df_match3["p_ret_pts_won%_std_l60_decay_diff_x"])

#### Retrospective Stamina and Fatigue Player Performance Differential Predictive Features By Match

In [None]:
# Diff in number of minutes played in the immediate previous (within tournament) match between winner (_x) and loser (_y)

df_match3["p_m_time_last_diff_x"] = df_match3["p_m_time_last_x"] - df_match3["p_m_time_last_y"]
df_match3["p_m_time_last_diff_y"] = -(df_match3["p_m_time_last_diff_x"])

In [None]:
# NON-decay-adjusted number of minutes played over up to the last 6 (within tournament) matches between winner (_x) and loser (_y)

df_match3["p_tot_time_l6_diff_x"] = df_match3["p_tot_time_l6_x"] - df_match3["p_tot_time_l6_y"]
df_match3["p_tot_time_l6_diff_y"] = -(df_match3["p_tot_time_l6_diff_x"])

In [None]:
# Decay-adjusted number of minutes played over up to the last 6 (within tournament) matches between winner (_x) and loser (_y)

df_match3["p_tot_time_l6_decay_diff_x"] = df_match3["p_tot_time_l6_decay_x"] - df_match3["p_tot_time_l6_decay_y"]
df_match3["p_tot_time_l6_decay_diff_y"] = -(df_match3["p_tot_time_l6_decay_diff_x"]) 

In [None]:
# Diff in number of points played in the immediate previous (within tournament) match between winner (_x) and loser (_y)

df_match3["p_tot_pts_last_diff_x"] = df_match3["p_tot_pts_last_x"] - df_match3["p_tot_pts_last_y"]
df_match3["p_tot_pts_last_diff_y"] = -(df_match3["p_tot_pts_last_diff_x"])

In [None]:
# NON-decay-adjusted total number of points played over up to the last 6 (within tournament) matches between winner (_x) and loser (_y)

df_match3["p_tot_pts_l6_diff_x"] = df_match3["p_tot_pts_l6_x"] - df_match3["p_tot_pts_l6_y"]
df_match3["p_tot_pts_l6_diff_y"] = -(df_match3["p_tot_pts_l6_diff_x"]) 

In [None]:
# Decay-adjusted total number of points played over up to the last 6 (within tournament) matches between winner (_x) and loser (_y)

df_match3["p_tot_pts_l6_decay_diff_x"] = df_match3["p_tot_pts_l6_decay_x"] - df_match3["p_tot_pts_l6_decay_y"]
df_match3["p_tot_pts_l6_decay_diff_y"] = -(df_match3["p_tot_pts_l6_decay_diff_x"]) 

In [None]:
# Difference in total matches played in the entire sample (non-surface specific) between winner (_x) and loser (_y)

df_match3["p_matches_diff_x"] = df_match3["p_matches_x"] - df_match3["p_matches_y"]
df_match3["p_matches_diff_y"] = -(df_match3["p_matches_diff_x"]) 

In [None]:
# Difference in total matches played in the entire sample (SURFACE-SPECIFIC) between winner (_x) and loser (_y)

df_match3["p_matches_surf_diff_x"] = df_match3["p_matches_surf_x"] - df_match3["p_matches_surf_y"]
df_match3["p_matches_surf_diff_y"] = -(df_match3["p_matches_surf_diff_x"]) 

In [None]:
# Difference in stamina-adjusted fatigue (decay weighted total time played last 6 component) between winner (_x) and loser (_y)

df_match3["p_stamina_adj_fatigue_decay_diff_x"] = df_match3["p_stamina_adj_fatigue_decay_x"] - df_match3["p_stamina_adj_fatigue_decay_y"]
df_match3["p_stamina_adj_fatigue_decay_diff_y"] = -(df_match3["p_stamina_adj_fatigue_decay_diff_x"]) 

In [None]:
# Difference in stamina-adjusted fatigue (NON-decay weighted total time played last 6 component) between winner (_x) and loser (_y)

df_match3["p_stamina_adj_fatigue_diff_x"] = df_match3["p_stamina_adj_fatigue_x"] - df_match3["p_stamina_adj_fatigue_y"]
df_match3["p_stamina_adj_fatigue_diff_y"] = -(df_match3["p_stamina_adj_fatigue_diff_x"]) 

In [None]:
# Head-to-Head Matchup Past Differential (surface-specific, but no time constraints) between winner (_x) and loser (_y)

df_match3["p_H2H_diff_x"] = df_match3["p_H2H_w_x"] - df_match3["p_H2H_w_y"]
df_match3["p_H2H_diff_y"] = -(df_match3["p_H2H_diff_x"])

In [None]:
# Head-to-Head Matchup Past Points Won % Differential (surface-specific, but no time constraints) between winner (_x) and loser (_y)

df_match3["p_H2H_pts_won%_diff_x"] = df_match3["p_H2H_pts_won%_x"] - df_match3["p_H2H_pts_won%_y"]
df_match3["p_H2H_pts_won%_diff_y"] = -(df_match3["p_H2H_pts_won%_diff_x"])

now back to by-player organization one final time.  A few additional features will be computed in relation to court speed prediction and then data will be prepped for the next stage (exploratory data analysis)

In [None]:
df_match3.to_csv('../data/df_match3.csv', index=False)

In [None]:
#df_match3.info()

In [None]:
#Dropping loser (_y) columns for remerge by player
df_winners4 = df_match3.drop(["m_outcome_x", "m_outcome_y", "p_id_y", "p_name_y", "p_H2H_w_y", "p_H2H_pts_won%_y", "p_rank_y", "p_rank_pts_y", "p_country_y", "p_ent_y", "p_hd_y", "p_ht_y", "p_age_y", "p_matches_y", "p_matches_surf_y", "p_pts_won%_y", "p_pts_won%_l60_decay_y", "p_pts_won%_l60_decay_IO_y", "p_pts_won%_l10_y", "p_SOS_adj_pts_won%_l60_decay_y", "p_SOS_adj_pts_won%_l60_decay_IO_y", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_y","p_SOS_adj_pts_won%_l10_y", "p_sv_pts_won%_y", "p_sv_pts_won%_l60_decay_y", "p_sv_pts_won%_l10_y", "p_SOS_adj_sv_pts_won%_l60_decay_y", "p_SOS_adj_sv_pts_won%_l10_y", "p_ret_pts_won%_y", "p_ret_pts_won%_l60_decay_y", "p_ret_pts_won%_l10_y", "p_SOS_adj_ret_pts_won%_l60_decay_y", "p_SOS_adj_ret_pts_won%_l10_y", "p_ace%_y", "p_ace%_l60_decay_y", "p_ace%_l10_y", "p_SOS_adj_ace%_l60_decay_y", "p_SOS_adj_ace%_l10_y", "p_aced%_y", "p_aced%_l60_decay_y", "p_aced%_l10_y", "p_SOS_adj_aced%_l60_decay_y", "p_SOS_adj_aced%_l10_y", "p_bp_save%_y", "p_bp_save%_l60_y", "p_bp_save%_l10_y", "p_SOS_adj_bp_save%_l60_y", "p_SOS_adj_bp_save%_l10_y", "p_bp_convert%_y", "p_bp_convert%_l60_y", "p_bp_convert%_l10_y", "p_SOS_adj_bp_convert%_l60_y", "p_SOS_adj_bp_convert%_l10_y", "p_pts_won%_std_l60_decay_y", "p_sv_pts_won%_std_l60_decay_y", "p_ret_pts_won%_std_l60_decay_y", "p_m_time_last_y", "p_tot_time_l6_y", "p_tot_time_l6_decay_y", "p_tot_pts_last_y",  "p_tot_pts_l6_y", "p_tot_pts_l6_decay_y", "p_stamina_adj_fatigue_decay_y", "p_stamina_adj_fatigue_y", "p_IP_NV_y", "p_rank_diff_y", "p_log_rank_y", "p_log_rank_diff_y", "p_rank_pts_diff_y", "p_ent_diff_y", "p_ht_diff_y", "p_age_diff_y", "p_L_opp_R_y", "p_HCA_opp_N_y", "p_pts_won%_l60_decay_diff_y", "p_pts_won%_l60_decay_IO_diff_y", "p_SOS_adj_pts_won%_l60_decay_diff_y", "p_SOS_adj_pts_won%_l60_decay_IO_diff_y", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff_y", "p_pts_won%_l10_diff_y", "p_SOS_adj_pts_won%_l10_diff_y", "p_sv_pts_won%_l60_decay_diff_y", "p_SOS_adj_sv_pts_won%_l60_decay_diff_y", "p_sv_pts_won%_l10_diff_y", "p_SOS_adj_sv_pts_won%_l10_diff_y", "p_ret_pts_won%_l60_decay_diff_y", "p_SOS_adj_ret_pts_won%_l60_decay_diff_y", "p_ret_pts_won%_l10_diff_y", "p_SOS_adj_ret_pts_won%_l10_diff_y", "p_sv_opp_ret_pts_won%_l60_decay_diff_y", "p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff_y", "p_sv_opp_ret_pts_won%_l10_diff_y", "p_SOS_adj_sv_opp_ret_pts_won%_l10_diff_y", "p_ret_opp_sv_pts_won%_l60_decay_diff_y", "p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff_y", "p_ret_opp_sv_pts_won%_l10_diff_y", "p_SOS_adj_ret_opp_sv_pts_won%_l10_diff_y", "p_ace%_l60_decay_diff_y", "p_SOS_adj_ace%_l60_decay_diff_y", "p_ace%_l10_diff_y", "p_SOS_adj_ace%_l10_diff_y", "p_aced%_l60_decay_diff_y", "p_SOS_adj_aced%_l60_decay_diff_y", "p_aced%_l10_diff_y", "p_SOS_adj_aced%_l10_diff_y", "p_ace%_opp_aced%_l60_decay_diff_y", "p_SOS_adj_ace%_opp_aced%_l60_decay_diff_y", "p_ace%_opp_aced%_l10_diff_y", "p_SOS_adj_ace%_opp_aced%_l10_diff_y", "p_aced%_opp_ace%_l60_decay_diff_y", "p_SOS_adj_aced%_opp_ace%_l60_decay_diff_y", "p_aced%_opp_ace%_l10_diff_y", "p_SOS_adj_aced%_opp_ace%_l10_diff_y", "p_bp_save%_l60_diff_y", "p_SOS_adj_bp_save%_l60_diff_y", "p_bp_save%_l10_diff_y", "p_SOS_adj_bp_save%_l10_diff_y", "p_bp_convert%_l60_diff_y", "p_SOS_adj_bp_convert%_l60_diff_y", "p_bp_convert%_l10_diff_y", "p_SOS_adj_bp_convert%_l10_diff_y", "p_bp_convert%_opp_bp_save%_l60_diff_y", "p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff_y", "p_bp_convert%_opp_bp_save%_l10_diff_y", "p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff_y", "p_bp_save%_opp_bp_convert%_l60_diff_y", "p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff_y", "p_bp_save%_opp_bp_convert%_l10_diff_y", "p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff_y", "p_pts_won%_std_l60_decay_diff_y", "p_sv_pts_won%_std_l60_decay_diff_y", "p_ret_pts_won%_std_l60_decay_diff_y", "p_m_time_last_diff_y", "p_tot_time_l6_diff_y", "p_tot_time_l6_decay_diff_y", "p_tot_pts_last_diff_y", "p_tot_pts_l6_diff_y", "p_tot_pts_l6_decay_diff_y", "p_matches_diff_y", "p_matches_surf_diff_y", "p_stamina_adj_fatigue_decay_diff_y", "p_stamina_adj_fatigue_diff_y", "p_H2H_diff_y", "p_H2H_pts_won%_diff_y"], axis=1)
df_winners4["m_outcome"] = 1

In [None]:
df_winners4.to_csv('../data/df_winners4.csv', index=False)

In [None]:
#Renaming columns to remove winner-loser descriptions so we can re-concatenate winners and losers
df_winners4 = df_winners4.set_axis(["t_id", "t_date", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "t_rd_num", "m_num", "m_best_of", "m_time(m)", "m_tot_pts", "p_id", "p_name", "p_H2H_w", "p_H2H_pts_won%", "p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_matches", "p_matches_surf", "p_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l60_decay_IO", "p_pts_won%_l10", "p_SOS_adj_pts_won%_l60_decay", "p_SOS_adj_pts_won%_l60_decay_IO", "p_SOS_adj_pts_won%_l60_decay_IO_weighted", "p_SOS_adj_pts_won%_l10", "p_sv_pts_won%", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_SOS_adj_sv_pts_won%_l60_decay", "p_SOS_adj_sv_pts_won%_l10", "p_ret_pts_won%", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_SOS_adj_ret_pts_won%_l60_decay", "p_SOS_adj_ret_pts_won%_l10", "p_ace%", "p_ace%_l60_decay", "p_ace%_l10", "p_SOS_adj_ace%_l60_decay", "p_SOS_adj_ace%_l10", "p_aced%", "p_aced%_l60_decay", "p_aced%_l10", "p_SOS_adj_aced%_l60_decay", "p_SOS_adj_aced%_l10", "p_bp_save%", "p_bp_save%_l60", "p_bp_save%_l10", "p_SOS_adj_bp_save%_l60", "p_SOS_adj_bp_save%_l10", "p_bp_convert%", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_SOS_adj_bp_convert%_l60", "p_SOS_adj_bp_convert%_l10", "p_pts_won%_std_l60_decay",'p_sv_pts_won%_std_l60_decay','p_ret_pts_won%_std_l60_decay', "p_m_time_last", "p_tot_time_l6", "p_tot_time_l6_decay", "p_tot_pts_last", "p_tot_pts_l6", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue_decay", "p_stamina_adj_fatigue", "p_IP_NV", "p_opp_rank_diff", "p_log_rank", "p_opp_log_rank_diff", "p_opp_rank_pts_diff", "p_ent_diff", "p_opp_ht_diff","p_opp_age_diff","p_L_opp_R","p_HCA_opp_N", "p_pts_won%_l60_decay_diff", "p_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_diff", "p_SOS_adj_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff", "p_pts_won%_l10_diff", "p_SOS_adj_pts_won%_l10_diff", "p_sv_pts_won%_l60_decay_diff", "p_SOS_adj_sv_pts_won%_l60_decay_diff", "p_sv_pts_won%_l10_diff", "p_SOS_adj_sv_pts_won%_l10_diff", "p_ret_pts_won%_l60_decay_diff", "p_SOS_adj_ret_pts_won%_l60_decay_diff", "p_ret_pts_won%_l10_diff", "p_SOS_adj_ret_pts_won%_l10_diff", "p_sv_opp_ret_pts_won%_l60_decay_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff", "p_sv_opp_ret_pts_won%_l10_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l10_diff", "p_ret_opp_sv_pts_won%_l60_decay_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff", "p_ret_opp_sv_pts_won%_l10_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l10_diff", "p_ace%_l60_decay_diff", "p_SOS_adj_ace%_l60_decay_diff", "p_ace%_l10_diff", "p_SOS_adj_ace%_l10_diff", "p_aced%_l60_decay_diff", "p_SOS_adj_aced%_l60_decay_diff", "p_aced%_l10_diff", "p_SOS_adj_aced%_l10_diff", "p_ace%_opp_aced%_l60_decay_diff", "p_SOS_adj_ace%_opp_aced%_l60_decay_diff", "p_ace%_opp_aced%_l10_diff", "p_SOS_adj_ace%_opp_aced%_l10_diff", "p_aced%_opp_ace%_l60_decay_diff", "p_SOS_adj_aced%_opp_ace%_l60_decay_diff", "p_aced%_opp_ace%_l10_diff", "p_SOS_adj_aced%_opp_ace%_l10_diff", "p_bp_save%_l60_diff", "p_SOS_adj_bp_save%_l60_diff", "p_bp_save%_l10_diff", "p_SOS_adj_bp_save%_l10_diff", "p_bp_convert%_l60_diff", "p_SOS_adj_bp_convert%_l60_diff", "p_bp_convert%_l10_diff", "p_SOS_adj_bp_convert%_l10_diff", "p_bp_convert%_opp_bp_save%_l60_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff", "p_bp_convert%_opp_bp_save%_l10_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff", "p_bp_save%_opp_bp_convert%_l60_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff", "p_bp_save%_opp_bp_convert%_l10_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff", "p_pts_won%_std_l60_decay_diff", "p_sv_pts_won%_std_l60_decay_diff", "p_ret_pts_won%_std_l60_decay_diff", "p_m_time_last_diff", "p_tot_time_l6_diff", "p_tot_time_l6_decay_diff", "p_tot_pts_last_diff", "p_tot_pts_l6_diff", "p_tot_pts_l6_decay_diff", "p_matches_diff", "p_matches_surf_diff", "p_stam_adj_fatigue_decay_diff", "p_stam_adj_fatigue_diff", "p_H2H_diff", "p_H2H_pts_won%_diff", "m_outcome"], axis=1)

In [None]:
#Dropping winner (_x) columns for remerge by player
df_losers4 = df_match3.drop(["m_outcome_x", "m_outcome_y", "p_id_x", "p_name_x", "p_H2H_w_x", "p_H2H_pts_won%_x", "p_rank_x", "p_rank_pts_x", "p_country_x", "p_ent_x", "p_hd_x", "p_ht_x", "p_age_x", "p_matches_x", "p_matches_surf_x", "p_pts_won%_x", "p_pts_won%_l60_decay_x", "p_pts_won%_l60_decay_IO_x", "p_pts_won%_l10_x", "p_SOS_adj_pts_won%_l60_decay_x", "p_SOS_adj_pts_won%_l60_decay_IO_x", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_x", "p_SOS_adj_pts_won%_l10_x", "p_sv_pts_won%_x", "p_sv_pts_won%_l60_decay_x", "p_sv_pts_won%_l10_x", "p_SOS_adj_sv_pts_won%_l60_decay_x", "p_SOS_adj_sv_pts_won%_l10_x", "p_ret_pts_won%_x", "p_ret_pts_won%_l60_decay_x", "p_ret_pts_won%_l10_x", "p_SOS_adj_ret_pts_won%_l60_decay_x", "p_SOS_adj_ret_pts_won%_l10_x", "p_ace%_x", "p_ace%_l60_decay_x", "p_ace%_l10_x", "p_SOS_adj_ace%_l60_decay_x", "p_SOS_adj_ace%_l10_x", "p_aced%_x", "p_aced%_l60_decay_x", "p_aced%_l10_x", "p_SOS_adj_aced%_l60_decay_x", "p_SOS_adj_aced%_l10_x", "p_bp_save%_x", "p_bp_save%_l60_x", "p_bp_save%_l10_x", "p_SOS_adj_bp_save%_l60_x", "p_SOS_adj_bp_save%_l10_x", "p_bp_convert%_x", "p_bp_convert%_l60_x", "p_bp_convert%_l10_x", "p_SOS_adj_bp_convert%_l60_x", "p_SOS_adj_bp_convert%_l10_x", "p_pts_won%_std_l60_decay_x", "p_sv_pts_won%_std_l60_decay_x", "p_ret_pts_won%_std_l60_decay_x", "p_m_time_last_x", "p_tot_time_l6_x", "p_tot_time_l6_decay_x", "p_tot_pts_last_x", "p_tot_pts_l6_x", "p_tot_pts_l6_decay_x", "p_stamina_adj_fatigue_decay_x", "p_stamina_adj_fatigue_x","p_IP_NV_x", "p_rank_diff_x", "p_log_rank_x", "p_log_rank_diff_x", "p_rank_pts_diff_x", "p_ent_diff_x", "p_ht_diff_x", "p_age_diff_x", "p_L_opp_R_x", "p_HCA_opp_N_x", "p_pts_won%_l60_decay_diff_x", "p_pts_won%_l60_decay_IO_diff_x", "p_SOS_adj_pts_won%_l60_decay_diff_x", "p_SOS_adj_pts_won%_l60_decay_IO_diff_x", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff_x", "p_pts_won%_l10_diff_x", "p_SOS_adj_pts_won%_l10_diff_x", "p_sv_pts_won%_l60_decay_diff_x", "p_SOS_adj_sv_pts_won%_l60_decay_diff_x", "p_sv_pts_won%_l10_diff_x", "p_SOS_adj_sv_pts_won%_l10_diff_x", "p_ret_pts_won%_l60_decay_diff_x", "p_SOS_adj_ret_pts_won%_l60_decay_diff_x", "p_ret_pts_won%_l10_diff_x", "p_SOS_adj_ret_pts_won%_l10_diff_x", "p_sv_opp_ret_pts_won%_l60_decay_diff_x", "p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff_x", "p_sv_opp_ret_pts_won%_l10_diff_x", "p_SOS_adj_sv_opp_ret_pts_won%_l10_diff_x", "p_ret_opp_sv_pts_won%_l60_decay_diff_x", "p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff_x", "p_ret_opp_sv_pts_won%_l10_diff_x", "p_SOS_adj_ret_opp_sv_pts_won%_l10_diff_x", "p_ace%_l60_decay_diff_x", "p_SOS_adj_ace%_l60_decay_diff_x", "p_ace%_l10_diff_x", "p_SOS_adj_ace%_l10_diff_x", "p_aced%_l60_decay_diff_x", "p_SOS_adj_aced%_l60_decay_diff_x", "p_aced%_l10_diff_x", "p_SOS_adj_aced%_l10_diff_x", "p_ace%_opp_aced%_l60_decay_diff_x", "p_SOS_adj_ace%_opp_aced%_l60_decay_diff_x", "p_ace%_opp_aced%_l10_diff_x", "p_SOS_adj_ace%_opp_aced%_l10_diff_x", "p_aced%_opp_ace%_l60_decay_diff_x", "p_SOS_adj_aced%_opp_ace%_l60_decay_diff_x", "p_aced%_opp_ace%_l10_diff_x", "p_SOS_adj_aced%_opp_ace%_l10_diff_x", "p_bp_save%_l60_diff_x", "p_SOS_adj_bp_save%_l60_diff_x", "p_bp_save%_l10_diff_x", "p_SOS_adj_bp_save%_l10_diff_x", "p_bp_convert%_l60_diff_x", "p_SOS_adj_bp_convert%_l60_diff_x", "p_bp_convert%_l10_diff_x", "p_SOS_adj_bp_convert%_l10_diff_x", "p_bp_convert%_opp_bp_save%_l60_diff_x", "p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff_x", "p_bp_convert%_opp_bp_save%_l10_diff_x", "p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff_x", "p_bp_save%_opp_bp_convert%_l60_diff_x", "p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff_x", "p_bp_save%_opp_bp_convert%_l10_diff_x", "p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff_x", "p_pts_won%_std_l60_decay_diff_x", "p_sv_pts_won%_std_l60_decay_diff_x", "p_ret_pts_won%_std_l60_decay_diff_x", "p_m_time_last_diff_x", "p_tot_time_l6_diff_x", "p_tot_time_l6_decay_diff_x", "p_tot_pts_last_diff_x", "p_tot_pts_l6_diff_x", "p_tot_pts_l6_decay_diff_x", "p_matches_diff_x", "p_matches_surf_diff_x", "p_stamina_adj_fatigue_decay_diff_x", "p_stamina_adj_fatigue_diff_x", "p_H2H_diff_x", "p_H2H_pts_won%_diff_x"], axis=1)
df_losers4["m_outcome"] = 0

In [None]:
df_losers4.to_csv('../data/df_losers4.csv', index=False)

In [None]:
#Renaming columns to remove winner-loser descriptions so we can re-concatenate winners and losers
df_losers4 = df_losers4.set_axis(["t_id", "t_date", "tour_wk", "t_name", "t_country", "t_surf", "t_indoor", "t_alt", "t_lvl", "t_draw_size", "t_rd_num", "m_num", "m_best_of", "m_time(m)", "m_tot_pts", "p_id", "p_name", "p_H2H_w", "p_H2H_pts_won%", "p_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_matches", "p_matches_surf", "p_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l60_decay_IO", "p_pts_won%_l10", "p_SOS_adj_pts_won%_l60_decay", "p_SOS_adj_pts_won%_l60_decay_IO", "p_SOS_adj_pts_won%_l60_decay_IO_weighted", "p_SOS_adj_pts_won%_l10", "p_sv_pts_won%", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_SOS_adj_sv_pts_won%_l60_decay", "p_SOS_adj_sv_pts_won%_l10", "p_ret_pts_won%", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_SOS_adj_ret_pts_won%_l60_decay", "p_SOS_adj_ret_pts_won%_l10", "p_ace%", "p_ace%_l60_decay", "p_ace%_l10", "p_SOS_adj_ace%_l60_decay", "p_SOS_adj_ace%_l10", "p_aced%", "p_aced%_l60_decay", "p_aced%_l10", "p_SOS_adj_aced%_l60_decay", "p_SOS_adj_aced%_l10", "p_bp_save%", "p_bp_save%_l60", "p_bp_save%_l10", "p_SOS_adj_bp_save%_l60", "p_SOS_adj_bp_save%_l10", "p_bp_convert%", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_SOS_adj_bp_convert%_l60", "p_SOS_adj_bp_convert%_l10", "p_pts_won%_std_l60_decay",'p_sv_pts_won%_std_l60_decay','p_ret_pts_won%_std_l60_decay', "p_m_time_last", "p_tot_time_l6", "p_tot_time_l6_decay", "p_tot_pts_last", "p_tot_pts_l6", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue_decay", "p_stamina_adj_fatigue", "p_IP_NV", "p_opp_rank_diff", "p_log_rank", "p_opp_log_rank_diff", "p_opp_rank_pts_diff", "p_ent_diff", "p_opp_ht_diff","p_opp_age_diff","p_L_opp_R","p_HCA_opp_N", "p_pts_won%_l60_decay_diff", "p_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_diff", "p_SOS_adj_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff", "p_pts_won%_l10_diff", "p_SOS_adj_pts_won%_l10_diff", "p_sv_pts_won%_l60_decay_diff", "p_SOS_adj_sv_pts_won%_l60_decay_diff", "p_sv_pts_won%_l10_diff", "p_SOS_adj_sv_pts_won%_l10_diff", "p_ret_pts_won%_l60_decay_diff", "p_SOS_adj_ret_pts_won%_l60_decay_diff", "p_ret_pts_won%_l10_diff", "p_SOS_adj_ret_pts_won%_l10_diff", "p_sv_opp_ret_pts_won%_l60_decay_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff", "p_sv_opp_ret_pts_won%_l10_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l10_diff", "p_ret_opp_sv_pts_won%_l60_decay_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff", "p_ret_opp_sv_pts_won%_l10_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l10_diff", "p_ace%_l60_decay_diff", "p_SOS_adj_ace%_l60_decay_diff", "p_ace%_l10_diff", "p_SOS_adj_ace%_l10_diff", "p_aced%_l60_decay_diff", "p_SOS_adj_aced%_l60_decay_diff", "p_aced%_l10_diff", "p_SOS_adj_aced%_l10_diff", "p_ace%_opp_aced%_l60_decay_diff", "p_SOS_adj_ace%_opp_aced%_l60_decay_diff", "p_ace%_opp_aced%_l10_diff", "p_SOS_adj_ace%_opp_aced%_l10_diff", "p_aced%_opp_ace%_l60_decay_diff", "p_SOS_adj_aced%_opp_ace%_l60_decay_diff", "p_aced%_opp_ace%_l10_diff", "p_SOS_adj_aced%_opp_ace%_l10_diff", "p_bp_save%_l60_diff", "p_SOS_adj_bp_save%_l60_diff", "p_bp_save%_l10_diff", "p_SOS_adj_bp_save%_l10_diff", "p_bp_convert%_l60_diff", "p_SOS_adj_bp_convert%_l60_diff", "p_bp_convert%_l10_diff", "p_SOS_adj_bp_convert%_l10_diff", "p_bp_convert%_opp_bp_save%_l60_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff", "p_bp_convert%_opp_bp_save%_l10_diff",  "p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff", "p_bp_save%_opp_bp_convert%_l60_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff", "p_bp_save%_opp_bp_convert%_l10_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff", "p_pts_won%_std_l60_decay_diff", "p_sv_pts_won%_std_l60_decay_diff", "p_ret_pts_won%_std_l60_decay_diff", "p_m_time_last_diff", "p_tot_time_l6_diff", "p_tot_time_l6_decay_diff", "p_tot_pts_last_diff", "p_tot_pts_l6_diff", "p_tot_pts_l6_decay_diff", "p_matches_diff", "p_matches_surf_diff", "p_stam_adj_fatigue_decay_diff", "p_stam_adj_fatigue_diff", "p_H2H_diff", "p_H2H_pts_won%_diff", "m_outcome"], axis=1)

In [None]:
#Re-merge data, but now with no separate columns for winners and losers 
df_player4 = pd.concat([df_winners4, df_losers4], ignore_index=True)
#df_player4.info()

In [None]:
#df_player4.head(30)

In [None]:
df_player5 = df_player4[["t_id", "t_date", "tour_wk", "t_name", "t_country", "t_surf","t_indoor", "t_alt", "t_lvl", "t_draw_size", "t_rd_num", "m_num", "m_best_of", "m_time(m)", "m_tot_pts", "p_id", "p_name", "p_rank", "p_log_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_matches", "p_matches_surf", "p_H2H_w", "p_H2H_pts_won%", "p_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l60_decay_IO", "p_pts_won%_l10", "p_SOS_adj_pts_won%_l60_decay", "p_SOS_adj_pts_won%_l60_decay_IO", "p_SOS_adj_pts_won%_l60_decay_IO_weighted", "p_SOS_adj_pts_won%_l10", "p_sv_pts_won%", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_SOS_adj_sv_pts_won%_l60_decay", "p_SOS_adj_sv_pts_won%_l10", "p_ret_pts_won%", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_SOS_adj_ret_pts_won%_l60_decay", "p_SOS_adj_ret_pts_won%_l10", "p_ace%", "p_ace%_l60_decay", "p_ace%_l10", "p_SOS_adj_ace%_l60_decay", "p_SOS_adj_ace%_l10", "p_aced%", "p_aced%_l60_decay", "p_aced%_l10", "p_SOS_adj_aced%_l60_decay", "p_SOS_adj_aced%_l10", "p_bp_save%", "p_bp_save%_l60", "p_bp_save%_l10", "p_SOS_adj_bp_save%_l60", "p_SOS_adj_bp_save%_l10", "p_bp_convert%", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_SOS_adj_bp_convert%_l60", "p_SOS_adj_bp_convert%_l10", "p_pts_won%_std_l60_decay",'p_sv_pts_won%_std_l60_decay','p_ret_pts_won%_std_l60_decay', "p_m_time_last", "p_tot_time_l6", "p_tot_time_l6_decay", "p_tot_pts_last", "p_tot_pts_l6", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue", "p_stamina_adj_fatigue_decay", "p_opp_rank_diff", "p_opp_log_rank_diff", "p_opp_rank_pts_diff", "p_ent_diff", "p_opp_ht_diff", "p_opp_age_diff", "p_L_opp_R", "p_HCA_opp_N", "p_pts_won%_l60_decay_diff", "p_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_diff", "p_SOS_adj_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff", "p_pts_won%_l10_diff", "p_SOS_adj_pts_won%_l10_diff", "p_sv_pts_won%_l60_decay_diff", "p_SOS_adj_sv_pts_won%_l60_decay_diff", "p_sv_pts_won%_l10_diff", "p_SOS_adj_sv_pts_won%_l10_diff", "p_ret_pts_won%_l60_decay_diff", "p_SOS_adj_ret_pts_won%_l60_decay_diff", "p_ret_pts_won%_l10_diff", "p_SOS_adj_ret_pts_won%_l10_diff", "p_sv_opp_ret_pts_won%_l60_decay_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff", "p_sv_opp_ret_pts_won%_l10_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l10_diff", "p_ret_opp_sv_pts_won%_l60_decay_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff", "p_ret_opp_sv_pts_won%_l10_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l10_diff", "p_ace%_l60_decay_diff", "p_SOS_adj_ace%_l60_decay_diff", "p_ace%_l10_diff", "p_SOS_adj_ace%_l10_diff", "p_aced%_l60_decay_diff", "p_SOS_adj_aced%_l60_decay_diff", "p_aced%_l10_diff", "p_SOS_adj_aced%_l10_diff", "p_ace%_opp_aced%_l60_decay_diff", "p_SOS_adj_ace%_opp_aced%_l60_decay_diff", "p_ace%_opp_aced%_l10_diff", "p_SOS_adj_ace%_opp_aced%_l10_diff", "p_aced%_opp_ace%_l60_decay_diff", "p_SOS_adj_aced%_opp_ace%_l60_decay_diff", "p_aced%_opp_ace%_l10_diff", "p_SOS_adj_aced%_opp_ace%_l10_diff", "p_bp_save%_l60_diff", "p_SOS_adj_bp_save%_l60_diff", "p_bp_save%_l10_diff", "p_SOS_adj_bp_save%_l10_diff", "p_bp_convert%_l60_diff", "p_SOS_adj_bp_convert%_l60_diff", "p_bp_convert%_l10_diff", "p_SOS_adj_bp_convert%_l10_diff", "p_bp_convert%_opp_bp_save%_l60_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff", "p_bp_convert%_opp_bp_save%_l10_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff", "p_bp_save%_opp_bp_convert%_l60_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff", "p_bp_save%_opp_bp_convert%_l10_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff", "p_pts_won%_std_l60_decay_diff", "p_sv_pts_won%_std_l60_decay_diff", "p_ret_pts_won%_std_l60_decay_diff", "p_m_time_last_diff", "p_tot_time_l6_diff", "p_tot_time_l6_decay_diff", "p_tot_pts_last_diff", "p_tot_pts_l6_diff", "p_tot_pts_l6_decay_diff", "p_matches_diff", "p_matches_surf_diff", "p_stam_adj_fatigue_diff", "p_stam_adj_fatigue_decay_diff",  "p_H2H_diff", "p_H2H_pts_won%_diff", "p_IP_NV", "m_outcome"]]

In [None]:
#Sorting as such helps visually verify the complicated, backward-looking stat accrual calculations we will make below
df_player5 = df_player5.sort_values(by=['p_id','tour_wk','t_rd_num'], ascending = False)

In [None]:
#df_player5.info()

In [None]:
#df_player5.to_csv('../data/df_player5.csv', index=False)

Ideally, as close to real time a model of conditions before a given match we want to predict can be generated. Once a sufficient number of matches have been played in a given tournament, priors on court speed can be updated as well. 

For now, we will use ace% per given tournament from the previous year (when available) as a proxy for court speed. Conditions, of course, are dictated by a number of factors, including the balls, altitude, watering frequency (clay) and sand incorporation in the mix for hard courts. Also, indoor conditions tend to be faster than outdoor. This is challenging to model because all conditions variables are seldom the same from year to year at a given venue. Plus, even the weather at the time of a match will make a considerable difference in court conditions, possibly even greater than the underlying "weather neutral" conditions. 

In [None]:
# First, generate by-tournament ace%
t_ace_perc = df_player5[["p_name","t_name","t_id","p_ace%"]]
t_ace_perc.head()

In [None]:
# Before computing by-tourny, by-year means, removing data from the three largest ace outliers in tennis history (apologies to Andy Roddick and Milos Raonic).
# Their absense or presence, esecially if they go very deep in the tourny, really does make a big difference at the individual tourny level as far as ace stats.
t_ace_perc = t_ace_perc[~t_ace_perc['p_name'].str.contains("Karlovic")]
t_ace_perc = t_ace_perc[~t_ace_perc['p_name'].str.contains("Isner")]
t_ace_perc = t_ace_perc[~t_ace_perc['p_name'].str.contains("Opelka")]
t_ace_perc.info()

In [None]:
# computes mean ace % per tourny per year (minus the 6'7" and above outliers removed above)
t_ace_perc = t_ace_perc.groupby(['t_id','t_name']).mean().round(2)
t_ace_perc.head(20)

In [None]:
t_ace_perc = t_ace_perc.sort_values(by=['t_name','t_id'], ascending = False)
t_ace_perc

In [None]:
t_ace_perc.rename(columns = {'p_ace%':'t_ace%'}, inplace=True)

In [None]:
#t_ace_perc.info()

In [None]:
#t_ace_perc.to_csv('../data/t_ace_perc.csv', index=True)

In [None]:
# For each tourny in the sample, applies the previous year's ace% from the same tourney (where available) as the speed conditions proxy
t_ace_perc["t_ace%_last"] = t_ace_perc.groupby('t_name')['t_ace%'].shift(-1)
t_ace_perc

In [None]:
# Now we can just do a left join with the main dataframe on t_id to fill in the proper last year's value for each player/match
df_player6 = df_player5.merge(t_ace_perc['t_ace%_last'], on='t_id', how = 'left')

In [None]:
#df_player6.info()

For tournaments without a prior year to assess conditions from (mostly tournies from the first year of the sample (2012) that we won't actually make predictions on, we will just use the overall sample mean for its' surface (hard or clay) and indoor or outdoor status. 

In [None]:
surface_ace_perc_means = df_player5[["p_name", "t_name","t_id","t_surf","t_indoor","t_alt","p_ace%"]]
surface_ace_perc_means.info()

In [None]:
# as with the by-tourny means above, removing the extreme ace outliers before computing the by surface averages for filling in the NaNs
surface_ace_perc_means = surface_ace_perc_means[~surface_ace_perc_means['p_name'].str.contains("Karlovic")]
surface_ace_perc_means = surface_ace_perc_means[~surface_ace_perc_means['p_name'].str.contains("Isner")]
surface_ace_perc_means = surface_ace_perc_means[~surface_ace_perc_means['p_name'].str.contains("Opelka")]
surface_ace_perc_means.info()

In [None]:
# computes mean across all matches played on one surface in the sample (clay or hard court). Used to fill in NaNs with surface=specificity
surface_ace_perc_means = surface_ace_perc_means.groupby(['t_surf','t_indoor']).mean().round(2)
surface_ace_perc_means.rename(columns = {'p_ace%':'t_ace%'}, inplace=True)
surface_ace_perc_means

In [None]:
surface_ace_perc_means["t_ace%"][0], surface_ace_perc_means["t_ace%"][1] 

In [None]:
# Assigns overall sample means per surface, per indoor-outdoor status to matches with no previous year value
# Additionally, a few tournaments (Quito and Gstaad on outdoor clay, Bogota on outdoor hard) at extreme altitude 
# get a little boost (1.5%) over the above mean given known effects above around 3,000 ft

df_player6.loc[(df_player6["t_ace%_last"].isnull()) & (df_player6["t_surf"] == "Clay") & (df_player6["t_indoor"] == 0) & (df_player6["t_alt"] == 0), "t_ace%_last"] = surface_ace_perc_means["t_ace%"][0] 
df_player6.loc[(df_player6["t_ace%_last"].isnull()) & (df_player6["t_surf"] == "Clay") & (df_player6["t_indoor"] == 0) & (df_player6["t_alt"] == 1), "t_ace%_last"] = (surface_ace_perc_means["t_ace%"][0]) + 1.5 

df_player6.loc[(df_player6["t_ace%_last"].isnull()) & (df_player6["t_surf"] == "Clay") & (df_player6["t_indoor"] == 1), "t_ace%_last"] = surface_ace_perc_means["t_ace%"][1] 

#df_player6.loc[(df_player6["t_ace%_last"].isnull()) & (df_player6["t_surf"] == "Hard") & (df_player6["t_indoor"] == 0) & (df_player6["t_alt"] == 0), "t_ace%_last"] = surface_ace_perc_means["t_ace%"][0]
#df_player6.loc[(df_player6["t_ace%_last"].isnull()) & (df_player6["t_surf"] == "Hard") & (df_player6["t_indoor"] == 0) & (df_player6["t_alt"] == 1), "t_ace%_last"] = (surface_ace_perc_means["t_ace%"][0]) + 1.5

#df_player6.loc[(df_player6["t_ace%_last"].isnull()) & (df_player6["t_surf"] == "Hard") & (df_player6["t_indoor"] == 1), "t_ace%_last"] = surface_ace_perc_means["t_ace%"][1]

This historic court speed proxy data is potentially useful to put player characteristics in context of the court speed at the tournment in which the match being predicted on is played. To this end, below are created two marker columns per player per match to be predicted on. The first column indicates whether the tournament at hand has an ace rate > 1 std higher than the surface average AND the PLAYER also has an ace rate > 1 std than the surface average. The second column indicates whether the tournament at hand has an ace rate > 1 std higher than the surface average AND the PLAYER also has an ACED rate > 1 std above the surface average ace rate.  

In [None]:
# Surface ace rate per tourney, per surface (collapsed across indoor and outdoor per surface; ace outliers still removed)
surface_ace_perc_means2 = df_player5[["p_name", "t_name","t_id","t_surf","p_ace%"]]
surface_ace_perc_means2 = surface_ace_perc_means2[~surface_ace_perc_means2['p_name'].str.contains("Karlovic")]
surface_ace_perc_means2 = surface_ace_perc_means2[~surface_ace_perc_means2['p_name'].str.contains("Isner")]
surface_ace_perc_means2 = surface_ace_perc_means2[~surface_ace_perc_means2['p_name'].str.contains("Opelka")]
surface_ace_perc_means2 = surface_ace_perc_means2.groupby(['t_surf','t_id']).mean().round(2)
#surface_ace_perc_means2.rename(columns = {'p_ace%':'t_ace%'}, inplace=True)
surface_ace_perc_means2

In [None]:
#surface_ace_perc_means2.to_csv('../data/surface_ace_perc_means2.csv', index=True)

In [None]:
# Mean ace rate per surface (using means at BY TOURNAMENT LEVEL generated above)
surface_ace_perc_means3 = surface_ace_perc_means2.groupby(['t_surf']).mean().round(2)
surface_ace_perc_means3.rename(columns = {'p_ace%':'t_ace%'}, inplace=True)
#surface_ace_perc_means3

# Ace rate stdev (using stdev at BY TOURNAMENT LEVEL generated above)
surface_ace_perc_std = surface_ace_perc_means2.groupby(['t_surf']).std().round(2)
surface_ace_perc_std.rename(columns = {'p_ace%':'t_ace%'}, inplace=True)
#surface_ace_perc_std

In [None]:
surface_ace_perc_means3["t_ace%"][0]

In [None]:
surface_ace_perc_std["t_ace%"][0]

In [None]:
# Now we want to assess mean and std for SOS_adjusted ace and aced% over the last l60, on a surface-specific basis 

In [None]:
player_ace_l60_perc_means = df_player5[["p_name", "t_surf","p_SOS_adj_ace%_l60_decay"]]
player_ace_l60_perc_means = player_ace_l60_perc_means[~player_ace_l60_perc_means['p_name'].str.contains("Karlovic")]
player_ace_l60_perc_means = player_ace_l60_perc_means[~player_ace_l60_perc_means['p_name'].str.contains("Isner")]
player_ace_l60_perc_means = player_ace_l60_perc_means[~player_ace_l60_perc_means['p_name'].str.contains("Opelka")]
player_ace_l60_perc_means = player_ace_l60_perc_means.groupby(['t_surf']).mean().round(2)
player_ace_l60_perc_means

In [None]:
player_ace_l60_perc_means["p_SOS_adj_ace%_l60_decay"][0]

In [None]:
player_ace_l60_perc_std = df_player5[["p_name", "t_surf","p_SOS_adj_ace%_l60_decay"]]
player_ace_l60_perc_std = player_ace_l60_perc_std[~player_ace_l60_perc_std['p_name'].str.contains("Karlovic")]
player_ace_l60_perc_std = player_ace_l60_perc_std[~player_ace_l60_perc_std['p_name'].str.contains("Isner")]
player_ace_l60_perc_std = player_ace_l60_perc_std[~player_ace_l60_perc_std['p_name'].str.contains("Opelka")]
player_ace_l60_perc_std = player_ace_l60_perc_std.groupby(['t_surf']).std().round(2)
player_ace_l60_perc_std

In [None]:
player_ace_l60_perc_std["p_SOS_adj_ace%_l60_decay"][0]

In [None]:
# Marker column for tournament high court speed proxy AND player high ace rate
# Plus or minus fractional standard deviation thresholds for tourney and player calibrated in univariate linear modeling
df_player6["high_t_ace_p_ace"] = ""
df_player6.loc[(df_player6["t_surf"] == "Clay") & (df_player6["t_ace%_last"] > (surface_ace_perc_means3["t_ace%"][0] + (0.35*surface_ace_perc_std["t_ace%"][0]))) & (df_player6["p_SOS_adj_ace%_l60_decay"] > (player_ace_l60_perc_means["p_SOS_adj_ace%_l60_decay"][0] + (0.4*player_ace_l60_perc_std["p_SOS_adj_ace%_l60_decay"][0]))), "high_t_ace_p_ace"] = 1
#df_player6.loc[(df_player6["t_surf"] == "Hard") & (df_player6["t_ace%_last"] > (surface_ace_perc_means3["t_ace%"][0] + (0.35*surface_ace_perc_std["t_ace%"][0]))) & (df_player6["p_SOS_adj_ace%_l60_decay"] > (player_ace_l60_perc_means["p_SOS_adj_ace%_l60_decay"][0] + (0.4*player_ace_l60_perc_std["p_SOS_adj_ace%_l60_decay"][0]))), "high_t_ace_p_ace"] = 1
df_player6.loc[(df_player6["high_t_ace_p_ace"] != 1), "high_t_ace_p_ace"] = 0

In [None]:
df_player6["high_t_ace_p_ace"].value_counts()

In [None]:
# Marker column for tournament high court speed proxy AND player high ACED rate
# Plus or minus fractional standard deviation thresholds for tourney and player calibrated in univariate linear modeling
df_player6["high_t_ace_p_aced"] = ""
df_player6.loc[(df_player6["t_surf"] == "Clay") & (df_player6["t_ace%_last"] > (surface_ace_perc_means3["t_ace%"][0] + (0.3*surface_ace_perc_std["t_ace%"][0]))) & (df_player6["p_SOS_adj_aced%_l60_decay"] > (player_ace_l60_perc_means["p_SOS_adj_ace%_l60_decay"][0] + (0.3*player_ace_l60_perc_std["p_SOS_adj_ace%_l60_decay"][0]))), "high_t_ace_p_aced"] = 1
#df_player6.loc[(df_player6["t_surf"] == "Hard") & (df_player6["t_ace%_last"] > (surface_ace_perc_means3["t_ace%"][0] + (0.3*surface_ace_perc_std["t_ace%"][0]))) & (df_player6["p_SOS_adj_aced%_l60_decay"] > (player_ace_l60_perc_means["p_SOS_adj_ace%_l60_decay"][0] + (0.3*player_ace_l60_perc_std["p_SOS_adj_ace%_l60_decay"][0]))), "high_t_ace_p_aced"] = 1
df_player6.loc[(df_player6["high_t_ace_p_aced"] != 1), "high_t_ace_p_aced"] = 0

In [None]:
df_player6["high_t_ace_p_aced"].value_counts()

In [None]:
#df_player6.to_csv('../data/df_player6.csv', index=False)

In [None]:
# Numerically encode surface (and handedness, which should have been converted earlier) moving forward
df_player6.loc[(df_player6["t_surf"] == "Hard"), "t_surf"] = 2 #Hard Court
df_player6.loc[(df_player6["t_surf"] == "Clay"), "t_surf"] = 1 #Clay Court
df_player6.loc[(df_player6["t_surf"] == "Carpet"), "t_surf"] = 2 #Hard Court

df_player6["t_surf"] = pd.to_numeric(df_player6["t_surf"])

In [None]:
# Now just drop player ace% per match column so we don't accidentally include in predictions
#df_player6 = df_player6.drop(["p_ace%"], axis=1)

In [None]:
#df_player6.info()

In [None]:
# One last thing to sneak in-let's make a Differential version of the implied odds derived from the average closing wagering line (vig-removed)
df_player6["p_IP_NV_opp"] = ""
df_player6["p_IP_NV_opp"] = 100 - df_player6["p_IP_NV"]

df_player6["p_IP_NV_diff"] = ""
df_player6["p_IP_NV_diff"] = df_player6["p_IP_NV"] - df_player6["p_IP_NV_opp"]

### Save Data for EDA Stage

In [None]:
# Creates a dataframe containing ranking-related features (and target feature) for dummy and benchmrk model testing.
# also contains features necessary for thresholding minimum number of matches played by a player prior to match being predicted on, and 
# for restricting modeling to 2012 and onward (2009-2012 used to accrue retrospective stats/features)
df_player_benchmark = df_player6[["p_pts_won%", "p_rank", "p_log_rank", "p_rank_pts", "p_opp_rank_diff", "p_opp_log_rank_diff","p_opp_rank_pts_diff", "p_matches_surf", "t_indoor", "m_num", "tour_wk"]]

In [None]:
# Saves ranking-related player features (for dummy and benchmark model testing)
df_player_benchmark.to_csv('../data/df_player_benchmark_clay.csv', index=False)

In [None]:
# Creates dataframe with all predictive features (and target, plus more granular targets might use later)
df_player_all = df_player6[["p_pts_won%", "p_sv_pts_won%", "p_ret_pts_won%", "p_ace%", "p_aced%", "p_bp_save%", "p_bp_convert%", "t_id", "t_date", "tour_wk", "t_name", "t_country", "t_indoor", "t_alt", "t_ace%_last", "t_lvl", "t_draw_size", "t_rd_num", "m_num", "m_best_of", "m_time(m)", "m_tot_pts", "p_id", "p_name", "p_rank", "p_log_rank", "p_rank_pts", "p_country", "p_ent", "p_hd", "p_ht", "p_age", "p_matches_surf", "p_H2H_w", "p_H2H_pts_won%", "p_pts_won%_l60_decay", "p_pts_won%_l60_decay_IO", "p_pts_won%_l10", "p_SOS_adj_pts_won%_l60_decay", "p_SOS_adj_pts_won%_l60_decay_IO", "p_SOS_adj_pts_won%_l60_decay_IO_weighted", "p_SOS_adj_pts_won%_l10", "p_sv_pts_won%_l60_decay", "p_sv_pts_won%_l10", "p_SOS_adj_sv_pts_won%_l60_decay", "p_SOS_adj_sv_pts_won%_l10", "p_ret_pts_won%_l60_decay", "p_ret_pts_won%_l10", "p_SOS_adj_ret_pts_won%_l60_decay", "p_SOS_adj_ret_pts_won%_l10", "p_ace%_l60_decay", "p_ace%_l10", "p_SOS_adj_ace%_l60_decay", "p_SOS_adj_ace%_l10", "p_aced%_l60_decay", "p_aced%_l10", "p_SOS_adj_aced%_l60_decay", "p_SOS_adj_aced%_l10", "p_bp_save%_l60", "p_bp_save%_l10", "p_SOS_adj_bp_save%_l60", "p_SOS_adj_bp_save%_l10", "p_bp_convert%_l60", "p_bp_convert%_l10", "p_SOS_adj_bp_convert%_l60", "p_SOS_adj_bp_convert%_l10", "p_pts_won%_std_l60_decay",'p_sv_pts_won%_std_l60_decay','p_ret_pts_won%_std_l60_decay', "p_m_time_last", "p_tot_time_l6", "p_tot_time_l6_decay", "p_tot_pts_last", "p_tot_pts_l6", "p_tot_pts_l6_decay", "p_stamina_adj_fatigue", "p_stamina_adj_fatigue_decay", "high_t_ace_p_ace", "high_t_ace_p_aced", "p_opp_rank_diff", "p_opp_log_rank_diff", "p_opp_rank_pts_diff", "p_ent_diff", "p_opp_ht_diff", "p_opp_age_diff", "p_L_opp_R", "p_HCA_opp_N", "p_pts_won%_l60_decay_diff", "p_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_diff", "p_SOS_adj_pts_won%_l60_decay_IO_diff", "p_SOS_adj_pts_won%_l60_decay_IO_weighted_diff", "p_pts_won%_l10_diff", "p_SOS_adj_pts_won%_l10_diff", "p_sv_pts_won%_l60_decay_diff", "p_SOS_adj_sv_pts_won%_l60_decay_diff", "p_sv_pts_won%_l10_diff", "p_SOS_adj_sv_pts_won%_l10_diff", "p_ret_pts_won%_l60_decay_diff", "p_SOS_adj_ret_pts_won%_l60_decay_diff", "p_ret_pts_won%_l10_diff", "p_SOS_adj_ret_pts_won%_l10_diff", "p_sv_opp_ret_pts_won%_l60_decay_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l60_decay_diff", "p_sv_opp_ret_pts_won%_l10_diff", "p_SOS_adj_sv_opp_ret_pts_won%_l10_diff", "p_ret_opp_sv_pts_won%_l60_decay_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l60_decay_diff", "p_ret_opp_sv_pts_won%_l10_diff", "p_SOS_adj_ret_opp_sv_pts_won%_l10_diff", "p_ace%_l60_decay_diff", "p_SOS_adj_ace%_l60_decay_diff", "p_ace%_l10_diff", "p_SOS_adj_ace%_l10_diff", "p_aced%_l60_decay_diff", "p_SOS_adj_aced%_l60_decay_diff", "p_aced%_l10_diff", "p_SOS_adj_aced%_l10_diff", "p_ace%_opp_aced%_l60_decay_diff", "p_SOS_adj_ace%_opp_aced%_l60_decay_diff", "p_ace%_opp_aced%_l10_diff", "p_SOS_adj_ace%_opp_aced%_l10_diff", "p_aced%_opp_ace%_l60_decay_diff", "p_SOS_adj_aced%_opp_ace%_l60_decay_diff", "p_aced%_opp_ace%_l10_diff", "p_SOS_adj_aced%_opp_ace%_l10_diff", "p_bp_save%_l60_diff", "p_SOS_adj_bp_save%_l60_diff", "p_bp_save%_l10_diff", "p_SOS_adj_bp_save%_l10_diff", "p_bp_convert%_l60_diff", "p_SOS_adj_bp_convert%_l60_diff", "p_bp_convert%_l10_diff", "p_SOS_adj_bp_convert%_l10_diff", "p_bp_convert%_opp_bp_save%_l60_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l60_diff", "p_bp_convert%_opp_bp_save%_l10_diff", "p_SOS_adj_bp_convert%_opp_bp_save%_l10_diff", "p_bp_save%_opp_bp_convert%_l60_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l60_diff", "p_bp_save%_opp_bp_convert%_l10_diff", "p_SOS_adj_bp_save%_opp_bp_convert%_l10_diff", "p_pts_won%_std_l60_decay_diff", "p_sv_pts_won%_std_l60_decay_diff", "p_ret_pts_won%_std_l60_decay_diff", "p_m_time_last_diff", "p_tot_time_l6_diff", "p_tot_time_l6_decay_diff", "p_tot_pts_last_diff", "p_tot_pts_l6_diff", "p_tot_pts_l6_decay_diff", "p_matches_surf_diff", "p_stam_adj_fatigue_diff", "p_stam_adj_fatigue_decay_diff",  "p_H2H_diff", "p_H2H_pts_won%_diff", "p_IP_NV", "p_IP_NV_diff", "m_outcome"]]

In [None]:
# Saves dataframe with all predictive features - clay court time range
df_player_all.to_csv('../data/df_player_all_clay.csv', index=False)