In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Loading in the Data

We saved the cleaned data to a file in the `data-cleaning` notebook, so let's load that data in.

In [2]:
data_file = Path("./data", "cleaned_data.hdf")
match_2018_train_df = pd.read_hdf(data_file, "train")

In [3]:
match_2018_train_df.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_hand,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
2291,2018-M-DC-2018-G1-AO-M-PAK-UZB-01,Davis Cup G1 R2: PAK vs UZB,Grass,4,D,20180406,4,104797,Denis Istomin,R,...,1.0,0.0,4.0,54.0,21.0,13.0,11.0,8.0,6.0,11.0
2414,2018-M-DC-2018-G2-EPA-M-TUN-FIN-01,Davis Cup G2 R1: TUN vs FIN,Hard,4,D,20180203,1,104291,Malek Jaziri,R,...,1.0,0.0,2.0,64.0,48.0,30.0,7.0,10.0,3.0,6.0
1784,2018-0314,Gstaad,Clay,32,A,20180723,278,103852,Feliciano Lopez,L,...,1.0,5.0,2.0,72.0,41.0,31.0,16.0,10.0,1.0,2.0
634,2018-M006,Indian Wells Masters,Hard,128,M,20180305,276,105023,Sam Querrey,R,...,3.0,4.0,6.0,121.0,78.0,52.0,19.0,16.0,11.0,15.0
2367,2018-M-DC-2018-G2-AO-M-THA-SRI-01,Davis Cup G2 R1: THA vs SRI,Clay,4,D,20180203,2,106397,Wishaya Trongcharoenchaikul,R,...,1.0,6.0,2.0,55.0,35.0,21.0,10.0,9.0,3.0,6.0


# Setting Up for Feature Engineering

## Filtering the Training Data

In general, training data has to consist of data representative of the same population as the input we are trying to make predictions about. In this context, what that means is if we are, for example, trying to predict the outcome of a match between Roger Federer and Rafael Nadal, our training data should consist of all matches played by Federer OR all matches played by Nadal. Therefore, each player in the dataset will have their own model that predicts their performance against an opponent. For the purposes of feature selection, I will primarily look at Federer's matches, but validate with a player that wins around half their matches to ensure I am not selecting features that are only relevant for Federer. 

In [25]:
def get_matches_for_player(data, player):
    return data[(data['winner_name'] == player) | (data['loser_name'] == player)]

In [26]:
fed_2018 = get_matches_for_player(match_2018_train_df, "Roger Federer")
fed_2018.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_hand,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
2143,2018-560,Us Open,Hard,128,G,20180827,163,103819,Roger Federer,R,...,9.0,0.0,4.0,110.0,64.0,29.0,27.0,13.0,9.0,15.0
1529,2018-540,Wimbledon,Grass,128,G,20180702,100,103819,Roger Federer,R,...,0.0,0.0,1.0,84.0,59.0,35.0,10.0,13.0,6.0,11.0
1413,2018-0500,Halle,Grass,32,A,20180618,300,106432,Borna Coric,R,...,4.0,12.0,1.0,88.0,62.0,54.0,11.0,15.0,1.0,3.0
641,2018-M006,Indian Wells Masters,Hard,128,M,20180305,269,103819,Roger Federer,R,...,3.0,2.0,2.0,71.0,33.0,24.0,23.0,10.0,0.0,1.0
1389,2018-0321,Stuttgart,Grass,32,A,20180611,297,103819,Roger Federer,R,...,2.0,6.0,1.0,57.0,37.0,24.0,14.0,10.0,4.0,6.0


To find a good candidate for a validation player, let's compute the win rate for each player.

In [27]:
wins = match_2018_train_df.groupby('winner_name')['tourney_id'].count()
losses = match_2018_train_df.groupby('loser_name')['tourney_id'].count()
win_rate = wins / (wins + losses)
win_rate.sort_values(ascending=False).head()

Rafael Nadal                   0.933333
Roger Federer                  0.846154
Wishaya Trongcharoenchaikul    0.833333
Novak Djokovic                 0.829787
Kamil Majchrzak                0.800000
Name: tourney_id, dtype: float64

Now we find the players whose win rate is around 50%.

In [28]:
win_rate[(win_rate > 0.45) & (win_rate < 0.55)]

Aisam Ul Haq Qureshi          0.500000
Alessandro Giannessi          0.500000
Alexey Vatutin                0.500000
Aljaz Bedene                  0.482759
Amine Ahouda                  0.500000
Andreas Seppi                 0.484848
Andrej Martin                 0.500000
Andrey Rublev                 0.461538
Benjamin Lock                 0.500000
Benoit Paire                  0.500000
Cameron Norrie                0.517241
Carlos Taberner               0.500000
Christian Harrison            0.500000
Christopher Diaz Figueroa     0.500000
Damir Dzumhur                 0.476190
Daniel Elahi Galan Riveros    0.500000
David Agung Susanto           0.500000
Denis Istomin                 0.482759
Dusan Lajovic                 0.513514
Egor Gerasimov                0.500000
Emil Ruusuvuori               0.500000
Federico Coria                0.500000
Feliciano Lopez               0.484848
Fernando Verdasco             0.523810
Filip Krajinovic              0.500000
Francis Tiafoe           

Since he has been on the ATP tour for a while and has a playing style different from Federer's, I will choose Fernando Verdasco to be my validation player.

In [29]:
verd_2018 = get_matches_for_player(match_2018_train_df, "Fernando Verdasco")
verd_2018.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_hand,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
737,2018-M007,Miami Masters,Hard,128,M,20180319,268,104269,Fernando Verdasco,L,...,4.0,2.0,8.0,83.0,46.0,28.0,11.0,12.0,5.0,12.0
134,2018-M001,Sydney,Hard,32,A,20180108,276,200282,Alex De Minaur,R,...,7.0,3.0,4.0,45.0,27.0,16.0,9.0,9.0,2.0,6.0
1276,2018-520,Roland Garros,Clay,128,G,20180528,144,104269,Fernando Verdasco,L,...,17.0,0.0,0.0,189.0,128.0,80.0,23.0,0.0,11.0,22.0
490,2018-6932,Rio De Janeiro,Clay,32,A,20180219,300,106043,Diego Sebastian Schwartzman,R,...,9.0,3.0,3.0,51.0,27.0,18.0,7.0,8.0,1.0,5.0
912,2018-0410,Monte Carlo Masters,Clay,64,M,20180416,239,104269,Fernando Verdasco,L,...,12.0,5.0,5.0,116.0,55.0,33.0,28.0,16.0,10.0,16.0


## Adding a Variable to Predict

Since our ultimate goal is to predict the outcome of a match, we need to have a binary variable representing whether the match resulted in a win or a loss for a given player. This binary variable will the variable we will try to predict with our model.

In [41]:
def add_win_loss(data, player):
    data['result'] = data['winner_name'].apply(lambda w: 1 if w == player else 0)

In [43]:
add_win_loss(fed_2018, "Roger Federer")
add_win_loss(verd_2018, "Fernando Verdasco")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
