# ATP Tennis Data - Feature Engineering

In previous notebook, we pre-processed the ATP match data by:
* dropping columns with little data
* cleaning string columns
* imputing any data that we can reasonably impute

For this notebook, we will use the data saved from pre-processing and start doing some basic feature engineering so we can feed  our data to our models to see how we do. My goal is to start training models and see how they do as soon as possible, so we will stick with the following basic feature engineering techniques:
* one hot encode any categorical data - ie, tournament surface, round, winner_ioc, loser_ioc
* extra month from the tournament date
* label each row with 0 as Player 1 losing to Player 2 and 1 as Player 1 beats Player 2
* remove any remaining columns that we

Output of basic feature engineering:
* tournament
    * tournament id - extract from current tourney_id field
    * tournament level - one hot encode
    * month of tournament - extract from tourney_date
    * year of tournament - extract from tourney_date
    * surface - one hot encode
    * draw size
    * best of
* player 1 and player 2
    * rank
    * height
    * ioc - one hot encode
    * age
    * rank
    * seed
    * hand
* match
    * round - 
* label
    * whether player 1 beat player 2 - 0 - False, 1 - True
    

## Future:
In a future notebook, I plan on implementing more advanced feature engineering with:
* look up matchup history for players
    * match-up stats for pervious matchups
* add match player record leading up to the match for each player (ie, last X matches)
* look up player stats leading up to match for each player (ie, last X matches)
* contruct 2 entries of data per row since match-up and player record leading up to a match will differe depending on which player is player 1 or 2


## Missing Data:

There are some features that are  missing from our dataset. In the future, we could look into somehow scrapping or manually getting these. I looked at the ATP website to see how to get this data, but it is not readily apparent since each player/tournament has some type of unique identifier on their website that I haven't figure out how to get. Here are some potential features to get i the future:


* tournament
    * tournament location - ie, city, country
    * prize money
* player
    * weight
    * backhand - one/two handed
    * number of years as a pro

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [38]:
# Contants

DATASET_DIR = '../datasets'
MODEL_DIR = '../models'
# this is the file we generated from our pre-processing notebook
PREPROCESSED_FILE = f'{DATASET_DIR}/atp_matches_preprocessed.csv'
OUTFILE = f'{DATASET_DIR}/atp_matches_features.csv'

pre = pd.read_csv(PREPROCESSED_FILE, parse_dates=["tourney_date"])
pre = pre.astype({'draw_size': np.int32})


## Let's look at the data to make sure all the types are correct before we being

In [21]:
pre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57868 entries, 0 to 57867
Data columns (total 45 columns):
tourney_id       57868 non-null object
tourney_name     57868 non-null object
surface          57868 non-null object
draw_size        57868 non-null int32
tourney_level    57868 non-null object
tourney_date     57868 non-null datetime64[ns]
match_num        57868 non-null int64
winner_id        57868 non-null int64
winner_seed      57868 non-null float64
winner_name      57868 non-null object
winner_hand      57868 non-null object
winner_ht        57868 non-null float64
winner_ioc       57868 non-null object
winner_age       57868 non-null float64
loser_id         57868 non-null int64
loser_seed       57868 non-null float64
loser_name       57868 non-null object
loser_hand       57868 non-null object
loser_ht         57868 non-null float64
loser_ioc        57868 non-null object
loser_age        57868 non-null float64
score            57867 non-null object
best_of          57868 

In [22]:
pre.sample(10).T

Unnamed: 0,17060,27832,2507,56756,51698,23210,24628,55058,42010,20753
tourney_id,2003-423,2007-416,1998-560,2018-0316,2016-7480,2005-560,2006-403,2017-0352,2012-540,2004-429
tourney_name,los angeles,rome masters,us open,bastad,los cabos,us open,miami masters,paris masters,wimbledon,stockholm
surface,hard,clay,hard,clay,hard,hard,hard,hard,grass,hard
draw_size,32,64,128,32,32,128,128,64,128,32
tourney_level,a,m,g,a,a,g,m,m,g,a
tourney_date,2003-07-28 00:00:00,2007-05-07 00:00:00,1998-08-31 00:00:00,2018-07-16 00:00:00,2016-08-08 00:00:00,2005-08-29 00:00:00,2006-03-20 00:00:00,2017-10-30 00:00:00,2012-06-25 00:00:00,2004-10-25 00:00:00
match_num,17,11,21,298,277,78,92,298,27,19
winner_id,103720,103454,103084,104926,104719,104925,103344,106058,104214,103163
winner_seed,1,46,85,3,8,93,6,16,92,4
winner_name,lleyton hewitt,nicolas massu,guillermo canas,fabio fognini,marcel granollers,novak djokovic,ivan ljubicic,jack sock,igor andreev,tommy haas


# Tournament Info

* tournament id - extract from current tourney_id field
* tournament level - one hot encode
* month of tournament - extract from tourney_date
* year of tournament - extract from tourney_date
* surface - one hot encode
* draw size
* best of


### Extract tournment ID aand Encode

Since players may have an affinity for certainly tournaments because of location or conditions, we should include this into our features

First let's rename the columns since it's rather confusing - currently, tourney_id is a composite of {year}-{id}. ID is alphanumeric so we need to encode these

In [26]:
matches = pre
matches = matches.rename({"tourney_id": "tourney_year_plus_id"}, axis=1)
matches["tourney_id"] = matches.tourney_year_plus_id.apply(lambda x: x.split("-")[1])
matches.sample(5, random_state=1).tourney_id

19830     419
49384    7290
32731     407
34227     418
53983     540
Name: tourney_id, dtype: object

In [60]:
from sklearn.preprocessing import LabelEncoder

tidle = LabelEncoder()
tid_labels = tidle.fit_transform(matches['tourney_id'])
tid_map = {label: num for num, label in enumerate(tidle.classes_)}
print(tid_map)

matches["tourney_id_num"] = matches["tourney_id"].map(tid_map)

# save off map to be used later
with open(f'{MODEL_DIR}/tid_map.json', 'w') as file:
    json.dump(surface_map, file)
    
matches.sample(10)[["tourney_id", "tourney_id_num"]]

{'0301': 0, '0308': 1, '0311': 2, '0314': 3, '0315': 4, '0316': 5, '0319': 6, '0321': 7, '0322': 8, '0328': 9, '0329': 10, '0337': 11, '0341': 12, '0352': 13, '0360': 14, '0375': 15, '0402': 16, '0407': 17, '0410': 18, '0414': 19, '0421': 20, '0424': 21, '0425': 22, '0429': 23, '0439': 24, '0451': 25, '0495': 26, '0496': 27, '0499': 28, '0500': 29, '0506': 30, '0533': 31, '0568': 32, '0605': 33, '0717': 34, '0741': 35, '0773': 36, '0891': 37, '1536': 38, '1720': 39, '2276': 40, '301': 41, '306': 42, '308': 43, '311': 44, '312': 45, '314': 46, '315': 47, '316': 48, '317': 49, '319': 50, '321': 51, '322': 52, '325': 53, '326': 54, '327': 55, '328': 56, '329': 57, '3348': 58, '336': 59, '337': 60, '338': 61, '339': 62, '341': 63, '3465': 64, '348': 65, '352': 66, '357': 67, '359': 68, '360': 69, '375': 70, '379': 71, '401': 72, '402': 73, '403': 74, '404': 75, '407': 76, '408': 77, '409': 78, '410': 79, '414': 80, '416': 81, '418': 82, '419': 83, '420': 84, '421': 85, '422': 86, '423': 87

Unnamed: 0,tourney_id,tourney_id_num
34392,3348,58
42731,560,114
9298,451,97
47907,560,114
55505,407,17
25363,540,113
9900,433,92
27486,403,74
43742,499,104
863,468,98


### Tournament Level

Tournament level does have some type of implicit ordinality to them since tournaments are worth different points according to their levels with 'g' (grand slams) worth the most number of points

We will do a custom encoding for these

Here are the point values for winner the levels:
* g - grand slam - 2000
* f - tour final - 1500
* m - masters - 1000
* a - other tour event - anywhere between 250 to 500 depending on the series

In [27]:
np.unique(matches.tourney_level)

array(['a', 'f', 'g', 'm'], dtype=object)

In [41]:
import json

level_map = {'g': 1, 'f': 2, 'm': 3, 'a': 4}
matches["tourney_level_num"] = matches['tourney_level'].map(level_map)
# let's check our work
matches.sample(10)[["tourney_level", "tourney_level_num"]]

# let's save off the map for later use
with open(f'{MODEL_DIR}/tourney_level_map.json', 'w') as file:
    json.dump(level_map, file)

### Tournamant year and Month

We can extract this from the tourney_date

In [33]:
matches["tourney_year"] = matches.tourney_date.dt.year
matches["tourney_month"] = matches.tourney_date.dt.month
matches.sample(10)[["tourney_year", "tourney_month", "tourney_date"]]

Unnamed: 0,tourney_year,tourney_month,tourney_date
56873,2018,7,2018-07-23
27461,2007,3,2007-03-05
5014,1999,7,1999-07-05
34908,2009,10,2009-10-26
15075,2002,11,2002-11-11
45296,2013,8,2013-08-26
27858,2007,5,2007-05-07
2887,1998,10,1998-10-12
52205,2016,10,2016-10-10
53493,2017,5,2017-05-01


### Tournament Surface

There is no ordinality here so we can just use a label encoder

In [45]:
from sklearn.preprocessing import LabelEncoder
import pickle

gle = LabelEncoder()
surface_labels = gle.fit_transform(matches['surface'])
surface_map = {label: num for num, label in enumerate(gle.classes_)}
print(surface_map)

matches["surface_num"] = matches["surface"].map(surface_map)

# save off map to be used later
with open(f'{MODEL_DIR}/surface_map.json', 'w') as file:
    json.dump(surface_map, file)
    
matches.sample(10)[["surface", "surface_num"]]

{'carpet': 0, 'clay': 1, 'grass': 2, 'hard': 3}


Unnamed: 0,surface,surface_num
42075,grass,2
8951,carpet,0
39523,clay,1
35230,hard,3
48299,hard,3
14310,hard,3
45552,hard,3
41340,clay,1
17171,hard,3
42786,hard,3


# Generate Player Features

Most are already numbers that we can feed into the model - we will just keep these for now

### Player Name

We will treat player name as a non-oridnal category to encode. Certain players may match up better against a specific type of player so including this seems important

In [72]:
from sklearn.preprocessing import LabelEncoder
import pickle

playerle = LabelEncoder()
player_labels = playerle.fit_transform(matches['loser_name'].append(matches['winner_name']))
player_map = {label: num for num, label in enumerate(playerle.classes_)}

# we are actually going to do one more thing here, instead of starting at 0 index. We are going to reserve this index for unknown player for later
# we will swap out player with 0 index with the key-max + 1
key_max = max(player_map.values())
player_map[list(player_map.keys())[0]] = key_max + 1

# we won't print out the map here 
# print(player_map)

matches["loser_id"] = matches["loser_name"].map(player_map)
matches["winner_id"] = matches["winner_name"].map(player_map)

# save off map to be used later
with open(f'{MODEL_DIR}/player_map.json', 'w') as file:
    json.dump(player_map, file)
    
matches.sample(10)[["loser_name", "loser_id", "winner_name", "winner_id"]]

Unnamed: 0,loser_name,loser_id,winner_name,winner_id
16210,wayne ferreira,1409,rainer schuettler,1150
36056,ernests gulbis,377,fernando verdasco,409
51373,lukas lacko,815,marin cilic,868
8376,hicham arazi,528,carlos moya,206
31880,diego junqueira,322,jose acasuso,685
28302,potito starace,1131,novak djokovic,1042
46539,lleyton hewitt,794,rafael nadal,1148
32714,simone bolelli,1267,rafael nadal,1148
54802,bernard tomic,162,diego sebastian schwartzman,325
13382,christian vinck,230,fernando gonzalez,406


### Player origin (ioc)

Again there is no ordinality to this so we will use the LabelEncoder

In [49]:
iocle = LabelEncoder()
# we have to append loser and winner list to get comprehensive list
ioc_labels = iocle.fit_transform(matches["loser_ioc"].append(matches["winner_ioc"]))
ioc_map = {label: num for num, label in enumerate(iocle.classes_)}
print(ioc_map)

matches["loser_ioc_num"] = matches["loser_ioc"].map(ioc_map)
matches["winner_ioc_num"] = matches["winner_ioc"].map(ioc_map)

# save off map to be used later
with open(f'{MODEL_DIR}/ioc_map.json', 'w') as file:
    json.dump(ioc_map, file)
    
matches.sample(10)[["loser_ioc", "loser_ioc_num", "winner_ioc", "winner_ioc_num"]]

{'alg': 0, 'arg': 1, 'arm': 2, 'aus': 3, 'aut': 4, 'aze': 5, 'bah': 6, 'bar': 7, 'bel': 8, 'bih': 9, 'blr': 10, 'bra': 11, 'bul': 12, 'can': 13, 'chi': 14, 'chn': 15, 'col': 16, 'crc': 17, 'cro': 18, 'cyp': 19, 'cze': 20, 'den': 21, 'dom': 22, 'ecu': 23, 'egy': 24, 'esa': 25, 'esp': 26, 'est': 27, 'fin': 28, 'fra': 29, 'gbr': 30, 'geo': 31, 'ger': 32, 'gre': 33, 'hkg': 34, 'hun': 35, 'ind': 36, 'irl': 37, 'isr': 38, 'ita': 39, 'jpn': 40, 'kaz': 41, 'kor': 42, 'kuw': 43, 'lat': 44, 'ltu': 45, 'lux': 46, 'mar': 47, 'mas': 48, 'mda': 49, 'mex': 50, 'mkd': 51, 'mon': 52, 'ned': 53, 'nor': 54, 'nzl': 55, 'pak': 56, 'par': 57, 'per': 58, 'phi': 59, 'pol': 60, 'por': 61, 'qat': 62, 'rou': 63, 'rsa': 64, 'rus': 65, 'slo': 66, 'srb': 67, 'sui': 68, 'svk': 69, 'swe': 70, 'tha': 71, 'tog': 72, 'tpe': 73, 'tun': 74, 'tur': 75, 'uae': 76, 'ukr': 77, 'uru': 78, 'usa': 79, 'uzb': 80, 'ven': 81, 'vie': 82, 'zim': 83}


Unnamed: 0,loser_ioc,loser_ioc_num,winner_ioc,winner_ioc_num
12069,srb,67,usa,79
54509,usa,79,cro,18
15630,mon,52,sui,68
53762,ukr,77,bel,8
1229,par,57,ita,39
11342,ecu,23,usa,79
31274,ita,39,ger,32
44216,fra,29,fra,29
15714,arg,1,arg,1
48936,aut,4,ukr,77


### Player Hand - (l/r/u)

This is simple categorial data - will use LabelEncoder

In [56]:
handcle = LabelEncoder()
# we have to append loser and winner list to get comprehensive list
hand_labels = handcle.fit_transform(matches["loser_hand"].append(matches["winner_hand"]))
hand_map = {label: num for num, label in enumerate(handcle.classes_)}
print(hand_map)

matches["loser_hand_num"] = matches["loser_hand"].map(hand_map)
matches["winner_hand_num"] = matches["winner_hand"].map(hand_map)

# save off map to be used later
with open(f'{MODEL_DIR}/hand_map.json', 'w') as file:
    json.dump(hand_map, file)
    
matches.sample(10)[["loser_hand", "loser_hand_num", "winner_hand", "winner_hand_num"]]

{'l': 0, 'r': 1, 'u': 2}


Unnamed: 0,loser_hand,loser_hand_num,winner_hand,winner_hand_num
16289,r,1,r,1
49998,r,1,r,1
17348,r,1,r,1
19715,r,1,l,0
47485,l,0,r,1
45278,r,1,r,1
41236,r,1,r,1
46589,r,1,r,1
19698,r,1,r,1
7148,r,1,r,1


## Match Features

Only feature we will use for now is round. There is an ordinality to the rounds so we will use a custome encoding

Couple issues with this:

'rr' - means round robin. This usually means that all players in a tournament is divided into multiple flights (ie, 2). All players in the flight play against each other. The year end final for instance has 8 players. Top 2 players from each flight make it to semi finals. Winner of that match goes to finals. So rr could mean any of the 3 matches a player would play in the flight. We will just pick the next unique number to encode this.

https://en.wikipedia.org/wiki/ATP_Finals

'br' - Not sure what this indicates. Will encode this as the last value

In [51]:
np.unique(matches["round"])

array(['br', 'f', 'qf', 'r128', 'r16', 'r32', 'r64', 'rr', 'sf'],
      dtype=object)

In [52]:
round_map = {'f': 1, 
            'sf': 2,
            'qf': 3,
            'r16': 4,
            'r32': 5,
            'r64': 6,
            'r128': 7,
            'rr': 8,
            'br': 9}
matches["round_num"] = matches["round"].map(round_map)

# save off map to be used later
with open(f'{MODEL_DIR}/round_map.json', 'w') as file:
    json.dump(round_map, file)
    
matches.sample(10)[["round", "round_num"]]


Unnamed: 0,round,round_num
212,r64,6
7384,r64,6
53372,r32,5
40725,r16,4
50487,r128,7
32679,qf,3
32758,r32,5
3565,r32,5
57603,r32,5
38135,r32,5


# Let's check out our data one more time

In [57]:
matches.sample(10).T

Unnamed: 0,57414,52804,29740,24112,3324,48232,32014,45319,1603,31956
tourney_year_plus_id,2018-m015,2017-0506,2008-580,2006-505,1999-301,2014-429,2008-329,2013-560,1998-311,2008-747
tourney_name,beijing,buenos aires,australian open,vina del mar,auckland,stockholm,tokyo,us open,queen s club,beijing
surface,hard,clay,hard,clay,hard,hard,hard,hard,grass,hard
draw_size,32,32,128,32,32,32,64,128,64,32
tourney_level,a,a,g,a,a,a,a,g,a,a
tourney_date,2018-10-01 00:00:00,2017-02-13 00:00:00,2008-01-14 00:00:00,2006-01-30 00:00:00,1999-01-11 00:00:00,2014-10-13 00:00:00,2008-09-29 00:00:00,2013-08-26 00:00:00,1998-06-08 00:00:00,2008-09-22 00:00:00
match_num,293,288,50,12,13,13,27,83,9,27
winner_id,105223,105807,103582,104593,103103,104607,104025,103852,101601,104053
winner_seed,1,4,50,29,8,1,34,23,13,2
winner_name,juan martin del potro,pablo carreno busta,michael berrer,daniel gimeno traver,dominik hrbaty,tomas berdych,amer delic,feliciano lopez,brett steven,andy roddick


In [58]:
# Let's pick out the columns we will use

In [7]:
pre[pre["round"] == 'r128'].sample(10).T

Unnamed: 0,50072,36740,50093,703,39130,19613,21144,29042,10961,25511
year_tourney_id,2015 540,2010 540,2015 540,1998 403,2011 520,2004 540,2005 580,2007 560,2001 540,2006 540
tourney_name,wimbledon,wimbledon,wimbledon,miami masters,roland garros,wimbledon,australian open,us open,wimbledon,wimbledon
surface,grass,grass,grass,hard,clay,grass,hard,hard,grass,grass
tourney_level,g,g,g,m,g,g,g,g,g,g
tourney_date,2015-06-29,2010-06-21,2015-06-29,1998-03-16,2011-05-22,2004-06-21,2005-01-17,2007-08-27,2001-06-25,2006-06-26
match_num,31,37,52,10,1,52,5,5,63,34
winner_id,103163,104312,104665,102765,104745,102434,103344,103852,102965,102703
winner_name,tommy haas,andreas seppi,pablo andujar,nicolas escude,rafael nadal,vincent spadea,ivan ljubicic,feliciano lopez,jamie delgado,hyung taik lee
winner_hand,r,r,r,r,l,r,r,l,r,r
winner_ht,188,190,180,185,185,183,193,188,178,180
