# ATP Tennis Data - Feature Engineering

In previous notebook, we pre-processed the ATP match data by:
* dropping columns with little data
* cleaning string columns
* imputing any data that we can reasonably impute

For this notebook, we will use the data saved from pre-processing and start doing some basic feature engineering so we can feed  our data to our models to see how we do. My goal is to start training models and see how they do as soon as possible, so we will stick with the following basic feature engineering techniques:
* one hot encode any categorical data - ie, tournament surface, round, winner_ioc, loser_ioc
* extra month from the tournament date
* label each row with 0 as Player 1 losing to Player 2 and 1 as Player 1 beats Player 2
* remove any remaining columns that we

Output of basic feature engineering:
* tournament
    * tournament id - extract from current tourney_id field
    * tournament level - one hot encode
    * month of tournament - extract from tourney_date
    * year of tournament - extract from tourney_date
    * surface - one hot encode
    * draw size
    * best of
* player 1 and player 2
    * rank
    * height
    * ioc - one hot encode
    * age
    * rank
    * seed
    * hand
* match
    * round - 
* label
    * whether player 1 beat player 2 - 0 - False, 1 - True
    

## Future:
In a future notebook, I plan on implementing more advanced feature engineering with:
* look up matchup history for players
    * match-up stats for pervious matchups
* add match player record leading up to the match for each player (ie, last X matches)
* look up player stats leading up to match for each player (ie, last X matches)
* contruct 2 entries of data per row since match-up and player record leading up to a match will differe depending on which player is player 1 or 2


## Missing Data:

There are some features that are  missing from our dataset. In the future, we could look into somehow scrapping or manually getting these. I looked at the ATP website to see how to get this data, but it is not readily apparent since each player/tournament has some type of unique identifier on their website that I haven't figure out how to get. Here are some potential features to get i the future:


* tournament
    * tournament location - ie, city, country
    * prize money
* player
    * weight
    * backhand - one/two handed
    * number of years as a pro

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import json

%matplotlib inline
sns.set()

In [2]:
# Contants

DATASET_DIR = '../datasets'
MODEL_DIR = '../models'
# this is the file we generated from our pre-processing notebook
PREPROCESSED_FILE = f'{DATASET_DIR}/atp_matches_preprocessed.csv'
OUTFILE = f'{DATASET_DIR}/atp_matches_features.csv'

pre = pd.read_csv(PREPROCESSED_FILE, parse_dates=["tourney_date"])
pre = pre.astype({'draw_size': np.int32})


## Let's look at the data to make sure all the types are correct before we being

In [3]:
pre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57868 entries, 0 to 57867
Data columns (total 45 columns):
tourney_id       57868 non-null object
tourney_name     57868 non-null object
surface          57868 non-null object
draw_size        57868 non-null int32
tourney_level    57868 non-null object
tourney_date     57868 non-null datetime64[ns]
match_num        57868 non-null int64
winner_id        57868 non-null int64
winner_seed      57868 non-null float64
winner_name      57868 non-null object
winner_hand      57868 non-null object
winner_ht        57868 non-null float64
winner_ioc       57868 non-null object
winner_age       57868 non-null float64
loser_id         57868 non-null int64
loser_seed       57868 non-null float64
loser_name       57868 non-null object
loser_hand       57868 non-null object
loser_ht         57868 non-null float64
loser_ioc        57868 non-null object
loser_age        57868 non-null float64
score            57867 non-null object
best_of          57868 

In [4]:
pre.sample(10).T

Unnamed: 0,28873,16535,16836,39624,48507,54632,1460,31401,28334,18918
tourney_id,2007-3348,2003-311,2003-315,2011-439,2015-338,2017-560,1998-520,2008-419,2007-540,2004-410
tourney_name,new haven,queen s club,newport,umag,sydney,us open,roland garros,indianapolis,wimbledon,monte carlo masters
surface,hard,grass,grass,clay,hard,hard,clay,hard,grass,clay
draw_size,64,64,32,32,32,128,128,32,128,64
tourney_level,a,a,a,a,a,g,g,a,g,m
tourney_date,2007-08-19 00:00:00,2003-06-09 00:00:00,2003-07-07 00:00:00,2011-07-25 00:00:00,2015-01-11 00:00:00,2017-08-28 00:00:00,1998-05-25 00:00:00,2008-07-14 00:00:00,2007-06-25 00:00:00,2004-04-19 00:00:00
match_num,9,32,27,2,25,213,55,25,80,48
winner_id,102563,102450,103184,103694,105062,126094,101549,103484,104542,103786
winner_seed,38,7,27,14,19,48,123,1,103,47
winner_name,thomas johansson,tim henman,bob bryan,olivier rochus,mikhail kukushkin,andrey rublev,rodolphe gilbert,james blake,jo wilfried tsonga,nikolay davydenko


# Tournament Info

* tournament id - extract from current tourney_id field
* tournament level - one hot encode
* month of tournament - extract from tourney_date
* year of tournament - extract from tourney_date
* surface - one hot encode
* draw size
* best of


### Extract tournment ID aand Encode

Since players may have an affinity for certainly tournaments because of location or conditions, we should include this into our features

First let's rename the columns since it's rather confusing - currently, tourney_id is a composite of {year}-{id}. ID is alphanumeric so we need to encode these

In [5]:
matches = pre
matches = matches.rename({"tourney_id": "tourney_year_plus_id"}, axis=1)
matches["tourney_id"] = matches.tourney_year_plus_id.apply(lambda x: x.split("-")[1])
matches.sample(5, random_state=1).tourney_id

19830     419
49384    7290
32731     407
34227     418
53983     540
Name: tourney_id, dtype: object

In [6]:


tidle = LabelEncoder()
tid_labels = tidle.fit_transform(matches['tourney_id'])
tid_map = {label: num for num, label in enumerate(tidle.classes_)}
print(tid_map)

matches["tourney_id_label"] = matches["tourney_id"].map(tid_map)

# save off map to be used later
with open(f'{MODEL_DIR}/tid_map.json', 'w') as file:
    json.dump(surface_map, file)
    
matches.sample(10)[["tourney_id", "tourney_id_label"]]

{'0301': 0, '0308': 1, '0311': 2, '0314': 3, '0315': 4, '0316': 5, '0319': 6, '0321': 7, '0322': 8, '0328': 9, '0329': 10, '0337': 11, '0341': 12, '0352': 13, '0360': 14, '0375': 15, '0402': 16, '0407': 17, '0410': 18, '0414': 19, '0421': 20, '0424': 21, '0425': 22, '0429': 23, '0439': 24, '0451': 25, '0495': 26, '0496': 27, '0499': 28, '0500': 29, '0506': 30, '0533': 31, '0568': 32, '0605': 33, '0717': 34, '0741': 35, '0773': 36, '0891': 37, '1536': 38, '1720': 39, '2276': 40, '301': 41, '306': 42, '308': 43, '311': 44, '312': 45, '314': 46, '315': 47, '316': 48, '317': 49, '319': 50, '321': 51, '322': 52, '325': 53, '326': 54, '327': 55, '328': 56, '329': 57, '3348': 58, '336': 59, '337': 60, '338': 61, '339': 62, '341': 63, '3465': 64, '348': 65, '352': 66, '357': 67, '359': 68, '360': 69, '375': 70, '379': 71, '401': 72, '402': 73, '403': 74, '404': 75, '407': 76, '408': 77, '409': 78, '410': 79, '414': 80, '416': 81, '418': 82, '419': 83, '420': 84, '421': 85, '422': 86, '423': 87

NameError: name 'json' is not defined

### Tournament Level

Tournament level does have some type of implicit ordinality to them since tournaments are worth different points according to their levels with 'g' (grand slams) worth the most number of points

We will do a custom encoding for these

Here are the point values for winner the levels:
* g - grand slam - 2000
* f - tour final - 1500
* m - masters - 1000
* a - other tour event - anywhere between 250 to 500 depending on the series

In [None]:
np.unique(matches.tourney_level)

In [None]:
import json

level_map = {'g': 1, 'f': 2, 'm': 3, 'a': 4}
matches["tourney_level_label"] = matches['tourney_level'].map(level_map)
# let's check our work
matches.sample(10)[["tourney_level", "tourney_level_label"]]

# let's save off the map for later use
with open(f'{MODEL_DIR}/tourney_level_map.json', 'w') as file:
    json.dump(level_map, file)

### Tournamant year and Month

We can extract this from the tourney_date

In [None]:
matches["tourney_year"] = matches.tourney_date.dt.year
matches["tourney_month"] = matches.tourney_date.dt.month
matches.sample(10)[["tourney_year", "tourney_month", "tourney_date"]]

### Tournament Surface

There is no ordinality here so we can just use a label encoder

In [None]:
from sklearn.preprocessing import LabelEncoder
import pickle

gle = LabelEncoder()
surface_labels = gle.fit_transform(matches['surface'])
surface_map = {label: num for num, label in enumerate(gle.classes_)}
print(surface_map)

matches["surface_label"] = matches["surface"].map(surface_map)

# save off map to be used later
with open(f'{MODEL_DIR}/surface_map.json', 'w') as file:
    json.dump(surface_map, file)
    
matches.sample(10)[["surface", "surface_label"]]

# Generate Player Features

Most are already numbers that we can feed into the model - we will just keep these for now

### Player Name

We will treat player name as a non-oridnal category to encode. Certain players may match up better against a specific type of player so including this seems important

In [None]:
from sklearn.preprocessing import LabelEncoder
import pickle

playerle = LabelEncoder()
player_labels = playerle.fit_transform(matches['loser_name'].append(matches['winner_name']))
player_map = {label: num for num, label in enumerate(playerle.classes_)}

# we are actually going to do one more thing here, instead of starting at 0 index. We are going to reserve this index for unknown player for later
# we will swap out player with 0 index with the key-max + 1
key_max = max(player_map.values())
player_map[list(player_map.keys())[0]] = key_max + 1

# we won't print out the map here 
# print(player_map)

matches["loser_label"] = matches["loser_name"].map(player_map)
matches["winner_label"] = matches["winner_name"].map(player_map)

# save off map to be used later
with open(f'{MODEL_DIR}/player_map.json', 'w') as file:
    json.dump(player_map, file)
    
matches.sample(10)[["loser_name", "loser_label", "winner_name", "winner_label"]]

### Player origin (ioc)

Again there is no ordinality to this so we will use the LabelEncoder

In [None]:
iocle = LabelEncoder()
# we have to append loser and winner list to get comprehensive list
ioc_labels = iocle.fit_transform(matches["loser_ioc"].append(matches["winner_ioc"]))
ioc_map = {label: num for num, label in enumerate(iocle.classes_)}
print(ioc_map)

matches["loser_ioc_label"] = matches["loser_ioc"].map(ioc_map)
matches["winner_ioc_label"] = matches["winner_ioc"].map(ioc_map)

# save off map to be used later
with open(f'{MODEL_DIR}/ioc_map.json', 'w') as file:
    json.dump(ioc_map, file)
    
matches.sample(10)[["loser_ioc", "loser_ioc_label", "winner_ioc", "winner_ioc_label"]]

### Player Hand - (l/r/u)

This is simple categorial data - will use LabelEncoder

In [None]:
handcle = LabelEncoder()
# we have to append loser and winner list to get comprehensive list
hand_labels = handcle.fit_transform(matches["loser_hand"].append(matches["winner_hand"]))
hand_map = {label: num for num, label in enumerate(handcle.classes_)}
print(hand_map)

matches["loser_hand_label"] = matches["loser_hand"].map(hand_map)
matches["winner_hand_label"] = matches["winner_hand"].map(hand_map)

# save off map to be used later
with open(f'{MODEL_DIR}/hand_map.json', 'w') as file:
    json.dump(hand_map, file)
    
matches.sample(10)[["loser_hand", "loser_hand_label", "winner_hand", "winner_hand_label"]]

## Match Features

Only feature we will use for now is round. There is an ordinality to the rounds so we will use a custome encoding

Couple issues with this:

'rr' - means round robin. This usually means that all players in a tournament is divided into multiple flights (ie, 2). All players in the flight play against each other. The year end final for instance has 8 players. Top 2 players from each flight make it to semi finals. Winner of that match goes to finals. So rr could mean any of the 3 matches a player would play in the flight. We will just pick the next unique number to encode this.

https://en.wikipedia.org/wiki/ATP_Finals

'br' - Not sure what this indicates. Will encode this as the last value

In [None]:
np.unique(matches["round"])

In [None]:
round_map = {'f': 1, 
            'sf': 2,
            'qf': 3,
            'r16': 4,
            'r32': 5,
            'r64': 6,
            'r128': 7,
            'rr': 8,
            'br': 9}
matches["round_label"] = matches["round"].map(round_map)

# save off map to be used later
with open(f'{MODEL_DIR}/round_map.json', 'w') as file:
    json.dump(round_map, file)
    
matches.sample(10)[["round", "round_label"]]


# Let's check out our data one more time

In [None]:
matches.sample(10).T

In [None]:
pre[pre["round"] == 'r128'].sample(10).T