# Pre-process our tennis data set so we can feed in the data to ML and DL model

We will do the following in his notebook:

## Reading Data
* data from this dataset is split out by year - we will read in all match data from 1985 when ATP was formed to the latest data we have (2019). I chose 1985 since player rank starts becoming available in the dataset starting that year

## Clean Data
* rename tourney_id because this is misleading - this is a composite of year and tourney id
* clean string columns - remove special characters and accents
* filter out non-professional tournaments (ie, challenger, satellites)
* clean out tiebreak scores from scores since we will not be using this as part of our features

## Impute Data
* get match minutes from average minutes for the tournament
* player height from average player height
* draw_size from size of tournament
* player rank from average player rank in the tournament - winner rank will be derived from winners, lower rank will be derived from losers


At the end of this notebook, we are left with around 59k entries. The only missing data we have are match stats which we won't be using for the time being


Match data source: https://github.com/JeffSackmann/tennis_atp

In [1]:
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline



# Reading Dataset

In [2]:
DATASET_DIR = '../datasets'
START_YEAR = 1985
END_YEAR = 2019
OUTFILE = f'{DATASET_DIR}/atp_matches_{START_YEAR}-{END_YEAR}_preprocessed.csv'
years = np.arange(START_YEAR, END_YEAR + 1)

matches_orig = pd.DataFrame()
for year in years:
    matches_orig = matches_orig.append(pd.read_csv(f'{DATASET_DIR}/tennis_atp-master/atp_matches_{year}.csv', parse_dates=["tourney_date"]), ignore_index=True)
    
    
print(len(matches_orig))
# read in our ATP data set and parse tournament dates
# matches_orig = pd.read_csv(f'{DATASET_DIR}/tennis_atp-master/ATP.csv', parse_dates=["tourney_date"])
matches_orig.sample(10).T

113991


Unnamed: 0,31414,43370,91519,46285,55283,76831,64670,60164,29583,16143
tourney_id,1993-419,1996-620,2011-540,1997-418,2000-359,2006-560,2003-580,2001-319,1993-D063,1989-319
tourney_name,Indianapolis,Bournemouth,Wimbledon,Washington,Bogota,US Open,Australian Open,Kitzbuhel,Davis Cup G2 PO: KUW vs SRI,Kitzbuhel
surface,Hard,Clay,Grass,Hard,Clay,Hard,Hard,Clay,Hard,Clay
draw_size,,,,,,,,,,
tourney_level,A,A,G,A,A,G,G,A,D,A
tourney_date,1993-08-16 00:00:00,1996-09-09 00:00:00,2011-06-20 00:00:00,1997-07-14 00:00:00,2000-03-06 00:00:00,2006-08-28 00:00:00,2003-01-13 00:00:00,2001-07-23 00:00:00,1993-03-26 00:00:00,1989-07-31 00:00:00
match_num,18,16,67,1,21,114,76,35,2,14
winner_id,101214,102456,105223,101955,103324,103484,103252,102869,108849,101534
winner_seed,,2,24,,5,5,,,,
winner_entry,,,,Q,,,,,,


# Data Cleaning

In [3]:
# make all column names lower case so it's easier to remember
matches = matches_orig
matches.columns = [col.strip().lower() for col in matches.columns]

# these columns don't have much data from our EDA so we can't impute. Let's drop them
# for rank points - these are used to determine a player's ranking in the ATP at the moment of the tournament so it's duplicate of rank - will drop this as well
drop_columns = ["loser_entry", "winner_entry", "loser_rank_points", "winner_rank_points"]
matches = matches.drop(drop_columns, axis=1)


# we only care about professional tournaments since we are prediction grand slams so let's filter out non-professional tournaments
# these are challengers tournaments, etc
matches = matches[(~matches.tourney_level.isin(["C", "S", "D"])) & 
                  (matches.tourney_date > datetime.datetime(START_YEAR, 12, 31)) &
                 (matches.tourney_date < datetime.datetime(END_YEAR, 1, 14))]

len(matches)

99955

### Clean String Columns 

Let's standarize data that is non-numeric by stripping out leading/trailing spaces and converting to lowercase

We will also remove any special characters

In [4]:
# keys for the data are player names and tournament names - these are strings
# we also have come categorical columns - ie loser_hand, winner_hand, surface, tourney_level, 
# let's convert any non-numerical column data into lower case and strip
# we will also remove any special characters and accents

import unicodedata
import re

# first let's print one of the columns
print("Before lowering...")
print(matches[:5].loser_hand)

lower_columns = [col for col, dt in matches.dtypes.items() if dt == np.object]
for col in lower_columns:
    print(f'cleaning col {col}')
    matches[col] = matches[col].str.strip()
    matches[col] = matches[col].str.lower()
    matches[col] = matches[col].apply(lambda x: unicodedata.normalize('NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8', 'ignore'))
    # we won't do this for score because they should look like 6-3 6-7(7)
    if col not in ["score", "tourney_id"]:
        matches[col] = matches[col].apply(lambda x: re.sub('[^a-zA-Z0-9\s]', ' ', x, flags=re.I | re.A))

# check to make sure we've done this correctly
print("After lowering...")
matches[:5].loser_hand

Before lowering...
3378    R
3379    L
3380    R
3381    R
3382    R
Name: loser_hand, dtype: object
cleaning col tourney_id
cleaning col tourney_name
cleaning col surface
cleaning col tourney_level
cleaning col winner_name
cleaning col winner_hand
cleaning col winner_ioc
cleaning col loser_name
cleaning col loser_hand
cleaning col loser_ioc
cleaning col score
cleaning col round
After lowering...


3378    r
3379    l
3380    r
3381    r
3382    r
Name: loser_hand, dtype: object

# Cleaning Scores

We will most likely break score into # of sets won by each user and # of games won by each user as feature

The current score has tiebreak points - since we are not going all the way down to point level, we can strip this out

In [5]:
import re

# first let's get some scores with tiebreak so we can test to make sure we did this correctly
tiebreaks = matches[matches.score.str.contains(r"\)")]
print(tiebreaks["score"].head(10))
tb_indexes = tiebreaks.index
tb_indexes

# strip out tiebreak scores
matches.loc[matches.score.str.contains("\)"), "score"] = matches[matches.score.str.contains("\)")].score.apply(lambda x: re.sub(r'\(\d+\)', '', x))

# sometimes there are string in the scores - ie ret (retired), w/o (not sure what this means)
# let's strip out these as well





10891    7-6(3) 4-6 7-6(1) 6-4
21102           6-4 3-6 7-6(2)
21103           6-3 3-6 7-6(6)
21105               7-6(2) 6-1
21111        6-7(4) 7-6(5) 6-4
21113           7-6(4) 3-0 ret
21114               6-2 7-6(2)
21115               6-3 7-6(1)
21118            7-6(5) 7-6(8)
21120               7-6(4) 7-5
Name: score, dtype: object


In [6]:
print(matches.loc[tb_indexes[:10]]["score"])

# verify if there are any other tiebreak scores left
len(matches[matches.score.str.contains(r"\)")]["score"])


10891    7-6 4-6 7-6 6-4
21102        6-4 3-6 7-6
21103        6-3 3-6 7-6
21105            7-6 6-1
21111        6-7 7-6 6-4
21113        7-6 3-0 ret
21114            6-2 7-6
21115            6-3 7-6
21118            7-6 7-6
21120            7-6 7-5
Name: score, dtype: object


0

In [7]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99955 entries, 3378 to 113478
Data columns (total 45 columns):
tourney_id       99955 non-null object
tourney_name     99955 non-null object
surface          99955 non-null object
draw_size        0 non-null float64
tourney_level    99955 non-null object
tourney_date     99955 non-null datetime64[ns]
match_num        99955 non-null int64
winner_id        99955 non-null int64
winner_seed      44675 non-null float64
winner_name      99955 non-null object
winner_hand      99955 non-null object
winner_ht        97827 non-null float64
winner_ioc       99955 non-null object
winner_age       99939 non-null float64
loser_id         99955 non-null int64
loser_seed       23955 non-null float64
loser_name       99955 non-null object
loser_hand       99955 non-null object
loser_ht         95800 non-null float64
loser_ioc        99955 non-null object
loser_age        99818 non-null float64
score            99955 non-null object
best_of          9995

# Impute Missing Data

We are missing some matches's minutes (ie, length of the match). We can impute this - since matches length might depend on the tournament (ie, surface) and whether the tournament is best of 3 or 5 - we will use this inforamation to impute by using the mean of match minutes for that tournament


In [8]:
## Some matches are misssing minutes

In [9]:
# we are missing some matches's minutes (ie, length of the match)
# we can impute this - since matches length might depend on the tournament (ie, surface) 
# and whether the tournament is best of 3 or 5 - we will use this inforamation 
# to impute by using the mean of match minutes for that tournament
tids = {id for id in matches[matches.minutes.isnull()].tourney_id}
for tid in tids:
    matches.loc[(matches.minutes.isnull()) & (matches.tourney_id == tid), "minutes"] = \
             matches[(matches.minutes.notnull()) & (matches.tourney_id == tid)].minutes.mean()
    
# some tournaments have no minutes at all, we will just use the mean
matches.loc[matches.minutes.isnull(), "minutes"] = \
             matches[matches.minutes.notnull()].minutes.mean()

In [10]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99955 entries, 3378 to 113478
Data columns (total 45 columns):
tourney_id       99955 non-null object
tourney_name     99955 non-null object
surface          99955 non-null object
draw_size        0 non-null float64
tourney_level    99955 non-null object
tourney_date     99955 non-null datetime64[ns]
match_num        99955 non-null int64
winner_id        99955 non-null int64
winner_seed      44675 non-null float64
winner_name      99955 non-null object
winner_hand      99955 non-null object
winner_ht        97827 non-null float64
winner_ioc       99955 non-null object
winner_age       99939 non-null float64
loser_id         99955 non-null int64
loser_seed       23955 non-null float64
loser_name       99955 non-null object
loser_hand       99955 non-null object
loser_ht         95800 non-null float64
loser_ioc        99955 non-null object
loser_age        99818 non-null float64
score            99955 non-null object
best_of          9995

In [11]:
matches[matches.minutes.isnull()][["tourney_id", "minutes"]]

Unnamed: 0,tourney_id,minutes


Looks like there is still a subset of tournaments where we do not have any results. This is ok - minutes will be used as part of feature engineering. We will just have to skip over these records when we create features later.

Looks like '98 and '99 Grand Slam Cup did not record match minutes. This is a year end tournament: 
https://www.grandslamhistory.com/atp/grand-slam-cup-munich

In [12]:
matches[matches.tourney_id.isin(matches[matches.minutes.isnull()].tourney_id.tolist())]["tourney_name"].unique()

array([], dtype=object)

## Impute Height and Age

We are missing some values for player height. This, we can impute by using the average height

In [14]:
# impute height with mean of players
matches.loc[matches.loser_ht.isnull(), 'loser_ht'] = matches.loser_ht.mean()
matches.loc[matches.winner_ht.isnull(), 'winner_ht'] = matches.winner_ht.mean()

# impute age with mean of players
matches.loc[matches.loser_age.isnull(), 'loser_age'] = matches.loser_age.mean()
matches.loc[matches.winner_age.isnull(), 'winner_age'] = matches.winner_age.mean()

In [15]:
matches.sample(5).T

Unnamed: 0,89649,103348,111261,68147,20936
tourney_id,2010-568,2015-540,2018-m006,2004-506,1990-352
tourney_name,st petersburg,wimbledon,indian wells masters,buenos aires,paris masters
surface,hard,grass,hard,clay,carpet
draw_size,,,,,
tourney_level,a,g,m,a,m
tourney_date,2010-10-25 00:00:00,2015-06-29 00:00:00,2018-03-05 00:00:00,2004-02-16 00:00:00,1990-10-29 00:00:00
match_num,25,19,294,22,9
winner_id,104022,104269,105223,103970,101265
winner_seed,1,,6,,
winner_name,mikhail youzhny,fernando verdasco,juan martin del potro,david ferrer,patrick mcenroe


In [16]:
# Lastly, let's drop any rows where we don't have scores for the matches
matches = matches.dropna(axis=0, subset=["score"])

### Let's see what else we are missing data for

In [17]:
def print_columns_with_missing_data(m: pd.DataFrame):
    print(m.columns[m.isnull().any()].tolist())
    
    
print_columns_with_missing_data(matches)

['draw_size', 'winner_seed', 'loser_seed', 'w_ace', 'w_df', 'w_svpt', 'w_1stin', 'w_1stwon', 'w_2ndwon', 'w_svgms', 'w_bpsaved', 'w_bpfaced', 'l_ace', 'l_df', 'l_svpt', 'l_1stin', 'l_1stwon', 'l_2ndwon', 'l_svgms', 'l_bpsaved', 'l_bpfaced', 'winner_rank', 'loser_rank']


## Missing Rank

At first I thought missing rank might mean that the player

When you look closer, I don't recognize all players, but Tim Henman was an English player that was at the top of his game in 1998 and was in the top 10 according to his wikipedia page: https://en.wikipedia.org/wiki/Tim_Henman

So I think we should somehow impute this

In [18]:
matches[(matches.winner_rank.isnull())][["tourney_date", "tourney_id", "round", "tourney_name", "winner_name"]].head(10)

Unnamed: 0,tourney_date,tourney_id,round,tourney_name,winner_name
4572,1986-05-26,1986-520,r128,roland garros,thierry van den daele
5181,1986-07-21,1986-417,r64,boston,richey reneberg
6560,1986-11-17,1986-426,r32,johannesburg,schalk van der merwe
6904,1987-02-02,1987-401,r64,philadelphia,martin blackman
9031,1987-08-10,1987-379,r32,prague,vaclav roubicek
10609,1988-02-22,1988-341,r32,metz,laurent prades
11058,1988-04-11,1988-329,r64,tokyo outdoor,kim warwick
11836,1988-06-13,1988-313,r64,bristol,gary drake
12007,1988-07-04,1988-417,r64,boston,magnus larsson
12219,1988-07-18,1988-224,r32,schenectady,martin blackman


I think a reasonable way to to impute the data for that tournament. We will take the mean of the player rank in that round and impute into our missing ranks

In [19]:
def impute_missing_rank(matches: pd.DataFrame, missing_list: pd.DataFrame, name_col: str) -> pd.DataFrame:
    for index, row in missing_list.iterrows():
        current_round_matches = matches[(matches.tourney_id == row.tourney_id) & 
                                        (matches["round"] == row["round"]) & (matches[name_col].notnull())]
        if len(current_round_matches) == 0:
            print(f'Unable to find other matches in this round index: {index} tourney_id: {str(row["tourney_id"])} round: {row["round"]} column: {name_col}. Imputing with tournament ranking')
            matches.loc[index, name_col] = matches[(matches.tourney_id == row.tourney_id) &
                                                   (matches[name_col].notnull())][name_col].mean()
        else:
#             print(f'len: {len(current_round_matches)} current mean: {int(current_round_matches[name_col].mean())}')
            matches.loc[index, name_col] = int(current_round_matches[name_col].mean())

In [20]:
losers_missing_rank = matches[matches.loser_rank.isnull()][["tourney_id", "round", "loser_name"]]
print(f'Missing loser rank before imputing: {len(losers_missing_rank)}')
print(losers_missing_rank.head(5))
impute_missing_rank(matches, losers_missing_rank, "loser_rank")    

losers_missing_rank = matches[matches.loser_rank.isnull()][["tourney_id", "round", "loser_name"]]

print(f'Missing loser rank after imputing: {len(losers_missing_rank)}')

Missing loser rank before imputing: 347
     tourney_id round      loser_name
3378   1986-301   r32  simon robinson
3384   1986-301   r32   malcolm elley
3392   1986-301   r32    brett steven
3393   1986-301   r32    neil borwick
3576   1986-403  r128   david wheaton
Unable to find other matches in this round index: 50009 tourney_id: 1998-423 round: f column: loser_rank. Imputing with tournament ranking
Unable to find other matches in this round index: 51548 tourney_id: 1999-495 round: f column: loser_rank. Imputing with tournament ranking
Missing loser rank after imputing: 0


In [21]:
winner_missing_ranks = matches[matches.winner_rank.isnull()][["tourney_id", "round", "winner_name"]]
print(f'Missing winner rank before imputing: {len(winner_missing_ranks)}')

impute_missing_rank(matches, winner_missing_ranks, "winner_rank")    

winner_missing_ranks = matches[matches.winner_rank.isnull()][["tourney_id", "round", "winner_name"]]
print(f'Missing winner rank after imputing: {len(winner_missing_ranks)}')

Missing winner rank before imputing: 89
Missing winner rank after imputing: 0


## Impute Draw Size

For single elimination tournaments - we can impute this by the first round

In [22]:
# let's look at some some random samples before we impute
rounds = ['r128', 'r64', 'r32', 'r16']
missing_draw_indexes = matches[matches["round"].isin(rounds)].sample(10, random_state = 1).index
print(missing_draw_indexes)
matches.loc[missing_draw_indexes][["tourney_name", "round", "draw_size"]]

Int64Index([63000, 69467, 9640, 111060, 96670, 110950, 51925, 29158, 78650,
            22555],
           dtype='int64')


Unnamed: 0,tourney_name,round,draw_size
63000,wimbledon,r128,
69467,wimbledon,r64,
9640,sydney indoor,r32,
111060,rio de janeiro,r32,
96670,miami masters,r128,
110950,new york,r16,
51925,miami masters,r64,
29158,philadelphia,r32,
78650,barcelona,r32,
22555,hamburg masters,r32,


In [23]:
# impute draw size for single elimination tournament
import re

# we are going to impute in reverse order so that we impute with max
for r in rounds:
    size = int(re.sub("r", "", r))
    # impute draw_size for the entire tournament
    tids = np.unique(matches[(matches["round"] == r) & (matches.draw_size.isnull())][["tourney_id"]])
    matches.loc[matches["tourney_id"].isin(tids), "draw_size"] = size
    
# let's look at the same random samples and see if they look correct
# round should be < then draw_size
matches.loc[missing_draw_indexes][["tourney_name", "round", "draw_size"]]
    

Unnamed: 0,tourney_name,round,draw_size
63000,wimbledon,r128,128.0
69467,wimbledon,r64,128.0
9640,sydney indoor,r32,32.0
111060,rio de janeiro,r32,32.0
96670,miami masters,r128,128.0
110950,new york,r16,32.0
51925,miami masters,r64,128.0
29158,philadelphia,r32,32.0
78650,barcelona,r32,64.0
22555,hamburg masters,r32,64.0


Let's now look at Round Robin tournaments (rr)

Background - these tournaments generally are divided into pools, then winner of the pools go into a single elimination tournament of 4 at the end

In order to calculate this, we will have to look at how many players were in each tournament (winners and losers)

In [24]:
def get_draw_size(matches: pd.DataFrame, yti: str):
    tourney_matches = matches[matches.tourney_id == yti]
    pids = np.unique(np.append(tourney_matches.winner_id, tourney_matches.loser_id))
    matches.loc[matches.tourney_id == yti, "draw_size"] = len(pids)
    
    
for i in matches[matches["round"] == 'rr']["tourney_id"].unique():
    get_draw_size(matches, i)
    
print(matches[matches["round"] == 'rr'].sample(10, random_state=1)[["tourney_name", "round", "draw_size"]])

# check to see if there are any more tournaments without draw_size
print(f'Tournaments without draw_size left: {len(matches[matches.draw_size.isnull()].tourney_id)}')

       tourney_name round  draw_size
49157    dusseldorf    rr       18.0
38165    dusseldorf    rr       18.0
61218   masters cup    rr        8.0
11511    dusseldorf    rr       20.0
52681    dusseldorf    rr       18.0
49162    dusseldorf    rr       18.0
104607  tour finals    rr        8.0
104611  tour finals    rr        8.0
28590   tour finals    rr        8.0
80583   masters cup    rr        8.0
Tournaments without draw_size left: 21


In [25]:
# looks like we are still missing some - let's use the median
matches.loc[matches.draw_size.isnull(), 'draw_size'] = matches.draw_size.median()

## Player Seed

We are missing quite a lot of player seeds but we can derive this

Player seed is determined by the player's rank. Player with the highest rank in the tournament will get seed #1, etc

In [26]:
# We will impute both loser_seed and winner_seed

# let's get a list of tournaments where we are missing either loser_seed or winner_seed
tids_seed = np.unique(matches[(matches.loser_seed.isnull()) | (matches.winner_seed.isnull())]["tourney_id"])

# get list of all players and their anks for the tournaments
for tid in tids_seed:
    tourney_matches = matches[matches.tourney_id == tid]
    winners = pd.DataFrame(tourney_matches[["winner_id", "winner_rank"]]).rename({"winner_id": "player_id", "winner_rank": "rank"}, axis=1)
    losers = pd.DataFrame(tourney_matches[["loser_id", "loser_rank"]]).rename({"loser_id": "player_id", "loser_rank": "rank"}, axis=1)

    # let's seed all players in the tournament
    players = winners.append(losers, ignore_index=True).drop_duplicates().sort_values("rank")
    counter = 1
    for index, player in players.iterrows():
        players.loc[index, "seed"] = counter
        counter += 1

    # print(players)

    for index, match in tourney_matches.iterrows():
        if pd.isnull(match.winner_seed):
            seed = players[players.player_id == match.winner_id]["seed"].values[0]
    #         print(f'index {index} winner_id {match.winner_id} winner seed {seed}')
            matches.loc[index, 'winner_seed'] = players[players.player_id == match.winner_id]["seed"].values[0]
        if pd.isnull(match.loser_seed):
            seed = players[players.player_id == match.loser_id]["seed"].values[0]
    #         print(f'index {index} loser_id {match.loser_id} loser seed {seed}')
            matches.loc[index, 'loser_seed'] = seed
        
# randomly pick a tournament and see how we did
matches[matches.tourney_id == tids_seed[10]][["winner_rank", "winner_seed", "loser_rank", "loser_seed"]]

Unnamed: 0,winner_rank,winner_seed,loser_rank,loser_seed
4693,6.0,1.0,59.0,27.0
4694,141.0,57.0,76.0,36.0
4695,63.0,30.0,48.0,22.0
4696,30.0,16.0,46.0,21.0
4697,110.0,51.0,42.0,18.0
...,...,...,...,...
4751,7.0,3.0,31.0,12.0
4752,20.0,8.0,3.0,2.0
4753,6.0,1.0,39.0,17.0
4754,20.0,8.0,7.0,3.0


In [27]:
matches[matches.loser_rank.isnull()][["tourney_name", "tourney_id", "draw_size", "loser_name"]]

Unnamed: 0,tourney_name,tourney_id,draw_size,loser_name


# Done

OK. Looks like we have cleaned up and inputed as much data as we can as of this point

For this round, we will not be using any of the match statistics so we will not drop rows with empty data for these columns.

However, we will drop the two rows for loser_rank and then save to file so we can move on to feature engineering

In [28]:
print_columns_with_missing_data(matches)

['w_ace', 'w_df', 'w_svpt', 'w_1stin', 'w_1stwon', 'w_2ndwon', 'w_svgms', 'w_bpsaved', 'w_bpfaced', 'l_ace', 'l_df', 'l_svpt', 'l_1stin', 'l_1stwon', 'l_2ndwon', 'l_svgms', 'l_bpsaved', 'l_bpfaced']


### Save our data

In [29]:
matches.to_csv(f'{OUTFILE}', index=False)

In [30]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99955 entries, 3378 to 113478
Data columns (total 45 columns):
tourney_id       99955 non-null object
tourney_name     99955 non-null object
surface          99955 non-null object
draw_size        99955 non-null float64
tourney_level    99955 non-null object
tourney_date     99955 non-null datetime64[ns]
match_num        99955 non-null int64
winner_id        99955 non-null int64
winner_seed      99955 non-null float64
winner_name      99955 non-null object
winner_hand      99955 non-null object
winner_ht        99955 non-null float64
winner_ioc       99955 non-null object
winner_age       99955 non-null float64
loser_id         99955 non-null int64
loser_seed       99955 non-null float64
loser_name       99955 non-null object
loser_hand       99955 non-null object
loser_ht         99955 non-null float64
loser_ioc        99955 non-null object
loser_age        99955 non-null float64
score            99955 non-null object
best_of          