# ATP Data Exploratory Data Analysis

https://www.kaggle.com/sijovm/atpdata


## Data description
* tourney_id - tournament_id
* tourney_name - tournament_name
* surface - surface in which the match is played
* draw_size - the size of the draw
* tourney_level - tournament level
    * 'G' = Grand Slams
    * 'M' = Masters 1000s
    * 'A' = other tour-level events
    * 'C' = Challengers
    * 'S' = Satellites/ITFs
    * 'F' = Tour finals and other season-ending events
    * 'D' = Davis Cup
* tourney_date - starting date of the tournament
* match num - match number in a certain tournament
* id - player id
* seed - the seed of the player in that tournament
* entry - How did the player enter the tournaments?
    * WC - Wildcard
    * Q - Qualifier
    * LL - Lucky loser
    * PR - Protected ranking
    * SE - Special Exempt
    * ALT - Alternate player
* name - player name
* hand - hand of the player, right or left
* ht - the height of the player
* IOC - the country of origin
* age - age of the player
* score - final score in the match
* best_of - the maximum number of sets played
* round - the round in the tournament a match belongs to
* minutes - duration of the match in minutes
* ace - number of aces in the match 
* df - double faults
* svpt - serve percent
* 1stin - first serve in percent
* 1stWon - first serve winning percent
* 2ndWon - second serve winning percent
* SvGms - number of games played on serve (So, the maximum difference between w_SvGms and l_SvGms will be 1)
* bpSaved - breakpoints saved
* bpFaced - breakpoints faced
Credits: 1) http://www.tennisabstract.com/ 2) Jeff Sackmann
NOTE: The rankings are available in the other CSVs

In [1]:
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

matches_all = pd.read_csv('atpdata/ATP.csv', parse_dates=["tourney_date"])

%matplotlib inline



In [2]:
matches_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169690 entries, 0 to 169689
Data columns (total 49 columns):
best_of               169690 non-null int64
draw_size             1232 non-null float64
l_1stIn               83415 non-null float64
l_1stWon              83415 non-null float64
l_2ndWon              83415 non-null float64
l_SvGms               83415 non-null float64
l_ace                 83415 non-null float64
l_bpFaced             83415 non-null float64
l_bpSaved             83415 non-null float64
l_df                  83415 non-null float64
l_svpt                83415 non-null float64
loser_age             164700 non-null float64
loser_entry           25339 non-null object
loser_hand            169605 non-null object
loser_ht              139052 non-null float64
loser_id              169690 non-null int64
loser_ioc             169690 non-null object
loser_name            169690 non-null object
loser_rank            145909 non-null float64
loser_rank_points     93025 non-nul

## draw size - looks like we don't have draw size past 1968 from this dataset

In [3]:
matches_all[matches_all.draw_size.notnull()][["tourney_id", "tourney_name", "tourney_level", "tourney_date", "draw_size"]].tourney_date.max()

Timestamp('1968-12-30 00:00:00')

## Filter out non-pro tournaments gives us around 59k entries

In [4]:
# exclude other tour-level events, challengers and satellites, and Davis Cup
matches = matches_all[~matches_all.tourney_level.isin(["C", "S", "D"])]
# federer turned pro in 1998 - we will exclude all data before then
matches = matches[matches.tourney_date > datetime.datetime(1998, 1, 1)]

In [5]:
matches.tourney_date.min()

Timestamp('1998-01-05 00:00:00')

In [6]:
matches.tourney_date.max()

Timestamp('2019-02-25 00:00:00')

## Summary of missing data
* Dataset contains tournaments from 1968 to 2/2019
* most players are missing entry - winners are missing more entries than losers. will have to drop this - this might be an important statistic as qualifiers have to play more matches to get into the tournament - we should look at why we are missing this data.
* only a subset of players have seeds - we can impute this based on the player's ranking
* player height - we seem to be missing some - we can probably just impute this with average height from that column 

### Player entry

Entry is supposed to tell you how the player made it into the tournament. We are missing this information for most player.
Looking at the data, it's pretty random:
* we have players that are ranked and seeding
* players that are not seeding but having pretty high rank (ie, < 50)
* players that have lower randkings (ie, > 100)

There is also 'S' in the loser entry which is not explained in the dataset

We should drop this column, since we don't have enough information to impute

In [11]:
print(matches.loser_entry.unique())
print(matches.winner_entry.unique())
print(matches[matches.loser_entry.isnull()].sample(10)[["loser_name", "loser_rank", "loser_seed"]])
print(matches[matches.winner_entry.isnull()].sample(10)[["winner_name", "winner_rank", "winner_seed"]])

[nan 'Q' 'WC' 'LL' 'PR' 'S' 'SE' 'ALT']
[nan 'Q' 'WC' 'LL' 'PR' 'SE' 'ALT']
             loser_name  loser_rank  loser_seed
121182    Attila Savolt       107.0         NaN
138201    Jurgen Melzer        72.0         NaN
142483     Peter Luczak        77.0         NaN
134607   Dominik Hrbaty        27.0         5.0
132060     Lukasz Kubot       138.0         NaN
138073    Nicolas Mahut        57.0         NaN
145511        Dudi Sela        75.0         NaN
114877    Albert Portas        46.0         NaN
122216  Sargis Sargsian        54.0         NaN
165795      Guido Pella        72.0         NaN
                winner_name  winner_rank  winner_seed
161321         Gael Monfils         16.0         13.0
161956       Kevin Anderson         24.0          1.0
103843         Marcelo Rios          7.0          1.0
167938         Milos Raonic         32.0         13.0
123421        Stefan Koubek         53.0          NaN
116723      Guillermo Canas         21.0          NaN
142318        Ivan

In [12]:
# drop these columns
matches = matches.drop(["draw_size","loser_entry", "winner_entry", "loser_seed", "winner_seed"], axis=1)

In [13]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59764 entries, 103342 to 169689
Data columns (total 44 columns):
best_of               59764 non-null int64
l_1stIn               59164 non-null float64
l_1stWon              59164 non-null float64
l_2ndWon              59164 non-null float64
l_SvGms               59164 non-null float64
l_ace                 59164 non-null float64
l_bpFaced             59164 non-null float64
l_bpSaved             59164 non-null float64
l_df                  59164 non-null float64
l_svpt                59164 non-null float64
loser_age             59764 non-null float64
loser_hand            59764 non-null object
loser_ht              55821 non-null float64
loser_id              59764 non-null int64
loser_ioc             59764 non-null object
loser_name            59764 non-null object
loser_rank            59608 non-null float64
loser_rank_points     59608 non-null float64
match_num             59764 non-null int64
minutes               57858 non-null fl

## Missing Match Stats

* looks like when we are missing data for matches (~600), when we are missing one stat, the rest are missing

In [21]:
matches[matches.l_1stIn.isnull()].sample(10).T

Unnamed: 0,125986,130967,119175,105365,109851,164460,109450,139550,113057,156874
best_of,5,3,3,3,5,3,3,3,3,5
l_1stIn,,,,,,,,,,
l_1stWon,,,,,,,,,,
l_2ndWon,,,,,,,,,,
l_SvGms,,,,,,,,,,
l_ace,,,,,,,,,,
l_bpFaced,,,,,,,,,,
l_bpSaved,,,,,,,,,,
l_df,,,,,,,,,,
l_svpt,,,,,,,,,,


## Let's see if the tournament has anything to do with the missing stats

To see if there are any tournaments with no stats at all, we add up the stats column - if it's 0 then there means there wer not stats at all

Looks like there are only 4 tournaments where we are completely missing stats

In [27]:
match_stats = matches.groupby("tourney_id").sum()
match_stats[match_stats.l_1stIn == 0]

Unnamed: 0_level_0,best_of,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_ace,l_bpFaced,l_bpSaved,l_df,l_svpt,...,w_ace,w_bpFaced,w_bpSaved,w_df,w_svpt,winner_age,winner_ht,winner_id,winner_rank,winner_rank_points
tourney_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1998-604,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,275.1,2000.0,1125372,94.0,29235.0
1999-604,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,270.22,2103.0,1128150,131.0,22674.0
2000-96,194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1542.69,11911.0,6582540,3380.0,74757.0
2004-96,194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1569.23,11922.0,6625770,2352.0,82124.0


There are 150 matches in the 4 tournaments that are missing stats, which means there are 450 more from other tournaments

In [25]:
tids = [tid for _, tid in match_stats[match_stats.l_1stIn == 0].reset_index().tourney_id.items()]
print(len(tids))
len(matches[matches.tourney_id.isin(tids)])

4


150

In [51]:
missing_matches = matches[(matches.l_1stIn.isnull()) & (~matches.tourney_id.isin(tids))]
# number of tournaments that are missing matches
len(missing_matches.tourney_id.unique())

344

Most tournaments are only missing a stats for a few matches

In [52]:
missing_matches.tourney_id.value_counts()

2002-741     10
1999-580      9
1999-328      8
2002-410      5
2000-441      4
             ..
2007-423      1
2012-352      1
2010-422      1
1998-326      1
2013-5014     1
Name: tourney_id, Length: 344, dtype: int64

In [56]:
missing_matches["round"].value_counts()

R16     160
R32     111
QF       75
R64      39
SF       32
F        13
RR       13
R128      5
BR        2
Name: round, dtype: int64

## Other Missing Data

* missing height - let's impute this with the average for that column
* minutes (match length) - let's impute this with average as well

## Let's take a closer look at missing data for minutes

In [None]:
# Let's imput this with the average for the tournament since some have 3 and 
# matches["minutes"] = matches.minutes.fillna(matches.minutes.mean())
matches[matches.minutes.isnull()]["best_of"].plot(kind='hist')

In [None]:
# looks like there are quite a lot more matches that are 3 sets vs 5 sets
# let's imput missing values according to the average match time for best_of

matches.loc[(matches.minutes.isnull()) & (matches.best_of == 3), "minutes"] = matches[(matches.minutes.notnull()) & (matches.best_of == 3)].mean()

In [None]:
matches[matches.minutes.isnull()]["best_of"].plot(kind='hist')

In [None]:
matches[matches.isnull().any(axis=1)].sample(5).T

In [None]:
matches.tourney_level.unique()

In [None]:
matches.iloc[:5][["loser_name", "winner_name"]]

In [None]:
matches.iloc[:5][["loser_ht"]]

In [None]:
matches.score.iloc[:5]

In [None]:
sorted(matches.tourney_name.unique())

In [None]:
len(matches.tourney_name.unique())

In [None]:
matches.tourney_id.unique()

In [None]:
matches.surface.unique()

In [None]:
matches.tourney_id.unique()

In [None]:
matches['tourney_id_no_year'] = matches.tourney_id.apply(lambda x: x.split("-")[1])

In [None]:
len(matches.tourney_id_no_year.unique())

In [None]:
matches.tourney_date.max()

In [None]:
len(matches[(matches.tourney_date > datetime.datetime(2017, 12, 31)) & (matches.tourney_date < datetime.datetime(2019,1,1))].tourney_name.unique())

In [None]:
# got the following warning when reading in data frame 
# //anaconda3/envs/sb/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning: Columns (11,12) have mixed types. Specify dtype option on import or set low_memory=False.
#  interactivity=interactivity, compiler=compiler, result=result)
# this is actually WRank and LRank columns - going to set them explicitly to object
# for players that are not ranked yet, you get NR in this columns
import numpy as np
data = pd.read_csv('atp-tour-20002016/Data.csv', parse_dates=["Date"], dtype={'WRank': object, 'LRank': object}, encoding='ISO-8859-1')


In [None]:
data.info()

In [None]:
data.sample(5).T

In [None]:
data.Series.unique()

In [None]:
data[(data.WRank == 'NR') | (data.LRank == 'NR')][["ATP", "Location", "Tournament", "WRank", "LRank"]]

In [None]:
# columns = data.columns
# for col in columns:
#     print(f'{col}: {data.columns.get_loc(col)}')

In [None]:
data[data.Date > datetime.datetime(1998,1,1)].Tournament.unique()

In [None]:
data.Loser.unique()