# Predicting Tennis Match Results: Data Wrangling

## Notebook by Tanaka Chakanyuka

## License

This dataset was exported from Kaggle, which used Jeff Sackmann's GitHub Contribution under CC attributions, non-commercial, and share alike license.

## Dataset Description

The dataset comes from Kaggle: \
https://www.kaggle.com/pablodroca/atp-tennis-matches-20002019?select=atp_matches_2000.csv

The dataset includes 21 individual csv files: \
1 csv file for matches in each year from 2000-2019 \
1 csv file for players

The matches csv files have the following columns: \
**tourney_name**: a unique identifier for each tournament \
**tourney_date**: date of the tournament in YYYYMMDD format \
**surface**: court surface (hard, clay, etc.) \
**winner_id**: player_id for winner of the match \
**loser_id**: player_id for loser of the match \
**score**: match score \
**best_of**: '3' or '5', indicating the maximum number of sets for the match \
**round**: round of tournament \
**minutes**: match time in minutes \
**w_ace, l_ace**: number of aces for winner, loser \
**w_df, l_df**: number of double faults for winner, loser \
**w_svpt, l_svpt**: number of serve points for winner, loser \
**w_1stIn, l_1stIn**: number of first serves made for winner, loser \
**w_1stWon, l_1stWon**: number of first-serve points won for winner, loser \
**w_2ndWon, l_2ndWon**: number of second-serve points won for winner, loser \
**w_SvGms, l_SvGms**: number of serve games for winner, loser \
**w_bpSaved, l_bpSaved**: number of break points saved for winner, loser \
**w_bpFaced, l_bpFaced**: number of break points faced for winner, loser \
**winner_rank, loser_rank**: winner's and loser's rank, as of the tourney_date \
**winner_rank_points, loser_rank_point**: winner's and loser's number of ranking points, as of the tourney date \

The players csv file has the following columns: \
**player_id**: a unique identifier for each player \
**name_first**: first name of player \
**name_last**: last name of player \
**hand**: 'R', 'L', 'A', or 'U' for right-handed, left-handed, ambidextrous, or unsure \
**birthdate**: birthdate of player \
**country**: player's country 

## Imports and Reading Files

In [14]:
# First, import modules and packages
import pandas as pd
import numpy as np
import glob
from datetime import date
import matplotlib.pyplot as plt

In [15]:
# Get csv file names
file_names = glob.glob('../raw_data/atp_matches*.csv')

In [22]:
# Read in atp_matches files and create DataFrame with all data
import glob
import os
import pandas as pd   
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df_matches = pd.concat(map(pd.read_csv, filepaths))

In [23]:
df_matches.head()

Unnamed: 0,tourney_id,tourney_name,tourney_date,surface,winner_id,loser_id,score,best_of,round,minutes,...,bpSaved,SvGms,2ndWon,1stWon,1stIn,svpt,df,ace,outcome,hand_R
0,2000-339,Adelaide,20000103.0,Hard,102358.0,103096.0,6-3 6-4,3.0,R32,76.0,...,,,,,,,,,,
1,2000-339,Adelaide,20000103.0,Hard,103819.0,102533.0,6-1 6-4,3.0,R32,45.0,...,,,,,,,,,,
2,2000-339,Adelaide,20000103.0,Hard,102998.0,101885.0,3-6 7-6(5) 6-4,3.0,R32,115.0,...,,,,,,,,,,
3,2000-339,Adelaide,20000103.0,Hard,103206.0,102776.0,6-2 6-1,3.0,R32,65.0,...,,,,,,,,,,
4,2000-339,Adelaide,20000103.0,Hard,102796.0,102401.0,6-4 6-4,3.0,R32,68.0,...,,,,,,,,,,


In [24]:
# See summary of the data
df_matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285957 entries, 0 to 86229
Data columns (total 54 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   tourney_id          59430 non-null   object 
 1   tourney_name        59430 non-null   object 
 2   tourney_date        59430 non-null   float64
 3   surface             59312 non-null   object 
 4   winner_id           59430 non-null   float64
 5   loser_id            59430 non-null   float64
 6   score               59429 non-null   object 
 7   best_of             59430 non-null   float64
 8   round               59430 non-null   object 
 9   minutes             52431 non-null   float64
 10  w_ace               53749 non-null   float64
 11  w_df                53749 non-null   float64
 12  w_svpt              53749 non-null   float64
 13  w_1stIn             53749 non-null   float64
 14  w_1stWon            53749 non-null   float64
 15  w_2ndWon            53749 non-null 

The df_matches dataframe has 59430 entries and 32 columns.

## Missing Values

In [25]:
# Count the number of missing values in each column and sort them
missing = pd.concat([df_matches.isnull().sum(), 100 * df_matches.isnull().mean()], axis=1)
missing.columns = ['count','%']
missing.sort_values(by='count')

Unnamed: 0,count,%
hand_R,113497,39.690233
ace,113497,39.690233
df,113497,39.690233
svpt,113497,39.690233
1stIn,113497,39.690233
1stWon,113497,39.690233
2ndWon,113497,39.690233
SvGms,113497,39.690233
bpSaved,113497,39.690233
bpFaced,113497,39.690233


For most of the stats columns, there are 5681 rows with missing data. Let's investigate.

In [26]:
print(df_matches[df_matches['w_1stWon'].isnull()])

      tourney_id                 tourney_name  tourney_date surface  \
282    2000-D066  Davis Cup G1 QF: CHN vs UZB    20000128.0    Hard   
283    2000-D066  Davis Cup G1 QF: CHN vs UZB    20000128.0    Hard   
284    2000-D074  Davis Cup G2 QF: HKG vs PAK    20000128.0    Hard   
285    2000-D074  Davis Cup G2 QF: HKG vs PAK    20000128.0    Hard   
286    2000-D074  Davis Cup G2 QF: HKG vs PAK    20000128.0    Hard   
...          ...                          ...           ...     ...   
86225        NaN                          NaN           NaN     NaN   
86226        NaN                          NaN           NaN     NaN   
86227        NaN                          NaN           NaN     NaN   
86228        NaN                          NaN           NaN     NaN   
86229        NaN                          NaN           NaN     NaN   

       winner_id  loser_id                      score  best_of round  minutes  \
282     102957.0  103337.0  6-4 6-7(5) 6-3 6-7(2) 6-3      5.0    

When one stat is missing, the others appear to be missing as well.

Before removing these missing values, let's drop columns with irrelevant information: tourney_id and tourney_name.
Also, drop columns that are the same regardless of winner or loser: surface, score, round, and minutes

In [27]:
df_matches.drop(columns=['tourney_id','tourney_name','surface','score', \
                        'round','minutes'], inplace=True)

In [28]:
# Drop rows with missing stats data
df_matches.dropna(subset=['w_1stWon','winner_rank','loser_rank'], inplace=True)

In [29]:
# Count the number of missing values in each column and sort them
missing = pd.concat([df_matches.isnull().sum(), 100 * df_matches.isnull().mean()], axis=1)
missing.columns = ['count','%']
missing.sort_values(by='count')

Unnamed: 0,count,%
tourney_date,0,0.0
loser_rank_points,0,0.0
loser_rank,0,0.0
winner_rank,0,0.0
l_bpFaced,0,0.0
l_bpSaved,0,0.0
l_SvGms,0,0.0
l_2ndWon,0,0.0
l_1stWon,0,0.0
l_1stIn,0,0.0


Great, no more missing data.

In [30]:
# Tennis matches can be either best of 5 sets or best of 3 sets. 
# See how many of each there are in our dataset.
df_matches['best_of'].value_counts()

3.0    43320
5.0    10152
Name: best_of, dtype: int64

In [31]:
# We need to drop either the rows with best of 5 or the rows with best of 3 because all the stats are absolute
# numbers, so they would obviously be higher for matches with 5 sets which would skew the results.
# So, drop the rows with a best of 5. Then, drop the best_of column.
df_matches = df_matches[df_matches.best_of == 3]
df_matches.drop(columns=['best_of'], inplace=True)

In [32]:
# Convert tourney_date to datetime object
df_matches['tourney_date'] = df_matches['tourney_date'].astype(str)
df_matches['tourney_date'] = pd.to_datetime(df_matches['tourney_date'], format = '%Y%m%d')

In [33]:
df_matches.head()

Unnamed: 0,tourney_date,winner_id,loser_id,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,...,bpSaved,SvGms,2ndWon,1stWon,1stIn,svpt,df,ace,outcome,hand_R
0,2000-01-03,102358.0,103096.0,6.0,0.0,66.0,29.0,23.0,23.0,10.0,...,,,,,,,,,,
1,2000-01-03,103819.0,102533.0,6.0,3.0,46.0,28.0,24.0,12.0,9.0,...,,,,,,,,,,
2,2000-01-03,102998.0,101885.0,8.0,3.0,81.0,40.0,35.0,28.0,15.0,...,,,,,,,,,,
3,2000-01-03,103206.0,102776.0,4.0,2.0,66.0,35.0,28.0,14.0,7.0,...,,,,,,,,,,
4,2000-01-03,102796.0,102401.0,6.0,2.0,52.0,32.0,26.0,12.0,10.0,...,,,,,,,,,,


## Separate matches into wins and losses

In [34]:
# Separate the matches into wins and losses
df_wins = df_matches[['tourney_date','winner_id','winner_rank_points','winner_rank','w_bpFaced','w_bpSaved',\
                      'w_SvGms','w_2ndWon','w_1stWon','w_1stIn','w_svpt','w_df','w_ace']]

df_losses = df_matches[['tourney_date','loser_id','loser_rank_points','loser_rank','l_bpFaced','l_bpSaved',\
                        'l_SvGms','l_2ndWon','l_1stWon','l_1stIn','l_svpt','l_df','l_ace']]

In [35]:
# Add a column for outcome
df_wins.loc[:,'outcome'] = 1
df_losses.loc[:,'outcome'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_wins.loc[:,'outcome'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_losses.loc[:,'outcome'] = 0


In [36]:
df_wins.head()

Unnamed: 0,tourney_date,winner_id,winner_rank_points,winner_rank,w_bpFaced,w_bpSaved,w_SvGms,w_2ndWon,w_1stWon,w_1stIn,w_svpt,w_df,w_ace,outcome
0,2000-01-03,102358.0,1850.0,4.0,2.0,2.0,10.0,23.0,23.0,29.0,66.0,0.0,6.0,1
1,2000-01-03,103819.0,515.0,64.0,0.0,0.0,9.0,12.0,24.0,28.0,46.0,3.0,6.0,1
2,2000-01-03,102998.0,544.0,58.0,1.0,0.0,15.0,28.0,35.0,40.0,81.0,3.0,8.0,1
3,2000-01-03,103206.0,928.0,27.0,4.0,4.0,7.0,14.0,28.0,35.0,66.0,2.0,4.0,1
4,2000-01-03,102796.0,1244.0,15.0,1.0,0.0,10.0,12.0,26.0,32.0,52.0,2.0,6.0,1


In [18]:
df_losses.head()

Unnamed: 0,tourney_date,loser_id,loser_rank_points,loser_rank,l_bpFaced,l_bpSaved,l_SvGms,l_2ndWon,l_1stWon,l_1stIn,l_svpt,l_df,l_ace,outcome
0,2018-12-31,126203,974.0,49.0,10.0,9.0,17.0,27.0,57.0,72.0,123.0,4.0,22.0,0
1,2018-12-31,105815,814.0,61.0,18.0,13.0,15.0,20.0,44.0,63.0,106.0,4.0,12.0,0
2,2018-12-31,106415,701.0,75.0,3.0,0.0,9.0,7.0,21.0,33.0,47.0,1.0,1.0,0
3,2018-12-31,200005,572.0,102.0,1.0,0.0,11.0,14.0,35.0,45.0,75.0,4.0,6.0,0
4,2018-12-31,105526,875.0,57.0,6.0,4.0,14.0,22.0,32.0,41.0,74.0,2.0,10.0,0


In [37]:
# Rename the columns so they are identical for df_wins and df_losses
df_wins = df_wins.rename(columns={'winner_id':'player_id','winner_rank_points':'rank_points','winner_rank':'rank',\
                       'w_bpFaced':'bpFaced','w_bpSaved':'bpSaved','w_SvGms':'SvGms','w_2ndWon':'2ndWon',\
                       'w_1stWon':'1stWon','w_1stIn':'1stIn','w_svpt':'svpt','w_df':'df','w_ace':'ace'})
df_losses = df_losses.rename(columns={'loser_id':'player_id','loser_rank_points':'rank_points','loser_rank':'rank',\
                       'l_bpFaced':'bpFaced','l_bpSaved':'bpSaved','l_SvGms':'SvGms','l_2ndWon':'2ndWon',\
                       'l_1stWon':'1stWon','l_1stIn':'1stIn','l_svpt':'svpt','l_df':'df','l_ace':'ace'})

## Supplement with player data

In [41]:
# First, read in the player data
df_players = pd.read_csv('atp_players.csv')
df_players.head()

Unnamed: 0,player_id,name_first,name_list,hand,birthdate,country
0,100001,Gardnar,Mulloy,R,19131122.0,USA
1,100002,Pancho,Segura,R,19210620.0,ECU
2,100003,Frank,Sedgman,R,19271002.0,AUS
3,100004,Giuseppe,Merlo,R,19271011.0,ITA
4,100005,Richard Pancho,Gonzales,R,19280509.0,USA


In [42]:
# Clean up a bit and drop irrelevant columns
df_players['name'] = df_players['name_first'] + ' ' + df_players['name_list']
df_players.drop(columns=['name_first','name_list','country'], inplace=True)

In [43]:
# Convert birthdate to datetime object
df_players['birthdate'] = df_players['birthdate'].astype(str)
df_players['birthdate'] = pd.to_datetime(df_players['birthdate'], format = '%Y%m%d')

In [44]:
df_wins = pd.merge(df_wins, df_players, how='left', on='player_id')
df_wins.head()

Unnamed: 0,tourney_date,player_id,rank_points,rank,bpFaced,bpSaved,SvGms,2ndWon,1stWon,1stIn,svpt,df,ace,outcome,hand,birthdate,name
0,2000-01-03,102358.0,1850.0,4.0,2.0,2.0,10.0,23.0,23.0,29.0,66.0,0.0,6.0,1,R,1974-03-13,Thomas Enqvist
1,2000-01-03,103819.0,515.0,64.0,0.0,0.0,9.0,12.0,24.0,28.0,46.0,3.0,6.0,1,R,1981-08-08,Roger Federer
2,2000-01-03,102998.0,544.0,58.0,1.0,0.0,15.0,28.0,35.0,40.0,81.0,3.0,8.0,1,R,1977-06-03,Jan Michael Gambill
3,2000-01-03,103206.0,928.0,27.0,4.0,4.0,7.0,14.0,28.0,35.0,66.0,2.0,4.0,1,R,1978-05-29,Sebastien Grosjean
4,2000-01-03,102796.0,1244.0,15.0,1.0,0.0,10.0,12.0,26.0,32.0,52.0,2.0,6.0,1,R,1976-05-30,Magnus Norman


In [45]:
df_losses = pd.merge(df_losses, df_players, how='left', on='player_id')
df_losses.head()

Unnamed: 0,tourney_date,player_id,rank_points,rank,bpFaced,bpSaved,SvGms,2ndWon,1stWon,1stIn,svpt,df,ace,outcome,hand,birthdate,name
0,2000-01-03,103096.0,490.0,56.0,4.0,2.0,9.0,13.0,25.0,37.0,59.0,3.0,1.0,0,R,1977-12-17,Arnaud Clement
1,2000-01-03,102533.0,404.0,91.0,3.0,0.0,8.0,12.0,13.0,15.0,42.0,5.0,3.0,0,R,1975-02-15,Jens Knippschild
2,2000-01-03,101885.0,243.0,105.0,5.0,4.0,16.0,22.0,49.0,59.0,103.0,2.0,26.0,0,L,1971-03-18,Wayne Arthurs
3,2000-01-03,102776.0,602.0,54.0,6.0,1.0,8.0,8.0,12.0,22.0,49.0,3.0,0.0,0,R,1976-04-18,Andrew Ilie
4,2000-01-03,102401.0,219.0,154.0,10.0,7.0,10.0,16.0,25.0,40.0,73.0,2.0,4.0,0,L,1974-06-05,Scott Draper


In [46]:
# Concatenate the wins and losses into a single dataframe
df_all = pd.concat([df_wins,df_losses])
df_all.head()

Unnamed: 0,tourney_date,player_id,rank_points,rank,bpFaced,bpSaved,SvGms,2ndWon,1stWon,1stIn,svpt,df,ace,outcome,hand,birthdate,name
0,2000-01-03,102358.0,1850.0,4.0,2.0,2.0,10.0,23.0,23.0,29.0,66.0,0.0,6.0,1,R,1974-03-13,Thomas Enqvist
1,2000-01-03,103819.0,515.0,64.0,0.0,0.0,9.0,12.0,24.0,28.0,46.0,3.0,6.0,1,R,1981-08-08,Roger Federer
2,2000-01-03,102998.0,544.0,58.0,1.0,0.0,15.0,28.0,35.0,40.0,81.0,3.0,8.0,1,R,1977-06-03,Jan Michael Gambill
3,2000-01-03,103206.0,928.0,27.0,4.0,4.0,7.0,14.0,28.0,35.0,66.0,2.0,4.0,1,R,1978-05-29,Sebastien Grosjean
4,2000-01-03,102796.0,1244.0,15.0,1.0,0.0,10.0,12.0,26.0,32.0,52.0,2.0,6.0,1,R,1976-05-30,Magnus Norman


## More cleaning

In [47]:
# Calculate age of player at time of tourney
df_all['age'] = df_all.tourney_date - df_all.birthdate
df_all['age'] = df_all.age / np.timedelta64(1, 'Y')
df_all['age'] = df_all['age'].apply(np.floor)

In [48]:
# Create new column with year of tourney, and then drop the tourney_date, player_id, and birthdate columns
df_all['tourney_year'] = pd.DatetimeIndex(df_all['tourney_date']).year
df_all.drop(columns=['tourney_date','player_id','birthdate'], inplace=True)

In [28]:
df_all.head()

Unnamed: 0,rank_points,rank,bpFaced,bpSaved,SvGms,2ndWon,1stWon,1stIn,svpt,df,ace,outcome,hand,name,age,tourney_year
0,810.0,63.0,4.0,4.0,17.0,26.0,58.0,73.0,117.0,4.0,8.0,1,R,Denis Kudla,26.0,2018
1,1083.0,38.0,5.0,3.0,15.0,15.0,49.0,68.0,98.0,2.0,8.0,1,R,John Millman,29.0,2018
2,1835.0,19.0,7.0,6.0,10.0,12.0,37.0,43.0,76.0,6.0,9.0,1,R,Grigor Dimitrov,27.0,2018
3,275.0,185.0,0.0,0.0,11.0,10.0,39.0,43.0,58.0,0.0,12.0,1,R,Yasutaka Uchiyama,26.0,2018
4,1050.0,40.0,3.0,2.0,15.0,21.0,40.0,52.0,87.0,4.0,15.0,1,R,Jeremy Chardy,31.0,2018


In [49]:
# Rearrange columns
df_all = df_all[['name','tourney_year','age','hand','rank_points','rank','bpFaced','bpSaved','SvGms','2ndWon',\
                 '1stWon','1stIn','svpt','df','ace','outcome']]

In [50]:
# Count the number of missing values in each column and sort them
missing = pd.concat([df_all.isnull().sum(), 100 * df_all.isnull().mean()], axis=1)
missing.columns = ['count','%']
missing.sort_values(by='count')

Unnamed: 0,count,%
name,0,0.0
tourney_year,0,0.0
rank_points,0,0.0
rank,0,0.0
bpFaced,0,0.0
bpSaved,0,0.0
SvGms,0,0.0
2ndWon,0,0.0
1stWon,0,0.0
1stIn,0,0.0


In [51]:
# Drop rows with missing stats data
df_all.dropna(subset=['age','hand'], inplace=True)

In [52]:
# What are the unique categories for 'hand'?
df_all['hand'].unique()

array(['R', 'L', 'U'], dtype=object)

In [53]:
df_all['hand'].value_counts()

R    75071
L    11159
U      401
Name: hand, dtype: int64

In [54]:
# Assuming 'U' stands for uncertain/unsure, drop these rows.
df_all = df_all[df_all.hand != 'U']

In [55]:
# Use get_dummies for the categorial variable 'hand'
df_all = pd.get_dummies(df_all, columns=['hand'])

In [56]:
df_all.head()

Unnamed: 0,name,tourney_year,age,rank_points,rank,bpFaced,bpSaved,SvGms,2ndWon,1stWon,1stIn,svpt,df,ace,outcome,hand_L,hand_R
0,Thomas Enqvist,2000,25.0,1850.0,4.0,2.0,2.0,10.0,23.0,23.0,29.0,66.0,0.0,6.0,1,0,1
1,Roger Federer,2000,18.0,515.0,64.0,0.0,0.0,9.0,12.0,24.0,28.0,46.0,3.0,6.0,1,0,1
2,Jan Michael Gambill,2000,22.0,544.0,58.0,1.0,0.0,15.0,28.0,35.0,40.0,81.0,3.0,8.0,1,0,1
3,Sebastien Grosjean,2000,21.0,928.0,27.0,4.0,4.0,7.0,14.0,28.0,35.0,66.0,2.0,4.0,1,0,1
4,Magnus Norman,2000,23.0,1244.0,15.0,1.0,0.0,10.0,12.0,26.0,32.0,52.0,2.0,6.0,1,0,1


In [57]:
# Drop hand_L to remove collinearity
df_all = df_all.drop(columns=['hand_L'])

In [58]:
# Extract numerical columns into separate dataframe to explore
df_numeric = df_all[['age','rank_points','rank','bpFaced','bpSaved','SvGms','2ndWon','1stWon','1stIn',\
                    'svpt','df','ace','hand_R','outcome']]
df_numeric.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,86230.0,25.918416,3.808094,15.0,23.0,26.0,29.0,44.0
rank_points,86230.0,1225.659886,1528.817899,1.0,510.0,793.0,1298.75,16790.0
rank,86230.0,77.518439,103.867787,1.0,26.0,54.0,92.0,2159.0
bpFaced,86230.0,6.175264,3.898824,0.0,3.0,6.0,9.0,29.0
bpSaved,86230.0,3.717511,2.917379,0.0,1.0,3.0,5.0,24.0
SvGms,86230.0,11.159364,2.948313,0.0,9.0,10.0,14.0,33.0
2ndWon,86230.0,14.443396,5.993341,0.0,10.0,14.0,18.0,60.0
1stWon,86230.0,30.665418,10.861418,0.0,23.0,29.0,37.0,101.0
1stIn,86230.0,43.081155,14.494993,0.0,32.0,41.0,52.0,135.0
svpt,86230.0,71.576655,21.793757,0.0,55.0,68.0,86.0,193.0


In [61]:
# Save the data to a new csv file

df_all.to_csv('tennis_data_all.csv', index=False)
df_numeric.to_csv('tennis_data_numeric.csv', index=False)