# ATP Initial Exploration

This is the first of two note books conducting an initial exploration of the ATP Dataset. This notebook focuses on exploring the raw data and steps to cleaning/wrangling. The second notebook will take the prepared data and create subsets for further analysis.

# Table of Contents

### 1. Importing Libraries and CSVs
### 2. Exploratory Analysis: Three Main Datasets
#### 2A. Matches
#### 2B. Players
#### 2C. Rankings
### 3. Wrangling and Cleaning Steps
#### 3A. Renaming Columns
#### 3B. Deriving New Variables
#### 3C. Creating a New Main Dataframe: "df_post1991"
 - Removing entries with no/missing/faulty match statistics
 - Removing entries prior to 2000 
  - There were no match statistics before 1991
  - B
#### 3D. Changing Data Types for Certain Variables
#### 3E. Final Cleaning: Converting Derived Percentage Variables to Integers
### 4. Export PKLs
 
## On Notebook 2: "ATP Initial Exploration Part 2"
### 1. Importing Libraries and PKL File
### 2. Creating Subsets
- df_matchstats
- df_carpet
- df_hard
- df_grass
- df_clay
- df_big3_win
- df_big3_lose
- df_big3


## 1. Importing Libraries and Datasets

In [1]:
#Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import scipy
import matplotlib

In [2]:
#Set Path
path = r'/Users/tristansavella/Desktop/Important Things/Data Analytics/CareerFoundry/Data Immersion/Achievement 6/Master Folder ATP/02 Data'

In [3]:
#Import Datasets
df_matches = pd.read_csv(os.path.join(path, 'Original Data', 'matches.csv'), index_col = False)
df_players = pd.read_csv(os.path.join(path, 'Original Data', 'players.csv'), index_col = False)
df_rankings = pd.read_csv(os.path.join(path, 'Original Data', 'rankings.csv'), index_col = False)

In [4]:
#Show all columns
pd.set_option('display.max_columns', None)

In [5]:
#Show all rows
pd.set_option('display.max_rows', None)

## 2. Initial Exploration

### 2A. Matches

In [6]:
df_matches.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,1968-2029,Dublin,Grass,32,A,19680708,270,112411,,,Douglas Smith,U,,AUS,,110196,,,Peter Ledbetter,U,,UNK,24.0,6-1 7-5,3,R32,,,,,,,,,,,,,,,,,,,,,,,
1,1968-2029,Dublin,Grass,32,A,19680708,271,126914,,,Louis Pretorius,R,,RSA,,209536,,,Maurice Pollock,U,,IRL,,6-1 6-1,3,R32,,,,,,,,,,,,,,,,,,,,,,,
2,1968-2029,Dublin,Grass,32,A,19680708,272,209523,,,Cecil Pedlow,U,,IRL,,209535,,,John Mulvey,U,,IRL,,6-2 6-2,3,R32,,,,,,,,,,,,,,,,,,,,,,,
3,1968-2029,Dublin,Grass,32,A,19680708,273,100084,,,Tom Okker,R,178.0,NED,24.3,209534,,,Unknown Fearmon,U,,,,6-1 6-1,3,R32,,,,,,,,,,,,,,,,,,,,,,,
4,1968-2029,Dublin,Grass,32,A,19680708,274,100132,,,Armistead Neely,R,,USA,21.3,209533,,,Harry Sheridan,U,,IRL,,6-2 6-4,3,R32,,,,,,,,,,,,,,,,,,,,,,,


In [7]:
#Shape
df_matches.shape

(188161, 49)

In [8]:
#Columns

df_matches.columns

#I will need to rename some of these variables

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points'],
      dtype='object')

In [9]:
#Checking for missing values
df_matches.isnull().sum()

#Seed; there is no seeding for team tournaments, such as Davis Cup; also, not all players are seeded 
#the fact that more
#For match statistics: similar amount of missing data for these

tourney_id                 0
tourney_name               0
surface                 2317
draw_size                  0
tourney_level              0
tourney_date               0
match_num                  0
winner_id                  0
winner_seed           118467
winner_entry          171891
winner_name                0
winner_hand               17
winner_ht              16237
winner_ioc                10
winner_age              1335
loser_id                   0
loser_seed            152824
loser_entry           160432
loser_name                 0
loser_hand                64
loser_ht               28698
loser_ioc                 69
loser_age               4825
score                      8
best_of                    0
round                      0
minutes                98650
w_ace                  95941
w_df                   95942
w_svpt                 95942
w_1stIn                95942
w_1stWon               95942
w_2ndWon               95942
w_SvGms                95941
w_bpSaved     

In [10]:
#duplicates check

df_matches_dups = df_matches[df_matches.duplicated()]
df_matches_dups.shape

#no duplicates

(0, 49)

In [11]:
df_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188161 entries, 0 to 188160
Data columns (total 49 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   tourney_id          188161 non-null  object 
 1   tourney_name        188161 non-null  object 
 2   surface             185844 non-null  object 
 3   draw_size           188161 non-null  int64  
 4   tourney_level       188161 non-null  object 
 5   tourney_date        188161 non-null  int64  
 6   match_num           188161 non-null  int64  
 7   winner_id           188161 non-null  int64  
 8   winner_seed         69694 non-null   float64
 9   winner_entry        16270 non-null   object 
 10  winner_name         188161 non-null  object 
 11  winner_hand         188144 non-null  object 
 12  winner_ht           171924 non-null  float64
 13  winner_ioc          188151 non-null  object 
 14  winner_age          186826 non-null  float64
 15  loser_id            188161 non-nul

In [12]:
df_matches.describe()

#the following columns should be turned into strings: 'tourney_date','match_num','winner_id','loser_id','best_of'

Unnamed: 0,draw_size,tourney_date,match_num,winner_id,winner_seed,winner_ht,winner_age,loser_id,loser_seed,loser_ht,loser_age,best_of,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
count,188161.0,188161.0,188161.0,188161.0,69694.0,171924.0,186826.0,188161.0,35337.0,159463.0,183336.0,188161.0,89511.0,92220.0,92219.0,92219.0,92219.0,92219.0,92219.0,92220.0,92219.0,92219.0,92219.0,92220.0,92219.0,92219.0,92219.0,92219.0,92220.0,92219.0,92219.0,153197.0,105973.0,144834.0,104354.0
mean,52.926292,19933500.0,76.618598,103820.251673,6.280225,184.449187,25.658362,104967.986995,7.667402,184.226592,25.709391,3.441218,103.497403,6.517708,2.736258,78.068142,47.581724,35.873746,16.733883,12.396758,3.527549,5.167406,4.828745,3.488224,80.908284,48.011549,31.894892,14.985263,12.197387,4.812002,8.742884,75.255716,1366.471611,112.88415,859.219896
std,36.446303,154444.5,110.714957,11470.048991,5.509548,6.667033,4.045128,14866.251405,5.995551,6.655036,4.190362,0.830009,39.365772,5.341289,2.367377,29.523162,19.216689,13.83661,6.989782,4.120292,3.08639,4.063021,4.67071,2.618833,29.458713,19.390121,14.449465,7.220377,4.129834,3.275387,4.131839,121.053512,1726.089469,162.191701,987.192154
min,2.0,19680110.0,1.0,100001.0,1.0,160.0,14.3,100001.0,1.0,160.0,14.4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.0,0.0,1.0,0.0,1.0,0.0
25%,32.0,19800410.0,10.0,100402.0,2.0,180.0,22.7,100502.0,4.0,180.0,22.7,3.0,75.0,3.0,1.0,56.0,34.0,26.0,12.0,9.0,1.0,2.0,2.0,2.0,59.0,34.0,22.0,10.0,9.0,2.0,6.0,17.0,489.0,37.0,361.0
50%,32.0,19930300.0,25.0,101686.0,5.0,185.0,25.2,101843.0,6.0,185.0,25.3,3.0,96.0,5.0,2.0,73.0,44.0,33.0,16.0,11.0,3.0,4.0,4.0,3.0,76.0,45.0,29.0,14.0,11.0,4.0,8.0,44.0,846.0,70.0,630.0
75%,64.0,20060720.0,80.0,103898.0,8.0,188.0,28.2,104252.0,10.0,188.0,28.3,3.0,125.0,9.0,4.0,94.0,58.0,43.0,21.0,15.0,5.0,7.0,7.0,5.0,97.0,58.0,40.0,19.0,15.0,7.0,11.0,86.0,1532.0,118.0,1013.0
max,128.0,20221130.0,1701.0,211468.0,35.0,211.0,58.7,211805.0,35.0,211.0,63.6,5.0,1146.0,113.0,26.0,491.0,361.0,292.0,82.0,90.0,24.0,34.0,103.0,26.0,489.0,328.0,284.0,101.0,91.0,28.0,38.0,2101.0,16950.0,2159.0,16950.0


### Initial Findings:

- Lots of missing statistics in many matches: figure out why data is missing here. Were these tournaments low profile? Too old?

### 2B. Players

In [13]:
df_players.shape

(58687, 8)

In [14]:
df_players.head()

Unnamed: 0,player_id,name_first,name_last,hand,dob,ioc,height,wikidata_id
0,100001,Gardnar,Mulloy,R,19131122.0,USA,185.0,Q54544
1,100002,Pancho,Segura,R,19210620.0,ECU,168.0,Q54581
2,100003,Frank,Sedgman,R,19271002.0,AUS,180.0,Q962049
3,100004,Giuseppe,Merlo,R,19271011.0,ITA,,Q1258752
4,100005,Richard,Gonzalez,R,19280509.0,USA,188.0,Q53554


In [15]:
df_players.tail()

Unnamed: 0,player_id,name_first,name_last,hand,dob,ioc,height,wikidata_id
58682,212032,Abhishek,Bastola,U,,NPL,,
58683,212033,Cameron,Muller,U,,USA,,
58684,212034,Ian,Bracks,U,,USA,,
58685,212035,Alexander,Visser,U,,USA,,
58686,212036,Abhishek,Thorat,U,,USA,,


In [16]:
df_players.columns

Index(['player_id', 'name_first', 'name_last', 'hand', 'dob', 'ioc', 'height',
       'wikidata_id'],
      dtype='object')

In [17]:
#duplicates check

df_players_dups = df_players[df_players.duplicated()]
df_players_dups.shape

#no duplicates

(0, 8)

In [18]:
df_players.isnull().sum()

player_id          0
name_first       355
name_last         41
hand             240
dob            13547
ioc              101
height         55899
wikidata_id    53793
dtype: int64

### 2C. Rankings

In [19]:
df_rankings.shape

(3113261, 4)

In [20]:
df_rankings.head()

Unnamed: 0,ranking_date,rank,player,points
0,20100104,1,103819,10550.0
1,20100104,2,104745,9205.0
2,20100104,3,104925,8310.0
3,20100104,4,104918,7030.0
4,20100104,5,105223,6785.0


In [21]:
df_rankings.columns

Index(['ranking_date', 'rank', 'player', 'points'], dtype='object')

In [22]:
#duplicates check

df_rankings_dups = df_rankings[df_rankings.duplicated()]
df_rankings_dups.shape

#how to see duplicates

(180, 4)

In [23]:
df_rankings.isnull().sum()

ranking_date         0
rank                 0
player               0
points          376080
dtype: int64

### Initial Exploration: Final Notes

- As of today (May 10th, 2024), I am planning only to use "df_matches"; I may later on decide to use "df_rankings"
- Next step: Data Cleaning and Wrangling for "df_matches"

## 3. Data Cleaning and Wrangling

## Cleaning and Wrangling Steps:


### A. Rename following columns
- w_ace --> w_#aces
- w_df --> w_#dfs
- w_svpt --> w_#servepoints
- w_1stin --> w_#1stservesin
- w_1stWon --> w_#1stWon
- w_2ndWon --> w_#2ndWon
- w_SvGms --> w_#SvGms
- w_ bpSaved --> w_#bpSaved
- w_bpFaced --> w_#bpFaced
- l_ace --> l_#aces
- l_df --> l_#dfs
- l_svpt --> l_#servepoints
- l_1stin --> l_#1stservesin
- l_1stWon --> l_#1stWon
- l_2ndWon --> l_#2ndWon
- l_SvGms --> l_#SvGms
- l_bpSaved --> l_#bpSaved
- l_bpFaced --> l_#bpFaced

### B. Create/derive following variables
###### Serve Statistics
- "w_#2ndservepoints" = "w_#servepoints" - "w_#1stservesin"
- "w_%1stservesin" = "w_#1stservesin"/"w_#servepoints"
- "w_%1stWon" = "w_#1stWon"/"w_#1stservesin"
- "w_%2ndWon" = "w_#2ndWon"/"w_#2ndservepoints"
- "l_#2ndservepoints" = "l_#servepoints" - "l_#1stservesin"
- "l_%1stservesin" = "l_#1stservesin"/"l_#servepoints"
- "l_%1stWon" = "l_#1stWon"/"l_#1stservesin"
- "l_%2ndWon" = "l_#2ndWon"/"l_#2ndservepoints"

### C. New Main Dataframe
- New Main DF: Remove entries prior to 1991 AND with missing match statistics: "df_matchstats"

### D. Changing Data Types
- Change the following variables' data types from integers to strings:





### 3A: Renaming Columns

In [24]:
#Renaming Columns

df_matches.rename(columns = 
                  {'w_ace' : 'w_#aces',
                   'w_df' : 'w_#dfs',
                   'w_svpt' : 'w_#ServePoints',
                   'w_1stIn' : 'w_#1stServesIn',
                   'w_1stWon' : 'w_#1stWon',
                   'w_2ndWon' : 'w_#2ndWon',
                   'w_SvGms' : 'w_#ServeGames',
                   'w_ bpSaved' : 'w_#bpSaved',
                   'w_bpFaced' : 'w_#bpFaced',
                    'l_ace' : 'l_#aces',
                   'l_df' : 'l_#dfs',
                   'l_svpt' : 'l_#ServePoints',
                   'l_1stIn' : 'l_#1stServesIn',
                   'l_1stWon' : 'l_#1stWon',
                   'l_2ndWon' : 'l_#2ndWon',
                   'l_SvGms' : 'l_#ServeGames',
                   'l_ bpSaved' : 'l_#bpSaved',
                   'l_bpFaced' : 'l_#bpFaced'}, inplace = True)

In [25]:
df_matches.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_#aces,w_#dfs,w_#ServePoints,w_#1stServesIn,w_#1stWon,w_#2ndWon,w_#ServeGames,w_bpSaved,w_#bpFaced,l_#aces,l_#dfs,l_#ServePoints,l_#1stServesIn,l_#1stWon,l_#2ndWon,l_#ServeGames,l_bpSaved,l_#bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,1968-2029,Dublin,Grass,32,A,19680708,270,112411,,,Douglas Smith,U,,AUS,,110196,,,Peter Ledbetter,U,,UNK,24.0,6-1 7-5,3,R32,,,,,,,,,,,,,,,,,,,,,,,
1,1968-2029,Dublin,Grass,32,A,19680708,271,126914,,,Louis Pretorius,R,,RSA,,209536,,,Maurice Pollock,U,,IRL,,6-1 6-1,3,R32,,,,,,,,,,,,,,,,,,,,,,,
2,1968-2029,Dublin,Grass,32,A,19680708,272,209523,,,Cecil Pedlow,U,,IRL,,209535,,,John Mulvey,U,,IRL,,6-2 6-2,3,R32,,,,,,,,,,,,,,,,,,,,,,,
3,1968-2029,Dublin,Grass,32,A,19680708,273,100084,,,Tom Okker,R,178.0,NED,24.3,209534,,,Unknown Fearmon,U,,,,6-1 6-1,3,R32,,,,,,,,,,,,,,,,,,,,,,,
4,1968-2029,Dublin,Grass,32,A,19680708,274,100132,,,Armistead Neely,R,,USA,21.3,209533,,,Harry Sheridan,U,,IRL,,6-2 6-4,3,R32,,,,,,,,,,,,,,,,,,,,,,,


### 3B: Deriving New Variables

In [26]:
#winner's 2nd serve points played

df_matches['w_#2ndServePoints'] = df_matches['w_#ServePoints'] - df_matches['w_#1stServesIn']

In [27]:
#winner's First Serve Percentage

df_matches['w_%1stServesIn'] = df_matches['w_#1stServesIn']/df_matches['w_#ServePoints']

In [28]:
#winner's Percentage of First Serve Points Won

df_matches['w_%1stWon'] = df_matches['w_#1stWon']/df_matches['w_#1stServesIn']

In [29]:
#winner's Percentage of Second Serve Points Won

df_matches['w_%2ndWon'] = df_matches['w_#2ndWon']/df_matches['w_#2ndServePoints']

In [30]:
#loser's 2nd serve points played

df_matches['l_#2ndServePoints'] = df_matches['l_#ServePoints'] - df_matches['l_#1stServesIn']

In [31]:
#loser's First Serve Percentage

df_matches['l_%1stServesIn'] = df_matches['l_#1stServesIn']/df_matches['l_#ServePoints']

In [32]:
#loser's Percentage of First Serve Points Won

df_matches['l_%1stWon'] = df_matches['l_#1stWon']/df_matches['l_#1stServesIn']

In [33]:
#loser's Percentage of Second Serve Points Won

df_matches['l_%2ndWon'] = df_matches['l_#2ndWon']/df_matches['l_#2ndServePoints']

### Year' will be created later in this notebook (in section 3D), after changing 'tourney_level' to a string type, using the first four digits of tourney date

In [34]:
df_matches.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_#aces,w_#dfs,w_#ServePoints,w_#1stServesIn,w_#1stWon,w_#2ndWon,w_#ServeGames,w_bpSaved,w_#bpFaced,l_#aces,l_#dfs,l_#ServePoints,l_#1stServesIn,l_#1stWon,l_#2ndWon,l_#ServeGames,l_bpSaved,l_#bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,w_#2ndServePoints,w_%1stServesIn,w_%1stWon,w_%2ndWon,l_#2ndServePoints,l_%1stServesIn,l_%1stWon,l_%2ndWon
0,1968-2029,Dublin,Grass,32,A,19680708,270,112411,,,Douglas Smith,U,,AUS,,110196,,,Peter Ledbetter,U,,UNK,24.0,6-1 7-5,3,R32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1968-2029,Dublin,Grass,32,A,19680708,271,126914,,,Louis Pretorius,R,,RSA,,209536,,,Maurice Pollock,U,,IRL,,6-1 6-1,3,R32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,1968-2029,Dublin,Grass,32,A,19680708,272,209523,,,Cecil Pedlow,U,,IRL,,209535,,,John Mulvey,U,,IRL,,6-2 6-2,3,R32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,1968-2029,Dublin,Grass,32,A,19680708,273,100084,,,Tom Okker,R,178.0,NED,24.3,209534,,,Unknown Fearmon,U,,,,6-1 6-1,3,R32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,1968-2029,Dublin,Grass,32,A,19680708,274,100132,,,Armistead Neely,R,,USA,21.3,209533,,,Harry Sheridan,U,,IRL,,6-2 6-4,3,R32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### 3C. Creating New Main Dataframe

#### New Main Dataframe
New Main DF: Remove entries prior to 2000 AND with missing match statistics: "df_matchstats"

In [39]:
#Checking for NaN in Match Stats
df_matches['w_#1stServesIn'].value_counts(dropna = False)

w_#1stServesIn
NaN      95942
35.0      2389
32.0      2371
38.0      2371
36.0      2369
37.0      2349
34.0      2302
33.0      2247
39.0      2222
40.0      2201
30.0      2189
31.0      2162
41.0      2148
43.0      2098
44.0      2065
42.0      2033
46.0      1962
45.0      1961
29.0      1940
48.0      1869
49.0      1844
28.0      1839
47.0      1791
50.0      1735
51.0      1698
53.0      1632
27.0      1622
52.0      1575
54.0      1467
55.0      1465
26.0      1440
56.0      1406
57.0      1380
58.0      1297
25.0      1290
59.0      1251
60.0      1210
61.0      1174
62.0      1148
24.0      1017
63.0       997
64.0       939
66.0       935
65.0       935
23.0       913
67.0       809
69.0       787
68.0       764
22.0       753
70.0       666
71.0       653
73.0       608
72.0       605
21.0       585
74.0       543
76.0       488
75.0       481
20.0       462
77.0       431
78.0       405
80.0       350
79.0       344
19.0       341
81.0       293
82.0       286
83.0      

In [41]:
#New Main Dataframe

df_post2000 = df_matches[df_matches['w_#1stServesIn'].notna()]

In [43]:
#Checking head for new main dataframe

df_post2000.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_#aces,w_#dfs,w_#ServePoints,w_#1stServesIn,w_#1stWon,w_#2ndWon,w_#ServeGames,w_bpSaved,w_#bpFaced,l_#aces,l_#dfs,l_#ServePoints,l_#1stServesIn,l_#1stWon,l_#2ndWon,l_#ServeGames,l_bpSaved,l_#bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,w_#2ndServePoints,w_%1stServesIn,w_%1stWon,w_%2ndWon,l_#2ndServePoints,l_%1stServesIn,l_%1stWon,l_%2ndWon
85848,1991-339,Adelaide,Hard,32,A,19901231,1,101723,,,Magnus Larsson,R,193.0,SWE,20.7,101414,1.0,,Boris Becker,R,190.0,GER,23.1,6-4 3-6 7-6(2),3,R32,130.0,6.0,2.0,96.0,55.0,39.0,25.0,15.0,2.0,4.0,8.0,3.0,95.0,62.0,44.0,23.0,16.0,6.0,8.0,56.0,,2.0,,41.0,0.572917,0.709091,0.609756,33.0,0.652632,0.709677,0.69697
85849,1991-339,Adelaide,Hard,32,A,19901231,2,100946,,Q,Slobodan Zivojinovic,R,198.0,YUG,27.4,101256,,,Mark Kratzmann,L,178.0,AUS,24.6,6-3 3-6 7-6(6),3,R32,119.0,19.0,4.0,101.0,56.0,45.0,25.0,15.0,9.0,10.0,8.0,2.0,84.0,41.0,35.0,27.0,15.0,1.0,2.0,304.0,,75.0,,45.0,0.554455,0.803571,0.555556,43.0,0.488095,0.853659,0.627907
85850,1991-339,Adelaide,Hard,32,A,19901231,3,101234,,,Patrik Kuhnen,R,190.0,GER,24.8,101421,,,Veli Paloheimo,R,183.0,FIN,23.0,6-0 6-4,3,R32,71.0,6.0,1.0,54.0,31.0,24.0,13.0,8.0,1.0,1.0,2.0,2.0,60.0,37.0,22.0,6.0,8.0,4.0,8.0,82.0,,69.0,,23.0,0.574074,0.774194,0.565217,23.0,0.616667,0.594595,0.26087
85851,1991-339,Adelaide,Hard,32,A,19901231,4,101889,8.0,,Todd Woodbridge,R,178.0,AUS,19.7,101703,,,Guillaume Raoux,R,180.0,FRA,20.8,7-6(2) 6-1,3,R32,85.0,2.0,0.0,60.0,40.0,30.0,14.0,9.0,3.0,3.0,3.0,3.0,74.0,45.0,30.0,11.0,10.0,5.0,8.0,50.0,,84.0,,20.0,0.666667,0.75,0.7,29.0,0.608108,0.666667,0.37931
85852,1991-339,Adelaide,Hard,32,A,19901231,5,101274,,,Udo Riglewski,R,185.0,GER,24.4,101843,4.0,,Sergi Bruguera,R,188.0,ESP,19.9,7-5 6-3,3,R32,90.0,4.0,2.0,72.0,40.0,33.0,14.0,10.0,7.0,8.0,2.0,2.0,77.0,41.0,28.0,15.0,11.0,4.0,8.0,88.0,,28.0,,32.0,0.555556,0.825,0.4375,36.0,0.532468,0.682927,0.416667


In [44]:
df_post2000.shape

(92219, 57)

In [45]:
#Columns

df_post2000.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_#aces', 'w_#dfs', 'w_#ServePoints', 'w_#1stServesIn',
       'w_#1stWon', 'w_#2ndWon', 'w_#ServeGames', 'w_bpSaved', 'w_#bpFaced',
       'l_#aces', 'l_#dfs', 'l_#ServePoints', 'l_#1stServesIn', 'l_#1stWon',
       'l_#2ndWon', 'l_#ServeGames', 'l_bpSaved', 'l_#bpFaced', 'winner_rank',
       'winner_rank_points', 'loser_rank', 'loser_rank_points',
       'w_#2ndServePoints', 'w_%1stServesIn', 'w_%1stWon', 'w_%2ndWon',
       'l_#2ndServePoints', 'l_%1stServesIn', 'l_%1stWon', 'l_%2ndWon'],
      dtype='object')

In [47]:
#Checking for missing values
df_post2000.isnull().sum()

## Only Important to remove the missing values for missing match statistics but figure out why they're missing?

tourney_id                0
tourney_name              0
surface                   0
draw_size                 0
tourney_level             0
tourney_date              0
match_num                 0
winner_id                 0
winner_seed           51523
winner_entry          79965
winner_name               0
winner_hand               5
winner_ht               234
winner_ioc                0
winner_age                0
loser_id                  0
loser_seed            69989
loser_entry           71975
loser_name                0
loser_hand               24
loser_ht               1031
loser_ioc                 0
loser_age                20
score                     0
best_of                   0
round                     0
minutes                2764
w_#aces                   0
w_#dfs                    0
w_#ServePoints            0
w_#1stServesIn            0
w_#1stWon                 0
w_#2ndWon                 0
w_#ServeGames             0
w_bpSaved                 0
w_#bpFaced          

In [48]:
#Checking for NaN in 'w_%1stServesIn'

df_post2000_nan_Percent1stServesIn = df_post2000[df_post2000['w_%1stServesIn'].isnull()== True]

In [49]:
df_post2000_nan_Percent1stServesIn

#This data should all be erased; the data is faulty as it makes no sense, 
#For example, no complete match can be played when the winner has 0 points played on serve

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_#aces,w_#dfs,w_#ServePoints,w_#1stServesIn,w_#1stWon,w_#2ndWon,w_#ServeGames,w_bpSaved,w_#bpFaced,l_#aces,l_#dfs,l_#ServePoints,l_#1stServesIn,l_#1stWon,l_#2ndWon,l_#ServeGames,l_bpSaved,l_#bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,w_#2ndServePoints,w_%1stServesIn,w_%1stWon,w_%2ndWon,l_#2ndServePoints,l_%1stServesIn,l_%1stWon,l_%2ndWon
115295,1998-773,Bucharest,Clay,32,A,19980914,4,103031,8.0,,Julian Alonso,R,185.0,ESP,21.1,102720,,,Tomas Zib,R,178.0,CZE,22.6,1-0 RET,3,R32,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,2.0,0.0,0.0,1.0,0.0,1.0,66.0,673.0,151.0,302.0,0.0,,,,2.0,0.5,0.0,0.0
120022,2000-357,Stuttgart Masters,Hard,48,M,20001030,37,101965,,,Wayne Ferreira,R,185.0,RSA,29.1,102882,12.0,,Mark Philippoussis,R,196.0,AUS,23.9,1-0 RET,3,R16,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,5.0,1.0,0.0,1.0,0.0,1.0,19.0,1234.0,13.0,1612.0,0.0,,,,0.0,1.0,0.2,
144264,2007-520,Roland Garros,Clay,128,G,20070528,8,104022,13.0,,Mikhail Youzhny,R,183.0,RUS,24.9,103401,,,Jan Hernych,R,190.0,CZE,27.8,RET,5,R128,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,9.0,5.0,3.0,1.0,0.0,1.0,1.0,15.0,1685.0,92.0,454.0,0.0,,,,4.0,0.555556,0.6,0.25
165479,2014-316,Bastad,Clay,28,A,20140707,6,105311,5.0,,Joao Sousa,R,185.0,POR,25.2,105137,,,Pere Riba,R,183.0,ESP,26.2,0-1 RET,3,R32,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,12.0,7.0,7.0,0.0,1.0,2.0,2.0,40.0,1020.0,93.0,592.0,0.0,,,,5.0,0.583333,1.0,0.0
172750,2017-M-DC-2017-G2-EPA-M-FIN-GEO-01,Davis Cup G2 R1: FIN vs GEO,Carpet,4,D,20170203,5,106106,,,George Tsivadze,U,191.0,GEO,24.0,126609,,,Patrik Niklas Salminen,L,188.0,FIN,19.9,6-4 7-5 5-7 0-6 7-6(5),5,RR,,0.0,0.0,0.0,0.0,0.0,0.0,26.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,26.0,0.0,0.0,819.0,21.0,745.0,28.0,0.0,,,,0.0,1.0,0.0,
175047,2017-0329,Tokyo,Hard,32,A,20171002,291,105216,,,Yuichi Sugita,R,173.0,JPN,29.0,105683,3.0,,Milos Raonic,R,196.0,CAN,26.7,1-0 RET,3,R16,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,5.0,3.0,0.0,1.0,2.0,3.0,40.0,1127.0,12.0,2825.0,0.0,,,,3.0,0.625,0.6,0.0


In [50]:
#Checking for NaN in 'w_%2ndWon'

df_post2000_nan_Percent2ndWon = df_post2000[df_post2000['w_%2ndWon'].isnull()== True]

In [51]:
df_post2000_nan_Percent2ndWon

# This data is likely faulty and should be deleted from the dataset
# This data suggests that the winner of these matches did not miss a single serve in an entire match, which is highly unlikely.
# In all of these cases, the loser also played 8 or fewer second serves, which is also highly unlikely

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_#aces,w_#dfs,w_#ServePoints,w_#1stServesIn,w_#1stWon,w_#2ndWon,w_#ServeGames,w_bpSaved,w_#bpFaced,l_#aces,l_#dfs,l_#ServePoints,l_#1stServesIn,l_#1stWon,l_#2ndWon,l_#ServeGames,l_bpSaved,l_#bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,w_#2ndServePoints,w_%1stServesIn,w_%1stWon,w_%2ndWon,l_#2ndServePoints,l_%1stServesIn,l_%1stWon,l_%2ndWon
98748,1994-308,Munich,Clay,32,A,19940425,10,102234,,WC,Oliver Gross,R,185.0,GER,20.8,101900,,,Nicklas Kulti,R,190.0,SWE,23.0,6-4 3-6 6-3,3,R32,113.0,0.0,0.0,84.0,84.0,55.0,0.0,14.0,4.0,6.0,0.0,0.0,82.0,82.0,51.0,0.0,14.0,0.0,3.0,169.0,222.0,72.0,586.0,0.0,1.0,0.654762,,0.0,1.0,0.621951,
101818,1995-807,Mexico City,Clay,32,A,19950227,14,103060,,WC,Alejandro Hernandez,R,180.0,MEX,17.4,102024,,,Hernan Gumy,R,188.0,ARG,22.9,5-7 6-4 6-4,3,R32,156.0,3.0,0.0,108.0,108.0,61.0,0.0,16.0,11.0,16.0,5.0,0.0,113.0,113.0,63.0,0.0,16.0,9.0,15.0,354.0,85.0,121.0,382.0,0.0,1.0,0.564815,,0.0,1.0,0.557522,
105967,1996-568,St. Petersburg,Carpet,32,A,19960318,28,101320,7.0,,Magnus Gustafsson,R,185.0,SWE,29.2,101888,2.0,,Daniel Vacek,R,190.0,CZE,24.9,2-1 RET,3,QF,11.0,1.0,0.0,4.0,4.0,4.0,0.0,2.0,0.0,0.0,1.0,3.0,13.0,7.0,5.0,1.0,2.0,1.0,2.0,70.0,673.0,38.0,959.0,0.0,1.0,1.0,,6.0,0.538462,0.714286,0.166667
113836,1998-414,Hamburg Masters,Clay,56,M,19980504,53,102610,,,Albert Costa,R,180.0,ESP,22.8,102344,6.0,,Karol Kucera,R,188.0,SVK,24.1,3-0 RET,3,SF,16.0,1.0,0.0,11.0,11.0,8.0,0.0,2.0,3.0,3.0,1.0,2.0,12.0,4.0,1.0,2.0,2.0,1.0,3.0,26.0,1308.0,10.0,2141.0,0.0,1.0,0.727273,,8.0,0.333333,0.25,0.25
115295,1998-773,Bucharest,Clay,32,A,19980914,4,103031,8.0,,Julian Alonso,R,185.0,ESP,21.1,102720,,,Tomas Zib,R,178.0,CZE,22.6,1-0 RET,3,R32,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,2.0,0.0,0.0,1.0,0.0,1.0,66.0,673.0,151.0,302.0,0.0,,,,2.0,0.5,0.0,0.0
120022,2000-357,Stuttgart Masters,Hard,48,M,20001030,37,101965,,,Wayne Ferreira,R,185.0,RSA,29.1,102882,12.0,,Mark Philippoussis,R,196.0,AUS,23.9,1-0 RET,3,R16,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,5.0,1.0,0.0,1.0,0.0,1.0,19.0,1234.0,13.0,1612.0,0.0,,,,0.0,1.0,0.2,
129662,2003-325,Palermo,Clay,32,A,20030922,4,103656,,,Albert Montanes,R,175.0,ESP,22.8,103105,7.0,,Ruben Ramirez Hidalgo,R,183.0,ESP,25.7,6-2 6-4,3,R32,76.0,1.0,0.0,50.0,50.0,36.0,0.0,9.0,1.0,2.0,0.0,0.0,51.0,45.0,27.0,2.0,9.0,3.0,7.0,93.0,408.0,73.0,496.0,0.0,1.0,0.72,,6.0,0.882353,0.6,0.333333
140974,2006-499,Delray Beach,Hard,32,A,20060130,7,103594,,,Oliver Marach,R,185.0,AUT,25.5,102642,,,Cyril Saulnier,R,190.0,FRA,30.4,2-0 RET,3,R32,9.0,0.0,0.0,5.0,5.0,4.0,0.0,1.0,0.0,0.0,0.0,1.0,6.0,2.0,0.0,2.0,1.0,0.0,1.0,121.0,360.0,84.0,460.0,0.0,1.0,0.8,,4.0,0.333333,0.0,0.5
143934,2007-439,Umag,Clay,32,A,20070723,4,103812,,,Victor Hanescu,R,198.0,ROU,26.0,104386,8.0,,Janko Tipsarevic,R,180.0,SRB,23.0,2-0 RET,3,R32,4.0,1.0,0.0,4.0,4.0,4.0,0.0,1.0,0.0,0.0,0.0,1.0,5.0,1.0,1.0,0.0,1.0,1.0,2.0,257.0,149.0,48.0,711.0,0.0,1.0,1.0,,4.0,0.2,1.0,0.0
144264,2007-520,Roland Garros,Clay,128,G,20070528,8,104022,13.0,,Mikhail Youzhny,R,183.0,RUS,24.9,103401,,,Jan Hernych,R,190.0,CZE,27.8,RET,5,R128,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,9.0,5.0,3.0,1.0,0.0,1.0,1.0,15.0,1685.0,92.0,454.0,0.0,,,,4.0,0.555556,0.6,0.25


In [52]:
#Checking for NaN in 'l_%1stServesIn'

df_post2000_nan_l_Percent1stServesIn = df_post2000[df_post2000['l_%1stServesIn'].isnull()== True]

In [53]:
df_post2000_nan_l_Percent1stServesIn

#this data must also be erased; there was no serve data available for the loser of this match

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_#aces,w_#dfs,w_#ServePoints,w_#1stServesIn,w_#1stWon,w_#2ndWon,w_#ServeGames,w_bpSaved,w_#bpFaced,l_#aces,l_#dfs,l_#ServePoints,l_#1stServesIn,l_#1stWon,l_#2ndWon,l_#ServeGames,l_bpSaved,l_#bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,w_#2ndServePoints,w_%1stServesIn,w_%1stWon,w_%2ndWon,l_#2ndServePoints,l_%1stServesIn,l_%1stWon,l_%2ndWon
86398,1991-402,Memphis,Hard,48,A,19910218,46,101529,7.0,,Michael Stich,R,193.0,GER,22.3,102021,5.0,,Michael Chang,R,175.0,USA,18.9,6-2 6-2,3,SF,66.0,6.0,0.0,48.0,26.0,21.0,14.0,8.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,912.0,15.0,1176.0,22.0,0.541667,0.807692,0.636364,0.0,,,
90952,1992-308,Munich,Clay,32,A,19920427,6,101549,,,Rodolphe Gilbert,L,183.0,FRA,23.3,102118,,Q,Radomir Vasek,R,188.0,CZE,19.5,6-2 6-4,3,R32,86.0,4.0,3.0,61.0,24.0,21.0,17.0,9.0,7.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,71.0,511.0,345.0,60.0,37.0,0.393443,0.875,0.459459,0.0,,,
91331,1992-520,Roland Garros,Clay,128,G,19920525,88,101529,4.0,,Michael Stich,R,193.0,GER,23.6,102201,,WC,Lionel Roux,R,185.0,FRA,19.1,6-1 6-4 6-4,5,R64,81.0,3.0,1.0,71.0,50.0,43.0,13.0,14.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2626.0,308.0,75.0,21.0,0.704225,0.86,0.619048,0.0,,,
91992,1992-319,Kitzbuhel,Clay,48,A,19920720,4,101500,,,Christian Saceanu,R,190.0,GER,24.0,102382,,WC,Julian Knowle,L,188.0,AUT,18.2,6-4 5-7 7-6(2),3,R64,180.0,1.0,13.0,130.0,83.0,52.0,17.0,17.0,12.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,150.0,239.0,648.0,13.0,47.0,0.638462,0.626506,0.361702,0.0,,,
92108,1992-475,San Marino,Clay,32,A,19920727,10,101086,,,Ronald Agenor,R,180.0,USA,27.7,102338,,WC,Yevgeny Kafelnikov,R,190.0,RUS,18.4,6-1 6-2,3,R32,64.0,0.0,1.0,59.0,41.0,28.0,11.0,8.0,6.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,118.0,326.0,488.0,31.0,18.0,0.694915,0.682927,0.611111,0.0,,,
92582,1992-323,Bordeaux,Clay,32,A,19920914,6,101016,,,Marcos Aurelio Gorriz,L,185.0,ESP,28.5,108378,,Q,Thierry Van Den Daele,R,,FRA,26.3,6-4 6-4,3,R32,90.0,1.0,3.0,60.0,39.0,24.0,15.0,10.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,102.0,415.0,,,21.0,0.65,0.615385,0.714286,0.0,,,
92624,1992-322,Cologne,Clay,32,A,19920914,17,101404,1.0,,Thomas Muster,L,180.0,AUT,24.9,102378,,Q,Juan Gisbert Jr,R,198.0,ESP,18.4,4-6 7-6(5) 7-6(5),3,R16,160.0,4.0,0.0,112.0,80.0,54.0,19.0,17.0,3.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,1396.0,234.0,126.0,32.0,0.714286,0.675,0.59375,0.0,,,
92789,1992-367,Athens,Clay,32,A,19921005,16,101532,2.0,,Francisco Clavet,L,183.0,ESP,23.9,108354,,Q,Talito Corrales,R,,ESP,,6-1 6-2,3,R32,85.0,2.0,3.0,65.0,36.0,24.0,15.0,8.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,1112.0,628.0,15.0,29.0,0.553846,0.666667,0.517241,0.0,,,
93038,1992-457,Taipei,Carpet,32,A,19921019,16,102202,2.0,,Kenneth Carlsen,L,190.0,DEN,19.5,101682,,Q,David Adams,R,188.0,RSA,22.7,3-6 6-3 7-5,3,R32,150.0,11.0,4.0,100.0,52.0,39.0,24.0,15.0,9.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,503.0,522.0,24.0,48.0,0.52,0.75,0.5,0.0,,,
93288,1992-438,Moscow,Carpet,32,A,19921109,17,101120,1.0,,Karel Novacek,R,190.0,CZE,27.6,101482,,,Markus Zoecke,R,196.0,GER,24.5,6-3 6-1,3,R16,72.0,6.0,2.0,58.0,36.0,32.0,8.0,10.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.0,1089.0,111.0,371.0,22.0,0.62069,0.888889,0.363636,0.0,,,


In [54]:
#removing all data where 'l_%1stServesIn' is NaN
df_post2000 = df_post2000[df_post2000['l_%1stServesIn'].isnull()== False]

In [55]:
#Re-Checking Shape
df_post2000.shape
#30 entries removed, which is how many NaNs were in 'l_%1stServesIn'

(92189, 57)

In [56]:
#Re-checking for missing values
df_post2000.isnull().sum()
#still more to remove

tourney_id                0
tourney_name              0
surface                   0
draw_size                 0
tourney_level             0
tourney_date              0
match_num                 0
winner_id                 0
winner_seed           51505
winner_entry          79937
winner_name               0
winner_hand               5
winner_ht               234
winner_ioc                0
winner_age                0
loser_id                  0
loser_seed            69962
loser_entry           71962
loser_name                0
loser_hand               24
loser_ht               1026
loser_ioc                 0
loser_age                19
score                     0
best_of                   0
round                     0
minutes                2761
w_#aces                   0
w_#dfs                    0
w_#ServePoints            0
w_#1stServesIn            0
w_#1stWon                 0
w_#2ndWon                 0
w_#ServeGames             0
w_bpSaved                 0
w_#bpFaced          

In [59]:
#removing all data where 'l_%2ndWon' is NaN
df_post2000 = df_post2000[df_post2000['l_%2ndWon'].isnull()== False]

In [60]:
#Re-Checking Shape
df_post2000.shape
#15 entries removed, which is how many NaNs were in 'l_%1stServesIn'

(92174, 57)

In [61]:
#Re-checking for missing values
df_post2000.isnull().sum()
#still some NaN values under 'w_%2ndWon' and 'l_%1stWon'

tourney_id                0
tourney_name              0
surface                   0
draw_size                 0
tourney_level             0
tourney_date              0
match_num                 0
winner_id                 0
winner_seed           51492
winner_entry          79928
winner_name               0
winner_hand               5
winner_ht               234
winner_ioc                0
winner_age                0
loser_id                  0
loser_seed            69952
loser_entry           71950
loser_name                0
loser_hand               24
loser_ht               1025
loser_ioc                 0
loser_age                19
score                     0
best_of                   0
round                     0
minutes                2756
w_#aces                   0
w_#dfs                    0
w_#ServePoints            0
w_#1stServesIn            0
w_#1stWon                 0
w_#2ndWon                 0
w_#ServeGames             0
w_bpSaved                 0
w_#bpFaced          

In [53]:
#removing all data where 'w_%2ndWon' is NaN
df_post1991 = df_post1991[df_post1991['w_%2ndWon'].isnull()== False]

In [54]:
#Re-Checking Shape
df_post1991.shape
#12 entries removed, which is how many NaNs were in 'w_%2ndWon'

(92162, 57)

In [55]:
#Re-checking for missing values
df_post1991.isnull().sum()
#1 entry left to remove under 'l_%1stWon'

tourney_id                0
tourney_name              0
surface                   0
draw_size                 0
tourney_level             0
tourney_date              0
match_num                 0
winner_id                 0
winner_seed           51485
winner_entry          79918
winner_name               0
winner_hand               5
winner_ht               234
winner_ioc                0
winner_age                0
loser_id                  0
loser_seed            69946
loser_entry           71939
loser_name                0
loser_hand               24
loser_ht               1025
loser_ioc                 0
loser_age                19
score                     0
best_of                   0
round                     0
minutes                2755
w_#aces                   0
w_#dfs                    0
w_#ServePoints            0
w_#1stServesIn            0
w_#1stWon                 0
w_#2ndWon                 0
w_#ServeGames             0
w_bpSaved                 0
w_#bpFaced          

In [56]:
#removing all data where 'l_%1stWon' is NaN
df_post1991 = df_post1991[df_post1991['l_%1stWon'].isnull()== False]

In [57]:
#Re-checking for missing values
df_post1991.isnull().sum()

tourney_id                0
tourney_name              0
surface                   0
draw_size                 0
tourney_level             0
tourney_date              0
match_num                 0
winner_id                 0
winner_seed           51484
winner_entry          79917
winner_name               0
winner_hand               5
winner_ht               234
winner_ioc                0
winner_age                0
loser_id                  0
loser_seed            69945
loser_entry           71938
loser_name                0
loser_hand               24
loser_ht               1025
loser_ioc                 0
loser_age                19
score                     0
best_of                   0
round                     0
minutes                2755
w_#aces                   0
w_#dfs                    0
w_#ServePoints            0
w_#1stServesIn            0
w_#1stWon                 0
w_#2ndWon                 0
w_#ServeGames             0
w_bpSaved                 0
w_#bpFaced          

### 3D. Changing Data type

In [58]:
#Checking for Mixed Data Types
for col in df_post1991.columns.tolist():
  weird = (df_post1991[[col]].applymap(type) != df_post1991[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_post1991[weird]) > 0:
    print (col)
    
#The following variables have mixed data types, but will likely not be used for analysis. They should all be strings

winner_entry
winner_hand
loser_entry
loser_hand


In [59]:
#checking data type for each variable

df_post1991.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92161 entries, 85848 to 188160
Data columns (total 57 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tourney_id          92161 non-null  object 
 1   tourney_name        92161 non-null  object 
 2   surface             92161 non-null  object 
 3   draw_size           92161 non-null  int64  
 4   tourney_level       92161 non-null  object 
 5   tourney_date        92161 non-null  int64  
 6   match_num           92161 non-null  int64  
 7   winner_id           92161 non-null  int64  
 8   winner_seed         40677 non-null  float64
 9   winner_entry        12244 non-null  object 
 10  winner_name         92161 non-null  object 
 11  winner_hand         92156 non-null  object 
 12  winner_ht           91927 non-null  float64
 13  winner_ioc          92161 non-null  object 
 14  winner_age          92161 non-null  float64
 15  loser_id            92161 non-null  int64  
 16  lose

#### The following variables should be changed into object:
'tourney_date', 'winner_id', 'loser_id', 'best_of'

In [60]:
#changing tourney_date

df_post1991['tourney_date'] = df_post1991['tourney_date'].astype('str')



##### Derive new column, "Year", using first four digits of "tourney_date"

In [61]:
df_post1991['Year'] = df_post1991['tourney_date']

In [62]:
df_post1991['Year'] = df_post1991['Year'].str[:4]

In [63]:
#changing winner_id

df_post1991['winner_id'] = df_post1991['winner_id'].astype('object')

In [64]:
#changing loser_id

df_post1991['loser_id'] = df_post1991['loser_id'].astype('object')

In [65]:
#changing best_of

df_post1991['best_of'] = df_post1991['best_of'].astype('object')

In [66]:
#checking data type for each variable

df_post1991.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92161 entries, 85848 to 188160
Data columns (total 58 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tourney_id          92161 non-null  object 
 1   tourney_name        92161 non-null  object 
 2   surface             92161 non-null  object 
 3   draw_size           92161 non-null  int64  
 4   tourney_level       92161 non-null  object 
 5   tourney_date        92161 non-null  object 
 6   match_num           92161 non-null  int64  
 7   winner_id           92161 non-null  object 
 8   winner_seed         40677 non-null  float64
 9   winner_entry        12244 non-null  object 
 10  winner_name         92161 non-null  object 
 11  winner_hand         92156 non-null  object 
 12  winner_ht           91927 non-null  float64
 13  winner_ioc          92161 non-null  object 
 14  winner_age          92161 non-null  float64
 15  loser_id            92161 non-null  object 
 16  lose

### Convert all percentages into integers 
This will be useful later when making categorical plots

In [67]:
#Convert all Percentage Variables to Integer by multiplying by 100 (for categorical plot bins)
df_post1991['w_%1stServesIn'] = df_post1991['w_%1stServesIn'].apply(lambda x: int(x * 100))
df_post1991['w_%1stWon'] = df_post1991['w_%1stWon'].apply(lambda x: int(x * 100))
df_post1991['w_%2ndWon'] = df_post1991['w_%2ndWon'].apply(lambda x: int(x * 100))
df_post1991['l_%1stServesIn'] = df_post1991['l_%1stServesIn'].apply(lambda x: int(x * 100))
df_post1991['l_%1stWon'] = df_post1991['l_%1stWon'].apply(lambda x: int(x * 100))
df_post1991['l_%2ndWon'] = df_post1991['l_%2ndWon'].apply(lambda x: int(x * 100))

In [68]:
df_post1991.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_#aces,w_#dfs,w_#ServePoints,w_#1stServesIn,w_#1stWon,w_#2ndWon,w_#ServeGames,w_bpSaved,w_#bpFaced,l_#aces,l_#dfs,l_#ServePoints,l_#1stServesIn,l_#1stWon,l_#2ndWon,l_#ServeGames,l_bpSaved,l_#bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,w_#2ndServePoints,w_%1stServesIn,w_%1stWon,w_%2ndWon,l_#2ndServePoints,l_%1stServesIn,l_%1stWon,l_%2ndWon,Year
85848,1991-339,Adelaide,Hard,32,A,19901231,1,101723,,,Magnus Larsson,R,193.0,SWE,20.7,101414,1.0,,Boris Becker,R,190.0,GER,23.1,6-4 3-6 7-6(2),3,R32,130.0,6.0,2.0,96.0,55.0,39.0,25.0,15.0,2.0,4.0,8.0,3.0,95.0,62.0,44.0,23.0,16.0,6.0,8.0,56.0,,2.0,,41.0,57,70,60,33.0,65,70,69,1990
85849,1991-339,Adelaide,Hard,32,A,19901231,2,100946,,Q,Slobodan Zivojinovic,R,198.0,YUG,27.4,101256,,,Mark Kratzmann,L,178.0,AUS,24.6,6-3 3-6 7-6(6),3,R32,119.0,19.0,4.0,101.0,56.0,45.0,25.0,15.0,9.0,10.0,8.0,2.0,84.0,41.0,35.0,27.0,15.0,1.0,2.0,304.0,,75.0,,45.0,55,80,55,43.0,48,85,62,1990
85850,1991-339,Adelaide,Hard,32,A,19901231,3,101234,,,Patrik Kuhnen,R,190.0,GER,24.8,101421,,,Veli Paloheimo,R,183.0,FIN,23.0,6-0 6-4,3,R32,71.0,6.0,1.0,54.0,31.0,24.0,13.0,8.0,1.0,1.0,2.0,2.0,60.0,37.0,22.0,6.0,8.0,4.0,8.0,82.0,,69.0,,23.0,57,77,56,23.0,61,59,26,1990
85851,1991-339,Adelaide,Hard,32,A,19901231,4,101889,8.0,,Todd Woodbridge,R,178.0,AUS,19.7,101703,,,Guillaume Raoux,R,180.0,FRA,20.8,7-6(2) 6-1,3,R32,85.0,2.0,0.0,60.0,40.0,30.0,14.0,9.0,3.0,3.0,3.0,3.0,74.0,45.0,30.0,11.0,10.0,5.0,8.0,50.0,,84.0,,20.0,66,75,70,29.0,60,66,37,1990
85852,1991-339,Adelaide,Hard,32,A,19901231,5,101274,,,Udo Riglewski,R,185.0,GER,24.4,101843,4.0,,Sergi Bruguera,R,188.0,ESP,19.9,7-5 6-3,3,R32,90.0,4.0,2.0,72.0,40.0,33.0,14.0,10.0,7.0,8.0,2.0,2.0,77.0,41.0,28.0,15.0,11.0,4.0,8.0,88.0,,28.0,,32.0,55,82,43,36.0,53,68,41,1990


### 4. Export New Main Dataframe

In [69]:
df_post1991.to_pickle(os.path.join(path, 'Prepared Data', 'df_post1991.pkl'))

# In the Next Notebook

### 1. Importing Libraries and PKL File
### 2. Creating Subsets
#### 2.1 Creating New Main Dataframe: df_matchstats
#### 2.2 Best of 3 or 5
#### 2.3 Court Surface
#### 2.4 Big 3 Participation
### 3. Exporting Subsets as PKL Files