# CAS BDAI Individual Innovation Project: Tennis Match Predictor

## Table of Contents 
1. [Introduction](#introduction)
2. [Understanding the Data](#understanding-data)
3. [ ](# )
4. [ ](# )
5. [ ](# )
6. [ ](# )
7. [ ](# )
8. [ ](# )


## Introduction <a name="introduction"></a>

### Tennis Match Predictor: GAImeSetMatch


### Goal of this project

...

![.png](img/project/image.png)

Source: [something](https://example.com/)

## Understanding the Data <a name="understanding-data"></a>

### Import the dependencies
First we need to import the required libraries: pandas, numpy and matplotlib.pyplot.


In [149]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline

### Define some helper functions
These will help us later with common tasks.

#### 

### Load and explore the data
This section loads the data available in .csv files from the aforementioned source, explores the data and then cleans it for ease of use and data quality.

In [150]:
# first, set some static parameters and options (used later too for loading other files)

# directory containing the .csv files
dirname = 'data'

# set options for pandas viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
# pd.reset_option('display.float_format')

# pd.reset_option('^display.', silent=True)
# pd.options.display.float_format = '{:.5f}'.format


#### Matches
Data is available in the form of results of ATP matches. For simplicity reasons, focus only on matches since the year 2000*. Each year is stored in one file using naming convention atp_matches_yyyy.csv.

*The reasoning behind this: since the year 2000, there have been factors that have influenced the outcomes of the modern form of the sport. For me, these are:
1. Racquet technology: Since the 1980s, rackets are made mainly out of graphite. Reference: [Link](https://www.pledgesports.org/2019/08/evolution-of-tennis-rackets/)
2. String technology: In the late 1990s, polyester strings were introduced, which revolutionised the sport. Reference: [Link](https://scientificinquirer.com/2021/08/30/string-theory-the-synthetic-revolution-that-changed-tennis-forever/)
3. Surfaces: in 2009, the ATP discontinued use of carpet court use in all its tournaments. Reference: [Link](https://racketsportsworld.com/tennis-not-played-carpet-courts/#When_was_Carpet_Discontinued_from_Use_in_Tennis)

In [151]:
# create a list of matches (since the year 2000 ) files to load
atp_match_files = [f'{dirname}/atp_matches_{year}.csv' for year in range(2000, 2024)]

In [152]:
# create an empty dataframe to store all matches
matches_df = pd.DataFrame()

# loop through the list of match files, read them and append the data to the combined DataFrame
for filen in atp_match_files:
    matches_df = pd.concat([matches_df, pd.read_csv(filen, index_col=None)])


In [153]:
# explore the matches data
matches_df.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,2000-301,Auckland,Hard,32,A,20000110,1,103163,1.0,,Tommy Haas,R,188.0,GER,21.7,101543,,,Jeff Tarango,L,180.0,USA,31.1,7-5 4-6 7-5,3,R32,108.0,18.0,4.0,96.0,49.0,39.0,28.0,17.0,3.0,5.0,7.0,8.0,106.0,55.0,39.0,29.0,17.0,4.0,7.0,11.0,1612.0,63.0,595.0
1,2000-301,Auckland,Hard,32,A,20000110,2,102607,,Q,Juan Balcells,R,190.0,ESP,24.5,102644,,,Franco Squillari,L,183.0,ARG,24.3,7-5 7-5,3,R32,85.0,5.0,3.0,76.0,52.0,39.0,13.0,12.0,5.0,6.0,5.0,10.0,74.0,32.0,25.0,18.0,12.0,3.0,6.0,211.0,157.0,49.0,723.0
2,2000-301,Auckland,Hard,32,A,20000110,3,103252,,,Alberto Martin,R,175.0,ESP,21.3,102238,,,Alberto Berasategui,R,173.0,ESP,26.5,6-3 6-1,3,R32,56.0,0.0,0.0,55.0,35.0,25.0,12.0,8.0,1.0,1.0,0.0,6.0,56.0,33.0,20.0,7.0,8.0,7.0,11.0,48.0,726.0,59.0,649.0
3,2000-301,Auckland,Hard,32,A,20000110,4,103507,7.0,,Juan Carlos Ferrero,R,183.0,ESP,19.9,103819,,,Roger Federer,R,185.0,SUI,18.4,6-4 6-4,3,R32,68.0,5.0,1.0,53.0,28.0,26.0,15.0,10.0,0.0,0.0,11.0,2.0,70.0,43.0,29.0,14.0,10.0,6.0,8.0,45.0,768.0,61.0,616.0
4,2000-301,Auckland,Hard,32,A,20000110,5,102103,,Q,Michael Sell,R,180.0,USA,27.3,102765,4.0,,Nicolas Escude,R,185.0,FRA,23.7,0-6 7-6(7) 6-1,3,R32,115.0,1.0,2.0,98.0,66.0,39.0,14.0,13.0,6.0,11.0,8.0,8.0,92.0,46.0,34.0,18.0,12.0,5.0,9.0,167.0,219.0,34.0,873.0


In [154]:
# get an overview of number of features, instances, empty values and data types 
matches_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 71213 entries, 0 to 2368
Data columns (total 49 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tourney_id          71213 non-null  object 
 1   tourney_name        71213 non-null  object 
 2   surface             71213 non-null  object 
 3   draw_size           71213 non-null  int64  
 4   tourney_level       71213 non-null  object 
 5   tourney_date        71213 non-null  int64  
 6   match_num           71213 non-null  int64  
 7   winner_id           71213 non-null  int64  
 8   winner_seed         29586 non-null  float64
 9   winner_entry        8944 non-null   object 
 10  winner_name         71213 non-null  object 
 11  winner_hand         71204 non-null  object 
 12  winner_ht           69582 non-null  float64
 13  winner_ioc          71213 non-null  object 
 14  winner_age          71208 non-null  float64
 15  loser_id            71213 non-null  int64  
 16  loser_seed

In [155]:
matches_df.describe()

Unnamed: 0,draw_size,tourney_date,match_num,winner_id,winner_seed,winner_ht,winner_age,loser_id,loser_seed,loser_ht,loser_age,best_of,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
count,71213.0,71213.0,71213.0,71213.0,29586.0,69582.0,71208.0,71213.0,16330.0,67939.0,71207.0,71213.0,63277.0,64811.0,64811.0,64811.0,64811.0,64811.0,64811.0,64812.0,64811.0,64811.0,64811.0,64811.0,64811.0,64811.0,64811.0,64811.0,64812.0,64811.0,64811.0,70666.0,70666.0,69793.0,69793.0
mean,55.12847,20109104.9577,94.71271,108736.80855,7.37552,186.13802,26.28357,108802.46563,8.89167,185.5986,26.38963,3.45761,106.69245,6.91245,2.65052,77.99489,47.97124,36.30004,16.64463,12.51935,3.46433,5.03759,5.11475,3.37616,81.03353,48.55815,32.39725,14.9649,12.31158,4.77988,8.62809,79.60923,1592.6501,117.93823,965.14317
std,40.04523,68421.82063,130.18938,18210.10981,6.78996,6.81147,3.95905,18259.89643,7.32843,6.76917,4.07194,0.84013,41.17812,5.53431,2.29043,29.23857,18.97155,13.59139,6.9798,4.23338,3.07829,4.03469,4.88851,2.5354,29.21421,19.24118,14.38498,7.20746,4.2343,3.27306,4.14767,138.94982,1997.6575,186.05034,1112.62339
min,2.0,20000103.0,1.0,100644.0,1.0,163.0,14.9,100644.0,1.0,163.0,14.5,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,32.0,20050509.0,11.0,103498.0,3.0,183.0,23.4,103444.0,4.0,181.0,23.4,3.0,77.0,3.0,1.0,56.0,34.0,27.0,12.0,9.0,1.0,2.0,2.0,2.0,60.0,35.0,22.0,10.0,9.0,2.0,6.0,18.0,573.0,36.0,426.0
50%,32.0,20110117.0,28.0,104339.0,5.0,185.0,26.1,104338.0,7.0,185.0,26.2,3.0,99.0,6.0,2.0,73.0,45.0,34.0,16.0,11.0,3.0,4.0,4.0,3.0,76.0,45.0,30.0,14.0,11.0,4.0,8.0,45.0,933.0,68.0,703.0
75%,64.0,20170203.0,169.0,105227.0,9.0,190.0,29.0,105385.0,12.0,190.0,29.2,3.0,129.0,9.0,4.0,94.0,58.0,43.0,20.0,15.0,5.0,7.0,7.0,5.0,97.0,59.0,40.0,19.0,15.0,7.0,11.0,85.0,1715.0,114.0,1095.0
max,128.0,20230828.0,1701.0,211468.0,35.0,211.0,42.3,212041.0,35.0,211.0,46.0,5.0,1146.0,113.0,26.0,491.0,361.0,292.0,82.0,90.0,24.0,30.0,103.0,26.0,489.0,328.0,284.0,101.0,91.0,27.0,38.0,2101.0,16950.0,2159.0,16950.0


Here's a small sample matches dataframe to be used for miscellaneous usage

In [156]:

# Small sample dataframe (5 matches) for misc usage
sample_matches_df = pd.DataFrame(data = {
    'tourney_id': ['2000-301', '2000-301', '2000-301', '2000-301', '2000-301'],
    'tourney_name': ['Auckland', 'Auckland', 'Auckland', 'Auckland', 'Auckland'],
    'surface': ['Hard', 'Hard', 'Hard', 'Hard', 'Hard'],
    'draw_size': [32, 32, 32, 32, 32],
    'tourney_level': ['A', 'A', 'A', 'A', 'A'],
    'tourney_date': [20000110, 20000110, 20000110, 20000110, 20000110],
    'match_num': [1, 2, 3, 4, 5],
    'winner_id': [103163, 102607, 103252, 103507, 102103],
    'winner_seed': [1.0, None, None, 7.0, None],
    'winner_entry': [None, 'Q', None, None, 'Q'],
    'winner_name': ['Tommy Haas', 'Juan Balcells', 'Alberto Martin', 'Juan Carlos Ferrero', 'Michael Sell'],
    'winner_hand': ['R', 'R', 'R', 'R', 'R'],
    'winner_ht': [188.0, 190.0, 175.0, 183.0, 180.0],
    'winner_ioc': ['GER', 'ESP', 'ESP', 'ESP', 'USA'],
    'winner_age': [21.7, 24.5, 21.3, 19.9, 27.3],
    'loser_id': [101543, 102644, 102238, 103819, 102765],
    'loser_seed': [None, None, None, None, 4.0],
    'loser_entry': [None, None, None, None, None],
    'loser_name': ['Jeff Tarango', 'Franco Squillari', 'Alberto Berasategui', 'Roger Federer', 'Nicolas Escude'],
    'loser_hand': ['L', 'L', 'L', 'L', 'L'],
    'loser_ht': [180.0, 183.0, 173.0, 185.0, 185.0],
    'loser_ioc': ['USA', 'ARG', 'ESP', 'SUI', 'FRA'],
    'loser_age': [31.1, 24.3, 26.5, 18.4, 23.7],
    'score': ['7-5 4-6 7-5', '7-5 7-5', '6-3 6-1', '6-4 6-4', '0-6 7-6(7) 6-1'],
    'best_of': [3, 3, 3, 3, 3],
    'round': ['R32', 'R32', 'R32', 'R32', 'R32'],
    'minutes': [108.0, 85.0, 56.0, 68.0, 115.0],
    'w_ace': [18.0, 5.0, 0.0, 5.0, 1.0],
    'w_df': [4.0, 3.0, 0.0, 1.0, 2.0],
    'w_svpt': [96.0, 76.0, 55.0, 53.0, 98.0],
    'w_1stIn': [49.0, 52.0, 35.0, 28.0, 66.0],
    'w_1stWon': [39.0, 39.0, 25.0, 26.0, 39.0],
    'w_2ndWon': [28.0, 13.0, 12.0, 15.0, 14.0],
    'w_SvGms': [17.0, 12.0, 8.0, 10.0, 13.0],
    'w_bpSaved': [3.0, 5.0, 1.0, 0.0, 6.0],
    'w_bpFaced': [5.0, 6.0, 1.0, 0.0, 8.0],
    'l_ace': [7.0, 10.0, 6.0, 11.0, 8.0],
    'l_df': [8.0, 7.0, 6.0, 2.0, 8.0],
    'l_svpt': [106.0, 74.0, 56.0, 70.0, 92.0],
    'l_1stIn': [55.0, 32.0, 33.0, 43.0, 46.0],
    'l_1stWon': [39.0, 25.0, 20.0, 29.0, 34.0],
    'l_2ndWon': [29.0, 18.0, 7.0, 14.0, 18.0],
    'l_SvGms': [17.0, 12.0, 8.0, 10.0, 12.0],
    'l_bpSaved': [4.0, 3.0, 7.0, 6.0, 5.0],
    'l_bpFaced': [7.0, 6.0, 11.0, 8.0, 9.0],
    'winner_rank': [1612.0, 211.0, 48.0, 768.0, 167.0],
    'winner_rank_points': [63.0, 157.0, 726.0, 616.0, 219.0],
    'loser_rank': [595.0, 723.0, 649.0, 616.0, 873.0],
    'loser_rank_points': [None, 723.0, 649.0, 616.0, 873.0]
}
)

#### Rankings
Data is also available in the form of ranking of ATP players. It may be required to supplement the missing data for current rankings in the matches dataset, for example, a player doesn't have a ranking at the time of playing a match. 

In [157]:
# create a list of rankings (since the year 2000 ) files to load
atp_rankings_files = [f'{dirname}/atp_rankings_{year}.csv' for year in ['00s','10s', '20s', 'current']]

In [158]:
# create an empty dataframe to store all rankings
rankings_df = pd.DataFrame()

# loop through the list of rankings files, read them and append the data to the combined DataFrame
for filen in atp_rankings_files:
    rankings_df = pd.concat([rankings_df, pd.read_csv(filen, index_col=None)])


In [159]:
# explore the rankings data
rankings_df.head()

Unnamed: 0,ranking_date,rank,player,points
0,20000110,1,101736,4135.0
1,20000110,2,102338,2915.0
2,20000110,3,101948,2419.0
3,20000110,4,103017,2184.0
4,20000110,5,102856,2169.0


In [160]:
# get an overview of number of features, instances, empty values and data types 
rankings_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2140631 entries, 0 to 58510
Data columns (total 4 columns):
 #   Column        Dtype  
---  ------        -----  
 0   ranking_date  int64  
 1   rank          int64  
 2   player        int64  
 3   points        float64
dtypes: float64(1), int64(3)
memory usage: 81.7 MB


In [161]:
# sanity checks on the data (min values, max values, etc.)
rankings_df.describe()

Unnamed: 0,ranking_date,rank,player,points
count,2140631.0,2140631.0,2140631.0,2139882.0
mean,20112972.33072,941.09613,119768.98851,117.05611
std,66763.21263,547.58143,31216.72435,455.87982
min,20000110.0,1.0,100149.0,1.0
25%,20060213.0,470.0,104128.0,2.0
50%,20110919.0,946.0,105498.0,10.0
75%,20170306.0,1381.0,120568.0,65.0
max,20230911.0,2271.0,212464.0,16950.0


### Helper function: Hide winner and loser from columns names

In [162]:

def hide_winner_loser(input_df):
    ''' Replace columns starting with 'winner_' and 'loser_' with 'player_1_' and 'player_2_' for the required features
    As we want to be able to predict who will be the winner and the loser in each match, we remove the 'winner_' and 'loser_' columns for each match, 
    and instead replace it with player_1_ and player_2 which are the player names in alphabetical order. 
    The features starting with 'w_' and 'l_' are measures recorded during the match and will not be used in the model for predicting the outcome, so we remove these features
    We will add a column at the end of the dataframe,  which will serve as our y variable '''

    # List of required features
    features = ['id', 'seed', 'entry', 'name', 'hand', 'ht', 'ioc', 'age', 'rank', 'rank_points']
    
    # Copy the input DataFrame to a new one
    df = input_df.copy()

    # Add player_1_name and player_2_name columns
    df['player_1_name'] = df.apply(lambda row: min(row['winner_name'], row['loser_name']), axis=1)
    df['player_2_name'] = df.apply(lambda row: max(row['winner_name'], row['loser_name']), axis=1)

    # Transfer the values from 'winner_' and 'loser_' features to 'player_1_' and 'player_2_' features, according to who was the winner & loser
    for feat in features:
        player_1_feature = np.where(df['player_1_name'] == df['winner_name'],
                                    df['winner_' + feat],
                                    df['loser_' + feat]
                                    )
        player_2_feature = np.where(df['player_2_name'] == df['winner_name'],
                                    df['winner_' + feat],
                                    df['loser_' + feat]
                                    )
        df['player_1_' + feat] = player_1_feature
        df['player_2_' + feat] = player_2_feature   

          
    # Add a winner column
    df['winner'] = df.apply(lambda row: 'player_1' if row['winner_name'] == row['player_1_name'] else 'player_2', axis=1)

    # Remove columns starting with 'winner_' and 'loser_' (they have been replaced by player_1_ and player_2_)
    df = df.loc[:, ~df.columns.str.startswith('winner_') & ~df.columns.str.startswith('loser_')]

    # Remove columns starting with 'w_' and 'l_' (not needed for predicting_)
    df = df.loc[:, ~df.columns.str.startswith('w_') & ~df.columns.str.startswith('l_')]

    return df


#### Test the function hide_winner_loser()

In [163]:
output_df = hide_winner_loser(sample_matches_df)
print(output_df.shape)
output_df

(5, 32)


Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,score,best_of,round,minutes,player_1_name,player_2_name,player_1_id,player_2_id,player_1_seed,player_2_seed,player_1_entry,player_2_entry,player_1_hand,player_2_hand,player_1_ht,player_2_ht,player_1_ioc,player_2_ioc,player_1_age,player_2_age,player_1_rank,player_2_rank,player_1_rank_points,player_2_rank_points,winner
0,2000-301,Auckland,Hard,32,A,20000110,1,7-5 4-6 7-5,3,R32,108.0,Jeff Tarango,Tommy Haas,101543,103163,,1.0,,,L,R,180.0,188.0,USA,GER,31.1,21.7,595.0,1612.0,,63.0,player_2
1,2000-301,Auckland,Hard,32,A,20000110,2,7-5 7-5,3,R32,85.0,Franco Squillari,Juan Balcells,102644,102607,,,,Q,L,R,183.0,190.0,ARG,ESP,24.3,24.5,723.0,211.0,723.0,157.0,player_2
2,2000-301,Auckland,Hard,32,A,20000110,3,6-3 6-1,3,R32,56.0,Alberto Berasategui,Alberto Martin,102238,103252,,,,,L,R,173.0,175.0,ESP,ESP,26.5,21.3,649.0,48.0,649.0,726.0,player_2
3,2000-301,Auckland,Hard,32,A,20000110,4,6-4 6-4,3,R32,68.0,Juan Carlos Ferrero,Roger Federer,103507,103819,7.0,,,,R,L,183.0,185.0,ESP,SUI,19.9,18.4,768.0,616.0,616.0,616.0,player_1
4,2000-301,Auckland,Hard,32,A,20000110,5,0-6 7-6(7) 6-1,3,R32,115.0,Michael Sell,Nicolas Escude,102103,102765,,4.0,Q,,R,L,180.0,185.0,USA,FRA,27.3,23.7,167.0,873.0,219.0,873.0,player_1


In [164]:
# replace the winner and loser columns with player_1 and player_2 for the matches dataset
matches_df= hide_winner_loser(matches_df)
matches_df.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,score,best_of,round,minutes,player_1_name,player_2_name,player_1_id,player_2_id,player_1_seed,player_2_seed,player_1_entry,player_2_entry,player_1_hand,player_2_hand,player_1_ht,player_2_ht,player_1_ioc,player_2_ioc,player_1_age,player_2_age,player_1_rank,player_2_rank,player_1_rank_points,player_2_rank_points,winner
0,2000-301,Auckland,Hard,32,A,20000110,1,7-5 4-6 7-5,3,R32,108.0,Jeff Tarango,Tommy Haas,101543,103163,,1.0,,,L,R,180.0,188.0,USA,GER,31.1,21.7,63.0,11.0,595.0,1612.0,player_2
1,2000-301,Auckland,Hard,32,A,20000110,2,7-5 7-5,3,R32,85.0,Franco Squillari,Juan Balcells,102644,102607,,,,Q,L,R,183.0,190.0,ARG,ESP,24.3,24.5,49.0,211.0,723.0,157.0,player_2
2,2000-301,Auckland,Hard,32,A,20000110,3,6-3 6-1,3,R32,56.0,Alberto Berasategui,Alberto Martin,102238,103252,,,,,R,R,173.0,175.0,ESP,ESP,26.5,21.3,59.0,48.0,649.0,726.0,player_2
3,2000-301,Auckland,Hard,32,A,20000110,4,6-4 6-4,3,R32,68.0,Juan Carlos Ferrero,Roger Federer,103507,103819,7.0,,,,R,R,183.0,185.0,ESP,SUI,19.9,18.4,45.0,61.0,768.0,616.0,player_1
4,2000-301,Auckland,Hard,32,A,20000110,5,0-6 7-6(7) 6-1,3,R32,115.0,Michael Sell,Nicolas Escude,102103,102765,,4.0,Q,,R,R,180.0,185.0,USA,FRA,27.3,23.7,167.0,34.0,219.0,873.0,player_1


#### Zeros
Here we check for zeros in the matches dataframe, in order to decide what to do with them.

In [165]:
# check all features for zero's
zero_count_per_feature= matches_df.apply(lambda col: (col == 0).sum())
zero_count_per_feature

tourney_id               0
tourney_name             0
surface                  0
draw_size                0
tourney_level            0
tourney_date             0
match_num                0
score                    0
best_of                  0
round                    0
minutes                 47
player_1_name            0
player_2_name            0
player_1_id              0
player_2_id              0
player_1_seed            0
player_2_seed            0
player_1_entry           0
player_2_entry           0
player_1_hand            0
player_2_hand            0
player_1_ht              0
player_2_ht              0
player_1_ioc             0
player_2_ioc             0
player_1_age             0
player_2_age             0
player_1_rank            0
player_2_rank            0
player_1_rank_points     0
player_2_rank_points     0
winner                   0
dtype: int64

In [172]:
# explore the matches with 0 or less minutes
matches_lessthan_0mins = matches_df.loc[matches_df['minutes']<=0]
matches_lessthan_0mins.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,score,best_of,round,minutes,player_1_name,player_2_name,player_1_id,player_2_id,player_1_seed,player_2_seed,player_1_entry,player_2_entry,player_1_hand,player_2_hand,player_1_ht,player_2_ht,player_1_ioc,player_2_ioc,player_1_age,player_2_age,player_1_rank,player_2_rank,player_1_rank_points,player_2_rank_points,winner
255,2020-580,Australian Open,Hard,128,G,20200120,188,W/O,5,R64,0.0,Philipp Kohlschreiber,Stefanos Tsitsipas,104259,126774,,6.0,,,R,R,178.0,193.0,GER,GRE,36.2,21.4,79.0,6.0,700.0,5375.0,player_2
361,2020-0891,Pune,Hard,32,A,20200203,287,W/O,3,R16,0.0,Viktor Troicki,Yuichi Sugita,104678,105216,,5.0,Q,,R,R,193.0,173.0,SRB,JPN,33.9,31.3,191.0,86.0,263.0,645.0,player_2
376,2020-0506,Buenos Aires,Clay,32,A,20200210,299,W/O,3,SF,0.0,Diego Schwartzman,Pedro Sousa,106043,105155,1.0,,,LL,R,R,170.0,180.0,ARG,POR,27.4,31.7,14.0,145.0,2325.0,373.0,player_2
454,2020-0407,Rotterdam,Hard,32,A,20200210,275,W/O,3,R32,0.0,Jannik Sinner,Radu Albot,206173,105430,,,WC,,R,R,188.0,175.0,ITA,MDA,18.4,30.2,79.0,50.0,710.0,977.0,player_1
1258,2020-0352,Paris Masters,Hard,64,M,20201102,271,W/O,3,R32,0.0,Corentin Moutet,Marin Cilic,144895,105227,,,WC,,L,R,178.0,198.0,FRA,CRO,21.5,32.0,75.0,43.0,838.0,1280.0,player_2


The matches lasting 0 minutes are all W/O ("Walkovers"), meaning that one player did not contest the match due to injury, illness, etc. These instances should not be used for predicting matches, as they don't measure a player's performance. 

##### NaN or empty values
Here we check for NaN or empty values in the matches dataframe, in order to decide what to do with them.

....

In [None]:
# remove matches with 0 or less minutes


In [None]:
rankings_df.describe()

Unnamed: 0,ranking_date,rank,player,points
count,2140631.0,2140631.0,2140631.0,2139882.0
mean,20112972.33072,941.09613,119768.98851,117.05611
std,66763.21263,547.58143,31216.72435,455.87982
min,20000110.0,1.0,100149.0,1.0
25%,20060213.0,470.0,104128.0,2.0
50%,20110919.0,946.0,105498.0,10.0
75%,20170306.0,1381.0,120568.0,65.0
max,20230911.0,2271.0,212464.0,16950.0


### Observations
Some initial observations about the x data sets
1. ...
2. ...

## Final Conclusion  <a name="final-concl"></a>

...