# Portfolio Project: Data Analysis on Tennis stats

## TABLE OF CONTENTS
1. Introduction
2. Imports and Functions
3. The Data
4. Data Preparation
5. Data Processing
6. Data Analysis and Visualization
7. Conclusions
8. Improvements & Next Steps
9. References

## 1. INTRODUCTION
For those who don't know, I spent more than one year working on a predictive system able to make money consistently and profitably by betting on NBA games. I started this journey on September 2019 and, as of December 2020, I had found an approach that way better than what I had hoped to find.

That's when Netty was born and, an entire regular season after, I can proudly say that it yielded a 9.14% ROI betting mostly on underdogs (average odds were 2.15) and made more than 51 units in profits just in 280 games. To put an example, someone with a unit of 100€ would have finished with **over 5100€ in less than 5 months**. 

Being the ambitious person that I am, I wasn't going to settle with something that was profitable only during 6 months. I want to earn money consistently and regularly, that's why I need another model able to work the entire year.

There were different options: tennis, horse racing, greyhound racing... I ended up choosing tennis because I'm not particularly a fan of animal racing and tennis is a much more famous sport (and that usually means more data).

Now, I won't be creating the model and doing all the work that comes after. This project will consist in all the previous phases: from data manipulation to data analysis, to data visualization, to getting data almost ready for modelling time. So **expect the analysis to move towards the direction of finding those relevant features** to use in a predictive model. 

I'll then, privately, use the insights I'll discover to create the model and hopefully make it profitable enough.

## 2. IMPORTS AND FUNCTIONS

This might not be the most efficient practice but I like to keep things organized and I can't think of a better way to do so.

Below, you'll be able to see the packages I'm using throughout the entire document and also some external functions that will help me make this path easier.

In [None]:
#######################################################################################
# IMPORTS
#######################################################################################

# Math
import numpy as np
import math, statistics

# Data manipulation
import pandas as pd
from fuzzywuzzy import fuzz, process
!pip install xlrd
!pip install openpyxl

# Data visualization
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
plt.style.use('ggplot')
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

# Extras
from tqdm import tqdm
import time

# Shutting down warnings, just to make things cleaner
import warnings
warnings. simplefilter(action='ignore', category=Warning)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 20)

In [None]:
#######################################################################################
# FUNCTIONS
#######################################################################################

def divider(ini, stop, denominator):
    '''Acts as a range() but instead of adding up or subtracting, it multiplies/divides'''
    l = []
    if ini > stop:
        while ini > stop:
            l.append(int(ini))
            ini /= denominator
    else:
        while ini < stop:
            l.append(int(ini))
            ini *= denominator
        
    return l

def compute_diff(df, cols):
    '''This computes the difference of certain features between player1 and player2'''
    # ALWAYS player 1 - player 2
    # cols shouldn't have player suffixes
    for col in cols:
        df[col + "_diff"] = df['p1_' + col] - df['p2_' + col]
    return df

def myround(x, base=.5):
    '''This rounds to the desired base (default to .5)'''
    return base * round(x/base)

def f(x):
    '''This will be used to display information when grouping by odds or confidence'''
    d = {}
    d["Games"] = x["Right"].count()
    d['Accuracy'] = x['Right'].mean()
    d['Mean Odds'] = x['Odd'].mean()
    d['Profits (unit)'] = x['Profits_unit'].sum()
    d['Profits (odd)'] = x['Profits_odd'].sum()
    d['ROI (unit)'] = x['Profits_unit'].sum() / x['Odd'].count()
    d['ROI (odd)'] = x['Profits_odd'].sum() / x['Odd'].sum()
    return pd.Series(d, index=d.keys())

## 3. THE DATA
The available data consists in two different types of files, both types extracted from different sources, and as many files per type as years between 2005 and 2021 (both included). 

* **data/atp_matches_{year}.csv**: they contain valuable qualitative information (like each player's hand, surface, tourney level...) as well as quantitative data from that particular event (typical match stats).
* **data/{year}.xls[x]**: these mainly contain information related to odds as well as the match results and metadata. 

Let's quickly see a sneak peek of both types:

In [None]:
pd.read_csv("/kaggle/input/atp-masters-tennis-dataset/atp_matches_2021.csv").head()

In [None]:
pd.read_excel("/kaggle/input/tennis-2021/2021.xlsx").head()

## 4. DATA PREPARATION

This will be a pretty mechanical process which will consist in a series of steps:

1. Create the necessary transformations to join datasets (that is, normalize player and tournament names, dates...)
2. Join datasets, create one per year.
3. Get rid of unnecessary columns.
4. Reorganize dataset: from winner and loser to player1 and player2, evenly distributed (that means, half the time player1 won and the other half was won by player2). Also create a new column names "Player1Win".
5. Create new features?


### 4.1. Transform identifier columns
The plan is to join both types of dataframes and, to do so, I'll be using the following columns as identifiers (format: column_name_df1 = column_name_df2), some of which I'll have to create:
* Month = month
* Year = year
* winner = winner_name
* loser = loser_name

If we look closer to both types of dataframes, one has full names for both player names while the other has just the last name followed by the initial letter of the player's name. I'll be using the fuzzywuzzy package to perform some string matching.

In [None]:
years = [y for y in range(2005, 2022)]
d = {}

for year in tqdm(years):
    # Type of XLS file changed in 2013. We'll read them accordingly.
    if year < 2013:
        df1 = pd.read_excel("/kaggle/input/tennis-data-atp/{}.xls".format(year))
    elif year < 2019:
        df1 = pd.read_excel("/kaggle/input/tennis-data-atp/{}.xlsx".format(year))
    elif year < 2021:
        df1 = pd.read_excel("/kaggle/input/atp-mens/ATP_Data/{}.xlsx".format(year))
    else:
        df1 = pd.read_excel("/kaggle/input/tennis-2021/2021.xlsx")
    df2 = pd.read_csv("/kaggle/input/atp-masters-tennis-dataset/atp_matches_{}.csv".format(year))
    
    # Converting dates into two columns: year and month -> year will always be the same as the iteration variable 
    # but let's do things right just in case
    df1["Year"] = df1["Date"].dt.year
    df1["Month"] = df1["Date"].dt.month
    df2["year"] = df2["tourney_date"].astype(str).str[:4].astype(int)
    df2["month"] = df2["tourney_date"].astype(str).str[4:6].astype(int)
    
    # Formatting strings in player names to match the other df
    l = []
    w = []
    players_w = pd.unique(df2["winner_name"])
    players_l = pd.unique(df2["loser_name"])
    for i,row in df1.iterrows():
        winner = row["Winner"]
        loser = row["Loser"]
        
        w1 = process.extract(winner, players_w)
        l1 = process.extract(loser, players_l)
        
        if len(w1) > 0:
            w.append(w1[0][0])
        else:
            w.append("")
            
        if len(l1) > 0:
            l.append(l1[0][0])
        else:
            l.append("")
            
    df1["winner"] = w
    df1["loser"] = l
    
    d[year] = [df1, df2]

### 4.2. Join datasets

In [None]:
for year in years:
    d[year] = d[year][0].merge(d[year][1], left_on=["Month", "Year", "winner", "loser"], right_on=["month", "year", "winner_name", "loser_name"])

As expected, we've lost some matches in the process and it's an average 13.24%. Is that a lot? Is it not? I'd say it's higher than the ideal but still good enough. 

### 4.3. Remove unwanted columns

First of all, not all dataframes have the same columns, so we must add consistency here. Let's quickly go over all of them to see which has the lowest number of columns.

In [None]:
for year in d:
    print("{} has {} columns".format(year, len(d[year].columns)))

2021 is the one with the fewest columns. Let's use them as a basis and see if the other years have the same ones.

Then, of all the columns available right now, I'll be removing some which I think won't have any relevance. I'll be keeping some of these though for better understanding of the data.

I'll make sure all available games have been completed (not finished by injury or other causes), also to keep those rows in which ATP ranks and ATP points coincide in both dfs and to remove those rows without odds.

Lastly, I'll be creating one single dataframe from them all, so we have all the data in one single data structure.

In [None]:
cols = d[2021].columns.tolist() # Choosing 2021 because it's the one with less columns

for year in d:
    cols = [col for col in cols if col in d[year].columns]
print("Final number of columns: {}".format(len(cols)))

# Unwanted columns
remove = ["ATP", "Location", "Tournament", "Date", "Series", "Best of", "Winner", "Loser", "Year", "Month", 
          "tourney_id", "tourney_name", "surface", "draw_size", "tourney_level", "tourney_date", "match_num",
          "winner_id", "winner_entry", "winner_seed", "winner", "loser", "winner_ioc", "loser_id", "loser_seed",
          "loser_entry", "loser_ioc", "score", "best_of", "round", "winner_rank", "loser_rank", "winner_rank_points",
          "loser_rank_points", "Comment"
         ]
wanted_cols = [col for col in cols if col not in remove]

# Making sure ranks and points coincide, also checking that the game had been completed and odds are not missing.
for year in d:
    d[year] = d[year][(d[year]["WRank"] == d[year]["winner_rank"]) & (d[year]["LRank"] == d[year]["loser_rank"]) & 
            (d[year]["WPts"] == d[year]["winner_rank_points"]) & (d[year]["LPts"] == d[year]["loser_rank_points"])
            & (d[year]["Comment"] == "Completed") & (~d[year]["B365W"].isna()) & (~d[year]["B365L"].isna())
           ][wanted_cols]

# Creating a single dataframe for all data, using just the desired columns
df = pd.concat([d[year] for year in d])
df.reset_index(drop=True, inplace=True)
df = df[['Court', 'Surface', 'Round', 'year', 'month', 'minutes', 
         'winner_name', 'WRank', 'WPts', 'W1', 'W2', 'W3', 'W4', 'W5', 'Wsets', 'B365W', 'winner_hand', 'winner_ht',
         'winner_age', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon', 'w_SvGms', 'w_bpSaved', 
         'w_bpFaced',
         'loser_name', 'LRank', 'LPts', 'L1', 'L2', 'L3', 'L4', 'L5', 'Lsets', 'B365L', 'loser_hand', 'loser_ht', 
         'loser_age', 'l_ace', 'l_df', 'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 
         'l_bpFaced']]

# I'll keep working with the same variable, df, all the time (aesthetics). It's a good practice to save 
# the current work in another variable, just in case I mess up later.
df_v1 = df.copy()

df.head()

### 4.4. Reorganize dataset
I think it's a good time to start shaping the data as we will need it. Instead of having winner and loser, I'll create player1 and player2. For future purposes, I'll make that half the time will win player1, the other half will be for player2 (keeping it as it is now, the output would learn that the first player always wins and that would be a huge mistake).

Obviously, I'll need an extra column telling which player won: "p1_win".

In [None]:
data = []
p1_wins = []

inverted_cols = df.columns[:6].tolist() + df.columns[28:].tolist() + df.columns[6:28].tolist()

final_cols = ['Court', 'Surface', 'Round', 'year', 'month', 'minutes', 
              'p1_name', 'p1_Rank', 'p1_Pts', 'p1_1', 'p1_2', 'p1_3', 'p1_4', 'p1_5', 'p1_sets', 'p1_B365', 
              'p1_hand', 'p1_ht', 'p1_age', 'p1_ace', 'p1_df', 'p1_svpt', 'p1_1stIn', 'p1_1stWon', 'p1_2ndWon', 
              'p1_SvGms', 'p1_bpSaved', 'p1_bpFaced',
              'p2_name', 'p2_Rank', 'p2_Pts', 'p2_1', 'p2_2', 'p2_3', 'p2_4', 'p2_5', 'p2_sets', 'p2_B365', 
              'p2_hand', 'p2_ht', 'p2_age', 'p2_ace', 'p2_df', 'p2_svpt', 'p2_1stIn', 'p2_1stWon', 'p2_2ndWon', 
              'p2_SvGms', 'p2_bpSaved', 'p2_bpFaced']

for i,row in tqdm(df.iterrows()):
    if len(p1_wins) == 0 or statistics.mean(p1_wins) <= 0.5: # More 0 than 1 (or same) -> we add 1 (that means, player1 is the one who won)
        p1_wins.append(1)
        data.append(row.tolist())
    else:
        p1_wins.append(0)
        data.append(row[inverted_cols].tolist())
        
df = pd.DataFrame(data, columns = final_cols)
df["p1_2"] = df["p1_2"].astype(float)
df["p1_3"] = df["p1_3"].replace(" ", np.nan)
df["p1_3"] = df["p1_3"].astype(float)
df["p2_2"] = df["p2_2"].astype(float)
df["p2_3"] = df["p2_3"].replace(" ", np.nan)
df["p2_3"] = df["p2_3"].astype(float)
df["p1_win"] = p1_wins
df

## 5. DATA PROCESSING

This is where stuff starts getting interesting. In a bit, you'll see me cleaning data, creating new features...

### 5.1. Checking consistency in Odds

In [None]:
# Visualize them
for i,row in df.iterrows():
    if (1/row["p2_B365"]) + (1/row["p1_B365"]) < 1:
        print(row["p2_B365"], row["p1_B365"])
    if row["p2_B365"]<1 or row["p1_B365"]<1:
        print(row["p2_B365"], row["p1_B365"])
        
# Drop them
df = df[(df["p2_B365"]>1) & (df["p1_B365"]>1)]

All good now. Let's generate some, more advanced, features.

### 5.2. Feature generation

If you're someone who loves tennis and loves stats, I'm sure you're familiarized with the website Ultimate Tennis Statistics. If you're not, feel free to check it out (it's a website full of tennis stats, as you already intuited).

I'm referring to this service because it has advanced stats I can use for this analysis. Concretely, there's this [glossary](https://www.ultimatetennisstatistics.com/glossary) containing the formulas, some of which I'll be creating right now. Don't worry if something's not clear, it's just code performing simple mathematical computations to generate these advanced features.

In [None]:
# First, turn the missing values into 0 within the games-won-per-set columns
df["p1_3"].fillna(0, inplace=True)
df["p1_4"].fillna(0, inplace=True)
df["p1_5"].fillna(0, inplace=True)
df["p2_3"].fillna(0, inplace=True)
df["p2_4"].fillna(0, inplace=True)
df["p2_5"].fillna(0, inplace=True)

# 1st Serve Effectiveness
df["p1_1stWon%"] = df["p1_1stWon"] / df["p1_1stIn"]
df["p1_2ndWon%"] = df["p1_2ndWon"] / (df["p1_svpt"] - df["p1_1stIn"])
df["p1_1stServeEffectiveness"] = df["p1_1stWon%"]/df["p1_2ndWon%"]

df["p2_1stWon%"] = df["p2_1stWon"] / df["p2_1stIn"]
df["p2_2ndWon%"] = df["p2_2ndWon"] / (df["p2_svpt"] - df["p2_1stIn"])
df["p2_1stServeEffectiveness"] = df["p2_1stWon%"]/df["p2_2ndWon%"]

# Return to Service Points Ratio 
df["p1_Ret2ServPtsRatio"] = df["p2_svpt"] / df["p1_svpt"]
df["p2_Ret2ServPtsRatio"] = df["p1_svpt"] / df["p2_svpt"]

# Point Dominance Ratio
df["p1_ServeWon%"] = (df["p1_1stWon"] + df["p1_2ndWon"]) / df["p1_svpt"]
df["p1_ReturnWon%"] = 1 - df["p1_ServeWon%"]

df["p2_ServeWon%"] = (df["p2_1stWon"] + df["p2_2ndWon"]) / df["p2_svpt"]
df["p2_ReturnWon%"] = 1 - df["p2_ServeWon%"]

df["p1_PtsDominanceRatio"] = df["p1_ReturnWon%"] / df["p2_ReturnWon%"]
df["p2_PtsDominanceRatio"] = df["p2_ReturnWon%"] / df["p1_ReturnWon%"]

# Break Points Ratio
df["p1_BPConverted%"] = (df["p2_bpFaced"] - df["p2_bpSaved"]) / df["p2_bpFaced"]
df["p2_BPConverted%"] = (df["p1_bpFaced"] - df["p1_bpSaved"]) / df["p1_bpFaced"]

df["p1_BPRatio"] = df["p1_BPConverted%"] / df["p2_BPConverted%"]
df["p2_BPRatio"] = df["p2_BPConverted%"] / df["p1_BPConverted%"]

# Points to Sets Over-Performing Ratio
df["p1_SetWon%"] = df["p1_sets"] / (df["p1_sets"] + df["p2_sets"])
df["p1_PtsWon%"] = (df["p1_1stWon"] + df["p1_2ndWon"] + df["p2_1stIn"] - df["p2_1stWon"] + (df["p2_svpt"] - df["p2_1stIn"]) - df["p2_2ndWon"]) / (df["p1_svpt"] + df["p2_svpt"])
df["p1_Pts2Sets_OP_Ratio"] = df["p1_SetWon%"] / df["p1_PtsWon%"]

df["p2_SetWon%"] = df["p2_sets"] / (df["p1_sets"] + df["p2_sets"])
df["p2_PtsWon%"] = (df["p2_1stWon"] + df["p2_2ndWon"] + df["p1_1stIn"] - df["p1_1stWon"] + (df["p1_svpt"] - df["p1_1stIn"]) - df["p1_2ndWon"]) / (df["p1_svpt"] + df["p2_svpt"])
df["p2_Pts2Sets_OP_Ratio"] = df["p2_SetWon%"] / df["p2_PtsWon%"]

# Points to Games Over-Performing Ratio
df["p1_GmsWon%"] = (df["p1_1"] + df["p1_2"] + df["p1_3"] + df["p1_4"] + df["p1_5"]) / (df["p1_1"] + df["p1_2"] + df["p1_3"] + df["p1_4"] + df["p1_5"] + df["p2_1"] + df["p2_2"] + df["p2_3"] + df["p2_4"] + df["p2_5"])
df["p1_Pts2Gms_OP_Ratio"] = df["p1_GmsWon%"] / df["p1_PtsWon%"]

df["p2_GmsWon%"] = (df["p2_1"] + df["p2_2"] + df["p2_3"] + df["p2_4"] + df["p2_5"]) / (df["p1_1"] + df["p1_2"] + df["p1_3"] + df["p1_4"] + df["p1_5"] + df["p2_1"] + df["p2_2"] + df["p2_3"] + df["p2_4"] + df["p2_5"])
df["p2_Pts2Gms_OP_Ratio"] = df["p2_GmsWon%"] / df["p2_PtsWon%"]

# Games to Sets Over-Performing Ratio
df["p1_Gms2Sets_OP_Ratio"] = df["p1_SetWon%"] / df["p1_GmsWon%"]
df["p2_Gms2Sets_OP_Ratio"] = df["p2_SetWon%"] / df["p2_GmsWon%"]

# Break Points Over-Performing Ratio
df["p1_BPWon%"] = (df["p2_bpFaced"] - df["p2_bpSaved"] + df["p1_bpSaved"]) / (df["p1_bpFaced"] + df["p2_bpFaced"])
df["p1_BP_OP_Ratio"] = df["p1_BPWon%"] / df["p1_PtsWon%"]

df["p2_BPWon%"] = (df["p1_bpFaced"] - df["p1_bpSaved"] + df["p2_bpSaved"]) / (df["p1_bpFaced"] + df["p2_bpFaced"])
df["p2_BP_OP_Ratio"] = df["p2_BPWon%"] / df["p2_PtsWon%"]

# Break Points Saved Over-Performing Ratio
df["p1_BPSaved%"] = df["p1_bpSaved"] / df["p1_bpFaced"]
df["p1_BPSaved_OP_Ratio"] = df["p1_BPSaved%"] / df["p1_ServeWon%"]

df["p2_BPSaved%"] = df["p2_bpSaved"] / df["p2_bpFaced"]
df["p2_BPSaved_OP_Ratio"] = df["p2_BPSaved%"] / df["p2_ServeWon%"]

# Break Points Converted Over-Performing Ratio
df["p1_BPConverted_OP_Ratio"] = df["p1_BPConverted%"] / df["p1_ReturnWon%"]
df["p2_BPConverted_OP_Ratio"] = df["p2_BPConverted%"] / df["p2_ReturnWon%"]

 # Extras I might need
df["p1_Ace%"] = df["p1_ace"]/df["p1_svpt"]
df["p1_DF%"] = df["p1_df"]/df["p1_svpt"]
df["p1_1stServe%"] = df["p1_1stIn"] / df["p1_svpt"]
df["p1_1stReturnWon%"] = (df["p2_1stIn"] - df["p2_1stWon"]) / df["p2_1stIn"]

df["p2_Ace%"] = df["p2_ace"]/df["p2_svpt"]
df["p2_DF%"] = df["p2_df"]/df["p2_svpt"]
df["p2_1stServe%"] = df["p2_1stIn"] / df["p2_svpt"]
df["p2_1stReturnWon%"] = (df["p1_1stIn"] - df["p1_1stWon"]) / df["p1_1stIn"]

# Upsets
df["p1_UpsetScored"] = [1 if (row["p1_Rank"] < row["p2_Rank"] and row["p1_win"] == 1) else 0 for i,row in df.iterrows()]
df["p2_UpsetScored"] = [1 if (row["p1_Rank"] > row["p2_Rank"] and row["p1_win"] == 0) else 0 for i,row in df.iterrows()]
df["p1_UpsetAgainst"] = df["p2_UpsetScored"]
df["p2_UpsetAgainst"] = df["p1_UpsetScored"]

# Rank variation
r1, r2 = [], []
for i,row in tqdm(df.iterrows()):
    year = row["year"]
    month = row["month"]
    p1 = row["p1_name"]
    p2 = row["p2_name"]
    p1_rank = row["p1_Rank"]
    p2_rank = row["p2_Rank"]
    
    if month < 7:
        year -= 1
        month = 12 + month - 6
    else:
        month -= 6
        
    try:
        aux1 = df[((df["year"] == year) & (df["month"] <= month)) | (df["year"] < year)].loc[(df["p1_name"] == p1) | (df["p2_name"] == p1)].iloc[-1]
        prev_rank1 = aux1["p1_Rank"] if aux1["p1_name"] == p1 else aux1["p2_Rank"]
        r1.append(prev_rank1 - p1_rank)
    except:
        r1.append(0)
        
    try:
        aux2 = df[((df["year"] == year) & (df["month"] <= month)) | (df["year"] < year)].loc[(df["p1_name"] == p2) | (df["p2_name"] == p2)].iloc[-1]
        prev_rank2 = aux2["p1_Rank"] if aux2["p1_name"] == p2 else aux2["p2_Rank"]
        r2.append(prev_rank2 - p2_rank)
    except:
        r2.append(0)
    
df["p1_RankVariation"] = r1
df["p2_RankVariation"] = r2

df.head()

### 5.3. Handling Null Values
Let's examine the proportion of missing values in the entire dataset:

In [None]:
pd.set_option('display.max_rows', 50)
df.isnull().sum().sort_values(ascending=False).head(50)/df.shape[0]

Columns corresponding to sets 3, 4 and 5 would have a lot of missing values if I hadn't converted them into 0. That makes sense, since some tournaments are "best of 3" and a 2-0 would actually finish the match, without the need of playing set 3). I don't really need them from now on, so I think I'll be getting rid of them.

Some heights are also missing. Around 21% of the games lack this feature (for either winner and/or loser players). I could use an imputation but I prefer not to, it just doesn't sound natural to me. For now, I won't be removing it, but I will if it proves non or low-correlated with the chances of winning.

Also, there's the 0.1016% of games missing the BPRatio feature. This is because they depend on percentage features which could be NaN due to a 0 by 0 division. These are demanding a clear dropna(). Same with other advanced features I've created.

I'll also drop those rows with missing values in the traditional stats (ace, double faults...) which all are missing in the same rows.

Apart from these, the rest of missing values (minutes column) can be imputed using Sklearn.

In [None]:
pd.set_option('display.max_rows', 20) # Get it back to the default we established

# 1. drop set columns
df.drop(columns = ["p1_1", "p2_1", "p1_2", "p2_2", "p1_3", "p2_3", "p1_4", "p2_4", "p1_5", "p2_5"], inplace=True)

# 2. drop na
df.dropna(subset = ["p1_BPRatio", "p2_BPRatio", "p1_Gms2Sets_OP_Ratio", "p2_Gms2Sets_OP_Ratio",
                    "p1_Pts2Gms_OP_Ratio", "p2_Pts2Gms_OP_Ratio", "p1_GmsWon%", "p1_2ndWon%",
                    "p1_sets"], inplace=True)

# 3. Impute using sklearn
si = SimpleImputer(strategy = "median")
si.fit(df[["minutes"]])
df[["minutes"]] = si.transform(df[["minutes"]])

# I'll keep working with the same variable, df, all the time (aesthetics). It's a good practice to save 
# the current work in another variable, just in case I mess up later.
df_v2 = df.copy()

df.head()

Someone could think we've done enough. Data is apparently clean and is ready to be analyzed, that is true. But let's go back to the goals or the motivations I had to create this analysis: forecasting a match-winner.

If I happened to analyze all these features and how they correlated with player 1 winning the match, that'd be an error and could lead to misleading conclusions. Why? Because each row contains the data from that match, which is information we obviously don't have before the event, when we want to make the prediction.

We should, somehow, do something to use the information previous to match-time to analyze its effects on the outcome of the game. How? **Rolling averages**

### 5.4. Rolling Averages

Creating a rolling average simply consists in imputing the average of the previous X rows in the current row. We want to make it in a way that it doesn't take into account the actual row for the average, and the number of games I'll be using is 30.

Furthermore, I'll add the Win% feature.

In [None]:
# 1. Transform to long-format table (two rows per match, one per each player)
p1 = [col for col in df.columns if "p1_" in col and col != "p1_win"]
p2 = [col for col in df.columns if "p2_" in col]
info = [col for col in df.columns if "p1_" not in col and "p2_" not in col]

new_cols = ["Win"] + info + [col[3:] for col in p1]
l = []
for i,row in df.iterrows():
    l.append([row["p1_win"]] + row[info + p1].tolist())
    l.append([abs(1-row["p1_win"])] + row[info + p2].tolist())
    
df = pd.DataFrame(l, columns = new_cols)
players = pd.unique(df["name"])
df["Win%"] = df["Win"]

# Columns to average
nums_avg = [
    "minutes", "sets", "ace", "df", "svpt", "1stIn", "1stWon", "2ndWon", "SvGms",
    "bpSaved", "bpFaced", "1stWon%", "2ndWon%", "1stServeEffectiveness", "Ret2ServPtsRatio", "ServeWon%",
    "ReturnWon%", "PtsDominanceRatio", "BPConverted%", "BPRatio", "SetWon%", "PtsWon%", "Pts2Sets_OP_Ratio",
    "GmsWon%", "Pts2Gms_OP_Ratio", "Gms2Sets_OP_Ratio", "BPWon%", "BP_OP_Ratio", "BPSaved%", "BPSaved_OP_Ratio",
    "BPConverted_OP_Ratio", "Ace%", "DF%", "1stServe%", "1stReturnWon%", "UpsetScored", "UpsetAgainst", "Win%"
           ]
# Rolling averages 
window = 30
for player in tqdm(players):
    for col in nums_avg:
        df.loc[df["name"] == player, col] = (
            df.loc[df["name"] == player, col].shift(1).rolling(window, min_periods = 10).mean()
        )
        
df.rename(columns={"UpsetScored": "UpsetsScored%", "UpsetAgainst": "UpsetsAgainst%"}, inplace=True)
df

Instead of re-converting it into a wider format, I actually think keeping it like this could be benefitial in analysis time (we don't have to differ between player 1 and player 2, so we have a clearer picture of how stats correlate with other features).


## 6. DATA ANALYSIS AND VISUALIZATION

### 6.1. Summary
Let's quickly run a describe() to see our data briefly summarized.

In [None]:
df.describe()

### 6.2. Categorical data

I want to start visualizing data. Starting simple, let's just see how categorical data is distributed by plotting the value-counts of each type.

In [None]:
categorical = ["Court", "Surface", "hand", "win", "Round"]
numerical = [ 
    'Rank','Pts', 'sets', 'B365','ht', 'age', 'ace', 'df', 'svpt', 
    '1stIn', '1stWon', '2ndWon', 'SvGms', 'bpSaved', 'bpFaced', "1stServeEffectiveness", 
    "Ret2ServPtsRatio", "PtsDominanceRatio", "BPRatio", "Pts2Sets_OP_Ratio", "Pts2Gms_OP_Ratio", 
    "Gms2Sets_OP_Ratio", "BP_OP_Ratio", "BPSaved_OP_Ratio", "BPConverted_OP_Ratio", "Ace%", 
    "DF%", "1stServe%", "1stReturnWon%"
]
other = ["year", "month", "name"]

# Useful lists
res = ["Win"]

fig, axs = plt.subplots(2,2, constrained_layout=True)
fig.suptitle("Value Counts of Court, Surface, Playing Hand and Favorite player winning the game")
sns.countplot(x="Court", data=df, palette="Set2", ax = axs[0,0])
sns.countplot(x="Surface", data=df, palette="Set2", ax = axs[0,1])
sns.countplot(x="hand", data=df, palette="Set2", ax = axs[1,0])
sns.countplot(x="Win", data=df, palette="Set2", ax = axs[1,1])

plt.show()

In [None]:
plt.rcParams["figure.figsize"] = [12,5]
fig, axs = plt.subplots(1, 3, constrained_layout=True)
fig.suptitle("Value Counts of Court, Surface, Playing Hand for both winner and loser.")
sns.countplot(x="Court", data=df, hue = "Win", palette="Set2", ax = axs[0])
sns.countplot(x="Surface", data=df, hue = "Win", palette="Set2", ax = axs[1])
sns.countplot(x="hand", data=df, hue = "Win", palette="Set2", ax = axs[2])

plt.show()

Obvious observations:
1. Most of the tournaments are played **outdoors** and the favorite has higher chances of winning when it's that way.
2. Hard is the most common surface, followed by clay, grass and carpet (in order). 
3. Most of the players are right-handed, just like in real workd.

Nothing surprising here, we're just getting to know the data better!

### 6.3. Numerical data

Time for a heatmap? Let's see if there's correlation between numeric variables and the target feature by taking a look at the correlation matrix.

In [None]:
# Pearson Correlation
corr_matrix = df[numerical + res].corr().sort_values(by=["Win"])
corr_matrix

Creating a heatmap for all the variables we have would create an ugly plot with a lot of numbers overlapping each other. That's not nice to see... I'll be instead showing the most-correlated with player 1 actually winning the game.

In [None]:
plt.rcParams["figure.figsize"] = [12,8]
matrix = pd.concat([corr_matrix.iloc[:3], corr_matrix.iloc[-9:]])

sns.heatmap(matrix[matrix.index], cmap='Blues', annot=True)
plt.show()

This is interesting. None of the correlations is huge, but some are above the 20% threshold, which makes them worth looking at! 

We can get good insights from here, like the fact that the **Points to Games Over-Performance Ratio**, the **Points to Sets Over-Performance Ratio**, the **Games to Sets Over-Performance Ratio** and the average number of sets won are somewhat correlated as well as the **Points Dominance Ratio** and, obviously, the **odds**. We can tell bookies are doing a good job if they are the number-one feature in terms of correlation. 

What's surprising, at least it shocks me, is that the Win% doesn't seem much relevant, nor any other advanced feature which is not a ratio. Curious.

Most of those stats being highly correlated between themselves. And it makes sense, they share a linear relationship. Should we study that?

In [None]:
sns.pairplot(data = df[["Win", "PtsDominanceRatio", "Gms2Sets_OP_Ratio", "sets", "Pts2Gms_OP_Ratio", "Pts2Sets_OP_Ratio"]], diag_kind = 'kde', hue = "Win")
plt.show()

Well that's very clear right? They all show a pretty strong linear relationship and it's a signal of dependence. That's normal, as they were built as a combination of other basic features.

I've also added this hue which indicates if a certain point resulted in a win or loss by that player, just to see how colors were distributed throughout the graphs. To be honest, I expected it to be more distinguishible. If we look at the numbers as well as the plots, the pattern exists: in a general way, orange points tend to occupy more space in one extreme and blue in the other. But, again, it's not extremely clear.

I'd now move to feature importance but I'd first like to start encoding categorical data, so I can take them into account too.

### 6.4. Feature Importance

In [None]:
#One hot encoding
surface=pd.get_dummies(df["Surface"], prefix='surface_')
df = pd.concat([df,surface],axis=1)
df.drop(columns='Surface', inplace=True)

hand=pd.get_dummies(df["hand"], prefix='hand_')
df = pd.concat([df,hand],axis=1)
df.drop(columns='hand', inplace=True)

playing_round=pd.get_dummies(df["Round"])
df = pd.concat([df,playing_round],axis=1)
df.drop(columns='Round', inplace=True)

df["Court"].replace(to_replace=['Outdoor','Indoor'],value=[1,0], inplace=True)

Now that we have all relevant data in numeric format, it's time to decide upon the features we'll be using for the model. There's several ways to do so, but I'm going to use one of the easiest and simplest because that can probably be enough. 

I'll be using sklearn's RandomForestClassifier, which could serve as the predictive model in and of itself, but it also internally ranks features in terms of importance. That's why I'm choosing this method. Again, there are way more ways to do it and are probably way better.

In [None]:
df.dropna(subset=["minutes", "ht"], inplace=True)
df.fillna(0, inplace=True)
df.replace([np.inf, -np.inf], 0, inplace=True)

feature_names = [feature for feature in df.drop(columns=["Win", "name", "year", "month"]).columns]
forest = RandomForestClassifier(random_state=0)
forest.fit(df.drop(columns=["Win", "name", "year", "month"]), df[["Win"]])

importances = forest.feature_importances_
std = np.std([
    tree.feature_importances_ for tree in forest.estimators_], axis=0)

forest_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

plt.rcParams["figure.figsize"] = [15,7]
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

This plot is ordered in terms of importance. We see the odds being the most important by far, and then the vast majority have pretty much the same relevance. I'm glad 7 of the top 10 are the advanced, created features.

Lower imporance features are those that correspond to the one-hot encoded features. That makes sense and this shouldn't be a reason to leave them out. Even so, I'm planning on dropping the columns corresponding to the round being played, I don't like this feature.

I'll also drop height.

Apart from that, I think we actually have enough! The data observed in our correlation visualization and what this random forest model provides is insightful and helps us determine what stats are important.

## 7. CONCLUSIONS

This project has given us some useful insights in terms of which stats and features may be the most important when it comes to forecasting a match-winner in tennis. Most of the tasks were focused on the data preparation and processing, but we also used some visualizations to understand better our data and get the results we were looking for.

Odds are, by far, the most important feature. And it makes sense, bookies are right around 70% of the time! It also surprised me the relevance the player's rank and points have. I conclude from this that ATP points are solid and significant, pretty well computed.

Apart from that, we've already seen some more advanced ratios and how they correlate with the player's chances of winning, as well as some other more advanced stats. 

Age seems to be a considerable factor, more than a lot more features which someone could have thought were important (like Ace%, Double Faults...).

Lastly, I want to highlight my surprise to see that the Win% within the last 30 games seems negligible. Also, upset-related stats aren't much important either.


## 8. IMPROVEMENTS & NEXT STEPS
There's several things I could have done. I wanted to keep this as simple as possible, even though it ended up being quite long. 

What I've restricted myself from doing is adding a lot more new features, like the **implied probability** (which comes from the odds), **player fatigue**...

I could have also studied the number of previous games we'd be performing the rolling averages over. I chose 30 but it could have been any other number. For better results, we should aim to find the number that optimizes our future results.

The next steps are clear: build the model and find a betting strategy to make it profitable.

## 9. REFERENCES
The data comes from two sources:
* [Jeff Sackmann](https://github.com/JeffSackmann/tennis_atp): he has an amazing set of files with almost everything one would need. I used some of his files to get the stats per match.
* [Tennis-data.co.uk](http://www.tennis-data.co.uk/alldata.php): It was used to get the information related to odds as well as the match results and other qualitative information.

I also used [Ultimate Tennis Statistics](https://www.ultimatetennisstatistics.com/glossary) to create new features and learn more about tennis itself.
