# Exploring Tennis Attributes and Tournament Outcomes

#### Authors: A.J. Madison, Robert Silber, and Spencer Tillman

#### Overview: This project is exploring the attributes of tennis players and how they placed in various tournaments throughout their career.

## Literature Review

#### Literature review

## Project Background

#### Information about purpose of the project

## Environment Setup and Data Gathering



In [None]:
#If needed, get requirements.txt file for all required packages

#!curl -0 https://raw.githubusercontent.com/spencer130/COMP4447_FinalProject/main/requirements.txt?token=GHSAT0AAAAAABY4YJJYIHTJ3OVAL4ADFD4KY22WJUA
#!pip install -r requirements.txt

In [14]:
#Import python packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import statsmodels.api as sm
import scipy


#### Next, we will be importing our data from a GitHub repository where csv files are created showing ATP tournament data annually. We will import this data directly into a pandas data frame.



In [None]:
#Retrieve data from github

url_1 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_futures_1995.csv'
url_2 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1996.csv'
url_3 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1997.csv'
url_4 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1998.csv'
url_5 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1999.csv'

tennis_1 = pd.read_csv(url_1)
tennis_2 = pd.read_csv(url_2)
tennis_3 = pd.read_csv(url_3)
tennis_4 = pd.read_csv(url_4)
tennis_5 = pd.read_csv(url_5)

#Combine the data across multiple years into one data frame and add a year column
tennis_df = pd.concat([tennis_1, tennis_2, tennis_3, tennis_4, tennis_5], join='inner')
tennis_df['year'] = [x[:4] for x in tennis_df['tourney_id']]

tennis_df.sample(10)

## Data Cleaning

#### There is a lot of data in each csv file. To begin, we will clean the data and put it into a format that is more useful for our analysis. This is necessary before we start exploring the data.

#### We will start by subsetting the data to columns that will be used in the analysis. The columns we need are:
* tourney_name: this gives us the name of the tournament
* winner_id: this shows who won the match
* winner_seed: this shows where the player ranked in the current tournament

In [None]:
#Subset the data

tennis_df_subset = tennis_df[['tourney_name', 'surface', 'winner_id', 'winner_name', 'winner_seed', 'winner_rank', 'winner_hand', 'w_ace', 'winner_age', 'loser_id', 'loser_name', 'loser_rank', 'loser_seed', 'loser_hand', 'year']].copy()
tennis_df_subset.sample(10)

#### Next, we need to find and address all null values in the data. To do so, we begin with finding how many there are in the data set.

In [None]:
#Count null values
tennis_df_subset.isna().sum()

#### We have null values in our winner seed and winner rank columns. In this instance, a null value is no rank. We will replace these nulls with zeroes to show that they do not have a rank.

In [None]:
#Replace null values with zero
tennis_df_subset['winner_seed'] = tennis_df_subset['winner_seed'].fillna(0)
tennis_df_subset['winner_rank'] = tennis_df_subset['winner_rank'].fillna(0)
tennis_df_subset['loser_rank'] = tennis_df_subset['loser_rank'].fillna(0)
tennis_df_subset['loser_seed'] = tennis_df_subset['loser_seed'].fillna(0)
tennis_df_subset['w_ace'] = tennis_df_subset['w_ace'].fillna(999)
tennis_df_subset['winner_age'] = tennis_df_subset['winner_age'].fillna(0)
tennis_df_subset.isna().sum()

####  Now that we have our data subsetted and null values are addressed, we need to check our data types to make sure they are in the correct format.

In [None]:
#Check the data types

tennis_df_subset.dtypes

#### Most of the data types seem good. However, our year should be changed to datetime. We can change the data types to be more accurate.

In [None]:
tennis_df_subset['year'] = pd.to_numeric(tennis_df_subset['year'])
tennis_df_subset.dtypes

#### Updating ages to be round numbers

In [None]:
tennis_df_subset['winner_age'] = tennis_df_subset['winner_age'].apply(lambda x: round(x))
tennis_df_subset

## Exploratory Data Analysis

#### After cleaning our data, we can begin exploring it. We will then visualize our data to get a better understanding.

In [119]:
#How many tournaments are in the data
tournament_names = tennis_df_subset['tourney_name'].unique()
print(len(tournament_names))

809


In [None]:
#What is the range of winners aces in the tournaments? Also remove values equal to 999 to not include NaN
w_ace_range = sns.histplot(x='w_ace', bins=30, data=tennis_df_subset[-(tennis_df_subset['w_ace']==999)])
print(w_ace_range)


In [None]:
#What is the range of winners ages? Also remove values equal to zero
w_age_range = sns.histplot(x='winner_age', bins=30, data=tennis_df_subset[-(tennis_df_subset['winner_age']==0)])
print(w_age_range)

#### Who has the most wins over the timeframe and how many?

In [None]:
#Most wins by a player over the timeframe
most_wins = str(tennis_df_subset['winner_name'].value_counts().max())
player_most_wins = tennis_df_subset['winner_name'].value_counts().index.tolist()[0]
print(str(player_most_wins) + ' has ' + str(most_wins) + ' wins.')

#Wins by that player over time
player = tennis_df_subset.loc[tennis_df_subset['winner_name']==tennis_df_subset['winner_name'].value_counts().index.tolist()[0]]

player_wins = player.groupby(by='year', as_index=False).count()
print(player_wins['winner_name'])

player_wins_chart = sns.barplot(x='year', y='winner_id', data=player_wins, palette='crest')
player_wins_chart.bar_label(player_wins_chart.containers[0])
player_wins_chart.set_ylabel('wins')
player_wins_chart.set_title('Wins by ' + player_most_wins + ' over the years')
player_wins_chart


## Feature Engineering

#### Look for any feature engineering opportunities to build on the existing data

In [None]:
#Biggest upsets by rank
differences = []

for row in tennis_df_subset:
    diff = tennis_df_subset['winner_rank'] - tennis_df_subset['loser_rank']
    differences.append(diff)
sns.catplot(x='surface', y='tourney_name', kind='box', data=tennis_df_subset)

## Linear Regression

#### Since we have completed our EDA, we will begin looking at linear regression models. To develop a model helping find player attributes that contribute or impede player performance, we will explore linear regression. To beign, let's start with simple linear regression. Our outcome variable will be a win for the match. Using that as our target variable, we will look at how strongly correlated other variables are to producing that outcome.

#### Diagnostic plots

#### To begin, we need to create some diagnostic plots to determine if linear regression is a good choice for our analysis. One parameter to check is normality. We can do this with both the qq plot and Shapiro-Wilks test.

In [None]:
#Test for normality using qq plot

sm.qqplot(data=tennis_df_subset['w_ace'], line='45')

In [None]:
#Test for normality using Shapiro-Wilks test

scipy.stats.shapiro(tennis_df_subset['w_ace'].sample(n=500))

#### Build regression model

#### Build a few models to look at correlations

In [None]:
#Build a simple linear regression model using OLS regression model for surface types

#Convert surface types to numbers first
surface_numbers = pd.get_dummies(tennis_df_subset['surface'])
tennis_df_regression = tennis_df_subset.join(surface_numbers)

#Create OLS regression model
x = tennis_df_regression.iloc[:, 14:18]
y = tennis_df_regression['w_ace']

lm_1 = sm.OLS(y, x)
lm_1_fit = lm_1.fit()
lm_1_fit.summary()


#### Based on the model above, we have a very low R^2 at just 0.008. This shows that there is no correlation between winner's aces and the surface type.

In [None]:
# OLS model with winner rank and winner aces

#Create OLS regression model
x = tennis_df_regression['winner_rank']
y = tennis_df_regression['w_ace']

lm_2 = sm.OLS(y, x)
lm_2_fit = lm_2.fit()
lm_2_fit.summary()

In [None]:
#Residuals Plot for winner rank and ace

sns.residplot(x='winner_rank', y='w_ace', data=tennis_df_subset)

In [None]:
#Residuals vs. Leverage Plot for winners rank

#get the influence, leverage, and residual values
lm_2_infl = lm_2_fit.get_influence()
lm_2_lev = lm_2_infl.hat_matrix_diag
lm_2_resid = lm_2_fit.resid

#Plot the results
sns.scatterplot(x=lm_2_lev, y=lm_2_resid)

In [None]:
# Ridge regression

## Conclusion

#### Findings

## Next Steps

#### Any follow on analysis that could be performed

## References

#### Tennis databases, files, and algorithms by Jeff Sackmann / Tennis Abstract is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
#### Based on a work at https://github.com/JeffSackmann.