# Exploring Tennis Attributes and Tournament Outcomes

#### Authors: A.J. Madison, Robert Silber, and Spencer Tillman

#### Overview: This project is exploring the attributes of tennis players and how they placed in various tournaments throughout their career.

## Literature Review

Literature review

## Project Background

Information about purpose of the project

## Environment Setup and Data Gathering



In [None]:
#If needed, get requirements.txt file for all required packages

#!curl -0 https://raw.githubusercontent.com/spencer130/COMP4447_FinalProject/main/requirements.txt?token=GHSAT0AAAAAABY4YJJYIHTJ3OVAL4ADFD4KY22WJUA
#!pip install -r requirements.txt

In [1]:
#Import python packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


#### Next, we will be importing our data from a GitHub repository where csv files are created showing ATP tournament data annually. We will import this data directly into a pandas data frame.



In [42]:
#Retrieve data from github

url_1 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_futures_1995.csv'
url_2 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1996.csv'
url_3 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1997.csv'
url_4 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1998.csv'
url_5 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1999.csv'

tennis_1 = pd.read_csv(url_1)
tennis_2 = pd.read_csv(url_2)
tennis_3 = pd.read_csv(url_3)
tennis_4 = pd.read_csv(url_4)
tennis_5 = pd.read_csv(url_5)

#Combine the data across multiple years into one data frame and add a year column
tennis_df = pd.concat([tennis_1, tennis_2, tennis_3, tennis_4, tennis_5], join='inner')
tennis_df['year'] = [x[:4] for x in tennis_df['tourney_id']]

tennis_df.sample(10)

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,year
1314,1996-329,Tokyo,Hard,56,A,19960415,43,102358,3.0,,...,24.0,11.0,9.0,4.0,7.0,8.0,2298.0,47.0,840.0,1996
6422,1995-M-SA-EGY-02A-1995b,Egypt 2 Masters 2,Clay,32,S,19950722,4,102526,7.0,,...,,,,,,944.0,6.0,,,1995
9119,1995-M-SA-GRE-01A-1995b,Greece Masters 2,Carpet,24,S,19950916,21,103018,1.0,,...,,,,,,665.0,19.0,782.0,11.0,1995
1376,1996-410,Monte Carlo Masters,Clay,56,M,19960422,50,101611,,,...,13.0,10.0,8.0,3.0,9.0,29.0,1077.0,33.0,997.0,1996
5597,1995-M-SA-ITA-03A-1995a,Italy 3 Masters 1,Clay,32,S,19950701,18,101876,,,...,,,,,,355.0,83.0,397.0,68.0,1995
630,1996-408,Milan,Carpet,32,A,19960226,22,101964,4.0,,...,18.0,11.0,9.0,0.0,4.0,6.0,2567.0,42.0,855.0,1996
1339,1996-410,Monte Carlo Masters,Clay,56,M,19960422,13,101320,,WC,...,32.0,15.0,9.0,10.0,13.0,45.0,871.0,164.0,248.0,1996
8976,1995-M-SA-USA-05A-1995c,USA 5 Masters 3,Hard,48,S,19950915,19,103185,,,...,,,,,,,,576.0,28.0,1995
6069,1995-M-SA-COL-01A-1995a,Colombia Masters 1,Clay,32,S,19950708,22,102152,3.0,,...,,,,,,332.0,,831.0,,1995
1259,1997-410,Monte Carlo Masters,Clay,56,M,19970421,32,102845,6.0,,...,32.0,14.0,12.0,10.0,19.0,8.0,2156.0,31.0,1222.0,1997


## Data Cleaning

#### There is a lot of data in each csv file. To begin, we will clean the data and put it into a format that is more useful for our analysis. This is necessary before we start exploring the data.

#### We will start by subsetting the data to columns that will be used in the analysis. The columns we need are:
* tourney_name: this gives us the name of the tournament
* winner_id: this shows who won the match
* winner_seed: this shows where the player ranked in the current tournament

In [52]:
#Subset the data

tennis_df_subset = tennis_df[['tourney_name', 'surface', 'winner_id', 'winner_seed', 'winner_rank', 'year']].copy()
tennis_df_subset.sample(10)

Unnamed: 0,tourney_name,surface,winner_id,winner_seed,winner_rank,year
1001,Estoril,Clay,102845,2.0,17.0,1998
3312,Tour Finals,Hard,101736,,1.0,1999
1334,Davis Cup G2 R1: CIV vs EGY,Hard,102279,,809.0,1998
1498,Mexico 1 Masters 4,Hard,101612,3.0,371.0,1995
1476,Hamburg Masters,Clay,103163,,126.0,1997
158,Australian Open,Hard,102563,,63.0,1997
343,Zagreb,Carpet,101964,1.0,10.0,1996
2345,Cincinnati Masters,Hard,102770,,93.0,1999
3351,Ostrava,Carpet,102796,7.0,27.0,1997
1653,Coral Springs,Clay,101318,4.0,56.0,1996


#### Next, we need to find and address all null values in the data. To do so, we begin with finding how many there are in the data set.

In [None]:
#Count null values
tennis_df_subset.isna().sum()

#### We have null values in our winner seed and winner rank columns. In this instance, a null value is no rank. We will replace these nulls with zeroes to show that they do not have a rank.

In [54]:
#Replace null values with zero
tennis_df_subset['winner_seed'] = tennis_df_subset['winner_seed'].fillna(0)
tennis_df_subset['winner_rank'] = tennis_df_subset['winner_rank'].fillna(0)
tennis_df_subset.isna().sum()

tourney_name    0
surface         0
winner_id       0
winner_seed     0
winner_rank     0
year            0
dtype: int64

####  Now that we have our data subsetted and null values are addressed, we need to check our data types to make sure they are in the correct format.

In [48]:
#Check the data types

tennis_df_subset.dtypes

tourney_name     object
winner_id         int64
winner_seed     float64
year             object
dtype: object

#### Most of the data types seem good. However, our seed does not need to be a float since each seed is a whole number. We can change the data type to be more accurate.

In [49]:
tennis_df_subset['winner_seed'] = tennis_df_subset['winner_seed'].astype(int)
tennis_df_subset.dtypes

tourney_name    object
winner_id        int64
winner_seed      int64
year            object
dtype: object

## Exploratory Data Analysis

#### After cleaning our data, we can begin exploring it. We will then visualize our data to get a better understanding.

In [50]:
#How many tournaments are in the data
tournament_names = tennis_df_subset['tourney_name'].unique()
len(tournament_names)

#What is the range of ranks in the tournaments?


#Wins per player
#tennis_df['winner_id'].value_counts()
#tennis_df_subset.pivot(index='tourney_name', columns='winner_id', values=)

809

## Feature Engineering

Look for any feature engineering opportunities to build on the existing data

## Linear Regression

#### Since we have completed our EDA, we will begin looking at linear regression models. To develop a model helping find player attributes that contribute or impede player performance, we will explore linear regression. To beign, let's start with simple linear regression. Our outcome variable will be 

## Conclusion

Findings

## Next Steps

Any follow on analysis that could be performed

## References

#### Tennis databases, files, and algorithms by Jeff Sackmann / Tennis Abstract is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
#### Based on a work at https://github.com/JeffSackmann.