# Data Analysis Practice on 2021 data

## Brainstorming of Questions
1. Are there more break points in clay court matches?
2. Is there a correlation between tournament level and length of match?
3. Is there a correlation between length of match and difference between player rankings?
4. Is there a correlation between 1st serve in % and break points faced?
5. Is there a correlation between player height and no. aces?
6. Who had the best tie break record in the calendar year?
7. Is there a correlation between tie break percentage win and end of year ranking? (would need to get the end of year ranking form somewhere else I think)
8. Is there a correlation between difference in total games won and difference in first serve in percentage (per match)?


In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from typing import Union

df = pd.read_csv('data/atp_matches_2021.csv')
df = df.astype({'tourney_date':'string'})
df.tourney_date = pd.to_datetime(df.tourney_date)
pd.set_option('display.max_columns', None)
df = df.sort_values(by =['tourney_date', 'match_num'])
df = df.reset_index()

## Dealing with the Score

### Total games won by winner and loser

The score for each set can be in one of 6 formats:
1. 6-x
2. x-6
3. 7-5
4. 5-7
5. 7-6(x)
6. (x)6-7

- From these formats, winner_games_won and loser_games_won can be calculated. Conveniently, winner's games is always quoted first (even if it goes to three sets) so once split these can just be calculated as such. 

- [Edit]: Inevitably encountered some alternative formats such as `4-6 6-3 [7-10]` which is when they had a first to 10 tie break to decide the match instead of a third set in some formats (usually doubles)

- Now I have number of games won for each player, paired with number of break points faced, number of break points saved and total number of service games, I can find how many service games they won and lost. 

In [69]:
def get_winner_games_won(score: str) -> int:
    """Takes in the score for the match and returns the number of games won by the winner

    Args:
        score (str): The score as a string 

    Returns:
        Int : The winners total games won
    """
    w_games = 0
    sets = score.split(' ')
    for set in sets:
        if 'R' in set or 'W' in set or 'Def.' in set or '[' in set:
            continue 
        if set[0] == '(':
            w_games += 6
            continue
        if set[-1] == ')':
            w_games += 7
            continue 
        games = set.split('-')
        w_games += int(games[0])
    return w_games

def get_loser_games_won(score: str) -> int:
    """Takes in the score for the match and returns the number of games won by the loser 

    Args:
        score (str): The score as a string 

    Returns:
        int: The losers total games won
    """
    l_games = 0
    sets = score.split(' ')
    for set in sets:
        if 'R' in set or 'W' in set or 'Def.' in set or '[' in set:
            continue 
        if set[0] == '(':
            l_games += 7
            continue
        if set[-1] == ')':
            l_games += 6
            continue 
        games = set.split('-')
        l_games += int(games[1])
    return l_games

In [70]:
df['w_games'] = df['score'].apply(lambda x: get_winner_games_won(x))
df['l_games'] = df['score'].apply(lambda x: get_loser_games_won(x))

### Tie breaks 

Can make a tally of number of tie breaks in a match and who won. Can use this data to discover who had the best tie break record in the season etc. 