In [1]:
NAME = "Sarale Goldberger"
COLLABORATORS = ""

# Instructions

1. Make sure you have filled out your "NAME" and "COLLABORATORS" (if any) in the previous cell.

2. You should complete all code/markdown cells that state "YOUR CODE HERE" or "YOUR ANSWER HERE". 
   
3. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

4. Partial credit can be obtained if your solution approach is clear and the documented within comments in the implementation.

5. You should follow good coding practices. Your code should use type hints, be robust against invalid inputs, and you should also write a few test cases to check for correctness particularly including edge cases.  


## Problem 1

Write a python function which computes third power of a number if the number is odd, and the square of the number if the number is even.

In [2]:
# the function takes in an int x and returns an int
def parity_power(x: int) -> int:
    if x % 2 == 0:
        return x**2
    return x**3
    raise NotImplementedError()

In [3]:
# provide test cases
assert parity_power(1) == 1
assert parity_power(2) == 4
assert parity_power(3) == 27
assert parity_power(24) == 24**2
assert parity_power(377) == 377**3
print("Your function works!")

Your function works!


## Problem 2

An important part of this course is to complete an ML based project over the course of the semester.  We are going to start right now!

Read the description of the project uploaded on canvas and write down 2 potential ideas for your project.  

I submitted my ideas in the 'Term Project Ideas' assignment in Canvas. My idea is to use my NLP research with Professor Waxman. I will be taking audio recordings of classes which have a mix of English, Hebrew, Yiddish, Arameic, and Yeshivish slang words, running them through WhisperAI to obtain preliminary transcriptions, and then using prompt strategies to train ChatGPT to fix the mistakes in the non-English words.

## Problem 3

This problem will test some of your programming skills and data wrangling ability.  

We will start with a dataset from 'https://github.com/JeffSackmann/tennis_atp' which covers information on Tennis matches from the past few decades.  

This dataset is provided by Jeff Sackman.

Write a function *download_data_by_year* which takes the year as an input and returns a pandas dataframe with the data from the file of the form `atp_matches_{year}.csv`.  Implement a cache so that we do not download the data if it is already in the cache. 

**Hint**: The url you need to use to access the files in github is https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{year}.csv

In [4]:
import pandas as pd

# cache for the df's that have already been downloaded
cache = {}

def download_data_by_year(year: int) -> pd.DataFrame:
    # Check if the input year has already been downloaded
    if year in cache:
        return cache[year]

    url = f"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{year}.csv"
    
    try:
        df = pd.read_csv(url)
        cache[year] = df
        return df
        
    except pd.errors.ParserError:
        print("Error parsing data")
    
    raise NotImplementedError()

In [5]:
df = download_data_by_year(2017)
assert df.shape == (2911, 49)

In [6]:
# Print part of the dataset
print("Input and target Features")
# display(pd.concat([df.data, df.target], axis=1).head())
df.info()
# df.filter(["surface", "draw_size", "tourney_level", "minutes"]).sort_values(by=['draw_size'])
# df.groupby(["tourney_level"]).size()

Input and target Features
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2911 entries, 0 to 2910
Data columns (total 49 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tourney_id          2911 non-null   object 
 1   tourney_name        2911 non-null   object 
 2   surface             2911 non-null   object 
 3   draw_size           2911 non-null   int64  
 4   tourney_level       2911 non-null   object 
 5   tourney_date        2911 non-null   int64  
 6   match_num           2911 non-null   int64  
 7   winner_id           2911 non-null   int64  
 8   winner_seed         1238 non-null   float64
 9   winner_entry        403 non-null    object 
 10  winner_name         2911 non-null   object 
 11  winner_hand         2910 non-null   object 
 12  winner_ht           2876 non-null   float64
 13  winner_ioc          2911 non-null   object 
 14  winner_age          2911 non-null   float64
 15  loser_id            2911 non-

## Problem 4
Download the data for years 2000-2020 (inclusive).  Compute the average number of matches played per year on each surface for each month. 

Return your solution as a dictionary where the key is the (surface, month) pair as a tuple and the value is the average.

**Hint**: You may have a case where no matches were played on a given surface in a month.  This should factor into your calculation as a zero.

In [7]:
def get_statistics():
    # download the data for years 2000-2020 incl.
    
    # initialize a df with the data from 2000, and concatenate the rest of the years
    # add a year and month column, assuming 'tourney_date' column contains integers in the format YYYYMM
    df = download_data_by_year(2000)      
    df['year'] = 2000   
    df['month'] = (df['tourney_date'] % 10000) // 100
    
    for year in range(2001, 2021):
        # download the data
        temp = download_data_by_year(year)

        ## To check if there is a row with 5 for month and Carpet for surface.
        #print(year, ((df['month'] == 5) & (df['surface'] == 'Carpet')).any())
        
        # add year and month cols
        temp['year'] = year
        temp['month'] = (temp['tourney_date'] % 10000) // 100
        # concatenate the df to our larger df
        df = pd.concat([df, temp])
           
    # # Define conditions for each column
    # surf_condition = df['surface'].str.contains('Carpet')
    # month_condition = (df['month'] == 1)
    # # Combine conditions with logical AND (&) to grep rows that satisfy all conditions
    # filtered_df = df[surf_condition & month_condition]
    # print(filtered_df.groupby(['year','month', 'surface']).size())

    # compute the average number of matches played per year on each surface for each month
    df_grouped = df.groupby(['year', 'month', 'surface']).size().reset_index(name='matches')
    df_grouped_avg = df_grouped.groupby(['surface', 'month']).mean().round().astype(int).reset_index().fillna(0)
    ##print(df_grouped_avg)
    ##print("Dimensions:", df_grouped_avg.shape)


    # create a dictionary where the key is the match, month pair as a tuple 
    # and the value is the average
    # there will be 12 * 4 keys for the 12 months * 4 surface types per month
    dict = {}
    months = [i for i in range(1,13)]
    surfaces = ['Carpet', 'Clay', 'Grass', 'Hard']
    i = 0
    for ind in range(len(df_grouped_avg)):
        mo = i%12 + 1  # expected month
        mo_df = df_grouped_avg['month'].iloc[ind]
        # if there are missing months add them in with data as 0
        if mo < mo_df:
            for j in range(mo, mo_df):
                dict[(df_grouped_avg['surface'].iloc[ind], j)] = 0
            i += mo_df - mo
        elif mo > mo_df:
            for j in range(mo, 13):
                dict[(df_grouped_avg['surface'].iloc[ind-1], j)] = 0
                i += 1
        # add the key-data pair of the current month and surface
        data = df_grouped_avg['matches'].iloc[ind]
        key = (df_grouped_avg['surface'].iloc[ind], mo_df)
        dict[key] = data
        # increment month tracker
        i+=1
    ##print(f"All months accounted for {i==len(months)*len(surfaces)}: i {i}, exp {len(months)*len(surfaces)}")
    return dict

    # raise NotImplementedError

get_statistics()

{('Carpet', 1): 27,
 ('Carpet', 2): 19,
 ('Carpet', 3): 13,
 ('Carpet', 4): 21,
 ('Carpet', 5): 0,
 ('Carpet', 6): 0,
 ('Carpet', 7): 6,
 ('Carpet', 8): 0,
 ('Carpet', 9): 18,
 ('Carpet', 10): 110,
 ('Carpet', 11): 47,
 ('Carpet', 12): 4,
 ('Clay', 1): 30,
 ('Clay', 2): 120,
 ('Clay', 3): 39,
 ('Clay', 4): 273,
 ('Clay', 5): 318,
 ('Clay', 6): 31,
 ('Clay', 7): 212,
 ('Clay', 8): 30,
 ('Clay', 9): 64,
 ('Clay', 10): 14,
 ('Clay', 11): 3,
 ('Clay', 12): 4,
 ('Grass', 1): 0,
 ('Grass', 2): 4,
 ('Grass', 3): 6,
 ('Grass', 4): 6,
 ('Grass', 5): 0,
 ('Grass', 6): 259,
 ('Grass', 7): 54,
 ('Grass', 8): 0,
 ('Grass', 9): 4,
 ('Grass', 10): 0,
 ('Grass', 11): 4,
 ('Grass', 12): 0,
 ('Hard', 1): 265,
 ('Hard', 2): 244,
 ('Hard', 3): 216,
 ('Hard', 4): 47,
 ('Hard', 5): 7,
 ('Hard', 6): 3,
 ('Hard', 7): 116,
 ('Hard', 8): 303,
 ('Hard', 9): 147,
 ('Hard', 10): 241,
 ('Hard', 11): 44,
 ('Hard', 12): 68}

In [8]:
from numpy.testing import assert_allclose
res = get_statistics()
assert_allclose(res[('Carpet', 1)], 27) #9

# Problem 5

Determine the BEST 5 players of all time.  There is not a definative answer here, this is your chance to show your creativity.  Please also explain how you arrived at your rankings.  You are free to use web resources to support your answer, but you MUST cite them as you use them.

You should answer questions like:

1. How did you define BEST?
2. Where do you believe your analysis is flawed?
3. What could you do to improve your analysis?

In [13]:
# Create a table of all the wins
df_wins = df.filter(["winner_id", "winner_name", "winner_rank", "winner_rank_points"])
df_wins = df_wins.groupby(["winner_id", "winner_name"]).size().reset_index(name="wins")
df_wins.rename(columns={"winner_id":"id", "winner_name":"name"}, inplace=True)

# Create a table of all the losses
df_losses = df.filter(["loser_id", "loser_name", "loser_rank", "loser_rank_points"])
df_losses = df_losses.groupby(["loser_id", "loser_name"]).size().reset_index(name="losses")
df_losses.rename(columns={"loser_id":"id", "loser_name":"name"}, inplace=True)

# Create a joint table of all players with their win and lose stats
# add a column of win percentage and lose percentage
df_players = df_wins.merge(df_losses, 'outer', sort=True).fillna(0)
df_players["games"] = df_players["wins"] + df_players["losses"]
df_players["win_prc"] = (df_players["wins"] / df_players["games"] * 100).round(0).astype(int)
df_players["lose_prc"] = (df_players["losses"] / df_players["games"] * 100).round(0).astype(int)
               
# display(df_wins.head())
# display(df_losses.head())
# display(df_players.head())
# df_players.info()

# df.info()
# df.filter(["tourney_id", "surface", "draw_size", "tourney_level", "winner_rank", "winner_name", "minutes"]).sort_values(by=['tourney_level', 'winner_rank']).fillna(0)
# temp2 = temp.groupby(["tourney_level", "winner_rank"]).size().reset_index(name="count")
# display(temp, temp2)
# print(temp.to_string())
# temp2.groupby(["tourney_level"]).nsmallest(5).reset_index()
# temp2.nlargest(5, "winner_rank", keep='all').sort_values(by=['tourney_level'])

# Determine rank of data in the tourney_level and draw_size columns
# According to :
#   "Higher drawsize tournaments (e.g., Grand Slams) involve more players and rounds. These tournaments attract top-ranked players, making the competition fierce.
#    Lower drawsize tournaments (e.g., smaller ATP or WTA events) have fewer players, [and] the overall field [is] less competitive."
temp = df.filter(["draw_size", "tourney_level", "winner_rank", "winner_name", "minutes"]).sort_values(by=['draw_size'])
print(temp.to_string())



Unnamed: 0,id,name,wins,losses,games,win_prc,lose_prc
0,100644,Alexander Zverev,57.0,22.0,79.0,72,28
1,102800,Nenad Zimonjic,0.0,2.0,2.0,0,100
2,103163,Tommy Haas,6.0,14.0,20.0,30,70
3,103285,Radek Stepanek,3.0,2.0,5.0,60,40
4,103333,Ivo Karlovic,15.0,20.0,35.0,43,57


      draw_size tourney_level  winner_rank                  winner_name  minutes
343           4             D          NaN         Aisam Ul Haq Qureshi     78.0
291           4             D         80.0                Denis Istomin    165.0
290           4             D         73.0                  Hyeon Chung    208.0
289           4             D        341.0               Egor Gerasimov    137.0
288           4             D        129.0                 Marius Copil    122.0
287           4             D        341.0               Egor Gerasimov    101.0
286           4             D        313.0                 Adrian Ungur    183.0
285           4             D        418.0               Tomislav Brkic    134.0
284           4             D         82.0                Damir Dzumhur      NaN
283           4             D         82.0                Damir Dzumhur    134.0
282           4             D        212.0                  Mirza Basic     88.0
281           4             

Another idea of how to decide 'BEST' is to take into account sufrace. 

RANKING SURFACES
https://matchpointpost.com/best-tennis-court-surface/
https://www.tennisletics.com/blog/how-different-types-of-court-impact-a-tennis-match/

In [10]:
data = {'surface':   ['Carpet', 'Clay', 'Grass', 'Hard'],
        'speed':     ['fast', 'slow', 'fastest', 'medium'],
        'bounce' :   ['low', 'high', 'low', 'medium']}

surfaces = pd.DataFrame(data)