In [1]:
NAME = "Amira Isenberg"
COLLABORATORS = ""

# Instructions

1. Make sure you have filled out your "NAME" and "COLLABORATORS" (if any) in the previous cell.

2. You should complete all code/markdown cells that state "YOUR CODE HERE" or "YOUR ANSWER HERE". 
   
3. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

4. Partial credit can be obtained if your solution approach is clear and the documented within comments in the implementation.

5. You should follow good coding practices. Your code should use type hints, be robust against invalid inputs, and you should also write a few test cases to check for correctness particularly including edge cases.  


## Problem 1

Write a python function which computes third power of a number if the number is odd, and the square of the number if the number is even.

In [2]:
def parity_power(x: int) -> int:
    if x % 2 == 0: # even
        return x ** 2
    else: # odd
        return x ** 3
    raise NotImplementedError()

In [15]:
assert parity_power(0) == 0
assert parity_power(1) == 1
assert parity_power(2) == 4
assert parity_power(3) == 27
assert parity_power(-1) == -1

## Problem 2

An important part of this course is to complete an ML based project over the course of the semester.  We are going to start right now!

Read the description of the project uploaded on canvas and write down 2 potential ideas for your project.  

I am writing my honors thesis on Naive Bayes Classifiers, so I would like to be able to use that model in this project.
Potential ideas:
1. Using a medical dataset, I can design a model that predicts the likelihood of specific diagnoses (for example, the dataset located at https://www.kaggle.com/datasets/joebeachcapital/cirrhosis-patient-survival-prediction).
2. Using a dataset containing housing market data, predict the price category of new house/apartments being sold. (for example, the dataset located at: https://www.kaggle.com/dahttps://www.kaggle.com/datasets/nelgiriyewithana/new-york-housing-markettasets/nelgiriyewithana/new-york-housing-market).

## Problem 3

This problem will test some of your programming skills and data wrangling ability.  

We will start with a dataset from 'https://github.com/JeffSackmann/tennis_atp' which covers information on Tennis matches from the past few decades.  

This dataset is provided by Jeff Sackman.

Write a function *download_data_by_year* which takes the year as an input and returns a pandas dataframe with the data from the file of the form `atp_matches_{year}.csv`.  Implement a cache so that we do not download the data if it is already in the cache. 

**Hint**: The url you need to use to access the files in github is https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{year}.csv

In [4]:
import pandas as pd
import requests 
import os

# with year as input, return pandas dataframe with the data from the 
# file of the form atp_matches_{year}.csv
def download_data_by_year(year: int) -> pd.DataFrame:
    # check if data is already in cache
    filePath = f"atp_matches_{year}.csv"

    # if cached, read from cache
    if os.path.exists(filePath):
        return pd.read_csv(filePath)
    else: # otherwise, download from the URL + save to cache
        url = f'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{year}.csv'
        response = requests.get(url)

        # if the request succeeds:
        if response.status_code == 200:
            # save data to cache
            with open(filePath, 'w', encoding='utf=8') as file:
                file.write(response.text)
                
            # read data into dataframe and return dataframe
            return pd.read_csv(filePath)
        else:
            # if request is not successful, raise error message
            raise NotImplementedError(f"Failed to download data for year: {year}")
            
    raise NotImplementedError()

In [16]:
df1 = download_data_by_year(2017)
assert df1.shape == (2911, 49)

df2 = download_data_by_year(1968)
assert df2.shape == (4377, 49)

df3 = download_data_by_year(2023)
assert df3.shape == (2986, 49)

year = 1967
try:
    df4 = download_data_by_year(year) 
except Exception as e:
    print(f"Error for year: {year}")

Error for year: 1967


## Problem 4
Download the data for years 2000-2020 (inclusive).  Compute the average number of matches played per year on each surface for each month. 

Return your solution as a dictionary where the key is the match, month pair as a tuple and the value is the average.

**Hint**: You may have a case where no matches were played on a given surface in a month.  This should factor into your calculation as a zero.

In [17]:
import pandas as pd
from collections import defaultdict
from datetime import datetime
from typing import Dict, Tuple

def get_statistics() -> Dict[Tuple[str, int], float]:
    # download data for years 2000-2021
    data_frames = [download_data_by_year(year) for year in range(2000,2021)]

    # concatenate data frames for all years into single date frame
    all_data = pd.concat(data_frames, ignore_index=True)

    # convert 'tourney_date' column to datetime
    all_data['tourney_date'] = pd.to_datetime(all_data['tourney_date'], format="%Y%m%d")

    # create dictionary to store count of matches for each surface + month
    # using defaultdict so that a default value can be provided for any month with
    # no matches played
    surface_month_count = defaultdict(int)

    # iterate through each row + update count
    for index, row in all_data.iterrows():
        surface = row['surface']
        month = row['tourney_date'].month
        surface_month_count[(surface,month)] += 1

    # create dictionary to store avg number of matches for each surface + month
    surface_month_avg = {}

    # calculate average:
    for key, value in surface_month_count.items():
        surface_month_avg[key] = value / len(data_frames)

    return surface_month_avg
    
    raise NotImplementedError()

In [18]:
from numpy.testing import assert_allclose
res = get_statistics()
assert_allclose(res[('Carpet', 1)], 9)
assert_allclose(res[('Carpet', 12)], 0.19047619)
assert_allclose(res[('Grass', 6)], 247)
assert_allclose(res[('Hard', 3)], 216.428571)

# Problem 5

Determine the BEST 5 players of all time.  There is not a definative answer here, this is your chance to show your creativity.  Please also explain how you arrived at your rankings.  You are free to use web resources to support your answer, but you MUST cite them as you use them.

You should answer questions like:

1. How did you define BEST?
2. Where do you believe your analysis is flawed?
3. What could you do to improve your analysis?

In [11]:
from collections import defaultdict
from typing import List, Dict

# calculate player statistics - specifically the number of Grand Slam titles
def calculate_player_statistics(data_frames: List[pd.DataFrame]) -> Dict[str, Dict[str, int]]:
    # dictionary to store player statistics
    player_statistics = defaultdict(dict)

    # iterate through each dataframe
    for df in data_frames:
        for index, row in df.iterrows():
            winner_name = row['winner_name']
            
            # check if the tournament is a Grand Slam
            if row['tourney_level'] == 'G':
                # update Grand Slam titles for the winner
                player_statistics[winner_name]['grand_slam_titles'] = player_statistics[winner_name].get('grand_slam_titles', 0) + 1

    return player_statistics

def determine_best_players() -> List[str]:
    # download data for years 1968-2023
    data_frames = [download_data_by_year(year) for year in range(1968, 2024)]

    # calculate player statistics - Grand Slam titles
    player_statistics = calculate_player_statistics(data_frames)

    # sort players based on Grand Slam title count
    best_players = sorted(player_statistics.items(), key=lambda x: x[1].get('grand_slam_titles', 0), reverse=True)[:5]
    
    # extract player names
    best_player_names = [player[0] for player in best_players]
    
    return best_player_names
    raise NotImplementedError()

In [12]:
try:
    best_players_result = determine_best_players()
    print("Best Players Result:")
    print(best_players_result)
except Exception as e:
    print(f"Error: {e}")

Best Players Result:
['Roger Federer', 'Novak Djokovic', 'Rafael Nadal', 'Jimmy Connors', 'Andre Agassi']


I defined the best player by calculating the number of Grand Slam Titles for that player. However, this analysis is flawed when it comes to newer players who have not had a chance to compete in as many Grand Slam titles, and doesn't take into account player style and other subjective factors. To improve my analysis, I could add to the calculations the number of overall titles, and especially the win percentages to make a more nuanced analysis. 