# Predicting Winner of Men's Singles Title at 2025 US Open

In this project, I will first review tennis player data from ATP to determine what factors contribute towards winning a Grand Slam title for Men's Singles. I'm using the following data source from Kaggle:  
https://www.kaggle.com/datasets/dissfya/atp-tennis-2000-2023daily-pull/data

## Outline
I will use the following steps to proceed with analysis.
1. **Data Loading**: Load the ATP dataset.
2. **Data Cleaning**: Handle missing values and ensure data consistency.
3. **Feature Engineering**: Create new features that may help in predicting the winners.
4. **Exploratory Data Analysis (EDA)**: Analyze the data to find patterns and relationships.
5. **Visualization**: Use plots to visualize the data and findings.
6. **Conclusion**: Summarize the findings and insights from the EDA.

## Import Necessary Libraries and Packages
First, we will import any necessary dependencies.

In [1]:
# Install dependencies as needed:
%pip install kagglehub[pandas-datasets]

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: C:\Users\sarah\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Data Loading
First we will load the ATP datasets from Kaggle.

In [12]:
# Set the path to the file you'd like to load
file_path = "atp_tennis.csv"  # Update this to the correct file name if needed

# Load the latest version
dfatp = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "dissfya/atp-tennis-2000-2023daily-pull",
  # "dissfya/atp-tennis-daily-pull",
  file_path,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print("First 5 records:", dfatp.head())
print("Pathfile to imported dataset:", file_path)

  dfatp = kagglehub.load_dataset(


Downloading from https://www.kaggle.com/api/v1/datasets/download/dissfya/atp-tennis-2000-2023daily-pull?dataset_version_number=819&file_name=atp_tennis.csv...


100%|██████████| 8.37M/8.37M [00:06<00:00, 1.31MB/s]


First 5 records:                            Tournament        Date         Series    Court  \
0  Australian Hardcourt Championships  2000-01-03  International  Outdoor   
1  Australian Hardcourt Championships  2000-01-03  International  Outdoor   
2  Australian Hardcourt Championships  2000-01-03  International  Outdoor   
3  Australian Hardcourt Championships  2000-01-03  International  Outdoor   
4  Australian Hardcourt Championships  2000-01-03  International  Outdoor   

  Surface      Round  Best of        Player_1       Player_2       Winner  \
0    Hard  1st Round        3      Dosedel S.    Ljubicic I.   Dosedel S.   
1    Hard  1st Round        3      Clement A.     Enqvist T.   Enqvist T.   
2    Hard  1st Round        3       Escude N.  Baccanello P.    Escude N.   
3    Hard  1st Round        3  Knippschild J.     Federer R.   Federer R.   
4    Hard  1st Round        3     Fromberg R.  Woodbridge T.  Fromberg R.   

   Rank_1  Rank_2  Pts_1  Pts_2  Odd_1  Odd_2        Scor

In [17]:
# List all Variables in the dataset
dfatp.dtypes

Tournament     object
Date           object
Series         object
Court          object
Surface        object
Round          object
Best of         int64
Player_1       object
Player_2       object
Winner         object
Rank_1          int64
Rank_2          int64
Pts_1           int64
Pts_2           int64
Odd_1         float64
Odd_2         float64
Score          object
dtype: object

In [19]:
# Create a frequency table of tournaments with key characteristics
tournament_freq = dfatp.groupby(['Tournament', 'Date', 'Series', 'Court', 'Surface'])\
                      .size()\
                      .reset_index(name='Frequency')\
                      .sort_values('Frequency', ascending=False)

# Display the first 10 rows
print("Tournament Frequency Table:")
print(tournament_freq.head(10))

# Get summary statistics
print("\nTotal unique tournaments:", len(tournament_freq))
print("\nDistribution by surface:")
print(dfatp['Surface'].value_counts())
print("\nDistribution by series:")
print(dfatp['Series'].value_counts())
print("\nDistribution by Tournament:")
print(dfatp['Tournament'].value_counts())

Tournament Frequency Table:
            Tournament        Date      Series    Court Surface  Frequency
912    Australian Open  2001-01-15  Grand Slam  Outdoor    Hard        125
10021          US Open  2001-08-27  Grand Slam  Outdoor    Hard        124
10020          US Open  2000-08-28  Grand Slam  Outdoor    Hard        124
911    Australian Open  2000-01-17  Grand Slam  Outdoor    Hard        124
10620        Wimbledon  2002-06-24  Grand Slam  Outdoor   Grass        124
10618        Wimbledon  2000-06-26  Grand Slam  Outdoor   Grass        123
913    Australian Open  2002-01-14  Grand Slam  Outdoor    Hard        122
3689       French Open  2002-05-27  Grand Slam  Outdoor    Clay        122
3688       French Open  2001-05-28  Grand Slam  Outdoor    Clay        121
10619        Wimbledon  2001-06-25  Grand Slam  Outdoor   Grass        121

Total unique tournaments: 11009

Distribution by surface:
Surface
Hard      35419
Clay      21389
Grass      7444
Carpet     1632
Name: count, dty