<a href="https://colab.research.google.com/github/shishirnarwal/tennis_prediction_model/blob/main/01_data_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exploring Jeff Sackman's Tennis Dataset

In [None]:
# Loading ATP matches for 2023
import pandas as pd

atp_matches_2023 = 'https://raw.githubusercontent.com/shishirnarwal/tennis_atp_jeff_sackman/refs/heads/master/atp_matches_2023.csv'

try:
    matches_df = pd.read_csv(atp_matches_2023)
    print("CSV loaded successfully!")
    print(matches_df.head())
except Exception as e:
    print(f"Error loading CSV from GitHub: {e}")

In [None]:
# Loading ATP rankings at end of 2023
atp_rankings = 'https://raw.githubusercontent.com/shishirnarwal/tennis_atp_jeff_sackman/refs/heads/master/atp_rankings_current.csv'

try:
    rank_df = pd.read_csv(atp_rankings)
    print("CSV loaded successfully!")
    print(rank_df.head())
except Exception as e:
    print(f"Error loading CSV from GitHub: {e}")

In [None]:
# Loading ATP players details
atp_players = 'https://raw.githubusercontent.com/shishirnarwal/tennis_atp_jeff_sackman/refs/heads/master/atp_players.csv'

try:
    players_df = pd.read_csv(atp_players)
    print("CSV loaded successfully!")
    print(players_df.head())
except Exception as e:
    print(f"Error loading CSV from GitHub: {e}")

In [None]:
# Joining ATP rankings with ATP players
rank_df_joined = rank_df.merge(players_df, left_on='player', right_on='player_id')
rank_df_joined.head()

In [None]:
# Checking for Walkovers
print(f'Number of matches with walkovers: {len(matches_df[matches_df['score'].str.contains('RET', na=False)])}')

In [None]:
# Checking for missing winner and loser ranks
print(f'Number of matches with missing winner rank: {matches_df['winner_rank'].isna().sum()}')
print(f'Number of matches with missing loser rank: {matches_df['loser_rank'].isna().sum()}')

In [None]:
# Removing walkovers and missing winner/loser ranks
matches_df_cleaned = matches_df.dropna(subset=['winner_rank', 'loser_rank'])
matches_df_cleaned = matches_df_cleaned[~(matches_df_cleaned['score'].str.contains('RET', na=False))]

In [None]:
matches_df_cleaned.head()

Calculating Win Rates by Surface

In [None]:
# Create flag for whether higher ranked player won
matches_df_cleaned['Favorite_won'] = (matches_df_cleaned['winner_rank'] < matches_df_cleaned['loser_rank']).astype(int)

# Group by Surface
surface_stats = matches_df_cleaned.groupby('surface').agg({
    'match_num': 'count',
    'Favorite_won': 'mean',
    'winner_rank': 'mean',
    'loser_rank': 'mean'
}).round(3)

surface_stats

**Key Finding: Surface Predictability**
   
Favorite win rates by surface:
- Hard: 64.0% (n=1,641)
- Clay: 62.3% (n=872)  
- Grass: 63.3% (n=316)
   
**Insight:** Surface type shows minimal impact on match predictability (< 2% range). Ranking alone provides ~63% baseline accuracy across all surfaces.
   
**Limitation:** Grass court sample size (n=316) is 5x smaller than hard courts.
This will likely reduce model accuracy for grass court predictions.

Ranking vs Win Rates

In [None]:
# Calculate ranking difference
matches_df_cleaned['rank_diff'] = abs(matches_df_cleaned['winner_rank'] - matches_df_cleaned['loser_rank'])

# Create bins
bins = [0, 10, 25, 50, 100, float('inf')]
labels = ['1-10', '11-25', '26-50', '51-100', '100+']
matches_df_cleaned['rank_diff_bin'] = pd.cut(matches_df_cleaned['rank_diff'], bins=bins, labels=labels)

# Calculate favorite win% by rank difference bin
fav_win_prob = matches_df_cleaned.groupby('rank_diff_bin')['Favorite_won'].mean()

# Plot
plt.figure(figsize=(10, 6))
fav_win_prob.plot(kind='bar')
plt.title('Higher-Ranked Player Win Probability by Ranking Gap')
plt.ylabel('Win Probability')
plt.xlabel('Ranking Difference')
plt.axhline(y=0.5, color='r', linestyle='--', label='Coin flip')
plt.legend()
plt.ylim(0.4, 0.8)
plt.tight_layout()

In [None]:
print(fav_win_prob)
matches_df_cleaned['rank_diff_bin'].value_counts(normalize=True).sort_index()

**Key Finding: Ranking Predictability**
   
Favorite win rates by rank difference:
- 1-10: 57.4% (n=12.7%)
- 11-25: 56.1% (n=19.0%)  
- 26-50: 63.5% (n=22.9%)
- 51-100: 63.9% (n=24.6%)
- 100+: 73.3% (n=20.8%)
   
**Insight:** Ranking difference shows predictable but non-linear impact on win predicatability
   
**Limitation:** Closely ranked matchups (rank difference < 25) will be difficult to predict and would require in-depth feature engineering.
This will likely reduce model accuracy for grass court predictions.

In [None]:
print(matches_df_cleaned['rank_diff_bin'].value_counts().sort_index())