# 🎾 Tennis Match Prediction – Notebook 2: Feature Engineering

> 🧠 **Note:** This is our baseline feature engineering notebook. We’re keeping it focused and simple — just enough to build a clean logistic regression model. We'll introduce more features in future notebooks as our modeling goals become more complex.

## 📋 What This Notebook Does

This notebook creates a small set of clean, interpretable features that we use in our first predictive model. We'll focus on a limited but powerful set of inputs:

- 🎯 Ranking difference between players
- 🏷️ Seeding status for both winner and loser

By keeping our features minimal, we can better understand how each one contributes to model performance. More complex features (like surface, recent form, or break points) will come later.



In [29]:
import pandas as pd

# Load the cleaned dataset
file_path = "../data/raw/atp_matches_2023.csv"  # Adjust if needed
df = pd.read_csv(file_path)


## 🎯 Feature 1: Ranking Difference

This feature captures the **ranking gap** between players. We subtract winner and loser ranks:
- A large positive number = winner was lower-ranked (a potential upset)
- A negative number = winner was higher-ranked (expected win)

Useful in modeling and betting to quantify mismatch risk.

In [30]:

df['ranking_diff'] = df['loser_rank'] - df['winner_rank']

## 🏷️ Feature 2: Seed Indicator

Seeding reflects tournament expectations. We create binary indicators:
- `1` if a player is seeded
- `0` otherwise

This helps evaluate if underdogs are beating expectations or if seeds are consistent performers.


In [31]:
df['winner_seeded'] = df['winner_seed'].notnull().astype(int)
df['loser_seeded'] = df['loser_seed'].notnull().astype(int)

## 💾 Save Processed Data (Optional)

Uncomment the line below if you want to save your updated dataset with features:


In [32]:
df.to_csv("../data/processed/atp_matches_2023_features.csv", index=False)

## 🔍 Preview Engineered Features

Let’s take a look at a few rows of the new features to confirm they’re created properly and help us understand the data we're feeding into the model.

In [33]:
from IPython.display import display

preview_cols = [
    'winner_name', 'loser_name', 'winner_rank', 'loser_rank', 'ranking_diff',
    'winner_seeded', 'loser_seeded'
]
display(df[preview_cols].head().style.set_table_attributes('style="width:100%;"')
        .set_caption("Preview of Engineered Features")
        .format(na_rep='-', precision=1))

Unnamed: 0,winner_name,loser_name,winner_rank,loser_rank,ranking_diff,winner_seeded,loser_seeded
0,Taylor Fritz,Matteo Berrettini,9.0,16.0,7.0,1,1
1,Frances Tiafoe,Lorenzo Musetti,19.0,23.0,4.0,0,0
2,Taylor Fritz,Hubert Hurkacz,9.0,10.0,1.0,1,1
3,Frances Tiafoe,Kacper Zuk,19.0,245.0,226.0,0,0
4,Stefanos Tsitsipas,Matteo Berrettini,4.0,16.0,12.0,1,1
