
# 🛠️ Feature Engineering: Rank-Only Model

 This notebook prepares a clean feature set to test **how well ATP rank alone predicts match outcomes**.



## 📥 Load Cleaned Match Data (2020–2023)

In this step, we load ATP match data from 2020 to 2023. We extract just the basic outcome info — player names and their pre-match rankings — so we can test how much predictive power rankings alone provide.


In [4]:
import numpy as np

years = [2020, 2021, 2022, 2023]
dfs = []

for year in years:
    df = pd.read_csv(f"../../data/raw/atp_matches_{year}.csv")
    df = df[["tourney_date", "winner_name", "loser_name", "winner_rank", "loser_rank"]]
    df.dropna(subset=["winner_rank", "loser_rank"], inplace=True)
    df["winner_rank"] = df["winner_rank"].astype(int)
    df["loser_rank"] = df["loser_rank"].astype(int)
    df["year"] = year
    dfs.append(df)

match_data = pd.concat(dfs, ignore_index=True)
match_data.head()

Unnamed: 0,tourney_date,winner_name,loser_name,winner_rank,loser_rank,year
0,20200106,Novak Djokovic,Rafael Nadal,2,1,2020
1,20200106,Roberto Bautista Agut,Dusan Lajovic,10,34,2020
2,20200106,Novak Djokovic,Daniil Medvedev,2,5,2020
3,20200106,Dusan Lajovic,Karen Khachanov,34,17,2020
4,20200106,Rafael Nadal,Alex De Minaur,1,18,2020


## 🔁 Randomly Assign Player A and Player B
To avoid introducing positional bias into the model, we randomly assign one player to be "Player A" and the other "Player B" for each match.

The target variable `player_A_won` is set to 1 if Player A is actually the winner, and 0 if not. This makes our dataset fair and symmetrical.


In [6]:
def assign_players_rank_only(df, seed=42):
    np.random.seed(seed)
    mask = np.random.rand(len(df)) < 0.5

    player_A = df.copy()
    player_A["player_A_name"] = np.where(mask, df["winner_name"], df["loser_name"])
    player_A["player_B_name"] = np.where(mask, df["loser_name"], df["winner_name"])
    player_A["player_A_rank"] = np.where(mask, df["winner_rank"], df["loser_rank"])
    player_A["player_B_rank"] = np.where(mask, df["loser_rank"], df["winner_rank"])
    player_A["tourney_date"] = df["tourney_date"]
    player_A["year"] = df["year"]
    player_A["player_A_won"] = mask.astype(int)

    return player_A[[
        "tourney_date", "year",
        "player_A_name", "player_B_name",
        "player_A_rank", "player_B_rank",
        "player_A_won"
    ]]

pairwise_rank_only = assign_players_rank_only(match_data)
pairwise_rank_only.head()

Unnamed: 0,tourney_date,year,player_A_name,player_B_name,player_A_rank,player_B_rank,player_A_won
0,20200106,2020,Novak Djokovic,Rafael Nadal,2,1,1
1,20200106,2020,Dusan Lajovic,Roberto Bautista Agut,34,10,0
2,20200106,2020,Daniil Medvedev,Novak Djokovic,5,2,0
3,20200106,2020,Karen Khachanov,Dusan Lajovic,17,34,0
4,20200106,2020,Rafael Nadal,Alex De Minaur,1,18,1


## 🧮 Feature: Rank Difference Only 

 We calculate the difference in rankings between Player B and Player A. This single numeric feature will drive our baseline model.
- A positive `rank_diff` means Player A is higher-ranked.
- A negative value means Player B is higher-ranked.

In [7]:
pairwise_rank_only["rank_diff"] = pairwise_rank_only["player_B_rank"] - pairwise_rank_only["player_A_rank"]
pairwise_rank_only.head()

Unnamed: 0,tourney_date,year,player_A_name,player_B_name,player_A_rank,player_B_rank,player_A_won,rank_diff
0,20200106,2020,Novak Djokovic,Rafael Nadal,2,1,1,-1
1,20200106,2020,Dusan Lajovic,Roberto Bautista Agut,34,10,0,-24
2,20200106,2020,Daniil Medvedev,Novak Djokovic,5,2,0,-3
3,20200106,2020,Karen Khachanov,Dusan Lajovic,17,34,0,17
4,20200106,2020,Rafael Nadal,Alex De Minaur,1,18,1,17


## ✅ Save Feature Dataset (Rank Only)

Finally, we save the feature matrix so it can be used in the modeling phase. This allows us to train a clean rank-only baseline model in the next notebook.


In [9]:
pairwise_rank_only.to_csv("../../data/processed/features_rank_match_winner_rank_only.csv", index=False)
print("✅ Saved rank-only feature matrix for baseline modeling.")


✅ Saved rank-only feature matrix for baseline modeling.
