# Predict

**INPUT**: "models/best_final_xgb_model.json"

**OUTPUT**: Predictions of matches

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
from sklearn import tree
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from utils.updateStats import getStats, updateStats, createStats
pd.set_option('display.max_columns', None)

## Re-Calculate all the stats

Firstly, we need to re-calculate all the stats. I could have export this in 1.CreateDataset, but I thought it would be better if I did it here again for simplicity (instead of export the statistics, which might weight a lot).

This is fairly okay, since it only takes a minute on my machine. Obviosly, if it took longer, I would directly export all the stats in 1.CreateDataset instead of doing this.

In [2]:
update_stats_param = {
    "k_factor": None,
    "base_k_factor": 43,
    "max_k_factor": 62,
    "div_number": 800,
    "bonus_after_layoff": True
}

In [3]:
clean_data = pd.read_csv("./data/0cleanDatasetWithQualifiersWith2025.csv")
prev_stats = createStats()

# Iterate through each row in clean_data
for index, row in tqdm(clean_data.iterrows(), total=len(clean_data)):
    ########## UPDATE STATS ##########
    # We only need to update the stats, since we don't need to create a dataset
    prev_stats = updateStats(row, prev_stats, **update_stats_param)

  clean_data = pd.read_csv("./data/0cleanDatasetWithQualifiersWith2025.csv")
100%|██████████| 197263/197263 [01:03<00:00, 3093.11it/s]


In [4]:
prev_stats.keys()

dict_keys(['elo_players', 'elo_surface_players', 'elo_grad_players', 'last_k_matches', 'last_k_matches_stats', 'matches_played', 'matches_surface_played', 'h2h', 'h2h_surface', 'last_tourney', 'last_tourney_surface'])

## Predict Any Two Players

In [5]:
# Load the model from models
xgb_model = XGBClassifier()
xgb_model.load_model("./models/best_final_xgb_model.json")

# I define this here to make the results more easy to interpret
mapper = np.vectorize(lambda x: "Player 2 Wins" if x == 0 else "Player 1 Wins")

Here, I'm going to predict a match between Sinner and Alcaraz. I'm going to simulate them playing in a grand slam, and in a Hard Court.

In [6]:
# Example match between Carlos Alcaraz and Jannik Sinner
player1 = {
    "Name": "Jannik Sinner",                    # Name is not needed, but I wrote it for clarity
    "ID": 206173,                               # You can search for the ID in "./data/atp_players.csv"
    "ATP_POINTS": 11000,                        # You can find this in the ATP website
    "ATP_RANK": 1,                              # You can find this in the ATP website
    "AGE": 23.6,                                # You don't need to calculate the age to a point decimal (but the more info the better) 
    "HEIGHT": 191,                              # This can also be found in "./data/atp_players.csv"
}

player2 = {
    "Name": "Carlos Alcaraz",
    "ID": 207989,
    "ATP_POINTS": 9000,
    "ATP_RANK": 2,
    "AGE": 21.6,
    "HEIGHT": 183,
}

match = {
    "BEST_OF": 5,                               # Set this to 5 if grand slam, otherwise 3 normally
    "DRAW_SIZE": 128,                           # Depending on the tournament
    "SURFACE": "Clay",                          # Surface of the match. Options are ("Hard", "Clay", "Grass", "Carpet")
    "ROUND": "QF"
}

# Call getStatsPlayers function
output = getStats(player1, player2, match, prev_stats)

match_data = pd.DataFrame([dict(sorted(output.items()))])
mapper(xgb_model.predict(np.array(match_data, dtype=object)))

array(['Player 2 Wins'], dtype='<U13')

In [7]:
match_data

Unnamed: 0,AGE_DIFF,ATP_RANK_DIFF,BEST_OF,DOMINANCE_RATIO_LAST_100_DIFF,DOMINANCE_RATIO_LAST_10_DIFF,DOMINANCE_RATIO_LAST_25_DIFF,DOMINANCE_RATIO_LAST_3_DIFF,DOMINANCE_RATIO_LAST_50_DIFF,DRAW_SIZE,ELO_DIFF,ELO_GRAD_LAST_100_DIFF,ELO_GRAD_LAST_10_DIFF,ELO_GRAD_LAST_25_DIFF,ELO_GRAD_LAST_3_DIFF,ELO_GRAD_LAST_50_DIFF,ELO_SURFACE_DIFF,H2H_DIFF,H2H_SURFACE_DIFF,HEIGHT_DIFF,N_GAMES_DIFF,P_1ST_IN_LAST_100_DIFF,P_1ST_IN_LAST_10_DIFF,P_1ST_IN_LAST_25_DIFF,P_1ST_IN_LAST_3_DIFF,P_1ST_IN_LAST_50_DIFF,P_1ST_WON_LAST_100_DIFF,P_1ST_WON_LAST_10_DIFF,P_1ST_WON_LAST_25_DIFF,P_1ST_WON_LAST_3_DIFF,P_1ST_WON_LAST_50_DIFF,P_2ND_WON_LAST_100_DIFF,P_2ND_WON_LAST_10_DIFF,P_2ND_WON_LAST_25_DIFF,P_2ND_WON_LAST_3_DIFF,P_2ND_WON_LAST_50_DIFF,P_ACE_LAST_100_DIFF,P_ACE_LAST_10_DIFF,P_ACE_LAST_25_DIFF,P_ACE_LAST_3_DIFF,P_ACE_LAST_50_DIFF,P_BP_CONV_LAST_100_DIFF,P_BP_CONV_LAST_10_DIFF,P_BP_CONV_LAST_25_DIFF,P_BP_CONV_LAST_3_DIFF,P_BP_CONV_LAST_50_DIFF,P_BP_SAVED_LAST_100_DIFF,P_BP_SAVED_LAST_10_DIFF,P_BP_SAVED_LAST_25_DIFF,P_BP_SAVED_LAST_3_DIFF,P_BP_SAVED_LAST_50_DIFF,P_DF_LAST_100_DIFF,P_DF_LAST_10_DIFF,P_DF_LAST_25_DIFF,P_DF_LAST_3_DIFF,P_DF_LAST_50_DIFF,P_RET_1ST_WON_LAST_100_DIFF,P_RET_1ST_WON_LAST_10_DIFF,P_RET_1ST_WON_LAST_25_DIFF,P_RET_1ST_WON_LAST_3_DIFF,P_RET_1ST_WON_LAST_50_DIFF,P_RET_2ND_WON_LAST_100_DIFF,P_RET_2ND_WON_LAST_10_DIFF,P_RET_2ND_WON_LAST_25_DIFF,P_RET_2ND_WON_LAST_3_DIFF,P_RET_2ND_WON_LAST_50_DIFF,P_RET_ACE_AGAINST_LAST_100_DIFF,P_RET_ACE_AGAINST_LAST_10_DIFF,P_RET_ACE_AGAINST_LAST_25_DIFF,P_RET_ACE_AGAINST_LAST_3_DIFF,P_RET_ACE_AGAINST_LAST_50_DIFF,P_RPW_LAST_100_DIFF,P_RPW_LAST_10_DIFF,P_RPW_LAST_25_DIFF,P_RPW_LAST_3_DIFF,P_RPW_LAST_50_DIFF,P_TOTAL_PTS_WON_LAST_100_DIFF,P_TOTAL_PTS_WON_LAST_10_DIFF,P_TOTAL_PTS_WON_LAST_25_DIFF,P_TOTAL_PTS_WON_LAST_3_DIFF,P_TOTAL_PTS_WON_LAST_50_DIFF,ROUND,WIN_LAST_100_DIFF,WIN_LAST_10_DIFF,WIN_LAST_25_DIFF,WIN_LAST_3_DIFF,WIN_LAST_50_DIFF
0,2.0,-1,5,11.496222,14.681369,20.065882,-20.886741,8.588416,128,28.321917,0.004387,-11.387592,-6.921975,-14.334072,-1.351263,-146.714432,-4,-2,8,45,-3.313163,-1.448676,-4.849732,-9.37806,-3.904147,5.058073,0.338289,4.404942,-8.64694,4.515237,1.344185,-4.08011,2.269427,-5.651967,1.973775,2.731577,-4.855461,2.089642,-13.730384,2.917547,1.144584,-1.226021,-4.029249,-23.492063,-2.869361,9.323282,4.031746,1.638672,-16.666667,5.923871,-0.812055,-2.427963,-1.381407,-2.866947,-1.781518,-1.881548,4.887175,-0.433518,8.096451,-2.811379,1.138033,8.383605,3.636464,5.43278,2.235864,2.299064,-1.727734,1.227668,-5.958705,1.928348,-0.661514,6.023451,1.354062,5.932513,-0.381311,1.213541,2.770368,2.249623,-0.728881,1.484121,4,7,-2,-1,-1,3


Uhhhh! How cool! I simulated a match between Carlos Alcaraz and Jannik Sinner. As you can see, if the surface is Hard, it predicted Jannik Sinner would win. However, I tried chaning the surface to Grass or Clay, and Carlos Alcaraz was predicted as the favorite.

This is super cool, because that's what I would have predicted myself. Carlos won Roland Garros, and Wimbledon and he's really good at both of those surfaces. Meanwhile, Sinner excels at Hard courts and has won the last two Australian Open tournaments and the last US Open.

In [8]:
# Try see how sure of the prediction the model is
probs = xgb_model.predict_proba(np.array(match_data, dtype=object))

# Extract probability of each class
prob_player1_wins = probs[0][1]
prob_player2_wins = probs[0][0]

print(f"Probability of {player1['Name']} winning: {prob_player1_wins:.2%}")
print(f"Probability of {player2['Name']} winning: {prob_player2_wins:.2%}")

Probability of Jannik Sinner winning: 45.37%
Probability of Carlos Alcaraz winning: 54.63%


We can also check what is the estimated probablity based on the predictions of the trees (which is a bit more complicated than just a discrete vote, like in random forests).

## Why you should not bet using my model?
Firstly, I'm just a youtuber and a CS student. Also, this was a small project.
Bookmakers are cracked, and they have the best models, which they keep a secret. I doubt that my model will ever be able to compete with them.
That being said, I think this is a fun project about how you can use Machine Learning to do some pretty cool things. Also, this model could be improved in a lot of ways, which I will briefly explain below.

I hope you enjoyed!