# Regression Forest

Our main model will be a regression forest. First, we'll design a generic regression tree model that can be applied to any player based on the Federer pilot tree. We'll then populate a forest with any number of these trees to obtain a distribution of predictions. Hyperparameter selection will be done through GridSearchCV. I'm chosing this approach instead of a classic Forest Regressor because, due to the volatile nature of our problem, it's hard to say whether any individual tree's prediction is better than another's. Thus, I'd like to visualize the whole range of predictions offered by my trees.

In [1]:
### IMPORTS ###

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We select the player and generate a table containing that player's matches which we will store in "player.csv", as to avoid creating one table for each player.  
Note : This setup step has to be repeated each time we change players.

In [2]:
### CLEAN PLAYER MATCHES TABLES ###

PLAYER = "Alexander Zverev"
setup = False

if (setup) :    # Creating the table (if the player.csv table currently contains information for another player)

    atp = pd.read_csv("atp_cat.csv")
    # atp = atp[atp.loser_hand != 'U']
    # atp.to_csv("atp_cat.csv", index=False)

    player_won = atp[atp["winner_name"] == PLAYER]
    player_lost = atp[atp["loser_name"] == PLAYER]

    # We'll drop all of the player's information except his age and rank
    player_won = player_won.drop(labels=["winner_name", "winner_hand", "winner_ht", "winner_ioc"], axis=1)
    player_won = player_won.rename(columns={"winner_age": "player_age", "winner_rank": "player_rank", "winner_rank_points": "player_rank_points",
                                    "loser_name":"opp_name", "loser_hand":"opp_hand", "loser_ht":"opp_ht", "loser_ioc":"opp_ioc", "loser_age":"opp_age",
                                    "loser_rank": "opp_rank", "loser_rank_points": "opp_rank_points"})
    player_won["index2"] = player_won.index
    player_won["player_won"] = "1"

    player_lost = player_lost.drop(labels=["loser_name", "loser_hand", "loser_ht", "loser_ioc"], axis=1)
    player_lost = player_lost.rename(columns={"loser_age": "player_age", "loser_rank": "player_rank", "loser_rank_points": "player_rank_points",
                                        "winner_name":"opp_name", "winner_hand":"opp_hand", "winner_ht":"opp_ht", "winner_ioc":"opp_ioc", "winner_age":"opp_age",
                                        "winner_rank": "opp_rank", "winner_rank_points": "opp_rank_points"})
    player_lost["index2"] = player_lost.index
    player_lost["player_won"] = "0"

    player = pd.concat([player_won, player_lost])
    player.drop(list(player.filter(regex = 'Unnamed')), axis = 1, inplace = True)
    player.sort_index(inplace=True)

    # Win streaks
    result = player.player_won.astype(int)
    consecutive = result.groupby((result != result.shift()).cumsum()).cumcount()
    wins = pd.DataFrame({"win" : result, "consecutive" : consecutive})
    m = wins.win == 1
    wins.consecutive = wins.consecutive.where(m, 0)
    player["consecutive"] = wins.consecutive

    player.to_csv("player.csv", index=False)     # Saving the table for ease of use

    atp_players = pd.read_csv("atp_players.csv")
    atp_players['name'] = atp_players['name_first'] + ' ' + atp_players['name_last']
    atp_players.to_csv("atp_players.csv", index=False)     # Saving the table for ease of use


else :  # player.csv already contains this player's information

    player = pd.read_csv("player.csv")

    atp_players = pd.read_csv("atp_players.csv")


display(player)
print(f"{PLAYER} has {len(player)} recorded matches.")

Unnamed: 0,tourney_name,surface,tourney_level,tourney_date,player_age,opp_name,opp_hand,opp_ht,opp_ioc,opp_age,...,best_of,round,minutes,player_rank,player_rank_points,opp_rank,opp_rank_points,index2,player_won,consecutive
0,Hamburg,Clay,A,20130715,16.235455,Roberto Bautista Agut,R,183.0,ESP,25.251198,...,3,R64,67.0,798.0,20.0,49.0,872.0,60756,0,0
1,Munich,Clay,A,20140428,17.021218,Jurgen Melzer,L,183.0,AUT,32.933607,...,3,R32,63.0,765.0,25.0,66.0,700.0,62670,0,0
2,Stuttgart,Clay,A,20140707,17.212868,Lukas Rosol,R,196.0,CZE,28.952772,...,3,R32,103.0,285.0,163.0,48.0,900.0,63309,0,0
3,Hamburg,Clay,A,20140714,17.232033,Robin Haase,R,190.0,NED,27.271732,...,3,R64,58.0,285.0,163.0,51.0,865.0,63361,1,0
4,Hamburg,Clay,A,20140714,17.232033,Mikhail Youzhny,R,183.0,RUS,32.052019,...,3,R32,120.0,285.0,163.0,19.0,1735.0,63377,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
388,Us Open,Hard,G,20210830,24.361396,Novak Djokovic,R,188.0,SRB,34.275154,...,5,SF,214.0,4.0,8240.0,1.0,11113.0,80082,0,0
389,Indian Wells Masters,Hard,M,20211004,24.457221,Taylor Fritz,R,193.0,USA,23.934292,...,3,QF,140.0,4.0,7603.0,39.0,1495.0,80198,0,0
390,Indian Wells Masters,Hard,M,20211004,24.457221,Gael Monfils,R,193.0,FRA,35.091034,...,3,R16,61.0,4.0,7603.0,18.0,2418.0,80205,1,0
391,Indian Wells Masters,Hard,M,20211004,24.457221,Andy Murray,R,190.0,GBR,34.390144,...,3,R32,127.0,4.0,7603.0,121.0,661.0,80218,1,1


Alexander Zverev has 393 recorded matches.


## Regression Decision Tree

We will create a decision tree using these features :
- Surface  
- Best of  
- Opponent hand  
- Opponent height  
- Opponent country  
- Ranking difference  
- Tournament level  
- Match round (QF, SF, F, etc...)
- Player form  

Some of these features seem irrelevant, and they probably are. However, since we will use GridSearchCV to perform feature selection, we will feed the model all the information we have and let it select what is important (ie. the optimal splits in terms of information gain).

### Formatting

The "player.csv" table contains many unecessary columns. Here, we create a player1 table containing all the features we potentially need for our model and format them correctly. Specifically, we use one-hot encoding to split categorical data into multiple boolean columns.

In [3]:
player1_pf = player[["minutes", "surface", "best_of", "opp_hand", "opp_ht", "opp_age", "tourney_level", "round"]]
player1_pf["rank_diff"] = player["player_rank"] - player["opp_rank"]
player1_pf["consecutive"] = player["consecutive"]

display(player1_pf)


# One-Hot Encoding
player1 = pd.get_dummies(data=player1_pf, columns=["surface", "best_of", "opp_hand", "tourney_level", "round"])

player1 = player1.dropna(axis=0, how='any')

# player1 = player1.drop(["round_BR"] , axis=1)

player1.tail(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,minutes,surface,best_of,opp_hand,opp_ht,opp_age,tourney_level,round,rank_diff,consecutive
0,67.0,Clay,3,R,183.0,25.251198,A,R64,749.0,0
1,63.0,Clay,3,L,183.0,32.933607,A,R32,699.0,0
2,103.0,Clay,3,R,196.0,28.952772,A,R32,237.0,0
3,58.0,Clay,3,R,190.0,27.271732,A,R64,234.0,0
4,120.0,Clay,3,R,183.0,32.052019,A,R32,266.0,1
...,...,...,...,...,...,...,...,...,...,...
388,214.0,Hard,5,R,188.0,34.275154,G,SF,3.0,0
389,140.0,Hard,3,R,193.0,23.934292,M,QF,-35.0,0
390,61.0,Hard,3,R,193.0,35.091034,M,R16,-14.0,0
391,127.0,Hard,3,R,190.0,34.390144,M,R32,-117.0,1


Unnamed: 0,minutes,opp_ht,opp_age,rank_diff,consecutive,surface_Clay,surface_Grass,surface_Hard,best_of_3,best_of_5,...,tourney_level_G,tourney_level_M,round_F,round_QF,round_R128,round_R16,round_R32,round_R64,round_RR,round_SF
387,126.0,193.0,24.511978,-42.0,9,0,0,1,0,1,...,1,0,0,1,0,0,0,0,0,0
388,214.0,188.0,34.275154,3.0,0,0,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
389,140.0,193.0,23.934292,-35.0,0,0,0,1,1,0,...,0,1,0,1,0,0,0,0,0,0
390,61.0,193.0,35.091034,-14.0,0,0,0,1,1,0,...,0,1,0,0,0,1,0,0,0,0
391,127.0,190.0,34.390144,-117.0,1,0,0,1,1,0,...,0,1,0,0,0,0,1,0,0,0



#### Pre-processing

For pre-processing we have three options :
- not scaling x or y
- scaling both x and y
- scaling x but not y   
  
I feel like scaling gives us better results but adds a hurdle in interpreting the tree's visualization since it displays scaled values. We can scale both input and output values back, just not display them with sklearns's plot_tree (or any tree visualization that I've found so far). For now, the roundabout solution is just to print out the scaled input and output.

In [4]:
### PRE-PROCESSING ###

from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

target='minutes'

scale_x = False
scale_y = False

def preprocessing(scale_x=False, scale_y=False) :

    # 1. X, y unscaled
    X = player1.drop([target], axis=1)
    y = np.asarray(player1[target])
    y = y.reshape(-1,1)

    xscaler = StandardScaler()
    yscaler = StandardScaler()

    if (scale_x) :
        print("Scaling X...")
        xscaler.fit(X[['opp_ht', 'opp_age', 'rank_diff']])
        X[['opp_ht', 'opp_age', 'rank_diff']] = xscaler.transform(X[['opp_ht', 'opp_age', 'rank_diff']])

    if (scale_y) :
        print("Scaling y...")
        yscaler.fit(y)
        y = yscaler.transform(y)

    display(pd.DataFrame(X).head())
    display(pd.DataFrame(y).head())

    return X, y, xscaler, yscaler

X, y, xscaler, yscaler = preprocessing(scale_x, scale_y)

Unnamed: 0,opp_ht,opp_age,rank_diff,consecutive,surface_Clay,surface_Grass,surface_Hard,best_of_3,best_of_5,opp_hand_L,...,tourney_level_G,tourney_level_M,round_F,round_QF,round_R128,round_R16,round_R32,round_R64,round_RR,round_SF
0,183.0,25.251198,749.0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
1,183.0,32.933607,699.0,0,1,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,0
2,196.0,28.952772,237.0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,190.0,27.271732,234.0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,183.0,32.052019,266.0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0


Unnamed: 0,0
0,67.0
1,63.0
2,103.0
3,58.0
4,120.0


#### Building the tree

We will setup 3 functions in order to build and exploit our tree model :  

**1. Parameter Selection** : To build a good tree, we need to select values for the tree's parameters (e.g. max depth, minimum samples per leaf). We'll perform this "Hyperparameter Tuning" using GridSearchCV, a library which selects the best tree by trying every combination of parameters we give it and performing a cross validation. The trees are commpared based on the R² criteria. We could say this is the actual building of the tree.
  
![score formula](res/score.png "sklearn score")  

**2. Custom Prediction** : Our end goal is to predict a match with any given variables. This function does that by building input and output wrapping around the native *dtr.predict* function. It looks complicated because we distinguish 3 cases depending on if we're scaling x and y. The usefulness of this disjunction remains to be discussed.

**3. Tree Visualization** : Sklearn provides a native *dtr.plot* function but doesn't label the branches with True/False. I've looked for another library - pydot - which does that.

In [5]:
### PARAMETER SELECTION ###

from sklearn.model_selection import GridSearchCV

params = {
    # "criterion":("squared_error", "friedman_mse", "absolute_error", "poisson"), 
    "max_depth":np.arange(3, 10), 
    "min_samples_leaf":np.arange(1, 80), 
    # "min_weight_fraction_leaf":[0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.075, 0.001], 
    # "max_features":np.arange(1, 25)
}

dtr = DecisionTreeRegressor(random_state=42)
dtr_cv = GridSearchCV(dtr, params, scoring="r2", n_jobs=-1, verbose=1, cv=10)

In [6]:
### CUSTOM PREDICTION ###

def custom_predict(dtr, scale_x, scale_y, height, age, rank_diff, consecutive, surface, best_of, hand, level, round) :
    # returns the predicted length of a match given a set of match conditions

    # surface : clay(1), grass(2), hard(3)
    # best of : 3(1), 5(2)
    # hand : left(1), right(2)
    # level : A(1), D(2), F(3), G(4), M(5)
    # round : F(1), QF(2), R128(3), R16(4), R32(5), R64(6), R(7), SF(8)

    surface_input = [0, 0, 0]
    surface_input[surface - 1] = 1

    bo_input = [0, 0]
    bo_input[best_of - 1] = 1

    hand_input = [0, 0]
    hand_input[hand - 1] = 1

    level_input = [0, 0, 0, 0, 0]
    level_input[level - 1] = 1

    round_input = [0, 0, 0, 0, 0, 0, 0, 0]
    round_input[round - 1] = 1

    X_custom = pd.DataFrame(columns=X.columns)


    if (not scale_x and not scale_y) :  # 1. X, y unscaled
        input = [height, age, rank_diff] + [consecutive] + surface_input + bo_input + hand_input + level_input + round_input
        X_custom = pd.DataFrame(columns=X.columns)
        X_custom.loc[0] = input
        display(X_custom)

        print("Prediction : ", dtr.predict(X_custom), " minutes")

        return(dtr.predict(X_custom)[0])

    elif (scale_x and scale_y) :    # 2. X, y scaled
        input = xscaler.transform([[height, age, rank_diff]])
        print(input)
        input = np.append(input[0], [consecutive] + surface_input + bo_input + hand_input + level_input + round_input)

        X_custom.loc[0] = input
        print("Scaled input : ")
        display(X_custom)

        print("Scaled Prediction : ", dtr.predict(X_custom), " minutes")
        print("Prediction : ", yscaler.inverse_transform(dtr.predict(X_custom)), " minutes")

        return(yscaler.inverse_transform(dtr.predict(X_custom))[0])

    elif (scale_x and not scale_y) :    # 3. X scaled, y unscaled
        input = xscaler.transform([[height, age, rank_diff]]).tolist()
        input = np.append(input[0], [consecutive] + surface_input + bo_input + hand_input + level_input + round_input)

        X_custom.loc[0] = input
        print("Scaled input : ")
        display(X_custom)

        print("Prediction : ", dtr.predict(X_custom)[0], " minutes")

        return(dtr.predict(X_custom)[0])


# e.g. custom_predict(dtr, scale_x, scale_y, 180, 22, 0, 0, 4, 1, 2, 4, 5)

In [7]:
### TREE VISUALIZATION ###

# Without True/False (sklearn)

# x_ax = range(len(ytest))
# plt.plot(x_ax, ytest, linewidth=1, label="original")
# plt.plot(x_ax, ypred, linewidth=1.1, label="predicted")
# plt.title("Y-test and y-predicted data")
# plt.xlabel('X-axis')
# plt.ylabel('Y-axis')
# plt.legend(loc='best',fancybox=True, shadow=True)
# plt.grid(True)
# plt.show()

# plt.figure(figsize=(30,15))
# tree.plot_tree(dtr,
#           filled=True,
#           rounded=True,
#           fontsize=10,
#           feature_names=["opp_ht", "opp_age", "rank_diff", "consecutive", "surface_Carpet", "surface_Clay", "surface_Grass", "surface_Hard", 
#                          "best_of_3", "best_of_5", "opp_hand_L", "opp_hand_R", 
#                          'tourney_level_A', 'tourney_level_D', 'tourney_level_F', 'tourney_level_G', 'tourney_level_M', 
#                          'round_F', 'round_QF', 'round_R128', 'round_R16', 'round_R32', 'round_R64', 'round_RR', 'round_SF'])

# # plt.savefig('tree_high_dpi', dpi=600)

# # With True/False label (pydot)

from IPython.display import Image
from six import StringIO
from sklearn.tree import export_graphviz
import pydot

def visualize_tree(dtr) :

    features = list(player1.columns)
    features.remove("minutes")

    dot_data = StringIO()
    export_graphviz(dtr, out_file=dot_data, feature_names=features, filled=True)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    display(Image(graph[0].create_png()))

### User Interface

We're building a simple UI to input match and player variables using ipywidgets.

In [8]:
import ipywidgets as widgets
from IPython.display import display

In [9]:
### INTERFACE ###

PLAYER1 = 'Roger Federer'
PLAYER2 = 'Rafael Nadal'

# Match Settings

match_settings_title = widgets.Label(value='MATCH CONDITIONS')

surface_dropdown = widgets.Dropdown(
    description="Surface",
    options=['Hard', 'Grass', 'Clay']
)

best_of_dropdown = widgets.Dropdown(
    description="Best of",
    options=['3', '5']
)

level_dropdown = widgets.Dropdown(
    description="Tournament Level",
    options=['Grand Slam (G)', 'Masters 1000s (M)', 'Finals (F)', 'Davis Cup (D)', 'Other (A)']
)

round_dropdown = widgets.Dropdown(
    description="Round of",
    options=['Final', 'Semifinals', 'Quarterfinals', 'R16', 'R32', 'R64', 'R128', 'R']
)

match_inputs_1 = widgets.HBox([surface_dropdown, best_of_dropdown])
match_inputs_2 = widgets.HBox([level_dropdown, round_dropdown])

# Players

player1_title = widgets.Label(value='PLAYER 1')

player1_text = widgets.Text(placeholder="Player 1", description='Name')

player1_height_display = widgets.Text(description='Height')
player1_hand_display = widgets.Text(description='Hand')
player1_rank_display = widgets.Text(description='Rank')
player1_age_display = widgets.Text(description='Age')
player1_cons_display = widgets.Text(description='Consecutive wins')
player1_age_display.value = '25'
player1_cons_display.value = '0'

player1_widgets = widgets.VBox([player1_title, player1_text, player1_height_display, player1_hand_display, player1_rank_display, player1_age_display, player1_cons_display])


player2_title = widgets.Label(value='PLAYER 2')

player2_text = widgets.Text(placeholder="Player 2", description='Name')

player2_height_display = widgets.Text(description='Height')
player2_hand_display = widgets.Text(description='Hand')
player2_rank_display = widgets.Text(description='Rank')
player2_age_display = widgets.Text(description='Age')
player2_cons_display = widgets.Text(description='Consecutive wins')
player2_age_display.value = '25'
player2_cons_display.value = '0'

player2_widgets = widgets.VBox([player2_title, player2_text, player2_height_display, player2_hand_display, player2_rank_display, player2_age_display, player2_cons_display])


player_inputs = widgets.HBox([player1_widgets, player2_widgets])

display(match_settings_title, match_inputs_1, match_inputs_2)
display(player_inputs)



### BEHAVIOR ###

def player1_eventhandler(change):
    PLAYER1 = change.new
    
    if PLAYER1 in atp_players.name.values :

        player1_height = atp_players.loc[atp_players.name==PLAYER1, 'height'].values[0]
        player1_hand = atp_players.loc[atp_players.name==PLAYER1, 'hand'].values[0]
        player1_rank = atp_players.loc[atp_players.name==PLAYER1, 'rank'].values[0]

        player1_height_display.value = str(player1_height)
        player1_hand_display.value = str(player1_hand)
        player1_rank_display.value = str(player1_rank)
        
def player2_eventhandler(change):
    PLAYER2 = change.new
    
    if PLAYER2 in atp_players.name.values :

        player2_height = atp_players.loc[atp_players.name==PLAYER2, 'height'].values[0]
        player2_hand = atp_players.loc[atp_players.name==PLAYER2, 'hand'].values[0]
        player2_rank = atp_players.loc[atp_players.name==PLAYER2, 'rank'].values[0]

        player2_height_display.value = str(player2_height)
        player2_hand_display.value = str(player2_hand)
        player2_rank_display.value = str(player2_rank)

player1_text.observe(player1_eventhandler, names='value')
player2_text.observe(player2_eventhandler, names='value')

Label(value='MATCH CONDITIONS')

HBox(children=(Dropdown(description='Surface', options=('Hard', 'Grass', 'Clay'), value='Hard'), Dropdown(desc…

HBox(children=(Dropdown(description='Tournament Level', options=('Grand Slam (G)', 'Masters 1000s (M)', 'Final…

HBox(children=(VBox(children=(Label(value='PLAYER 1'), Text(value='', description='Name', placeholder='Player …

## Regression Forest

Finally, we're gonna setup a loop to build any number of trees to form a Decision Forest. We'll then visualize the predictions as a histogram acting as a distribution chart. We can set the number of trees (to get an idea, a tree takes 8-10 seconds) as well as the time intervals for the distribution.

In [11]:
# Converting the user inputs into inputs usable by the custom_predict function

sd = { 'Clay':1, 'Grass':2, 'Hard':3 }                                                                  # Surface dictionary
bod = { '3':1, '5':2 }                                                                                  # Best of dictionary
hd = { 'L':1, 'R':2 }                                                                                   # Hand dictionary
ld = { 'Other (A)':1, 'Davis Cup (D)':2, 'Finals (F)':3, 'Grand Slam (G)':4, 'Masters 1000s (M)':5 }    # Level dictionary
rd = { 'Final':1, 'Quarterfinals':2, 'R128':3, 'R16':4, 'R32':5, 'R64':6, 'R':7, 'Semifinals':8 }       # Round dictionary

def t2i(dict, text) :   # text to input
    return dict[text]

ht = float(player2_height_display.value)
a = float(player2_age_display.value)
rk = float(player1_rank_display.value) - float(player2_rank_display.value)
c = int(player1_cons_display.value)
s = t2i(sd, surface_dropdown.value)
bo = int(t2i(bod, best_of_dropdown.value))
h = t2i(hd, player2_hand_display.value)
l = t2i(ld, level_dropdown.value)
r = t2i(rd, round_dropdown.value)

In [12]:
### THE LOOP ###

test_scores = []
predictions = []

iter = 2

for i in range(iter) :

    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.3)
    dtr_cv.fit(Xtrain, ytrain)

    best_params = dtr_cv.best_params_
    print(f"Best parameters: {best_params})")

    dtr = DecisionTreeRegressor(**best_params)
    dtr.fit(Xtrain, ytrain)

    # Testing :

    ypred = dtr.predict(Xtest)

    print(f"Training score {i+1} : ", dtr.score(Xtrain, ytrain))
    print(f"Testing score {i+1} : ", dtr.score(Xtest, ytest))
    test_scores.append(dtr.score(Xtest, ytest))

    predictions.append(custom_predict(dtr, scale_x, scale_y, ht, a, rk, c, s, bo, h, l, r))     # ???

    # visualize_tree(dtr)

    print("------------------------------------------------------------------------------------------------")

print("------------------------------------------------------------------------------------------------")
print("Test scores : ", test_scores)
print("Test scores average : ", sum(test_scores)/len(test_scores))

average_prediction = sum(predictions) / len(predictions)

print("------------------------------------------------------------------------------------------------")
print(f"Predictions : {predictions}")
print(f"Average predicted length : {average_prediction}")

Fitting 10 folds for each of 553 candidates, totalling 5530 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done 2832 tasks      | elapsed:    6.9s


Best parameters: {'max_depth': 3, 'min_samples_leaf': 27})
Training score 1 :  0.36361458154781257
Testing score 1 :  0.2894346456478729


[Parallel(n_jobs=-1)]: Done 5530 out of 5530 | elapsed:   10.3s finished


Unnamed: 0,opp_ht,opp_age,rank_diff,consecutive,surface_Clay,surface_Grass,surface_Hard,best_of_3,best_of_5,opp_hand_L,...,tourney_level_G,tourney_level_M,round_F,round_QF,round_R128,round_R16,round_R32,round_R64,round_RR,round_SF
0,185.0,25.0,12.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Prediction :  [106.328]  minutes
------------------------------------------------------------------------------------------------
Fitting 10 folds for each of 553 candidates, totalling 5530 fits


[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 2160 tasks      | elapsed:    2.9s


Best parameters: {'max_depth': 3, 'min_samples_leaf': 26})
Training score 2 :  0.39748255384281383
Testing score 2 :  0.2130285387210451


[Parallel(n_jobs=-1)]: Done 5530 out of 5530 | elapsed:    7.3s finished


Unnamed: 0,opp_ht,opp_age,rank_diff,consecutive,surface_Clay,surface_Grass,surface_Hard,best_of_3,best_of_5,opp_hand_L,...,tourney_level_G,tourney_level_M,round_F,round_QF,round_R128,round_R16,round_R32,round_R64,round_RR,round_SF
0,185.0,25.0,12.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Prediction :  [98.00775194]  minutes
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Test scores :  [0.2894346456478729, 0.2130285387210451]
Test scores average :  0.251231592184459
------------------------------------------------------------------------------------------------
Predictions : [106.328, 98.0077519379845]
Average predicted length : 102.16787596899225


In [68]:
### HISTOGRAM ###

# ATP Colors
# Dark blue : #002865
# Light blue : #00AFF0
# Rolland Garros Orange : #CB5A19

time_interval = 10  # precision : 15/30/30 minutes

def plot_distribution(time_step) : 
    
    fig, ax = plt.subplots(1, figsize=(18,6))

    plt.title('Distribution of Match Length Predictions', loc='left')
    plt.xlabel("Minutes")
    plt.ylabel("Nb of predictions")

    ax.spines['bottom'].set_visible(True)
    ax.spines['top'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['right'].set_visible(False)

    plt.yticks([])
    
    
    intervals=np.arange(60, 300 + time_step, time_step)
    n, bins, patches = plt.hist(predictions, bins=intervals, color='#00AFF0')

    plt.xticks(bins)
    plt.grid(color='white', lw = 1, axis='x')

    xticks = [(bins[idx+1] + value)/2 for idx, value in enumerate(bins[:-1])]

    for idx, value in enumerate(n) :
        if value > 0 :
            plt.text(xticks[idx], value * 1.05, f"{int(100 * value / iter)}%", ha='center')
            plt.text(xticks[idx], value / 2, int(value), ha='center', color='w')

    plt.axvline(x=average_prediction, color='#CB5A19')
    plt.show()

out = widgets.Output()
time_step_slider = widgets.IntSlider(value=time_interval, min=5, max=45, step=5)

def time_step_slider_eventhandler(change) :
    out.clear_output()
    with out:
        plot_distribution(change.new)
    
display(time_step_slider)

time_step_slider.observe(time_step_slider_eventhandler, names='value')

display(out)

IntSlider(value=10, max=45, min=5, step=5)

Output()

Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '<Figure size 1296x432 with 1 Axes>', '…