In [1]:
import pandas as pd
from pathlib import Path
import sklearn.linear_model as lm

# Loading in Data and Functions

Since we're making the model for Roger Federer, let's load in his data from `feature-engineering.ipynb`.

In [14]:
data_file = Path('./data', 'fed_2018.hdf')
fed_2018 = pd.read_hdf(data_file, 'fed_2018')
fed_2018.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_hand,...,player_1stIn%,opp_1stIn%,player_1stWon%,opp_1stWon%,player_2ndWon%,opp_2ndWon%,player_bpSaved%,opp_bpSaved%,player_bpFaced%,opp_bpFaced%
202,2018-580,Australian Open,Hard,128,G,2018-01-15,164,103819,Roger Federer,R,...,0.716049,0.555556,0.775862,0.7,0.608696,0.45,1.0,0.692308,0.024691,0.144444
234,2018-580,Australian Open,Hard,128,G,2018-01-15,232,103819,Roger Federer,R,...,0.576087,0.504673,0.830189,0.759259,0.666667,0.509434,0.666667,0.727273,0.032609,0.102804
250,2018-580,Australian Open,Hard,128,G,2018-01-15,316,103819,Roger Federer,R,...,0.648936,0.677778,0.803279,0.672131,0.424242,0.413793,0.5,0.375,0.021277,0.088889
258,2018-580,Australian Open,Hard,128,G,2018-01-15,408,103819,Roger Federer,R,...,0.592593,0.527778,0.833333,0.666667,0.757576,0.529412,1.0,0.7,0.0,0.092593
262,2018-580,Australian Open,Hard,128,G,2018-01-15,504,103819,Roger Federer,R,...,0.625,0.56383,0.833333,0.698113,0.527778,0.512195,0.6,0.5,0.052083,0.085106


We're also going to need the functions we defined in `feature-engineering.ipynb`, so let's "import" that notebook.

In [3]:
%%capture
%run feature-engineering.ipynb

# Training and Evaluating the Model

We need to create the matrix for training the model. This matrix consists of the features we chose and engineered in `feature-engineering.ipynb`.

In [30]:
features = ['player_ace%', 'opp_ace%', 'player_df%', 'opp_df%', 'player_1stIn%', 'opp_1stIn%',
            'player_1stWon%', 'opp_1stWon%', 'player_2ndWon%', 'opp_2ndWon%', 'player_bpSaved%',
            'opp_bpSaved%', 'player_bpFaced%', 'opp_bpFaced%', 'win_streak',
            'head_to_head', 'opponent_hand_L', 'opponent_hand_R']
train_features, train_results = fed_2018[features], fed_2018['result']

Now we need to apply the same transformations to the test data as we did to the training data in `feature-engineering.ipynb`.

In [31]:
test_data = pd.read_hdf(data_file, 'fed_2018_test')
test_data.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_hand,...,w_1stIn%,l_1stIn%,w_1stWon%,l_1stWon%,w_2ndWon%,l_2ndWon%,w_bpSaved%,l_bpSaved%,w_bpFaced%,l_bpFaced%
408,2018-0407,Rotterdam,Hard,32,A,2018-02-12,293,103819,Roger Federer,R,...,0.676471,0.615385,0.891304,0.642857,0.636364,0.685714,1.0,0.833333,0.0,0.065934
404,2018-0407,Rotterdam,Hard,32,A,2018-02-12,297,103819,Roger Federer,R,...,0.57377,0.657143,0.742857,0.608696,0.730769,0.375,0.0,0.375,0.016393,0.114286
613,2018-M006,Indian Wells Masters,Hard,128,M,2018-03-05,297,103819,Roger Federer,R,...,0.666667,0.517241,0.695652,0.533333,0.565217,0.535714,0.833333,0.428571,0.086957,0.12069
1641,2018-540,Wimbledon,Grass,128,G,2018-07-02,212,103819,Roger Federer,R,...,0.671053,0.77,0.901961,0.545455,0.52,0.478261,1.0,0.583333,0.052632,0.12
617,2018-M006,Indian Wells Masters,Hard,128,M,2018-03-05,293,103819,Roger Federer,R,...,0.510204,0.666667,1.0,0.711538,0.791667,0.384615,1.0,0.6,0.0,0.064103


In [35]:
def process_data(data, test=False):
    """Executes the feature engineering process on DATA by applying the necessary transformations."""
    X = (
        data
            .pipe(convert_match_stats_to_percent)
            .pipe(replace_nan_bp)
            .pipe(add_win_loss, ("Roger Federer"))
            #.pipe(add_win_streak)
            #.pipe(add_head_to_head)
            .pipe(add_opponent_hand)
            .pipe(add_player_v_opponent_stats, ("Roger Federer"))
    )
    return X

In [36]:
test_features, test_results = process_data(test_data)[features], process_data(test_data)['result']

Since we are trying to predict a binary variable (win or lose), I use a logistic regression model. I fit the model to the training data, then find the accuracy of the model on the test data.

In [37]:
fed_model = lm.LogisticRegression(penalty='l2', C=1.0, fit_intercept=True, multi_class='ovr')
fed_model.fit(train_features, train_results)
fed_model.score(test_features, test_results)

1.0

As shown by the output above, the model gives the correct prediction for all of the test data. However, since we only considered Federer's 2018 matches, the test data is really small (only around 5 matches). Therefore, we cannot be certain that the model will be accurate for all Federer's matches.