In [1]:
import pandas as pd
from pathlib import Path
import sklearn.linear_model as lm

# Loading in Data and Functions

Since we're making the model for Roger Federer, let's load in his data from `feature-engineering.ipynb`.

In [2]:
data_file = Path('./data', 'fed_2018.hdf')
fed_2018 = pd.read_hdf(data_file, 'fed_2018')
fed_2018.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_name,winner_hand,...,player_1stIn%,opp_1stIn%,player_1stWon%,opp_1stWon%,player_2ndWon%,opp_2ndWon%,player_bpSaved%,opp_bpSaved%,player_bpFaced%,opp_bpFaced%
202,2018-580,Australian Open,Hard,128,G,2018-01-15,164,103819,Roger Federer,R,...,0.716049,0.555556,0.775862,0.7,0.608696,0.45,1.0,0.692308,0.024691,0.144444
234,2018-580,Australian Open,Hard,128,G,2018-01-15,232,103819,Roger Federer,R,...,0.576087,0.504673,0.830189,0.759259,0.666667,0.509434,0.666667,0.727273,0.032609,0.102804
250,2018-580,Australian Open,Hard,128,G,2018-01-15,316,103819,Roger Federer,R,...,0.648936,0.677778,0.803279,0.672131,0.424242,0.413793,0.5,0.375,0.021277,0.088889
258,2018-580,Australian Open,Hard,128,G,2018-01-15,408,103819,Roger Federer,R,...,0.592593,0.527778,0.833333,0.666667,0.757576,0.529412,1.0,0.7,0.0,0.092593
262,2018-580,Australian Open,Hard,128,G,2018-01-15,504,103819,Roger Federer,R,...,0.625,0.56383,0.833333,0.698113,0.527778,0.512195,0.6,0.5,0.052083,0.085106


We're also going to need the functions we defined in `feature-engineering.ipynb`, so let's "import" that notebook.

In [3]:
%%capture
%run data-cleaning.ipynb
%run feature-engineering.ipynb

# Training and Evaluating the Model

We need to create the matrix for training the model. This matrix consists of the features we chose and engineered in `feature-engineering.ipynb`.

In [4]:
features = ['player_ace%', 'opp_ace%', 'player_df%', 'opp_df%', 'player_1stIn%', 'opp_1stIn%',
            'player_1stWon%', 'opp_1stWon%', 'player_2ndWon%', 'opp_2ndWon%', 'player_bpSaved%',
            'opp_bpSaved%', 'player_bpFaced%', 'opp_bpFaced%', 'win_streak',
            'head_to_head', 'opponent_hand_L', 'opponent_hand_R']

In [5]:
def process_data(data):
    """Executes the feature engineering process on DATA by applying the necessary transformations."""
    X = (
        data
            .pipe(convert_match_stats_to_percent)
            .pipe(replace_nan_bp)
            .pipe(add_win_loss, ("Roger Federer"))
            .pipe(add_win_streak)
            .pipe(add_head_to_head)
            .pipe(add_opponent_hand)
            .pipe(add_player_v_opponent_stats, ("Roger Federer"))
    )
    return X

In [6]:
def make_train_matrix(data, player, test=False):
    all_data = clean_data(data)
    player_data = get_matches_for_player(all_data, player)
    if test:
        return (process_data(player_data)[features], process_data(player_data)['result'])
    return process_data(player_data)[features]

In [7]:
fed_2018_features, fed_2018_results = fed_2018[features], fed_2018['result']

## A Note on Evaluating the Model

In most cases, the available dataset is randomly split into a set for training the model and a set for testing the model. However, in this case, that approach will not give an reliable evaluation of the model because the values of the `win_streak` and `head_to_head` features rely on the entire dataset (i.e. splitting the dataset results in incorrect values for these 2 features). Therefore, we will instead train the model on Federer's 2017 data and use the model's score on Federer's 2018 data to determine the quality of the model.

In [8]:
fed_2017_train, fed_2017_results = make_train_matrix(pd.read_csv(Path('./data/jeff_sackman_data/atp_matches_2017.csv')), "Roger Federer", True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pa

## Training the Model

Since we are trying to predict a binary variable, I choose a logistic regression model.

In [9]:
fed_model = lm.LogisticRegression(penalty='l2', C=1.0, fit_intercept=True, multi_class='ovr')
fed_model.fit(fed_2017_train, fed_2017_results)
fed_model.score(fed_2018_features, fed_2018_results)

1.0

The above output indicates that the model trained on Federer's 2017 data correctly predicts all of Federer's 2018 match outcomes. Thus, we have strong evidence to believe the model is good. Let's take a look at the coefficients of the trained model to see what features were the strongest.

In [10]:
fed_model.coef_

array([[ 0.03453387, -0.08877613, -0.02458042,  0.01307659, -0.11027405,
        -0.14874659, -0.02687958, -0.29519857, -0.08934161, -0.20274061,
         0.080021  , -0.2390508 , -0.03268396,  0.04879537,  1.61853159,
         0.78727113,  0.17789702, -0.33590054]])

The coefficients indicate that the win streak and head-to-head features were the strongest in predicting the outcome of the match.