## 3) Initial Modeling and Error Evaluation

In [1]:
import pandas as pd
import numpy as np

df_model = pd.read_excel('C:\\Users\\rbush\\Documents\\Projects\\PGA Finish Projections\\PGA Finish Projections_modeling data.xlsx')

In [2]:
df_model.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'player', 'player_id', 'tournament_name',
       'tournament_id', 'course', 'course_id', 'region', 'region_id', 'season',
       'days_from_today', 'sg_putt', 'sg_arg', 'sg_app', 'sg_ott', 'sg_t2g',
       'sg_total', 'nan', 'mc', 'strokes', 'strokes_rel_par', 'place_adj',
       'nan_count', 'tourn_count', 'sg_putt1', 'sg_putt2', 'sg_putt3',
       'sg_arg1', 'sg_arg2', 'sg_arg3', 'sg_app1', 'sg_app2', 'sg_app3',
       'sg_ott1', 'sg_ott2', 'sg_ott3', 'sg_t2g1', 'sg_t2g2', 'sg_t2g3',
       'sg_total1', 'sg_total2', 'sg_total3'],
      dtype='object')

Until now, I've spent time cleaning the dataset, investigating whether there are significant differences in performance given a player's history at a particular course or tournament, and understanding to what extent recent performance impacts a player's next tournament finish.  In the next section I will use these insights as features in a model to predict future performance.

Below I do a few different things:

<ol>
    <li><b>Establish a baseline of error evaluation:</b> I will use each record's prior strokes-gained data as a predictor for <em>place_adj</em>, and use the resulting <em>root-mean-squared-error (RMSE)</em> as an error baseline.  When I create a model with new features or hyperparameters, I will calculate the difference in RMSE between the baseline and the new model to see how much more of the relationship is explained (i.e. if the RMSE difference is negative after adding new features, the error has been reduced, and the new model is better).</li>
    <li><b>Quantify the impact of including player, course, and tournament on predictive power:</b> the baseline will not care what player posted the recent strokes-gained performance, and won't care where they did it.  However, we know from our ANOVA that course and tournament are significantly correlated with a player's finish, therefore I expect these categorical features to also improve predictive power significantly.</li>
    <li><b>Assess the improvement due to lagged strokes-gained data:</b> in the baseline case we will only use the prior tournament's performance as a predictor.  However, we know that an auto-correlation in strokes-gained performance can be observed over a period of 3 weeks is observable, in turn having a relationship with tournament finishes.  We will add lagged strokes-gained performance and assess its impact on predictive power.</li>
</ol>

For our modeling and error evaluation in this section, I'll use Scikit-Learn to fit simple 100-estimator random forest models.  Below, I have written a simple function to calculate RMSE which is used to evaluate each model's error.

In [3]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from math import sqrt

In [4]:
def rmse(predictions, actual):
    mse = mean_squared_error(predictions, actual)
    rmse = sqrt(mse)
    residuals = [predictions-actual]
    return(rmse, residuals)

In [5]:
train, test = train_test_split(df_model, test_size = 0.1)

In [6]:
print('The train dataset has %s rows' % (len(train.iloc[:,0])))
print('The test dataset has %s rows' % (len(test.iloc[:,0])))

The train dataset has 23605 rows
The test dataset has 2623 rows


##### Base Accuracy Measurement

We'll create a list of all of the strokes-gained fields from the week prior and store it as the variable <em>rf_base_fields</em>.  As we add features in to the model, we will expand on this list and evaluate the new accuracy measurement.

In [7]:
lag_1_fields = ['sg_putt1','sg_arg1','sg_app1','sg_ott1','sg_t2g1']

In [8]:
rf_base = RandomForestRegressor(n_estimators = 100, random_state = 42)

In [9]:
rf_base.fit(train[lag_1_fields], train['strokes_rel_par'])

RandomForestRegressor(random_state=42)

In [10]:
predictions_base = rf_base.predict(test[lag_1_fields])

In [11]:
errors_base, _ = rmse(predictions_base, test['strokes_rel_par'])
base_rmse = round(np.mean(errors_base), 2)

In [12]:
print('Root Mean-Squred Error:', base_rmse)

Root Mean-Squred Error: 7.06


In real terms, a RMSE of ~28 means that we are over- or under-estimating a player's finish by 28 places in a given PGA Tour event.  Because the RMSE is weighting edge-cases more heavily than the close predictions, my first reaction is that this is likely not a terrible place to start for the most highly-ranked and consistent players in the world.  However, we will certainly look to improve upon this figure.

In the <b><em>Feature Evaluation</em></b> section below, we'll see what kind of improvement we get by adding in the features we mentioned above.

##### Feature Evaluation

In [13]:
def feature_evaluate(train, test, target_field, base_error, scenario_names, scenario_features, n_estimators, random_state):
    """
    Take the error from the base case above and evaluate the % decrease associated with adding additional features.  The output
    is a printed output which shows the name of the scenario, and what % change in error it drove.
    """
    
    i = 0
    for scenario in scenario_features:
        
        random_forest = RandomForestRegressor(n_estimators = 100, random_state = 42)
        random_forest.fit(train[scenario], train[target_field])
        predictions = random_forest.predict(test[scenario])
        
        residuals, _ = rmse(predictions, test[target_field])
        error = np.mean(residuals)
        error_pct_diff = round(100*(error-base_error)/base_error, 2)
        
        print_string = '%s: %s' % (scenario_names[i], error_pct_diff)
        print(print_string+'%'+'\n')
        
        i = i+1

In [14]:
scenario_names = ['Player, Course, and Tournament', 
                 'Lagged Strokes-Gained Features', 
                 'Player, Course, and Tournament with Lagged Strokes-Gained']

Below we'll select only the fields we need to investigate error reduction in our scenarios listed above.

In [15]:
# Player, Course, and Tournament
pct_fields = ['player_id']+['course_id']+['tournament_id']+lag_1_fields

# Lagged Strokes-Gained Feature
lag_2_fields = ['sg_putt2','sg_arg2','sg_app2','sg_ott2','sg_t2g2']
lag_3_fields = ['sg_putt3','sg_arg3','sg_app3','sg_ott3','sg_t2g3']

sg_fields = lag_1_fields+lag_2_fields+lag_3_fields

# Player, Course, and Tournament with Lagged Strokes-Gained
pct_sg_fields = sg_fields+['player_id']+['tournament_id']+['course_id']

In [16]:
scenario_features = [pct_fields, sg_fields, pct_sg_fields]

In [17]:
feature_evaluate(train, test, 'strokes_rel_par', base_rmse, scenario_names, scenario_features, n_estimators = 100, random_state = 42)

Player, Course, and Tournament: -17.53%

Lagged Strokes-Gained Features: -3.13%

Player, Course, and Tournament with Lagged Strokes-Gained: -16.26%



Above, we can see the inclusion of features we tested in our statistical analysis before is reducing our error.  When player, course, tournament, and lagged performance are all included, we're seeing a quick 6% reduction of error.

However, while it's nice to see an error reduction, in real terms that only reduces our estimated tournament finish from an average difference of ~28 places to ~26.  We're certainly on the right track, but there is more we can do to squeeze additional information out of this dataset.  In the next section, we'll see how much more we can increase our accuracy via feature engineering, another modeling approach, and hyperparameter tuning.