# Machine Learning on PGA Tour - Data Preprocessing

In this notebook we will do some data preprocessing before we apply this data to our classification models.

In [88]:
# IMPORTS
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold

In [2]:
df = pd.read_csv("../data/pga_data.csv")
df

Unnamed: 0,Name,Season,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed,Spin Rate,Eligible
0,Justin Thomas,2017,1,309.1,54.64,116.52,174.84,2320.1,1.0
1,Jordan Spieth,2017,2,294.6,58.67,112.66,168.55,2439.6,1.0
2,Xander Schauffele,2017,3,306.5,58.80,118.33,174.24,2518.8,1.0
3,Dustin Johnson,2017,4,314.8,54.02,121.45,180.66,2499.9,1.0
4,Jon Rahm,2017,5,305.3,58.27,116.42,174.53,2193.0,1.0
...,...,...,...,...,...,...,...,...,...
611,C.T. Pan,2021,121,296.3,61.03,111.20,167.34,2129.2,-1.0
612,Matt Kuchar,2021,122,288.0,65.81,108.60,162.18,2419.4,-1.0
613,Brice Garnett,2021,123,288.1,70.86,109.53,164.71,2539.5,-1.0
614,Scott Stallings,2021,124,298.2,58.83,115.96,173.80,2516.0,-1.0


First we will look how skewed our classes are in our data set.

In [3]:
pos_amount = len(df[df["Eligible"] == 1].index)
neg_amount = len(df[df["Eligible"] == -1].index)
print(f"Number of eligible: {pos_amount}")
print(f"Number of not eligible: {neg_amount}")

Number of eligible: 149
Number of not eligible: 467


As we can see, our data set is heavily skewed as we have over three times more data points in the other class (negative class). This could lead to a situation where the classifier algorithm classifies every data point to the dominating class, thus in our case leading to a high accuracy but small (or zero) precision and recall. Therefore, we will use some data augmentation to balance the data.

In [72]:
def augment_data(X):
    aug_X = []
    aug_y = []
    for i in range(X.shape[0]):
        for _ in range(3):
            perturbations = np.random.standard_normal(5)
            aug_X.append(X[i] + perturbations)
            aug_y.append(1)
    return np.array(aug_X), np.array(aug_y)

In [82]:
X_pos = df[df["Eligible"] == 1].iloc[:, 3:8].to_numpy()
y_pos = df[df["Eligible"] == 1].iloc[:, 8].to_numpy()
X_aug, y_aug = augment_data(X_pos)
X_aug.shape[0]

447

As the range of our feature values differ quite a lot, I also used standardization on the data. This standardization is first used on the training data and then with the training parameters on validation/testing data

In [86]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_aug)
# later use scaler.transform(X_test)
X_scaled

array([[ 1.02038113, -1.57294431,  0.02710474,  0.15276793, -1.1746936 ],
       [ 1.1751274 , -1.15583169, -0.53643043,  0.26857336, -1.16318628],
       [ 0.925306  , -1.5673207 ,  0.10255744,  0.31230229, -1.171197  ],
       ...,
       [ 0.81899994, -1.20794155,  0.67821711,  0.87641008, -0.53228462],
       [ 0.9903229 , -1.18353483,  0.89605884,  0.82532824, -0.54137369],
       [ 0.95107153, -0.91194904,  0.64776378,  0.98770068, -0.52950311]])

I used k-fold cross-validation in the validation phase of the model.

In [89]:
cv = KFold(n_splits=20, random_state=42, shuffle=True)