# Machine Learning on PGA Tour - Data Preprocessing

In this notebook we will do some data preprocessing before we apply this data to our classification models.

In [1]:
# IMPORTS
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold

In [2]:
df = pd.read_csv("../data/pga_data.csv")
df

Unnamed: 0,Name,Season,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed,Spin Rate,Eligible
0,Rory McIlroy,2016,1,304.9,61.80,119.62,179.01,2435.6,1.0
1,Dustin Johnson,2016,2,313.9,56.85,122.38,181.75,2685.6,1.0
2,Patrick Reed,2016,3,296.7,56.68,118.37,172.95,2936.1,1.0
3,Adam Scott,2016,4,303.9,55.71,119.05,179.21,2507.4,1.0
4,Paul Casey,2016,5,294.0,64.41,117.46,173.41,2473.3,1.0
...,...,...,...,...,...,...,...,...,...
734,C.T. Pan,2021,121,296.3,61.03,111.20,167.34,2129.2,-1.0
735,Matt Kuchar,2021,122,288.0,65.81,108.60,162.18,2419.4,-1.0
736,Brice Garnett,2021,123,288.1,70.86,109.53,164.71,2539.5,-1.0
737,Scott Stallings,2021,124,298.2,58.83,115.96,173.80,2516.0,-1.0


First we will look how skewed our classes are in our data set.

In [3]:
pos_amount = len(df[df["Eligible"] == 1].index)
neg_amount = len(df[df["Eligible"] == -1].index)
print(f"Number of eligible: {pos_amount}")
print(f"Number of not eligible: {neg_amount}")

Number of eligible: 179
Number of not eligible: 560


As we can see, our data set is heavily skewed as we have over three times more data points in the other class (negative class). This could lead to a situation where the classifier algorithm classifies every data point to the dominating class, thus in our case leading to a high accuracy but small (or zero) precision and recall. Therefore, we will use some data augmentation to balance the data.

In [4]:
def augment_data(X):
    aug_X = []
    aug_y = []
    for i in range(X.shape[0]):
        for _ in range(3):
            perturbations = np.random.standard_normal(5)
            aug_X.append(X[i] + perturbations)
            aug_y.append(1)
    return np.array(aug_X), np.array(aug_y)

In [5]:
X_pos = df[df["Eligible"] == 1].iloc[:, 3:8].to_numpy()
y_pos = df[df["Eligible"] == 1].iloc[:, 8].to_numpy()
X_aug, y_aug = augment_data(X_pos)
X_aug.shape[0]

537

As the range of our feature values differ quite a lot, I also used standardization on the data. This standardization is first used on the training data and then with the training parameters on validation/testing data

In [6]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_aug)
# later use scaler.transform(X_test)
X_scaled

array([[ 0.40767232, -0.07066548,  0.92159943,  0.7388761 , -0.62174926],
       [ 0.43542156,  0.10241775,  0.95507728,  1.11700156, -0.63120485],
       [ 0.58292314,  0.12316052,  0.6698381 ,  0.94470415, -0.62743871],
       ...,
       [ 1.26870666, -0.99303086,  0.68182268,  0.94657959, -0.52594978],
       [ 1.08137675, -1.08534859,  0.49851873,  0.86831544, -0.52300911],
       [ 1.10410275, -1.20352787,  0.86403682,  0.77442501, -0.52585952]])

I used k-fold cross-validation in the validation phase of the model.

In [7]:
cv = KFold(n_splits=20, random_state=42, shuffle=True)