<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>

<h1 align="center"><font size="5">Supervised Machine Learning: Classification - Final Assignment</font></h1>

## Importing required libraries

In [127]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
import lime.lime_tabular
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

## Importing the dataset

In [128]:
filepath = "data/pgafulldata.csv"
data = pd.read_csv(filepath, encoding='latin1')
data.head()

Unnamed: 0,PLAYER NAME,Ball Speed,Driving Distance,Approaches from > 100 yards,Eagles (Holes per),Putts Per Round,Birdie Average,Proximity to Hole,FedexCup Regular Season Points,Average Distance of Putts made,...,Total Birdies,Total Eagles,Spin Rate,Top 10 Finishes,Sand Save Percentage,Scrambling,SG: Total,Smash Factor,Country,College
0,A.J. McInerney (2018),,,,,,,,,,...,,,,1.0,,,,,USA,"University of Nevada-Las Vegas 2016, Finance"
1,Aaron Wise (2017),,,,,,,,,,...,,,,1.0,,,,,USA,University of Oregon
2,Aaron Wise (2018),171.85,302.9,"33' 5""",264.0,29.15,4.11,"36' 8""",1086.0,"79' 2""",...,362.0,6.0,2580.3,4.0,47.92,56.28,0.703,1.502,USA,University of Oregon
3,Aaron Wise (2019),174.29,302.6,"33' 10""",414.0,29.25,4.42,"37' 5""",400.0,"72' 1""",...,305.0,3.0,2331.3,1.0,48.19,52.05,0.329,1.495,USA,University of Oregon
4,Abraham Ancer (2016),159.6,276.4,"32' 4""",864.0,29.06,3.33,"34' 9""",147.0,"65' 4""",...,160.0,1.0,2604.7,,46.75,56.67,-0.727,1.493,MEX,"University of Oklahoma 2013, Multi-Disciplinar..."


# 1. About the Data

## Dataset Summary
This dataset contains the statistics for the top 200 PGA Tour players from 2015-2019. The original dataset comprises of 29 columns with stats about every aspect of their game, including driving distance, putts per round, proximity to the hole, and many more.
<br><br>
The primary purpose of this analysis is to determine what factors led to these players either having or not having a top ten finish in the year that the statistics were collected.

## Objective of the Analysis
In this analysis of PGA Tour data, the model to be constructed will mostly focus on interpretation. While the model will try and predict whether or not a player will have a top ten finish, there is more interest on the features that drive those predictions. This will tell us the aspects of the golf game that most directly lead high finishes.

## Data Cleaning and Feature Engineering
Below are listed all of the actions taken to clean the data and implement new features:
- Any distance related metric that was measured in feet and inches was converted to a decimal in feet
- This dataset included a lot superfluous statistics that do not relate to performance, so any of those are removed
- If a player did not record a top ten finish, the column was labeled NAN, so that was changed to 0
- A new column (our target column) was created to indicate if a player had a top ten finish or not (0 if they didn't, 1 if they did)
- Many rows had almost all null values, so any rows with a null value were discarded.

In [129]:
#Replace any NAN values with 0 for top ten finishes column
def no_top_ten_replacer(value):
    if pd.isna(value):
        return 0
    else:
        return value
    
#Determines if a player has a top ten finish or not
def top_ten_bool(value):
    if value:
        return 1
    else:
        return 0

#Seperating the name column into a name and year column
def obj_to_int(obj):
    if pd.isna(obj):
        return None
    
    return int(obj)

data['Player Name'] = data['PLAYER NAME'].str.extract(r'^(.*) \(\d{4}\)$')
data['Year'] = data['PLAYER NAME'].str.extract(r'\((\d{4})\)')
data['Year'] = data['Year'].apply(obj_to_int)

#Function to convert distance to a decimal
def feet_inches_string_to_decimal(ft_inches_str):
    if pd.isna(ft_inches_str) or ft_inches_str.strip == "":
        return None
    
    clean_str = ft_inches_str.replace("'", "")
    clean_str = clean_str.replace("\"", "")
    parts = clean_str.split(" ")
    if len(parts) != 2:
        return None
    
    height = 0
    height += int(parts[0]) + float(int(parts[1])/12)
    return height

#Convert all feet and inches columns to decimals using the function above
data['Decimal Approach from > 100 Yards (Feet)'] = data['Approaches from > 100 yards'].apply(feet_inches_string_to_decimal)
data['Average Distance Putts Made (Feet)'] = data['Average Distance of Putts made'].apply(feet_inches_string_to_decimal)
data['Decimal Proximity to Hole (Feet)'] = data['Proximity to Hole'].apply(feet_inches_string_to_decimal)
data['Top 10 Finishes'] = data['Top 10 Finishes'].apply(no_top_ten_replacer)

#Create the target column, top ten finish (T10?)
data['T10?'] = data['Top 10 Finishes'].apply(top_ten_bool)

#Removing all unecessary columns
data.drop(columns=['Eagles (Holes per)', 'FedexCup Regular Season Points', 'Official Money',
                   'Total Birdies', 'Total Eagles', 'Country', 'College', 'Top 10 Finishes',
                   'PLAYER NAME', 'Approaches from > 100 yards', 'Average Distance of Putts made',
                   'Proximity to Hole', 'Player Name', 'Year'], inplace=True)

#Finally, removing any rows with null values, as there are many rows that are have little to no data
data = data.dropna()

# 2. Model Construction & Testing

In [130]:
def precision_scores(y_test, preds):
    accuracy = accuracy_score(y_test, preds)
    precision, recall, fbeta, support = precision_recall_fscore_support(y_test, preds, beta=5, pos_label=1, average='binary')
    auc = roc_auc_score(y_test, preds)
    print(f"Accuracy is: {accuracy:.2f}")
    print(f"Precision is: {precision:.2f}")
    print(f"Recall is: {recall:.2f}")
    print(f"Fscore is: {fbeta:.2f}")
    print(f"AUC is: {auc:.2f}")

## Data Split

In [131]:
X = data.drop(columns='T10?')
y = data['T10?']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Scaling Features

For many classification models, scaling is crucial to identify decision boundaries between classes. Therefore, scaling is applied below in order for the models to perform optimally

In [132]:
ss = StandardScaler()
ss = ss.fit(X)
X_scaled = ss.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)
X.head()

Unnamed: 0,Ball Speed,Driving Distance,Putts Per Round,Birdie Average,GIR Percentage from Fairway,Bogey Average,Par 3 Scoring Average,Club Head Speed,Par 4 Scoring Average,Par 5 Scoring Average,Driving Accuracy Percentage,Scoring Average,Spin Rate,Sand Save Percentage,Scrambling,SG: Total,Smash Factor,Decimal Approach from > 100 Yards (Feet),Average Distance Putts Made (Feet),Decimal Proximity to Hole (Feet)
0,0.427479,1.06538,0.15128,1.595539,1.67175,0.198506,0.095696,0.125518,-0.986152,-0.343162,0.362434,-0.989217,-0.078835,-0.388669,-0.769352,0.778216,1.080796,0.470195,1.569196,0.451328
1,0.822495,1.032727,0.354674,2.604599,1.151096,0.236132,-1.069065,0.626891,0.070041,-1.622385,0.056275,-0.296735,-1.269121,-0.343114,-2.05429,0.241928,0.600012,0.732957,-0.220288,0.919951
2,-1.555692,-1.818914,-0.031774,-0.943386,-2.702448,1.778785,0.561601,-1.610718,0.862186,2.073149,0.767415,1.670928,0.037804,-0.586076,-0.650883,-1.272298,0.462646,-0.212985,-1.925561,-0.746265
3,-0.265417,0.325259,-0.336865,0.065674,-0.439199,-0.742135,0.095696,-0.461739,-0.722104,-0.343162,0.422503,-0.409333,0.014859,0.65742,1.177799,0.229022,0.806063,0.207434,-0.62029,-0.329711
4,-0.166663,0.020504,-0.540258,0.879432,-0.818178,-0.742135,0.095696,-0.417636,-1.514248,-0.485297,1.695578,-0.520525,0.775399,-0.608011,0.962124,0.508638,1.012113,-0.580851,-0.409763,-0.850404


## Important Note:
All classifiers will have their hyperparameters tuned on GridSearchCV, and the best estimator will be trained on a control X/y train/test sets and metrics will be evaluated from there.

## Logistic Regression Model (Coefficient based)

In [133]:
lr_model = LogisticRegression(penalty='l2')
param_grid = {'C': np.geomspace(1, 1e5, 6),
              'max_iter': [1000, 2000, 3000]}

grid_lr = GridSearchCV(estimator=lr_model, 
                       param_grid=param_grid,
                       scoring='roc_auc')
grid_lr = grid_lr.fit(X, y)

#Best model metrics calculation
best_lr = grid_lr.best_estimator_
best_lr = best_lr.fit(X_train, y_train)
y_pred = best_lr.predict(X_test)
print(precision_scores(y_test, y_pred))
print(grid_lr.best_params_)

Accuracy is: 0.88
Precision is: 0.91
Recall is: 0.95
Fscore is: 0.95
AUC is: 0.65
None
{'C': np.float64(1.0), 'max_iter': 1000}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Decision Tree Model

In [134]:
dt_model = DecisionTreeClassifier()
param_grid = {'criterion': ['gini', 'entropy'],
              'max_depth': [5*x-1 for x in range(1,22,5)],
              'min_samples_leaf': [1,2,5]}
grid_dt = GridSearchCV(estimator=dt_model, 
                       param_grid=param_grid,
                       scoring='roc_auc')
grid_dt = grid_dt.fit(X, y)
best_dt = grid_dt.best_estimator_
y_pred = best_dt.predict(X_test)
print(precision_scores(y_test, y_pred))
print(grid_dt.best_params_)

Accuracy is: 0.87
Precision is: 0.87
Recall is: 1.00
Fscore is: 0.99
AUC is: 0.50
None
{'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 2}


While both of these models are fairly good, there are some changes that might make them perform even better

## Class Weighting on Logistic Regression
While the previous logistic regression did perform well, the classes were unbalanced. Therefore, this new model will add the class weight parameter to see how it enhances performace

In [135]:
# Getting an idea of how unbalanced the classes are (should be about an 80/20 split)
data['T10?'].value_counts(normalize=True)

T10?
1    0.829812
0    0.170188
Name: proportion, dtype: float64

In [141]:
#Tune the class_weight parameter
best_lr_weighted = best_lr
best_lr_weighted.class_weights_ = {0:0.8, 1:0.2}
best_lr_weighted = best_lr_weighted.fit(X_train, y_train)
precision_scores(y_test, best_lr_weighted.predict(X_test))

Accuracy is: 0.78
Precision is: 0.94
Recall is: 0.80
Fscore is: 0.81
AUC is: 0.72


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Ensemble Methods on the Decision Tree (Random Forest)

In [145]:
rf_model = RandomForestClassifier(max_depth=4, criterion='gini', min_samples_leaf=2)
param_grid = {'n_estimators': [x for x in range(20, 120, 20)]}
grid_rf = GridSearchCV(estimator=rf_model,
                       param_grid=param_grid,
                       scoring='roc_auc')
grid_rf = grid_rf.fit(X,y)
best_rf = grid_rf.best_estimator_
best_rf = best_rf.fit(X_train, y_train)
print(precision_scores(y_test, best_rf.predict(X_test)))
print(grid_rf.best_params_)

Accuracy is: 0.89
Precision is: 0.90
Recall is: 0.97
Fscore is: 0.97
AUC is: 0.65
None
{'n_estimators': 40}
