# Predictive Analysis - Stint to Previous Stint Comparison

Age, percent through career, teammates of same nationality, and teammates of same nationality vs previous stint will be used to predict whether the current stint will have a score higher or lower than the previous stint.

## Load Data and Specify Features

The player's first stint in the league is removed, as there is no previous stint to compare to, and these rows are unsuitable for training.

In [74]:
import pandas as pd

df = pd.read_csv('Data/Gold/main.csv')
df = df[['year','age', 'percent_through_career', 'teammates_same_nationality', 'tsm_vs_prev_stint', 'stint_vs_prev_stint']]

# Remove rows where 'stint_vs_prev_stint' is NaN
df = df.dropna(subset=['stint_vs_prev_stint'])

df.head()

Unnamed: 0,year,age,percent_through_career,teammates_same_nationality,tsm_vs_prev_stint,stint_vs_prev_stint
1,2010,23.0,0.258373,3,0.0,1.0
2,2011,24.0,0.61244,1,-1.0,0.0
4,1927,27.0,0.132132,0,0.0,0.0
5,1928,28.0,0.201201,0,0.0,1.0
6,1929,29.0,0.333333,0,0.0,1.0


### Setup Result Table

In [75]:
table_columns = ('Run', 'Mean CV ROC AUC', 'Mean CV F1', 'Test ROC AUC', 'Test F1', 'Train ROC AUC', 'Train F1'
)

results_table = []

results_table.append(table_columns)

## Random Forest Analysis

### Initial Run

In [76]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, f1_score, roc_auc_score

# Split input features and target feature
X = df.drop(columns=['stint_vs_prev_stint'])
y = df['stint_vs_prev_stint']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model for classification
model = RandomForestClassifier(random_state=42)

# Define the scoring metrics
f1_scorer = make_scorer(f1_score, average='binary')
roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

# Perform cross-validation and compute F1 and ROC AUC scores
cv_f1_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=f1_scorer)
cv_roc_auc_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=roc_auc_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_f1 = f1_score(y_test, y_test_pred, average='binary')
test_roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_f1 = f1_score(y_train, y_train_pred, average='binary')
train_roc_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation F1 Scores:", cv_f1_scores)
print("Mean Cross-Validation F1:", cv_f1_scores.mean())
print("Cross-Validation ROC AUC Scores:", cv_roc_auc_scores)
print("Mean Cross-Validation ROC AUC:", cv_roc_auc_scores.mean())

print("Test Set F1 Score:", test_f1)
print("Test Set ROC AUC:", test_roc_auc)
print("Train Set F1 Score:", train_f1)
print("Train Set ROC AUC:", train_roc_auc)
print("Feature Importances:\n", feature_importances_df)

# Append results to the results table
results_table.append(('Initial Run', cv_roc_auc_scores.mean(), cv_f1_scores.mean(), test_roc_auc, test_f1, train_roc_auc, train_f1))





Cross-Validation F1 Scores: [0.49259498 0.50817046 0.49270767 0.50706033 0.51579955]
Mean Cross-Validation F1: 0.5032665990702172
Cross-Validation ROC AUC Scores: [0.56377078 0.58463559 0.56151325 0.58414523 0.58684195]
Mean Cross-Validation ROC AUC: 0.5761813594324071
Test Set F1 Score: 0.5055455248903792
Test Set ROC AUC: 0.5861509983237954
Train Set F1 Score: 1.0
Train Set ROC AUC: 1.0
Feature Importances:
                       Feature  Importance
2      percent_through_career    0.427560
0                        year    0.235091
3  teammates_same_nationality    0.158197
1                         age    0.130597
4           tsm_vs_prev_stint    0.048555


A ROC AUC of 0.5 is the equivalent of random guessing.  Our test ROC AUC of 0.589 is slightly better than random guessing, but is not excellent.  The F1 score of 0.506 is not incredible either. We would hope to see this closer to 0.7 or 0.8.

### Control - Only Age & Percent Through Career

In order to see if the teammate features we have developed improve model performance, we create a control model that excludes these features.

In [77]:
df = pd.read_csv('Data/Gold/main.csv')
df = df[['year','age', 'percent_through_career', 'stint_vs_prev_stint']]
# Remove rows where 'stint_vs_prev_stint' is NaN
df = df.dropna(subset=['stint_vs_prev_stint'])


In [78]:

# Split input features and target feature
X = df.drop(columns=['stint_vs_prev_stint'])
y = df['stint_vs_prev_stint']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model for classification
model = RandomForestClassifier(random_state=42)

# Define the scoring metrics
f1_scorer = make_scorer(f1_score, average='binary')
roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

# Perform cross-validation and compute F1 and ROC AUC scores
cv_f1_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=f1_scorer)
cv_roc_auc_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=roc_auc_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_f1 = f1_score(y_test, y_test_pred, average='binary')
test_roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_f1 = f1_score(y_train, y_train_pred, average='binary')
train_roc_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation F1 Scores:", cv_f1_scores)
print("Mean Cross-Validation F1:", cv_f1_scores.mean())
print("Cross-Validation ROC AUC Scores:", cv_roc_auc_scores)
print("Mean Cross-Validation ROC AUC:", cv_roc_auc_scores.mean())

print("Test Set F1 Score:", test_f1)
print("Test Set ROC AUC:", test_roc_auc)
print("Train Set F1 Score:", train_f1)
print("Train Set ROC AUC:", train_roc_auc)
print("Feature Importances:\n", feature_importances_df)

# Append results to the results table
results_table.append(('Control', cv_roc_auc_scores.mean(), cv_f1_scores.mean(), test_roc_auc, test_f1, train_roc_auc, train_f1))



Cross-Validation F1 Scores: [0.50352653 0.49797445 0.49011125 0.50169075 0.51222571]
Mean Cross-Validation F1: 0.5011057346521257
Cross-Validation ROC AUC Scores: [0.54511965 0.56052111 0.5493164  0.55198245 0.57724271]
Mean Cross-Validation ROC AUC: 0.5568364645002308
Test Set F1 Score: 0.502773575390822
Test Set ROC AUC: 0.5650047218926314
Train Set F1 Score: 0.9998788759689923
Train Set ROC AUC: 0.9999999741763581
Feature Importances:
                   Feature  Importance
2  percent_through_career    0.660455
0                    year    0.259147
1                     age    0.080399


### Reduced Features

Percent through career, year, and teammates_same_nationality were the top three features in the initial run with all features.  We will see if performance improves when these are the only features.

In [79]:
df = pd.read_csv('Data/Gold/main.csv')
df = df[['year','percent_through_career', 'teammates_same_nationality', 'stint_vs_prev_stint']]
# Remove rows where 'stint_vs_prev_stint' is NaN
df = df.dropna(subset=['stint_vs_prev_stint'])

In [80]:

# Split input features and target feature
X = df.drop(columns=['stint_vs_prev_stint'])
y = df['stint_vs_prev_stint']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model for classification
model = RandomForestClassifier(random_state=42)

# Define the scoring metrics
f1_scorer = make_scorer(f1_score, average='binary')
roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

# Perform cross-validation and compute F1 and ROC AUC scores
cv_f1_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=f1_scorer)
cv_roc_auc_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=roc_auc_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_f1 = f1_score(y_test, y_test_pred, average='binary')
test_roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_f1 = f1_score(y_train, y_train_pred, average='binary')
train_roc_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation F1 Scores:", cv_f1_scores)
print("Mean Cross-Validation F1:", cv_f1_scores.mean())
print("Cross-Validation ROC AUC Scores:", cv_roc_auc_scores)
print("Mean Cross-Validation ROC AUC:", cv_roc_auc_scores.mean())

print("Test Set F1 Score:", test_f1)
print("Test Set ROC AUC:", test_roc_auc)
print("Train Set F1 Score:", train_f1)
print("Train Set ROC AUC:", train_roc_auc)
print("Feature Importances:\n", feature_importances_df)

# Append results to the results table
results_table.append(('Reduced Features', cv_roc_auc_scores.mean(), cv_f1_scores.mean(), test_roc_auc, test_f1, train_roc_auc, train_f1))





Cross-Validation F1 Scores: [0.48271527 0.5070159  0.49071782 0.50770179 0.51731535]
Mean Cross-Validation F1: 0.5010932277382446
Cross-Validation ROC AUC Scores: [0.55271195 0.58085444 0.55003106 0.565417   0.57348263]
Mean Cross-Validation ROC AUC: 0.564499417028769
Test Set F1 Score: 0.5071196602548089
Test Set ROC AUC: 0.5661619632973485
Train Set F1 Score: 0.9998182809376703
Train Set ROC AUC: 0.9999999031613431
Feature Importances:
                       Feature  Importance
1      percent_through_career    0.668601
0                        year    0.239432
2  teammates_same_nationality    0.091967


### Hyperparameter Experimentation

To experiment with hyperparameters in an efficient way, a grid of parameters and a GridSearch are used to quickly experiment with a variety of hyperparameter combinations. The initial run had the best overall performance, so return to that feature set for hyperparameter experimentation.

In [81]:
df = pd.read_csv('Data/Gold/main.csv')
df = df[['year','age', 'percent_through_career', 'teammates_same_nationality', 'tsm_vs_prev_stint', 'stint_vs_prev_stint']]

# Remove rows where 'stint_vs_prev_stint' is NaN
df = df.dropna(subset=['stint_vs_prev_stint'])

# Split input features and target feature
X = df.drop(columns=['stint_vs_prev_stint'])
y = df['stint_vs_prev_stint']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [82]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid
param_grid = {
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2, 3],
}

# Initialize the Random Forest model
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV with ROC AUC scoring
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='roc_auc')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best ROC AUC Score:", best_score)


Fitting 3 folds for each of 18 candidates, totalling 54 fits
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best ROC AUC Score: 0.6106785975486931


Now a model is executed with the hyper parameters that provide the best ROC AUC score.

In [83]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, f1_score, roc_auc_score

# Split input features and target feature
X = df.drop(columns=['stint_vs_prev_stint'])
y = df['stint_vs_prev_stint']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model for classification
model = RandomForestClassifier(max_depth=10
                               , min_samples_leaf=2
                               , min_samples_split=2
                               , random_state=42)

# Define the scoring metrics
f1_scorer = make_scorer(f1_score, average='binary')
roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)

# Perform cross-validation and compute F1 and ROC AUC scores
cv_f1_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=f1_scorer)
cv_roc_auc_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=roc_auc_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_f1 = f1_score(y_test, y_test_pred, average='binary')
test_roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_f1 = f1_score(y_train, y_train_pred, average='binary')
train_roc_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation F1 Scores:", cv_f1_scores)
print("Mean Cross-Validation F1:", cv_f1_scores.mean())
print("Cross-Validation ROC AUC Scores:", cv_roc_auc_scores)
print("Mean Cross-Validation ROC AUC:", cv_roc_auc_scores.mean())

print("Test Set F1 Score:", test_f1)
print("Test Set ROC AUC:", test_roc_auc)
print("Train Set F1 Score:", train_f1)
print("Train Set ROC AUC:", train_roc_auc)
print("Feature Importances:\n", feature_importances_df)

# Append results to the results table
results_table.append(('Hyperparameter Experimentation', cv_roc_auc_scores.mean(), cv_f1_scores.mean(), test_roc_auc, test_f1, train_roc_auc, train_f1))





Cross-Validation F1 Scores: [0.50813273 0.51759973 0.50542942 0.51554664 0.49722607]
Mean Cross-Validation F1: 0.5087869180581406
Cross-Validation ROC AUC Scores: [0.59637546 0.61841115 0.59766856 0.61806568 0.62732624]
Mean Cross-Validation ROC AUC: 0.6115694169534902
Test Set F1 Score: 0.5243804956035172
Test Set ROC AUC: 0.6163039512193204
Train Set F1 Score: 0.6162147933666754
Train Set ROC AUC: 0.7524160211939793
Feature Importances:
                       Feature  Importance
2      percent_through_career    0.481219
0                        year    0.199137
1                         age    0.154695
3  teammates_same_nationality    0.125659
4           tsm_vs_prev_stint    0.039291


## Results

### Results Table

In [84]:
# Extract the column titles
columns = results_table[0]

# Extract the data values and round them to 4 decimal places
values = [
    (row[0], *[round(val, 4) for val in row[1:]])
    for row in results_table[1:]
]

# Convert the values into a DataFrame
results_df = pd.DataFrame(values, columns=columns)

# Display the DataFrame
results_df

Unnamed: 0,Run,Mean CV ROC AUC,Mean CV F1,Test ROC AUC,Test F1,Train ROC AUC,Train F1
0,Initial Run,0.5762,0.5033,0.5862,0.5055,1.0,1.0
1,Control,0.5568,0.5011,0.565,0.5028,1.0,0.9999
2,Reduced Features,0.5645,0.5011,0.5662,0.5071,1.0,0.9998
3,Hyperparameter Experimentation,0.6116,0.5088,0.6163,0.5244,0.7524,0.6162


### Summary

The feature selection in the Control and Reduced Features model did not improve performance over the initial run which included all features. None of the models performed particularly well, but the initial run with hyperparameter adjustments greatly improved overfitting, with more reasonable training data scores, and a set of higher overall test scores.