# Predictive Analysis - Performance Score (Stint Score)

In this notebook, we conduct machine learning analysis to predict a player's stint_score, henceforth referred to as their performance score.

## Load Data and Specify Features

In [54]:
import pandas as pd

df = pd.read_csv('Data/Gold/main.csv')
df = df[['year','age', 'percent_through_career', 'teammates_same_nationality', 'tsm_vs_prev_stint', 'stint_score']]

### Setup Result Table

In [55]:
table_columns = ('Run', 'Mean CV MSE', 'Mean CV R²', 'Test MSE', 'Test R²', 'Train MSE', 'Train R²'
)

results_table = []

results_table.append(table_columns)

## Random Forest Analysis

Because no clear linear relationship was found between the number of teammates sharing a player's nationality and that player's performance, we use a Random Forest. As a non-linear ensemble model, Random Forest can be effective at predictions in situations similar to this one.

### Initial Run

In [56]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, mean_squared_error, r2_score

# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Define the scoring metrics
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

# Perform cross-validation and compute MSE and R² scores
cv_mse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=mse_scorer)
cv_r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation MSE Scores:", cv_mse_scores)
print("Mean Cross-Validation MSE:", -cv_mse_scores.mean())  # Negated to get positive value
print("Cross-Validation R² Scores:", cv_r2_scores)
print("Mean Cross-Validation R²:", cv_r2_scores.mean())

print("Test Set Mean Squared Error:", test_mse)
print("Test Set R-Squared:", test_r2)
print("Train Set Mean Squared Error:", train_mse)
print("Train Set R-Squared:", train_r2)
print("Feature Importances:\n", feature_importances_df)

results_table.append(('Initial Run', -cv_mse_scores.mean(), cv_r2_scores.mean(), test_mse, test_r2, train_mse, train_r2))


Cross-Validation MSE Scores: [-0.86602951 -0.88224342 -0.86929714 -0.8565786  -0.89277956]
Mean Cross-Validation MSE: 0.8733856450066917
Cross-Validation R² Scores: [0.13384674 0.12111081 0.14075823 0.13600072 0.1086928 ]
Mean Cross-Validation R²: 0.12808185901750171
Test Set Mean Squared Error: 0.8669143163951625
Test Set R-Squared: 0.12673396557694883
Train Set Mean Squared Error: 0.13087317173234508
Train Set R-Squared: 0.8693643696643413
Feature Importances:
                       Feature  Importance
2      percent_through_career    0.471434
0                        year    0.200815
3  teammates_same_nationality    0.151379
1                         age    0.131113
4           tsm_vs_prev_stint    0.045259


### Control - Only Age & Percent Through Career

In order to see if the teammate features we have developed improve model performance, we create a control model that excludes these features.

In [57]:
df = pd.read_csv('Data/Gold/main.csv')
df = df[['year','age', 'percent_through_career', 'stint_score']]

In [58]:
# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Define the scoring metrics
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

# Perform cross-validation and compute MSE and R² scores
cv_mse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=mse_scorer)
cv_r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation MSE Scores:", cv_mse_scores)
print("Mean Cross-Validation MSE:", -cv_mse_scores.mean())  # Negated to get positive value
print("Cross-Validation R² Scores:", cv_r2_scores)
print("Mean Cross-Validation R²:", cv_r2_scores.mean())

print("Test Set Mean Squared Error:", test_mse)
print("Test Set R-Squared:", test_r2)
print("Train Set Mean Squared Error:", train_mse)
print("Train Set R-Squared:", train_r2)
print("Feature Importances:\n", feature_importances_df)

results_table.append(('Control', -cv_mse_scores.mean(), cv_r2_scores.mean(), test_mse, test_r2, train_mse, train_r2))

Cross-Validation MSE Scores: [-0.91115088 -0.91953607 -0.91723727 -0.90427235 -0.93437222]
Mean Cross-Validation MSE: 0.9173137586124425
Cross-Validation R² Scores: [0.08871892 0.08395994 0.09337263 0.08789379 0.06716873]
Mean Cross-Validation R²: 0.08422280270169116
Test Set Mean Squared Error: 0.9058013272112092
Test Set R-Squared: 0.08756203695186404
Train Set Mean Squared Error: 0.15841510851828053
Train Set R-Squared: 0.8418724228805199
Feature Importances:
                   Feature  Importance
2  percent_through_career    0.610152
0                    year    0.236674
1                     age    0.153173


This performance without the teammate same nationality variables is worse on all metrics compared to the original model that includes these features, indicating that the teammate features are beneficial. 

### Reduced Features

Percent through career, year, and teammates_same_nationality were the top three features in the initial run with all features.  We will see if performance improves when these are the only features.

In [59]:
# remove age and tsm_vs_prev_stint
df = pd.read_csv('Data/Gold/main.csv')
df = df[['year', 'percent_through_career', 'teammates_same_nationality', 'stint_score']]

In [60]:
# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Define the scoring metrics
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

# Perform cross-validation and compute MSE and R² scores
cv_mse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=mse_scorer)
cv_r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation MSE Scores:", cv_mse_scores)
print("Mean Cross-Validation MSE:", -cv_mse_scores.mean())  # Negated to get positive value
print("Cross-Validation R² Scores:", cv_r2_scores)
print("Mean Cross-Validation R²:", cv_r2_scores.mean())

print("Test Set Mean Squared Error:", test_mse)
print("Test Set R-Squared:", test_r2)
print("Train Set Mean Squared Error:", train_mse)
print("Train Set R-Squared:", train_r2)
print("Feature Importances:\n", feature_importances_df)

results_table.append(('Reduced Features', -cv_mse_scores.mean(), cv_r2_scores.mean(), test_mse, test_r2, train_mse, train_r2))

Cross-Validation MSE Scores: [-0.92382674 -0.93002508 -0.90755884 -0.91913583 -0.95260342]
Mean Cross-Validation MSE: 0.9266299836399853
Cross-Validation R² Scores: [0.07604125 0.07351081 0.10293911 0.07290154 0.04896759]
Mean Cross-Validation R²: 0.07487206092605529
Test Set Mean Squared Error: 0.9268842159106914
Test Set R-Squared: 0.06632467789504681
Train Set Mean Squared Error: 0.15927714018388087
Train Set R-Squared: 0.8410119558457994
Feature Importances:
                       Feature  Importance
1      percent_through_career    0.601697
0                        year    0.221464
2  teammates_same_nationality    0.176839


### Hyperparameter Experimentation


Model performance has gotten worse, so we will return to the original model. Additionally, all models so far show signs of overfitting, as train performance metrics are better than test metrics. To attempt to address this, we experiment with hyperparameters in the next section.

In [61]:
# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


To experiment with hyperparameters in an efficient way, a grid of parameters and a GridSearch are used to quickly experiment with a variety of hyperparameter combinations.

In [62]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2, 3],
}

# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='r2')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_params, best_score

Fitting 3 folds for each of 18 candidates, totalling 54 fits


({'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5},
 np.float64(0.1774304182354913))

The model is ran again with the hyperparamters selected in the previous cell.

In [63]:
# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(
    n_estimators=100, 
    random_state=42, 
    max_depth=10, 
    min_samples_leaf=1, 
    min_samples_split=5
)

# Define the scoring metrics
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

# Perform cross-validation and compute MSE and R² scores
cv_mse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=mse_scorer)
cv_r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation MSE Scores:", cv_mse_scores)
print("Mean Cross-Validation MSE:", -cv_mse_scores.mean())  # Negated to get positive value
print("Cross-Validation R² Scores:", cv_r2_scores)
print("Mean Cross-Validation R²:", cv_r2_scores.mean())

print("Test Set Mean Squared Error:", test_mse)
print("Test Set R-Squared:", test_r2)
print("Train Set Mean Squared Error:", train_mse)
print("Train Set R-Squared:", train_r2)
print("Feature Importances:\n", feature_importances_df)

results_table.append(('Hyperparameters Adjusted', -cv_mse_scores.mean(), cv_r2_scores.mean(), test_mse, test_r2, train_mse, train_r2))

Cross-Validation MSE Scores: [-0.82447198 -0.82187292 -0.81179235 -0.81205728 -0.8376697 ]
Mean Cross-Validation MSE: 0.8215728469674051
Cross-Validation R² Scores: [0.17541021 0.18125178 0.19759784 0.18090774 0.16371177]
Mean Cross-Validation R²: 0.1797758673735608
Test Set Mean Squared Error: 0.7995144078887128
Test Set R-Squared: 0.19462770052719125
Train Set Mean Squared Error: 0.7057429359881517
Train Set R-Squared: 0.2955380229776705
Feature Importances:
                       Feature  Importance
1      percent_through_career    0.724703
0                        year    0.190125
2  teammates_same_nationality    0.085171


Performance has increased, and the signs of overfitting have improved as well. Performance on the training data specifically has decreased, but the model generalizes to unseen data much better after making hyperparameter adjustments.

## Results Table

In [65]:
# Extract the column titles
columns = results_table[0]

# Extract the data values and round them to 4 decimal places
values = [
    (row[0], *[round(val, 4) for val in row[1:]])
    for row in results_table[1:]
]

# Convert the values into a DataFrame
results_df = pd.DataFrame(values, columns=columns)

# Display the DataFrame
results_df

Unnamed: 0,Run,Mean CV MSE,Mean CV R²,Test MSE,Test R²,Train MSE,Train R²
0,Initial Run,0.8734,0.1281,0.8669,0.1267,0.1309,0.8694
1,Control,0.9173,0.0842,0.9058,0.0876,0.1584,0.8419
2,Reduced Features,0.9266,0.0749,0.9269,0.0663,0.1593,0.841
3,Hyperparameters Adjusted,0.8216,0.1798,0.7995,0.1946,0.7057,0.2955


## Summary

None of the model runs had particularly impressive results. Only a small amount of the variation in the target variable can be attributed to the input features in our best case. That said, we found that the teammate variables can be beneficial to predicting a player's performance score. When these variables were removed, model performance decreased. In the initial model run, the R² of 0.128 is not particularly impressive, but it is markedly higher than the R² score when only age, year, and percent_through_career were considered.