# Predictive Analysis - Performance Score (Stint Score)

In this notebook, we conduct machine learning analysis to predict a player's stint_score, henceforth referred to as their performance score.

## Load Data and Specify Features

In [1]:
import pandas as pd

df = pd.read_csv('Data/Gold/main.csv')

df = df[['year','age', 'percent_through_career', 'teammates_same_nationality', 'tsm_vs_prev_stint', 'stint_score']]

### Setup result table

In [2]:
table_columns = ('Run', 'Cross-Validation MSE Scores', 'Mean Cross-Validation MSE', 'Cross-Validation R² Scores', 'Mean Cross-Validation R²', 'Test Set Mean Squared Error', 'Test Set R-Squared', 'Train Set Mean Squared Error', 'Train Set R-Squared'
)

results_table = []

results_table.append(table_columns)



## Random Forest Analysis

Because no clear linear relationship was found between the number of teammates sharing a player's nationality and that player's performance, we use a Random Forest. As a non-linear ensemble model, Random Forest can be effective at predictions in situations similar to this one.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, mean_squared_error, r2_score

# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Define the scoring metrics
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

# Perform cross-validation and compute MSE and R² scores
cv_mse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=mse_scorer)
cv_r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation MSE Scores:", cv_mse_scores)
print("Mean Cross-Validation MSE:", -cv_mse_scores.mean())  # Negated to get positive value
print("Cross-Validation R² Scores:", cv_r2_scores)
print("Mean Cross-Validation R²:", cv_r2_scores.mean())

print("Test Set Mean Squared Error:", test_mse)
print("Test Set R-Squared:", test_r2)
print("Train Set Mean Squared Error:", train_mse)
print("Train Set R-Squared:", train_r2)
print("Feature Importances:\n", feature_importances_df)

results_table.append(('Initial Run', cv_mse_scores, -cv_mse_scores.mean(), cv_r2_scores, cv_r2_scores.mean(), test_mse, test_r2, train_mse, train_r2))


### Initial Notes

Percent through career, year, and teammates_same_nationality are the top three features.  We will see if performance improves when these are the only features.

In [None]:
# remove age and tsm_vs_prev_stint
df = df[['year', 'percent_through_career', 'teammates_same_nationality', 'stint_score']]

## Model #2 - Reduced Features

In [None]:
# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Define the scoring metrics
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

# Perform cross-validation and compute MSE and R² scores
cv_mse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=mse_scorer)
cv_r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation MSE Scores:", cv_mse_scores)
print("Mean Cross-Validation MSE:", -cv_mse_scores.mean())  # Negated to get positive value
print("Cross-Validation R² Scores:", cv_r2_scores)
print("Mean Cross-Validation R²:", cv_r2_scores.mean())

print("Test Set Mean Squared Error:", test_mse)
print("Test Set R-Squared:", test_r2)
print("Train Set Mean Squared Error:", train_mse)
print("Train Set R-Squared:", train_r2)
print("Feature Importances:\n", feature_importances_df)

Cross-Validation MSE Scores: [-0.92382674 -0.93002508 -0.90755884 -0.91913583 -0.95260342]
Mean Cross-Validation MSE: 0.9266299836399853
Cross-Validation R² Scores: [0.07604125 0.07351081 0.10293911 0.07290154 0.04896759]
Mean Cross-Validation R²: 0.07487206092605529
Test Set Mean Squared Error: 0.9268842159106914
Test Set R-Squared: 0.06632467789504681
Train Set Mean Squared Error: 0.15927714018388087
Train Set R-Squared: 0.8410119558457994
Feature Importances:
                       Feature  Importance
1      percent_through_career    0.601697
0                        year    0.221464
2  teammates_same_nationality    0.176839


Model performance has gotten worse, so we will return to the original model and experiment with hyper parameters.

In [None]:
# reset model to initial feature set
df = pd.read_csv('Data/Gold/main.csv')
df = df[['year','age', 'percent_through_career', 'teammates_same_nationality', 'tsm_vs_prev_stint', 'stint_score']]

## Hyperparameter Experimentation

To experiment with hyperparameters in an efficient way, a grid of parameters and a GridSearch are used to quickly experiment with a variety of hyperparameter combinations.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2, 3],
}

# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='r2')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_params, best_score

Fitting 3 folds for each of 18 candidates, totalling 54 fits


({'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5},
 np.float64(0.1774304182354913))

The model is ran again with the hyperparamters selected in the previous cell.

In [None]:
# Split input features and target feature
X = df.drop(columns=['stint_score'])
y = df['stint_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
model = RandomForestRegressor(
    n_estimators=100, 
    random_state=42, 
    max_depth=10, 
    min_samples_leaf=1, 
    min_samples_split=5
)

# Define the scoring metrics
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

# Perform cross-validation and compute MSE and R² scores
cv_mse_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=mse_scorer)
cv_r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)

# Train the model on the full training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Evaluate the model on the test set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Feature importances
feature_importances = model.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the results
print("Cross-Validation MSE Scores:", cv_mse_scores)
print("Mean Cross-Validation MSE:", -cv_mse_scores.mean())  # Negated to get positive value
print("Cross-Validation R² Scores:", cv_r2_scores)
print("Mean Cross-Validation R²:", cv_r2_scores.mean())

print("Test Set Mean Squared Error:", test_mse)
print("Test Set R-Squared:", test_r2)
print("Train Set Mean Squared Error:", train_mse)
print("Train Set R-Squared:", train_r2)
print("Feature Importances:\n", feature_importances_df)

Cross-Validation MSE Scores: [-0.80387353 -0.81282177 -0.8025946  -0.80072406 -0.8219909 ]
Mean Cross-Validation MSE: 0.8084009726441709
Cross-Validation R² Scores: [0.1960116  0.19026852 0.2066892  0.19233914 0.17936472]
Mean Cross-Validation R²: 0.19293463436876462
Test Set Mean Squared Error: 0.7962687547208445
Test Set R-Squared: 0.19789713398493225
Train Set Mean Squared Error: 0.6843471432768496
Train Set R-Squared: 0.3168949812478371
Feature Importances:
                       Feature  Importance
2      percent_through_career    0.637400
0                        year    0.178935
1                         age    0.092734
3  teammates_same_nationality    0.072761
4           tsm_vs_prev_stint    0.018171


## Results Table

## Summary