---
title: "project3"
format: html
---


## Objective

Can we predict a player’s performance/worth in an upcoming season based on their previous performance stats and other metrics?

## Description of Data

We have data from qualified (300+ PAs) batters in each season from 2015 to 2024. Most of it is from Statcast but we also added the columns WAR (wins above replacement), R (runs), OPS+ (on-base plus slugging), rOBA (run-out batting average), and Rbat+ from Baseball Reference. We will predict a player's WAR in an upcoming season given their previous stats and stats from all other MLB players. The most important factors in this prediction will be WAR, Age, OPS (on-base plus slugging), xBA (expected batting average), Barrel%, K%, BB%, Rbat+ (weighted runs created +), and HR (homeruns).

## Exploratory Data Analysis

Data cleaning (details listed in Project 2/eda.qmd):


In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

def concat_statcast(input1, input2, output):
  try:
    df1 = pd.read_csv(input1)
    df2 = pd.read_csv(input2)
    combined_df = pd.concat([df1, df2], ignore_index=True)
    combined_df.to_csv(output, index=False)
  except:
    print("file not found error")
    
input1 = "Statcast.csv"
input2 = "Statcast_2020.csv"
output = "Complete_Statcast.csv"
concat_statcast(input1, input2, output)

def merge_data(input3, input4, output, columns):
  try:
    df1 = pd.read_csv(input3)
    df1['Player'] = df1['last_name, first_name'].str.split(', ').str[::-1].str.join(' ')
    all_years_df = []
    for i in input4:
      filename = f'{i}.csv'
      df2 = pd.read_csv(filename, skiprows=4, skipfooter=3, engine='python')
      df2['year'] = i 
      df2['Player'] = df2['Player'].str.replace(r'[*#]', '', regex=True).str.strip()
      df2_subset = df2[['Player', 'year'] + columns]
      all_years_df.append(df2_subset)
    combined_df2 = pd.concat(all_years_df, ignore_index=True)
    merged_df = pd.merge(df1, combined_df2, on=['Player', 'year'], how='inner')
    merged_df.to_csv(output, index=False)
    print(merged_df.info)
  except FileNotFoundError:
    print("file not found error")
    
input3 = 'Complete_Statcast.csv'
input4 = range(2015, 2025)
columns = ['WAR', 'R', 'OPS+', 'rOBA', 'Rbat+']
output = "Complete_Data.csv"
merge_data(input3, input4, output, columns)
complete_dataset = pd.read_csv(output)

key_columns = ['xba', 'barrel_batted_rate', 'player_age', 'WAR', 'k_percent', 'bb_percent', 'on_base_plus_slg', 'Rbat+']

### Univariate Analysis (Histograms)


In [None]:
# Univariate Analysis
print(data[key_columns].describe())

# Histograms
plt.figure(figsize=(16, 12))
for i, col in enumerate(key_columns, 1):
    plt.subplot(3, 3, i)
    sns.histplot(data[col], kde=True, bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('univariate.png')
plt.close()

We can see that most of the data is either centered or skewed left.

### Bivariate Analysis (Correlation)


In [None]:
print(data[key_columns].corr())

plt.figure(figsize=(12, 8))
# xBA vs WAR
plt.subplot(2, 2, 1)
sns.scatterplot(x='xba', y='WAR', data=data)
plt.title('xBA vs WAR')
# Barrel% vs OPS
plt.subplot(2, 2, 2)
sns.scatterplot(x='barrel_batted_rate', y='WAR', data=data)
plt.title('Barrel% vs WAR')
# Age vs WAR
plt.subplot(2, 2, 3)
sns.scatterplot(x='player_age', y='WAR', data=data)
plt.title('Player Age vs WAR')
# K% vs BB%
plt.subplot(2, 2, 4)
sns.scatterplot(x='Rbat+', y='WAR', data=data)
plt.title('Rbat+ vs WAR')
plt.tight_layout()
plt.savefig('bivariate.png')
plt.close()

We can see that xBA vs WAR and Rbat+ vs WAR are somewhat positively correlated but Player Age vs WAR and Barrel% vs WAR are not strongly correlated.

### Multivariate Analysis (Pairplot)


In [None]:
sns.pairplot(data[key_columns], diag_kind='kde', corner=True)
plt.suptitle('Pair Plot of Key Variables')
plt.savefig('multivariate.png')
plt.close()

We can see that for every plot most of the data points are clustered in a small area, with a few data points around that cluster. We can also see that the plots we expect to see positive correlation in such as WAR vs xBA, OPS vs BB%, etc. are indeed somewhat positively correlated while there are others that do not seem too correlated.

The resulting png files are in Project 1 -> Data.

### Modeling

We will be using Recurrent Neural Networks as our main model. Before we start modeling, however, we must clean the data so we have sequences of seasons for each player. 


def rnn_prediction(input, output, target, key_columns, sequence_length):
  # Sort by Player and Season
  complete_dataset = complete_dataset.sort_values(['Player', 'year'], ascending=[True, True])
  # Filter players with >3 seasons
  season_counts = complete_dataset.groupby('Player')['year'].nunique()
  valid_players = season_counts[season_counts > sequence_length].index
  # Create sequences
    X, y, player_years = [], [], []
    for player in valid_players:
        player_df = df[df['Player'] == player].loc[:, key_columns].copy()
        seasons = len(player_df)
        for i in range(seasons - sequence_length):
            seq = player_df.iloc[i:i+sequence_length].values
            if seq.shape != (sequence_length, 8):
                continue
            X.append(seq)
            y.append(df[df['Player'] == player][target].iloc[i+sequence_length])
            player_years.append((player, df[df['Player'] == player]['year'].iloc[i+sequence_length]))
    
    if not X:
        raise ValueError(f"No valid sequences created with sequence_length. Found {len(valid_players)} players with >3 seasons.")
    
    X = np.array(X)
    y = np.array(y)
    
    # Scale features and target
    scaler_X = StandardScaler()
    scaler_y = StandardScaler()
    X_reshaped = X.reshape(-1, 8)
    X_scaled = scaler_X.fit_transform(X_reshaped).reshape(X.shape)
    y_scaled = scaler_y.fit_transform(y.reshape(-1, 1)).flatten()
    
    # Split data
    train_idx = int(0.8 * len(X))
    X_train, X_test = X_scaled[:train_idx], X_scaled[train_idx:]
    y_train, y_test = y_scaled[:train_idx], y_scaled[train_idx:]
    test_player_years = player_years[train_idx:]
    
    # Build RNN model
    model = Sequential([
        LSTM(50, input_shape=(sequence_length, 8), return_sequences=False),
        Dense(25, activation='relu'),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse')
    
    # Train model
    model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0, validation_split=0.2)
    
    # Predict
    y_pred_scaled = model.predict(X_test)
    y_pred = scaler_y.inverse_transform(y_pred_scaled).flatten()
    y_test_orig = scaler_y.inverse_transform(y_test.reshape(-1, 1)).flatten()
    
    # Evaluate
    rmse = np.sqrt(mean_squared_error(y_test_orig, y_pred))
    r2 = r2_score(y_test_orig, y_pred)
    
    # Save predictions
    test_df = pd.DataFrame({
        'Player': [py[0] for py in test_player_years],
        'Year': [py[1] for py in test_player_years],
        'Actual_WAR': y_test_orig,
        'Predicted_WAR': y_pred
    })
    test_df.to_csv(output, index=False)
    
    return model, test_df, rmse, r2