# Phillies Quantitative Analyst Take-Home
### Author: Ryan Williams
### Date: [Submission Date]

This notebook addresses Question 11 of the Phillies take-home assessment. The objective is to predict each player’s **2024 Strikeout Percentage (K%)** using prior seasons' data, and optionally incorporating supplemental information to improve predictive power.

We follow a structured workflow:
- Exploratory Data Analysis
- Baseline Linear Model
- Advanced ML Model (LightGBM)
- Optional: External Feature Augmentation (e.g., fastball velocity via `pybaseball`)
- Evaluation & Prediction


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import lightgbm as lgb
# Optional: pybaseball for supplemental data
# from pybaseball import pitching_stats

sns.set(style='whitegrid')

## Load Data
We begin by loading `k.csv`, which contains:
- Player identifiers (MLBAMID, FanGraphs ID)
- Age and season
- Total Batters Faced (TBF) and Strikeout Percentage (K%)

In [None]:
df = pd.read_csv('k.csv')
df.head()

## Exploratory Data Analysis (EDA)
- Examine K% distribution by season
- TBF trends by age
- Visualize player-level trajectories if useful

In [None]:
df['Season'] = df['Season'].astype(int)
df['Age'] = df['Age'].astype(float)

sns.histplot(data=df, x='K%', bins=30, kde=True)
plt.title('Strikeout % Distribution')
plt.show()

## Feature Engineering
We will pivot the dataset so each row is a player, with past season stats as features.

- Encode prior years' K% and TBF as features
- Compute year-over-year deltas
- Optional: include player age or trend features

In [None]:
# Example preprocessing stub
def preprocess_for_modeling(df):
    df = df.sort_values(['MLBAMID', 'Season'])
    features = []
    targets = []
    for pid, group in df.groupby('MLBAMID'):
        group = group.reset_index(drop=True)
        if len(group) < 2:
            continue
        for i in range(1, len(group)):
            row = {
                'Prev_K%': group.loc[i-1, 'K%'],
                'Prev_TBF': group.loc[i-1, 'TBF'],
                'Age': group.loc[i, 'Age']
            }
            features.append(row)
            targets.append(group.loc[i, 'K%'])
    return pd.DataFrame(features), pd.Series(targets)

X, y = preprocess_for_modeling(df)
X.head()

## Baseline Linear Model
We'll begin with a simple linear regression to establish a baseline performance.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(f'R²: {r2_score(y_test, y_pred):.4f}')
print(f'RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}')

## LightGBM Model (Gradient Boosted Trees)
A more flexible model to capture nonlinear effects and feature interactions.


In [None]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'verbosity': -1
}
gbm = lgb.train(params, lgb_train, num_boost_round=100, valid_sets=[lgb_eval], early_stopping_rounds=10)
y_pred_gbm = gbm.predict(X_test)

print(f'LGBM RMSE: {mean_squared_error(y_test, y_pred_gbm, squared=False):.4f}')

## Optional: Integrating External Data (Fastball Velocity, Stuff+, etc.)
You may augment features using `pybaseball` to retrieve:
- Fastball velocity
- Pitch type usage
- Historical Stuff+ ratings

This shows initiative and awareness of predictive features used by MLB teams.


## Conclusion & Deliverables
- Summary of findings
- Model performance comparison
- Export of 2024 predictions for all eligible players (if requested)
