# Phillies Quantitative Analyst Take-Home
### Author: Ryan Williams
### Date: 05/12/2025

**Language**: Julia 1.11+

This notebook addresses Question 11 of the assessment, where we are tasked with predicting each pitcher's 2024 strikeout percentage (K%) using their prior performance data.

We use a hierarchical Bayesian model implemented via `Turing.jl` to estimate future K% while accounting for player-level effects and uncertainty.


In [None]:
# Setup Julia environment for the k_model package
using Pkg
Pkg.activate("k_model_env")
Pkg.add([
    "CSV",
    "DataFrames",
    "StatsPlots",
    "Turing",
    "Random",
    "Distributions",
    "StatsBase",
    "CategoricalArrays"
])
Pkg.precompile()

In [8]:
# Load libraries
using CSV, DataFrames
using StatsPlots
using Turing
using Random, Distributions
using StatsBase
using CategoricalArrays
Random.seed!(42)

TaskLocalRNG()

## Load and inspect data
We begin by loading `k.csv`, which contains:
- Player identifiers (MLBAMID, FanGraphs ID)
- Age and season
- Total Batters Faced (TBF) and Strikeout Percentage (K%)

In [3]:
df = CSV.read("k.csv", DataFrame)
first(df, 5)

Row,MLBAMID,PlayerId,Name,Team,Age,Season,TBF,K%
Unnamed: 0_level_1,Int64,Int64,String31,String7,Int64,Int64,Int64,Float64
1,695243,31757,Mason Miller,OAK,25,2024,249,0.417671
2,621242,14710,Edwin Díaz,NYM,30,2024,216,0.388889
3,518585,7048,Fernando Cruz,CIN,34,2024,288,0.378472
4,623352,14212,Josh Hader,HOU,30,2024,278,0.377698
5,663574,19926,Tony Santillan,CIN,27,2024,122,0.377049


In [None]:
describe(df)
# names(df)

## Preprocess Data
- We sort by season and player
- We compute lagged K% and TBF
- We exclude rows with missing historical data (e.g., rookies)

In [4]:
rename!(df, Symbol("K%") => :K)

Row,MLBAMID,PlayerId,Name,Team,Age,Season,TBF,K
Unnamed: 0_level_1,Int64,Int64,String31,String7,Int64,Int64,Int64,Float64
1,695243,31757,Mason Miller,OAK,25,2024,249,0.417671
2,621242,14710,Edwin Díaz,NYM,30,2024,216,0.388889
3,518585,7048,Fernando Cruz,CIN,34,2024,288,0.378472
4,623352,14212,Josh Hader,HOU,30,2024,278,0.377698
5,663574,19926,Tony Santillan,CIN,27,2024,122,0.377049
6,669093,22210,Jeremiah Estrada,SDP,25,2024,252,0.373016
7,547973,10233,Aroldis Chapman,PIT,36,2024,265,0.369811
8,671305,22533,Michel Otañez,OAK,26,2024,151,0.364238
9,489446,9073,Kirby Yates,TEX,37,2024,237,0.35865
10,670955,20539,Edwin Uceta,TBR,26,2024,159,0.358491


In [5]:
# Sort the DataFrame to ensure proper ordering within each player group
sort!(df, [:MLBAMID, :Season])

# Initialize new columns with missing values
df[!, :K_prev] = Vector{Union{Missing, Float64}}(missing, nrow(df))
df[!, :TBF_prev] = Vector{Union{Missing, Int64}}(missing, nrow(df))

# Fill previous season's K% and TBF for each pitcher
for g in groupby(df, :MLBAMID)
    global_indices = findall(x -> x in g.MLBAMID, df.MLBAMID)
    for i in 2:nrow(g)
        curr_idx = global_indices[i]
        prev_idx = global_indices[i - 1]

        df[!, :K_prev][curr_idx] = df[!, :K][prev_idx]
        df[!, :TBF_prev][curr_idx] = df[!, :TBF][prev_idx]
    end
end

# Drop rows that don’t have lagged values (e.g. rookies or first-year records)
df_model = dropmissing(df, [:K_prev, :TBF_prev])

Row,MLBAMID,PlayerId,Name,Team,Age,Season,TBF,K,K_prev,TBF_prev
Unnamed: 0_level_1,Int64,Int64,String31,String7,Int64,Int64,Int64,Float64,Float64,Int64
1,425794,2233,Adam Wainwright,STL,40,2022,803,0.178082,0.210145,828
2,425794,2233,Adam Wainwright,STL,41,2023,484,0.113636,0.178082,803
3,425844,1943,Zack Greinke,KCR,38,2022,585,0.124786,0.172166,697
4,425844,1943,Zack Greinke,KCR,39,2023,593,0.163575,0.124786,585
5,434378,8700,Justin Verlander,- - -,40,2023,669,0.215247,0.277778,666
6,434378,8700,Justin Verlander,HOU,41,2024,396,0.186869,0.215247,669
7,445276,3096,Kenley Jansen,ATL,34,2022,260,0.326923,0.309353,278
8,445276,3096,Kenley Jansen,BOS,35,2023,188,0.276596,0.326923,260
9,445276,3096,Kenley Jansen,BOS,36,2024,218,0.284404,0.276596,188
10,445926,5448,Jesse Chavez,- - -,38,2022,292,0.253425,0.270677,133


## Target and Features
- Target: K% for current season
- Features: prior K%, prior TBF, age

In [12]:
player_cats = categorical(df_model.MLBAMID)
player_index = codes(player_cats)

UndefVarError: UndefVarError: `codes` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

## Baseline Linear Model
We'll begin with a simple linear regression to establish a baseline performance.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(f'R²: {r2_score(y_test, y_pred):.4f}')
print(f'RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}')

## LightGBM Model (Gradient Boosted Trees)
A more flexible model to capture nonlinear effects and feature interactions.


In [None]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'verbosity': -1
}
gbm = lgb.train(params, lgb_train, num_boost_round=100, valid_sets=[lgb_eval], early_stopping_rounds=10)
y_pred_gbm = gbm.predict(X_test)

print(f'LGBM RMSE: {mean_squared_error(y_test, y_pred_gbm, squared=False):.4f}')

## Optional: Integrating External Data (Fastball Velocity, Stuff+, etc.)
You may augment features using `pybaseball` to retrieve:
- Fastball velocity
- Pitch type usage
- Historical Stuff+ ratings

This shows initiative and awareness of predictive features used by MLB teams.


## Conclusion & Deliverables
- Summary of findings
- Model performance comparison
- Export of 2024 predictions for all eligible players (if requested)
