# 📊 Phillies Take-Home: Strikeout Percentage Prediction (2024)
**Language**: Julia 1.9+

This notebook addresses Question 11 of the assessment, where we are tasked with predicting each pitcher's 2024 strikeout percentage (K%) using their prior performance data.

We use a hierarchical Bayesian model implemented via `Turing.jl` to estimate future K% while accounting for player-level effects and uncertainty.

In [None]:
# 🧱 Load libraries
using CSV, DataFrames
using StatsPlots
using Turing
using Random, Distributions
using StatsBase
Random.seed!(42)

## 📂 Load and inspect data

In [None]:
df = CSV.read("k.csv", DataFrame)
first(df, 5)

## 🧼 Preprocess Data
- We sort by season and player
- We compute lagged K% and TBF
- We exclude rows with missing historical data (e.g., rookies)

In [None]:
sort!(df, [:MLBAMID, :Season])
df.K_prev = missings(Bool, nrow(df))
df.TBF_prev = missings(Bool, nrow(df))

for g in groupby(df, :MLBAMID)
    for i in 2:nrow(g)
        df.K_prev[g.row[i]] = g.K_[i-1]
        df.TBF_prev[g.row[i]] = g.TBF[i-1]
    end
end

df_model = dropmissing(df, [:K_prev, :TBF_prev])

## 🎯 Target and Features
- Target: K% for current season
- Features: prior K%, prior TBF, age

In [None]:
X = select(df_model, [:K_prev, :TBF_prev, :Age]) |> Matrix
y = df_model.K_ |> collect
player_index = labelencode(df_model.MLBAMID)

## 🧠 Turing Model Definition
We model K% as a function of prior stats, with player-level intercepts.

In [None]:
@model function k_predict_model(X, y, player_idx)
    N, D = size(X)
    n_players = maximum(player_idx)

    α ~ Normal(0, 1)
    β ~ MvNormal(D, 1.0)
    σ ~ Exponential(1.0)
    θ_player ~ filldist(Normal(0, 1), n_players)

    μ = α .+ θ_player[player_idx] .+ X * β
    y .~ Normal.(μ, σ)
end

## 🔁 Fit the Model

In [None]:
model = k_predict_model(X, y, player_index)
chain = sample(model, NUTS(), 500)
describe(chain)

## 📈 Posterior Check (optional)

In [None]:
using MCMCChains
plot(chain)