# **Stolen Base Proabability**

## **Table of Contents**

- **[Imports and Installations](#imports-and-installations)** - Project Imports and Installations
- **[Globals](#globals)** - Global Arguments for Learning
- **[Utils](#utils)** - Functions to Assist with Learning 
- **[Create Player DF](#create-player-df)** - Creating the Necessary DataFrames for Learning
- **[Lead Distance](#lead-distance)** - Learn Players Lead Distance at Release
    - **[Runner and Pitcher DataFrames](#runner-and-pitcher-dataframes)** - Appropriate DataFrames for Learning Lead Distance
    - **[Prep Lead Distance DataFrame](#prep-lead-distance-dataframe)** - Preping the Lead Distance DataFrame for Learning
    - **[Learn Lead Distance](#learn-lead-distance)** - Function to Learn the Seconday Lead Distance of a Runner and Pitcher Matchup

## **Imports and Installations**

In [None]:
%pip install numpy==1.26.0
%pip install numpyro
%pip install pymc==5.9.0

In [48]:
# ------------------------- Standard library imports ------------------------- #
import os  
from pathlib import Path 

# ---------------- Scientific computing and data manipulation ---------------- #
import numpy as np  
import pandas as pd  
from scipy.stats import norm 

# ------------------------------- Visualization ------------------------------ #
from tqdm import tqdm
from matplotlib import pyplot as plt  

# ----------------------------- Bayesian modeling ---------------------------- #
import pymc as pm  
import pymc.sampling.jax as pmjax

# ------------------------------- Local import ------------------------------- #
from stolen_base.utils import get_player_speed

## **Globals**

In [49]:
YEARS = "2022-2025"
MPH_FT_PER_SEC_MULTIPLE = 1.4666667
PLATE_DISTANCE = 90  # feet

## **Utils**

In [50]:
def create_index_df(df: pd.DataFrame, col_name: str, col_index: str, file_path: str):
    """
    Create an index DataFrame mapping an index to a specific value in a given column
    
    Args:
        df (pd.DataFrame): The input DataFrame containing the data.
        col_name (str): The column to map unique index vlaues
        col_index (str): The name of the index column to be created.
        file_path (str): The path where the index DataFrame will be saved.
    """
    items = set(df[col_name].unique().astype(int))
    index_df = pd.DataFrame({
        f'{col_index}': np.arange(len(items)),
        f'{col_name}': list(items)
    })
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    index_df.to_csv(file_path, index=False)

## **Create Player DF**

In [None]:
sb_data = pd.read_csv(f'data/sb_data_{YEARS}.csv')

# Player Index DataFrames
create_index_df(sb_data, col_name='catcher_id', col_index='catcher_index', file_path=Path('data/player_index/catcher_index.csv'))
create_index_df(sb_data, col_name='pitcher_id', col_index='pitcher_index', file_path=Path('data/player_index/pitcher_index.csv'))
create_index_df(sb_data, col_name='runner_id', col_index='runner_index', file_path=Path('data/player_index/runner_index.csv'))
create_index_df(sb_data, col_name='batter_id', col_index='batter_index', file_path=Path('data/player_index/batter_index.csv'))
create_index_df(sb_data, col_name='fielder_id', col_index='fielder_index', file_path=Path('data/player_index/fielder_index.csv'))

# Runner Speed DataFrame
players = set(sb_data['runner_id'].unique().astype(int))
player_speed_df = pd.DataFrame({
    'runner_id': list(players),
    'mu_runner_speed': [round(get_player_speed(player)['sprint_speed'].mean(), 3) for player in players]
})
player_speed_df.to_csv(Path('data/player_speed.csv'), index=False)

## **Lead Distance**

Learn a base runners average lead distance gained from first movement of the 
pitcher to the time the ball was release. Which will assist in learning a pitchers 
windup time.

### **Runner and Pitcher DataFrames**

In [51]:
# Player Index DataFrames
pitcher_index_df = pd.read_csv(Path('data/player_index/pitcher_index.csv'))
runner_index_df = pd.read_csv(Path('data/player_index/runner_index.csv'))

# Player Speed DataFrame
player_speed_df = pd.read_csv(Path('data/player_speed.csv'))

### **Prep Lead Distance DataFrame**

In [52]:
# Build lead distance DataFrame
lead_distance_df = sb_data[['pitcher_id', 'runner_id', 'lead_distance_gained', 'at_pitchers_first_move', 'at_pitch_release']]

# Add pitcher and runner indices
lead_distance_df = lead_distance_df.merge(pitcher_index_df, on='pitcher_id', how='left')
lead_distance_df = lead_distance_df.merge(runner_index_df, on='runner_id', how='left')

# Add runners average speed
lead_distance_df = lead_distance_df.merge(player_speed_df, on='runner_id', how='left')

### **Learn Lead Distance**

In [None]:
def learn_lead_distance(
        df: pd.DataFrame,
        n_pitchers: int, 
        n_runners: int,
        mu_mu_estimate: int,
        mu_sigma_estimate: int,
        sigma_mu_estimate: int,
        sigma_sigma_estimate: int,
        lower_confidence: float = None,
        upper_confidence: float = None,
        tune: int = 2000,
        n_samples: int = 2000,
        n_chains: int = 4,
):
    """
    Learn the the seconary lead distance of a runner against a specific pitcher.
    
    Args:
        df (pd.DataFrame): The input DataFrame containing the data.
        n_pitchers (int): The number of unique pitchers.
        n_runners (int): The number of unique runners.
        mu_mu_estimate (int): The hyper-prior mean for the secondary lead distance.
        mu_sigma_estimate (int): The hyper-prior standard deviation for the secondary lead distance.
        sigma_mu_estimate (int): The hyper-prior mean for the standard deviation of the secondary lead distance.
        sigma_sigma_estimate (int): The hyper-prior standard deviation for the standard deviation of the secondary lead distance.
        lower_confidence (float, optional): The lower confidence interval for the posterior distribution.
        upper_confidence (float, optional): The upper confidence interval for the posterior distribution.
        tune (int, optional): The number of tuning steps.
        n_samples (int, optional): The number of samples to draw from the posterior distribution.
        n_chains (int, optional): The number of chains to use for sampling.
    """
    # Fit model
    coords = {
        "pitcher": np.arange(n_pitchers),
        "runner": np.arange(n_runners),
        "observation": np.arange(df.shape[0])
        }

    with pm.Model(coords=coords) as mod:
        # Extract pitcher, runner indices, and secondary lead distance ('at_pitch_release')
        pitcher_index = pm.ConstantData("pitcher_index", df['pitcher_index'].values)
        runner_index = pm.ConstantData("runner_index", df['runner_index'].values)
        lead_distance = pm.ConstantData("secondary_lead", df['at_pitch_release'].values)

        # Priors for secondary lead mu

        # Priors of secondary lead sigma

        # Sample distribution for each runner

        # Likelihood of the observed data
    
    # Train model
    with mod:
        trace = pmjax.sample_numpyro_nuts(
            n_samples,
            tune=tune,
            chains=n_chains,
        )
        
    return trace