# Modifying Batting Strategy

When considering modifications to the batting strategy, we are considering a change in the percentage where that pitch count ends the at-bat.  As a result, the modification must leave the sum of pitch counts the same.  Adding (or removing) some amount percentage from a pitch count i-j will then have the opposite effect on all  pitch counts with more balls or more strikes.  Explicitly, labelling the fraction of pitch count i-j as $P(i,j)$, the transformation we are looking at is given by

\begin{align}
P(b,s) \rightarrow P'(b,s) &= P(b,s) \pm x\\
P(i\ge b, j\ge s) \rightarrow P'(i\ge b, j\ge s) &= P(i,j)\left(1 \mp \frac{x}{-P(b,s) + \sum_{i\ge b, j\ge s} P(i,j)}\right),
\end{align}

where $P(i,j)$ is unchanged if either $i<b$ or $j<s$. For each pitch count we wish to modified, we must calculate the entire transformation.  In particular, we calculate the change in the fractional pitch count $(P'-P)/x$.  This transformation can be written in the form of a 12x12 lower triangular matrix:

\begin{equation}
P'- P=
\begin{bmatrix}
-1 & 0 & 0 & \dots\\
\frac{P(10)}{1-P(00)} & -1 & 0\\
\frac{P(20)}{1-P(00)} & \frac{P(20)}{1-P(00)-P(10)-P(01)-P(02)-P(11)-P(12)} & -1 &\\
\vdots & & & \ddots
\end{bmatrix} \vec{x}
\end{equation}

Note, for each column in the transformation matrix, the entries are either $0,-1$, or have the same denominator.  This is because the columns denote a shift in one particular pitch count.  It is also important to point out, the pitch count $P(3,2)$ cannot be directly changed, this the entire column is filled with zeros.  This is because there is no state with more balls or strikes that can compensate for the change.  Once you have reached this pitch count, the at-bat will necessarily end at that pitch count.

### Reacquire data and model from previous work

We will need the following transformations for both home and away:

Standardization: ()_standardize

Principal Component Analysis: ()_pca

Linear Regression Model: ()_lin_regr

Number of Principal Components kept for regre: num_pc_kept

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import os


# Set up each team's pitch count data
path_counts = '' # Path to the directory with the pitch count files
count_files = os.listdir(path_counts)
home_files = [files for files in count_files if files[4:]=='Home']
away_files = [files for files in count_files if files[4:]=='Away']

home_counts = pd.DataFrame()
away_counts = pd.DataFrame()


# Import and save home and away data separately
for file in home_files:
    cur_home = pd.read_csv(path_counts+file, index_col=[0])
    cur_home = pd.concat({file[:4]:cur_home}, names=['Year', 'Team'])
    home_counts = pd.concat([home_counts,cur_home])

for file in away_files:
    cur_away = pd.read_csv(path_counts+file, index_col=[0])
    cur_away = pd.concat({file[:4]:cur_away}, names=['Year', 'Team'])
    away_counts = pd.concat([away_counts,cur_away])


# Add W-L data for each team
path_wl = '' # Path to directory with Win/Loss team data
wl_records = os.listdir(path_wl)

home_wl = pd.DataFrame()
away_wl = pd.DataFrame()

for file in wl_records:
    year = file[-4:]
    cur_year_records = pd.read_csv(path_wl+file, index_col=[0])
    
    cur_year_home = pd.concat({year:(cur_year_records['Home_Win']/(cur_year_records['Home_Loss']+cur_year_records['Home_Win']))},
                          names=['Year', 'Team'])
    cur_year_away = pd.concat({year:(cur_year_records['Away_Win']/(cur_year_records['Away_Loss']+cur_year_records['Away_Win']))},
                          names=['Year', 'Team'])
    
    home_wl = pd.concat([home_wl, cur_year_home])
    away_wl = pd.concat([away_wl, cur_year_away])
    
home_counts_wl = pd.concat([home_counts, home_wl], axis=1).rename(columns={0:'W%'})
away_counts_wl = pd.concat([away_counts, away_wl], axis=1).rename(columns={0:'W%'})


# Add a defensive stat in Runs Against
path_ra = '' # Path to the directory with the teams' runs for and against data
runs_records = os.listdir(path_ra)

home_ra = pd.DataFrame()
away_ra = pd.DataFrame()

for file in runs_records:
    year = file[-4:]
    cur_year_records = pd.read_csv(path_ra+file, index_col=[0])
    
    cur_year_home = pd.concat({year:(cur_year_records['Home Runs Against'])}, names=['Year', 'Team'])
    cur_year_away = pd.concat({year:(cur_year_records['Away Runs Against'])}, names=['Year', 'Team'])
    home_ra = pd.concat([home_ra, cur_year_home])
    away_ra = pd.concat([away_ra, cur_year_away])
    
home_counts_ra = pd.concat([home_counts_wl, home_ra], axis=1).rename(columns={0:'RA'})
away_counts_ra = pd.concat([away_counts_wl, away_ra], axis=1).rename(columns={0:'RA'})

In [2]:
s = np.array(['0','1','2']) # Strikes
b = np.array(['0','1','2','3']) # Balls
counts = [i+j for j in s for i in b] # All possible counts

# Standardize pitch count data.
# Do not transform the win percentage of each team since it is the dependent variable, which will not be used in the PCA

def StandPCA(df, cols):
    pc_col_names = ['pc'+str(i+1) for i in range(len(cols))] # List of names pc1, pc2,...
    
    df_standardized = df.copy()
    scalar = StandardScaler().fit(df[cols])
    df_standardized[cols] = scalar.transform(df[cols]) # Standardize data
    
    # Perform a principal component analysis on all columns in consideration.
    # Save the explained variance ratio in a separate DataFrame to return for evaluation
    pca = PCA(n_components=len(cols)).fit(df_standardized[cols])
    expl_var_df = pd.DataFrame(pca.explained_variance_ratio_, columns=['EVR'], index=pc_col_names)
    expl_var_df['Cumulative'] = expl_var_df.cumsum()

    # Transform the columns considered into Principal Components
    # Columns not used in the PCA are merged after the principal components.
    df_pc = pd.DataFrame(pca.transform(df_standardized[cols]),columns=pc_col_names,index=df_standardized.index)
    df_pc = pd.concat([df_pc, df_standardized.drop(cols,axis=1)],axis=1)
    
    return [df_pc, pca,scalar]


# Standardize and PCA both home and away data
home_pc_ra, home_pca, home_standardize = StandPCA(home_counts_ra, counts)
away_pc_ra, away_pca, away_standardize = StandPCA(away_counts_ra, counts)

home_model = home_pc_ra.copy()
away_model = away_pc_ra.copy()

home_model['RAxW%'] = home_pc_ra['RA']*home_pc_ra['W%']
away_model['RAxW%'] = away_pc_ra['RA']*away_pc_ra['W%']

In [3]:
def LinReg(data, pcs_model, outlier_stat):
    # Remobe particularly large outliers in one particular stat
    X = data[pcs_model][np.abs((data[outlier_stat]-data[outlier_stat].mean())/data[outlier_stat].std())<=2]
    y = data['RAxW%'][np.abs((data[outlier_stat]-data[outlier_stat].mean())/data[outlier_stat].std())<=2]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    lin_regr = LinearRegression().fit(X_train, y_train)
    
    print('Trained R^2:',lin_regr.score(X_train,y_train))
    print('Test R^2:', lin_regr.score(X_test,y_test))

    return lin_regr


num_pc_kept = 5
# Set up linear regression, again keeping only the first few principal components and Runs Against
pcs = ['pc'+str(i+1) for i in range(num_pc_kept)]
pcs_model = pcs
pcs_model.append('RA')
home_lin_regr = LinReg(home_model, pcs_model, 'W%')
away_lin_regr = LinReg(home_model, pcs_model, 'W%')

Trained R^2: 0.43266162168193434
Test R^2: 0.4812555588404419
Trained R^2: 0.4539807619474555
Test R^2: 0.39574161583808176


### Constructing the Transformation Matrix

For now, we will randomly choose a team within the data and calculate the resulting change in pitch counts to maximize the improvement in wins.  To begin, we calculate the transformation matrix for the particular team chosen.

As a check, we see the matrix has the correct form and each column correctly sums to zero.

In [4]:
team_pitch_counts = home_counts.sample() # Select a Team and save its data
team_pc_array = team_pitch_counts.to_numpy().reshape((3,4)).T # Reshape the pitch count array to the form [b,s]

# Calculate the denominator in the transformation matrix for each column
denominator = np.zeros((4,3))
for b in range(3,-1,-1):
    for s in range(2,-1,-1):
        prob_sum = -team_pc_array[b,s]
        for i in range(b,4):
            for j in range(s,3):
                prob_sum += team_pc_array[i,j]
        denominator[b,s] = prob_sum

# Calculate the entire transformation matrix
transformation = np.zeros((12,12))
for i in range(12):
    transformation[i,i] = -1 # Diagonal entries take the form -1
    for j in range(i+1,12): # Lower triangular matrix
        # The only non-zero entries are when the balls (i%4) and strikes (i//4) are larger than the current row's values
        # When both conditions are satisfied, set appropriate fraction.
        transformation[j,i]=(i%4 <= j%4)*(i//4 <= j//4)*team_pc_array[j%4,j//4]/denominator[i%4,i//4]
# No direct modification can be made to the 3-2 pitch count.
# If you reach this point in the at-bat, you are either reaching base or getting out.
transformation[11,11] = 0
        
transform_df = pd.DataFrame(transformation, columns = team_pitch_counts.columns, index = team_pitch_counts.columns)
# display(transform_df)
# transform_df.sum()

## x is equal for each possible pitch count

First, we consider shifting (one at a time) each possible pitch count by some percentage x of the total. As an example, x = 0.01 is a 1% shift of the total to any one particular pitch count.  Written explicitly, this type of modificaition has shifted pitch counts of the form:
\begin{equation}
\text{Count}'(i,j) = \text{Count}(i,j) + \left(x\sum_{i,j}\text{Count}(i,j)\right)T\cdot\vec{B},
\end{equation}
for transformation matrix $T$ and a basis vector $\vec{B}$, and x is a percentage of the total pitch counts to be modified into a single pitch count.

Selecting $\vec{x}$ as a basis element (only 1 non-zero element) in the original space of pitch counts, we are modifying the team's strategy at a single pitch count.  By transposing the entire transformation given at the start of the notebook, we can collect the rows of $P'$s into a single dataframe to feed into the model and extract each modified strategy's predicted wins.

Note, despite no modifications being possible to the 3-2 pitch count, this strategy is still kept to compare the other strategies.

In [5]:
# Apply all transformations and save to DataFrame.  Each row represents a different strategy.
x = 0.01
transform_value = team_pitch_counts.to_numpy() - transform_df.T*np.sum(team_pitch_counts.values)*x
all_mod_strats = pd.DataFrame(home_pca.transform(home_standardize.transform(transform_value)), columns = ['pc'+str(i+1) for i in range(12)])

# Add the 'Runs Against' data to each strategy.  These values are not changed by the change in hitting.
all_mod_strats['RA'] = home_model.loc[(team_pitch_counts.index)]['RA'].values[0]

# Predict how many wins each strategy will get. Note, the raw output is RAxW%, so we divide again by RA.
predicted_wins = home_lin_regr.predict(all_mod_strats[pcs_model])/all_mod_strats['RA']
predicted_wins.index = transform_value.index

### Being more aggressive or patient?

The original transformation considered only more patient strategies (decreasing a particular pitch count value).  However, all transformations and models are linear, so more aggressive strategies can be found by simply flipping the sign of the difference in wins between a particular strategy and the final unmodified strategy.

In [6]:
max_change_index = np.argmax(np.abs(predicted_wins - predicted_wins[-1]))
win_change = np.max(np.abs(predicted_wins - predicted_wins[-1]))*162
is_patient = ((predicted_wins - predicted_wins[-1])[max_change_index] > 0)
patient_or_aggressive = 'Patient'*int(is_patient) + 'Aggressive'*(1-int(is_patient))


print('Team: ' + ' '.join(team_pitch_counts.index[0]))
print('Change on Pitch Count: '+ predicted_wins.index[max_change_index])
print(x,patient_or_aggressive)
print('Actual Home Wins: {:.0f}'.format(81*home_model.loc[(team_pitch_counts.index)]['W%'].values[0]))
print('Original Predicted Home Wins: {:.2f}'.format(81*predicted_wins[-1]))
print('Improved Predicted  Home Wins: {:.2f}'.format(81*predicted_wins[-1]+win_change))
print('Improvement: {:.2f}'.format(win_change))

Team: 2004 PIT
Change on Pitch Count: 20
0.01 Patient
Actual Home Wins: 39
Original Predicted Home Wins: 39.66
Improved Predicted  Home Wins: 43.80
Improvement: 4.14


## x is a percentage of each pitch count

One key problem with the previous metric is that the practical implementation is somewhat convoluted.  Telling a team that 1% of their total at bats should be shifted to how they approach a single type of pitch count is realistically difficult to implement.

Instead, a better metric to use is what percentage of a particular type of at-bat should batters be more aggressive or more patient.  As a result, the value of x for each modified strategy will be proportional to how often that particular pitch count is met by the team's batters.  As a result, the modified pitch counts now take the form:
\begin{equation}
\text{Count}'(i,j) = \text{Count}(i,j) + x \text{Count}(i,j)T\cdot\vec{B}.
\end{equation}
The quantity x still represents the percentage of how much change, but now it is only the percentage of a particular pitch count rather than all at-bats.

In [7]:
# Apply all transformations and save to DataFrame.  Each row represents a different strategy.
x = 0.01
transform_value = team_pitch_counts.to_numpy() - transform_df.T.multiply(team_pitch_counts.to_numpy(),axis=0)*x
all_mod_strats = pd.DataFrame(home_pca.transform(home_standardize.transform(transform_value)), columns = ['pc'+str(i+1) for i in range(12)])

# Add the 'Runs Against' data to each strategy.  These values are not changed by the change in hitting.
all_mod_strats['RA'] = home_model.loc[(team_pitch_counts.index)]['RA'].values[0]

# Predict how many wins each strategy will get. Note, the raw output is RAxW%, so we divide again by RA.
predicted_wins = home_lin_regr.predict(all_mod_strats[pcs_model])/all_mod_strats['RA']
predicted_wins.index = transform_value.index


max_change_index = np.argmax(np.abs(predicted_wins - predicted_wins[-1]))
win_change = np.max(np.abs(predicted_wins - predicted_wins[-1]))*162
is_patient = ((predicted_wins - predicted_wins[-1])[max_change_index] > 0)
patient_or_aggressive = 'Patient'*int(is_patient) + 'Aggressive'*(1-int(is_patient))


print('Team: ' + ' '.join(team_pitch_counts.index[0]))
print('Change on Pitch Count: '+ predicted_wins.index[max_change_index])
print(x,patient_or_aggressive)
print('Actual Home Wins: {:.0f}'.format(81*home_model.loc[(team_pitch_counts.index)]['W%'].values[0]))
print('Original Predicted Home Wins: {:.2f}'.format(81*predicted_wins[-1]))
print('Improved Predicted  Home Wins: {:.2f}'.format(81*predicted_wins[-1]+win_change))
print('Improvement: {:.2f}'.format(win_change))

Team: 2004 PIT
Change on Pitch Count: 00
0.01 Aggressive
Actual Home Wins: 39
Original Predicted Home Wins: 39.66
Improved Predicted  Home Wins: 39.87
Improvement: 0.21


### Final Words

For the same percentage change, the second method will produce a much smaller improvement, simply because fewer at-bats are being adjusted.  In both cases however, the results are unsurprising.  Generally speaking, teams are told to be more patient on hitter's counts and more aggressive on pitcher's counts.  

There is relatively little difference in the advice being given out.  Similar to the conclusions drawn from the data exploration and modeling Notebook, this likely has to do with the fact that we are not considering how well different teams are taking advantage or falling behind in different scenarios.  Some teams may do very well in certain pitch counts that would push the team to wanting to reach those pitch counts.