# SCORE Sports Data Repository Questions
## Justin Verlander Pitches

### Motivation

After nearly two full seasons due to injury, at the age of 39 Justin Verlander returned for the 2022 season to win the American League Cy Young award and World Series with the Houston Astros. Leading the league in a variety of statistics, Verlander dominated in his starts throughout the season.

Pitch selection has played a key role into Verlander’s recent success with the Astros. Verlander throws four types of pitches (using MLB’s abbreviation): fastball (FF), slider (SL), curveball (CB), and changeup (CH).

However, pitches are thrown in the context of an at-bat where the ball-strike count starts 0-0, and progresses until either the batter strikes out (reaches three strikes), is walked (reaches four balls), or is either hit-by-pitch or hits the ball in-play.

As the count varies, pitchers often decide to favor certain pitches over others, e.g, with three balls (i.e., 3-X counts) the pitcher may favor throwing more accurate fastballs relative to out-of-the-zone offspeed pitches that are favored with two strikes (i.e., X-2 counts).

### Questions

1. Using the appropriate statistical test, assess whether or not Justin Verlander’s pitch_type is independent of the count. Perform this analysis for both the 2019 and 2022 datasets. Are the conclusions similar or different?

2. Which combinations of pitch_type and count appear more than expected under the assumption of independence? Which combinations appear fewer than expected? Perform this analysis for both the 2019 and 2022 datasets. Are the conclusions similar or different?

### Organizing Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

In [2]:
df_19 = pd.read_csv('verlander-pitches-2019.csv')
df_22 = pd.read_csv('verlander-pitches-2022.csv')

In [64]:
df_19['pitch_type'].value_counts(normalize = True)

FF    0.498567
SL    0.289919
CU    0.169575
CH    0.041938
Name: pitch_type, dtype: float64

In [65]:
df_22['pitch_type'].value_counts(normalize = True)

FF    0.507563
SL    0.282353
CU    0.188235
CH    0.021849
Name: pitch_type, dtype: float64

Trying to understand Verlander's pitching repetoire, he utilizes a Fastball (FF), Slider (SL), Curveball (CU) and Change-up (CH).

The general proportions are similar by the year, but it appears he used his change-up in 2022 less frequently than he did in 2019, with most of that distribution going towards his curveball.

### Question 1: Using the appropriate statistical test, assess whether or not Justin Verlander’s `pitch_type` is independent of the `count`.

### Perform this analysis for both the 2019 and 2022 datasets.

### Are the conclusions similar or different?

The appropriate statistical tests when it comes to independent counts for pitches would be a chi-squared test.

A chi-squared test uses contingency tables to count the actual occurance of each instance compared to the expected number of occurances expected.  The Null Hypothesis in our case is that there is no relationship between the variables of `pitch_type` and `count`.

I will use a crosstab to set-up the contingency table and then run the chi-squared test.  The main evaluation I am looking for is the p-value of the test, with anything below 0.05 would be statistically significant and therefore would be able to reject the null hypothesis.

In [59]:
ct_19 = pd.crosstab(df_19['pitch_type'], df_19['count'])
ct_22 = pd.crosstab(df_22['pitch_type'], df_22['count'])

In [60]:
p_19 = chi2_contingency(ct_19)[1]
p_22 = chi2_contingency(ct_22)[1]

print(f'The p-value for the 2019 season is {p_19}')
print(f'The p-value for the 2022 season is {p_22}')

The p-value for the 2019 season is 8.367149152143298e-74
The p-value for the 2022 season is 7.994849407390398e-35


The p-values for both seasons are well below 0.05, so the null hypothesis can be rejected in both cases; there is a statistical relationship between `pitch_type` and `count`.

However, the p-value for the 2022 season is significantly greater than it was for the 2019 season.  This would suggest that there was a weaker (though still significant) relationship between `pitch_type` and `count` in the 2022 season than there was in the 2019 season

### 2a. For 2019, which combinations of pitch_type and count appear more than expected under the assumption of independence? Which combinations appear fewer than expected?

In order to judge which `pitch_type` appeared more, or less, frequently based on the count, I would like to re-arrange the dataframe so that we can see the actual vs expected percentages for each `pitch_type` during each `count` and see how much the percentages differ by.

In [31]:
def expand_ct(ct):
    calc_df = pd.DataFrame(chi2_contingency(ct)[3], columns = ct.columns, index = ct.index)
    
    balls = []
    strikes = []
    pitch_type = []
    actual = []
    expected = []
    
    for col in ct.columns:
        for i in ct.index:
            balls.append(int(col[0]))
            strikes.append(int(col[2]))
            pitch_type.append(i)
            actual.append(int(ct[col][i]))
            expected.append(int(round(calc_df[col][i], 2)))

    df = pd.DataFrame([balls, strikes, pitch_type, actual, expected],
                      index = ['balls', 'strikes', 'pitch_type', 'actual', 'expected']).T

    for ball in list(df['balls'].unique()):
        for strike in list(df['strikes'].unique()):
            for pitch in list(df['pitch_type'].unique()):
                df.loc[(df['balls'] == ball) & (df['strikes'] == strike) & (df['pitch_type'] == pitch), 'actual%']=\
                df[(df['balls'] == ball) & (df['strikes'] == strike) & (df['pitch_type'] == pitch)]['actual']/\
                df[(df['balls'] == ball) & (df['strikes'] == strike)].sum()['actual']*100
                
    for ball in list(df['balls'].unique()):
        for strike in list(df['strikes'].unique()):
            for pitch in list(df['pitch_type'].unique()):
                df.loc[(df['balls'] == ball) & (df['strikes'] == strike) & (df['pitch_type'] == pitch), 'expected%']=\
                df[(df['balls'] == ball) & (df['strikes'] == strike) & (df['pitch_type'] == pitch)]['expected']/\
                df[(df['balls'] == ball) & (df['strikes'] == strike)].sum()['expected']*100
                
    df['difference'] = df['actual%'] - df['expected%']
    
    return df

In [32]:
full_19 = expand_ct(ct_19)

In [61]:
full_19.sort_values('difference', ascending = False).head(10)

Unnamed: 0,balls,strikes,pitch_type,actual,expected,actual%,expected%,difference
38,3,0,FF,20,9,100.0,52.941176,47.058824
42,3,1,FF,50,26,92.592593,50.0,42.592593
26,2,0,FF,83,51,79.807692,50.0,29.807692
11,0,2,SL,158,104,43.888889,28.969359,14.91953
23,1,2,SL,199,131,43.736264,28.918322,14.817941
2,0,0,FF,592,470,62.711864,49.893843,12.818022
14,1,0,FF,203,164,61.515152,50.152905,11.362246
30,2,1,FF,103,83,61.309524,50.0,11.309524
35,2,2,SL,136,99,39.766082,29.117647,10.648435
5,0,1,CU,132,90,24.719101,16.917293,7.801808


These are the Top 10 `pitch_type`/`count` combinations that happened **more** than expected.

Three trends are apparent here:
1. Verlander pitched more fastballs when the batter had the advantage (3-0, 3-1, 2-0 counts)
2. Verlander pitched more sliders with 2 strikes.
3. Verlander pitched more fastballs early in the count (0-0, 1-0)

From a narrative standpoint, it would seem that Verlander wanted to establish his fastball early in the count, continue to use his fastball if he got in trouble, but utilize his slider as his 'out' pitch.

In [75]:
full_19.sort_values('difference').head(10)

Unnamed: 0,balls,strikes,pitch_type,actual,expected,actual%,expected%,difference
39,3,0,SL,0,5,0.0,29.411765,-29.411765
10,0,2,FF,102,179,28.333333,49.860724,-21.527391
43,3,1,SL,4,15,7.407407,28.846154,-21.438746
22,1,2,FF,144,226,31.648352,49.889625,-18.241273
37,3,0,CU,0,3,0.0,17.647059,-17.647059
41,3,1,CU,0,9,0.0,17.307692,-17.307692
25,2,0,CU,1,17,0.961538,16.666667,-15.705128
3,0,0,SL,155,273,16.419492,28.980892,-12.5614
34,2,2,FF,131,170,38.304094,50.0,-11.695906
29,2,1,CU,9,28,5.357143,16.86747,-11.510327


These are the Top 10 `pitch_type`/`count` combinations that happened **less** than expected.

There are 3 noticeable trends here as well:
1. Verlander does not like to throw a Curveball or Slider when the batter has the advantage
2. With the Slider being his 'out' pitch, he throws Fastballs with two strikes less than expected
3. Verlander throws significantly fewer Sliders to start an at bat than expected.

These align with the narrative mentioned above, using the Fastball early and the Slider as an out pitch.  It's interesting to see that he does not throw Curveballs in dangerous counts either.

As well, his Change-up does not appear in either list.  Considering he only throws the pitch 4% of the time, the small sample size may account for it.  However, since it would be a more 'surprising' pitch to throw compared to the others, it is interesting that he seems to throw it in-line with expectations across all counts.

### 2b. For 2022, which combinations of pitch_type and count appear more than expected under the assumption of independence? Which combinations appear fewer than expected?

In [67]:
full_22 = expand_ct(ct_22)

In [69]:
full_22.sort_values('difference', ascending = False).head(10)

Unnamed: 0,balls,strikes,pitch_type,actual,expected,actual%,expected%,difference
38,3,0,FF,19,10,90.47619,55.555556,34.920635
26,2,0,FF,72,43,83.72093,51.190476,32.530454
42,3,1,FF,35,24,71.428571,51.06383,20.364742
23,1,2,SL,163,103,44.657534,28.374656,16.282879
11,0,2,SL,103,69,41.869919,28.278689,13.59123
46,3,2,FF,68,57,60.176991,51.351351,8.82564
14,1,0,FF,171,145,59.79021,51.056338,8.733872
17,1,1,CU,73,54,25.172414,18.75,6.422414
2,0,0,FF,429,385,56.447368,50.791557,5.655812
35,2,2,SL,83,70,33.2,28.225806,4.974194


These differences align closely with the 2019 narrative:  Verlander appears to use the fastball to establish the count and when he gets in trouble, and utilizes the slider as an out pitch.

However, the differences in the percentages are bit less stark than they were in 2019.  In 2019, 9 `pitch_type`/`count` combinations had a greater than 10% difference between `actual%` and `expected%`, whereas only 5 do in 2022 (and the top 3 are significantly lower too).  This lends credence to the result of the chi-squared test that the `count` had a weaker relationship to `pitch_type` in the 2022 season than it did in the 2022 season.

In [73]:
full_22.sort_values('difference').head(10)

Unnamed: 0,balls,strikes,pitch_type,actual,expected,actual%,expected%,difference
22,1,2,FF,112,185,30.684932,50.964187,-20.279256
39,3,0,SL,2,5,9.52381,27.777778,-18.253968
25,2,0,CU,2,16,2.325581,19.047619,-16.722038
37,3,0,CU,0,3,0.0,16.666667,-16.666667
27,2,0,SL,11,24,12.790698,28.571429,-15.780731
10,0,2,FF,87,124,35.365854,50.819672,-15.453818
41,3,1,CU,2,9,4.081633,19.148936,-15.067304
45,3,2,CU,5,21,4.424779,18.918919,-14.49414
3,0,0,SL,147,214,19.342105,28.23219,-8.890085
29,2,1,CU,13,23,10.569106,19.008264,-8.439159


This again supports the initial narrative, and also shows that the relationship between `pitch_type` and `count` is weaker in 2022.  Here it is obvious that Verlander does not like to throw the Curveball or Slider when he is in trouble.

However, in 2019 Verlander threw a Slider 21% less than expected on a 3-1 count, good for the 3rd greatest difference.  In 2022, that combination did not even make the Top 10, suggesting that Verlander was more willing to throw his 'off-speed' pitches more frequently during dangerous counts in 2022.

### 2c. Are the conclusions similar or different between the 2019 and 2022 seasons?

Regardless of direction, I would like to see which of these combinations changed the most between 2019 and 2022.

In [51]:
full_df = pd.merge(full_19, full_22, on=['balls', 'strikes', 'pitch_type'], suffixes=('_19', '_22'))\
[['balls', 'strikes', 'pitch_type', 'difference_19', 'difference_22']]

In [55]:
full_df['percent_change'] = abs(full_df['difference_19'] - full_df['difference_22'])

In [71]:
full_df.sort_values('percent_change', ascending = False).head(10)

Unnamed: 0,balls,strikes,pitch_type,difference_19,difference_22,percent_change
42,3,1,FF,42.592593,20.364742,22.227851
43,3,1,SL,-21.438746,-5.210595,16.228152
38,3,0,FF,47.058824,34.920635,12.138189
39,3,0,SL,-29.411765,-18.253968,11.157796
34,2,2,FF,-11.695906,-4.006452,7.689455
44,3,2,CH,-4.117647,3.507933,7.62558
6,0,1,FF,-6.928839,0.381499,7.310338
30,2,1,FF,11.309524,4.044883,7.26464
2,0,0,FF,12.818022,5.655812,7.16221
31,2,1,SL,-2.129948,4.421152,6.5511


The biggest difference between the seasons is that Verlander was willing to throw the Fastball less during the start of the at bat (0-0) and when he was behind in the count (3-1, 3-0).  He seemed to shift this distribution towards his Slider, which increased closer to the expected values in the chi-squared contingency table during dangerous counts (3-0 and 3-1).

Two interesting notes on these results:
1. His frequency of his Fastball on a 0-1 count increased, going against the grain in 2022 of pitching fewer fastballs at the beginning of the at-bat.  It seems in 2019 that Verlander liked to establish his fastball early and then follow it with the slider.  It seems in 2022, he was more willing to start with the Slider and follow it with a Fastball on a 0-1 count.

2. Verlander's Change-up had a big swing on a 3-2 count.  The sample size is small here (increasing the total number of change-ups thrown on that count from 0 to 6) but it highlights Verlander's willingness to go to his off-speed pitches more frequently in more dangerous situations in 2022.  This would seem to indicate that Verlander may have grown more confident to use the change-up as a secondary 'out' pitch (behind his slider) for the 2022 season.