# MLB Hit Predictor

This program uses data from the 2021 MLB season to predict whether a batted ball in play will be an out or a hit, based on the exit velocity and launch angle off the bat. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Analysing our Data

Lets open our data set and check it out.

In [2]:
df = pd.read_csv('/Users/zacrossman/Downloads/Statcast_2021.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
0,2926,FF,2021-10-03,92.3,1.4,6.8,"Smith, Will",596019,519293,field_out,...,5,0,5,0,5,Infield shift,Strategic,148.0,0.0,-0.073
1,3048,SL,2021-10-03,80.6,1.6,6.64,"Smith, Will",596019,519293,,...,5,0,5,0,5,Infield shift,Strategic,315.0,0.0,-0.027
2,3294,CU,2021-10-03,75.5,1.46,6.88,"Smith, Will",596019,519293,,...,5,0,5,0,5,Infield shift,Standard,328.0,0.0,-0.02
3,3447,CU,2021-10-03,75.0,1.53,6.83,"Smith, Will",596019,519293,,...,5,0,5,0,5,Standard,Standard,330.0,0.0,0.016
4,3545,FF,2021-10-03,91.2,1.49,6.66,"Smith, Will",607043,519293,field_out,...,5,0,5,0,5,Standard,Standard,143.0,0.002,-0.189


This data set contains a lot of data, but we only care about the launch speed, launch angle, and the outcome of the pitch. So lets grab those column specifically. 

In [3]:
df = df[['events', 'launch_speed', 'launch_angle']]
df.head()

Unnamed: 0,events,launch_speed,launch_angle
0,field_out,95.2,-13.0
1,,75.3,75.0
2,,83.5,65.0
3,,,
4,field_out,93.3,-18.0


Just looking at our first five rows, we can see that not every pitch results in an outcome. These pitches are usually either taken strikes, taken balls, or foul balls that are not caught. Lets go ahead and delete those rows.

In [4]:
df.dropna(axis = 0, inplace = True)
df.head()

Unnamed: 0,events,launch_speed,launch_angle
0,field_out,95.2,-13.0
4,field_out,93.3,-18.0
9,field_out,75.3,65.0
18,field_out,96.8,18.0
27,field_out,91.8,29.0


Now our data consists only of pitches that result in an outcome, either a hit or an out. Lets see all of the different outcomes in our data. 

In [5]:
unique_plays = df['events'].unique()
print('Unique Play Results')
print(unique_plays)

Unique Play Results
['field_out' 'grounded_into_double_play' 'single' 'sac_bunt' 'double'
 'home_run' 'force_out' 'triple' 'field_error' 'sac_fly' 'double_play'
 'fielders_choice' 'fielders_choice_out' 'sac_fly_double_play'
 'catcher_interf' 'sac_bunt_double_play' 'triple_play' 'game_advisory']


That's a lot of unique outcomes. The events 'field_out', 'grounded_into_double_play', 'force_out', 'field_error', 'fielders_choice_out', 'double_play, and 'triple_play' are all considered outs, and count against one's batting average. The events 'single', 'double', 'triple', 'fielders_choice' and 'home_run' are all hits, and count for one's batting average. Lets convert the hits into 1's and the outs into 0's. 

In [6]:
df['events'].replace('field_out', 0, inplace = True)
df['events'].replace('grounded_into_double_play', 0, inplace = True)
df['events'].replace('force_out', 0, inplace = True)
df['events'].replace('field_error', 0, inplace = True)
df['events'].replace('fielders_choice_out', 0, inplace = True)
df['events'].replace('double_play', 0, inplace = True)
df['events'].replace('triple_play', 0, inplace = True)
df['events'].replace('single', 1, inplace = True)
df['events'].replace('double', 1, inplace = True)
df['events'].replace('triple', 1, inplace = True)
df['events'].replace('home_run', 1, inplace = True)
df['events'].replace('fielders_choice', 1, inplace = True)
df.head()

Unnamed: 0,events,launch_speed,launch_angle
0,0,95.2,-13.0
4,0,93.3,-18.0
9,0,75.3,65.0
18,0,96.8,18.0
27,0,91.8,29.0


Let's take a look at our remaining unique events.

In [7]:
results = df['events'].value_counts()
print('Unique Event Totals')
print(results)

Unique Event Totals
0                       79612
1                       39802
sac_fly                  1126
sac_bunt                  756
sac_fly_double_play        17
catcher_interf              3
sac_bunt_double_play        2
game_advisory               1
Name: events, dtype: int64


The remaining events here don't hurt or help a player's batting average, so lets just leave those rows out. 

In [8]:
df = df.loc[(df['events'] == 1) | (df['events'] == 0)]
results = df['events'].value_counts()
print('Event Totals')
print(results)

Event Totals
0    79612
1    39802
Name: events, dtype: int64


Now our data conists of only the rows that we want, so were ready to move on. Lets now convert our feature columns and label column into arrays, and the split the data into train and test subsets.

In [9]:
X = np.array(df[['launch_speed', 'launch_angle']])
y = np.array(df['events'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1, random_state = 0)
print('Training data shape:',X_train.shape, y_train.shape)
print()
print('Test data shape:', X_test.shape, y_test.shape)

Training data shape: (107472, 2) (107472,)

Test data shape: (11942, 2) (11942,)


## Training and Implementing our Model

Now lets train our data with the linear regression model.

In [10]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

Lets take a look at some of our predictions against some actual results.

In [11]:
comparison = []
for i in range(len(y_test)):
    a_list = []
    a_list.append(y_test[i])
    a_list.append(y_pred[i])
    comparison.append(a_list)
print('Comparison [y_test, y_pred]')
print(comparison[:10])

Comparison [y_test, y_pred]
[[1, 0.41208235912630153], [0, -0.1501122324471083], [0, 0.3498322384532113], [1, 0.19093742802962932], [0, 0.2040234066100438], [1, 0.3057551268851153], [0, 0.3757732192026676], [1, 0.4759718833972614], [0, 0.27915046710408253], [0, 0.4103475768369933]]


Our model returns a value between 0 and 1. Lets try and declare anything greater than .5 as a hit, and anything less than .5 as an out, and see how accurate we can get. Let's first see how many outs and hits we have in our test data.

In [12]:
hit_threshold = 0.5

In [13]:
y_test_hits = 0
y_test_outs = 0
for element in y_test:
    if element == 0:
        y_test_outs += 1
    else:
        y_test_hits += 1
print('y_test Hit, Out Totals:', y_test_hits,',', y_test_outs)

y_test Hit, Out Totals: 4063 , 7879


Lets compare these numbers to our prediction totals.

In [14]:
y_pred_hits = 0
y_pred_outs = 0
for element in y_pred:
    if element >= hit_threshold:
        y_pred_hits += 1
    else:
        y_pred_outs += 1
print('y_pred Hit, Out Totals:', y_pred_hits, ',', y_pred_outs)

y_pred Hit, Out Totals: 384 , 11558


It looks like our model is predicting way more outs than hits. Lets invesitage by looking at our average prediction score for all actual hits, and do the same thing for all actual outs.

In [15]:
hit_total = 0
hit_total_pred = 0
out_total = 0
out_total_pred = 0
for row in comparison:
    if row[0] == 0:
        out_total += 1
        out_total_pred += row[1]
    else:
        hit_total += 1
        hit_total_pred += row[1]
avg_hit_pred = hit_total_pred / hit_total
avg_out_pred = out_total_pred / out_total
print('Average Prediction Scores for Outs, Hits')
print(avg_out_pred, avg_hit_pred)

Average Prediction Scores for Outs, Hits
0.3108343148316676 0.37593365796899375


Our model does have a higher score on average for hits than outs, which is good. However, it's only a difference of .06, which is not as high as we would hope. We also would have expected our model to give a score of over .5 for hits, and instead it is much lower, at .376 on average. Let's do some more investigation.

Let's see how the model does on hits and outs separately. 

In [16]:
#Hit accuracy
hit_list = []
for row in comparison:
    if row[0] == 1:
        hit_list.append(row)
hit_correct = 0
for element in hit_list:
    if element[1] >= hit_threshold:
        hit_correct += 1
hit_pred_accuracy = hit_correct / len(hit_list)
print('Prediction accuracy on hits:', hit_pred_accuracy)

#Our accuracy
out_list = []
for row in comparison:
    if row[0] == 0:
        out_list.append(row)
out_correct = 0
for element in out_list:
    if element[1] <= hit_threshold:
        out_correct += 1
out_pred_accuracy = out_correct / len(out_list)
print('Prediction accuracy on outs:', out_pred_accuracy)

#Total accuracy
total_correct = 0
for row in comparison:
    if row[0] == 1 and row[1] >= hit_threshold:
        total_correct += 1
    elif row[0] == 0 and row[1] < hit_threshold:
        total_correct += 1
accuracy = total_correct / len(comparison)
print('Total Accuracy:', accuracy)

Prediction accuracy on hits: 0.06226925916810239
Prediction accuracy on outs: 0.9833735245589542
Total Accuracy: 0.6699882766705745


Our model is heavily favoring outs, and not giving enough hits. Our total accuracy is coming in at about 67%, which is better than someone with no baseball knowledge giving out random guesses. Earlier we saw that our test data has 4063 hits, and 7879 outs. This means that the batting average in our test data is .340. So if someone were to guess 'out' on every instance in our test data, they would have an accuracy of 66%, which is just slightly worse than our model. Since our model leans so heavily towards outs, lets lower our hit threshold to 0.416 and see if we can be more accurate. 

In [17]:
hit_threshold = 0.416

In [18]:
y_pred_hits = 0
y_pred_outs = 0
for element in y_pred:
    if element >= hit_threshold:
        y_pred_hits += 1
    else:
        y_pred_outs += 1
print('y_pred Hit, Out Totals:', y_pred_hits, ',', y_pred_outs)

y_pred Hit, Out Totals: 3363 , 8579


In [19]:
#Hit accuracy
hit_list = []
for row in comparison:
    if row[0] == 1:
        hit_list.append(row)
hit_correct = 0
for element in hit_list:
    if element[1] >= hit_threshold:
        hit_correct += 1
hit_pred_accuracy = hit_correct / len(hit_list)
print('Prediction accuracy on hits:', hit_pred_accuracy)

#Our accuracy
out_list = []
for row in comparison:
    if row[0] == 0:
        out_list.append(row)
out_correct = 0
for element in out_list:
    if element[1] <= hit_threshold:
        out_correct += 1
out_pred_accuracy = out_correct / len(out_list)
print('Prediction accuracy on outs:', out_pred_accuracy)

#Total accuracy
total_correct = 0
for row in comparison:
    if row[0] == 1 and row[1] >= hit_threshold:
        total_correct += 1
    elif row[0] == 0 and row[1] < hit_threshold:
        total_correct += 1
accuracy = total_correct / len(comparison)
print('Total Accuracy:', accuracy)

Prediction accuracy on hits: 0.4708343588481418
Prediction accuracy on outs: 0.8159664932097982
Total Accuracy: 0.6985429576285379


Lowering our hit threshold increased our accuracy by about 3% which is pretty signifiant. We are now just under 70% accurate. 

## Conclusion

Using a simple multivariable regression model, we were able to create a program that predicts hits and outs based on exit velocity and launch angle. Our model predicts with accuracy around 70%. This is solid, considering a person giving random guesses would on average predict with 50% accuracy, and a person guessing 'out' in each instance would top out around 66%. 