# Brief Introduction to the Problem:
Baseball, often referred to as America’s pastime, is the quintessential American sport, with a rich history and tradition of uniting communities. 

Baseball is also regarded as one of, if not the most, analytics driven sport. Because of the nature of the sport as having discrete events, baseball is a very analytic rich sport. From exact metrics about play outcomes to distinct statistics about location and velocities of pitches, each play in baseball provides a wide range of information. Thus, utilizing all this information and creating in-depth analysis has been a key focus of the sport in order to help players improve and teams win more. This project will be focused on using a dataset with information gathered about college baseball over the 2024 season, obtained by the UT Baseball team.

## Section about UT Baseball and NCAA Division 1
Founded in 1894 at The University of Texas Austin, the Texas Longhorns Baseball team is the “winningest” NCAA Division I intercollegiate men’s baseball program of all time. The team has an all-time win–loss record of 3774–1442–32 (.722) as of the end of the 2024 season, and the Longhorns have won 6 NCAA baseball national championships.
The NCAA Division I baseball tournament is a 64-team tournament that starts in February and ends in June each year. 
Regionals: In the first round, the 64 teams are split into 16 brackets. Each is a double-elimination bracket (team isn't eliminated until they lose two games) with 4 teams.

Super regionals: The 16 winners of the regionals move on to the super regionals and are split into 8 pairings.
Men's College World Series: The 8 winners of the super regionals head to the Men's College World Series (MCWS). They are split into two double-elimination brackets, consisting of 4 teams each.
Men's College World Series Finals: In the MCWS Finals, the NCAA champion is decided between the winners of the 2 MCWS brackets.

College baseball is about much more than sports. It is a chance to develop the next generation of baseball talent, and an integral part of the economic and cultural fabric of a university and a city or state. It is an opportunity for players to develop their skills to advance to the professional level, gain scholarships to higher education, and experience the sense of camaraderie from playing a team sport. College baseball games attract millions of viewers, although interest in baseball is not the same as it has been in the past, as college basketball and football takes a front seat in the college sports scene. The future of baseball is in a precarious, uncertain position and baseball teams around the country are working together to sustain their baseball teams and the relevance of baseball in American society.


## What is the machine learning problem you are trying to solve? Why does the problem matter? 
The machine learning problem we are trying to solve is the Multiclass Classification problem of deciding the kind of pitch a pitcher should throw to get their desired play result.

The start of a play in baseball, can be an opportunity or a missed chance for that team. The start can define the final outcome of the play, and even the game. Pitchers must make a split second decision on how they want to pitch their ball to get the desired outcome and help their team win in the long run, factoring in many different aspects of their game into that decision. 

The problem of how to throw a pitch to get a desired result is one that affects all baseball players, and pitchers especially. We would like to help pitchers in their decision making process. This can help amateur pitchers become better, and can be used by professional pitchers to ensure their pitch will result in their desired outcome and ultimately win for their team.


## What could the results of your predictive model be used for? 
The results of our predictive model can be used to help pitchers figure out what type of pitches they need to throw to obtain a certain outcome from this at-bat, assuming the batter swings at the pitch. This can help coaches and pitchers make decisions on what kind of pitch to throw in crucial situations of the game, and develop a very robust pitch selection strategy for games. If certain pitchers on a team are better at certain types of pitches than others, the coaches can use that knowledge to choose who to pitch based on which pitch type is needed. Also, if there is a certain match-up between a same-handedness batter, then coaches can adjust their pitch recommendation. 


## Why would we want to be able to predict the thing you’re trying to predict? 
We want to be able to predict the Play Result because we think that will help pitchers determine what type of outcome they will get from a batter swinging at this particular pitch. For pitchers, predicting Play Result based on the type of pitch they throw can give them insight into which of their pitches results in a positive outcome versus a negative outcome, and this can then help them in future games to know what type of pitch they should aim to throw. For example, it is useful for pitchers to know which pitches result in outs versus scoring plays like a home run.


## Dataset Description
The dataset describes baseball statistics from the UT baseball team over the time period of the 2024 baseball season, from January to April. It provides information on 4904 unique games, each with their own unique game_id. The dataset has information about the situation of a pitch, pitch data, hitter data, and pitcher, batter, and catcher names.


In [48]:
import pandas as pd
df = pd.read_csv("2024_combined_data (1).csv")

In [49]:
df.head()

Unnamed: 0,game_id,Date,Time,PitchNo,Inning,inning_half,PAofInning,PitchofPA,Pitcher,PitcherId,...,z0,vx0,vy0,vz0,ax0,ay0,az0,catcher,catcher_id,catcher_team
0,20240220-HighPointUniversity-1,2024-02-20,60314.0,82,3,Top,4,2,"Olsovsky, Dalton",1000251274,...,5.41,3.28,-106.58,0.58,12.91,22.07,-34.04,"Ruiz, Justin",1000209000.0,HIG_PAN
1,20240220-HighPointUniversity-1,2024-02-20,63576.0,185,6,Top,2,3,"Glover, Lucas",1000138461,...,6.01,5.23,-118.11,-3.37,3.74,26.71,-27.88,"Ruiz, Justin",1000209000.0,HIG_PAN
2,20240220-HighPointUniversity-1,2024-02-20,66446.0,269,8,Top,3,1,"Carter, Noah",1000108939,...,5.52,3.08,-122.94,-1.77,-3.68,24.28,-16.89,"Grintz, Eric",686456.0,HIG_PAN
3,20240220-HighPointUniversity-1,2024-02-20,64809.0,216,6,Bottom,6,1,"Welch, Collin",1000192105,...,6.23,0.38,-117.74,-0.37,-12.27,24.59,-28.63,"Church, Braxton",1000192000.0,APP_MOU
4,20240220-HighPointUniversity-1,2024-02-20,67985.0,308,9,Bottom,4,3,"Lewis, Zach",1000127413,...,5.18,9.62,-132.57,-10.2,-8.59,32.05,-14.93,"Church, Braxton",1000192000.0,APP_MOU


In [50]:
df.columns

Index(['game_id', 'Date', 'Time', 'PitchNo', 'Inning', 'inning_half',
       'PAofInning', 'PitchofPA', 'Pitcher', 'PitcherId', 'PitcherThrows',
       'PitcherTeam', 'Batter', 'BatterId', 'BatterSide', 'BatterTeam',
       'PitchCall', 'PlayResult', 'KorBB', 'OutsOnPlay', 'RunsScored', 'Balls',
       'Strikes', 'Outs', 'TaggedPitchType', 'RelSpeed', 'SpinRate',
       'SpinAxis', 'Tilt', 'InducedVertBreak', 'VertBreak', 'HorzBreak',
       'VertApprAngle', 'HorzApprAngle', 'vert_rel_angle', 'horz_rel_angle',
       'RelHeight', 'RelSide', 'Extension', 'PlateLocHeight', 'PlateLocSide',
       'zone_time', 'EffectiveVelo', 'SpeedDrop', 'TaggedHitType', 'hit_x',
       'hit_y', 'ExitSpeed', 'Angle', 'HitSpinRate', 'hit_spin_axis',
       'Distance', 'hit_last_tracked_distance', 'hit_hang_time', 'Direction',
       'Bearing', 'hit_max_height', 'hit_contact_x', 'hit_contact_y',
       'hit_contact_z', 'position_110x', 'position_110y', 'position_110z',
       'pfxx', 'pfxz', 'x0', 'y0', 'z

# Data Cleaning
In this section, we will clean our data by removing the features we believe are irrelevant to our model.

In [51]:
df['PitcherThrows'].value_counts()
df['BatterSide'].value_counts()

# Pitcher Throws: R, L S

BatterSide
R            971742
L            541647
Undefined        50
Name: count, dtype: int64

In [52]:
# Not sure what this is: 'x0', 'y0', 'z0', 'vx0', 'vy0', 'vz0', 'ax0', 'ay0', 'az0', 
cols_to_drop = ['game_id', 'Date', 'Time', 'Batter', 'BatterId', 'BatterTeam', 'TaggedHitType', 'hit_x',
                'hit_y', 'ExitSpeed', 'Angle', 'HitSpinRate', 'hit_spin_axis',
                'Distance', 'hit_last_tracked_distance', 'hit_hang_time', 'Direction',
                'Bearing', 'hit_max_height', 'hit_contact_x', 'hit_contact_y',
                'hit_contact_z','PitcherTeam','Pitcher', 'PitcherId', 'KorBB',
                'catcher', 'catcher_id', 'catcher_team']

df = df.drop(cols_to_drop,axis=1)

In [53]:
df.head()

Unnamed: 0,PitchNo,Inning,inning_half,PAofInning,PitchofPA,PitcherThrows,BatterSide,PitchCall,PlayResult,OutsOnPlay,...,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0
0,82,3,Top,4,2,R,R,StrikeCalled,Undefined,0,...,-1.66,-2.0,50.0,5.41,3.28,-106.58,0.58,12.91,22.07,-34.04
1,185,6,Top,2,3,R,R,StrikeCalled,Strikeout,0,...,3.1,-1.87,50.0,6.01,5.23,-118.11,-3.37,3.74,26.71,-27.88
2,269,8,Top,3,1,R,R,HitByPitch,Undefined,0,...,9.98,-2.41,50.0,5.52,3.08,-122.94,-1.77,-3.68,24.28,-16.89
3,216,6,Bottom,6,1,R,L,BallCalled,Undefined,0,...,2.55,-1.17,50.0,6.23,0.38,-117.74,-0.37,-12.27,24.59,-28.63
4,308,9,Bottom,4,3,R,R,BallCalled,Undefined,0,...,9.83,-1.23,50.0,5.18,9.62,-132.57,-10.2,-8.59,32.05,-14.93


In [54]:
df['PlayResult'].value_counts()

PlayResult
Undefined         1128864
Out                146365
Strikeout           78363
Single              60293
Walk                44806
Double              17877
HomeRun             11520
Sacrifice            7748
FieldersChoice       6756
Error                6375
StolenBase           1980
Triple               1883
CaughtStealing        602
SIngle                  4
error                   2
homerun                 1
Name: count, dtype: int64

In [55]:
# take care of single
df['PlayResult'] = df['PlayResult'].str.title()
# take care of homerun

In [56]:
df['PlayResult'].value_counts()

PlayResult
Undefined         1128864
Out                146365
Strikeout           78363
Single              60297
Walk                44806
Double              17877
Homerun             11521
Sacrifice            7748
Fielderschoice       6756
Error                6377
Stolenbase           1980
Triple               1883
Caughtstealing        602
Name: count, dtype: int64

In [58]:
df = df[df['PlayResult'].isin(['Out', 'Single', 'Double', 'Triple', 'Homerun'])]

In [59]:
df['PlayResult'].value_counts()

PlayResult
Out        146365
Single      60297
Double      17877
Homerun     11521
Triple       1883
Name: count, dtype: int64

In [60]:
df.describe()

Unnamed: 0,PitchNo,Inning,PAofInning,PitchofPA,OutsOnPlay,RunsScored,Balls,Strikes,Outs,RelSpeed,...,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0
count,237943.0,237943.0,237943.0,237943.0,237943.0,237943.0,237943.0,237943.0,237943.0,236633.0,...,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0
mean,155.472458,4.784797,3.130846,3.285909,0.648908,0.221419,1.086718,1.036958,0.975355,85.173191,...,5.410697,-0.698182,50.0,5.593194,2.341157,-123.517539,-3.02085,-3.101841,25.955383,-23.195081
std,93.301664,2.525452,1.965054,1.794471,0.532081,0.572583,1.011131,0.818237,0.821488,5.760423,...,4.727797,1.563894,0.0,0.49885,5.104889,8.339524,2.740909,10.427141,4.112026,7.675004
min,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,45.43,...,-25.18,-8.46,50.0,0.64,-17.36,-148.25,-13.88,-35.56,4.47,-74.6
25%,78.0,3.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,80.91,...,2.11,-1.75,50.0,5.33,-1.13,-130.02,-4.95,-12.03,22.96,-29.2
50%,150.0,5.0,3.0,3.0,1.0,0.0,1.0,1.0,1.0,86.19,...,6.14,-1.21,50.0,5.63,3.83,-124.98,-3.15,-4.67,26.01,-22.48
75%,228.0,7.0,4.0,5.0,1.0,0.0,2.0,2.0,2.0,89.67,...,9.2,0.52,50.0,5.91,5.96,-117.37,-1.2,5.53,28.91,-16.84
max,538.0,17.0,21.0,17.0,3.0,4.0,4.0,3.0,2.0,102.08,...,31.68,7.37,50.0,8.92,17.26,-60.19,26.4,30.95,48.79,24.08


In [73]:
df.columns

Index(['PitchNo', 'Inning', 'inning_half', 'PAofInning', 'PitchofPA',
       'PitcherThrows', 'BatterSide', 'PitchCall', 'PlayResult', 'OutsOnPlay',
       'RunsScored', 'Balls', 'Strikes', 'Outs', 'TaggedPitchType', 'RelSpeed',
       'SpinRate', 'SpinAxis', 'Tilt', 'InducedVertBreak', 'VertBreak',
       'HorzBreak', 'VertApprAngle', 'HorzApprAngle', 'vert_rel_angle',
       'horz_rel_angle', 'RelHeight', 'RelSide', 'Extension', 'PlateLocHeight',
       'PlateLocSide', 'zone_time', 'EffectiveVelo', 'SpeedDrop',
       'position_110x', 'position_110y', 'position_110z', 'pfxx', 'pfxz', 'x0',
       'y0', 'z0', 'vx0', 'vy0', 'vz0', 'ax0', 'ay0', 'az0'],
      dtype='object')

In [70]:
df.iloc[:,:10].describe(include="all")

Unnamed: 0,PitchNo,Inning,inning_half,PAofInning,PitchofPA,PitcherThrows,BatterSide,PitchCall,PlayResult,OutsOnPlay
count,237943.0,237943.0,237943,237943.0,237943.0,237943,237943,237943,237943,237943.0
unique,,,2,,,3,3,5,5,
top,,,Bottom,,,R,R,InPlay,Out,
freq,,,120330,,,176000,153568,237938,146365,
mean,155.472458,4.784797,,3.130846,3.285909,,,,,0.648908
std,93.301664,2.525452,,1.965054,1.794471,,,,,0.532081
min,1.0,1.0,,1.0,1.0,,,,,0.0
25%,78.0,3.0,,2.0,2.0,,,,,0.0
50%,150.0,5.0,,3.0,3.0,,,,,1.0
75%,228.0,7.0,,4.0,5.0,,,,,1.0


In [71]:
df.iloc[:,10:20].describe(include="all")

Unnamed: 0,RunsScored,Balls,Strikes,Outs,TaggedPitchType,RelSpeed,SpinRate,SpinAxis,Tilt,InducedVertBreak
count,237943.0,237943.0,237943.0,237943.0,237383,236633.0,236621.0,235037.0,235037.0,235037.0
unique,,,,,13,,,,,
top,,,,,Fastball,,,,,
freq,,,,,112191,,,,,
mean,0.221419,1.086718,1.036958,0.975355,,85.173191,2160.798928,182.151698,22862.053209,9.314316
std,0.572583,1.011131,0.818237,0.821488,,5.760423,296.475393,70.088924,15707.3179,8.564277
min,0.0,0.0,0.0,0.0,,45.43,615.8,0.03,3600.0,-46.3
25%,0.0,0.0,0.0,0.0,,80.91,2014.18,129.48,6300.0,3.42
50%,0.0,1.0,1.0,1.0,,86.19,2183.34,203.29,26100.0,10.7
75%,0.0,2.0,2.0,2.0,,89.67,2340.68,231.86,37800.0,16.18


In [74]:
df.iloc[:,20:30].describe(include="all")

Unnamed: 0,VertBreak,HorzBreak,VertApprAngle,HorzApprAngle,vert_rel_angle,horz_rel_angle,RelHeight,RelSide,Extension,PlateLocHeight,...,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0
count,235037.0,235037.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236555.0,236633.0,...,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0
mean,-30.219308,2.730778,-6.478636,-0.617926,-0.879045,-1.111111,5.681059,0.78632,5.775338,-0.0179,...,5.410697,-0.698182,50.0,5.593194,2.341157,-123.517539,-3.02085,-3.101841,25.955383,-23.195081
std,13.352875,11.952718,1.442519,1.784944,1.445757,2.492375,0.558147,1.747224,0.577031,0.575115,...,4.727797,1.563894,0.0,0.49885,5.104889,8.339524,2.740909,10.427141,4.112026,7.675004
min,-107.94,-32.45,-15.73,-8.16,-6.35,-8.16,-0.85,-6.81,-3.38,-13.22,...,-25.18,-8.46,50.0,0.64,-17.36,-148.25,-13.88,-35.56,4.47,-74.6
25%,-39.6,-7.76,-7.4,-1.73,-1.91,-2.9,5.39,-0.63,5.39,-0.41,...,2.11,-1.75,50.0,5.33,-1.13,-130.02,-4.95,-12.03,22.96,-29.2
50%,-26.88,5.2,-6.25,-0.57,-1.04,-1.85,5.72,1.37,5.78,-0.02,...,6.14,-1.21,50.0,5.63,3.83,-124.98,-3.15,-4.67,26.01,-22.48
75%,-19.39,13.03,-5.42,0.43,0.03,0.57,6.03,1.96,6.16,0.38,...,9.2,0.52,50.0,5.91,5.96,-117.37,-1.2,5.53,28.91,-16.84
max,21.23,30.02,11.74,8.3,23.87,7.87,9.01,8.15,11.73,9.43,...,31.68,7.37,50.0,8.92,17.26,-60.19,26.4,30.95,48.79,24.08


In [75]:
df.iloc[:,30:].describe(include="all")

Unnamed: 0,PlateLocSide,zone_time,EffectiveVelo,SpeedDrop,position_110x,position_110y,position_110z,pfxx,pfxz,x0,y0,z0,vx0,vy0,vz0,ax0,ay0,az0
count,236633.0,236633.0,236633.0,236633.0,114988.0,114988.0,114988.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0,236633.0
mean,2.347228,0.451443,83.964447,7.503356,102.949612,47.704054,-1.631737,-1.747698,5.410697,-0.698182,50.0,5.593194,2.341157,-123.517539,-3.02085,-3.101841,25.955383,-23.195081
std,0.594866,0.034108,6.005551,1.07002,7.342748,31.480258,38.01012,6.669099,4.727797,1.563894,0.0,0.49885,5.104889,8.339524,2.740909,10.427141,4.112026,7.675004
min,-5.16,0.37,35.93,1.97,-109.02,2.42,-101.57,-25.74,-25.18,-8.46,50.0,0.64,-17.36,-148.25,-13.88,-35.56,4.47,-74.6
25%,1.95,0.42,79.51,6.76,98.75,22.61,-32.8,-7.47,2.11,-1.75,50.0,5.33,-1.13,-130.02,-4.95,-12.03,22.96,-29.2
50%,2.35,0.44,85.01,7.52,105.55,41.67,-2.31,-3.02,6.14,-1.21,50.0,5.63,3.83,-124.98,-3.15,-4.67,26.01,-22.48
75%,2.74,0.47,88.64,8.24,108.94,66.2,29.03,3.97,9.2,0.52,50.0,5.91,5.96,-117.37,-1.2,5.53,28.91,-16.84
max,19.43,1.05,101.92,13.97,110.0,207.73,103.51,20.82,31.68,7.37,50.0,8.92,17.26,-60.19,26.4,30.95,48.79,24.08


In [84]:
df["TaggedPitchType"].value_counts()

TaggedPitchType
Fastball            112152
Slider               47344
ChangeUp             26291
Sinker               14933
Curveball            13038
Cutter                7454
FourSeamFastBall      7210
Changeup              2743
Four-Seam             2660
TwoSeamFastBall       2251
Splitter              1194
Knuckleball             29
OneSeamFastBall         23
Name: count, dtype: int64

In [80]:
# cleaning errors in describing pitcher and batter's handedness
df = df[df["PitcherThrows"] != "S"]
df = df[df["BatterSide"] != "Undefined"]