<a href="https://www.kaggle.com/code/tusharaggarwal27/a-new-era-of-data-analysis-in-baseball?scriptVersionId=112934457" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# The Statcast revolution

#github.com/tushar2704,kaggle.com/tusharaggarwal27, linkedin.com/in/tusharaggarwalinseec

**Statcast** is a state-of-the-art tracking system that uses high-resolution cameras and radar equipment to measure the precise location and movement of baseballs and baseball players. Introduced in 2015 to all 30 major league ballparks, Statcast data is revolutionizing the game. Teams are engaging in an "arms race" of data analysis, hiring analysts left and right in an attempt to gain an edge over their competition. 

Learn more about it here-https://www.youtube.com/watch?v=9rOKGKhQe8U

**In this notebook, we're going to wrangle, analyze, and visualize Statcast data to compare Mr. Judge and another (extremely large) teammate of his, Giancarlo Stanton. Let's start by loading the data into our Notebook. There are two CSV files, judge.csv and stanton.csv, both of which contain Statcast data for 2015-2017. We'll use pandas DataFrames to store this data. Let's also load our data visualization libraries, matplotlib and seaborn.**

In [1]:
# Data manipulation imports
import numpy as np
import pandas as pd

# Visualization imports
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
%matplotlib inline

# Modeling imports
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

**Loading the required data**

Aaron Judge is one of the physically largest players in Major League Baseball standing 6 feet 7 inches (2.01 m) tall and weighing 282 pounds (128 kg). He also hit the hardest home run ever recorded. How do we know this? Statcast.

In [2]:
# Loading Aaron Judge's Statcast data
judge = pd.read_csv("/kaggle/input/a-new-era-of-data-analysis-in-baseball/judge.csv")

# Loading Giancarlo Stanton's Statcast data
stanton = pd.read_csv("/kaggle/input/a-new-era-of-data-analysis-in-baseball/stanton.csv")

**What can Statcast measure?**

The better question might be, what can't Statcast measure?

Starting with the pitcher, Statcast can **measure simple data points such as velocity**. At the same time, Statcast digs a whole lot deeper, also **measuring the release point and spin rate of every pitch.**

Moving on to hitters, Statcast is capable of **measuring the exit velocity, launch angle and vector of the ball as it comes off the bat**. From there, Statcast can also track the **hang time and projected distance that a ball travels.**

Let's inspect the last five rows of the judge DataFrame. You'll see that each row represents one pitch thrown to a batter. You'll also see that some columns have esoteric names. If these don't make sense now, don't worry. The relevant ones will be explained as necessary.

In [3]:
# Display all columns (pandas will collapse some columns if we don't set this option)
pd.set_option('display.max_columns', None)

# Display the last five rows of the Aaron Judge file
judge.tail()

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,des,game_type,stand,p_throws,home_team,away_team,type,hit_location,bb_type,balls,strikes,game_year,pfx_x,pfx_z,plate_x,plate_z,on_3b,on_2b,on_1b,outs_when_up,inning,inning_topbot,hc_x,hc_y,tfs_deprecated,tfs_zulu_deprecated,pos2_person_id,umpire,sv_id,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,hit_distance_sc,launch_speed,launch_angle,effective_speed,release_spin_rate,release_extension,game_pk,pos1_person_id,pos2_person_id.1,pos3_person_id,pos4_person_id,pos5_person_id,pos6_person_id,pos7_person_id,pos8_person_id,pos9_person_id,release_pos_y,estimated_ba_using_speedangle,estimated_woba_using_speedangle,woba_value,woba_denom,babip_value,iso_value,launch_speed_angle,at_bat_number,pitch_number
3431,CH,2016-08-13,85.6,-1.9659,5.9113,Aaron Judge,592450,542882,,ball,,,,,14.0,,R,R,R,NYY,TB,B,,,0,0,2016,-0.379108,0.370567,0.739,1.442,,,,0,5,Bot,,,,,571912.0,,160813_144259,6.96,-124.371,-4.756,-2.821,23.634,-30.22,3.93,1.82,,,,84.459,1552.0,5.683,448611,542882.0,571912.0,543543.0,523253.0,446334.0,622110.0,545338.0,595281.0,543484.0,54.8144,0.0,0.0,,,,,,36,1
3432,CH,2016-08-13,87.6,-1.9318,5.9349,Aaron Judge,592450,542882,home_run,hit_into_play_score,,,,,4.0,Aaron Judge homers (1) on a fly ball to center...,R,R,R,NYY,TB,X,,fly_ball,1,2,2016,-0.295608,0.3204,-0.419,3.273,,,,2,2,Bot,130.45,14.58,,,571912.0,,160813_135833,4.287,-127.452,-0.882,-1.972,24.694,-30.705,4.01,1.82,446.0,108.8,27.41,86.412,1947.0,5.691,448611,542882.0,571912.0,543543.0,523253.0,446334.0,622110.0,545338.0,595281.0,543484.0,54.8064,0.98,1.937,2.0,1.0,0.0,3.0,6.0,14,4
3433,CH,2016-08-13,87.2,-2.0285,5.8656,Aaron Judge,592450,542882,,ball,,,,,14.0,,R,R,R,NYY,TB,B,,,0,2,2016,-0.668575,0.198567,0.561,0.96,,,,2,2,Bot,,,,,571912.0,,160813_135815,7.491,-126.665,-5.862,-6.393,21.952,-32.121,4.01,1.82,,,,86.368,1761.0,5.721,448611,542882.0,571912.0,543543.0,523253.0,446334.0,622110.0,545338.0,595281.0,543484.0,54.777,0.0,0.0,,,,,,14,3
3434,CU,2016-08-13,79.7,-1.7108,6.1926,Aaron Judge,592450,542882,,foul,,,,,4.0,,R,R,R,NYY,TB,S,,,0,1,2016,0.397442,-0.614133,-0.803,2.742,,,,2,2,Bot,,,,,571912.0,,160813_135752,1.254,-116.062,0.439,5.184,21.328,-39.866,4.01,1.82,9.0,55.8,-24.973,77.723,2640.0,5.022,448611,542882.0,571912.0,543543.0,523253.0,446334.0,622110.0,545338.0,595281.0,543484.0,55.4756,0.0,0.0,,,,,1.0,14,2
3435,FF,2016-08-13,93.2,-1.8476,6.0063,Aaron Judge,592450,542882,,called_strike,,,,,8.0,,R,R,R,NYY,TB,S,,,0,0,2016,-0.82305,1.6233,-0.273,2.471,,,,2,2,Bot,,,,,571912.0,,160813_135736,5.994,-135.497,-6.736,-9.36,26.782,-13.446,4.01,1.82,,,,92.696,2271.0,6.068,448611,542882.0,571912.0,543543.0,523253.0,446334.0,622110.0,545338.0,595281.0,543484.0,54.4299,0.0,0.0,,,,,,14,1


In [4]:
print(judge.isna().sum())

pitch_type              40
game_date                0
release_speed           41
release_pos_x           41
release_pos_z           41
                      ... 
babip_value           2663
iso_value             2663
launch_speed_angle    2757
at_bat_number            0
pitch_number             0
Length: 78, dtype: int64


**Aaron Judge and Giancarlo Stanton, prolific sluggers**

Giancarlo Stanton. He is also a very large human being, standing 6 feet 6 inches tall and weighing 245 pounds. Despite not wearing the same jersey as Judge in the pictures provided, in 2018 they will be teammates on the New York Yankees. They are similar in a lot of ways, one being that they hit a lot of home runs. Stanton and Judge led baseball in home runs in 2017, with 59 and 52, respectively. These are exceptional totals - the player in third "only" had 45 home runs.

Stanton and Judge are also different in many ways. One is batted ball events, which is any batted ball that produces a result. This includes outs, hits, and errors. Next, you'll find the counts of batted ball events for each player in 2017. The frequencies of other events are quite different.

In [5]:
# All of Aaron Judge's batted ball events in 2017
judge_events_2017 = judge.loc[judge['game_year']==2017].events
print("Aaron Judge batted ball event totals, 2017:")
print(judge_events_2017)

# All of Giancarlo Stanton's batted ball events in 2017
stanton_events_2017 = stanton.loc[stanton['game_year']==2017].events
print("\nGiancarlo Stanton batted ball event totals, 2017:")
print(stanton_events_2017)

Aaron Judge batted ball event totals, 2017:
0       strikeout
1             NaN
2             NaN
3            walk
4             NaN
          ...    
3023          NaN
3024    field_out
3025          NaN
3026          NaN
3027       double
Name: events, Length: 3028, dtype: object

Giancarlo Stanton batted ball event totals, 2017:
0       strikeout
1             NaN
2             NaN
3             NaN
4       field_out
          ...    
2780       double
2781    field_out
2782          NaN
2783          NaN
2784          NaN
Name: events, Length: 2785, dtype: object


# Analyzing home runs with Statcast data