### Background:

Following the Astros sign stealing fiasco, I saw this reddit thread: https://www.reddit.com/r/baseball/comments/dw9wnd/the_astros_home_k_dropped_from_244_in_2016_to_166/

It showed the home K% dropped from 24.4% in 2016 to 16.6% in 2017, a 31.8% decrease. However this doesn't take into account change in staff between these years. So I wanted to take a look at the K% change for _just_ the players on both teams

In [1]:
import pandas as pd
pd.set_option("display.max_columns", 101)
import numpy as np
import matplotlib.pyplot as plt
from pybaseball import league_batting_stats
from bs4 import BeautifulSoup
import requests

Get 2016 and 2017 stats

In [2]:
df2016 = league_batting_stats.batting_stats_bref(2016)
astros2016 = df2016.query("Tm=='Houston'")
df2017 = league_batting_stats.batting_stats_bref(2017)
astros2017 = df2017.query("Tm=='Houston'")

Do an inner merge on player names, keeping only players present in both years

In [3]:
mixed_df = astros2016.merge(astros2017, how='inner', on='Name', suffixes=('2016','2017'))

In [4]:
mixed_df["Name"].values

array(['Jose Altuve', 'Alex Bregman', 'Carlos Correa', 'Evan Gattis',
       'Marwin Gonzalez', 'Yuli Gurriel', 'Tony Kemp', 'Dallas Keuchel',
       'Jake Marisnick', 'Lance McCullers Jr.', 'Collin McHugh',
       'Colin Moran', 'Joe Musgrove', 'AJ Reed', 'George Springer',
       'Max Stassi', 'Tyler White'], dtype=object)

Looks right, now take the K rate for 2016 and 2017, find the difference

In [5]:
krate2016 = (mixed_df["SO2016"].sum() / mixed_df["PA2016"].sum())*100
krate2017 = (mixed_df["SO2017"].sum() / mixed_df["PA2017"].sum())*100
dkrate = krate2017 - krate2016
print("2016 K%: {0:.2f}".format(krate2016))
print("2017 K%: {0:.2f}".format(krate2017))
print("year-over-year change: {0:.2f}".format(dkrate))

2016 K%: 21.19
2017 K%: 16.95
year-over-year change: -4.23


### However, this is home and away - need to just do home, which will require some scraping

In [6]:
from pybaseball import playerid_lookup

In [7]:
player_keys = {}
for player in mixed_df["Name"].values:
    print("Getting player: {0}".format(player))

    # Annoying exception
    if player == "AJ Reed": 
        first = "A. J."
        last = "Reed"
        player_keys["AJ Reed"] = playerid_lookup(last, first).dropna().iloc[-1]["key_fangraphs"]        
    else:
        player_keys[player] = playerid_lookup(player.split(" ")[1], player.split(" ")[0]).dropna().iloc[-1]["key_fangraphs"]

Getting player: Jose Altuve
Gathering player lookup table. This may take a moment.
Getting player: Alex Bregman
Gathering player lookup table. This may take a moment.
Getting player: Carlos Correa
Gathering player lookup table. This may take a moment.
Getting player: Evan Gattis
Gathering player lookup table. This may take a moment.
Getting player: Marwin Gonzalez
Gathering player lookup table. This may take a moment.
Getting player: Yuli Gurriel
Gathering player lookup table. This may take a moment.
Getting player: Tony Kemp
Gathering player lookup table. This may take a moment.
Getting player: Dallas Keuchel
Gathering player lookup table. This may take a moment.
Getting player: Jake Marisnick
Gathering player lookup table. This may take a moment.
Getting player: Lance McCullers Jr.
Gathering player lookup table. This may take a moment.
Getting player: Collin McHugh
Gathering player lookup table. This may take a moment.
Getting player: Colin Moran
Gathering player lookup table. This m

URL requires position as well, so put those in by hand. Additionally with this being the AL, prune pitchers

In [8]:
# Hand prune pitchers
player_pos = {
'Jose Altuve': "2B",
 'Alex Bregman': "3B",
 'Carlos Correa': "SS",
 'Evan Gattis': "C/DH",
 'Marwin Gonzalez': "SS",
 'Yuli Gurriel': "1B",
 'Tony Kemp': "OF",
 #'Dallas Keuchel': "P",
 'Jake Marisnick': "OF",
 #'Lance McCullers Jr.': "P",
 #'Collin McHugh': "P",
 'Colin Moran': "3B",
 #'Joe Musgrove': "P",
 'AJ Reed': "1B/DH",
 'George Springer': "OF",
 'Max Stassi': "C",
 'Tyler White': "1B"
}

Scrape Fangraphs

In [9]:
home_splits = {}
for key in player_pos:
    play_split = 'https://www.fangraphs.com/statsplits.aspx?playerid={0}&position={1}&season=0&split=1.1'.format(player_keys[key], player_pos[key])
    page = requests.get(play_split)
    soup = BeautifulSoup(page.text, 'html.parser')
    found = soup.find('table',{'class':'rgMasterTable', 'id':'SeasonSplits1_dgSeason1_ctl00'})
    player_df=pd.read_html(str(found))[0]
    home_splits[key]  = player_df.drop(player_df.tail(1).index)

Math out the total PA and K values for consistent players

In [10]:
total2016_pa, total2017_pa = 0,0
total2016_k, total2017_k = 0,0
for player,df in home_splits.items():
    d2016 = df[df["Season"] == "2016"]
    total2016_k += d2016.iloc[0]["SO"]
    total2016_pa += d2016.iloc[0]["PA"]
    
    d2017 = df[df["Season"] == "2017"]
    if not player == "Colin Moran":
        total2017_k += d2017.iloc[0]["SO"]
        total2017_pa += d2017.iloc[0]["PA"]


In [11]:
krate2017 = (total2017_k/total2017_pa) * 100
krate2016 = (total2016_k/total2016_pa) * 100

In [16]:
print("2016 K%: {0:.2f}".format(krate2016))
print("2017 K%: {0:.2f}".format(krate2017))
print("year-over-year change: {0:.2f}%".format(krate2017-krate2016))
print("percent change: {0:.2f}%".format(100*(krate2017-krate2016)/krate2016))

2016 K%: 22.17
2017 K%: 16.46
year-over-year change: -5.71%
percent change: -25.75%


In [13]:
print("Full list of position players from both 2016+2017:")
for player in player_pos.keys():
    print(player)

Full list of position players from both 2016+2017:
Jose Altuve
Alex Bregman
Carlos Correa
Evan Gattis
Marwin Gonzalez
Yuli Gurriel
Tony Kemp
Jake Marisnick
Colin Moran
AJ Reed
George Springer
Max Stassi
Tyler White
