# **ANALYSIS OF THE HOUSTON ASTROS' SIGN STEALING SCANDAL**
* This database is copyright 1996-2021 by Sean Lahman. 
* The datasets used in this notebook were downloaded from: http://www.seanlahman.com/baseball-archive/statistics/

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
infile = "../input/houston-astros-scandal/Teams2020.csv"

bball = pd.read_csv(infile)

def select_year_range(df, start_year, end_year=10000):
    """
    Returns rows of dataframe df whose yearID lies in given range of years.
    """
    return df[(df['yearID'] >= start_year) & (df['yearID'] <= end_year)]

bball_df = select_year_range(bball, 1970)
bball_df.head(5)

In [None]:
bball_df = bball_df[['yearID', 'franchID', 'H', '2B', '3B', 'HR', 'R', 'RA', 'AB', 'SO', 'BB', 'HBP', 'SF', 'W', 'L', 'LgWin']]
bball_df.head(20)

In [None]:
bball_df.dropna(how='any')
print('Nbr records with valid data since 2000: {0}'.format(len(bball_df)))

In [None]:
pd.options.mode.chained_assignment = None
bball_df['WinPct'] = bball_df['W'] / (bball_df['W'] + bball_df['L']).astype(float)
bball_df.head(20)

In [None]:
# 1B:
bball_df['1B'] = (bball_df['H'] - (bball_df['2B'] + bball_df['3B'] + bball_df['HR']))
# Slugging Percentage:
bball_df['SLG'] = ((1 * bball_df['1B']) + (2 * bball_df['2B']) + (3 * bball_df['3B']) + (4 * bball_df['HR'])) / (bball_df['AB'])
# Batting Average:
bball_df['BAT_AVG'] = bball_df['H'] / bball_df['AB']
# On Base Percentage:
bball_df['OBP'] = ( bball_df['H'] + bball_df['BB'] + bball_df['HBP'] )  / ( bball_df['AB'] + bball_df['BB'] + bball_df['HBP'] + bball_df['SF'])
# On Base Plus Slugging Percentage:
bball_df['OPS'] =  bball_df['OBP'] +  bball_df['SLG'] 
# Runs scored -> R . Opponents runs scored -> RA
# I will take the R and RA differential as a parameter to explore and analyze the 
# teams success (Run_diff = R - RA).
bball_df['RUN_DIFF'] = bball_df['R'] - bball_df['RA']
                                                                           
                                                               

In [None]:
bball_df['RUN_DIFF']   

In [None]:
columns = ['yearID', 'franchID', '1B', '2B', '3B', 'HR', 'R', 'RA', 'RUN_DIFF', 'AB', 'SLG', 'BAT_AVG', 'OBP', 'OPS', 'SO', 'BB', 'W', 'L', 'WinPct', 'LgWin']
bball_df = bball_df[columns]
bball_df2 = bball_df.set_index('franchID')
bball_df2.head(50)

In [None]:
rundiff_plus_200 = bball_df2.loc[:, ['yearID', 'RUN_DIFF', 'LgWin']][bball_df2['RUN_DIFF'] >= 200]
rundiff_plus_200

In [None]:
rundiff_plus_300 = bball_df2.loc[:, ['yearID', 'RUN_DIFF', 'LgWin']][bball_df2['RUN_DIFF'] >= 300]
rundiff_plus_300

In [None]:
negative_rundiff = bball_df2.loc[:, ['yearID', 'RUN_DIFF', 'LgWin']][bball_df2['RUN_DIFF'] < 0]
negative_rundiff

**VISUALIZATION # 1**

In [None]:
def vis1():
    won_league = bball_df[bball_df['LgWin'] == 'Y']
#     print(won_league)
    lost_league = bball_df[bball_df['LgWin'] == 'N']
    fig, axes = plt.subplots(figsize=(8,6))
    legendStr = ['Won League', 'Lost League']
    axes.scatter(lost_league['RUN_DIFF'], lost_league['WinPct'], c='r', s=180, alpha=0.3, label=legendStr[1])
    axes.scatter(won_league['RUN_DIFF'], won_league['WinPct'], c='g', s=180, alpha=0.4, label=legendStr[0])
    m, b = np.polyfit(bball_df['RUN_DIFF'], bball_df['WinPct'], 1)
    xmin, xmax = axes.get_xlim()
    x_plot = np.linspace(xmin, xmax, 100)
    plt.plot(x_plot, m * x_plot + b, 'bo')
    #labels / legends
    axes.legend(loc='upper left')
    axes.set_ylabel('Winning Percent')
    axes.set_xlabel('Run Differential')
    plt.show()
      
vis1()

It is obvious from the plot that the teams who had a higher run differential(RUN_DIFF) tend to be more succesful. They score more runs compared to the ones they receive. There are no teams with a negative run differential that won the championship from 2000 to the present year, and in fact, there is only one team in the history of the MLB that won it with a negative run differential ( -20) : The 1987 Twins.

In [None]:
negative_rundiff.loc[lambda df: df['LgWin'] == 'Y', :]

One interesting data point is a team that had a run differential of 300 and yet did not win the title that year: the 2001 Seattle Mariners.

In [None]:
rundiff_plus_300.loc[lambda df: df['LgWin'] == 'N', :]

From the plot we can also tell that, although a higher winning percentage indicates a better team, there are teams under the blue line, with a lower winning percentage, that actually won the title compared to teams that had a higher winning percentage and that are above the blue line but did not win a championship.

**VISUALIZATION # 2**

In [None]:
def vis2():
    bball_df3 = select_year_range(bball_df2, 2015)
    astros = bball_df3.loc['HOU']
    rest_of_the_league = bball_df3.drop(['HOU'])
    fig, axes = plt.subplots(3, 2, sharey=True, figsize=(15, 10), constrained_layout=True)
    legendStr = ['Astros', 'Rest of the league']

    axes[0, 0].scatter(rest_of_the_league['SO'], rest_of_the_league['WinPct'], c='mediumaquamarine', s=180, alpha=0.4, label=legendStr[1])
    axes[0, 0].scatter(astros['SO'], astros['WinPct'], c='red', s=180, alpha=0.6, label=legendStr[0])
    
    axes[0, 1].scatter(rest_of_the_league['SLG'], rest_of_the_league['WinPct'], c='mediumaquamarine', s=180, alpha=0.4, label=legendStr[1])
    axes[0, 1].scatter(astros['SLG'], astros['WinPct'], c='red', s=180, alpha=0.6, label=legendStr[0])
    
    axes[1, 0].scatter(rest_of_the_league['RUN_DIFF'], rest_of_the_league['WinPct'], c='mediumaquamarine', s=180, alpha=0.4, label=legendStr[1])
    axes[1, 0].scatter(astros['RUN_DIFF'], astros['WinPct'], c='red', s=180, alpha=0.6, label=legendStr[0])
    
    axes[1, 1].scatter(rest_of_the_league['BB'], rest_of_the_league['WinPct'], c='mediumaquamarine', s=180, alpha=0.4, label=legendStr[1])
    axes[1, 1].scatter(astros['BB'], astros['WinPct'], c='red', s=180, alpha=0.6, label=legendStr[0])
    
    axes[2, 0].scatter(rest_of_the_league['BAT_AVG'], rest_of_the_league['WinPct'], c='mediumaquamarine', s=180, alpha=0.4, label=legendStr[1])
    axes[2, 0].scatter(astros['BAT_AVG'], astros['WinPct'], c='red', s=180, alpha=0.6, label=legendStr[0])
    
    axes[2, 1].scatter(rest_of_the_league['OPS'], rest_of_the_league['WinPct'], c='mediumaquamarine', s=180, alpha=0.4, label=legendStr[1])
    axes[2, 1].scatter(astros['OPS'], astros['WinPct'], c='red', s=180, alpha=0.6, label=legendStr[0])
    
    
#Polyfit    
    m1, b1 = np.polyfit(bball_df3['SO'], bball_df3['WinPct'], 1)
    m2, b2 = np.polyfit(bball_df3['SLG'], bball_df3['WinPct'], 1)
    m3, b3 = np.polyfit(bball_df3['RUN_DIFF'], bball_df3['WinPct'], 1)
    m4, b4 = np.polyfit(bball_df3['BB'], bball_df3['WinPct'], 1)
    m5, b5 = np.polyfit(bball_df3['BAT_AVG'], bball_df3['WinPct'], 1)
    m6, b6 = np.polyfit(bball_df3['OPS'], bball_df3['WinPct'], 1)
    
# Xmax- Xmin
    xmin1, xmax1 = axes[0, 0].get_xlim()
    xmin2, xmax2 = axes[0, 1].get_xlim()
    xmin3, xmax3 = axes[1, 0].get_xlim()
    xmin4, xmax4 = axes[1, 1].get_xlim()
    xmin5, xmax5 = axes[2, 0].get_xlim()
    xmin6, xmax6 = axes[2, 1].get_xlim() 
    
# Plots    
    x_plot1 = np.linspace(xmin1, xmax1, 100)
    x_plot2 = np.linspace(xmin2, xmax2, 100)
    x_plot3 = np.linspace(xmin3, xmax3, 100)
    x_plot4 = np.linspace(xmin4, xmax4, 100)
    x_plot5 = np.linspace(xmin5, xmax5, 100)
    x_plot6 = np.linspace(xmin6, xmax6, 100)
    
# From the Polyfit we draw the line y = mx + b.
    axes[0, 0].plot(x_plot1, m1 * x_plot1 + b1, 'bo')
    axes[0, 1].plot(x_plot2, m2 * x_plot2 + b2, 'bo')
    axes[1, 0].plot(x_plot3, m3 * x_plot3 + b3, 'bo')
    axes[1, 1].plot(x_plot4, m4 * x_plot4 + b4, 'bo')
    axes[2, 0].plot(x_plot5, m5 * x_plot5 + b5, 'bo')
    axes[2, 1].plot(x_plot6, m6 * x_plot6 + b6, 'bo')
    
#legends
    axes[0, 0].legend(loc='upper left')
    axes[0, 1].legend(loc='upper left')
    axes[1, 0].legend(loc='upper left')
    axes[1, 1].legend(loc='upper left')
    axes[2, 0].legend(loc='upper left')
    axes[2, 1].legend(loc='upper left')
    
#labels in y   
    axes[0, 0].set_ylabel('Winning Percent')
    axes[0, 1].set_ylabel('Winning Percent')
    axes[1, 0].set_ylabel('Winning Percent')
    axes[1, 1].set_ylabel('Winning Percent')
    axes[2, 0].set_ylabel('Winning Percent')
    axes[2, 1].set_ylabel('Winning Percent')
    
# labels in x 
    axes[0, 0].set_xlabel('Strike Outs')
    axes[0, 1].set_xlabel('Slugging Percentage')
    axes[1, 0].set_xlabel('Run Differential')
    axes[1, 1].set_xlabel('Walks by batter')
    axes[2, 0].set_xlabel('Batting Average')
    axes[2, 1].set_xlabel('On Base Plus Slugging')
    
# Titles 
    axes[0, 0].set_title('Strike Outs vs Winning Percentage')
    axes[0, 1].set_title('Slugging Percentage vs Winning Percentage')
    axes[1, 0].set_title('Run Differential vs Winning Percentage')
    axes[1, 1].set_title('Walks by batter vs Winning Percentage')
    axes[2, 0].set_title('Batting Average vs Winning Percentage')
    axes[2, 1].set_title('On Base Plus Slugging vs Winning Percentage')
    
    plt.show()   

vis2()

In terms of the Houston Astros team stats, for 3 years in a row(2017- 2018 - 2019), they had spectacular seasons with over  100 victories on each one. They also went to 3 straight American League Championship Series(ALCS) of which they won two, and subsequently, they won 1 World Series and lost another. Not an easy feat. Especially because it was on consecutive years.  During the seasons of 2017 and 2018, the MLB was able to determine that the Astros were cheating using different methods to steal the signals. From the graphs we can see that the Astros were an offensive weapon in every category, compared to the other teams in the league. On 2017 and 2018, their run differential was 196 and 263 with a winning percentage of 0.62 and 0.63, respectively. This plot is very informative because the run differential measures the runs scored vs the runs received difference, and the Astros sit almost at the top of the league for 3 consecutive seasons. Not to mention that the winning percentage is also very high compared to the competition. The On Base Plus Slugging(OPS) plot tells the same story. This stat measures the hit power and ability of a player to get to a base and the Astros were also one of the few teams in the league who reached 0.82 on 2017. The Astros did beat this OPS in  2019 with 0.84. On both years, they won the ALCS. The same analysis could be made from the more simpler slugging percentage plot where on 2017 and 2019 , the Astros were also very dominant with 0.47 and 0.49. I also plotted the Walks by Batter to answer my initial thought that if the Astros were getting signals ahead of time, the opposing pitcher would have more trouble to deal with Astros' batters, giving them the first base. And, according to the data, that appears to be the case for the 2018 and 2019 seasons. 
From 2017 to 2019, Astros also were very hard to strikeout. From the Strikout plot, we can see that they reside on the left of the graph these 3 seasons meaning they had fewer strikeouts in comparison to the rest of the league. The Strikeout and Walks by Batters plots show a gap in values because these are cumulative graphs and both take into account the year 2020, which was a shorter season due to the Corona Virus Pandemic. Teams only played 60 games in the regular season. 
The batting average vs winning percentage plot clearly shows that on 2017 and 2019, the Astros were, if not leading the league, they were at least one of 3 teams that had the highest performance in a 3 year period.  
On 2019, the Astros had a really amazing year, surpassing the two previous ones. They beat their own records in these two years  in  SLG with 0.49, Run differential with 280, wins with 107 and OPS with 0.84. The intriguing question that comes to mind is: If the cheating was confirmed only for the years of 2017 and 2018, WHY DID THE ASTROS HAVE AN OUTSTANDING 2019(when they also reached the World Series, only this time they lost!), even better than the prior two cheating seasons? One answer could be that the Astros were already an amazing team, good at cheating but also good at baseball, otherwise, they would not have been able to improve their batting and winning performance for the 2019 season, when, allegedly, they did not cheat. Or at least, the league found no proof of any illegality. 