# MLB playoff teams: dimensionality analysis, principal component analysis (PCA), and playoff qualifier predictive model
A version of this notebook was originally submitted as part of my MSc Data Science grad school module, Data Science Techniques and Applications (DSTA). The assignment was:
* Part 1: analyze a Kaggle dataset, consider the main aggregate measures of the dataset (range, quality, distribution), and select a small number of key dimensions
* Part 2: select three dimensions as predictors of a fourth predicted dimension, discuss whether the three dimensions could become a good predictor, demo PCA similar to [this one](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#sphx-glr-auto-examples-datasets-plot-iris-dataset-py), display the results graphically, and comment on the effectiveness of PCA and its dimensionality reduction

To this I intend to add:
* Part 3: based on the findings from Part 1 and Part 2, build a model for predicting MLB playoff qualifiers using dimensions like runs scored, runs allowed, OPS, and others.

#### Thanks/references
I found these Kaggle notebooks useful for similar analyses and background [Feature selection for predicting wins](https://www.kaggle.com/michaelmtz20/feature-selection-for-predicting-wins), [Worst World Series winners since 1900](https://www.kaggle.com/cm1291/worst-world-series-winners-since-1900), [Linear regression Moneyball](https://www.kaggle.com/maglionejm/linear-regression-moneyball), and others.


#### Random interestings

* Leagues (AA, AL, FL, NL, PL, UA)
* 1962 - 165 games
* Ties
* 1981 - playoff qualifiers
* Season length (increase from 16 to 24 for 1914-1915)
* May need to impute who would win wild cards (under current rules in previous years). 

## Part 1: dataset and dimension analysis
This analysis focuses on **team.csv**. The full "History of Baseball" data set includes 704 columns over 29 .csv files, 1 .txt file, and 1 .sqlite file. The **team.csv** contains 2805 rows and 48 columns (in other words, 2805 observations of 48 features). These 48 features are made up of 13 categorical and 35 numerical variables (supporting documentation can be found in the [R docs for this database](https://rdrr.io/cran/Lahman/man/Teams.html)):

#### Categorical/Class Variables (exclude 9-13)
1. `league_id`: unique league identifier (6 possible classes or null)
2. `team_id`: unique team identifier (149 classes)
3. `franchise_id`: unique franchise identifier (120 classes)
4. `div_id`: unique division identifier (3 classes)
5. `div_win`: whether team won its division (Y, N, or null)
6. `wc_win`: whether team won a wild card spot (Y, N, or null)
7. `lg_win`: whether team won its league (Y, N, or null)
8. `ws_win`: whether team won the World Series (Y, N, or null)
9. `name`: team name (139 classes)
10. `park`: team home ballpark (212 classes)
11. `team_id_br`: unique Baseball Reference team identifier (101 classes)
12. `team_id_lahman45`: unique Lahman database team identifier (148 classes)
13. `team_id_retro`: unique Retrosheet team identifier (149 classes)


#### Numerical/Continuous Variables
14. `year`: calendar year (season)
15. `rank`: ordered ranking of team's finish for its division (league_id and div_id)

*Game outcomes*

16. `g`: games
17. `g_home`: home games
18. `w`: wins
19. `l`: losses

*Hitting*

20. `r`: runs
21. `ab`: at bats
22. `h`: hits
23. `double`: doubles
24. `triple`: triples
25. `hr`: home runs
26. `bb`: walks
27. `so`: strikeouts
28. `sb`: stolen bases
29. `cs`: caught stealing
30. `hbp`: hit by pitch
31. `sf`: sacrifice flies

*Pitching*

32. `ra`: runs allowed
33. `er`: earned runs
34. `era`: earned run average
35. `cg`: complete games
36. `sho`: shutouts
37. `sv`: saves
38. `ipouts`: outs pitched
39. `ha`: hits allowed
40. `hra`: home runs allowed
41. `bba`: walks allowed
42. `soa`: strikeouts allowed

*Fielding*
43. `e`: errors
44. `dp`: double plays
45. `fp`: fielding percentage

*Home ballpark*
46. `attendance`: total attendance for home games
47. `bpf`: park factor for batters
48. `ppf`: park factor for pitchers

In [None]:
# read data, review dimensions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('/kaggle/input/the-history-of-baseball/team.csv')

df[['rank', 'g', 'w', 'l', 'r', 'ra']].describe()
df.info()
df[df["g"]==6]
df[df["g"]==165]
df.head()

# observations
#  - huge range of games
#     - min 6 games for the Baltimore Marylands in 1873 (https://en.wikipedia.org/wiki/1873_Baltimore_Marylands_season)
#     - max 165 games for the LA Dodgers and SF Giants in 1962 (https://en.wikipedia.org/wiki/1962_National_League_tie-breaker_series)
#     - need to adjust to per-game metrics and perhaps remove old data (World Series era, modern era?)
#  - null values
#     - need to update div_win, wc_win, lg_win, ws_win, so, sb, cs, hbp, sf
#     - need to drop old pre-World Series 1873-1902 data (https://en.wikipedia.org/wiki/World_Series#Modern_World_Series_(1903%E2%80%93present))

In [None]:
# remove pre-1903 teams
df = df[df.year >= 1903]
df.head()
# df.info()

In [None]:
# fill null values
df["league_id"].fillna("*None", inplace=True)
df["div_id"].fillna("*None", inplace=True)
df["ghome"].fillna(0, inplace=True)
df["div_win"].fillna("N", inplace=True)
df["wc_win"].fillna("N", inplace=True)
df["lg_win"].fillna("N", inplace=True)
df["ws_win"].fillna("N", inplace=True)
df["so"].fillna(0, inplace=True)
df["sb"].fillna(0, inplace=True)
df["cs"].fillna(0, inplace=True)
df["hbp"].fillna(0, inplace=True)
df["sf"].fillna(0, inplace=True)
df["park"].fillna("*Unknown", inplace=True)

# impute attendance -- use median attendance for the given year
null_attendance_years = np.unique(df[df["attendance"]!=df["attendance"]]["year"]) # years with missing attendance data
for y in null_attendance_years:
    median_attendance_for_year = df[df["year"]==y]["attendance"].median(skipna=True)
    df["attendance"] = np.where( ((df["year"] == y) & (df["attendance"] != df["attendance"]) ), median_attendance_for_year, df["attendance"])
    
df.info()

In [None]:
plt.hist(df["g"], bins=50)
plt.xlabel('Games (season length)')
plt.ylabel('Number of observations')
plt.title("Histogram of season lengths from 1903-2015")

In [None]:
# add slugging percentage
df["single"] = df["h"] - df["hr"] - df["triple"] - df["double"]
df["total_bases"] = df["single"]+(df["double"]*2) + (df["triple"]*3) + (df["hr"]*4)
df["slg"] = df["total_bases"]/df["ab"]

df[["h", "single", "double", "triple", "hr", "total_bases"]].head()
df.columns

In [None]:
# obp + ops
df["hbp"].fillna(0, inplace = True)
df["sf"].fillna(0, inplace = True)

df["obp"] = (df["h"] + df["bb"] + df["hbp"]) / (df["ab"] + df["bb"] + df["hbp"] + df["sf"])

df["ops"] = df["slg"] + df["obp"]

df.head()

In [None]:
# spot check a few data points
df[df["year"]==1927]["slg"].max() # 1927 NY Yankees: https://www.baseball-reference.com/leagues/MLB/1927-standard-batting.shtml
df[df["year"]==2001]["obp"].max() # 2001 SEA Mariners: https://www.baseball-reference.com/leagues/MLB/2001-standard-batting.shtml
df[df["year"]==1998]["ops"].max() # 1998 NY Yankees: https://www.baseball-reference.com/leagues/MLB/1998-standard-batting.shtml

df[["slg","obp","ops"]].describe()

In [None]:
for col in ['obp', 'slg', 'ops']:
    plt.hist(df[col])
    plt.xlabel(col)
    plt.ylabel('Number of observations')
plt.title(str("Histogram of OBP (blue), SLG (orange), and OPS (green)"))

In [None]:
# checking the math -- min/max obp/slg/ops match: https://www.baseball-reference.com/leagues/MLB/2012.shtml
df[df["year"].eq(2012)][["obp", "slg", "ops"]].describe()

In [None]:
# add ties
df[df["w"] + df["l"] != df["g"]] # years where w + l != g
df["ties"] = df["g"] - df["w"] - df["l"]
df[df["w"] + df["l"] + df["ties"] != df["g"]] # years where w + l + ties != g

# transform "counting stats" (like wins, runs, home runs) to "rate stats" (winning pct, runs/game)
for col in ['w', 'l', 'ties', 'r', 'ab', 'h', 'so', 'sb', 'cs', 'sf', 'ra', 'er', 'cg', 'sho', 'sv', 'ipouts', 'ha', 'hra', 'bba', 'soa', 'e', 'dp']:
    if (col=='w') or (col=='l') or (col=='ties'):
        col_name = str(col + "_pct")
    else:
        col_name = str(col + "_per_game")
    df[col_name] = df[col] / df["g"]

In [None]:
df[['w_pct', 'l_pct', 'ties_pct', 'r_per_game', 'ra_per_game']].describe()

In [None]:
for col in ['w_pct', 'l_pct', 'ties_pct', 'r_per_game', 'ab_per_game', 'h_per_game', 'so_per_game', 'sb_per_game', 'cs_per_game', 'ra_per_game', 'er_per_game', 'cg_per_game', 'sho_per_game', 'sv_per_game', 'ipouts_per_game', 'ha_per_game', 'hra_per_game', 'bba_per_game', 'soa_per_game', 'e_per_game', 'dp_per_game']:
    plt.hist(df[col], bins=50)
    plt.xlabel(col)
    plt.ylabel('Number of observations')
    plt.title(str("Histogram of " + col))
    plt.show()

In [None]:
ws_winners = df[df["ws_win"]=="Y"]["franchise_id"]
plt.subplots(figsize=(10,5))
plt.hist(ws_winners, bins=50)
plt.title("Histogram of WS winning franchises (1903-2015)")

In [None]:
# make playoffs
df["make_playoffs_rank_1"] = df["rank"]==1
df["make_playoffs_wild_card"] = df["wc_win"]=="Y"
df["make_playoffs_win_division"] = df["div_win"]=="Y"

df["make_playoffs"] = df["make_playoffs_rank_1"] | df["make_playoffs_wild_card"] | df["make_playoffs_win_division"]

# 1981 strike-shortened season, some playoff qualifiers were considered division winners (from first half) despite not being rank 1
df['make_playoffs'] = np.where( ( (df['year'] == 1981) & ((df['franchise_id'] == "CIN") | (df['franchise_id'] == "STL")) ), False, df["make_playoffs"])


df.query('year == 1981')[["franchise_id","make_playoffs"]]


In [None]:
playoff_qualifiers = df[df["make_playoffs"]==True]["franchise_id"]
plt.subplots(figsize=(15,5))
plt.hist(playoff_qualifiers, bins=100)
plt.title("Histogram of playoff qualifying franchises (1903-2015)")

In [None]:
# number of playoff teams / year
year_teams_qualifiers = []
n_teams = []
n_playoff_qualifiers = []
# median_season_lengths = []
years = range(1903, 2016)
year_labels = []
for y in years:
    year_labels.append("'" + str(y)[-2] + str(y)[-1])

for year in years:
    teams_in_year = df[df["year"].eq(year)]
    n_teams_in_year = len(teams_in_year)
    playoff_qualifiers_in_year = teams_in_year[teams_in_year["make_playoffs"].eq(True)]
    n_playoff_qualifiers_in_year = len(playoff_qualifiers_in_year)
    n_teams.append(n_teams_in_year)
    n_playoff_qualifiers.append(n_playoff_qualifiers_in_year)
#     median_season_lengths.append(teams_in_year[["g"]].median())
    
# print(median_season_lengths)

x = np.arange(len(years))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(40,10))
rects1 = ax.bar(x - width/2, n_teams, width, label="Overall")
rects2 = ax.bar(x + width/2, n_playoff_qualifiers, width, label="Playoff Qualifiers")
# rects3 = ax.bar(x + width/3, median_season_lengths, width, label="Season length")

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Number of Teams')
ax.set_title('Number of MLB teams and playoff qualifiers by year, with season length')
ax.set_xticks(x)
ax.set_xticklabels(year_labels)
ax.legend()


def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects1)
autolabel(rects2)
# autolabel(rects3)

fig.tight_layout()

plt.show()

In [None]:
# h/t https://www.kaggle.com/maglionejm/linear-regression-moneyball
# plot suggesting that playoff winners need to win 55-65% of games
import seaborn as sns
sns.lmplot(x="w_pct", y="year", fit_reg=False, hue="make_playoffs", data=df, height=5, aspect=1.5)
plt.xlabel("Winning Percentage", fontsize = 15)
plt.ylabel("Year", fontsize = 15)
plt.axvline(0.575, 0, 1, color = "Black", ls = '--')
plt.show()

In [None]:
n_teams = len(df)
playoff_qualifiers = df[df["make_playoffs"] == True]
n_playoff_qualifiers = len(playoff_qualifiers) # 412
for winning_pct in [0.5, 0.55, 0.575, 0.6, 0.65]:
    n_exceed_win_pct = len(df[df["w_pct"] >= winning_pct]) # 647
    n_playoff_qualifiers_exceed_win_pct = len(np.where( ( (df['make_playoffs'] == True) & (df['w_pct'] >= winning_pct) ) )[0]) # 368
    n_win_over_55_miss_playoffs = len(np.where( ( (df['make_playoffs'] == False) & (df['w_pct'] >= winning_pct) ) )[0]) # 279

    print("--------")    
    print("Winning percentage of " + str(winning_pct) + ":")
    print("--------")
    print(str(n_exceed_win_pct) + " of " + str(n_teams) + " teams won at least " + str(winning_pct*100)[0:4] + "% of games in a season (" + str(n_exceed_win_pct/n_teams*100)[0:4] + "%)")
    print(str(n_playoff_qualifiers_exceed_win_pct) + " of those " + str(n_exceed_win_pct*100)[0:4] + " teams qualified for the playoffs (" + str(n_playoff_qualifiers_exceed_win_pct/n_exceed_win_pct*100)[0:4] + "%)")
    print(str(n_win_over_55_miss_playoffs) + " teams won at least " + str(winning_pct*100)[0:4] + "% of games and still missed the playoffs (" + str(n_win_over_55_miss_playoffs/n_exceed_win_pct*100)[0:4] + "%)")
    print(str(n_playoff_qualifiers_exceed_win_pct) + " out of " + str(n_playoff_qualifiers*100)[0:4] + " playoff qualifiers won at least " + str(winning_pct) + "% of games to qualify (" + str(n_playoff_qualifiers_exceed_win_pct/n_playoff_qualifiers*100)[0:4] + "%)")
    print()

In [None]:
import seaborn as sns

plt.subplots(figsize=(15,15))
numeric_correlations = df.corr() # correlations between numeric variables
sns.heatmap(numeric_correlations, xticklabels=1, yticklabels=1)

In [None]:
sns.pairplot(df[["w_pct", "obp", "slg", "ops", "r_per_game", "ra_per_game", "make_playoffs"]], hue="make_playoffs")

# Part 2: Principal Component Analysis (PCA)

* Select three dimensions as predictors of a fourth predicted dimension
* Discuss whether the three dimensions could become a good predictor
* Demo Principal Component Analysis similar to this one
* Display the results graphically
* Comment on the effectiveness of PCA and its dimensionality reduction

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

# ------------------------------------------------------------------------------------------------
# Load data and set up three dimensions that could be predictor of a fourth predicted dimension
# ------------------------------------------------------------------------------------------------

df = pd.read_csv('/kaggle/input/the-history-of-baseball/team.csv')

df = df[df["year"] >= 1903] # remove pre-1903 observations (before consistent playoffs)
df = df[df["year"] != 1904] # remove non-playoff season
df = df[df["year"] != 1994] # remove non-playoff season

# ------------------------------------
# add OPS dimension (OBP + SLG)
# ------------------------------------
df["hbp"].fillna(0, inplace = True)
df["sf"].fillna(0, inplace = True)
df["single"] = df["h"] - df["hr"] - df["triple"] - df["double"]
df["total_bases"] = df["single"] + (df["double"] * 2) + (df["triple"] * 3) + (df["hr"] * 4)
df["slg"] = df["total_bases"] / df["ab"]
df["obp"] = (df["h"] + df["bb"] + df["hbp"]) / (df["ab"] + df["bb"] + df["hbp"] + df["sf"])
df["ops"] = df["slg"] + df["obp"]

# ------------------------------------
# ------------------------------------
df["make_playoffs_rank"] = df["rank"]==1
df["make_playoffs_wc"] = df["wc_win"]=="Y"
df["make_playoffs_div"] = df["div_win"]=="Y"
df["make_playoffs"] = df["make_playoffs_rank"] | df["make_playoffs_wc"] | df["make_playoffs_div"]
# (1981 strike-shortened season)
df["make_playoffs"] = np.where( ( (df["year"] == 1981) & ((df["franchise_id"] == "CIN") | (df["franchise_id"] == "STL")) ), False, df["make_playoffs"])

# ------------------------------------
# convert features to per-game rates and select 3+1 dimensions
# ------------------------------------
for col in ['w', 'l', 'r', 'ab', 'h', 'so', 'sb', 'cs', 'sf', 'ra', 'er', 'cg', 'sho', 'sv', 'ipouts', 'ha', 'hra', 'bba', 'soa', 'e', 'dp']:
    df[str(col + "_per_game")] = df[col] / df["g"]
dimensions = ["ops", "r_per_game", "ra_per_game", "make_playoffs"]
predictor_dimensions = ["ops", "r_per_game", "ra_per_game"]
_df = df[dimensions]

# ------------------------------------
# generate pairplot
# ------------------------------------
sns.pairplot(_df, hue="make_playoffs")

# ------------------------------------
# scale data based on standard deviation
# ------------------------------------
scaler = StandardScaler()
df_scaled = _df.copy()
df_scaled[predictor_dimensions] = scaler.fit_transform(_df[predictor_dimensions])

# ------------------------------------------------------------------------------------------------
# function to generate 3D plots of predictor dimensions
# ------------------------------------------------------------------------------------------------
def generate_3d_plot(x, y, z, title):
    fig = plt.figure(figsize=(6,6))
    ax = Axes3D(fig)
    colors = []
    for mp in df["make_playoffs"]:
        if mp==True:
            colors.append("tab:orange")
        else:
            colors.append("tab:blue")

    scatterplot = ax.scatter(x, y, z, s=40, c=colors)
    ax.set_xlabel("OPS (1st eigenvector)")
    ax.set_ylabel("Runs allowed per game (2nd eigenvector)")
    ax.set_zlabel("Runs per game (3rd eigenvector)")
    ax.set_title(title);
    plt.savefig("scatter_hue", bbox_inches='tight')

# ------------------------------------------------------------------------------------------------
# generate 3D plots with **unscaled** and **scaled** data for three dimensions
# ------------------------------------------------------------------------------------------------
generate_3d_plot(df["ops"], df["r_per_game"], df["ra_per_game"], "3D plot of predictor dimensions")
generate_3d_plot(df_scaled["ops"], df_scaled["r_per_game"], df_scaled["ra_per_game"], "3D plot of *scaled* predictor dimensions")

# ------------------------------------------------------------------------------------------------
# **PCA**
# generate plots with first three PCA dimensions, as in this example:
# https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#sphx-glr-auto-examples-datasets-plot-iris-dataset-py
# ------------------------------------------------------------------------------------------------
X_reduced_unscaled = PCA(n_components=3).fit_transform(df[predictor_dimensions])
generate_3d_plot(X_reduced_unscaled[:, 0], X_reduced_unscaled[:, 1], X_reduced_unscaled[:, 2], "3D plot of unscaled PCA dimensions")
X_reduced_scaled = PCA(n_components=3).fit_transform(df_scaled[predictor_dimensions])
generate_3d_plot(X_reduced_scaled[:, 0], X_reduced_scaled[:, 1], X_reduced_scaled[:, 2], "3D plot of *scaled* PCA dimensions")

# ------------------------------------------------------------------------------------------------
# **PCA**
# importance of feature scaling, as demonstrated here:
# https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
# ------------------------------------------------------------------------------------------------

X_train, X_test, y_train, y_test = train_test_split(df[predictor_dimensions], df["make_playoffs"],
                                                    test_size=0.30,
                                                    random_state=42)

unscaled_clf = make_pipeline(PCA(n_components=2), GaussianNB())
unscaled_clf.fit(X_train, y_train)
pred_test = unscaled_clf.predict(X_test)

# Fit to data and predict using pipelined scaling, GNB and PCA.
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())
std_clf.fit(X_train, y_train)
pred_test_std = std_clf.predict(X_test)

# Show prediction accuracies in scaled and unscaled data.
print('\nPrediction accuracy for the normal test dataset with PCA')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))

print('\nPrediction accuracy for the standardized test dataset with PCA')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))

# Extract PCA from pipeline
pca = unscaled_clf.named_steps['pca']
pca_std = std_clf.named_steps['pca']

# Show principal components
print('\nPC 1 without scaling:\n', pca.components_[0])
print('\nPC 1 with scaling:\n', pca_std.components_[0])
print('\nPC 2 without scaling:\n', pca.components_[1])
print('\nPC 2 with scaling:\n', pca_std.components_[1])

# Use PCA without and with scale on X_train data for visualization.
X_train_transformed = pca.transform(X_train)
scaler = std_clf.named_steps['standardscaler']
X_train_std_transformed = pca_std.transform(scaler.transform(X_train))

# visualize standardized vs. untouched dataset with PCA performed
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 7))

def get_label_for_class(val):
    if val==0:
        return "False (failed to qualify for playoffs)"
    else:
        return "True (qualified for playoffs)"

for l, c, m in zip(range(0, 2), ('tab:blue', 'tab:orange'), ('^', 's', 'o')):
    ax1.scatter(X_train_transformed[y_train == l, 0],
                X_train_transformed[y_train == l, 1],
                color=c,
                label=get_label_for_class(l),
                alpha=0.5,
                marker=m)

for l, c, m in zip(range(0, 2), ('tab:blue', 'tab:orange'), ('^', 's')):
    ax2.scatter(X_train_std_transformed[y_train == l, 0],
                X_train_std_transformed[y_train == l, 1],
                color=c,
                label=get_label_for_class(l),
                alpha=0.5,
                marker=m)

ax1.set_title('Training dataset after PCA')
ax2.set_title('Standardized training dataset after PCA')

for ax in (ax1, ax2):
    ax.set_xlabel('1st principal component')
    ax.set_ylabel('2nd principal component')
    ax.legend(loc='upper right')
    ax.grid()

plt.tight_layout()

plt.show()

In [None]:
# h/t https://www.kaggle.com/michaelmtz20/feature-selection-for-predicting-wins
import pylab as pl

N = 3

pca = PCA(n_components=N)
X_pca = pca.fit_transform(X_train)
vals = pca.explained_variance_ratio_
print(vals)

ind = np.arange(N)  # the x locations for the groups

pl.figure(figsize=(10, 6), dpi=250)
ax = pl.subplot()
ax.bar(ind, pca.explained_variance_ratio_)
for i in range(N):
    ax.annotate(r"%d%%" % (int(vals[i]*100)), (ind[i], vals[i]), va="bottom", ha="center", fontsize=12)
    ax.annotate(str("PC " + str(i)), (ind[i], vals[i] + 0.05), va="bottom", ha="center", fontsize=12)

ax.set_xlabel("Principal Component", fontsize=12)
ax.set_ylabel("Variance Explained (%)", fontsize=12)
ax.set_ylim(0, .80)
pl.title("% variance explained by " + str(N) + " principal components")


# Part 3: Predict playoff qualifiers for test data

The Gaussian Naive Bayes model can predict playoff qualifiers relatively well. 

Takeaways:
* The model needs to be refined to predict the **correct number of qualifiers for a given year**. The model doesn't always predict X qualifiers in years with X playoff spots. For example, in 2000, the model only predicts a single playoff qualifier (SFG), even though 8 teams qualified (OAK, ATL, STL, NYY, SFG, SEA, NYM, and CHW).
* Low false positive rate (type I error). In 11 predicted years (2000-2001, 2006-2010, 2012-2015), there are 0 false positives; in 5 predicted years (2002-2005, 2011), there are 6 total false positives, where a predicted playoff qualifier did not actually qualify.
* Best predictive performance in 2002, where the model did predict 8 qualifiers. It correctly identified 7 of 8 qualifiers, with the only mistake: incorrectly predicting BOS (93-69, 6 games behind ANA for wild card) instead of MIN (won AL Central at 94-67). https://www.baseball-reference.com/leagues/MLB/2002-standings.shtml

In [None]:
# as shown in previous section
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())
std_clf.fit(X_train, y_train)

# predict playoff qualifiers for a given year (2000-2015)
for year in range(2000, 2016):
    _df = df[df["year"]==year]
    X = _df[predictor_dimensions]
    y = _df["make_playoffs"]

    _df["predicted_playoff_qualifier"] = std_clf.predict(X)
    print("--------")
    print(year)
    print("--------")
    print("Actual playoff qualifiers in " + str(year) + ":")
    actual = set(_df[_df["make_playoffs"]==True]["franchise_id"])
    print(actual)
    print("Predicted playoff qualifiers in " + str(year) + ":")
    predicted = set(_df[_df["predicted_playoff_qualifier"]==True]["franchise_id"])
    print(predicted)
    print()
    incorrect = predicted.difference(actual)
    print("Incorrect predictions (false positives) " + str(len(incorrect)) + ":")
    print(incorrect)
    exclusions = actual.difference(predicted)
    print("Incorrect exclusions from prediction (false negatives) " + str(len(exclusions)) + ":")
    print(exclusions)
    print()
    print()