# MLB World Series winners: offense and pitching

Taking a cue from [this "Worst World Series Winners Since 1900" notebook](https://www.kaggle.com/cm1291/worst-world-series-winners-since-1900) and continuing my ["MLB Playoff Teams: Dimensionality, PCA, prediction" notebook](https://www.kaggle.com/cekohlbrenner/mlb-playoff-teams-dimensionality-pca-prediction), I took a closer look at team performances that culminate in a World Series win.

A few takeaways:
* Many of the best teams of all time (appearing on the top-300 list for both relative runs per game and runs allowed per game) failed to win the World Series. Only 10/21 teams on the top 200 of both lists won it all.
* Historic greatness on offense OR pitching/defense alone does not guarantee World Series wins. Four of the top five offenses failed to win the World Series. Four of the top five defensive/pitching teams failed to win it all.
* Only 19 teams have won a World Series with below average runs scored or runs allowed. Most of those (16) had below average offense; only 3 won the World Series with a runs allowed metric worse than league average.

**In short, baseball is hard and has lots of randomness -- regular season success does not guarantee a World Series win.**



To begin the analysis, the code below looks at some of the World Series winners with most and least **runs scored per game** and **runs allowed per game**:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("/kaggle/input/the-history-of-baseball/team.csv")
# modern era only (after 1900)
df = df[df["year"]>1900]

for col in ["w", "r", "ra"]:
    col_name = str(col + "_per_game")
    df[col_name] = df[col] / df["g"]

# WS winners in the modern era (since 1900)
df_ws = df[df["ws_win"]=="Y"]

plt.xlabel("Runs Scored Per Game")
plt.ylabel("Runs Allowed Per Game")
plt.title("WS winners in the modern era")

plt.scatter(df_ws["r_per_game"], df_ws["ra_per_game"])

plt.annotate('1936 NY Yankees', xy=(6.87, 4.7), xycoords='data', xytext=(6.5, 4.25), size=10, color="navy", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate('1916 Boston Red Sox', xy=(3.5, 3.07), xycoords='data', xytext=(2.5, 2.7), size=10, color="navy", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate('2000 NY Yankees', xy=(5.41, 5.06), xycoords='data', xytext=(6, 5), size=10, color="navy", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate('1907 Chicago Cubs', xy=(3.7, 2.5), xycoords='data', xytext=(4.25, 2.5), size=10, color="navy", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))

# Hitting and pitching performance relative to league

Rather than the absolute metrics above, we now look at runs scored and runs allowed *relative* to the league average in a given year. This way we can normalize for performance in a given year -- if scoring is down historically in a given year, it is more meaningful to see a teams' runs or runs allowed *relative* to the league average that year.

Taking the `df[["r_per_game_over_avg", "ra_per_game_over_avg"]].describe()` output below, we see:

||r_per_game_over_avg|ra_per_game_over_avg|
|-----|-------------------|--------------------|
|count|2.422000e+03|2.422000e+03|
|mean|-1.392363e-18|-6.302876e-19|
|std|1.126390e-01|1.164116e-01|
|min|-3.605343e-01|-5.118569e-01|
|25%|-7.416388e-02|-7.464466e-02|
|50%|-6.051719e-03|9.055717e-03|
|75%|7.549824e-02|8.056344e-02|
|max|4.310457e-01|3.209090e-01|

* Mean values are effectively 0
* The worst `r_per_game` is -0.36, meaning the worst team in history scored 36% fewer runs/game than league average
* The best `r_per_game` is 0.43, meaning the best team scored 43% more runs/game than league average
* The worst `ra_per_game` is -0.51, meaning the worst team allowed 51% more runs/game than league average
* The best `ra_per_game` is 0.32, meaning the best team allowed 32% fewer runs/game than league average

In [None]:
def mean_dimension_val_for_year(df, year, dim):
    df_year = df[df["year"] == year]
    return df_year[dim].mean()

averages={}
for year in range(1901, 2016):
    mean_r_per_game = mean_dimension_val_for_year(df, year, "r_per_game")
    mean_ra_per_game = mean_dimension_val_for_year(df, year, "ra_per_game")
    averages[year] = {"r_per_game": mean_dimension_val_for_year(df, year, "r_per_game"),
                      "ra_per_game": mean_dimension_val_for_year(df, year, "ra_per_game")}

for index, row in df.iterrows():     
    year = row["year"]
    r_per_game_year_avg = averages[year]["r_per_game"]
    ra_per_game_year_avg = averages[year]["ra_per_game"]
    df.loc[index, "r_per_game_year_avg"] = r_per_game_year_avg
    df.loc[index, "ra_per_game_year_avg"] = ra_per_game_year_avg

df["r_per_game_over_avg"] = (df["r_per_game"] - df["r_per_game_year_avg"]) / df["r_per_game_year_avg"]
# invert order of ra_per_game and average, so that larger values are better performance
df["ra_per_game_over_avg"] = (df["ra_per_game_year_avg"] - df["ra_per_game"]) / df["ra_per_game_year_avg"]

df[["r_per_game_over_avg", "ra_per_game_over_avg"]].describe()

Graphing all teams' `r_per_game_over_average` and `ra_per_game_over_average`, we see a few outliers and patterns.

For example:
* Percentage of World Series winners with below average runs/game?
* Percentage of World Series winners with below average runs allowed/game?
* Strong non-WS winners (1902 PIT, 1906 CHC, 1931 Yankees)
* Weak outliers overall (1942 PHI, 1915 PHA)
* Weak WS winners (1987 MIN, 2006 STL)
* Strongest overall?
    - Can I identify teams in top 10% of both runs and runs allowed?

In [None]:
df_ws = df[df["ws_win"]=="Y"]
df_non_ws = df[df["ws_win"]!="Y"]

plt.figure( figsize=(12, 12))

plt.xlabel("Hitting Performance (Runs/Game, relative to year average)")
plt.ylabel("Pitching/Fielding Performance (Runs Allowed/Game, relative to year average)")
plt.title("WS winners in the modern era")

plt.scatter(df_non_ws["r_per_game_over_avg"], df_non_ws["ra_per_game_over_avg"], c=(0.5, 0.5, 0.5, 0.3))

plt.scatter(df_ws["r_per_game_over_avg"], df_ws["ra_per_game_over_avg"], c="tab:blue")

plt.ylim(top=0.4)
plt.xlim(right=0.5)

plt.annotate("1933 NY Giants", xy=(-0.091, 0.264), xycoords="data", xytext=(-0.22, 0.33), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1995 ATL Braves", xy=(-0.076, 0.226), xycoords="data", xytext=(-0.3, 0.3), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))

plt.annotate("1939 NY Yankees", xy=(0.319, 0.242), xycoords="data", xytext=(0.33, 0.30), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1927 NY Yankees", xy=(0.325, 0.186), xycoords="data", xytext=(0.34, 0.21), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1936 NY Yankees", xy=(0.324, 0.091), xycoords="data", xytext=(0.35, 0.08), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1976 CIN Reds", xy=(0.324, 0.022), xycoords="data", xytext=(0.35, 0.03), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))

plt.annotate("1907 CHI Cubs", xy=(0.05, 0.287), xycoords="data", xytext=(-0.02, 0.32), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))

# weak WS winners
plt.annotate("2006 STL Cardinals", xy=(-0.001, 0.026), xycoords="data", xytext=(-0.3, 0.1), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1987 MIN Twins", xy=(0.027, -0.053), xycoords="data", xytext=(0, -0.2), size=10, color="tab:blue", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))

# strong WS losers
plt.annotate("1931 NY Yankees", xy=(0.431, -0.019), xycoords="data", xytext=(0.375, -0.07), size=10, color="tab:gray", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1902 PIT Pirates", xy=(0.231, 0.302), xycoords="data", xytext=(0.14, 0.37), size=10, color="tab:gray", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1906 CHI Cubs", xy=(0.258, 0.321), xycoords="data", xytext=(0.24, 0.35), size=10, color="tab:gray", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))

# weakest all-time
plt.annotate("1942 PHI Phillies", xy=(-0.361, -0.144), xycoords="data", xytext=(-0.38, -0.26), size=10, color="tab:gray", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))
plt.annotate("1915 PHA Athletics", xy=(-0.072, -0.512), xycoords="data", xytext=(-0.28, -0.45), size=10, color="tab:gray", arrowprops=dict(arrowstyle="simple",fc="0", ec="none"))

plt.annotate("Strong Offense / Strong Pitching", xy=(0.28, 0.385))
plt.annotate("Weak Offense / Strong Pitching", xy=(-0.4, 0.385))
plt.annotate("Strong Offense / Weak Pitching", xy=(0.29, -0.55))
plt.annotate("Weak Offense / Weak Pitching", xy=(-0.4, -0.55))

plt.scatter(0, 0, c="black")

plt.annotate("League average", xy=(0, 0), xycoords="data", xytext=(-0.11, 0), size=10, color="black")

plt.axvline(x=0, color="tab:gray", ls="--")
plt.axhline(y=0, color="tab:gray", ls="--")

Digging a bit deeper, we confirm that a top offense OR a top defense/pitching staff *alone* is not predictive of World Series victory. That is, when we consider the top 5, 10, 20, 50, or 100 teams by `r_per_game_over_avg` or `ra_per_game_over_avg`, neither one indicates a >50% likelihood of winning the World Series:

In [None]:
# best N offenses in history
df_sorted_r_per_game_over_avg = df.copy()
df_sorted_r_per_game_over_avg.sort_values(by=['r_per_game_over_avg'], inplace=True, ascending=False)

print("World Series likelihood for teams in the top N of r_per_game_over_avg")
for i in [5, 10, 20, 50, 100]:
    num_ws_winners_among_top_r_scorers = len(df_sorted_r_per_game_over_avg.head(i)[df_sorted_r_per_game_over_avg.head(i)["ws_win"]=="Y"])
    pct = num_ws_winners_among_top_r_scorers / i * 100
    print(str(num_ws_winners_among_top_r_scorers) + " of " + str(i) + " (" + str(pct) + "%) top teams by r_per_game_over_avg won the World Series")

# best N pitching/defenses in history
df_sorted_ra_per_game_over_avg = df.copy()
df_sorted_ra_per_game_over_avg.sort_values(by=['ra_per_game_over_avg'], inplace=True, ascending=False)

print()
print("World Series likelihood for teams in the top N of r_per_game_over_avg")
for i in [5, 10, 20, 50, 100]:
    num_ws_winners_among_top_ra = len(df_sorted_ra_per_game_over_avg.head(i)[df_sorted_ra_per_game_over_avg.head(i)["ws_win"]=="Y"])
    pct = num_ws_winners_among_top_ra / i * 100
    print(str(num_ws_winners_among_top_ra) + " of " + str(i) + " (" + str(pct) + "%) top teams by ra_per_game_over_avg won the World Series")




On the other hand, combining a top offense (`r_per_game_over_avg`) and defense/pitching staff (`ra_per_game_over_avg`) *together* also doesn't guarantee a World Series win.

For example, only 2 teams in history have finished in the top 50 of both `r_per_game_over_avg` AND `ra_per_game_over_avg`, but one of those teams failed to win the World Series (sorry, 1906 Chicago Cubs; congrats, 1939 NY Yankees).

Likewise, only 6 teams in history have finished in the top 100 of both `r_per_game_over_avg` AND `ra_per_game_over_avg`, but only two of those teams won the World Series: the 1927 Yankees join the 1939 Yankees and 1927 Yankees as winners, the 1902 Pirates and 1904 Giants missed out (no World Series those years), and the 1942 Yankees join the 1906 Cubs as historically great teams who failed to win it all.

The pattern holds for teams in the top 200 (only 10 out of 21 on it all, or 47.6% of teams in the top-200 of both lists), top 300 (18/42, 42.9%), top 500 (43/113, 38.1%), or top 1000 (81/452, 17.9%).

### Teams in top-200 for both `r_per_game_over_avg` AND `ra_per_game_over_avg`
(11 non-World Series winners should be in contention for best teams all-time that failed to win it all)

|year|franchise_id|ws_win|
|----|------------|------|
|1902|PIT||
|1904|SFG||
|1905|SFG|Y|
|1906|CHC||
|1909|PIT|Y|
|1910|CHC||
|1912|SFG||
|1917|CHW|Y|
|1927|NYY|Y|
|1939|NYY|Y|
|1942|LAD||
|1942|NYY||
|1942|STL|Y|
|1944|STL|Y|
|1947|NYY|Y|
|1948|CLE|Y|
|1954|NYY||
|1969|BAL||
|1974|LAD||
|1998|NYY|Y|
|2001|SEA||



In [None]:
df["year_str"] = df["year"]
df.year_str = df.year_str.astype(str)
df["team_year"] = df["franchise_id"] + df["year_str"]

# best N offenses in history
df_sorted_r_per_game_over_avg = df.copy()
df_sorted_r_per_game_over_avg.sort_values(by=['r_per_game_over_avg'], inplace=True, ascending=False)

# best N pitching/defenses in history
df_sorted_ra_per_game_over_avg = df.copy()
df_sorted_ra_per_game_over_avg.sort_values(by=['ra_per_game_over_avg'], inplace=True, ascending=False)


print("World Series likelihood for teams in the top N of BOTH r_per_game_over_avg and ra_per_game_over_avg")
for i in [50, 100, 200, 300, 500, 1000, 1500, 2000]:
    top_offenses = set(df_sorted_r_per_game_over_avg.head(i)["team_year"])
    top_defenses = set(df_sorted_ra_per_game_over_avg.head(i)["team_year"])
    n_top_offense_and_defense = len(top_offenses.intersection(top_defenses))
    # print(top_offenses.intersection(top_defenses))
    
    print("There are " + str(n_top_offense_and_defense) + " teams in history with both a top-" + str(i) + " offense AND defense")
    offense_winners = set(df_sorted_r_per_game_over_avg.head(i)[df_sorted_r_per_game_over_avg.head(i)["ws_win"]=="Y"]["team_year"])
    defense_winners = set(df_sorted_ra_per_game_over_avg.head(i)[df_sorted_ra_per_game_over_avg.head(i)["ws_win"]=="Y"]["team_year"])
    
    n_top_offense_and_defense_winners = len(offense_winners.intersection(defense_winners))
    pct = n_top_offense_and_defense_winners / n_top_offense_and_defense * 100
    print(str(n_top_offense_and_defense_winners) + " out of " + str(n_top_offense_and_defense) + " (" + str(pct) + "%) won the World Series")
    print()
    
df[df["team_year"].str.contains("NYY1947|CHC1906|LAD1974|NYY1927|BAL1969|NYY1942|STL1944|CLE1948|SFG1905|CHW1917|PIT1902|SFG1904|PIT1909|NYY1998|CHC1910|SFG1912|NYY1954|STL1942|NYY1939|SEA2001|LAD1942", regex=True)][["year", "franchise_id", "ws_win"]]

## World Series winners with below average metrics

Of the 111 World-Series-winning teams, only 16 won with an offense below league average (`r_per_game_over_avg` < 0) and only 3 won with a defense/pitching performance below league average (`ra_per_game_over_avg` < 0).

The teams at the top of each table could make a claim for the weakest of all-time, including:
* 2006 STL Cardinals (offense slightly **below** leage average and just 2.6% better runs allowed than league average)
* 1987 MIN Twins (2.7% better offense than league average, 5.3% **worse** defense/pitching)

In [None]:
df_ws_winners = df[df["ws_win"] == "Y"]
print(str(len(df_ws_winners)) + " World Series winners all time")

In [None]:
df_below_average_offense_ws_winners = df_ws_winners[df_ws_winners["r_per_game_over_avg"] < 0]
print(str(len(df_below_average_offense_ws_winners)) + " World Series winners with offense below league average")

df_below_average_offense_ws_winners.sort_values(by=['ra_per_game_over_avg'], inplace=True)
df_below_average_offense_ws_winners

In [None]:
df_below_average_defense_ws_winners = df_ws_winners[df_ws_winners["ra_per_game_over_avg"] < 0]
print(str(len(df_below_average_defense_ws_winners)) + " World Series winners with defense/pitching below league average")

df_below_average_defense_ws_winners.sort_values(by=['r_per_game_over_avg'], inplace=True)
df_below_average_defense_ws_winners