# Progression of the NL career home run record

"Who the heck is Cy Williams?"

"I think he used to be the NL's career home run leader at some point."

OK, let's see.

In [1]:
import pandas as pd
import boxball_loader as bbl
import numpy as np

In [2]:
# get the count of HR for each player-season in the NL
df = bbl.load_batting(coalesce_type=bbl.CoalesceMode.PLAYER_SEASON_LEAGUE)
nl_hr = df[df.index.get_level_values('lg_id')=='NL']['hr']
nl_hr

player_id  yr    lg_id
aardsda01  2004  NL        0
           2006  NL        0
           2013  NL        0
           2015  NL        0
aaronha01  1954  NL       13
                          ..
zuvelpa01  1983  NL        0
           1984  NL        0
           1985  NL        0
zuverge01  1954  NL        0
zwilldu01  1916  NL        1
Name: hr, Length: 52360, dtype: int16

In [3]:
# cumulative career totals
totals = nl_hr.groupby(level=0).cumsum()
totals

player_id  yr    lg_id
aardsda01  2004  NL        0
           2006  NL        0
           2013  NL        0
           2015  NL        0
aaronha01  1954  NL       13
                          ..
zuvelpa01  1983  NL        0
           1984  NL        0
           1985  NL        0
zuverge01  1954  NL        0
zwilldu01  1916  NL        1
Name: hr, Length: 52360, dtype: int16

In [4]:
# here's the annual leader
# two problems: only counts active players, and is missing the player id
totals.reset_index().groupby('yr')['hr'].max()

yr
1876      5
1877      6
1878      9
1879     18
1880     23
       ... 
2016    382
2017    302
2018    322
2019    344
2020    352
Name: hr, Length: 145, dtype: int16

In [5]:
# ok, this is what we really need to do for each year
def get_yearly_leader(yr):
    yrly = nl_hr[nl_hr.index.get_level_values('yr')<=yr].reset_index().groupby('player_id')['hr'].sum()
    leader = yrly[yrly == yrly.max()]
    return leader

get_yearly_leader(1925)

player_id
willicy01    186
Name: hr, dtype: int16

In [6]:
# combine years into a df
yrly_leaders = pd.concat({yr: get_yearly_leader(yr) for yr in range(1870, 2021)}).reset_index(level=1)
yrly_leaders

Unnamed: 0,player_id,hr
1876,hallge01,5
1877,jonesch01,6
1878,jonesch01,9
1879,jonesch01,18
1880,jonesch01,23
...,...,...
2016,bondsba01,762
2017,bondsba01,762
2018,bondsba01,762
2019,bondsba01,762


In [7]:
print(yrly_leaders.drop_duplicates().to_string())

      player_id   hr
1876   hallge01    5
1877  jonesch01    6
1878  jonesch01    9
1879  jonesch01   18
1880  jonesch01   23
1884  broutda01   35
1884  willine01   35
1885  broutda01   42
1886  broutda01   53
1887  broutda01   65
1888  broutda01   74
1889  broutda01   81
1891  ansonca01   84
1892  broutda01   86
1893  connoro01   96
1894  connoro01  104
1895  thompsa01  113
1896  thompsa01  125
1898  thompsa01  126
1923  willicy01  149
1924  willicy01  173
1925  willicy01  186
1926  willicy01  204
1927  willicy01  234
1928  willicy01  246
1929  hornsro01  277
1930  hornsro01  279
1931  hornsro01  295
1932  hornsro01  296
1933  hornsro01  298
1937    ottme01  306
1938    ottme01  342
1939    ottme01  369
1940    ottme01  388
1941    ottme01  415
1942    ottme01  445
1943    ottme01  463
1944    ottme01  489
1945    ottme01  510
1946    ottme01  511
1966   mayswi01  542
1967   mayswi01  564
1968   mayswi01  587
1969   mayswi01  600
1970   mayswi01  628
1971   mayswi01  646
1972  aaronha

In [8]:
yrly_leaders['player_id'].value_counts()

aaronha01    34
ottme01      29
thompsa01    28
bondsba01    15
broutda01     8
hornsro01     8
jonesch01     7
mayswi01      6
willicy01     6
connoro01     2
ansonca01     1
hallge01      1
willine01     1
Name: player_id, dtype: int64

In [9]:
# ok, now let's visualize the projection
import plotly.express as px

In [10]:
px.line(yrly_leaders['hr'], color=yrly_leaders['player_id'])

Ok, that's a start for visualization.  I'd like to try a couple more things:
* one line for the leader, but use shading to indicate the record-holder
* show career progression for each player who ever held the record (e.g., you see Aaron's climb towards the record)
* show the progression of the top-n, not just the top 1.  Gives some feeling about the distance of the record.
    * To this point, de-emphasize or hide retired players, since active players are the ones who threaten a record