## Which rotations have gotten the most starts from their top-5 starters?

There was a reddit thread about teams keeping their 5-man rotations healthy, and getting
a high fraction of their starts from those top 5 starters.  Many people were chiming in
with anecdotal instances.

Cool, but let's generate a leaderboard.  Teams, since integration, that have gotten the greatest
fraction of starts from 5 pitchers.  And better yet, let's include the names and GS for those
pitchers.

*(Next, we'll generalize this away from 5 to any number, and away from GS to any stat.)*

In [1]:
import pandas as pd
import boxball_loader as bbl
import baseball_stats_utils as bsu

In [2]:
category = 'gs'
top_n = 5
col_top_n = f'top{top_n}'
col_top_frac = f'top{top_n}_frac'

# Find all player-seasons (since integration, with GS>0), ranked among their team-season by GS

stat = bbl.load_pitching(seasons=bbl.Eras.Integration, coalesce_type=bbl.CoalesceMode.PLAYER_SEASON_TEAM).query(f'{category}>0')[category]
stat.describe()


count    19119.000000
mean        14.987813
std         11.831355
min          1.000000
25%          4.000000
50%         12.000000
75%         26.000000
max         49.000000
Name: gs, dtype: float64

In [3]:
starts_from_topn = stat.groupby(['yr', 'team_id']).nlargest(top_n).groupby(['yr', 'team_id']).sum().rename(col_top_n)
starts_total = stat.groupby(['yr', 'team_id']).sum().rename('total')
teams = pd.concat([starts_total, starts_from_topn], axis=1)
teams[col_top_frac] = teams[col_top_n]/teams['total']
teams.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,total,top5,top5_frac
yr,team_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1993,BOS,162,138,0.851852
1968,LAN,162,150,0.925926
1997,MON,162,138,0.851852
1958,CLE,153,115,0.751634
1999,SFN,162,138,0.851852
1968,BAL,162,139,0.858025
2017,BOS,162,136,0.839506
1994,MON,114,102,0.894737
1965,CLE,162,134,0.82716
1993,ML4,162,136,0.839506


In [4]:
teams.sort_values(by=col_top_frac, ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,total,top5,top5_frac
yr,team_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2003,SEA,162,162,1.000000
1966,LAN,162,162,1.000000
2012,CIN,162,161,0.993827
1994,LAN,114,113,0.991228
2012,SFN,162,160,0.987654
...,...,...,...,...
1993,COL,162,93,0.574074
1996,PIT,162,92,0.567901
1993,CLE,162,90,0.555556
2017,CIN,162,90,0.555556


In [5]:
# Now let's get the names and figures for the top n
topn = stat.reset_index().sort_values(['yr', 'team_id', 'gs'], ascending=(True, True, False)).groupby(['yr', 'team_id']).head(top_n)
topn

Unnamed: 0,player_id,yr,team_id,gs
4142,dobsojo01,1947,BOS,31
4940,ferrida01,1947,BOS,28
7790,hughste01,1947,BOS,26
5470,galehde01,1947,BOS,21
8191,johnsea01,1947,BOS,17
...,...,...,...,...
15474,scherma01,2020,WAS,12
3383,corbipa01,2020,WAS,11
15252,sanchan01,2020,WAS,11
17849,vothau01,2020,WAS,11


In [6]:
topn['name'] = bsu.get_player_names_col(topn['player_id'], idx_fld='player_id')
topn['display'] = topn['name'] + ' (' + topn['gs'].astype(str) + ')'
topn

Unnamed: 0,player_id,yr,team_id,gs,name,display
4142,dobsojo01,1947,BOS,31,Joe Dobson,Joe Dobson (31)
4940,ferrida01,1947,BOS,28,Dave Ferriss,Dave Ferriss (28)
7790,hughste01,1947,BOS,26,Tex Hughson,Tex Hughson (26)
5470,galehde01,1947,BOS,21,Denny Galehouse,Denny Galehouse (21)
8191,johnsea01,1947,BOS,17,Earl Johnson,Earl Johnson (17)
...,...,...,...,...,...,...
15474,scherma01,2020,WAS,12,Max Scherzer,Max Scherzer (12)
3383,corbipa01,2020,WAS,11,Patrick Corbin,Patrick Corbin (11)
15252,sanchan01,2020,WAS,11,Anibal Sanchez,Anibal Sanchez (11)
17849,vothau01,2020,WAS,11,Austin Voth,Austin Voth (11)


In [7]:
teams['pitchers'] = topn.groupby(['yr', 'team_id'])['display'].agg(lambda x: ', '.join(x))
teams.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,total,top5,top5_frac,pitchers
yr,team_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1976,HOU,162,121,0.746914,"J. R. Richard (39), Larry Dierker (28), Joaqui..."
1961,ML1,155,133,0.858065,"Lew Burdette (36), Warren Spahn (34), Bob Buhl..."
2007,PHI,162,123,0.759259,"Jamie Moyer (33), Adam Eaton (30), Cole Hamels..."
2007,LAA,162,135,0.833333,"John Lackey (33), Kelvim Escobar (30), Jered W..."
1954,CHA,155,124,0.8,"Virgil Trucks (33), Bob Keegan (27), Billy Pie..."
1966,CHN,162,119,0.734568,"Dick Ellsworth (37), Ken Holtzman (33), Bill H..."
1972,DET,156,132,0.846154,"Mickey Lolich (41), Joe Coleman (39), Tom Timm..."
2020,TBA,60,45,0.75,"Tyler Glasnow (11), Blake Snell (11), Charlie ..."
2012,MIA,162,132,0.814815,"Mark Buehrle (31), Josh Johnson (31), Ricky No..."
2005,DET,162,144,0.888889,"Mike Maroth (34), Jason Johnson (33), Nate Rob..."


In [10]:
# Printable/shareable table of all teams at 98% or better
threshold = .98
print(teams.reset_index().sort_values(col_top_frac, ascending=False).query(f'{col_top_frac}>=@threshold').to_markdown(index=False))

|   yr | team_id   |   total |   top5 |   top5_frac | pitchers                                                                                            |
|-----:|:----------|--------:|-------:|------------:|:----------------------------------------------------------------------------------------------------|
| 2003 | SEA       |     162 |    162 |    1        | Freddy Garcia (33), Jamie Moyer (33), Ryan Franklin (32), Gil Meche (32), Joel Pineiro (32)         |
| 1966 | LAN       |     162 |    162 |    1        | Sandy Koufax (41), Don Drysdale (40), Claude Osteen (38), Don Sutton (35), Joe Moeller (8)          |
| 2012 | CIN       |     162 |    161 |    0.993827 | Homer Bailey (33), Johnny Cueto (33), Mat Latos (33), Bronson Arroyo (32), Mike Leake (30)          |
| 1994 | LAN       |     114 |    113 |    0.991228 | Ramon Martinez (24), Pedro Astacio (23), Kevin Gross (23), Tom Candiotti (22), Orel Hershiser (21)  |
| 2012 | SFN       |     162 |    160 |    0.987654 | Tim Lincec