In 2015ish, I spent a lot of time entering my game attendance history into Hardball Passport.  The sources were ticket stubs, photo history, and my memory.  After entering the data, I threw out a lot of paper ticket stubs.  Hardball Passport eventually vanished, losing my history with it.  I did, however, save a couple sheets of player stats in games I attended.  The goal here is to reconstruct the attendance history from these player stats.

Remember, the goal is to reconstruct the set of games that was in Hardball Passport.  This is different than trying to reconstruct my actual attendance history.  For example, if a game was missing in HBP, but I know I attended it, including it here will break the algorithm to reconstruct.

The general design is to iterate a loop, maintaining this data:
* a set of games that are known to be in the HBP history
* a set of games that are known to not be in the HBP history
* a set of games that may be in the HBP history (note that these and the previous two represent the universe of games)
* the HBP player stats for all players in the "possible" games (e.g., with the stats in "known" games removed)

The loop is:
* Identify a games that is known
** Either by memory
** Or by taking a player with 1 games of stats and matching that to a game from their career
* Add that game to the "known" games
* Deduct the stats from that game for all players who played in that game
* For any players who drop to zero games/stats remaining, then all of the remaining games they've played in go from possible to impossible

In [1]:
import pandas as pd

In [2]:
bat = pd.read_csv('~/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0147-B.csv')
pit = pd.read_csv('~/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0148-P.csv')

bat

Unnamed: 0,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,BA,OPS
0,Ryan Klesko,SD SF,69,226,58,36,29,13,3,9,104,0.257,0.717
1,Phil Nevin,SD,67,244,76,40,47,14,0,16,138,0.311,0.877
2,Brian Giles,PIT SD,61,219,60,38,26,11,3,8,101,0.274,0.735
3,Trevor Hoffman,SD,58,2,0,0,0,0,0,0,0,0.000,0.000
4,Chase Headley,SD,56,193,49,24,17,19,0,2,74,0.254,0.637
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1738,Ehire Adrianza,SF,1,0,0,0,0,0,0,0,0,0.000,0.000
1739,Juan Acevedo,MIL,1,0,0,0,0,0,0,0,0,0.000,0.000
1740,Jeremy Accardo,SF,1,1,0,0,0,0,0,0,0,0.000,0.000
1741,Tony Abreu,SF,1,4,0,0,0,0,0,0,0,0.000,0.000


In [3]:
bat['SLG'] = bat['TB']/bat['AB']
bat['OBP'] = bat['OPS'] - bat['SLG']
bat

Unnamed: 0,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,BA,OPS,SLG,OBP
0,Ryan Klesko,SD SF,69,226,58,36,29,13,3,9,104,0.257,0.717,0.460177,0.256823
1,Phil Nevin,SD,67,244,76,40,47,14,0,16,138,0.311,0.877,0.565574,0.311426
2,Brian Giles,PIT SD,61,219,60,38,26,11,3,8,101,0.274,0.735,0.461187,0.273813
3,Trevor Hoffman,SD,58,2,0,0,0,0,0,0,0,0.000,0.000,0.000000,0.000000
4,Chase Headley,SD,56,193,49,24,17,19,0,2,74,0.254,0.637,0.383420,0.253580
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1738,Ehire Adrianza,SF,1,0,0,0,0,0,0,0,0,0.000,0.000,,
1739,Juan Acevedo,MIL,1,0,0,0,0,0,0,0,0,0.000,0.000,,
1740,Jeremy Accardo,SF,1,1,0,0,0,0,0,0,0,0.000,0.000,0.000000,0.000000
1741,Tony Abreu,SF,1,4,0,0,0,0,0,0,0,0.000,0.000,0.000000,0.000000


In [4]:
bat.sum()

Name      Ryan KleskoPhil NevinBrian GilesTrevor Hoffman...
Teams     SD SFSDPIT SDSDSDSDSD SDNMIL SDSDSDSD MIL ATLS...
G                                                      5979
AB                                                    14077
H                                                      3519
R                                                      1665
RBI                                                    1600
TwoB                                                    638
ThreeB                                                   98
HR                                                      367
TB                                                     5454
BA                                                  289.327
OPS                                                 725.802
SLG                                              436.457239
OBP                                              289.344761
dtype: object

In [5]:
pit.sum()

Name       Trevor HoffmanScott LinebrinkLuke GregersonJoe...
Teams      SDSD MIL ATLSDSD ARZSDSDSD SFNSD MIASD SDNSDSD...
G                                                       1653
IP                                                    3576.6
W                                                        206
L                                                        206
SV                                                       110
H                                                       3598
ER                                                      1545
K                                                       3034
BB                                                      1385
Pitches                                                44740
ERA                                                 63561.09
WHIP                                                1136.057
dtype: object

In [6]:
44740/3576.6

12.509086842252419

# Data Cleaning

Need to clean both the HBP data and the retrosheet/bd data.  Column names matching, etc.

In [7]:
yrs = range(1986, 2016)
len(yrs)

30

In [8]:
def get_pitchers(yrs):
    """ Retrieve a DF of all pitchers and their names/IDs who pitched in a range of years"""
    df = pd.read_parquet('../data/bd/pitching.parquet')
    pitcher_list = df[df.year_id.isin(yrs)].player_id.unique()
    
    ppl = pd.read_parquet('../data/bd/people.parquet')[['player_id', 'name_first', 'name_last', 'retro_id']]
    pitchers = ppl[ppl.player_id.isin(pitcher_list)].copy()
    pitchers['display_name'] = pitchers['name_first'] + ' ' + pitchers['name_last']
    name_counts = pitchers.groupby('display_name')['retro_id'].count()
    dup_names = name_counts[name_counts>1].index.values
    pitchers['dup_name'] = pitchers.display_name.isin(dup_names)
    return pitchers

In [9]:
pitchers = get_pitchers(yrs).query('retro_id != "alfoa001"')
len(pitchers)

3750

In [10]:
pitchers

Unnamed: 0,player_id,name_first,name_last,retro_id,display_name,dup_name
0,aardsda01,David,Aardsma,aardd001,David Aardsma,False
3,aasedo01,Don,Aase,aased001,Don Aase,False
5,abadfe01,Fernando,Abad,abadf001,Fernando Abad,False
14,abbotji01,Jim,Abbott,abboj001,Jim Abbott,False
16,abbotky01,Kyle,Abbott,abbok001,Kyle Abbott,False
...,...,...,...,...,...,...
19847,zimmejo02,Jordan,Zimmermann,zimmj003,Jordan Zimmermann,False
19851,zinkch01,Charlie,Zink,zinkc001,Charlie Zink,False
19860,zitoba01,Barry,Zito,zitob001,Barry Zito,False
19870,zumayjo01,Joel,Zumaya,zumaj001,Joel Zumaya,False


In [11]:
pit[~pit.Name.isin(pitchers.display_name)]

Unnamed: 0,Name,Teams,G,IP,W,L,SV,H,ER,K,BB,Pitches,ERA,WHIP
137,Tom Layne,SD,3,1.1,0,0,0,2,1,2,1,22,6.75,2.25
138,Hong-Chih Kuo,LAD,3,4.0,0,0,0,1,0,2,3,54,0.0,1.0
177,Alexander Torres,SD,2,0.2,0,0,0,1,1,0,2,18,13.5,4.5
249,J.P. Howell,LAD TB,2,3.0,0,1,0,1,0,5,0,15,0.0,0.333
285,Andrew Carpenter,SD,2,2.0,0,0,0,2,2,4,2,24,9.0,2.0
307,Antonio Alfonseca,ATL,2,1.2,0,0,0,4,0,2,1,18,0.0,3.0
352,J.J. Trujillo,SD,1,1.1,0,0,0,2,2,1,3,33,13.5,3.75
426,J.J. Putz,ARZ,1,1.0,0,0,1,1,0,1,0,9,0.0,1.0
458,Leo Nunez,FLA,1,0.2,0,0,1,1,0,0,0,8,0.0,1.5
495,T.J. Mathews,STL,1,1.0,1,0,0,1,0,2,0,3,0.0,1.0


In [12]:
(pit.Name.isin(pitchers.display_name)).value_counts()

True     678
False     18
Name: Name, dtype: int64

In [13]:
pitchers[~pitchers['dup_name']]

Unnamed: 0,player_id,name_first,name_last,retro_id,display_name,dup_name
0,aardsda01,David,Aardsma,aardd001,David Aardsma,False
3,aasedo01,Don,Aase,aased001,Don Aase,False
5,abadfe01,Fernando,Abad,abadf001,Fernando Abad,False
14,abbotji01,Jim,Abbott,abboj001,Jim Abbott,False
16,abbotky01,Kyle,Abbott,abbok001,Kyle Abbott,False
...,...,...,...,...,...,...
19847,zimmejo02,Jordan,Zimmermann,zimmj003,Jordan Zimmermann,False
19851,zinkch01,Charlie,Zink,zinkc001,Charlie Zink,False
19860,zitoba01,Barry,Zito,zitob001,Barry Zito,False
19870,zumayjo01,Joel,Zumaya,zumaj001,Joel Zumaya,False


In [14]:
def compute_IPouts(IP):
    return round(round(IP)*3 + 10*(IP%1))

[int(compute_IPouts(ip)) for ip in [8.1, 0.2, 1.2, 38.1]]

[25, 2, 5, 115]

In [15]:
cols = ['player_id', 'retro_id', 'Name', 'Teams', 'G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB', 'Pitches']
pit['IPouts'] = compute_IPouts(pit['IP']).apply(int)
pit_stats = pd.merge(left=pitchers[~pitchers['dup_name']], right=pit, left_on='display_name', right_on='Name')[cols]

pit_stats

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
1,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16
2,acevejo01,acevj002,Jose Acevedo,CIN COL,3,20,0,2,0,13,10,3,7,126
3,aceveju01,acevj001,Juan Acevedo,MIL,1,6,0,0,0,2,0,2,0,16
4,adamsmi03,adamm001,Mike Adams,MIL SD,12,39,1,0,0,7,1,12,5,152
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,wrighja02,wrigj002,Jaret Wright,SD ATL,2,5,0,2,0,5,6,1,6,62
663,youngch03,younc003,Chris Young,SD,6,114,3,1,0,13,7,38,13,376
664,zieglbr01,ziegb001,Brad Ziegler,ARZ,2,4,0,0,0,2,1,1,1,16
665,zimmejo01,zimmj002,Jordan Zimmerman,SEA,1,0,0,0,0,0,0,0,1,5


OK, this is our starting point.  A table with pitching counting stats matching the retro dailies,
with IDs.  It's a subset of pitchers, only those whose names resolved easily; but the algorithm 
will work the same (the pitchers who didn't resolve are essentially just missing from our universe).

Still to clean up:
* column headers match retro
* Teams

In [16]:
import boxball_loader as bbl

df_dailies = bbl.load_dailies(game_types=bbl.GameType.ALL)
df_dailies

Unnamed: 0,game_id,game_dt,game_ct,appearance_dt,team_id,player_id,slot_ct,seq_ct,home_fl,opponent_id,...,f_rf_out,f_rf_tc,f_rf_po,f_rf_a,f_rf_e,f_rf_dp,f_rf_tp,yr,game_type,team_game_number
0,ALS193307060,1933-07-06,0,1933-07-06,ALS,avere101,9,3,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
1,ALS193307060,1933-07-06,0,1933-07-06,ALS,chapb102,1,1,True,NLS,...,3.0,1.0,1.0,0.0,0.0,0,0.0,1933,ASG,1
2,ALS193307060,1933-07-06,0,1933-07-06,ALS,cronj101,7,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
3,ALS193307060,1933-07-06,0,1933-07-06,ALS,crowg102,9,2,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
4,ALS193307060,1933-07-06,0,1933-07-06,ALS,dykej101,6,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5198794,WS4187205170,1872-05-17,0,1872-05-17,WS4,hollh101,1,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7
5198795,WS4187205170,1872-05-17,0,1872-05-17,WS4,lennb101,3,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7
5198796,WS4187205170,1872-05-17,0,1872-05-17,WS4,mince101,2,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7
5198797,WS4187205170,1872-05-17,0,1872-05-17,WS4,steab101,9,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7


In [17]:
pit_stats

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
1,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16
2,acevejo01,acevj002,Jose Acevedo,CIN COL,3,20,0,2,0,13,10,3,7,126
3,aceveju01,acevj001,Juan Acevedo,MIL,1,6,0,0,0,2,0,2,0,16
4,adamsmi03,adamm001,Mike Adams,MIL SD,12,39,1,0,0,7,1,12,5,152
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,wrighja02,wrigj002,Jaret Wright,SD ATL,2,5,0,2,0,5,6,1,6,62
663,youngch03,younc003,Chris Young,SD,6,114,3,1,0,13,7,38,13,376
664,zieglbr01,ziegb001,Brad Ziegler,ARZ,2,4,0,0,0,2,1,1,1,16
665,zimmejo01,zimmj002,Jordan Zimmerman,SEA,1,0,0,0,0,0,0,0,1,5


In [18]:
pit_stats.set_index('retro_id')

Unnamed: 0_level_0,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
aardd001,aardsda01,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
accaj001,accarje01,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16
acevj002,acevejo01,Jose Acevedo,CIN COL,3,20,0,2,0,13,10,3,7,126
acevj001,aceveju01,Juan Acevedo,MIL,1,6,0,0,0,2,0,2,0,16
adamm001,adamsmi03,Mike Adams,MIL SD,12,39,1,0,0,7,1,12,5,152
...,...,...,...,...,...,...,...,...,...,...,...,...,...
wrigj002,wrighja02,Jaret Wright,SD ATL,2,5,0,2,0,5,6,1,6,62
younc003,youngch03,Chris Young,SD,6,114,3,1,0,13,7,38,13,376
ziegb001,zieglbr01,Brad Ziegler,ARZ,2,4,0,0,0,2,1,1,1,16
zimmj002,zimmejo01,Jordan Zimmerman,SEA,1,0,0,0,0,0,0,0,1,5


In [19]:
pit_col_mapper = \
{'p_g': 'G',
 'p_w': 'W',
 'p_l': 'L',
 'p_sv': 'SV',
 'p_out': 'IPouts',
 'p_er': 'ER',
 'p_h': 'H',
 'p_bb': 'BB',
 'p_so': 'K',
 'p_pitch': 'Pitches'}
pit_col_mapper

{'p_g': 'G',
 'p_w': 'W',
 'p_l': 'L',
 'p_sv': 'SV',
 'p_out': 'IPouts',
 'p_er': 'ER',
 'p_h': 'H',
 'p_bb': 'BB',
 'p_so': 'K',
 'p_pitch': 'Pitches'}

In [20]:
def get_dailies(yrs):
    df = df_dailies
    dailies = df[(df['yr'].isin(yrs)) & (df['p_g']>0)]

    pit_col_mapper = \
        {'p_g': 'G',
         'p_w': 'W',
         'p_l': 'L',
         'p_sv': 'SV',
         'p_out': 'IPouts',
         'p_er': 'ER',
         'p_h': 'H',
         'p_bb': 'BB',
         'p_so': 'K',
         'p_pitch': 'Pitches',
         'player_id': 'retro_id'}
    cols=['game_id', 'team_id'] + list(pit_col_mapper.values())
    
    return dailies.rename(columns=pit_col_mapper)[cols]

In [21]:
dailies = get_dailies(yrs)
dailies

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
1049,ALS198707140,ALS,1,0,0,0,8,0.0,2,0.0,1.0,,henkt001
1050,ALS198707140,ALS,1,0,1,0,6,2.0,3,0.0,3.0,,howej001
1052,ALS198707140,ALS,1,0,0,0,6,0.0,0,0.0,3.0,,langm001
1055,ALS198707140,ALS,1,0,0,0,6,0.0,1,1.0,2.0,,morrj001
1058,ALS198707140,ALS,1,0,0,0,3,0.0,0,0.0,1.0,,plesd001
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5061322,WAS201509280,CIN,1,0,0,0,3,1.0,2,0.0,2.0,22.0,diazj005
5061323,WAS201509280,CIN,1,0,1,0,15,3.0,8,2.0,3.0,83.0,finnb001
5061325,WAS201509280,CIN,1,0,0,0,3,1.0,1,0.0,1.0,15.0,parrm001
5061336,WAS201509280,WAS,1,1,0,0,24,1.0,2,3.0,10.0,113.0,schem001


# Building Blocks for the algorithm

### Find a game (in dailies) from a statline

In [22]:
pit_stats[pit_stats['G']==1].sort_values('K')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
294,krolia01,kroli001,Ian Krol,DET,1,2,0,0,0,1,1,0,0,10
308,leipeda01,leipd001,Dave Leiper,OAK,1,3,0,0,0,1,0,0,1,10
311,lewisji02,lewij002,Jim Lewis,SD,1,2,0,0,0,0,0,0,2,20
319,loeweca01,loewc001,Carlton Loewer,SD,1,6,0,1,0,7,6,0,1,44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,matzety01,matzt001,Tyler Matzek,COL,1,18,0,1,0,9,5,9,2,56
523,scherma01,schem001,Max Scherzer,DET,1,16,0,1,0,4,4,10,3,58
239,hernafe02,hernf002,Felix Hernandez,SEA,1,21,1,0,0,7,1,10,1,49
510,sabatcc01,sabac001,CC Sabathia,CLE,1,24,1,0,0,4,0,11,2,68


In [23]:
pit_stats.loc[530]

player_id        scribev01
retro_id          scrie001
Name         Evan Scribner
Teams               SD OAK
G                        2
IPouts                   9
W                        0
L                        0
SV                       0
H                        0
ER                       0
K                        1
BB                       0
Pitches                 28
Name: 530, dtype: object

In [24]:
mike_scott = pit_stats[pit_stats['player_id']=='scottmi03']
mike_scott

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
529,scottmi03,scotm001,Mike Scott,HOU,1,25,0,1,0,8,3,14,0,0


In [25]:
def find_daily(dailies, stat_line):
    match_cols = ['retro_id', 'G', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
    matches = pd.merge(left=stat_line, right=dailies, on=match_cols )
    return matches

In [26]:
find_daily(dailies, mike_scott)

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts_x,W,L,SV,H,ER,K,BB,Pitches_x,game_id,team_id,IPouts_y,Pitches_y
0,scottmi03,scotm001,Mike Scott,HOU,1,25,0,1,0,8,3,14,0,0,SDN198609140,HOU,25,


### Subtract a game's dailies from a stat DF
#### Start by subtracting a player's daily from their own statline
#### then scale

In [27]:
dailies[dailies['game_id']=='SDN198609140']

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4133033,SDN198609140,SDN,1,0,0,0,21,2.0,7,4.0,1.0,,hawka001
4133036,SDN198609140,SDN,1,1,0,0,6,0.0,1,1.0,1.0,,leffc001
4133050,SDN198609140,HOU,1,0,1,0,25,3.0,8,0.0,14.0,,scotm001


In [28]:
dailies.loc[4059874]

game_id     PIT200608270
team_id              PIT
G                      1
W                      0
L                      1
SV                     0
IPouts                13
ER                   5.0
H                      6
BB                   4.0
K                    3.0
Pitches             78.0
retro_id        chacs001
Name: 4059874, dtype: object

In [29]:
hawk = pit_stats[pit_stats['retro_id']=='hawka001']
hawk

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
228,hawkian01,hawka001,Andy Hawkins,SD,2,40,0,0,0,10,3,7,5,0


In [30]:
d = dailies[(dailies['game_id']=='SDN198609140')&(dailies['retro_id']=='hawka001')]
d

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4133033,SDN198609140,SDN,1,0,0,0,21,2.0,7,4.0,1.0,,hawka001


In [31]:
match_cols = ['G', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
d.set_index('retro_id')[match_cols]

Unnamed: 0_level_0,G,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
hawka001,1,0,0,0,7,2.0,1.0,4.0


In [32]:
hawk.set_index('retro_id')

Unnamed: 0_level_0,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
hawka001,hawkian01,Andy Hawkins,SD,2,40,0,0,0,10,3,7,5,0


In [33]:
hawk.set_index('retro_id')[match_cols] - d.set_index('retro_id')[match_cols]

Unnamed: 0_level_0,G,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
hawka001,1,0,0,0,3,1.0,6.0,1.0


# Execute the Algorithm

### Likely starting out manually, iterating manually, but calling the building block functions

In [34]:
pit_stats[pit_stats['G']==1].sort_values('K')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
294,krolia01,kroli001,Ian Krol,DET,1,2,0,0,0,1,1,0,0,10
308,leipeda01,leipd001,Dave Leiper,OAK,1,3,0,0,0,1,0,0,1,10
311,lewisji02,lewij002,Jim Lewis,SD,1,2,0,0,0,0,0,0,2,20
319,loeweca01,loewc001,Carlton Loewer,SD,1,6,0,1,0,7,6,0,1,44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,matzety01,matzt001,Tyler Matzek,COL,1,18,0,1,0,9,5,9,2,56
523,scherma01,schem001,Max Scherzer,DET,1,16,0,1,0,4,4,10,3,58
239,hernafe02,hernf002,Felix Hernandez,SEA,1,21,1,0,0,7,1,10,1,49
510,sabatcc01,sabac001,CC Sabathia,CLE,1,24,1,0,0,4,0,11,2,68


In [35]:
pit_stats[pit_stats['G']==1].head(10).apply(lambda row: find_daily(dailies, pd.DataFrame(row).T), axis=1)

0        player_id  retro_id           Name Teams  G...
1        player_id  retro_id            Name Teams  ...
3        player_id  retro_id          Name Teams  G ...
6        player_id  retro_id        Name Teams  G IP...
8        player_id  retro_id         Name Teams  G I...
9         player_id  retro_id             Name Teams...
10       player_id  retro_id        Name Teams  G IP...
11        player_id  retro_id             Name Teams...
13       player_id  retro_id            Name Teams  ...
14       player_id  retro_id            Name Teams  ...
dtype: object

In [36]:
find_daily(dailies, pit_stats[pit_stats['G']==1].head(10))

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts_x,W,L,SV,H,ER,K,BB,Pitches_x,game_id,team_id,IPouts_y,Pitches_y
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21,SDN201006110,SEA,2,21.0
1,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16,NYA200907040,TOR,2,26.0
2,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16,SDN200509270,SFN,4,19.0
3,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16,TOR200609180,TOR,1,15.0
4,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16,TOR200704120,TOR,3,19.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69,almanar01,almaa001,Armando Almanza,ARZ,1,1,0,0,0,0,0,0,0,2,SDN200507150,ARI,1,2.0
70,almanar01,almaa001,Armando Almanza,ARZ,1,1,0,0,0,0,0,0,0,2,TBA200206290,FLO,2,2.0
71,anderbr02,andeb002,Brian Anderson,ARZ,1,12,0,1,0,7,3,3,1,59,KCA200504160,KCA,23,92.0
72,anderbr04,andeb004,Brett Anderson,OAK,1,18,1,0,0,6,0,4,0,91,CHA200906040,OAK,21,109.0


In [37]:
find_daily(dailies, pd.DataFrame(pit_stats[pit_stats['G']==1].loc[0]).T)

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts_x,W,L,SV,H,ER,K,BB,Pitches_x,game_id,team_id,IPouts_y,Pitches_y
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21,SDN201006110,SEA,2,21.0


In [38]:
matches = find_daily(dailies, pit_stats[pit_stats['G']==1])
match_counts = matches.groupby('retro_id').agg({'player_id': len, 'game_id': min}).rename(columns={'player_id': 'count'})
match_counts


Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
aardd001,1,SDN201006110
accaj001,5,NYA200907040
acevj001,5,CIN200004050
adkij001,6,ARI200609290
albem001,2,BAL201006080
...,...,...
willr003,11,ARI200508060
wisem001,1,SDN200406060
wolfr001,1,SDN201005020
worrt001,21,ATL199406050


In [39]:
known_attended = match_counts[match_counts['count']==1].game_id.unique()
known_attended

array(['SDN201006110', 'DET201308310', 'KCA200504160', 'SDN201009240',
       'SDN200007010', 'SDN200906300', 'SDN200405160', 'SDN199308062',
       'SDN201005260', 'LAN199807120', 'SDN200604300', 'SDN200505160',
       'SDN200908010', 'SDN200409080', 'SEA201209080', 'SFN199609180',
       'TOR200508050', 'SDN201205040', 'DET198808040', 'SDN200509270',
       'SDN200207130', 'SDN200104250', 'OAK199505080', 'SDN200110070',
       'SDN201306240', 'SDN201008260', 'SDN201407050', 'TOR200508060',
       'SDN200010010', 'SDN200408170', 'SDN200609230', 'SDN201307140',
       'SDN201008270', 'SDN201206050', 'SDN199407060', 'SDN201105180',
       'SDN200705110', 'SDN200309050', 'SDN200108050', 'SDN201304220',
       'SDN201109050', 'SDN200007160', 'SDN201005020', 'SDN200008060',
       'CHN199806230', 'SDN200409240', 'SDN199906080', 'SDN200106160',
       'SDN199707050', 'SDN201405250', 'ATL201008060', 'SDN201308180',
       'SDN200409290', 'SEA200905240', 'SDN201209150', 'SDN201104230',
      

In [40]:
def aggregate_dailies(dailies, game_ids, pitchers):
    match_cols = ['G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
    match_cols = ['G',           'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
    known_gm_stats = dailies.query('game_id in @game_ids and retro_id in @pitchers').groupby('retro_id')[match_cols].sum()
    return known_gm_stats

known_gm_stats = aggregate_dailies(dailies, known_attended, pit_stats.retro_id)
known_gm_stats

Unnamed: 0_level_0,G,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
aardd001,2,0,1,1,3,2.0,2.0,1.0
accaj001,1,0,0,0,0,0.0,1.0,2.0
acevj001,1,0,0,0,2,0.0,2.0,0.0
acevj002,2,0,1,0,12,9.0,3.0,6.0
adamm001,7,0,0,0,3,1.0,7.0,2.0
...,...,...,...,...,...,...,...,...
worrt002,6,0,0,2,8,2.0,4.0,4.0
wrigj002,2,0,2,0,5,6.0,1.0,6.0
younc003,3,1,1,0,8,6.0,9.0,7.0
ziegb001,1,0,0,0,2,1.0,1.0,1.0


In [41]:
these_dailies = dailies[dailies.game_id.isin(known_attended)]
these_dailies

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
208183,ATL201008060,ATL,1,0,0,0,1,0.0,0,0.0,0.0,5.0,dunnm002
208186,ATL201008060,ATL,1,0,0,0,21,1.0,3,2.0,3.0,102.0,hanst001
208192,ATL201008060,ATL,1,0,1,0,5,1.0,1,4.0,0.0,33.0,moylp001
208193,ATL201008060,ATL,1,0,0,0,3,0.0,0,1.0,1.0,10.0,ventj001
208194,ATL201008060,ATL,1,0,0,0,3,0.0,0,1.0,0.0,16.0,wagnb001
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5000554,TOR200508060,TOR,1,1,0,0,7,0.0,1,2.0,1.0,34.0,walkp001
5000559,TOR200508060,NYA,1,0,0,0,3,0.0,0,0.0,0.0,14.0,franw001
5000562,TOR200508060,NYA,1,0,1,0,12,5.0,10,1.0,3.0,75.0,johnr005
5000565,TOR200508060,NYA,1,0,0,0,8,2.0,4,0.0,3.0,49.0,procs001


In [42]:
these_dailies.retro_id.isin(pit_stats.retro_id).value_counts()

True     883
False     22
Name: retro_id, dtype: int64

Why are there 25 entries that don't show up in my pitchers seen, when looking at games that are *known* to be in the dataset?

Oh, could these be the guys with ambiguous name matching, who we removed from our "universe"?

Yep.  Maybe we should trim the dailies earlier, for the pitchers in our known universe.

In [43]:
these_dailies[~these_dailies.retro_id.isin(pit_stats.retro_id)]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
208183,ATL201008060,ATL,1,0,0,0,1,0.0,0,0.0,0.0,5.0,dunnm002
2047763,DET198808040,DET,1,0,0,0,2,0.0,0,0.0,0.0,6.0,hernw001
2397979,KCA200504160,DET,1,1,0,0,24,1.0,6,0.0,5.0,103.0,bondj001
4031884,PIT199405160,PIT,1,0,0,0,6,1.0,2,0.0,1.0,23.0,penaa001
4031887,PIT199405160,SLN,1,0,0,0,3,1.0,2,0.0,0.0,10.0,everb001
4147954,SDN199308062,SDN,1,0,0,0,1,0.0,0,0.0,0.0,1.0,martp002
4154655,SDN199610050,SLN,1,1,0,0,3,0.0,1,0.0,2.0,10.0,matht002
4165122,SDN200106160,SDN,1,0,0,0,11,1.0,4,1.0,0.0,46.0,nunej002
4165840,SDN200108050,SDN,1,0,0,0,6,0.0,0,1.0,4.0,33.0,nunej002
4166133,SDN200109030,SDN,1,0,1,0,20,3.0,7,3.0,3.0,107.0,joneb003


In [44]:
known_gm_stats = these_dailies[these_dailies.retro_id.isin(pit_stats.retro_id)].groupby('retro_id')[match_cols].sum()
known_gm_stats

Unnamed: 0_level_0,G,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
aardd001,2,0,1,1,3,2.0,2.0,1.0
accaj001,1,0,0,0,0,0.0,1.0,2.0
acevj001,1,0,0,0,2,0.0,2.0,0.0
acevj002,2,0,1,0,12,9.0,3.0,6.0
adamm001,7,0,0,0,3,1.0,7.0,2.0
...,...,...,...,...,...,...,...,...
worrt002,6,0,0,2,8,2.0,4.0,4.0
wrigj002,2,0,2,0,5,6.0,1.0,6.0
younc003,3,1,1,0,8,6.0,9.0,7.0
ziegb001,1,0,0,0,2,1.0,1.0,1.0


In [45]:
match_cols = ['G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB'] # note this doesn't include 'retro_id', since that is now the index
unaccounted_stats = pit_stats.set_index('retro_id')[match_cols]
unaccounted_stats

Unnamed: 0_level_0,G,IPouts,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
aardd001,1,2,0,1,0,2,2,0,1
accaj001,1,4,0,0,0,0,0,1,2
acevj002,3,20,0,2,0,13,10,3,7
acevj001,1,6,0,0,0,2,0,2,0
adamm001,12,39,1,0,0,7,1,12,5
...,...,...,...,...,...,...,...,...,...
wrigj002,2,5,0,2,0,5,6,1,6
younc003,6,114,3,1,0,13,7,38,13
ziegb001,2,4,0,0,0,2,1,1,1
zimmj002,1,0,0,0,0,0,0,0,1


In [46]:
# when subtracting the data frames, need to subtract only the intersection of the sets
def subtract_stats(start, delta):
    unchanging = start[~start.index.isin(delta.index)]
    changing   = start[ start.index.isin(delta.index)]

    after = pd.concat([unchanging, changing-delta])
    return after

subtract_stats(unaccounted_stats, known_gm_stats)

Unnamed: 0_level_0,G,IPouts,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
adkij001,1,3.0,0,0,0,2,1.0,0.0,0.0
almaa001,1,1.0,0,0,0,0,0.0,0.0,0.0
almac001,2,9.0,0,1,0,7,5.0,4.0,2.0
balfg001,1,1.0,0,0,0,2,0.0,1.0,0.0
beckj002,1,24.0,1,0,0,7,2.0,8.0,1.0
...,...,...,...,...,...,...,...,...,...
worrt002,3,,0,1,0,1,1.0,0.0,0.0
wrigj002,0,,0,0,0,0,0.0,0.0,0.0
younc003,3,,2,0,0,5,1.0,29.0,6.0
ziegb001,1,,0,0,0,0,0.0,0.0,0.0


In [47]:
def run_iteration(dailies, unaccounted_stats, games_known_in, known_out, _):
    # find players who have only one game in the set
    singletons = unaccounted_stats[unaccounted_stats['G']==1]

    # find possible matches for all of those single games, from all of each player's career
    matches = find_daily(dailies, singletons)
    print(matches.query('game_id=="SDN200408170"'))
                                   
    # count the matches for each player, and any games that resolve uniquely are now known to be IN
    match_counts = matches.groupby('retro_id').agg({'G': len, 'game_id': min}).rename(columns={'G': 'count'})
    games_deduced_in = match_counts[match_counts['count']==1].game_id.unique()
    
    # aggregate the stats of pitchers across these games deduced in
    # these stats are now accounted for, so subtract them
    deduced_gm_stats = aggregate_dailies(dailies, games_deduced_in, unaccounted_stats.index)
    still_unaccounted = subtract_stats(unaccounted_stats, deduced_gm_stats)
    return (dailies, still_unaccounted, games_deduced_in, known_out, match_counts)

iter1 = run_iteration(dailies, unaccounted_stats, [], [], None)
iter1, len(iter1[2])

     retro_id  G  IPouts_x  W  L  SV  H  ER  K  BB       game_id team_id  \
485  cruzj005  1         3  0  0   0  0   0  0   1  SDN200408170     ATL   
607  drewt001  1        10  0  0   0  9   6  2   3  SDN200408170     ATL   

     IPouts_y  Pitches  
485         3     13.0  
607        10     79.0  


((              game_id team_id  G  W  L  SV  IPouts   ER  H   BB     K  \
  1049     ALS198707140     ALS  1  0  0   0       8  0.0  2  0.0   1.0   
  1050     ALS198707140     ALS  1  0  1   0       6  2.0  3  0.0   3.0   
  1052     ALS198707140     ALS  1  0  0   0       6  0.0  0  0.0   3.0   
  1055     ALS198707140     ALS  1  0  0   0       6  0.0  1  1.0   2.0   
  1058     ALS198707140     ALS  1  0  0   0       3  0.0  0  0.0   1.0   
  ...               ...     ... .. .. ..  ..     ...  ... ..  ...   ...   
  5061322  WAS201509280     CIN  1  0  0   0       3  1.0  2  0.0   2.0   
  5061323  WAS201509280     CIN  1  0  1   0      15  3.0  8  2.0   3.0   
  5061325  WAS201509280     CIN  1  0  0   0       3  1.0  1  0.0   1.0   
  5061336  WAS201509280     WAS  1  1  0   0      24  1.0  2  3.0  10.0   
  5061338  WAS201509280     WAS  1  0  0   0       3  0.0  1  0.0   0.0   
  
           Pitches  retro_id  
  1049         NaN  henkt001  
  1050         NaN  howej001  
  10

In [48]:
iter2 = run_iteration(*iter1)
iter2, len(iter2[2])

Empty DataFrame
Columns: [retro_id, G, IPouts_x, W, L, SV, H, ER, K, BB, game_id, team_id, IPouts_y, Pitches]
Index: []


((              game_id team_id  G  W  L  SV  IPouts   ER  H   BB     K  \
  1049     ALS198707140     ALS  1  0  0   0       8  0.0  2  0.0   1.0   
  1050     ALS198707140     ALS  1  0  1   0       6  2.0  3  0.0   3.0   
  1052     ALS198707140     ALS  1  0  0   0       6  0.0  0  0.0   3.0   
  1055     ALS198707140     ALS  1  0  0   0       6  0.0  1  1.0   2.0   
  1058     ALS198707140     ALS  1  0  0   0       3  0.0  0  0.0   1.0   
  ...               ...     ... .. .. ..  ..     ...  ... ..  ...   ...   
  5061322  WAS201509280     CIN  1  0  0   0       3  1.0  2  0.0   2.0   
  5061323  WAS201509280     CIN  1  0  1   0      15  3.0  8  2.0   3.0   
  5061325  WAS201509280     CIN  1  0  0   0       3  1.0  1  0.0   1.0   
  5061336  WAS201509280     WAS  1  1  0   0      24  1.0  2  3.0  10.0   
  5061338  WAS201509280     WAS  1  0  0   0       3  0.0  1  0.0   0.0   
  
           Pitches  retro_id  
  1049         NaN  henkt001  
  1050         NaN  howej001  
  10

In [49]:
pit_stats.query('retro_id=="alfoa001"')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches


In [50]:
dailies.query('retro_id=="alfoa001" and game_id in @iter1[2]')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4172941,SDN200408170,ATL,1,0,0,0,3,0.0,3,0.0,1.0,10.0,alfoa001


In [51]:
dailies.query('retro_id=="alfoa001" and IPouts==2 and H==1 and BB==1')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
192649,ATL200405070,ATL,1,0,0,0,2,0.0,1,1.0,1.0,14.0,alfoa001
1314720,CHN200508260,FLO,1,0,0,0,2,0.0,1,1.0,0.0,16.0,alfoa001
2254573,HOU200410090,ATL,1,0,0,0,2,2.0,1,1.0,0.0,14.0,alfoa001
3374062,NYN200404150,ATL,1,0,0,0,2,2.0,1,1.0,0.0,15.0,alfoa001


In [52]:
iter2[4].query('game_id == "SDN201306140"')

Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1


In [53]:
dailies[(dailies['retro_id']=='cahit001') & dailies['game_id'].isin(iter1[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4190101,SDN201204100,ARI,1,0,0,0,18,1.0,2,6.0,5.0,104.0,cahit001


In [54]:
dailies[(dailies['retro_id']=='harrw002') & dailies['game_id'].isin(iter1[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4192071,SDN201209150,COL,1,0,0,0,2,0.0,2,0.0,1.0,21.0,harrw002


In [55]:
iter1[4][iter1[4]['game_id'] =='SDN201309260']

Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1


In [56]:
iter1[4][iter1[4]['game_id'] =='SDN201209150']

Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
kellc001,1,SDN201209150


In [57]:
iter2[1].sort_values(by='IPouts')

Unnamed: 0_level_0,G,IPouts,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
zimmj002,1,0.0,0,0,0,0,0.0,0.0,1.0
strih001,1,1.0,0,0,0,0,0.0,0.0,0.0
samaj001,1,1.0,0,0,0,1,0.0,1.0,1.0
myerr001,1,1.0,0,0,0,0,0.0,1.0,1.0
huffd001,1,1.0,0,0,0,1,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
wilsp001,0,,0,0,0,1,0.0,0.0,0.0
witaj001,7,,0,0,0,15,3.0,20.0,7.0
worrt002,2,,0,1,0,-1,0.0,-1.0,-1.0
wrigj001,2,,0,0,0,1,0.0,1.0,0.0


In [58]:
dailies.query('retro_id=="murrh001" and game_id in @iter1[2]')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4160270,SDN199906080,SDN,1,0,0,0,17,1.0,5,3.0,2.0,94.0,murrh001


In [59]:
pit_stats.query('retro_id=="murrh001"')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
388,murrahe01,murrh001,Heath Murray,SD,2,21,0,0,0,6,1,4,4,100


This looks suspicious.  I didn't go to a game in Atlanta in May 2004.  This explains the discrepancy in the totals - but how did this game show up in my known attended games?

In [60]:
dailies.query('game_id=="ATL200405070"')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
192649,ATL200405070,ATL,1,0,0,0,2,0.0,1,1.0,1.0,14.0,alfoa001
192650,ATL200405070,ATL,1,0,0,0,3,0.0,0,1.0,2.0,17.0,cunnw001
192660,ATL200405070,ATL,1,0,0,0,3,0.0,1,0.0,0.0,14.0,reitc001
192661,ATL200405070,ATL,1,0,1,0,19,5.0,10,1.0,5.0,90.0,thomj005
192664,ATL200405070,HOU,1,0,0,0,2,1.0,1,0.0,1.0,14.0,backb001
192668,ATL200405070,HOU,1,0,0,1,3,0.0,1,0.0,1.0,21.0,doteo001
192673,ATL200405070,HOU,1,0,0,0,4,0.0,1,0.0,3.0,20.0,lidgb001
192674,ATL200405070,HOU,1,0,0,0,0,0.0,2,0.0,0.0,3.0,miced001
192675,ATL200405070,HOU,1,1,0,0,18,1.0,4,1.0,1.0,85.0,reddt001


In [61]:
pit_stats.query('retro_id=="thomj005"')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
583,thomsjo01,thomj005,John Thomson,COL ATL,2,35,1,0,0,11,4,15,3,108


In [62]:
iter3 = run_iteration(*iter2)
iter3

Empty DataFrame
Columns: [retro_id, G, IPouts_x, W, L, SV, H, ER, K, BB, game_id, team_id, IPouts_y, Pitches]
Index: []


(              game_id team_id  G  W  L  SV  IPouts   ER  H   BB     K  \
 1049     ALS198707140     ALS  1  0  0   0       8  0.0  2  0.0   1.0   
 1050     ALS198707140     ALS  1  0  1   0       6  2.0  3  0.0   3.0   
 1052     ALS198707140     ALS  1  0  0   0       6  0.0  0  0.0   3.0   
 1055     ALS198707140     ALS  1  0  0   0       6  0.0  1  1.0   2.0   
 1058     ALS198707140     ALS  1  0  0   0       3  0.0  0  0.0   1.0   
 ...               ...     ... .. .. ..  ..     ...  ... ..  ...   ...   
 5061322  WAS201509280     CIN  1  0  0   0       3  1.0  2  0.0   2.0   
 5061323  WAS201509280     CIN  1  0  1   0      15  3.0  8  2.0   3.0   
 5061325  WAS201509280     CIN  1  0  0   0       3  1.0  1  0.0   1.0   
 5061336  WAS201509280     WAS  1  1  0   0      24  1.0  2  3.0  10.0   
 5061338  WAS201509280     WAS  1  0  0   0       3  0.0  1  0.0   0.0   
 
          Pitches  retro_id  
 1049         NaN  henkt001  
 1050         NaN  howej001  
 1052         NaN  l

In [63]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter1[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4190114,SDN201204100,ARI,1,0,0,0,2,1.0,2,1.0,1.0,15.0,ziegb001


In [64]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter2[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id


In [65]:
iter3[1].describe()

Unnamed: 0,G,IPouts,W,L,SV,H,ER,K,BB
count,667.0,79.0,667.0,667.0,667.0,667.0,667.0,667.0,667.0
mean,0.667166,9.658228,0.091454,0.08096,0.052474,1.512744,0.53973,1.226387,0.512744
std,1.374429,11.890403,0.358134,0.358568,0.430035,4.387164,1.875116,3.205854,1.576213
min,-1.0,0.0,-1.0,-1.0,-1.0,-6.0,-6.0,-6.0,-5.0
25%,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,15.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
max,17.0,60.0,3.0,3.0,9.0,42.0,18.0,28.0,12.0
