In 2015ish, I spent a lot of time entering my game attendance history into Hardball Passport.  The sources were ticket stubs, photo history, and my memory.  After entering the data, I threw out a lot of paper ticket stubs.  Hardball Passport eventually vanished, losing my history with it.  I did, however, save a couple sheets of player stats in games I attended.  The goal here is to reconstruct the attendance history from these player stats.

Remember, the goal is to reconstruct the set of games that was in Hardball Passport.  This is different than trying to reconstruct my actual attendance history.  For example, if a game was missing in HBP, but I know I attended it, including it here will break the algorithm to reconstruct.

The general design is to iterate a loop, maintaining this data:
* a set of games that are known to be in the HBP history
* a set of games that are known to not be in the HBP history
* a set of games that may be in the HBP history (note that these and the previous two represent the universe of games)
* the HBP player stats for all players in the "possible" games (e.g., with the stats in "known" games removed)

The loop is:
* Identify a games that is known
** Either by memory
** Or by taking a player with 1 games of stats and matching that to a game from their career
* Add that game to the "known" games
* Deduct the stats from that game for all players who played in that game
* For any players who drop to zero games/stats remaining, then all of the remaining games they've played in go from possible to impossible

In [1]:
import pandas as pd

In [2]:
bat = pd.read_csv('/Users/vkumar/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0147-B.csv')
pit = pd.read_csv('/Users/vkumar/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0148-P.csv')

bat

Unnamed: 0,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,BA,OPS
0,Ryan Klesko,SD SF,69,226,58,36,29,13,3,9,104,0.257,0.717
1,Phil Nevin,SD,67,244,76,40,47,14,0,16,138,0.311,0.877
2,Brian Giles,PIT SD,61,219,60,38,26,11,3,8,101,0.274,0.735
3,Trevor Hoffman,SD,58,2,0,0,0,0,0,0,0,0.000,0.000
4,Chase Headley,SD,56,193,49,24,17,19,0,2,74,0.254,0.637
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1738,Ehire Adrianza,SF,1,0,0,0,0,0,0,0,0,0.000,0.000
1739,Juan Acevedo,MIL,1,0,0,0,0,0,0,0,0,0.000,0.000
1740,Jeremy Accardo,SF,1,1,0,0,0,0,0,0,0,0.000,0.000
1741,Tony Abreu,SF,1,4,0,0,0,0,0,0,0,0.000,0.000


In [50]:
bat['SLG'] = bat['TB']/bat['AB']
bat['OBP'] = bat['OPS'] - bat['SLG']
bat

Unnamed: 0,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,BA,OPS,SLG,OBP
0,Ryan Klesko,SD SF,69,226,58,36,29,13,3,9,104,0.257,0.717,0.460177,0.256823
1,Phil Nevin,SD,67,244,76,40,47,14,0,16,138,0.311,0.877,0.565574,0.311426
2,Brian Giles,PIT SD,61,219,60,38,26,11,3,8,101,0.274,0.735,0.461187,0.273813
3,Trevor Hoffman,SD,58,2,0,0,0,0,0,0,0,0.000,0.000,0.000000,0.000000
4,Chase Headley,SD,56,193,49,24,17,19,0,2,74,0.254,0.637,0.383420,0.253580
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1738,Ehire Adrianza,SF,1,0,0,0,0,0,0,0,0,0.000,0.000,,
1739,Juan Acevedo,MIL,1,0,0,0,0,0,0,0,0,0.000,0.000,,
1740,Jeremy Accardo,SF,1,1,0,0,0,0,0,0,0,0.000,0.000,0.000000,0.000000
1741,Tony Abreu,SF,1,4,0,0,0,0,0,0,0,0.000,0.000,0.000000,0.000000


In [42]:
bat.sum()

Name      Ryan KleskoPhil NevinBrian GilesTrevor Hoffman...
Teams     SD SFSDPIT SDSDSDSDSD SDNMIL SDSDSDSD MIL ATLS...
G                                                      5979
AB                                                    14077
H                                                      3519
R                                                      1665
RBI                                                    1600
TwoB                                                    638
ThreeB                                                   98
HR                                                      367
TB                                                     5454
BA                                                  289.327
OPS                                                 725.802
dtype: object

In [43]:
pit.sum()

Name       Trevor HoffmanScott LinebrinkLuke GregersonJoe...
Teams      SDSD MIL ATLSDSD ARZSDSDSD SFNSD MIASD SDNSDSD...
G                                                       1653
IP                                                    3576.6
W                                                        206
L                                                        206
SV                                                       110
H                                                       3598
ER                                                      1545
K                                                       3034
BB                                                      1385
Pitches                                                44740
ERA                                                  63561.1
WHIP                                                 1136.06
IPouts                                                 11091
dtype: object

In [49]:
44740/3576.6

12.509086842252419

# Data Cleaning

Need to clean both the HBP data and the retrosheet/bd data.  Column names matching, etc.

In [3]:
yrs = range(1986, 2016)
len(yrs)

30

In [4]:
def get_pitchers(yrs):
    """ Retrieve a DF of all pitchers and their names/IDs who pitched in a range of years"""
    df = pd.read_parquet('../data/bd/pitching.parquet')
    pitcher_list = df[df.year_id.isin(yrs)].player_id.unique()
    
    ppl = pd.read_parquet('../data/bd/people.parquet')[['player_id', 'name_first', 'name_last', 'retro_id']]
    pitchers = ppl[ppl.player_id.isin(pitcher_list)].copy()
    pitchers['display_name'] = pitchers['name_first'] + ' ' + pitchers['name_last']
    name_counts = pitchers.groupby('display_name')['retro_id'].count()
    dup_names = name_counts[name_counts>1].index.values
    pitchers['dup_name'] = pitchers.display_name.isin(dup_names)
    return pitchers

In [5]:
pitchers = get_pitchers(yrs)
len(pitchers)

3751

In [6]:
pitchers

Unnamed: 0,player_id,name_first,name_last,retro_id,display_name,dup_name
0,aardsda01,David,Aardsma,aardd001,David Aardsma,False
3,aasedo01,Don,Aase,aased001,Don Aase,False
5,abadfe01,Fernando,Abad,abadf001,Fernando Abad,False
14,abbotji01,Jim,Abbott,abboj001,Jim Abbott,False
16,abbotky01,Kyle,Abbott,abbok001,Kyle Abbott,False
...,...,...,...,...,...,...
19847,zimmejo02,Jordan,Zimmermann,zimmj003,Jordan Zimmermann,False
19851,zinkch01,Charlie,Zink,zinkc001,Charlie Zink,False
19860,zitoba01,Barry,Zito,zitob001,Barry Zito,False
19870,zumayjo01,Joel,Zumaya,zumaj001,Joel Zumaya,False


In [7]:
pit[~pit.Name.isin(pitchers.display_name)]

Unnamed: 0,Name,Teams,G,IP,W,L,SV,H,ER,K,BB,Pitches,ERA,WHIP
137,Tom Layne,SD,3,1.1,0,0,0,2,1,2,1,22,6.75,2.25
138,Hong-Chih Kuo,LAD,3,4.0,0,0,0,1,0,2,3,54,0.0,1.0
177,Alexander Torres,SD,2,0.2,0,0,0,1,1,0,2,18,13.5,4.5
249,J.P. Howell,LAD TB,2,3.0,0,1,0,1,0,5,0,15,0.0,0.333
285,Andrew Carpenter,SD,2,2.0,0,0,0,2,2,4,2,24,9.0,2.0
352,J.J. Trujillo,SD,1,1.1,0,0,0,2,2,1,3,33,13.5,3.75
426,J.J. Putz,ARZ,1,1.0,0,0,1,1,0,1,0,9,0.0,1.0
458,Leo Nunez,FLA,1,0.2,0,0,1,1,0,0,0,8,0.0,1.5
495,T.J. Mathews,STL,1,1.0,1,0,0,1,0,2,0,3,0.0,1.0
554,Guillermo Hernandez,DET,1,0.2,0,0,0,0,0,0,0,6,0.0,0.0


In [11]:
(pit.Name.isin(pitchers.display_name)).value_counts()

True     679
False     17
Name: Name, dtype: int64

In [12]:
pitchers[~pitchers['dup_name']]

Unnamed: 0,player_id,name_first,name_last,retro_id,display_name,dup_name
0,aardsda01,David,Aardsma,aardd001,David Aardsma,False
3,aasedo01,Don,Aase,aased001,Don Aase,False
5,abadfe01,Fernando,Abad,abadf001,Fernando Abad,False
14,abbotji01,Jim,Abbott,abboj001,Jim Abbott,False
16,abbotky01,Kyle,Abbott,abbok001,Kyle Abbott,False
...,...,...,...,...,...,...
19847,zimmejo02,Jordan,Zimmermann,zimmj003,Jordan Zimmermann,False
19851,zinkch01,Charlie,Zink,zinkc001,Charlie Zink,False
19860,zitoba01,Barry,Zito,zitob001,Barry Zito,False
19870,zumayjo01,Joel,Zumaya,zumaj001,Joel Zumaya,False


In [8]:
def compute_IPouts(IP):
    return round(round(IP)*3 + 10*(IP%1))

[int(compute_IPouts(ip)) for ip in [8.1, 0.2, 1.2, 38.1]]

[25, 2, 5, 115]

In [9]:
cols = ['player_id', 'retro_id', 'Name', 'Teams', 'G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB', 'Pitches']
pit['IPouts'] = compute_IPouts(pit['IP']).apply(int)
pit_stats = pd.merge(left=pitchers[~pitchers['dup_name']], right=pit, left_on='display_name', right_on='Name')[cols]

pit_stats

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
1,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16
2,acevejo01,acevj002,Jose Acevedo,CIN COL,3,20,0,2,0,13,10,3,7,126
3,aceveju01,acevj001,Juan Acevedo,MIL,1,6,0,0,0,2,0,2,0,16
4,adamsmi03,adamm001,Mike Adams,MIL SD,12,39,1,0,0,7,1,12,5,152
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
663,wrighja02,wrigj002,Jaret Wright,SD ATL,2,5,0,2,0,5,6,1,6,62
664,youngch03,younc003,Chris Young,SD,6,114,3,1,0,13,7,38,13,376
665,zieglbr01,ziegb001,Brad Ziegler,ARZ,2,4,0,0,0,2,1,1,1,16
666,zimmejo01,zimmj002,Jordan Zimmerman,SEA,1,0,0,0,0,0,0,0,1,5


OK, this is our starting point.  A table with pitching counting stats matching the retro dailies,
with IDs.  It's a subset of pitchers, only those whose named resolved deasily; but the algorithm 
will work the same (those pitchers are essentially just missing from our universe).

Still to clean up:
* column headers match retro
* ipOuts
* Teams

In [10]:
df_dailies = pd.read_parquet('../data/mine/daily.parquet')
df_dailies

Unnamed: 0,game_id,game_dt,game_ct,appearance_dt,team_id,player_id,slot_ct,seq_ct,home_fl,opponent_id,...,f_rf_out,f_rf_tc,f_rf_po,f_rf_a,f_rf_e,f_rf_dp,f_rf_tp,yr,game_type,team_game_number
0,ALS193307060,1933-07-06,0,1933-07-06,ALS,avere101,9,3,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
1,ALS193307060,1933-07-06,0,1933-07-06,ALS,chapb102,1,1,True,NLS,...,3.0,1.0,1.0,0.0,0.0,0,0.0,1933,ASG,1
2,ALS193307060,1933-07-06,0,1933-07-06,ALS,cronj101,7,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
3,ALS193307060,1933-07-06,0,1933-07-06,ALS,crowg102,9,2,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
4,ALS193307060,1933-07-06,0,1933-07-06,ALS,dykej101,6,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1933,ASG,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5107371,WS4187205170,1872-05-17,0,1872-05-17,WS4,hollh101,1,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7
5107372,WS4187205170,1872-05-17,0,1872-05-17,WS4,lennb101,3,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7
5107373,WS4187205170,1872-05-17,0,1872-05-17,WS4,mince101,2,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7
5107374,WS4187205170,1872-05-17,0,1872-05-17,WS4,steab101,9,1,True,WS3,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1872,RS,7


In [11]:
pit_stats

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
1,accarje01,accaj001,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16
2,acevejo01,acevj002,Jose Acevedo,CIN COL,3,20,0,2,0,13,10,3,7,126
3,aceveju01,acevj001,Juan Acevedo,MIL,1,6,0,0,0,2,0,2,0,16
4,adamsmi03,adamm001,Mike Adams,MIL SD,12,39,1,0,0,7,1,12,5,152
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
663,wrighja02,wrigj002,Jaret Wright,SD ATL,2,5,0,2,0,5,6,1,6,62
664,youngch03,younc003,Chris Young,SD,6,114,3,1,0,13,7,38,13,376
665,zieglbr01,ziegb001,Brad Ziegler,ARZ,2,4,0,0,0,2,1,1,1,16
666,zimmejo01,zimmj002,Jordan Zimmerman,SEA,1,0,0,0,0,0,0,0,1,5


In [12]:
pit_stats.set_index('retro_id')

Unnamed: 0_level_0,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,Pitches
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
aardd001,aardsda01,David Aardsma,SEA,1,2,0,1,0,2,2,0,1,21
accaj001,accarje01,Jeremy Accardo,SF,1,4,0,0,0,0,0,1,2,16
acevj002,acevejo01,Jose Acevedo,CIN COL,3,20,0,2,0,13,10,3,7,126
acevj001,aceveju01,Juan Acevedo,MIL,1,6,0,0,0,2,0,2,0,16
adamm001,adamsmi03,Mike Adams,MIL SD,12,39,1,0,0,7,1,12,5,152
...,...,...,...,...,...,...,...,...,...,...,...,...,...
wrigj002,wrighja02,Jaret Wright,SD ATL,2,5,0,2,0,5,6,1,6,62
younc003,youngch03,Chris Young,SD,6,114,3,1,0,13,7,38,13,376
ziegb001,zieglbr01,Brad Ziegler,ARZ,2,4,0,0,0,2,1,1,1,16
zimmj002,zimmejo01,Jordan Zimmerman,SEA,1,0,0,0,0,0,0,0,1,5


In [13]:
pit_col_mapper = \
{'p_g': 'G',
 'p_w': 'W',
 'p_l': 'L',
 'p_sv': 'SV',
 'p_out': 'IPouts',
 'p_er': 'ER',
 'p_h': 'H',
 'p_bb': 'BB',
 'p_so': 'K',
 'p_pitch': 'Pitches'}
pit_col_mapper

{'p_g': 'G',
 'p_w': 'W',
 'p_l': 'L',
 'p_sv': 'SV',
 'p_out': 'IPouts',
 'p_er': 'ER',
 'p_h': 'H',
 'p_bb': 'BB',
 'p_so': 'K',
 'p_pitch': 'Pitches'}

In [14]:
def get_dailies(yrs):
    df = df_dailies
    dailies = df[(df['yr'].isin(yrs)) & (df['p_g']>0)]

    pit_col_mapper = \
        {'p_g': 'G',
         'p_w': 'W',
         'p_l': 'L',
         'p_sv': 'SV',
         'p_out': 'IPouts',
         'p_er': 'ER',
         'p_h': 'H',
         'p_bb': 'BB',
         'p_so': 'K',
         'p_pitch': 'Pitches',
         'player_id': 'retro_id'}
    cols=['game_id', 'team_id'] + list(pit_col_mapper.values())
    
    return dailies.rename(columns=pit_col_mapper)[cols]

In [15]:
dailies = get_dailies(yrs)
dailies

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
1049,ALS198707140,ALS,1,0,0,0,8,0.0,2,0.0,1.0,,henkt001
1050,ALS198707140,ALS,1,0,1,0,6,2.0,3,0.0,3.0,,howej001
1052,ALS198707140,ALS,1,0,0,0,6,0.0,0,0.0,3.0,,langm001
1055,ALS198707140,ALS,1,0,0,0,6,0.0,1,1.0,2.0,,morrj001
1058,ALS198707140,ALS,1,0,0,0,3,0.0,0,0.0,1.0,,plesd001
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4974679,WAS201509280,CIN,1,0,0,0,3,1.0,2,0.0,2.0,22.0,diazj005
4974680,WAS201509280,CIN,1,0,1,0,15,3.0,8,2.0,3.0,83.0,finnb001
4974682,WAS201509280,CIN,1,0,0,0,3,1.0,1,0.0,1.0,15.0,parrm001
4974693,WAS201509280,WAS,1,1,0,0,24,1.0,2,3.0,10.0,113.0,schem001


# Building Blocks for the algorithm

### Find a game (in dailies) from a statline

In [None]:
pit_stats[pit_stats['G']==1].sort_values('K')

In [None]:
pit_stats.loc[530]

In [None]:
mike_scott = pit_stats[pit_stats['player_id']=='scottmi03']
mike_scott

In [20]:
def find_daily(dailies, stat_line):
    match_cols = ['retro_id', 'G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
    matches = pd.merge(left=stat_line, right=dailies, on=match_cols )
    return matches

In [None]:
find_daily(dailies, mike_scott)

### Subtract a game's dailies from a stat DF
#### Start by subtracting a player's daily from their own statline
#### then scale

In [None]:
dailies[dailies['game_id']=='SDN198609140']

In [None]:
dailies.loc[4059874]

In [None]:
hawk = pit_stats[pit_stats['retro_id']=='hawka001']
hawk

In [None]:
d = dailies[(dailies['game_id']=='SDN198609140')&(dailies['retro_id']=='hawka001')]
d

In [None]:
match_cols = ['G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
d.set_index('retro_id')[match_cols]

In [None]:
hawk.set_index('retro_id')

In [None]:
hawk.set_index('retro_id')[match_cols] - d.set_index('retro_id')[match_cols]

# Execute the Algorithm

### Likely starting out manually, iterating manually, but calling the building block functions

In [None]:
pit_stats[pit_stats['G']==1].sort_values('K')

In [None]:
pit_stats[pit_stats['G']==1].head(10).apply(lambda row: find_daily(dailies, pd.DataFrame(row).T), axis=1)

In [None]:
find_daily(dailies, pit_stats[pit_stats['G']==1].head(1))

In [None]:
find_daily(dailies, pd.DataFrame(pit_stats[pit_stats['G']==1].loc[0]).T)

In [None]:
matches = find_daily(dailies, pit_stats[pit_stats['G']==1])
match_counts = matches.groupby('retro_id').agg({'player_id': len, 'game_id': min}).rename(columns={'player_id': 'count'})


In [None]:
known_attended = match_counts[match_counts['count']==1].game_id.unique()
known_attended

In [16]:
def aggregate_dailies(dailies, games, pitchers):
    included_dailies = dailies[dailies.game_id.isin(games)]
    match_cols = ['G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
    known_gm_stats = included_dailies[included_dailies.retro_id.isin(pitchers)].groupby('retro_id')[match_cols].sum()
    return known_gm_stats

#known_gm_stats = aggregate_dailies(dailies, known_attended, pit_stats.retro_id)
#known_gm_stats

In [None]:
these_dailies = dailies[dailies.game_id.isin(known_attended)]
these_dailies

In [None]:
these_dailies.retro_id.isin(pit_stats.retro_id).value_counts()

Why are there 25 entries that don't show up in my pitchers seen, when looking at games that are *known* to be in the dataset?

Oh, could these be the guys with ambiguous name matching, who we removed from our "universe"?

Yep.  Maybe we should trim the dailies earlier, for the pitchers in our known universe.

In [None]:
these_dailies[~these_dailies.retro_id.isin(pit_stats.retro_id)]

In [None]:
known_gm_stats = these_dailies[these_dailies.retro_id.isin(pit_stats.retro_id)].groupby('retro_id')[match_cols].sum()
known_gm_stats

In [17]:
match_cols = ['G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB'] # note this doesn't include 'retro_id', since that is now the index
unaccounted_stats = pit_stats.set_index('retro_id')[match_cols]
unaccounted_stats

Unnamed: 0_level_0,G,IPouts,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
aardd001,1,2,0,1,0,2,2,0,1
accaj001,1,4,0,0,0,0,0,1,2
acevj002,3,20,0,2,0,13,10,3,7
acevj001,1,6,0,0,0,2,0,2,0
adamm001,12,39,1,0,0,7,1,12,5
...,...,...,...,...,...,...,...,...,...
wrigj002,2,5,0,2,0,5,6,1,6
younc003,6,114,3,1,0,13,7,38,13
ziegb001,2,4,0,0,0,2,1,1,1
zimmj002,1,0,0,0,0,0,0,0,1


In [18]:
# when subtracting the data frames, need to subtract only the intersection of the sets
def subtract_stats(start, delta):
    unchanging = start[~start.index.isin(delta.index)]
    changing   = start[ start.index.isin(delta.index)]

    after = pd.concat([unchanging, changing-delta])
    return after

#subtract_stats(unaccounted_stats, known_gm_stats)

In [21]:
def run_iteration(dailies, unaccounted_stats, games_known_in, known_out):
    # find players who have only one game in the set
    singletons = unaccounted_stats[unaccounted_stats['G']==1]

    # find possible matches for all of those single games, from all of each player's career
    matches = find_daily(dailies, singletons)
                                   
    # count the matches for each player, and any games that resolve uniquely are now known to be IN
    match_counts = matches.groupby('retro_id').agg({'G': len, 'game_id': min}).rename(columns={'G': 'count'})
    games_deduced_in = match_counts[match_counts['count']==1].game_id.unique()
    
    # aggregate the stats of pitchers across these games deduced in
    # these stats are now accounted for, so subtract them
    deduced_gm_stats = aggregate_dailies(dailies, games_deduced_in, unaccounted_stats.index)
    still_unaccounted = subtract_stats(unaccounted_stats, deduced_gm_stats)
    return (dailies, still_unaccounted, games_deduced_in, known_out, match_counts)

iter1 = run_iteration(dailies, unaccounted_stats, [], [])
iter1

(              game_id team_id  G  W  L  SV  IPouts   ER  H   BB     K  \
 1049     ALS198707140     ALS  1  0  0   0       8  0.0  2  0.0   1.0   
 1050     ALS198707140     ALS  1  0  1   0       6  2.0  3  0.0   3.0   
 1052     ALS198707140     ALS  1  0  0   0       6  0.0  0  0.0   3.0   
 1055     ALS198707140     ALS  1  0  0   0       6  0.0  1  1.0   2.0   
 1058     ALS198707140     ALS  1  0  0   0       3  0.0  0  0.0   1.0   
 ...               ...     ... .. .. ..  ..     ...  ... ..  ...   ...   
 4974679  WAS201509280     CIN  1  0  0   0       3  1.0  2  0.0   2.0   
 4974680  WAS201509280     CIN  1  0  1   0      15  3.0  8  2.0   3.0   
 4974682  WAS201509280     CIN  1  0  0   0       3  1.0  1  0.0   1.0   
 4974693  WAS201509280     WAS  1  1  0   0      24  1.0  2  3.0  10.0   
 4974695  WAS201509280     WAS  1  0  0   0       3  0.0  1  0.0   0.0   
 
          Pitches  retro_id  
 1049         NaN  henkt001  
 1050         NaN  howej001  
 1052         NaN  l

In [22]:
iter2 = run_iteration(iter1[0], iter1[1], iter1[2], iter1[3])
iter2

(              game_id team_id  G  W  L  SV  IPouts   ER  H   BB     K  \
 1049     ALS198707140     ALS  1  0  0   0       8  0.0  2  0.0   1.0   
 1050     ALS198707140     ALS  1  0  1   0       6  2.0  3  0.0   3.0   
 1052     ALS198707140     ALS  1  0  0   0       6  0.0  0  0.0   3.0   
 1055     ALS198707140     ALS  1  0  0   0       6  0.0  1  1.0   2.0   
 1058     ALS198707140     ALS  1  0  0   0       3  0.0  0  0.0   1.0   
 ...               ...     ... .. .. ..  ..     ...  ... ..  ...   ...   
 4974679  WAS201509280     CIN  1  0  0   0       3  1.0  2  0.0   2.0   
 4974680  WAS201509280     CIN  1  0  1   0      15  3.0  8  2.0   3.0   
 4974682  WAS201509280     CIN  1  0  0   0       3  1.0  1  0.0   1.0   
 4974693  WAS201509280     WAS  1  1  0   0      24  1.0  2  3.0  10.0   
 4974695  WAS201509280     WAS  1  0  0   0       3  0.0  1  0.0   0.0   
 
          Pitches  retro_id  
 1049         NaN  henkt001  
 1050         NaN  howej001  
 1052         NaN  l

In [26]:
iter2[4][iter2[4]['game_id'] =='SDN201306140']

Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
cahit001,1,SDN201306140
harrw002,1,SDN201306140


In [29]:
dailies[(dailies['retro_id']=='cahit001') & dailies['game_id'].isin(iter1[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4116942,SDN201204100,ARI,1,0,0,0,18,1.0,2,6.0,5.0,104.0,cahit001
4121453,SDN201309260,ARI,1,0,0,0,17,2.0,5,4.0,3.0,100.0,cahit001


In [31]:
dailies[(dailies['retro_id']=='harrw002') & dailies['game_id'].isin(iter1[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4118912,SDN201209150,COL,1,0,0,0,2,0.0,2,0.0,1.0,21.0,harrw002
4121459,SDN201309260,ARI,1,0,0,0,3,0.0,0,0.0,1.0,13.0,harrw002


In [30]:
iter1[4][iter1[4]['game_id'] =='SDN201309260']

Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
roe-c001,1,SDN201309260


In [32]:
iter1[4][iter1[4]['game_id'] =='SDN201209150']

Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
kellc001,1,SDN201209150


In [40]:
iter2[1].sort_values(by='IPouts')

Unnamed: 0_level_0,G,IPouts,W,L,SV,H,ER,K,BB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
thomj005,-1,-18,0,-1,0,-9,-5.0,-5.0,-1.0
lidgb001,-1,-4,0,0,0,-1,0.0,-3.0,0.0
doteo001,-1,-3,0,0,-1,-1,0.0,-1.0,0.0
murrh001,0,-2,0,0,0,0,0.0,0.0,0.0
leblw001,0,-2,0,0,0,3,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
haraa001,3,57,1,1,0,16,3.0,15.0,6.0
tomkb001,3,59,0,2,0,20,15.0,10.0,11.0
cooka002,3,60,1,2,0,20,6.0,6.0,6.0
welld001,4,70,0,1,0,35,16.0,12.0,3.0


In [41]:
dailies[(dailies['retro_id']=='thomj005') & dailies['game_id'].isin(iter1[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4093324,SDN200110070,COL,1,1,0,0,21,3.0,7,0.0,12.0,105.0,thomj005
4100894,SDN200505160,ATL,1,0,0,0,13,1.0,3,3.0,3.0,76.0,thomj005


In [40]:
iter3 = run_iteration(iter2[0], iter2[1], iter2[2], iter2[3])
iter3

(              game_id team_id  G  W  L  SV  IPouts   ER  H   BB     K  \
 1049     ALS198707140     ALS  1  0  0   0       8  0.0  2  0.0   1.0   
 1050     ALS198707140     ALS  1  0  1   0       6  2.0  3  0.0   3.0   
 1052     ALS198707140     ALS  1  0  0   0       6  0.0  0  0.0   3.0   
 1055     ALS198707140     ALS  1  0  0   0       6  0.0  1  1.0   2.0   
 1058     ALS198707140     ALS  1  0  0   0       3  0.0  0  0.0   1.0   
 ...               ...     ... .. .. ..  ..     ...  ... ..  ...   ...   
 4974679  WAS201509280     CIN  1  0  0   0       3  1.0  2  0.0   2.0   
 4974680  WAS201509280     CIN  1  0  1   0      15  3.0  8  2.0   3.0   
 4974682  WAS201509280     CIN  1  0  0   0       3  1.0  1  0.0   1.0   
 4974693  WAS201509280     WAS  1  1  0   0      24  1.0  2  3.0  10.0   
 4974695  WAS201509280     WAS  1  0  0   0       3  0.0  1  0.0   0.0   
 
          Pitches  retro_id  
 1049         NaN  henkt001  
 1050         NaN  howej001  
 1052         NaN  l

In [43]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter1[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4116955,SDN201204100,ARI,1,0,0,0,2,1.0,2,1.0,1.0,15.0,ziegb001


In [44]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter2[2])]

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,Pitches,retro_id
4120117,SDN201306140,ARI,1,0,0,0,3,0.0,0,0.0,0.0,7.0,ziegb001
