In 2015ish, I spent a lot of time entering my game attendance history into Hardball Passport.  The sources were ticket stubs, photo history, and my memory.  After entering the data, I threw out a lot of paper ticket stubs.  Hardball Passport eventually vanished, losing my history with it.  I did, however, save a couple sheets of player stats in games I attended.  The goal here is to reconstruct the attendance history from these player stats.

Remember, the goal is to reconstruct the set of games that was in Hardball Passport.  This is different than trying to reconstruct my actual attendance history.  For example, if a game was missing in HBP, but I know I attended it, including it here will break the algorithm to reconstruct.

The general design is to iterate a loop, maintaining this data:
* a set of games that are known to be in the HBP history
* a set of games that are known to not be in the HBP history
* a set of games that may be in the HBP history (note that these and the previous two represent the universe of games)
* the HBP player stats for all players in the "possible" games (e.g., with the stats in "known" games removed)

The loop is:
* Identify a games that is known
** Either by memory
** Or by taking a player with 1 games of stats and matching that to a game from their career
* Add that game to the "known" games
* Deduct the stats from that game for all players who played in that game
* For any players who drop to zero games/stats remaining, then all of the remaining games they've played in go from possible to impossible

In [1]:
import pandas as pd
import boxball_loader as bbl

In [2]:
yrs = range(1986, 2016)

df_dailies = bbl.load_dailies(game_types=bbl.GameType.ALL).query('yr in @yrs')
df_dailies

Unnamed: 0,game_id,game_dt,game_ct,appearance_dt,team_id,player_id,slot_ct,seq_ct,home_fl,opponent_id,...,f_rf_out,f_rf_tc,f_rf_po,f_rf_a,f_rf_e,f_rf_dp,f_rf_tp,yr,game_type,team_game_number
1043,ALS198707140,1987-07-14,0,1987-07-14,ALS,bainh001,3,4,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
1044,ALS198707140,1987-07-14,0,1987-07-14,ALS,bellg001,4,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
1045,ALS198707140,1987-07-14,0,1987-07-14,ALS,boggw001,3,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
1046,ALS198707140,1987-07-14,0,1987-07-14,ALS,evand002,7,2,True,NLS,...,21.0,2.0,2.0,0.0,0.0,0,0.0,1987,ASG,1
1047,ALS198707140,1987-07-14,0,1987-07-14,ALS,fernt001,6,2,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5061336,WAS201509280,2015-09-28,0,2015-09-28,WAS,schem001,9,1,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156
5061337,WAS201509280,2015-09-28,0,2015-09-28,WAS,taylm002,1,1,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156
5061338,WAS201509280,2015-09-28,0,2015-09-28,WAS,thorm001,9,3,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156
5061339,WAS201509280,2015-09-28,0,2015-09-28,WAS,turnt001,2,1,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156


In [3]:
bat = pd.read_csv('~/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0147-B.csv')
pit = pd.read_csv('~/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0148-P.csv')

bat

Unnamed: 0,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,BA,OPS
0,Ryan Klesko,SD SF,69,226,58,36,29,13,3,9,104,0.257,0.717
1,Phil Nevin,SD,67,244,76,40,47,14,0,16,138,0.311,0.877
2,Brian Giles,PIT SD,61,219,60,38,26,11,3,8,101,0.274,0.735
3,Trevor Hoffman,SD,58,2,0,0,0,0,0,0,0,0.000,0.000
4,Chase Headley,SD,56,193,49,24,17,19,0,2,74,0.254,0.637
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1738,Ehire Adrianza,SF,1,0,0,0,0,0,0,0,0,0.000,0.000
1739,Juan Acevedo,MIL,1,0,0,0,0,0,0,0,0,0.000,0.000
1740,Jeremy Accardo,SF,1,1,0,0,0,0,0,0,0,0.000,0.000
1741,Tony Abreu,SF,1,4,0,0,0,0,0,0,0,0.000,0.000


In [4]:
# Fix known errors in the HBP data

bat.loc[bat['Name']=='Denny Walling', 'AB'] = 4
bat.loc[bat['Name']=='Frank Menechino', 'RBI'] = 1
bat.loc[bat['Name']=='Ryan Howard', 'R'] = 1
pit.loc[pit['Name']=='Denny Neagle', 'H'] = 2
pit.loc[pit['Name']=='Randy Myers', 'G'] = 0 # I don't know what Myers' correct data is, so just set games to 0
bat.loc[bat['Name']=='Glenallen Hill', 'G'] = 0 # ditto
pit.loc[pit['Name']=='Heathcliff Slocumb', 'BB'] = 1


bat.query('Name == "Frank Menechino"')

Unnamed: 0,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,BA,OPS
1292,Frank Menechino,TOR,1,2,1,1,1,0,0,1,4,0.5,2.5


In [5]:
bat.sum()

Name      Ryan KleskoPhil NevinBrian GilesTrevor Hoffman...
Teams     SD SFSDPIT SDSDSDSDSD SDNMIL SDSDSDSD MIL ATLS...
G                                                      5975
AB                                                    14076
H                                                      3519
R                                                      1666
RBI                                                    1599
TwoB                                                    638
ThreeB                                                   98
HR                                                      367
TB                                                     5454
BA                                                  289.327
OPS                                                 725.802
dtype: object

In [6]:
pit.sum()

Name       Trevor HoffmanScott LinebrinkLuke GregersonJoe...
Teams      SDSD MIL ATLSDSD ARZSDSDSD SFNSD MIASD SDNSDSD...
G                                                       1652
IP                                                    3576.6
W                                                        206
L                                                        206
SV                                                       110
H                                                       3597
ER                                                      1545
K                                                       3034
BB                                                      1384
Pitches                                                44740
ERA                                                 63561.09
WHIP                                                1136.057
dtype: object

# Data Cleaning

Need to clean both the HBP data and the retrosheet/bd data.  Column names matching, etc.

In [7]:

len(yrs)

30

In [8]:
def get_players(yrs):
    """ Retrieve a DF of all batters and their names/IDs who played in a range of years"""
    player_list = bbl.load_batting(yrs, coalesce_type=bbl.CoalesceMode.PLAYER_CAREER).index.values
    
    players = bbl.load_people().query('player_id in @player_list')[['player_id', 'name_first', 'name_last', 'retro_id']]
    players['display_name'] = players['name_first'] + ' ' + players['name_last']
    name_counts = players.groupby('display_name')['retro_id'].count()
    dup_names = name_counts[name_counts>1].index.values
    players['dup_name'] = players.display_name.isin(dup_names)
    return players

players = get_players(yrs)
len(players)

6804

In [9]:
players.sample(10)

Unnamed: 0,player_id,name_first,name_last,retro_id,display_name,dup_name
19164,wellsja02,Jared,Wells,wellj001,Jared Wells,False
1037,beachbr01,Brandon,Beachy,beacb001,Brandon Beachy,False
14439,pooleji02,Jim,Poole,poolj001,Jim Poole,False
7956,hernara02,Ramon,Hernandez,hernr002,Ramon Hernandez,False
1508,blancgr01,Gregor,Blanco,blang001,Gregor Blanco,False
13737,palacvi01,Vicente,Palacios,palav001,Vicente Palacios,False
17312,stefejo01,John,Stefero,stefj001,John Stefero,False
4273,davisru01,Russ,Davis,davir002,Russ Davis,False
13912,paulsbe01,Ben,Paulsen,paulb001,Ben Paulsen,False
6981,greense01,Sean,Green,grees004,Sean Green,False


In [10]:
def compute_IPouts(IP):
    return round(round(IP)*3 + 10*(IP%1))

[int(compute_IPouts(ip)) for ip in [8.1, 0.2, 1.2, 38.1]]

[25, 2, 5, 115]

In [11]:
pit['IPouts'] = compute_IPouts(pit['IP']).apply(int)

In [12]:
standard_cols = ['player_id', 'retro_id', 'Name', 'Teams']
pit_cols = standard_cols + ['G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
bat_cols = standard_cols + ['G', 'AB', 'H', 'R', 'RBI', 'TwoB', 'ThreeB', 'HR', 'TB']

pit_stats = pd.merge(left=players.query('not dup_name'), right=pit, left_on='display_name', right_on='Name')[pit_cols]
bat_stats = pd.merge(left=players.query('not dup_name'), right=bat, left_on='display_name', right_on='Name')[bat_cols]


bat_stats, pit_stats

(      player_id  retro_id              Name    Teams  G  AB  H  R  RBI  TwoB  \
 0     aardsda01  aardd001     David Aardsma      SEA  1   0  0  0    0     0   
 1     abreubo01  abreb001       Bobby Abreu  PHI NYY  5  19  4  3    2     1   
 2     abreuto01  abret001        Tony Abreu       SF  1   4  0  0    0     0   
 3     accarje01  accaj001    Jeremy Accardo       SF  1   1  0  0    0     0   
 4     acevejo01  acevj002      Jose Acevedo  CIN COL  3   2  0  0    0     0   
 ...         ...       ...               ...      ... ..  .. .. ..  ...   ...   
 1645  zeileto01  zeilt001        Todd Zeile  LAD NYM  2   7  3  2    2     2   
 1646  zieglbr01  ziegb001      Brad Ziegler      ARZ  2   0  0  0    0     0   
 1647  zimmejo01  zimmj002  Jordan Zimmerman      SEA  1   0  0  0    0     0   
 1648   zitoba01  zitob001        Barry Zito       SF  2   2  0  0    0     0   
 1649  zobribe01  zobrb001       Ben Zobrist       TB  1   4  2  1    4     1   
 
       ThreeB  HR  TB  
 0

OK, this is our starting point.  A table with pitching counting stats matching the retro dailies,
with IDs.  It's a subset of pitchers, only those whose names resolved easily; but the algorithm 
will work the same (the pitchers who didn't resolve are essentially just missing from our universe).

Still to clean up:
* column headers match retro
* Teams

In [13]:
pit_col_mapper = \
{'p_g': 'G',
 'p_w': 'W',
 'p_l': 'L',
 'p_sv': 'SV',
 'p_out': 'IPouts',
 'p_er': 'ER',
 'p_h': 'H',
 'p_bb': 'BB',
 'p_so': 'K',
 'player_id': 'retro_id'}

bat_col_mapper = \
{'b_g': 'G',
 'b_ab': 'AB',
 'b_h': 'H',
 'b_r': 'R',
 'b_rbi': 'RBI',
 'b_2b': 'TwoB',
 'b_3b': 'ThreeB',
 'b_hr': 'HR',
 'b_tb': 'TB',
 'player_id': 'retro_id'}

pit_col_mapper, bat_col_mapper

({'p_g': 'G',
  'p_w': 'W',
  'p_l': 'L',
  'p_sv': 'SV',
  'p_out': 'IPouts',
  'p_er': 'ER',
  'p_h': 'H',
  'p_bb': 'BB',
  'p_so': 'K',
  'player_id': 'retro_id'},
 {'b_g': 'G',
  'b_ab': 'AB',
  'b_h': 'H',
  'b_r': 'R',
  'b_rbi': 'RBI',
  'b_2b': 'TwoB',
  'b_3b': 'ThreeB',
  'b_hr': 'HR',
  'b_tb': 'TB',
  'player_id': 'retro_id'})

In [14]:
def get_dailies(col_mapper):
    cols=['game_id', 'team_id'] + list(col_mapper.values())
    
    return df_dailies.rename(columns=col_mapper)[cols]

In [15]:
bat_dailies = get_dailies(bat_col_mapper)
pit_dailies = get_dailies(pit_col_mapper)
bat_dailies, pit_dailies

(              game_id team_id  G  AB  H  R  RBI  TwoB  ThreeB   HR   TB  \
 1043     ALS198707140     ALS  1   1  0  0  0.0   0.0     0.0  0.0  0.0   
 1044     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
 1045     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
 1046     ALS198707140     ALS  1   2  2  0  0.0   0.0     0.0  0.0  2.0   
 1047     ALS198707140     ALS  1   2  0  0  0.0   0.0     0.0  0.0  0.0   
 ...               ...     ... ..  .. .. ..  ...   ...     ...  ...  ...   
 5061336  WAS201509280     WAS  1   3  2  0  0.0   0.0     0.0  0.0  2.0   
 5061337  WAS201509280     WAS  1   5  2  0  1.0   0.0     0.0  0.0  2.0   
 5061338  WAS201509280     WAS  1   0  0  0  0.0   0.0     0.0  0.0  0.0   
 5061339  WAS201509280     WAS  1   3  1  0  0.0   0.0     0.0  0.0  1.0   
 5061340  WAS201509280     WAS  1   3  0  1  0.0   0.0     0.0  0.0  0.0   
 
          retro_id  
 1043     bainh001  
 1044     bellg001  
 1045     boggw001  
 1

# Building Blocks for the algorithm

### Find a game (in dailies) from a statline

In [16]:
pit_stats[pit_stats['G']==1].sort_values('K')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1
305,leipeda01,leipd001,Dave Leiper,OAK,1,3,0,0,0,1,0,0,1
308,lewisji02,lewij002,Jim Lewis,SD,1,2,0,0,0,0,0,0,2
316,loeweca01,loewc001,Carlton Loewer,SD,1,6,0,1,0,7,6,0,1
321,louxsh01,louxs001,Shane Loux,SF,1,3,1,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,matsuda01,matsd001,Daisuke Matsuzaka,BOS,1,18,1,0,0,5,1,9,5
518,scherma01,schem001,Max Scherzer,DET,1,16,0,1,0,4,4,10,3
236,hernafe02,hernf002,Felix Hernandez,SEA,1,21,1,0,0,7,1,10,1
505,sabatcc01,sabac001,CC Sabathia,CLE,1,24,1,0,0,4,0,11,2


In [17]:
bat_stats.query('G==1').sort_values('RBI')

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
0,aardsda01,aardd001,David Aardsma,SEA,1,0,0,0,0,0,0,0,0
1084,olivefr01,olivf001,Francisco Oliveras,SF,1,0,0,0,0,0,0,0,0
1086,olmedra01,olmer001,Ray Olmedo,CIN,1,0,0,0,0,0,0,0,0
1087,olsonti01,olsot001,Tim Olson,ARZ,1,3,1,0,0,0,0,0,1
1088,oltmi01,olt-m001,Mike Olt,CHC,1,4,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
573,guerrpe01,guerp001,Pedro Guerrero,LAD,1,5,3,0,3,1,0,0,4
1328,santaca01,santc002,Carlos Santana,CLE,1,4,2,1,3,0,0,1,5
1649,zobribe01,zobrb001,Ben Zobrist,TB,1,4,2,1,4,1,0,1,6
1489,trammal01,trama001,Alan Trammell,DET,1,4,2,0,5,2,0,0,4


In [18]:
bat_stats.query('retro_id=="nettg001"')

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
1048,nettlgr01,nettg001,Graig Nettles,SD,1,4,2,2,5,0,0,2,8


In [19]:
mike_scott = pit_stats[pit_stats['player_id']=='scottmi03']
mike_scott

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
524,scottmi03,scotm001,Mike Scott,HOU,1,25,0,1,0,8,3,14,0


In [20]:
match_cols_pit = ['retro_id', 'G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
match_cols_bat = ['retro_id', 'G', 'AB', 'H', 'R', 'RBI', 'TwoB', 'ThreeB', 'HR', 'TB']
def find_daily(dailies, stat_line, match_cols):

    matches = pd.merge(left=stat_line, right=dailies, on=match_cols)
    return matches

In [21]:
find_daily(bat_dailies, bat_stats.query('retro_id=="nettg001"'), match_cols_bat)

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id
0,nettlgr01,nettg001,Graig Nettles,SD,1,4,2,2,5,0,0,2,8,SDN198607300,SDN


### Subtract a game's dailies from a stat DF
#### Start by subtracting a player's daily from their own statline
#### then scale

# Execute the Algorithm

### Likely starting out manually, iterating manually, but calling the building block functions

In [22]:
matches = find_daily(bat_dailies, bat_stats.query('G==1'), match_cols_bat)
match_counts = matches.groupby('retro_id').agg({'player_id': len, 'game_id': min}).rename(columns={'player_id': 'count'})
match_counts


Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
aardd001,327,ANA200705060
abret001,19,ARI201005080
accaj001,6,FLO200605310
acevj001,326,ANA200205070
adkij001,118,ANA200308140
...,...,...
younj001,24,CHN198707050
younm003,55,ANA200704030
zawal001,1,SDN201005020
zimmj002,12,COL199906090


In [23]:
match_counts.query('count==1').reset_index().groupby(['game_id'])['retro_id'].count().sort_values()

game_id
SDN200408080    1
SDN200408220    1
SDN200409080    1
SDN200409240    1
SDN200706190    1
SDN200706240    1
SDN200708040    1
SDN200910040    1
SDN201005020    1
SDN201007310    1
SDN201105230    1
SDN201109240    1
SDN201109270    1
SDN201205200    1
SDN201209300    1
SDN201304230    1
SDN201405250    1
SDN201409230    1
SFN199609180    1
SFN199804140    1
SDN200407020    1
SDN200406230    1
TOR200508060    1
CIN200407160    1
SDN200308100    1
SDN200306200    1
SDN200204080    1
SDN200109030    1
SDN200106300    1
DET201308310    1
SDN200104250    1
SDN199909200    1
OAK199505080    1
SDN199308062    1
SDN198609140    1
SDN198806130    1
SDN200405020    2
SEA201209080    2
CHN199806230    2
SDN200406190    2
SDN199407060    2
SDN200904260    2
SDN200110070    2
SLN200707270    2
SDN198607300    3
SEA199608100    3
SDN199109220    3
DET198808040    4
SDN201205040    4
SDN198704180    6
Name: retro_id, dtype: int64

In [25]:
matches['num_matches'] = matches.groupby('retro_id').transform(len)['player_id']

In [26]:
known_attended = sorted(match_counts.query('count==1').game_id.unique())
known_attended, len(known_attended)

(['CHN199806230',
  'CIN200407160',
  'DET198808040',
  'DET201308310',
  'OAK199505080',
  'SDN198607300',
  'SDN198609140',
  'SDN198704180',
  'SDN198806130',
  'SDN199109220',
  'SDN199308062',
  'SDN199407060',
  'SDN199909200',
  'SDN200104250',
  'SDN200106300',
  'SDN200109030',
  'SDN200110070',
  'SDN200204080',
  'SDN200306200',
  'SDN200308100',
  'SDN200405020',
  'SDN200406190',
  'SDN200406230',
  'SDN200407020',
  'SDN200408080',
  'SDN200408220',
  'SDN200409080',
  'SDN200409240',
  'SDN200706190',
  'SDN200706240',
  'SDN200708040',
  'SDN200904260',
  'SDN200910040',
  'SDN201005020',
  'SDN201007310',
  'SDN201105230',
  'SDN201109240',
  'SDN201109270',
  'SDN201205040',
  'SDN201205200',
  'SDN201209300',
  'SDN201304230',
  'SDN201405250',
  'SDN201409230',
  'SEA199608100',
  'SEA201209080',
  'SFN199609180',
  'SFN199804140',
  'SLN200707270',
  'TOR200508060'],
 50)

In [27]:
def aggregate_dailies(dailies, game_ids, players, match_cols):
    known_gm_stats = dailies.query('game_id in @game_ids and retro_id in @players').groupby('retro_id')[match_cols].sum()
    return known_gm_stats

known_gm_stats_bat = aggregate_dailies(bat_dailies, known_attended, bat_stats.retro_id, match_cols_bat)
known_gm_stats_bat

Unnamed: 0_level_0,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
abreb001,1,4,1,0,0.0,0.0,0.0,0.0,1.0
ackld001,1,5,1,0,0.0,0.0,0.0,0.0,1.0
adamm001,3,0,0,0,0.0,0.0,0.0,0.0,0.0
adamr002,1,4,1,3,0.0,0.0,0.0,0.0,1.0
adamt001,1,0,0,0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
yound001,1,4,2,1,0.0,1.0,0.0,0.0,3.0
younj001,1,0,0,0,0.0,0.0,0.0,0.0,0.0
zaung001,2,8,3,0,0.0,1.0,0.0,0.0,4.0
zawal001,1,3,1,1,0.0,0.0,0.0,0.0,1.0


In [28]:
unaccounted_stats_bat = bat_stats[match_cols_bat].set_index('retro_id')
unaccounted_stats_pit = pit_stats[match_cols_pit].set_index('retro_id')
unaccounted_stats_bat, unaccounted_stats_pit

(          G  AB  H  R  RBI  TwoB  ThreeB  HR  TB
 retro_id                                        
 aardd001  1   0  0  0    0     0       0   0   0
 abreb001  5  19  4  3    2     1       0   0   5
 abret001  1   4  0  0    0     0       0   0   0
 accaj001  1   1  0  0    0     0       0   0   0
 acevj002  3   2  0  0    0     0       0   0   0
 ...      ..  .. .. ..  ...   ...     ...  ..  ..
 zeilt001  2   7  3  2    2     2       0   0   5
 ziegb001  2   0  0  0    0     0       0   0   0
 zimmj002  1   0  0  0    0     0       0   0   0
 zitob001  2   2  0  0    0     0       0   0   0
 zobrb001  1   4  2  1    4     1       0   1   6
 
 [1650 rows x 9 columns],
            G  IPouts  W  L  SV   H  ER   K  BB
 retro_id                                      
 aardd001   1       2  0  1   0   2   2   0   1
 accaj001   1       4  0  0   0   0   0   1   2
 acevj002   3      20  0  2   0  13  10   3   7
 acevj001   1       6  0  0   0   2   0   2   0
 adamm001  12      39  1  0   0   

In [29]:
# when subtracting the data frames, need to subtract only the intersection of the sets
def subtract_stats(start, delta):
    unchanging = start[~start.index.isin(delta.index)]
    changing   = start[ start.index.isin(delta.index)]

    after = pd.concat([unchanging, changing-delta])
    return after



In [30]:
def run_iteration(dailies, stats, match_cols, games_known_in = [], _ = None, _2 = None):
    # Compute the unaccounted stats
    accounted_for = aggregate_dailies(dailies, games_known_in, stats.index, match_cols)
    unaccounted_stats = subtract_stats(stats, accounted_for)

    # find players who have only one game in the set
    singletons = unaccounted_stats[unaccounted_stats['G']==1]
    print(len(singletons))

    # find possible matches for all of those single games, from all of each player's career
    matches = find_daily(dailies, singletons, match_cols)
                                   
    # count the matches for each player, and any games that resolve uniquely are now known to be IN
    matches['num_matches'] = matches.groupby('retro_id').transform(len)['player_id']
    games_deduced_in = matches.query('num_matches==1').game_id.unique()
    games_now_known = sorted(set(games_known_in) | set(games_deduced_in))
    
    # aggregate the stats of pitchers across these games deduced in
    # these stats are now accounted for, so subtract them
    return (dailies, stats, match_cols, games_now_known, games_deduced_in, matches)

iter1 = run_iteration(bat_dailies, bat_stats.set_index('retro_id'), match_cols_bat)
len(iter1[4])

751


50

In [31]:
iter1_pit = run_iteration(pit_dailies, pit_stats.set_index('retro_id'), match_cols_pit, iter1[3])
len(iter1_pit[4]), len(iter1_pit[3])

306


(104, 154)

In [32]:
iter2 = run_iteration(bat_dailies, bat_stats.set_index('retro_id'), match_cols_bat, iter1_pit[3])
len(iter2[4]), len(iter2[3])

403


(21, 175)

In [33]:
sorted(iter2[4])

['SDN199806090',
 'SDN200005290',
 'SDN200009060',
 'SDN200009260',
 'SDN200206200',
 'SDN200404140',
 'SDN200405120',
 'SDN200405140',
 'SDN200407070',
 'SDN200408180',
 'SDN200506070',
 'SDN200508130',
 'SDN201004290',
 'SDN201107140',
 'SDN201109040',
 'SDN201206230',
 'SDN201304100',
 'SDN201408130',
 'SDN201409200',
 'SDN201504100',
 'SFN200109300']

In [34]:
iter2_pit = run_iteration(pit_dailies, pit_stats.set_index('retro_id'), match_cols_pit, iter2[3])
len(iter2_pit[4]), len(iter2_pit[3])

109


(17, 192)

In [35]:
iter3 = run_iteration(bat_dailies, bat_stats.set_index('retro_id'), match_cols_bat, iter2_pit[3])
len(iter3[4]), len(iter3[3]), sorted(iter3[4])

200


(3, 195, ['SDN199809160', 'SDN200208030', 'SDN201305050'])

In [36]:
iter3_pit = run_iteration(pit_dailies, pit_stats.set_index('retro_id'), match_cols_pit, iter3[3])
len(iter3_pit[4]), len(iter3_pit[3]), sorted(iter3_pit[4])

43


(3, 198, ['SDN199906180', 'SDN200407090', 'SDN200605300'])

In [37]:
iter4 = run_iteration(bat_dailies, bat_stats.set_index('retro_id'), match_cols_bat, iter3_pit[3])
len(iter4[4]), len(iter4[3]), sorted(iter4[4])

120


(1, 199, ['SDN201208030'])

In [73]:
iter4_pit = run_iteration(pit_dailies, pit_stats.set_index('retro_id'), match_cols_pit, iter4[3])
len(iter4_pit[4]), len(iter4_pit[3]), sorted(iter4_pit[4])

29


(0, 199, [])

In [74]:
iter5 = run_iteration(bat_dailies, bat_stats.set_index('retro_id'), match_cols_bat, iter4_pit[3])
len(iter5[4]), len(iter5[3]), sorted(iter5[4])

107


(0, 199, [])

In [77]:
known_games = iter5[3]

In [78]:
iter5[5].query('num_matches==2 and game_id not in @known_games')

Unnamed: 0,retro_id,player_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id,num_matches
2031,browb003,,,,1,4,2,0,0.0,1.0,0.0,0.0,3.0,PIT199904230,PIT,2
2032,browb003,,,,1,4,2,0,0.0,1.0,0.0,0.0,3.0,SDN199904210,PIT,2
9128,peavj001,,,,1,1,1,1,0.0,0.0,0.0,0.0,1.0,HOU200804220,SDN,2
9129,peavj001,,,,1,1,1,1,0.0,0.0,0.0,0.0,1.0,SDN200407260,SDN,2


In [82]:
known_games= sorted(known_games + ['SDN199904210', 'SDN200407260'])
len(known_games)

203

In [84]:
iter5_pit = run_iteration(pit_dailies, pit_stats.set_index('retro_id'), match_cols_pit, known_games)
known_games = iter5_pit[3]
len(iter5_pit[4]), len(known_games), sorted(iter5_pit[4])

26


(1, 202, ['SDN200109210'])

In [85]:
iter = iter6 = run_iteration(bat_dailies, bat_stats.set_index('retro_id'), match_cols_bat, known_games)
known_games = iter[3]
len(iter[4]), len(known_games), sorted(iter[4])

67


(0, 202, [])

In [87]:
iter = iter6_pit = run_iteration(pit_dailies, pit_stats.set_index('retro_id'), match_cols_pit, known_games)
known_games = iter[3]
len(iter[4]), len(known_games), sorted(iter[4])

19


(0, 202, [])

In [96]:
iter[5].query('num_matches==2 and game_id not in @known_games')

Unnamed: 0,retro_id,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,game_id,team_id,num_matches,gm_match_ct
3,kroli001,krolia01,Ian Krol,DET,1,2,0,0,0,1,1.0,0.0,0.0,DET201404230,DET,2,1
4,kroli001,krolia01,Ian Krol,DET,1,2,0,0,0,1,1.0,0.0,0.0,SDN201404130,DET,2,5
5,masao001,masaoon01,Onan Masaoka,LAD,1,3,0,0,0,1,0.0,2.0,0.0,COL200009080,LAN,2,1
6,masao001,masaoon01,Onan Masaoka,LAD,1,3,0,0,0,1,0.0,2.0,0.0,SDN199904160,LAN,2,2


In [95]:
iter[5].query('game_id =="SDN201404130"')

Unnamed: 0,retro_id,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,game_id,team_id,num_matches,gm_match_ct
2,albua001,albural01,Al Alburquerque,DET,1,1,0,0,0,0,0.0,1.0,0.0,SDN201404130,DET,3,5
4,kroli001,krolia01,Ian Krol,DET,1,2,0,0,0,1,1.0,0.0,0.0,SDN201404130,DET,2,5
20,benoj001,,,,1,3,0,0,0,0,0.0,0.0,1.0,SDN201404130,SDN,8,5
36,chamj002,,,,1,3,0,0,0,1,0.0,2.0,0.0,SDN201404130,DET,15,5
52,cokep001,,,,1,3,0,0,0,0,0.0,0.0,0.0,SDN201404130,DET,16,5


In [90]:
iter[5].query('game_id =="SDN199904160"')

Unnamed: 0,retro_id,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,game_id,team_id,num_matches
6,masao001,masaoon01,Onan Masaoka,LAD,1,3,0,0,0,1,0.0,2.0,0.0,SDN199904160,LAN,2
13,milla001,millsal01,Alan Mills,LAD,1,2,0,0,0,0,0.0,1.0,0.0,SDN199904160,LAN,7


In [94]:
iter[5]['gm_match_ct'] = iter[5].groupby('game_id')['retro_id'].transform(len)
iter[5].sort_values(by='gm_match_ct')

Unnamed: 0,retro_id,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,game_id,team_id,num_matches,gm_match_ct
0,albua001,albural01,Al Alburquerque,DET,1,1,0,0,0,0,0.0,1.0,0.0,DET201405250,DET,3,1
81,marmc001,,,,1,3,0,0,0,0,0.0,1.0,0.0,MIA201404160,MIA,33,1
80,marmc001,,,,1,3,0,0,0,0,0.0,1.0,0.0,MIA201404060,MIA,33,1
79,marmc001,,,,1,3,0,0,0,0,0.0,1.0,0.0,LAN201309270,LAN,33,1
78,marmc001,,,,1,3,0,0,0,0,0.0,1.0,0.0,LAN201308110,LAN,33,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36,chamj002,,,,1,3,0,0,0,1,0.0,2.0,0.0,SDN201404130,DET,15,5
20,benoj001,,,,1,3,0,0,0,0,0.0,0.0,1.0,SDN201404130,SDN,8,5
4,kroli001,krolia01,Ian Krol,DET,1,2,0,0,0,1,1.0,0.0,0.0,SDN201404130,DET,2,5
52,cokep001,,,,1,3,0,0,0,0,0.0,0.0,0.0,SDN201404130,DET,16,5


In [38]:
# Where does MIL199909030 come from?
gm_id= "MIL199909030"
iter2_pit[5].query('game_id==@gm_id and num_matches==1')

Unnamed: 0,retro_id,player_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB,game_id,team_id,num_matches


In [39]:
player = 'sloch001'
pit_stats.query('retro_id==@player')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
536,slocuhe01,sloch001,Heathcliff Slocumb,PHI SD,5,14,0,0,0,10,3,4,1


In [40]:
gms = iter2_pit[3]
pit_dailies.query('retro_id==@player and game_id in @gms')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,retro_id
4149986,SDN199407060,PHI,1,0,0,0,6,1.0,3,1.0,2.0,sloch001
4163422,SDN200008060,SDN,1,0,0,0,3,1.0,2,0.0,0.0,sloch001
4163791,SDN200009060,SDN,1,0,0,0,2,0.0,1,0.0,1.0,sloch001
4163923,SDN200009160,SDN,1,0,0,0,2,0.0,2,0.0,1.0,sloch001
4164006,SDN200009260,SDN,1,0,0,0,1,1.0,2,0.0,0.0,sloch001


In [41]:
# OK, this works otu well.  We know MIL199909030 can't work because Slocumb was a Cardinal, and the other games
# mostly add up to the totals, off by 1.

In [42]:
# OK, not enough information to figure out what the error is in Hill's case.  The other three games all seem legit, and he's not yet showing up in other known games.
# So we'll just delete him for now, and possibly revisit later

In [43]:
# Where does PHI201006220 come from?
gm_id='PHI201006220'
iter2[5].query('game_id==@gm_id')

Unnamed: 0,retro_id,player_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id,num_matches
31549,lidgb001,,,,1,0,0,0,0.0,0.0,0.0,0.0,0.0,PHI201006220,PHI,638


In [44]:
player = 'howar001'
bat_stats.query('retro_id==@player')

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
671,howarry01,howar001,Ryan Howard,PHI,3,13,4,1,3,1,0,0,5


In [45]:
gms = iter2[3]
bat_dailies.query('retro_id==@player and game_id in @gms')

Unnamed: 0,game_id,team_id,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,retro_id
4175173,SDN200508130,PHI,1,3,2,1,1.0,0.0,0.0,0.0,2.0,howar001
4187058,SDN201008270,PHI,1,5,1,0,0.0,0.0,0.0,0.0,1.0,howar001
4187967,SDN201104230,PHI,1,5,1,0,2.0,1.0,0.0,0.0,2.0,howar001


In [46]:
# OK, we have something.  Howard played in SDN200508130, which was implied by another player(s), and
# his line in that game nearly matches PHI201006220.  Let's first validate SDN200508130

gm_id='SDN200508130'
iter2[5].query('game_id==@gm_id')

Unnamed: 0,retro_id,player_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id,num_matches
11278,urbiu001,urbinug01,Ugueth Urbina,PHI,1,0,0,0,0.0,0.0,0.0,0.0,0.0,SDN200508130,PHI,558
13443,abreb001,,,,1,4,1,1,0.0,0.0,0.0,0.0,1.0,SDN200508130,PHI,56
14265,astap001,,,,1,1,0,0,0.0,0.0,0.0,0.0,0.0,SDN200508130,SDN,88
15349,belld002,,,,1,3,1,0,1.0,0.0,0.0,0.0,1.0,SDN200508130,PHI,17
27320,howar001,,,,1,3,2,1,1.0,0.0,0.0,0.0,2.0,SDN200508130,PHI,3
31695,liebm001,,,,1,2,1,0,1.0,0.0,0.0,0.0,1.0,SDN200508130,PHI,2
31804,loftk001,,,,1,0,0,1,0.0,0.0,0.0,0.0,0.0,SDN200508130,PHI,12
35671,michj001,,,,1,1,1,1,1.0,0.0,1.0,0.0,3.0,SDN200508130,PHI,1
39120,padiv001,,,,1,2,0,0,0.0,0.0,0.0,0.0,0.0,SDN200508130,PHI,49
42696,rollj001,,,,1,5,0,0,0.0,0.0,0.0,0.0,0.0,SDN200508130,PHI,96


In [47]:
# OK, this game is uniquely matched only by Jason Michaels.  Check some more players (Lieberthal, Astacio ...) to
# see if those corroborate

plyrs = ['liebm001']
iter2[5].query('retro_id in @plyrs')

Unnamed: 0,retro_id,player_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id,num_matches
31694,liebm001,,,,1,2,1,0,1.0,0.0,0.0,0.0,1.0,PHI200504300,PHI,2
31695,liebm001,,,,1,2,1,0,1.0,0.0,0.0,0.0,1.0,SDN200508130,PHI,2


In [48]:
# OK, Lieberman checks out.  Now Astacio
player = 'astap001'
pit_stats.query('retro_id==@player')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
21,astacpe01,astap001,Pedro Astacio,COL NYM SD,4,61,0,1,0,19,10,15,6


In [49]:
gms = iter2[3]
pit_dailies.query('retro_id==@player and game_id in @gms')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,retro_id
4158161,SDN199807050,COL,1,0,1,0,18,6.0,8,2.0,8.0,astap001
4167079,SDN200205190,NYN,1,0,0,0,21,2.0,6,1.0,2.0,astap001
4174821,SDN200507150,SDN,1,0,0,0,1,1.0,1,1.0,0.0,astap001
4175182,SDN200508130,SDN,1,0,0,0,21,1.0,4,2.0,5.0,astap001


In [50]:
# Astacio checks out.  Let's assume that Howard's RBI total is wrong.  Should be 

In [51]:
known_games = sorted(set(iter1_pit[2]) )
known_games

['BB', 'ER', 'G', 'H', 'IPouts', 'K', 'L', 'SV', 'W', 'retro_id']

In [52]:
known_games = set(iter1[2])|set(iter1_pit[2])
sorted(known_games)

['AB',
 'BB',
 'ER',
 'G',
 'H',
 'HR',
 'IPouts',
 'K',
 'L',
 'R',
 'RBI',
 'SV',
 'TB',
 'ThreeB',
 'TwoB',
 'W',
 'retro_id']

In [53]:
# where did ATL199709160 come from?

singletons = unaccounted_stats_bat.query('G==1')

# find possible matches for all of those single games, from all of each player's career
matches = find_daily(bat_dailies, singletons, match_cols_bat)
matches.query('game_id=="HOU198607010"')


Unnamed: 0,retro_id,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id
67535,waltg001,1,1,0,0,0,0,0,0,0,HOU198607010,SDN


In [54]:
bat_stats.query('retro_id in ("walld001", "waltg001")')

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
1563,wallide01,walld001,Denny Walling,HOU,1,4,1,0,1,0,0,0,1
1564,waltege01,waltg001,Gene Walter,SD,1,1,0,0,0,0,0,0,0


In [55]:
bat_dailies.query('retro_id in ("walld001", "waltg001") and game_id in @known_games')

Unnamed: 0,game_id,team_id,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,retro_id


In [56]:
# Walling went 1-for-4 with a walk in HOU198607010.  Is AB really PA?  Seems like there would be way more mistakes if it were consistently wrong like that.

# Walter seems to be a red herring; he had two games matching, one is confirmed known, and the other one happens to be HOU198607010.

# Let's look at other players who walked in that game.  Doran went 2-for-3 with 2 BB, Cruz went 2-for-2 with 2 BB
plrs = ("dorab001", "cruzj001")
bat_dailies.query('retro_id in @plrs and game_id in @known_games'), bat_stats.query('retro_id in @plrs')


(Empty DataFrame
 Columns: [game_id, team_id, G, AB, H, R, RBI, TwoB, ThreeB, HR, TB, retro_id]
 Index: [],
      player_id  retro_id        Name Teams  G  AB  H  R  RBI  TwoB  ThreeB  \
 378  doranbi02  dorab001  Bill Doran   HOU  1   3  2  1    0     2       0   
 
      HR  TB  
 378   0   4  )

In [57]:
# How many matches in a game I'm sure I went to?
matches.query('game_id=="SDN198609140"')

Unnamed: 0,retro_id,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id
1530,asadr001,1,3,0,0,0,0,0,0,0,SDN198609140,SDN
14733,davig001,1,4,1,0,0,1,0,0,2,SDN198609140,HOU
16435,dorab001,1,3,2,1,0,2,0,0,4,SDN198609140,HOU
21908,greeg001,1,3,1,1,0,1,0,0,2,SDN198609140,SDN
28273,iorgd001,1,1,0,0,0,0,0,0,0,SDN198609140,SDN
40494,mizej001,1,4,1,0,0,0,0,0,1,SDN198609140,HOU
51891,pyznt001,1,3,0,0,0,0,0,0,0,SDN198609140,SDN
53172,reync001,1,3,0,0,0,0,0,0,0,SDN198609140,HOU
59128,scotm001,1,3,0,0,0,0,0,0,0,SDN198609140,HOU
67532,walld001,1,4,1,0,1,0,0,0,1,SDN198609140,HOU


In [58]:
# where did NYN198704250 come from?

singletons = unaccounted_stats_pit.query('G==1')

# find possible matches for all of those single games, from all of each player's career
matches = find_daily(pit_dailies, singletons, match_cols_pit)
matches.query('game_id=="NYN198704250"')

Unnamed: 0,retro_id,G,IPouts,W,L,SV,H,ER,K,BB,game_id,team_id


In [59]:
pit_stats.query('retro_id in ("myerr001")')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
387,myersra01,myerr001,Randy Myers,SD,0,1,0,0,0,0,0,1,1


In [60]:

pit_dailies.query('retro_id in ("myerr001") and G==1 and SV==0 and IPouts==1')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,retro_id
58102,ARI199809250,SDN,1,0,1,0,1,3.0,1,1.0,1.0,myerr001
166879,ATL199304120,CHN,1,0,0,0,1,0.0,0,0.0,0.0,myerr001
179706,ATL199808200,SDN,1,0,0,0,1,0.0,2,0.0,0.0,myerr001
319123,BAL199604280,BAL,1,0,0,0,1,1.0,2,1.0,0.0,myerr001
1275080,CHN198806250,NYN,1,0,0,0,1,0.0,1,0.0,0.0,myerr001
1280485,CHN199008310,CIN,1,0,1,0,1,2.0,4,1.0,0.0,myerr001
1289806,CHN199505140,CHN,1,0,0,0,1,2.0,1,1.0,1.0,myerr001
1290520,CHN199507140,CHN,1,0,0,0,1,0.0,1,0.0,1.0,myerr001
1290763,CHN199507290,CHN,1,0,0,0,1,3.0,3,1.0,0.0,myerr001
1291530,CHN199509280,CHN,1,0,0,0,1,1.0,2,0.0,0.0,myerr001


In [61]:
plrs = ('menef001')
bat_dailies.query('retro_id in @plrs and game_id in @known_games'), bat_stats.query('retro_id in @plrs')

(Empty DataFrame
 Columns: [game_id, team_id, G, AB, H, R, RBI, TwoB, ThreeB, HR, TB, retro_id]
 Index: [],
      player_id  retro_id             Name Teams  G  AB  H  R  RBI  TwoB  \
 964  menecfr01  menef001  Frank Menechino   TOR  1   2  1  1    1     0   
 
      ThreeB  HR  TB  
 964       0   1   4  )

In [62]:
# So HBP has Menechino with 2 RBI, while he actually had 1

In [63]:
#OK, now when we find a game that appears, see how many players are matching in that game
# the hypothesis is that most games will match multiple players; if it matches only 1, that's a possible error
singletons = unaccounted_stats_bat.query('G==1')

# find possible matches for all of those single games, from all of each player's career
matches = find_daily(bat_dailies, singletons, match_cols_bat)
cts = matches.groupby('retro_id')['G'].count()
plrs = cts[cts==1].index
plrs


Index(['beckj002', 'benzt001', 'bergd002', 'berrg001', 'bixlb001', 'blacb001',
       'bordp001', 'campt001', 'cardj001', 'cespy001', 'cishs001', 'davib004',
       'descd001', 'dobbg001', 'esasn001', 'farip001', 'garcd002', 'garck002',
       'greeg001', 'guerp001', 'harvk001', 'hayeb001', 'higgk001', 'holtb001',
       'howek001', 'iglej001', 'jennd003', 'keisr001', 'kendh001', 'knobc001',
       'kreuc001', 'krueb001', 'krukj001', 'krukm001', 'lakej001', 'laroa002',
       'lee-t002', 'markn001', 'marta002', 'mater001', 'mckac001', 'mechg001',
       'menck001', 'menef001', 'molip001', 'neagd001', 'nettg001', 'nokem001',
       'olsot001', 'parkd001', 'pavac001', 'peguf001', 'petrb001', 'ramsm002',
       'robit001', 'rodrg001', 'sancr001', 'sandj002', 'sax-s001', 'sherp001',
       'shumt001', 'simor001', 'smitb004', 'terrl001', 'trama001', 'velee001',
       'walbm001', 'wardk001', 'webet001', 'wille001', 'wilsd002', 'wilsp001',
       'woodc001', 'ynoar001', 'zawal001'],
      dt

In [64]:
implied_gms = matches.query('retro_id in @plrs')['game_id'].unique()
implied_gms, len(implied_gms)

(array(['SDN200706240', 'DET198808040', 'SDN200406190', 'OAK199505080',
        'SDN200904260', 'SDN199109220', 'CHN199806230', 'SDN201109270',
        'SDN200204080', 'SEA201209080', 'SDN201205040', 'SDN201007310',
        'SDN201105230', 'SDN198607300', 'SDN200405020', 'SDN198609140',
        'SDN198704180', 'SDN200407020', 'SDN199308062', 'DET201308310',
        'SLN200707270', 'SDN201205200', 'SEA199608100', 'SDN200106300',
        'SDN199407060', 'SDN198806130', 'SDN201405250', 'SDN200104250',
        'SDN200706190', 'SDN200308100', 'SDN200409080', 'SDN200306200',
        'TOR200508060', 'SDN199909200', 'SDN200406230', 'SDN200408220',
        'SDN201209300', 'SDN200110070', 'SDN200708040', 'SFN199804140',
        'SDN201109240', 'SDN200408080', 'SDN200109030', 'SDN200409240',
        'SDN200910040', 'SDN201304230', 'SFN199609180', 'CIN200407160',
        'SDN201409230', 'SDN201005020'], dtype=object),
 50)

In [65]:

# How many matches in each game?
matches.query('game_id in @implied_gms')['game_id'].value_counts()



DET198808040    18
CHN199806230    16
SDN199109220    16
SDN198704180    15
SDN198607300    15
SDN201205200    14
SEA199608100    14
DET201308310    13
OAK199505080    12
SDN198609140    10
SDN198806130    10
SDN201405250     9
SFN199609180     9
SDN201205040     8
SDN201005020     7
SDN199407060     7
SEA201209080     7
SDN199308062     6
SDN201109270     6
SDN200904260     6
SDN200706190     5
SDN200405020     5
SDN200407020     5
SDN200409240     5
SDN199909200     4
SDN200910040     4
TOR200508060     4
SDN200708040     4
SDN200308100     4
SDN200110070     4
SDN200406190     4
SDN200104250     4
SDN201209300     4
SDN200204080     4
SDN200409080     3
SDN200306200     3
SDN201007310     3
SDN201109240     3
SLN200707270     3
SFN199804140     3
SDN201105230     3
SDN201304230     2
SDN201409230     2
SDN200406230     1
SDN200109030     1
CIN200407160     1
SDN200106300     1
SDN200408220     1
SDN200408080     1
SDN200706240     1
Name: game_id, dtype: int64

In [66]:
# OK, even better, are there any players whose stats *don't* match?

# First, any singletons showing up in multiple games?
plrs = singletons.index
apprcs = bat_dailies.query('retro_id in @plrs and game_id in @implied_gms').groupby('retro_id')['game_id'].count()
multi_singletons_whoops = apprcs[apprcs>1].index
multi_singletons_whoops


Index(['jackm001'], dtype='object', name='retro_id')

In [67]:
bat_dailies.query('retro_id in @multi_singletons_whoops and game_id in @implied_gms')['game_id'].value_counts()

CHN199806230    1
SEA199608100    1
Name: game_id, dtype: int64

In [68]:
                                
# count the matches for each player, and any games that resolve uniquely are now known to be IN
match_counts = matches.groupby('retro_id').agg({'G': len, 'game_id': min}).rename(columns={'G': 'count'})
games_deduced_in = match_counts.query('count==1').game_id.unique()

In [69]:
iter2 = run_iteration(*iter1)
iter2, len(iter2[2])

583


((              game_id team_id  G  AB  H  R  RBI  TwoB  ThreeB   HR   TB  \
  1043     ALS198707140     ALS  1   1  0  0  0.0   0.0     0.0  0.0  0.0   
  1044     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
  1045     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
  1046     ALS198707140     ALS  1   2  2  0  0.0   0.0     0.0  0.0  2.0   
  1047     ALS198707140     ALS  1   2  0  0  0.0   0.0     0.0  0.0  0.0   
  ...               ...     ... ..  .. .. ..  ...   ...     ...  ...  ...   
  5061336  WAS201509280     WAS  1   3  2  0  0.0   0.0     0.0  0.0  2.0   
  5061337  WAS201509280     WAS  1   5  2  0  1.0   0.0     0.0  0.0  2.0   
  5061338  WAS201509280     WAS  1   0  0  0  0.0   0.0     0.0  0.0  0.0   
  5061339  WAS201509280     WAS  1   3  1  0  0.0   0.0     0.0  0.0  1.0   
  5061340  WAS201509280     WAS  1   3  0  1  0.0   0.0     0.0  0.0  0.0   
  
           retro_id  
  1043     bainh001  
  1044     bellg001  
  1045 

In [70]:
iter2[2]

['retro_id', 'G', 'AB', 'H', 'R', 'RBI', 'TwoB', 'ThreeB', 'HR', 'TB']

In [71]:
pit_stats.query('retro_id=="alfoa001"')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
10,alfonan01,alfoa001,Antonio Alfonseca,ATL,2,5,0,0,0,4,0,2,1


In [72]:
dailies.query('retro_id=="alfoa001" and game_id in @iter1[2]')

NameError: name 'dailies' is not defined

In [None]:
dailies.query('retro_id=="alfoa001" and IPouts==2 and H==1 and BB==1')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,retro_id
192649,ATL200405070,ATL,1,0,0,0,2,0.0,1,1.0,1.0,alfoa001
1314720,CHN200508260,FLO,1,0,0,0,2,0.0,1,1.0,0.0,alfoa001
2254573,HOU200410090,ATL,1,0,0,0,2,2.0,1,1.0,0.0,alfoa001
3374062,NYN200404150,ATL,1,0,0,0,2,2.0,1,1.0,0.0,alfoa001


In [None]:
iter2[4].query('game_id == "SDN201306140"')

AttributeError: 'numpy.ndarray' object has no attribute 'query'

In [None]:
dailies[(dailies['retro_id']=='cahit001') & dailies['game_id'].isin(iter1[2])]

In [None]:
dailies[(dailies['retro_id']=='harrw002') & dailies['game_id'].isin(iter1[2])]

In [None]:
iter1[4][iter1[4]['game_id'] =='SDN201309260']

In [None]:
iter1[4][iter1[4]['game_id'] =='SDN201209150']

In [None]:
iter2[1].sort_values(by='IPouts')

In [None]:
dailies.query('retro_id=="murrh001" and game_id in @iter1[2]')

In [None]:
pit_stats.query('retro_id=="murrh001"')

This looks suspicious.  I didn't go to a game in Atlanta in May 2004.  This explains the discrepancy in the totals - but how did this game show up in my known attended games?

In [None]:
dailies.query('game_id=="ATL200405070"')

In [None]:
pit_stats.query('retro_id=="thomj005"')

In [None]:
iter3 = run_iteration(*iter2)
iter3[2]

In [None]:
iter4 = run_iteration(*iter3)
iter4[2]

In [None]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter1[2])]

In [None]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter2[2])]

In [None]:
iter3[1].describe()