In 2015ish, I spent a lot of time entering my game attendance history into Hardball Passport.  The sources were ticket stubs, photo history, and my memory.  After entering the data, I threw out a lot of paper ticket stubs.  Hardball Passport eventually vanished, losing my history with it.  I did, however, save a couple sheets of player stats in games I attended.  The goal here is to reconstruct the attendance history from these player stats.

Remember, the goal is to reconstruct the set of games that was in Hardball Passport.  This is different than trying to reconstruct my actual attendance history.  For example, if a game was missing in HBP, but I know I attended it, including it here will break the algorithm to reconstruct.

The general design is to iterate a loop, maintaining this data:
* a set of games that are known to be in the HBP history
* a set of games that are known to not be in the HBP history
* a set of games that may be in the HBP history (note that these and the previous two represent the universe of games)
* the HBP player stats for all players in the "possible" games (e.g., with the stats in "known" games removed)

The loop is:
* Identify a games that is known
** Either by memory
** Or by taking a player with 1 games of stats and matching that to a game from their career
* Add that game to the "known" games
* Deduct the stats from that game for all players who played in that game
* For any players who drop to zero games/stats remaining, then all of the remaining games they've played in go from possible to impossible

In [1]:
import pandas as pd
import boxball_loader as bbl

In [2]:
bat = pd.read_csv('~/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0147-B.csv')
pit = pd.read_csv('~/Dropbox/personal/baseball/hardball_passport/passportplus-20150523-0148-P.csv')

bat

Unnamed: 0,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,BA,OPS
0,Ryan Klesko,SD SF,69,226,58,36,29,13,3,9,104,0.257,0.717
1,Phil Nevin,SD,67,244,76,40,47,14,0,16,138,0.311,0.877
2,Brian Giles,PIT SD,61,219,60,38,26,11,3,8,101,0.274,0.735
3,Trevor Hoffman,SD,58,2,0,0,0,0,0,0,0,0.000,0.000
4,Chase Headley,SD,56,193,49,24,17,19,0,2,74,0.254,0.637
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1738,Ehire Adrianza,SF,1,0,0,0,0,0,0,0,0,0.000,0.000
1739,Juan Acevedo,MIL,1,0,0,0,0,0,0,0,0,0.000,0.000
1740,Jeremy Accardo,SF,1,1,0,0,0,0,0,0,0,0.000,0.000
1741,Tony Abreu,SF,1,4,0,0,0,0,0,0,0,0.000,0.000


In [3]:
bat.sum()

Name      Ryan KleskoPhil NevinBrian GilesTrevor Hoffman...
Teams     SD SFSDPIT SDSDSDSDSD SDNMIL SDSDSDSD MIL ATLS...
G                                                      5979
AB                                                    14077
H                                                      3519
R                                                      1665
RBI                                                    1600
TwoB                                                    638
ThreeB                                                   98
HR                                                      367
TB                                                     5454
BA                                                  289.327
OPS                                                 725.802
dtype: object

In [4]:
pit.sum()

Name       Trevor HoffmanScott LinebrinkLuke GregersonJoe...
Teams      SDSD MIL ATLSDSD ARZSDSDSD SFNSD MIASD SDNSDSD...
G                                                       1653
IP                                                    3576.6
W                                                        206
L                                                        206
SV                                                       110
H                                                       3598
ER                                                      1545
K                                                       3034
BB                                                      1385
Pitches                                                44740
ERA                                                 63561.09
WHIP                                                1136.057
dtype: object

# Data Cleaning

Need to clean both the HBP data and the retrosheet/bd data.  Column names matching, etc.

In [5]:
yrs = range(1986, 2016)
len(yrs)

30

In [6]:
def get_players(yrs):
    """ Retrieve a DF of all batters and their names/IDs who played in a range of years"""
    player_list = bbl.load_batting(yrs, coalesce_type=bbl.CoalesceMode.PLAYER_CAREER).index.values
    
    players = bbl.load_people().query('player_id in @player_list')[['player_id', 'name_first', 'name_last', 'retro_id']]
    players['display_name'] = players['name_first'] + ' ' + players['name_last']
    name_counts = players.groupby('display_name')['retro_id'].count()
    dup_names = name_counts[name_counts>1].index.values
    players['dup_name'] = players.display_name.isin(dup_names)
    return players

players = get_players(yrs)
len(players)

6804

In [7]:
players.sample(10)

Unnamed: 0,player_id,name_first,name_last,retro_id,display_name,dup_name
447,aquingr01,Greg,Aquino,aquig001,Greg Aquino,False
1212,benavfr01,Freddie,Benavides,benaf001,Freddie Benavides,False
12838,munozbo01,Bobby,Munoz,munob001,Bobby Munoz,False
17494,stranpa02,Pat,Strange,strap001,Pat Strange,False
14917,reedda01,Darren,Reed,reedd001,Darren Reed,False
864,barnesc01,Scott,Barnes,barns002,Scott Barnes,False
4982,dunnto01,Todd,Dunn,dunnt001,Todd Dunn,False
3768,cowlejo01,Joe,Cowley,cowlj001,Joe Cowley,False
10433,lewisje01,Jensen,Lewis,lewij003,Jensen Lewis,False
6724,gonzade01,Denny,Gonzalez,gonzd001,Denny Gonzalez,False


In [8]:
def compute_IPouts(IP):
    return round(round(IP)*3 + 10*(IP%1))

[int(compute_IPouts(ip)) for ip in [8.1, 0.2, 1.2, 38.1]]

[25, 2, 5, 115]

In [9]:
pit['IPouts'] = compute_IPouts(pit['IP']).apply(int)

In [10]:
standard_cols = ['player_id', 'retro_id', 'Name', 'Teams']
pit_cols = standard_cols + ['G', 'IPouts', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
bat_cols = standard_cols + ['G', 'AB', 'H', 'R', 'RBI', 'TwoB', 'ThreeB', 'HR', 'TB']

pit_stats = pd.merge(left=players.query('not dup_name'), right=pit, left_on='display_name', right_on='Name')[pit_cols]
bat_stats = pd.merge(left=players.query('not dup_name'), right=bat, left_on='display_name', right_on='Name')[bat_cols]


bat_stats, pit_stats

(      player_id  retro_id              Name    Teams  G  AB  H  R  RBI  TwoB  \
 0     aardsda01  aardd001     David Aardsma      SEA  1   0  0  0    0     0   
 1     abreubo01  abreb001       Bobby Abreu  PHI NYY  5  19  4  3    2     1   
 2     abreuto01  abret001        Tony Abreu       SF  1   4  0  0    0     0   
 3     accarje01  accaj001    Jeremy Accardo       SF  1   1  0  0    0     0   
 4     acevejo01  acevj002      Jose Acevedo  CIN COL  3   2  0  0    0     0   
 ...         ...       ...               ...      ... ..  .. .. ..  ...   ...   
 1645  zeileto01  zeilt001        Todd Zeile  LAD NYM  2   7  3  2    2     2   
 1646  zieglbr01  ziegb001      Brad Ziegler      ARZ  2   0  0  0    0     0   
 1647  zimmejo01  zimmj002  Jordan Zimmerman      SEA  1   0  0  0    0     0   
 1648   zitoba01  zitob001        Barry Zito       SF  2   2  0  0    0     0   
 1649  zobribe01  zobrb001       Ben Zobrist       TB  1   4  2  1    4     1   
 
       ThreeB  HR  TB  
 0

OK, this is our starting point.  A table with pitching counting stats matching the retro dailies,
with IDs.  It's a subset of pitchers, only those whose names resolved easily; but the algorithm 
will work the same (the pitchers who didn't resolve are essentially just missing from our universe).

Still to clean up:
* column headers match retro
* Teams

In [11]:
df_dailies = bbl.load_dailies(game_types=bbl.GameType.ALL).query('yr in @yrs')
df_dailies

Unnamed: 0,game_id,game_dt,game_ct,appearance_dt,team_id,player_id,slot_ct,seq_ct,home_fl,opponent_id,...,f_rf_out,f_rf_tc,f_rf_po,f_rf_a,f_rf_e,f_rf_dp,f_rf_tp,yr,game_type,team_game_number
1043,ALS198707140,1987-07-14,0,1987-07-14,ALS,bainh001,3,4,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
1044,ALS198707140,1987-07-14,0,1987-07-14,ALS,bellg001,4,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
1045,ALS198707140,1987-07-14,0,1987-07-14,ALS,boggw001,3,1,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
1046,ALS198707140,1987-07-14,0,1987-07-14,ALS,evand002,7,2,True,NLS,...,21.0,2.0,2.0,0.0,0.0,0,0.0,1987,ASG,1
1047,ALS198707140,1987-07-14,0,1987-07-14,ALS,fernt001,6,2,True,NLS,...,0.0,0.0,0.0,0.0,0.0,0,0.0,1987,ASG,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5061336,WAS201509280,2015-09-28,0,2015-09-28,WAS,schem001,9,1,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156
5061337,WAS201509280,2015-09-28,0,2015-09-28,WAS,taylm002,1,1,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156
5061338,WAS201509280,2015-09-28,0,2015-09-28,WAS,thorm001,9,3,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156
5061339,WAS201509280,2015-09-28,0,2015-09-28,WAS,turnt001,2,1,True,CIN,...,0.0,0.0,0.0,0.0,0.0,0,0.0,2015,RS,156


In [12]:
pit_col_mapper = \
{'p_g': 'G',
 'p_w': 'W',
 'p_l': 'L',
 'p_sv': 'SV',
 'p_out': 'IPouts',
 'p_er': 'ER',
 'p_h': 'H',
 'p_bb': 'BB',
 'p_so': 'K',
 'player_id': 'retro_id'}

bat_col_mapper = \
{'b_g': 'G',
 'b_ab': 'AB',
 'b_h': 'H',
 'b_r': 'R',
 'b_rbi': 'RBI',
 'b_2b': 'TwoB',
 'b_3b': 'ThreeB',
 'b_hr': 'HR',
 'b_tb': 'TB',
 'player_id': 'retro_id'}

pit_col_mapper, bat_col_mapper

({'p_g': 'G',
  'p_w': 'W',
  'p_l': 'L',
  'p_sv': 'SV',
  'p_out': 'IPouts',
  'p_er': 'ER',
  'p_h': 'H',
  'p_bb': 'BB',
  'p_so': 'K',
  'player_id': 'retro_id'},
 {'b_g': 'G',
  'b_ab': 'AB',
  'b_h': 'H',
  'b_r': 'R',
  'b_rbi': 'RBI',
  'b_2b': 'TwoB',
  'b_3b': 'ThreeB',
  'b_hr': 'HR',
  'b_tb': 'TB',
  'player_id': 'retro_id'})

In [21]:
def get_dailies(col_mapper):
    cols=['game_id', 'team_id'] + list(col_mapper.values())
    
    return df_dailies.rename(columns=col_mapper)[cols]

In [23]:
bat_dailies = get_dailies(bat_col_mapper)
pit_dailies = get_dailies(pit_col_mapper)
bat_dailies, pit_dailies

(              game_id team_id  G  AB  H  R  RBI  TwoB  ThreeB   HR   TB  \
 1043     ALS198707140     ALS  1   1  0  0  0.0   0.0     0.0  0.0  0.0   
 1044     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
 1045     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
 1046     ALS198707140     ALS  1   2  2  0  0.0   0.0     0.0  0.0  2.0   
 1047     ALS198707140     ALS  1   2  0  0  0.0   0.0     0.0  0.0  0.0   
 ...               ...     ... ..  .. .. ..  ...   ...     ...  ...  ...   
 5061336  WAS201509280     WAS  1   3  2  0  0.0   0.0     0.0  0.0  2.0   
 5061337  WAS201509280     WAS  1   5  2  0  1.0   0.0     0.0  0.0  2.0   
 5061338  WAS201509280     WAS  1   0  0  0  0.0   0.0     0.0  0.0  0.0   
 5061339  WAS201509280     WAS  1   3  1  0  0.0   0.0     0.0  0.0  1.0   
 5061340  WAS201509280     WAS  1   3  0  1  0.0   0.0     0.0  0.0  0.0   
 
          retro_id  
 1043     bainh001  
 1044     bellg001  
 1045     boggw001  
 1

# Building Blocks for the algorithm

### Find a game (in dailies) from a statline

In [15]:
pit_stats[pit_stats['G']==1].sort_values('K')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
0,aardsda01,aardd001,David Aardsma,SEA,1,2,0,1,0,2,2,0,1
305,leipeda01,leipd001,Dave Leiper,OAK,1,3,0,0,0,1,0,0,1
308,lewisji02,lewij002,Jim Lewis,SD,1,2,0,0,0,0,0,0,2
316,loeweca01,loewc001,Carlton Loewer,SD,1,6,0,1,0,7,6,0,1
321,louxsh01,louxs001,Shane Loux,SF,1,3,1,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,matsuda01,matsd001,Daisuke Matsuzaka,BOS,1,18,1,0,0,5,1,9,5
518,scherma01,schem001,Max Scherzer,DET,1,16,0,1,0,4,4,10,3
236,hernafe02,hernf002,Felix Hernandez,SEA,1,21,1,0,0,7,1,10,1
505,sabatcc01,sabac001,CC Sabathia,CLE,1,24,1,0,0,4,0,11,2


In [16]:
bat_stats.query('G==1').sort_values('RBI')

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
0,aardsda01,aardd001,David Aardsma,SEA,1,0,0,0,0,0,0,0,0
1084,olivefr01,olivf001,Francisco Oliveras,SF,1,0,0,0,0,0,0,0,0
1086,olmedra01,olmer001,Ray Olmedo,CIN,1,0,0,0,0,0,0,0,0
1087,olsonti01,olsot001,Tim Olson,ARZ,1,3,1,0,0,0,0,0,1
1088,oltmi01,olt-m001,Mike Olt,CHC,1,4,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1328,santaca01,santc002,Carlos Santana,CLE,1,4,2,1,3,0,0,1,5
997,molitpa01,molip001,Paul Molitor,MIN,1,5,2,0,3,1,0,0,3
1649,zobribe01,zobrb001,Ben Zobrist,TB,1,4,2,1,4,1,0,1,6
1489,trammal01,trama001,Alan Trammell,DET,1,4,2,0,5,2,0,0,4


In [17]:
bat_stats.query('retro_id=="nettg001"')

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
1048,nettlgr01,nettg001,Graig Nettles,SD,1,4,2,2,5,0,0,2,8


In [18]:
mike_scott = pit_stats[pit_stats['player_id']=='scottmi03']
mike_scott

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
524,scottmi03,scotm001,Mike Scott,HOU,1,25,0,1,0,8,3,14,0


In [74]:
#match_cols_pit = ['retro_id', 'G', 'W', 'L', 'SV', 'H', 'ER', 'K', 'BB']
match_cols_pit = ['retro_id', 'G', 'W', 'L', 'SV', 'ER', 'K', 'BB', 'IPouts']
#match_cols_bat = ['retro_id', 'G', 'AB', 'H', 'R', 'RBI', 'TwoB', 'ThreeB', 'HR', 'TB']
match_cols_bat = ['retro_id', 'G', 'H', 'R', 'RBI', 'TwoB', 'ThreeB', 'HR', 'TB']
def find_daily(dailies, stat_line, match_cols):

    matches = pd.merge(left=stat_line, right=dailies, on=match_cols)
    return matches

In [75]:
find_daily(bat_dailies, bat_stats.query('retro_id=="nettg001"'), match_cols_bat)

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB_x,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id,AB_y
0,nettlgr01,nettg001,Graig Nettles,SD,1,4,2,2,5,0,0,2,8,SDN198607300,SDN,4


### Subtract a game's dailies from a stat DF
#### Start by subtracting a player's daily from their own statline
#### then scale

# Execute the Algorithm

### Likely starting out manually, iterating manually, but calling the building block functions

In [76]:
matches = find_daily(bat_dailies, bat_stats.query('G==1'), match_cols_bat)
match_counts = matches.groupby('retro_id').agg({'player_id': len, 'game_id': min}).rename(columns={'player_id': 'count'})
match_counts


Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
aardd001,331,ANA200705060
abret001,103,ARI201004070
accaj001,261,ANA200708230
acevj001,359,ANA200205070
adkij001,119,ANA200308140
...,...,...
younj001,201,ATL198607020
younm003,445,ANA200106270
zawal001,2,SDN201005020
zimmj002,12,COL199906090


In [77]:
match_counts.query('count==2')

Unnamed: 0_level_0,count,game_id
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1
asadr001,2,SDN198609140
branm003,2,CLE201009100
campt001,2,SDN201109270
crawc002,2,SDN201304100
diaza001,2,MIL199708270
diazr001,2,PIT200905300
higgk001,2,ATL199306030
holld001,2,NYN199209281
knobc001,2,MIN199705160
lanej001,2,HOU200506040


In [78]:
known_attended = match_counts[match_counts['count']==1].game_id.unique()
sorted(known_attended), len(known_attended)

(['DET198808040',
  'DET201308310',
  'OAK200004050',
  'SDN198607300',
  'SDN198704180',
  'SDN198806130',
  'SDN199109220',
  'SDN199407060',
  'SDN200106300',
  'SDN200110070',
  'SDN200204080',
  'SDN200306200',
  'SDN200308100',
  'SDN200405020',
  'SDN200406190',
  'SDN200407020',
  'SDN200408220',
  'SDN200409080',
  'SDN200706190',
  'SDN200904260',
  'SDN200910040',
  'SDN201007310',
  'SDN201209300',
  'SEA199608100',
  'SEA201209080',
  'SFN199609180',
  'SFN199804140',
  'SLN200707270'],
 28)

In [30]:
def aggregate_dailies(dailies, game_ids, players, match_cols):
    known_gm_stats = dailies.query('game_id in @game_ids and retro_id in @players').groupby('retro_id')[match_cols].sum()
    return known_gm_stats

known_gm_stats_bat = aggregate_dailies(bat_dailies, known_attended, bat_stats.retro_id, match_cols_bat)
known_gm_stats_bat

Unnamed: 0_level_0,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
abreb001,1,4,1,0,0.0,0.0,0.0,0.0,1.0
ackld001,1,5,1,0,0.0,0.0,0.0,0.0,1.0
adamm001,3,0,0,0,0.0,0.0,0.0,0.0,0.0
adamt001,1,0,0,0,0.0,0.0,0.0,0.0,0.0
affej001,1,0,0,0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
yound001,1,4,2,1,0.0,1.0,0.0,0.0,3.0
younj001,1,0,0,0,0.0,0.0,0.0,0.0,0.0
zaung001,1,4,2,0,0.0,1.0,0.0,0.0,3.0
zawal001,1,3,1,1,0.0,0.0,0.0,0.0,1.0


In [None]:
these_dailies = dailies[dailies.game_id.isin(known_attended)]
these_dailies

In [None]:
these_dailies.retro_id.isin(bat_stats.retro_id).value_counts()

Why are there 25 entries that don't show up in my pitchers seen, when looking at games that are *known* to be in the dataset?

Oh, could these be the guys with ambiguous name matching, who we removed from our "universe"?

Yep.  Maybe we should trim the dailies earlier, for the pitchers in our known universe.

In [None]:
these_dailies[~these_dailies.retro_id.isin(bat_stats.retro_id)]

In [None]:
known_gm_stats = these_dailies[these_dailies.retro_id.isin(bat_stats.retro_id)].groupby('retro_id')[match_cols_bat].sum()
known_gm_stats

In [35]:
unaccounted_stats_bat = bat_stats[match_cols_bat].set_index('retro_id')
unaccounted_stats_pit = pit_stats[match_cols_pit].set_index('retro_id')
unaccounted_stats_bat, unaccounted_stats_pit

(          G  AB  H  R  RBI  TwoB  ThreeB  HR  TB
 retro_id                                        
 aardd001  1   0  0  0    0     0       0   0   0
 abreb001  5  19  4  3    2     1       0   0   5
 abret001  1   4  0  0    0     0       0   0   0
 accaj001  1   1  0  0    0     0       0   0   0
 acevj002  3   2  0  0    0     0       0   0   0
 ...      ..  .. .. ..  ...   ...     ...  ..  ..
 zeilt001  2   7  3  2    2     2       0   0   5
 ziegb001  2   0  0  0    0     0       0   0   0
 zimmj002  1   0  0  0    0     0       0   0   0
 zitob001  2   2  0  0    0     0       0   0   0
 zobrb001  1   4  2  1    4     1       0   1   6
 
 [1650 rows x 9 columns],
            G  W  L  SV   H  ER   K  BB
 retro_id                              
 aardd001   1  0  1   0   2   2   0   1
 accaj001   1  0  0   0   0   0   1   2
 acevj002   3  0  2   0  13  10   3   7
 acevj001   1  0  0   0   2   0   2   0
 adamm001  12  1  0   0   7   1  12   5
 ...       .. .. ..  ..  ..  ..  ..  ..
 w

In [32]:
# when subtracting the data frames, need to subtract only the intersection of the sets
def subtract_stats(start, delta):
    unchanging = start[~start.index.isin(delta.index)]
    changing   = start[ start.index.isin(delta.index)]

    after = pd.concat([unchanging, changing-delta])
    return after

subtract_stats(unaccounted_stats, known_gm_stats)

Unnamed: 0_level_0,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
retro_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
aardd001,1,0,0,0,0.0,0.0,0.0,0.0,0.0
abret001,1,4,0,0,0.0,0.0,0.0,0.0,0.0
accaj001,1,1,0,0,0.0,0.0,0.0,0.0,0.0
acevj002,3,2,0,0,0.0,0.0,0.0,0.0,0.0
acevj001,1,0,0,0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
yound001,3,8,1,1,1.0,0.0,0.0,0.0,1.0
younj001,0,0,0,0,0.0,0.0,0.0,0.0,0.0
zaung001,3,9,1,1,0.0,0.0,0.0,0.0,1.0
zawal001,0,0,0,0,0.0,0.0,0.0,0.0,0.0


In [38]:
def run_iteration(dailies, unaccounted_stats, games_known_in, match_cols):
    # find players who have only one game in the set
    singletons = unaccounted_stats[unaccounted_stats['G']==1]

    # find possible matches for all of those single games, from all of each player's career
    matches = find_daily(dailies, singletons, match_cols)
                                   
    # count the matches for each player, and any games that resolve uniquely are now known to be IN
    match_counts = matches.groupby('retro_id').agg({'G': len, 'game_id': min}).rename(columns={'G': 'count'})
    games_deduced_in = match_counts.query('count==1').game_id.unique()
    
    # aggregate the stats of pitchers across these games deduced in
    # these stats are now accounted for, so subtract them
    deduced_gm_stats = aggregate_dailies(dailies, games_deduced_in, unaccounted_stats.index, match_cols)
    still_unaccounted = subtract_stats(unaccounted_stats, deduced_gm_stats)
    return (dailies, still_unaccounted, games_deduced_in, match_cols)

iter1 = run_iteration(bat_dailies, unaccounted_stats_bat, [], match_cols_bat)
iter1, len(iter1[2])

((              game_id team_id  G  AB  H  R  RBI  TwoB  ThreeB   HR   TB  \
  1043     ALS198707140     ALS  1   1  0  0  0.0   0.0     0.0  0.0  0.0   
  1044     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
  1045     ALS198707140     ALS  1   3  0  0  0.0   0.0     0.0  0.0  0.0   
  1046     ALS198707140     ALS  1   2  2  0  0.0   0.0     0.0  0.0  2.0   
  1047     ALS198707140     ALS  1   2  0  0  0.0   0.0     0.0  0.0  0.0   
  ...               ...     ... ..  .. .. ..  ...   ...     ...  ...  ...   
  5061336  WAS201509280     WAS  1   3  2  0  0.0   0.0     0.0  0.0  2.0   
  5061337  WAS201509280     WAS  1   5  2  0  1.0   0.0     0.0  0.0  2.0   
  5061338  WAS201509280     WAS  1   0  0  0  0.0   0.0     0.0  0.0  0.0   
  5061339  WAS201509280     WAS  1   3  1  0  0.0   0.0     0.0  0.0  1.0   
  5061340  WAS201509280     WAS  1   3  0  1  0.0   0.0     0.0  0.0  0.0   
  
           retro_id  
  1043     bainh001  
  1044     bellg001  
  1045 

In [39]:
iter1_pit = run_iteration(pit_dailies, unaccounted_stats_pit, [], match_cols_pit)
iter1_pit, len(iter1_pit[2])

((              game_id team_id  G  W  L  SV  IPouts   ER  H   BB     K  \
  1043     ALS198707140     ALS  0  0  0   0       0  0.0  0  0.0   0.0   
  1044     ALS198707140     ALS  0  0  0   0       0  0.0  0  0.0   0.0   
  1045     ALS198707140     ALS  0  0  0   0       0  0.0  0  0.0   0.0   
  1046     ALS198707140     ALS  0  0  0   0       0  0.0  0  0.0   0.0   
  1047     ALS198707140     ALS  0  0  0   0       0  0.0  0  0.0   0.0   
  ...               ...     ... .. .. ..  ..     ...  ... ..  ...   ...   
  5061336  WAS201509280     WAS  1  1  0   0      24  1.0  2  3.0  10.0   
  5061337  WAS201509280     WAS  0  0  0   0       0  0.0  0  0.0   0.0   
  5061338  WAS201509280     WAS  1  0  0   0       3  0.0  1  0.0   0.0   
  5061339  WAS201509280     WAS  0  0  0   0       0  0.0  0  0.0   0.0   
  5061340  WAS201509280     WAS  0  0  0   0       0  0.0  0  0.0   0.0   
  
           retro_id  
  1043     bainh001  
  1044     bellg001  
  1045     boggw001  
  1046   

In [56]:
sorted(set(iter1_pit[2]) )

['ATL201008060',
 'CHN199806230',
 'CIN200407180',
 'CLE200806270',
 'DET198808040',
 'DET199806180',
 'DET201308310',
 'LAN199807120',
 'NYN200809131',
 'NYN200809132',
 'OAK199505080',
 'PIT199405160',
 'SDN198607300',
 'SDN198609140',
 'SDN198806130',
 'SDN199109220',
 'SDN199308061',
 'SDN199308062',
 'SDN199407060',
 'SDN199609210',
 'SDN199610050',
 'SDN199707050',
 'SDN199807050',
 'SDN199904190',
 'SDN199906080',
 'SDN200006110',
 'SDN200006160',
 'SDN200007010',
 'SDN200007160',
 'SDN200008060',
 'SDN200009060',
 'SDN200010010',
 'SDN200104250',
 'SDN200104290',
 'SDN200106160',
 'SDN200108050',
 'SDN200109030',
 'SDN200110070',
 'SDN200206210',
 'SDN200207130',
 'SDN200304060',
 'SDN200305310',
 'SDN200309050',
 'SDN200309250',
 'SDN200404180',
 'SDN200405160',
 'SDN200406040',
 'SDN200406060',
 'SDN200407110',
 'SDN200408040',
 'SDN200408050',
 'SDN200408060',
 'SDN200408170',
 'SDN200409080',
 'SDN200409220',
 'SDN200409240',
 'SDN200409290',
 'SDN200504070',
 'SDN200504190

In [70]:
# where did HOU198607010 come from?

singletons = unaccounted_stats_bat.query('G==1')

# find possible matches for all of those single games, from all of each player's career
matches = find_daily(bat_dailies, singletons, match_cols_bat)
matches.query('game_id=="HOU198607010"')


Unnamed: 0,retro_id,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id
67528,walld001,1,5,1,0,1,0,0,0,1,HOU198607010,HOU
67530,waltg001,1,1,0,0,0,0,0,0,0,HOU198607010,SDN


In [55]:
bat_stats.query('retro_id in ("walld001", "waltg001")')

Unnamed: 0,player_id,retro_id,Name,Teams,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB
1563,wallide01,walld001,Denny Walling,HOU,1,5,1,0,1,0,0,0,1
1564,waltege01,waltg001,Gene Walter,SD,1,1,0,0,0,0,0,0,0


In [62]:
bat_dailies.query('retro_id in ("walld001", "waltg001") and game_id in @known_games')

Unnamed: 0,game_id,team_id,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,retro_id
2212949,HOU198607010,HOU,1,5,1,0,1.0,0.0,0.0,0.0,1.0,walld001
2212961,HOU198607010,SDN,1,1,0,0,0.0,0.0,0.0,0.0,0.0,waltg001
4132500,SDN198607300,SDN,1,1,0,0,0.0,0.0,0.0,0.0,0.0,waltg001
4133052,SDN198609140,HOU,1,4,1,0,1.0,0.0,0.0,0.0,1.0,walld001


In [72]:
# Walling went 1-for-4 with a walk in HOU198607010.  Is AB really PA?  Seems like there would be way more mistakes if it were consistently wrong like that.

# Walter seems to be a red herring; he had two games matching, one is confirmed known, and the other one happens to be HOU198607010.

# Let's look at other players who walked in that game.  Doran went 2-for-3 with 2 BB, Cruz went 2-for-2 with 2 BB
plrs = ("dorab001", "cruzj001")
bat_dailies.query('retro_id in @plrs and game_id in @known_games'), bat_stats.query('retro_id in @plrs')


(              game_id team_id  G  AB  H  R  RBI  TwoB  ThreeB   HR   TB  \
 2212935  HOU198607010     HOU  1   4  0  0  0.0   0.0     0.0  0.0  0.0   
 2212938  HOU198607010     HOU  1   4  1  0  1.0   1.0     0.0  0.0  2.0   
 4133044  SDN198609140     HOU  1   2  2  0  0.0   1.0     0.0  0.0  3.0   
 4133046  SDN198609140     HOU  1   3  2  1  0.0   2.0     0.0  0.0  4.0   
 
          retro_id  
 2212935  cruzj001  
 2212938  dorab001  
 4133044  cruzj001  
 4133046  dorab001  ,
      player_id  retro_id        Name Teams  G  AB  H  R  RBI  TwoB  ThreeB  \
 378  doranbi02  dorab001  Bill Doran   HOU  1   3  2  1    0     2       0   
 
      HR  TB  
 378   0   4  )

In [71]:
# How many matches in a game I'm sure I went to?
matches.query('game_id=="SDN198609140"')

Unnamed: 0,retro_id,G,AB,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id
1530,asadr001,1,3,0,0,0,0,0,0,0,SDN198609140,SDN
14733,davig001,1,4,1,0,0,1,0,0,2,SDN198609140,HOU
16435,dorab001,1,3,2,1,0,2,0,0,4,SDN198609140,HOU
21908,greeg001,1,3,1,1,0,1,0,0,2,SDN198609140,SDN
28273,iorgd001,1,1,0,0,0,0,0,0,0,SDN198609140,SDN
40493,mizej001,1,4,1,0,0,0,0,0,1,SDN198609140,HOU
51890,pyznt001,1,3,0,0,0,0,0,0,0,SDN198609140,SDN
53171,reync001,1,3,0,0,0,0,0,0,0,SDN198609140,HOU
59127,scotm001,1,3,0,0,0,0,0,0,0,SDN198609140,HOU


In [57]:
# where did PIT199405160 come from?

singletons = unaccounted_stats_pit.query('G==1')

# find possible matches for all of those single games, from all of each player's career
matches = find_daily(pit_dailies, singletons, match_cols_pit)
matches.query('game_id=="PIT199405160"')

Unnamed: 0,retro_id,G,W,L,SV,H,ER,K,BB,game_id,team_id,IPouts
1341,neagd001,1,1,0,0,3,1,6,3,PIT199405160,PIT,21


In [58]:
pit_stats.query('retro_id in ("neagd001")')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
390,neaglde01,neagd001,Denny Neagle,CIN,1,24,1,0,0,3,1,6,3


In [60]:
known_games = set(iter1[2])|set(iter1_pit[2])
pit_dailies.query('retro_id in ("neagd001") and game_id in @known_games')

Unnamed: 0,game_id,team_id,G,W,L,SV,IPouts,ER,H,BB,K,retro_id
4031882,PIT199405160,PIT,1,1,0,0,21,1.0,3,3.0,6.0,neagd001
4161779,SDN199909200,CIN,1,1,0,0,24,1.0,2,3.0,6.0,neagd001


OK, it's probably that 1999 game in SD that I actually attended.  Everything matches except for the hits (retrosheet says 2, HBP says 3).  Why could that be?

In [66]:
pits_used = pit_dailies.query('game_id in ("SDN199909200")')['retro_id']
pit_stats.query('retro_id in @pits_used')

Unnamed: 0,player_id,retro_id,Name,Teams,G,IPouts,W,L,SV,H,ER,K,BB
13,almanca01,almac001,Carlos Almanzar,SD,2,9,0,1,0,7,5,4,2
35,belinst01,belis001,Stan Belinda,CIN COL,2,6,0,0,0,1,0,3,0
89,carlybu01,carlb001,Buddy Carlyle,SD ATL,3,24,0,1,0,9,4,12,6
206,guzmado01,guzmd001,Domingo Guzman,SD,1,0,0,0,0,4,4,0,0
385,murrahe01,murrh001,Heath Murray,SD,2,21,0,0,0,6,1,4,4
390,neaglde01,neagd001,Denny Neagle,CIN,1,24,1,0,0,3,1,6,3
637,whitema02,whitm002,Matt Whiteside,SD TEX,5,18,0,0,0,4,1,9,5


In [90]:

# where did OAK200004050 come from?

singletons = unaccounted_stats_bat.query('G==1')

# find possible matches for all of those single games, from all of each player's career
matches = find_daily(bat_dailies, singletons, match_cols_bat)
cts = matches.groupby('retro_id')['G'].count()
plrs = cts[cts==1].index
gm_ids = ("OAK200004050")
matches.query('retro_id in @plrs and game_id in @gm_ids')


Unnamed: 0,retro_id,G,AB_x,H,R,RBI,TwoB,ThreeB,HR,TB,game_id,team_id,AB_y
65955,menef001,1,2,1,1,2,0,0,1,4,OAK200004050,OAK,4


In [91]:
plrs = ('menef001')
bat_dailies.query('retro_id in @plrs and game_id in @known_games'), bat_stats.query('retro_id in @plrs')

(              game_id team_id  G  AB  H  R  RBI  TwoB  ThreeB   HR   TB  \
 5000550  TOR200508060     TOR  1   2  1  1  1.0   0.0     0.0  1.0  4.0   
 
          retro_id  
 5000550  menef001  ,
      player_id  retro_id             Name Teams  G  AB  H  R  RBI  TwoB  \
 964  menecfr01  menef001  Frank Menechino   TOR  1   2  1  1    2     0   
 
      ThreeB  HR  TB  
 964       0   1   4  )

In [None]:
# So HBP has Menechino with 2 RBI, while he actually had 1

In [None]:
                                
# count the matches for each player, and any games that resolve uniquely are now known to be IN
match_counts = matches.groupby('retro_id').agg({'G': len, 'game_id': min}).rename(columns={'G': 'count'})
games_deduced_in = match_counts.query('count==1').game_id.unique()

In [None]:
iter2 = run_iteration(*iter1)
iter2, len(iter2[2])

In [None]:
iter2[2]

In [None]:
pit_stats.query('retro_id=="alfoa001"')

In [None]:
dailies.query('retro_id=="alfoa001" and game_id in @iter1[2]')

In [None]:
dailies.query('retro_id=="alfoa001" and IPouts==2 and H==1 and BB==1')

In [None]:
iter2[4].query('game_id == "SDN201306140"')

In [None]:
dailies[(dailies['retro_id']=='cahit001') & dailies['game_id'].isin(iter1[2])]

In [None]:
dailies[(dailies['retro_id']=='harrw002') & dailies['game_id'].isin(iter1[2])]

In [None]:
iter1[4][iter1[4]['game_id'] =='SDN201309260']

In [None]:
iter1[4][iter1[4]['game_id'] =='SDN201209150']

In [None]:
iter2[1].sort_values(by='IPouts')

In [None]:
dailies.query('retro_id=="murrh001" and game_id in @iter1[2]')

In [None]:
pit_stats.query('retro_id=="murrh001"')

This looks suspicious.  I didn't go to a game in Atlanta in May 2004.  This explains the discrepancy in the totals - but how did this game show up in my known attended games?

In [None]:
dailies.query('game_id=="ATL200405070"')

In [None]:
pit_stats.query('retro_id=="thomj005"')

In [None]:
iter3 = run_iteration(*iter2)
iter3[2]

In [None]:
iter4 = run_iteration(*iter3)
iter4[2]

In [None]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter1[2])]

In [None]:
dailies[(dailies['retro_id']=='ziegb001') & dailies['game_id'].isin(iter2[2])]

In [None]:
iter3[1].describe()