# Retrosheet and Lahman Data Consistency

The Retrosheet and Lahman data sets are not perfectly accurate.

A good method for understanding the accuracy of the data, and understanding the data in general, is to run different queries that should produce the same results and verify that they do.

There are three primary types of data consistency tests that will be performed:
* The Retrosheet and Lahman data for batting/pitching/fielding will be aggregated to the same level and compared against each other
* The Retrosheet and Lahman individual stats will be aggregated to the team level and compared against their respective team stats
* The Retrosheet and Lahman hitting stats will be compared against their respective pitching stats
  * for example, for every home run hit by a batter, a pitcher must have given up a home run
  
The data will also be checked to verify that fields which uniquely identify a record, actually do uniquely identify a record.

Furthermore, unusual data will be spot checked against Baseball-Reference.com to verify it.
  
The pytest test suite supplied in this Github repo automatically runs a superset of the queries in this notebook and verifies that the results differ by no more than 1%, and in most cases no more than 0.1%.

All data between 1974 and 2019 inclusive will be used.  Retrosheet has all play-by-play data over this time period.

TODO:
* Lahman Batting vs Lahman Pitching (aggregated over all players and teams 1974 through 2019)
* Lahman Batting vs Lahman Teams Batting
* Lahman Pitching vs Lahman Teams Pitching
* Lahman Fielding vs Lahman Teams Fielding

## Setup
[Preliminaries](#Preliminaries)  
[Imports and Setup](#Imports-and-Setup)  
[Load Data](#Load-the-Data)   

## Preliminaries

This notebook assumes that the Lahman and Retrosheet data sets have been downloaded and wrangled using the scripts in the `../download_scripts` directory of this repo.

For this notebook, Retrosheet data from 1974 through 2019 inclusive is used.

The `../download_scripts/data_helper.py` function: `from_csv_with_types()` uses pd.read_csv() with dtypes set to the type data read in from: <filename\>_types.csv.  This allows Pandas to use the previously optimized data types.

## MLB Data Summary

Most used csv files.

**Lahman**  
* Stats per Player per Year:
  * batting.csv
  * pitching.csv
  * fielding.csv
* Stats per Team per Year:
  * teams.csv -- contains team_id for both Lahman and Retrosheet
* Other
  * people.csv -- contains player_id for both Lahman and Retrosheet
  
**Retrosheet**  
* Stats per Player per Game:
  * batting.csv.gz
  * pitching.csv.gz
  * fielding.csv.gz
* Stats per Team per Game:
  * team_game.csv.gz
* Stats per Game:
  * game.csv.gz 

## Imports and Setup

In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import re
from scipy.stats import linregress

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [3]:
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100 # increase dpi, will make figures larger and clearer

In [4]:
import sys

# import data_helper.py from download_scripts directory
sys.path.append('../download_scripts')
import data_helper as dh

In [5]:
data_dir = Path('../data')
lahman_data = data_dir.joinpath('lahman/wrangled').resolve()
retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()

In [6]:
pd.set_option("display.max_columns", 50)

# Load the Data

Loading all the data up front makes the code clearer, but uses more memory.

As optimized Pandas data types were persisted when the data was wrangled, the total memory usage is about 3 times less than if Pandas inferred the data types with pd.read_csv().

Notes:
* every Retrosheet player who appears in a game has a batting record, even if they had no plate appearance
* every Retrosheet team appears in retro_team_game
* the Lahman stint value is incremented each time a player is traded in the same year

In [7]:
lahman_people = dh.from_csv_with_types(lahman_data / 'people.csv')
lahman_teams = dh.from_csv_with_types(lahman_data / 'teams.csv')
lahman_batting = dh.from_csv_with_types(lahman_data / 'batting.csv')
lahman_pitching = dh.from_csv_with_types(lahman_data / 'pitching.csv')
lahman_fielding = dh.from_csv_with_types(lahman_data / 'fielding.csv')

In [8]:
# restrict Lahman data to be between 1974 and 2019 inclusive
lahman_batting = lahman_batting.query('1974 <= year <= 2019')
lahman_pitching = lahman_pitching.query('1974 <= year <= 2019')
lahman_fielding = lahman_fielding.query('1974 <= year <= 2019')

In [9]:
retro_batting = dh.from_csv_with_types(retrosheet_data / 'batting.csv.gz')
retro_pitching = dh.from_csv_with_types(retrosheet_data / 'pitching.csv.gz')
retro_fielding = dh.from_csv_with_types(retrosheet_data / 'fielding.csv.gz')
retro_team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz')
retro_game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz')

In [10]:
# restrict Retrosheet data to be between 1974 and 2019 inclusive
retro_batting = retro_batting.query('1974 <= game_start_dt.dt.year <= 2019')
retro_pitching = retro_pitching.query('1974 <= game_start_dt.dt.year <= 2019')
retro_fielding = retro_fielding.query('1974 <= game_start_dt.dt.year <= 2019')
retro_team_game = retro_team_game.query('1974 <= game_start_dt.dt.year <= 2019')
retro_game = retro_game.query('1974 <= game_start_dt.dt.year <= 2019')

In [11]:
# verify these years are in the downloaded data
(retro_batting['game_start_dt'].agg(['min', 'max']).dt.year == (1974, 2019)).all()

True

In [12]:
# verify all the years are in the downloaded data
retro_batting['game_start_dt'].dt.year.nunique() == (2019 - 1974) + 1

True

## Primary and Foreign Key Tests
Perform these tests before checking for data consistency.

### Lahman People and Lahman/Retrosheet Player IDs

In [13]:
# Lahman People pkey: player_id
dh.is_unique(lahman_people, ['player_id'])

True

In [14]:
# Lahman People fkey: retro_id
# The mapping between player_id and retro_id must be one-to-one (or missing)
# if player_id is unique and retro_id is unique (or missing), then the mapping is one-to-one
dh.is_unique(lahman_people, ['retro_id'], ignore_null=True)

True

In [15]:
lahman_players = set(lahman_people['retro_id'].unique())
retro_players = set(retro_batting['player_id'].unique())

In [16]:
# Every retrosheet player_id is in Lahman
retro_players.issubset(lahman_players)

True

### Lahman Teams and Lahman/Retrosheet Team IDs

In [17]:
# Lahman Teams pkey: team_id, year_id
dh.is_unique(lahman_teams, ['team_id', 'year'])

True

In [18]:
# Lahman Teams fkey: team_id_retro, year_id
# The mapping between (team_id, year_id) and (team_id_retro, year_id) must be one-to-one
dh.is_unique(lahman_teams, ['team_id_retro', 'year'])

True

In [19]:
lahman_team_ids = set(zip(lahman_teams['team_id_retro'], lahman_teams['year']))
retro_team_ids = set(zip(retro_team_game['team_id'], retro_team_game['game_start_dt'].dt.year))

In [20]:
# every retrosheet (team_id, year) is in lahman
retro_team_ids.issubset(lahman_team_ids)

True

### Lahman Batting/Pitching/Fielding

In [21]:
dh.is_unique(lahman_batting, ['player_id', 'year', 'stint'])

True

In [22]:
dh.is_unique(lahman_pitching, ['player_id', 'year', 'stint'])

True

In [23]:
dh.is_unique(lahman_fielding, ['player_id', 'year', 'stint', 'pos'])

True

### Retrosheet Batting/Pitching/Fielding/Team_Game/Game

In [24]:
dh.is_unique(retro_batting, ['player_id', 'game_id'])

True

In [25]:
dh.is_unique(retro_pitching, ['player_id', 'game_id'])

True

In [26]:
dh.is_unique(retro_fielding, ['player_id', 'game_id', 'pos'])

True

In [27]:
dh.is_unique(retro_team_game, ['team_id', 'game_id'])

True

In [28]:
dh.is_unique(retro_game, ['game_id'])

True

# Data Consistency Tests
The download scripts ensured that columns with the same meaning, were given the same name.

## Retrosheet Batting vs Lahman Batting Players

In [29]:
# TODO add this to pytest tests
# verify that Lahman and Retrosheet have stats on the same set of batters
lahman_batters = pd.merge(lahman_batting['player_id'], lahman_people[['player_id', 'retro_id']])
r_batters = set(retro_batting['player_id'].unique())
l_batters = set(lahman_batters['retro_id'].unique())
r_batters == l_batters

True

## Retrosheet Batting vs Lahman Batting Stats

In [30]:
# batting columns to compare
cols = set(retro_batting.columns) & set(lahman_batting.columns)
cols -= {'player_id', 'team_id'}
len(cols)

17

In [31]:
cols

{'ab',
 'bb',
 'cs',
 'double',
 'g',
 'gidp',
 'h',
 'hbp',
 'hr',
 'ibb',
 'r',
 'rbi',
 'sb',
 'sf',
 'sh',
 'so',
 'triple'}

In [32]:
# aggregate the stats in common over all players over all years (1974 thru 2019)
l = lahman_batting[cols]
r = retro_batting[cols]

l_sums = l.agg('sum')
l_sums.sort_index(inplace=True)

r_sums = r.agg('sum')
r_sums.sort_index(inplace=True)

In [33]:
# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))

ab        1.426074e-07
bb        1.487730e-06
cs        3.295273e-05
double    2.922456e-06
g         1.397775e-06
gidp      8.906306e-05
h         0.000000e+00
hbp       0.000000e+00
hr        0.000000e+00
ibb       1.776294e-05
r         0.000000e+00
rbi       5.724236e-06
sb        2.964764e-05
sf        0.000000e+00
sh        1.411652e-05
so        0.000000e+00
triple    0.000000e+00
dtype: float64

In [34]:
# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')

0.000089


In [35]:
# all 17 batting attributes from 1974-2019
# are within plus/minus 0.01% of each other when summed
(np.abs(1.0 - (l_sums / r_sums)) < .0001).all()

True

## Retrosheet Pitching vs Lahman Pitching Players

In [36]:
# TODO add this to pytest tests
# verify that Lahman and Retrosheet have stats on exactly the same set of pitchers
lahman_pitchers = pd.merge(lahman_pitching['player_id'], lahman_people[['player_id', 'retro_id']])
r_pitchers = set(retro_pitching['player_id'].unique())
l_pitchers = set(lahman_pitchers['retro_id'].unique())
r_pitchers == l_pitchers

True

## Retrosheet Pitching vs Lahman Pitching Stats

In [37]:
# pitching columns to compare
cols = set(retro_pitching.columns) & set(lahman_pitching.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)

21

In [38]:
cols

{'bb',
 'bk',
 'cg',
 'er',
 'g',
 'gf',
 'gidp',
 'gs',
 'h',
 'hbp',
 'hr',
 'ibb',
 'l',
 'r',
 'sf',
 'sh',
 'sho',
 'so',
 'sv',
 'w',
 'wp'}

In [39]:
# aggregate the stats in common over all players over all years (1974 thru 2019)
l = lahman_pitching[cols]
r = retro_pitching[cols]

l_sums = l.agg('sum')
l_sums.sort_index(inplace=True)

r_sums = r.agg('sum')
r_sums.sort_index(inplace=True)

In [40]:
# compute the relative differences
np.abs(1.0 - (l_sums / r_sums))

bb      0.000001
bk      0.000403
cg      0.000327
er      0.000002
g       0.000105
gf      0.000032
gidp    0.000006
gs      0.000000
h       0.000000
hbp     0.000000
hr      0.000005
ibb     0.000018
l       0.000000
r       0.000000
sf      0.000000
sh      0.000014
sho     0.000598
so      0.000000
sv      0.000081
w       0.000000
wp      0.000355
dtype: float64

In [41]:
# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')

0.000598


In [42]:
# verify all values between 1974 and 2019 are within plus/minus 0.06% of each other
(np.abs(1.0 - (l_sums / r_sums)) < .0006).all()

True

## Retrosheet Fielding vs Lahman Fielding Players

In [43]:
# TODO add this to pytest tests
# verify that Lahman and Retrosheet have stats on exactly the same set of fielders
lahman_fielders = pd.merge(lahman_fielding['player_id'], lahman_people[['player_id', 'retro_id']])
r_fielders = set(retro_fielding['player_id'].unique())
l_fielders = set(lahman_fielders['retro_id'].unique())
r_fielders == l_fielders

False

In [44]:
l_fielders - r_fielders

set()

In [45]:
r_fielders - l_fielders

{'olivt102'}

In [46]:
# Retrosheet has a fielder not in Lahman, what are the fielding stats for this player?
retro_fielding.query('player_id == "olivt102"')

Unnamed: 0,game_id,player_id,pos,team_id,g,gs,inn_outs,tc,po,a,e,dp,tp,pb,xi,game_start_dt
791764,BOS197604190,olivt102,2B,MIN,0,1,0,0,0,0,0,0,0,0,0,1976-04-19 11:05:00
791786,BOS197604200,olivt102,2B,MIN,0,1,0,0,0,0,0,0,0,0,0,1976-04-20 15:03:00
816021,NYA197604180,olivt102,2B,MIN,0,1,0,0,0,0,0,0,0,0,0,1976-04-18 14:07:00
831226,TEX197604110,olivt102,2B,MIN,0,1,0,0,0,0,0,0,0,0,0,1976-04-11 00:00:00


Tony Oliva (olivt102) started 4 games as a second baseman.  
He had zero total chances (tc) and no outs were recorded during the time he played second base.  

This sounds like a pinch hitter or a designated hitter, rather than a fielder.  Let's check the box scores and player information on Baseball-Reference.

In [47]:
# right click on each generated link and open in a new tab
dh.game_id_to_url('BOS197604190')
dh.game_id_to_url('BOS197604200')
dh.game_id_to_url('NYA197604180')
dh.game_id_to_url('TEX197604110')

bb_player_id = lahman_people.query('retro_id == "olivt102"')['bb_ref_id'].values[0]
dh.player_id_to_url(bb_player_id)

Tony Oliva lead off in each of these four away games and was immediately replaced, whether he got on base or not.  He was only in the game to hit once.  He might have been unable to run or field well for these games, but he could still hit, so he was in the lineup.

Tony was listed as the starting second baseman, even though he was a career right fielder.  Tony was not going to play second base, the guy who replaced him, Jerry Terrell, was.

The cwdaily parser created a fielding record for Tony as he was on the lineup card as the starting second baseman.

It would be hard to argue that either the Lahman or Retrosheet data is wrong in this scenario.  The only difference is that Retrosheet show 4 starts as a second baseman for Tony Oliva that Lahman does not.

The fielding players are essentially the same in Lahman and Retrosheet (1974 through 2019).

## Retrosheet Fielding vs Lahman Fielding Stats

In [48]:
# fielding columns to compare
cols = set(retro_fielding.columns) & set(lahman_fielding.columns)
cols -= {'player_id', 'team_id', 'year'}
len(cols)

8

In [49]:
cols

{'a', 'dp', 'e', 'g', 'gs', 'inn_outs', 'po', 'pos'}

In [50]:
# aggregate the stats in common per position over all players over all years (1974 thru 2019)
l = lahman_fielding[cols]
r = retro_fielding[cols]

l_sums = l.groupby('pos').agg('sum')
l_sums.sort_index(inplace=True)

r_sums = r.groupby('pos').agg('sum')
r_sums.sort_index(inplace=True)

In [51]:
np.abs(1.0 - (l_sums / r_sums))

Unnamed: 0_level_0,po,a,e,gs,dp,g,inn_outs
pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1B,0.000482,0.006393,0.000833,2.9e-05,0.000737,8e-06,2.470157e-05
2B,0.002122,0.001008,0.000309,6.3e-05,7e-06,1.7e-05,2.179551e-06
3B,0.00185,0.001483,0.000109,1.9e-05,0.000908,0.000177,9.807978e-06
C,0.000668,0.00781,0.000683,5e-06,0.000982,5.9e-05,5.448877e-07
CF,,,,,,,
LF,,,,,,,
OF,,,,,,,
P,0.005532,0.00189,0.001415,0.0,0.002302,0.000105,4.72236e-06
RF,,,,,,,
SS,0.001301,0.001009,0.000102,6.8e-05,0.00073,4e-06,8.354944e-06


In [52]:
# Lahman uses OF for sum of LF, CF, RF -- account for this
r_sums.loc['OF'] = r_sums.loc['LF'] + r_sums.loc['CF'] + r_sums.loc['RF']
r_sums = r_sums.drop(['LF', 'CF', 'RF'])
r_sums.sort_index(inplace=True)
r_sums

Unnamed: 0_level_0,po,a,e,gs,dp,g,inn_outs
pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1B,1798401.0,138119.0,14409.0,205280.0,171005.0,237097.0,5505722.0
2B,422753.0,599404.0,19443.0,205280.0,135428.0,235233.0,5505722.0
3B,149770.0,404527.0,27552.0,205280.0,37465.0,237078.0,5505722.0
C,1349560.0,106279.0,14642.0,205280.0,13236.0,236247.0,5505722.0
OF,1352944.0,38885.0,24381.0,615840.0,8256.0,735430.0,16517166.0
P,112264.0,243922.0,16956.0,205280.0,17807.0,695259.0,5505722.0
SS,319687.0,610365.0,29333.0,205280.0,127447.0,232003.0,5505722.0


In [53]:
# This will overcount games, as Lahman will have a player in the game once as an OF, whereas
# Retrosheet may have him in the game as both a LF and CF, for example.
filt = retro_fielding['pos'].isin(['LF', 'CF', 'RF'])
r_of = retro_fielding[filt]

total_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep=False).sum()
counted_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep='first').sum()

r_sums.loc['OF', 'g'] -= (total_dups - counted_dups)
r_sums

Unnamed: 0_level_0,po,a,e,gs,dp,g,inn_outs
pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1B,1798401.0,138119.0,14409.0,205280.0,171005.0,237097.0,5505722.0
2B,422753.0,599404.0,19443.0,205280.0,135428.0,235233.0,5505722.0
3B,149770.0,404527.0,27552.0,205280.0,37465.0,237078.0,5505722.0
C,1349560.0,106279.0,14642.0,205280.0,13236.0,236247.0,5505722.0
OF,1352944.0,38885.0,24381.0,615840.0,8256.0,711605.0,16517166.0
P,112264.0,243922.0,16956.0,205280.0,17807.0,695259.0,5505722.0
SS,319687.0,610365.0,29333.0,205280.0,127447.0,232003.0,5505722.0


In [54]:
np.abs(1.0 - l_sums / r_sums)

Unnamed: 0_level_0,po,a,e,gs,dp,g,inn_outs
pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1B,0.000482,0.006393,0.000833,2.9e-05,0.000737,8e-06,2.470157e-05
2B,0.002122,0.001008,0.000309,6.3e-05,7e-06,1.7e-05,2.179551e-06
3B,0.00185,0.001483,0.000109,1.9e-05,0.000908,0.000177,9.807978e-06
C,0.000668,0.00781,0.000683,5e-06,0.000982,5.9e-05,5.448877e-07
OF,7.2e-05,0.002546,0.000656,1.3e-05,0.001453,0.000311,6.599195e-06
P,0.005532,0.00189,0.001415,0.0,0.002302,0.000105,4.72236e-06
SS,0.001301,0.001009,0.000102,6.8e-05,0.00073,4e-06,8.354944e-06


In [55]:
# find the largest relative difference
print(f'{np.abs(1.0 - (l_sums / r_sums)).max().max():8.6f}')

0.007810


In [56]:
# verify all values between 1974 and 2019 are within plus/minus 0.8% of each other
np.abs(1.0 - (l_sums / r_sums)).max().max() < 0.008

True

# Retrosheet Pitching (allowed) vs Retrosheet Hitting Per Game

In [57]:
#TODO add this to pytest
exclude = ['game_id', 'team_id', 'player_id', 'g', 'game_start_dt']
cols = set(retro_pitching.columns) & set(retro_batting.columns) - set(exclude)
cols = list(cols)
cols

['hr',
 'hbp',
 'ibb',
 'bb',
 'sf',
 'triple',
 'h',
 'tb',
 'xi',
 'sh',
 'so',
 'hr4',
 'double',
 'gidp',
 'ab',
 'r']

In [58]:
p = retro_pitching[cols].agg('sum')
p

hr         195985
hbp         58097
ibb         56297
bb         672165
sf          58275
triple      41532
h         1826823
tb        2840020
xi           1060
sh          70839
so        1301700
hr4          4710
double     342178
gidp       157192
ab        7012260
r          923811
dtype: int64

In [59]:
b = retro_batting[cols].agg('sum')
b

hr         195985
hbp         58097
ibb         56297
bb         672165
sf          58275
triple      41532
h         1826823
tb        2840020
xi           1060
sh          70839
so        1301700
hr4          4710
double     342178
gidp       157192
ab        7012260
r          923811
dtype: int64

In [60]:
p.equals(b)

True

# Retrosheet Batting vs Retrosheet Team Batting Per Game

In [61]:
exclude = ['game_id', 'team_id', 'player_id', 'game_start_dt']
cols = set(retro_batting.columns) & set(retro_team_game.columns) - set(exclude)
cols = list(cols)
len(cols)

17

In [62]:
cols

['hr',
 'hbp',
 'cs',
 'rbi',
 'bb',
 'ibb',
 'sf',
 'triple',
 'h',
 'xi',
 'sh',
 'so',
 'double',
 'gidp',
 'ab',
 'r',
 'sb']

In [63]:
b = retro_batting[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')
b = b.reset_index().sort_index()
b.head(4)

Unnamed: 0,game_id,team_id,hr,hbp,cs,rbi,bb,ibb,sf,triple,h,xi,sh,so,double,gidp,ab,r,sb
0,ANA199704020,ANA,1,0,0,5,5,0,1,0,12,0,0,7,0,0,38,5,2
1,ANA199704020,BOS,1,1,0,6,8,0,0,0,9,0,0,12,2,0,36,6,0
2,ANA199704030,ANA,0,0,0,2,4,1,1,0,7,0,0,1,2,2,27,2,0
3,ANA199704030,BOS,0,0,0,0,0,0,0,0,5,0,0,5,0,3,29,0,0


In [64]:
tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(['game_id', 'team_id']).reset_index(drop=True)
tg.head(4)

Unnamed: 0,game_id,team_id,hr,hbp,cs,rbi,bb,ibb,sf,triple,h,xi,sh,so,double,gidp,ab,r,sb
0,ANA199704020,ANA,1,0,0,5,5,0,1,0,12,0,0,7,0,0,38,5,2
1,ANA199704020,BOS,1,1,0,6,8,0,0,0,9,0,0,12,2,0,36,6,0
2,ANA199704030,ANA,0,0,0,2,4,1,1,0,7,0,0,1,2,2,27,2,0
3,ANA199704030,BOS,0,0,0,0,0,0,0,0,5,0,0,5,0,3,29,0,0


In [65]:
b.equals(tg)

True

# Retrosheet Fielding vs Retrosheet Team Fielding Per Game

In [66]:
cols = ['a', 'e', 'po', 'pb']
cols

['a', 'e', 'po', 'pb']

In [67]:
f = retro_fielding[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')
f = f.reset_index().sort_index()
f.head(4)

Unnamed: 0,game_id,team_id,a,e,po,pb
0,ANA199704020,ANA,6,1,27,0
1,ANA199704020,BOS,8,1,27,0
2,ANA199704030,ANA,13,0,27,0
3,ANA199704030,BOS,9,0,24,0


In [68]:
tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(
    ['game_id', 'team_id']).reset_index(drop=True)
tg.head(4)

Unnamed: 0,game_id,team_id,a,e,po,pb
0,ANA199704020,ANA,6,1,27,0
1,ANA199704020,BOS,8,1,27,0
2,ANA199704030,ANA,13,0,27,0
3,ANA199704030,BOS,9,0,24,0


In [69]:
f.equals(tg)

True

# Retrosheet Pitching vs Retrosheet Team Pitching Per Game

In [70]:
cols = ['wp', 'bk', 'er']
cols

['wp', 'bk', 'er']

In [71]:
p = retro_pitching[cols].agg('sum')
p

wp     64799
bk      9919
er    841197
dtype: int64

In [72]:
tg = retro_team_game[cols].agg('sum')
tg

wp     64799
bk      9919
er    841197
dtype: int64

In [73]:
p.equals(tg)

True

## Summary
Real world data is often messy and inaccurate.  It may seem like a lot of work to cross check the data in different ways, but it is necessary to do so before performing any data analysis, otherwise that data analysis could be meaningless.