## Baseball-specific analysis

D) List the 20 most dangerous pitchers derived from their career total statistics. Specifically, those with the highest career rates of hitting batters. (Join People & Pitching tables. Use the formula total “Batters Hit By Pitch” / total “Outs Pitched”)

In [1]:
import pandas as pd
import numpy as np
import doctest

Read the data files for pitches and player names

In [2]:
pitching = pd.read_csv('baseballdatabank-2019.2/baseballdatabank-2019.2/core/Pitching.csv')

In [3]:
people = pd.read_csv('baseballdatabank-2019.2/baseballdatabank-2019.2/core/People.csv')

Compute for each player, the number of Outs pitched and batters hit. Then create a column to hold the frequency with which batters are hit.

In [4]:
pitches_by_playerid = pitching.groupby('playerID').agg({'HBP':sum, 'IPouts':sum})
pitches_by_playerid.loc[:, 'batter_hit_freq'] = pitches_by_playerid.HBP/pitches_by_playerid.IPouts

The top 20 most dangerous pitchers can be identified by sorting based on frequency

In [5]:

def top_n(df:pd.DataFrame, column:str, n:int=20):
    """Return the top n rows of the dataframe df when sorted by column.
    
    df: dataframe whose data needs to be sorted.
    column: column on which the rows are to be ranked
    n: number of rows to be extracted
    
    >>> df = pd.DataFrame([['a',1], ['b', 2],['c', 3],['d', 4]], columns=['alpha', 'num'])
    >>> out = top_n(df, 'num', 2)
    >>> list(out.alpha.values)
    ['d', 'c']
    >>> list(out.num.values)
    [4, 3]
    """
    return df.sort_values(by=column, ascending=False).head(n)
doctest.testmod()

TestResults(failed=0, attempted=4)

We ignore players with '0' IPouts to get rid of misleading batting hit frequency

In [6]:
top_20_pitchers = top_n(pitches_by_playerid.loc[pitches_by_playerid.IPouts>0], 'batter_hit_freq', 20)

In [7]:
top_20_pitchers

Unnamed: 0_level_0,HBP,IPouts,batter_hit_freq
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sborzja01,2.0,2,1.0
cathete01,1.0,1,1.0
bleicje01,1.0,1,1.0
brownpe01,1.0,1,1.0
wilshte01,1.0,1,1.0
youngjb01,3.0,6,0.5
osikke01,3.0,6,0.5
moorety01,1.0,2,0.5
jonesga02,1.0,2,0.5
craigge01,2.0,5,0.4


join top_20_pitchers dataframe with people dataframe to identify the most dangerous pitchers

In [8]:
top_20_pitchers.merge(people, on='playerID')[['nameFirst', 'nameLast', 'HBP', 'IPouts', 'batter_hit_freq']]

Unnamed: 0,nameFirst,nameLast,HBP,IPouts,batter_hit_freq
0,Jay,Sborz,2.0,2,1.0
1,Ted,Cather,1.0,1,1.0
2,Jeremy,Bleich,1.0,1,1.0
3,Pete,Browning,1.0,1,1.0
4,Terry,Wilshusen,1.0,1,1.0
5,J. B.,Young,3.0,6,0.5
6,Keith,Osik,3.0,6,0.5
7,Tyler,Moore,1.0,2,0.5
8,Garrett,Jones,1.0,2,0.5
9,George,Craig,2.0,5,0.4


# Part 2 cross-database integration analysis

List all player given names that are statistically higher frequency among baseball players than among males in the general population for their year of birth. Sort results by the proportion of commonality.

Read the people file to get the given names and birth years of the players.

In [9]:
people = pd.read_csv('baseballdatabank-2019.2/baseballdatabank-2019.2/core/People.csv')

Focusing only on the required columns

In [10]:
names = people[['playerID', 'birthYear', 'nameGiven']].copy()

We use `namesGiven` column to extract the real first name. We use the `str` accessor to vectorize the split operation to the whole column

In [11]:
names.loc[:,'first_name'] = names.nameGiven.str.split(' ', expand=True).loc[:,0]

We drop any rows with invalid birthYears and given names.

In [12]:
names.dropna(how = 'any', subset = ['birthYear', 'nameGiven'], inplace=True)

We dropped the NaN's, now it is possible to convert `birthYear` to integers.

In [13]:
names.birthYear = names.birthYear.astype(int)

To obtain the yearly frequency of a given first name amongst baseball players, we first obtain how many players born in a given year had a particular first name.

In [14]:
player_yearly_name_count = (names.groupby(['first_name', 'birthYear'])
                              .first_name
                              .count()
                              .rename('name_count')
                              .reset_index()
                             )


We also need how many players were born in any given year.

In [15]:
players_born_per_year = (player_yearly_name_count
                        .groupby('birthYear')
                        .name_count
                        .sum()
                        .to_frame()
                        .rename(columns={'name_count':'nbirths'})
                        .reset_index()
                        )

We are now able to find the yearly frequency of a given first name in the `player_name_freq` column.

In [16]:
player_name_stats = player_yearly_name_count.merge(players_born_per_year, on='birthYear')

player_name_stats.loc[:,'player_name_freq'] = (player_name_stats.name_count
                                               /player_name_stats.nbirths)

We write some helper functions to process the ssn names data files.

In [17]:
def process_name_file(year:int, gender:str='M'):
    """Obtain frequency of names of a given gender for a given year.
    
    >>> df = process_name_file(1880)
    >>> df.shape
    (1058, 3)
    >>> '{:.1f}'.format(df.pop_name_freq.sum())
    '1.0'
    """
    file = 'names/yob{}.txt'.format(year)
    df = pd.read_csv(file, header=None, names=['Name', 'Sex', 'noccur'])
    df = df.loc[df.Sex == gender, ['Name', 'noccur']].copy()
    df.loc[:,'pop_name_freq']=df.noccur/df.noccur.sum()
    df.loc[:,'birthYear'] = year
    return df.drop(['noccur'], axis=1)    

doctest.testmod()

TestResults(failed=0, attempted=7)

In [18]:
def process_ssn_db(min_year:int, max_year:int):
    """Obtain population name statistics from ssn data files for given year range.
    
    Wrapper that loops process_name_file around required year range.
    """
    names_accum = []
    for year in range(min_year, max_year+1):        
        df = process_name_file(year, gender='M')
        names_accum.append(df)
    return pd.concat(names_accum, axis=0, ignore_index=True)

Use helper functions to process ssn data

In [19]:
min_year, max_year = names.birthYear.min(), names.birthYear.max()
name_frequencies = process_ssn_db(1880, max_year)

Merge the player name stats with the population name frequencies

In [20]:
merged = player_name_stats.merge(name_frequencies,
                           left_on=['birthYear','first_name'],
                           right_on=['birthYear', 'Name'])

We are interested in names which are more common amongst baseball players than the general population for a given birth year. So we filter the merged table and present the results

In [21]:
more_frequent_player_names = (merged
                              .loc[merged['player_name_freq'] > merged['pop_name_freq']]
                              .sort_values(by=['player_name_freq', 'pop_name_freq'],
                                           ascending=False)
                              [['Name', 'birthYear', 'player_name_freq', 'pop_name_freq']]
                              .reset_index(drop=True)
                             )
more_frequent_player_names

Unnamed: 0,Name,birthYear,player_name_freq,pop_name_freq
0,Juan,1998,1.000000,0.003852
1,John,1885,0.119048,0.081225
2,Robert,1925,0.118280,0.054569
3,Michael,1961,0.114504,0.040950
4,John,1882,0.112245,0.084065
5,Robert,1935,0.110000,0.054292
6,William,1886,0.108696,0.074487
7,Robert,1930,0.103448,0.056638
8,James,1881,0.103448,0.054009
9,William,1882,0.102041,0.081787
