# 01 - Bulletproofing your data

[Nameparser library](https://pypi.org/project/nameparser/)

In [7]:
# import libraries
import pandas as pd
from nameparser import HumanName
import numpy as np

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)

### Practice
#### FED board of directors 
##### 1. Download
- [Diversity in the FED - Brookings Institute](https://www.brookings.edu/research/diversity-within-the-federal-reserve-system/): download file named " Biographical database by unique position" on lefthand side
- [Documentation](https://www.brookings.edu/wp-content/uploads/2021/04/Biographical-Database-Overview.pdf): go through each column to understand what's there
- [Brookings Analysis](https://www.brookings.edu/wp-content/uploads/2021/04/Appendices-Directors-by-race-gender-and-bank.pdf)

In [3]:
# get excel file from assets/ folder
df_bod = pd.read_excel('downloaded_data//Biographical-database-BoD-Unique-Positions.xlsx')

In [4]:
# check row and column lengths
df_bod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2607 entries, 0 to 2606
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Name                  2607 non-null   object 
 1   District Number       2607 non-null   int64  
 2   District Name         2607 non-null   object 
 3   Class                 2607 non-null   object 
 4   Group                 1393 non-null   float64
 5   TD1: Degree           2607 non-null   object 
 6   TD1: Major/Field      2607 non-null   object 
 7   TD1: School           2607 non-null   object 
 8   TD1: Year             2607 non-null   object 
 9   TD2: Degree           2607 non-null   object 
 10  TD2: Major/Field      2607 non-null   object 
 11  TD2: School           2607 non-null   object 
 12  TD2: Year             2606 non-null   object 
 13  City                  2606 non-null   object 
 14  State                 2607 non-null   object 
 15  Job Title            

In [5]:
# trim columns and covert all to strings
df_bod = df_bod.applymap(lambda x: x.strip() if type(x) == str else str(x))

In [8]:
# standardize null values
df_bod = df_bod.applymap(lambda x: np.nan if str(x) in ['-', '?', 'None', 'nan'] else x)

In [9]:
# check columns: with missing values (Group)
df_bod['Group'].unique(), df_bod['Class'].unique()

(array(['3.0', nan, '2.0', '1.0'], dtype=object),
 array(['A', 'T', 'B', 'C', 'C Chair', 'C Dep. Chair', 'C chair'],
       dtype=object))

**_According to documentation: For A and B class directors they can be elected by banks in Group 1, 2, or 3. Upon investigation, there are 10 entries where there is a group number but not A or B class. Ask about this._**

In [10]:
# according to documentation, class A and B directors are assigned a group. does this match?
len(df_bod[df_bod['Class'].isin(['A', 'B'])])

1383

In [11]:
# 1383 != 1393 investigate further
len(df_bod[df_bod['Group'].notna() & (~df_bod['Class'].isin(['A', 'B']))])

10

One row = each position served rather than one person, which is why there are duplicates. In the research article in footnote 18, it says there are 1,957 unique people. Does this match?

In [12]:
# find duplicates
df_bod[df_bod.duplicated()]

Unnamed: 0,Name,District Number,District Name,Class,Group,TD1: Degree,TD1: Major/Field,TD1: School,TD1: Year,TD2: Degree,TD2: Major/Field,TD2: School,TD2: Year,City,State,Job Title,Organization,Sector,Race,Gender,Birth Year,Age at Start,FOMC Combined,FOMC Pre-reorg,FOMC Pre Start Year,FOMC Pre End Year,FOMC Post-reorg,FOMC Post Start Year,FOMC Post End Year,Start Year (pos.),End Year (pos.)


In [13]:
# find duplicated: filter by duplicated Name + Birth Year
len(df_bod[df_bod.duplicated(subset=['Name'])].sort_values('Name'))

651

There is one person missing, so let's standardize the names (replace all special characters and make all uppercase)

In [14]:
df_bod['Name'] = df_bod['Name'].str.replace('.', '').str.upper()

  df_bod['Name'] = df_bod['Name'].str.replace('.', '').str.upper()


In [15]:
len(df_bod[df_bod.duplicated(subset=['Name', 'Birth Year'])])

650

In [16]:
2607-650

1957

In [17]:
# missing city value
df_bod[df_bod['City'].isnull()]

Unnamed: 0,Name,District Number,District Name,Class,Group,TD1: Degree,TD1: Major/Field,TD1: School,TD1: Year,TD2: Degree,TD2: Major/Field,TD2: School,TD2: Year,City,State,Job Title,Organization,Sector,Race,Gender,Birth Year,Age at Start,FOMC Combined,FOMC Pre-reorg,FOMC Pre Start Year,FOMC Pre End Year,FOMC Post-reorg,FOMC Post Start Year,FOMC Post End Year,Start Year (pos.),End Year (pos.)
838,ERWIN DAIN CANHAM,1,Boston,C Dep. Chair,,Bachelor's,,Bates College,1925,,,,,,Massachusetts,Editor,Christian Science Monitor,Publishing,W,M,1904,57,,,,,,,,1961,1962


In [18]:
df_bod[df_bod['Name'] == 'ERWIN DAIN CANHAM']

Unnamed: 0,Name,District Number,District Name,Class,Group,TD1: Degree,TD1: Major/Field,TD1: School,TD1: Year,TD2: Degree,TD2: Major/Field,TD2: School,TD2: Year,City,State,Job Title,Organization,Sector,Race,Gender,Birth Year,Age at Start,FOMC Combined,FOMC Pre-reorg,FOMC Pre Start Year,FOMC Pre End Year,FOMC Post-reorg,FOMC Post Start Year,FOMC Post End Year,Start Year (pos.),End Year (pos.)
771,ERWIN DAIN CANHAM,1,Boston,C,,Bachelor's,,Bates College,1925,,,,,Boston,Massachusetts,Editor,Christian Science Monitor,Publishing,W,M,1904,55,,,,,,,,1959,1960
838,ERWIN DAIN CANHAM,1,Boston,C Dep. Chair,,Bachelor's,,Bates College,1925,,,,,,Massachusetts,Editor,Christian Science Monitor,Publishing,W,M,1904,57,,,,,,,,1961,1962
889,ERWIN DAIN CANHAM,1,Boston,C Chair,,Bachelor's,,Bates College,1925,,,,,Boston,Massachusetts,Editor,Christian Science Monitor,Publishing,W,M,1904,59,,,,,,,,1963,1967


**Replaced NaN with Boston due to related rows**

In [19]:
df_bod.loc[838, 'City'] = 'Boston'

In [20]:
# data is too coarse, let's break down names into their parts
def format_names(row):
    name_dict = HumanName(row['Name']).as_dict()
    for key in name_dict:
        row[key] = name_dict[key]
    row['middle_initial'] = row['middle'][0].replace('.', '') if row['middle'] != "" else np.nan
    return row

In [21]:
# Split "Name" column
df_bod = df_bod.apply(format_names, axis=1)

In [22]:
df_bod

Unnamed: 0,Name,District Number,District Name,Class,Group,TD1: Degree,TD1: Major/Field,TD1: School,TD1: Year,TD2: Degree,TD2: Major/Field,TD2: School,TD2: Year,City,State,Job Title,Organization,Sector,Race,Gender,Birth Year,Age at Start,FOMC Combined,FOMC Pre-reorg,FOMC Pre Start Year,FOMC Pre End Year,FOMC Post-reorg,FOMC Post Start Year,FOMC Post End Year,Start Year (pos.),End Year (pos.),title,first,middle,last,suffix,nickname,middle_initial
0,BUCKNER A MCKINNEY,11,Dallas,A,3.0,JD,Law,,,,,,,Durant,Oklahoma,Vice President & Cashier,Durant National Bank,Banking,W,M,1872,42,"President, FRB Dallas & Governor, FRB Dallas","Governor, FRB Dallas",1922,1925,"President, FRB Dallas",1931,1939,1914,1921,,BUCKNER,A,MCKINNEY,,,A
1,DAVID C WILLS,4,Cleveland,T,,,,,,,,,,Bellevue,Pennsylvania,President,Citizens National Bank of Bellevue,Banking,W,M,1872,42,"Member, Federal Reserve Board","Member, Federal Reserve Board",1920,1921,,,,1914,1919,,DAVID,C,WILLS,,,C
2,JAMES K LYNCH,12,San Francisco,A,2.0,,,,,,,,,San Francisco,California,Vice President,First National Bank of San Francisco,Banking,W,M,1857,57,"Governor, FRB San Francisco","Governor, FRB San Francisco",1917,1919,,,,1914,1916,,JAMES,K,LYNCH,,,K
3,GEORGE J SEAY,5,Richmond,B,1.0,,,,,,,,,Richmond,Virginia,Partner,Scott and Stringfellow,Banking,W,M,1862,52,"Governor, FRB Richmond","Governor, FRB Richmond",1914,1936,"President, FRB Richmond",1914,1936,1914,1914,,GEORGE,J,SEAY,,,J
4,GEORGE W NORRIS,3,Philadelphia,C,,,,,,,,,,Philadelphia,Pennsylvania,"Director of Department of Wharves, Docks and F...",City of Philadelphia,Transportation,W,M,1864,50,"Governor, FRB Philadelphia","Governor, FRB Philadelphia",1920,1936,"President, FRB Philadelphia",1920,1936,1914,1914,,GEORGE,W,NORRIS,,,W
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2602,MARGARET G LEWIS,5,Richmond,C Dep. Chair,,Master's,Business,Averett University,,,,,,Virginia,Richmond,President,HCA Capital Divison,Health Care Delivery,W,F,1955,64,,,,,,,,2019,,,MARGARET,G,LEWIS,,,G
2603,HELENE D GAYLE,7,Chicago,B,3.0,Master's,Public Health,Johns Hopkins University,,,,,,Chicago,Illinois,Chief Executive Officer & President,Chicago Community Trust,Banking,NW,F,1955,64,,,,,,,,2019,,,HELENE,D,GAYLE,,,D
2604,CLAUDIA AGUIRRE,11,Dallas,C,,Master's,Education,University of Houston,2002,,,,,Houston,Texas,Chief Executive Officer & President,BakerRipley,Consumer/Community,NW,F,1969,50,,,,,,,,2019,,,CLAUDIA,,AGUIRRE,,,
2605,ROSA M GIL,2,New York,C Dep. Chair,,,,,,,,,,New York,New York,"Chief Executive Officer, President, & Founder","Comunilife, Inc.",Nonprofit/Business Groups,NW,F,1940,79,,,,,,,,2019,,,ROSA,M,GIL,,,M


In [None]:
df_bod.to_csv('formatted_data/2021-06-28_fed-board-of-directors.csv', index=False)