# Processing Baby Names Data by Year

This examples uses the baby names by year from the Social Security Administration, available at https://www.ssa.gov/oact/babynames/limits.html

The national data file was downloaded and placed in a babynames directory for use.  That directory contains files named yob####.txt where #### is the year of birth.  Each file is a comma-delimited list with name,sex,occurences.  Order is by occurences, sorted alphabetically on ties.



## Outline

1. Get list of file names to process
2. Read in a file into a dataframe
3. Add the year as a column to the dataframe
4. accumulate all files, each as a dataframe
5. merge the dataframes for opeartion

### Get input file listing

In [1]:
from pathlib import Path
p = Path('./babynames')
raw_list = [pfile for pfile in p.iterdir() if pfile.is_file()]
file_list = [i for i in raw_list if ('yob' in i.name)]

In [2]:
file_list[:5]

[WindowsPath('babynames/yob1880.txt'),
 WindowsPath('babynames/yob1881.txt'),
 WindowsPath('babynames/yob1882.txt'),
 WindowsPath('babynames/yob1883.txt'),
 WindowsPath('babynames/yob1884.txt')]

In [3]:
### Iterate through files reading them in

In [43]:
import pandas as pd
import numpy as np
df_list = []
header = ['name', 'sex', 'births']

In [44]:
for nfile in file_list:
    year = np.int(nfile.name.strip('.txt').strip('yob'))
    # Note assign here shortcuts a step.
    df_list.append(pd.read_csv(nfile, names=header).assign(yob=year))

In [45]:
df = pd.concat(df_list)

In [46]:
boys = df[df['sex']=='M']

In [47]:
boy_names = boys.groupby('name').sum().sort_values('births', ascending=False)

In [48]:
boy_names = boy_names.reset_index().drop('yob', axis=1)

In [55]:
boy_names.head()

Unnamed: 0,name,births
0,James,5164280
1,John,5124817
2,Robert,4820129
3,Michael,4362731
4,William,4117369
