# Exploring Baby Names in the United States

You can download this data from the Social Security Adminstration [here](https://www.ssa.gov/OACT/babynames/limits.html).  There are additional data files by state and territory that could be combined or analyzed on their own.  We begin by exploring the results of downloading and unzipping the files using bash commands with the magin command `%%bash`.


In [None]:
%%bash
ls names/*.txt | head -n 10

In [None]:
%%bash
head -n 10 names/yob1880.txt

In [None]:
import pandas as pd

In [None]:
names1880 = pd.read_csv('names/yob1880.txt', names = ['name', 'sex', 'births'])

In [None]:
names1880.head()

In [None]:
names1880.groupby('sex')['births'].sum()

In [None]:
%%bash
ls names/ | tail -n 5

In [None]:
pieces = []
columns = ['name', 'sex', 'births']

In [None]:
years = range(1880, 2018)
pieces = []
for year in years:
    #remember that here, I'm providing a variable to add on to the path
    #as a digit using the value of year.  Then, we pass this value to the read_csv
    #method, and tack the dataframe on to our data each time through the loop
    path = 'names/yob%d.txt' % year
    frame = pd.read_csv(path, names = columns)
    frame['year'] = year
    pieces.append(frame)

In [None]:
names = pd.concat(pieces, ignore_index = True)

In [None]:
names.head()

In [None]:
import numpy as np

### `pivot_table`

Just as in Microsoft Excel and Google Sheets, we have a `pivot_table` method in Pands.  This takes values of a given row and creates a dataframe with these values as columns.  For example, we can create a table that pivots the sex column and applies a sum of the birth values.  This in effect gives us the total births per year by gender.

In [None]:
total_births = names.pivot_table(values = 'births', index = 'year', columns = 'sex', aggfunc = np.sum)

In [None]:
total_births.head()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
total_births.plot(title = 'Total Births by Sex and Year', figsize = (13, 6))

In [None]:
def get_top1000(group):
    return group.sort_values(by = 'births', ascending = False)[:1000]

In [None]:
grouped = names.groupby(['year', 'sex'])

In [None]:
top1000 = grouped.apply(get_top1000)
top1000.index = np.arange(len(top1000))

In [None]:
top1000.head()

In [None]:
boys = top1000[top1000.sex == 'M']

In [None]:
girls = top1000[top1000.sex == 'F']

In [None]:
total_births = top1000.pivot_table('births', index = 'year', columns = 'name', aggfunc = sum)

In [None]:
total_births.info()

In [None]:
total_births.head()

In [None]:
subset = total_births[['Jacob', 'Erika', 'Valentino', 'Michael', 'Tina', 'Vincent']]

In [None]:
subset.plot(subplots = True, figsize = (12, 10));

### Diversity of Names

We saw that the number of births was fairly steady, however our names seem to be falling out of favor.  This may be because new ones have overtaken them; Valentino has seen a rise in popularity only recently.  Also, it may be that the diversity of names is increasing.  One way to explore this would be to create a diversity column and then explore where high and low diversity of naming is happening.

In [None]:
def add_prop(group):
    births = group.births.astype(float)
    group['prop'] = births/births.sum()
    return group

In [None]:
names = names.groupby(['year', 'sex']).apply(add_prop)

In [None]:
names.head()

In [None]:
def get_top(group):
    return group.sort_values(by = 'prop', ascending = False)[:1]

In [None]:
names.columns

In [None]:
top_prop = names.groupby(['year', 'sex']).apply(get_top)

In [None]:
type(top_prop)

In [None]:
top_prop.nlargest(10, 'prop')

In [None]:
top_prop.nsmallest(10, 'prop')