In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Homework 2: Analyzing US baby name trends

The SSA has made available data on the frequency of baby names from 1880 through 2018 (at the time of this writing).
The raw data can be obtained from [the SSA webpage](https://www.ssa.gov/oact/babynames/limits.html) (there is one file per year).

**Part 0:** Download the [National Data](https://www.ssa.gov/oact/babynames/names.zip) file *names.zip* and unzip it.

**Part 1** Assemble all of the data into a single DataFrame and add a *year* field. 
You can do this using [pandas.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In [None]:
years = list(range(1880,2019))

In [None]:
df_list = []
for year in years:
    # load the dataset into a dataframe
    df = pd.read_csv('names\yob'+str(year)+'.txt',header=None,names=['name','sex','births'])
    # add year column
    df['year'] = year
    # put dataframe in df_list
    df_list.append(df)
names = pd.concat(df_list)
names.head()

**Part 2:** Plot the total births by sex and year

In [None]:
# use a pivot table
names.pivot_table('births', columns='sex',index='year',aggfunc='sum').plot(figsize=(12,5))

**Part 3:** Plot the number of babies given a particular name (your own, or another name) by year.

In [None]:
my_name = 'Javier'
names[(names.name==my_name)&(names.sex=='M')].set_index('year').births.plot(figsize=(12,5))

**Part 4:** Insert a column 'prop' with the relative frequency of each name in each of the years.

In [None]:
def get_prop(group):
    # create proportion column
    group['prop'] = group.births/group.births.sum()
    return group

names = names.groupby(['year','sex']).apply(get_prop)
names.head()

**Part 5**: Create a DataFrame 'top1000_names' that contains the top 1000 names for each sex/year combination.
You will use this top 1000 dataset in the following investigations into the data.

**Part 6**: Plot the number of Johns, Harrys, Marys, and Marilyns by year.

Looking at your plots, you might conclude that these names have grown out of favor with the American population. 
But the story is more complicated than that, as you will explore in the next part.

## Measuring the increase in naming diversity

One explanation for the decrease in plots is that fewer parents are choosing common names for their children.
One measure of naming diversity is the proportion of births represented by the top 1000 most popular names.

**Part 7**: Plot the proportion of the top 1000 names by year and sex

## 10 most popular 2017 names through the ages

**Part 8**: Find the 10 most popular female names in 2017

**Part 9**: Plot the proportions of the 10 most popular female names in 2017 by year

## Similarity between decades

Here, you will explore the similarity between the set of names given in one particular year and the set of names given 10 years previosly.

The **Jaccard similarity** between sets A and B is the number of
elements in both A and B relative to the number of elements in either A or
B. 
If we let |A| denote the number of elements in the set A, then the Jaccard
similarity is

$$
J(A,B)=\frac{|A \cap B|}{|A\cup B|}
$$

**Part 10**: Find the Jaccard similarity between the following two sets

In [None]:
set1 = {'John','Daniel','Drogo'}
set2 = {'Robert', 'John'}

**Part 11**: Compute the Jaccard similarity between the set of male names given in 2017 and the set of male names given in 2007

**Part 12**: Plot the Jaccard similarity between the set of male names given in one particular year and the set of male names given 10 years previosly by year

##  Extra: The last letter revolution

It has been argued (see [here](https://www.babynamewizard.com/archives/2007/7/where-all-boys-end-up-nowadays), for example) that the distribution of boy names by final letter has changed significantly over the last 100 years.

**Extra part 1:** Extract the last letter from the "name" column

**Extra part 2**: Plot the proportion of male names by the last letter for the years 1910, 1960, and 2010

**Extra part 3**: Plot the proportions of male names ending in "e", "n", "d", "s" and "y" by year.