# Case Study - Babies' names in the US from 1880 to 2015

## Learning Objectives:
1. Perform group-wise operations using Pandas
2. Familiar with Pandas’s groupby objects 
3. Practice aggregate, filter and apply functions in Pandas  

In [None]:
import numpy as np
import pandas as pd

<i><b>Background</b></i>: The dataset, `babynames.csv`, keeps the record of all the male/female baby names in the US from 1880 to 2015, together with their corresponding count ("n") and proportion ("prop") among all the new borns in that year. We will use this dataset to practice group-wise operations using Pandas.

### Load data

In [None]:
babynames = pd.read_csv("babynames.csv")

In [None]:
babynames.head(10)

### Task 1. On Hilary

Let's focus on a particular baby name first.

In [None]:
fil_hilary = (babynames['name'] == 'Hilary')
hilary = babynames.loc[fil_hilary,:]

In [None]:
hilary.head(10)

### Task 1-1. List the numbers of male and female Hilary for each year

In [None]:
hilary.groupby(['year', 'sex']).size()

### Task 2. Group-wise operations

### Task 2-1. Count the number of names by year and sex

In [None]:
babynames.groupby(['year', 'sex']).size()

### Task 2-2. Calculate ranking of each name for each year and sex combination. Which names were most popular in 1999? （Hint: ranking can be calculated with argsort())

In [None]:
babynames['rank'] = babynames.groupby(['year', 'sex'])['prop'].apply(lambda x: (-x).argsort())
babynames.head(20)

In [None]:
babynames.groupby(['year', 'sex']).get_group((1999, 'F'))

In [None]:
babynames[(babynames['year'] == 1999) & (babynames['rank'] == 0)]

### Task 2-3. What are the Top 10 in overall name popularity (in terms of total "n") by "sex"?

In [None]:
babynames.groupby(['name', 'sex'])[['n']].agg(np.sum)

In [None]:
babynames.groupby(['name', 'sex'])[['n']].agg(np.sum).sort_values(by='n', ascending=False).head(10)

### Task 2-4. What is the proportion of babies having the top 100 names for each year and sex?

In [None]:
top100 = babynames[babynames['rank'] < 100]
top100.head()

In [None]:
top100prop = top100.groupby(['year', 'sex'])[['prop']].agg(np.sum).reset_index()
top100prop.head(10)

### Task 2-5. For each name, find the year in which it was ranked highest and the rank in that year.

In [None]:
babynames['most_pop'] = babynames.groupby(['name', 'sex'])['rank'].transform(lambda x: (x == np.min(x)))

In [None]:
# You can look at a specific group using Female & "Mary"
babynames_gb = babynames.groupby(['name', 'sex'])
babynames_gb.get_group(("Mary", "F"))

In [None]:
# You can look at a specific group using Female & "Anna" using filtering
fil = (babynames["name"] == "Anna") & (babynames["sex"] == "F")
babynames.loc[fil,:]

In [None]:
babynames[babynames['most_pop'] == True].groupby(['name', 'sex']).head(1)

### Task 2-6. Which name has been in the top 10 most often?

In [None]:
top10 = babynames[babynames['rank'] < 10]
top10.head()

In [None]:
top10_count = top10.groupby(['name', 'sex']).size().reset_index()
top10_count.columns = ['name', 'sex', 'top10_count']
top10_count.sort_values(by = 'top10_count', ascending = False).head(30)