# Case Study - Babies' names in the US from 1880 to 2015

## Learning Objectives:
1. Perform group-wise operations using Pandas
2. Familiar with Pandas’s groupby objects 
3. Practice aggregate, filter and apply functions in Pandas  

In [1]:
import numpy as np
import pandas as pd

<i><b>Background</b></i>: The dataset, `babynames.csv`, keeps the record of all the male/female baby names in the US from 1880 to 2015, together with their corresponding count ("n") and proportion ("prop") among all the new borns in that year. We will use this dataset to practice group-wise operations using Pandas.

### Load data

In [7]:
babynames = pd.read_csv("babynames.csv")
babynames.shape

(1858689, 5)

In [74]:
babynames.head(10)

Unnamed: 0,year,sex,name,n,prop,rank,top100,top10
0,1880,F,Mary,7065,0.072384,0,Top100,1
1,1880,F,Anna,2604,0.026679,1,Top100,1
2,1880,F,Emma,2003,0.020522,2,Top100,1
3,1880,F,Elizabeth,1939,0.019866,3,Top100,1
4,1880,F,Minnie,1746,0.017889,4,Top100,1
5,1880,F,Margaret,1578,0.016167,5,Top100,1
6,1880,F,Ida,1472,0.015081,6,Top100,1
7,1880,F,Alice,1414,0.014487,7,Top100,1
8,1880,F,Bertha,1320,0.013524,8,Top100,1
9,1880,F,Sarah,1288,0.013196,9,Top100,1


### Task 1. On Hilary

Let's focus on a particular baby name first.

In [75]:
Hilary_c = babynames["name"] == "Hilary"

In [76]:
Hilary_df = babynames.loc[Hilary_c,:]

In [77]:
Hilary_df

Unnamed: 0,year,sex,name,n,prop,rank,top100,top10
5757,1882,M,Hilary,7,0.000057,828,Rest,0
7952,1883,M,Hilary,6,0.000053,876,Rest,0
17221,1887,M,Hilary,7,0.000064,783,Rest,0
27703,1891,M,Hilary,8,0.000073,730,Rest,0
42705,1896,M,Hilary,6,0.000046,1060,Rest,0
...,...,...,...,...,...,...,...,...
1694105,2011,F,Hilary,79,0.000041,2355,Rest,0
1728086,2012,F,Hilary,75,0.000039,2456,Rest,0
1761989,2013,F,Hilary,66,0.000034,2659,Rest,0
1795403,2014,F,Hilary,60,0.000031,2867,Rest,0


### Task 1-1. List the numbers of male and female Hilary for each year

In [78]:
Hilary_df.groupby(["year","sex"])["n"].sum().reset_index()

Unnamed: 0,year,sex,n
0,1882,M,7
1,1883,M,6
2,1887,M,7
3,1891,M,8
4,1896,M,6
...,...,...,...
186,2011,F,79
187,2012,F,75
188,2013,F,66
189,2014,F,60


### Task 2. Group-wise operations

### Task 2-1. Count the number of names by year and sex

In [79]:
babynames.groupby(["year","sex"]).size().reset_index()

Unnamed: 0,year,sex,0
0,1880,F,942
1,1880,M,1058
2,1881,F,938
3,1881,M,997
4,1882,F,1028
...,...,...,...
267,2013,M,14026
268,2014,F,19150
269,2014,M,14026
270,2015,F,18993


In [80]:
# alternative method to use .count instead
babynames.groupby(["year","sex"])["name"].count().reset_index()

Unnamed: 0,year,sex,name
0,1880,F,942
1,1880,M,1058
2,1881,F,938
3,1881,M,997
4,1882,F,1028
...,...,...,...
267,2013,M,14026
268,2014,F,19150
269,2014,M,14026
270,2015,F,18993


### Task 2-2. Calculate ranking of each name for each year and sex combination. Which names were most popular in 1999? （Hint: ranking can be calculated with argsort())

In [81]:
babynames["rank"] = babynames.groupby(["year","sex"])["prop"].apply(lambda x: (-x).argsort())

In [82]:
babynames


Unnamed: 0,year,sex,name,n,prop,rank,top100,top10
0,1880,F,Mary,7065,0.072384,0,Top100,1
1,1880,F,Anna,2604,0.026679,1,Top100,1
2,1880,F,Emma,2003,0.020522,2,Top100,1
3,1880,F,Elizabeth,1939,0.019866,3,Top100,1
4,1880,F,Minnie,1746,0.017889,4,Top100,1
...,...,...,...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002,12650,Rest,0
1858685,2015,M,Zyking,5,0.000002,12649,Rest,0
1858686,2015,M,Zykir,5,0.000002,12648,Rest,0
1858687,2015,M,Zyrus,5,0.000002,12646,Rest,0


In [83]:
babynames[babynames["year"] == 1999]

Unnamed: 0,year,sex,name,n,prop,rank,top100,top10
1304060,1999,F,Emily,26537,0.013638,0,Top100,1
1304061,1999,F,Hannah,21669,0.011136,1,Top100,1
1304062,1999,F,Alexis,19232,0.009884,2,Top100,1
1304063,1999,F,Sarah,19088,0.009810,3,Top100,1
1304064,1999,F,Samantha,19034,0.009782,4,Top100,1
...,...,...,...,...,...,...,...,...
1332601,1999,M,Zyier,5,0.000002,10468,Rest,0
1332602,1999,M,Zyquan,5,0.000002,10467,Rest,0
1332603,1999,M,Zyquez,5,0.000002,10466,Rest,0
1332604,1999,M,Zyron,5,0.000002,10464,Rest,0


### Task 2-3. What are the Top 10 in overall name popularity (in terms of total "n") by "sex"?

In [84]:
names_grp = babynames.groupby(["sex","name"])["n"].sum().reset_index()

In [85]:
sex_grp = names_grp.groupby("sex").apply(lambda x: x.sort_values(["n"],ascending = False)).reset_index(drop = True)

In [86]:
sex_grp

Unnamed: 0,sex,name,n
0,F,Mary,4118058
1,F,Elizabeth,1610948
2,F,Patricia,1570954
3,F,Jennifer,1464067
4,F,Linda,1451331
...,...,...,...
105381,M,Luvender,5
105382,M,Dominquie,5
105383,M,Luvert,5
105384,M,Domino,5


In [87]:
sex_grp.groupby("sex").head(10)

Unnamed: 0,sex,name,n
0,F,Mary,4118058
1,F,Elizabeth,1610948
2,F,Patricia,1570954
3,F,Jennifer,1464067
4,F,Linda,1451331
5,F,Barbara,1433339
6,F,Margaret,1242141
7,F,Susan,1120810
8,F,Dorothy,1106106
9,F,Sarah,1065265


In [88]:
# alternative method
nam_grp = babynames.groupby(["sex", "name"])["n"].sum().reset_index()

# making use of python function 

nam_grp.groupby("sex").apply(lambda x: x.nlargest(10, 'n')).reset_index(drop = True) 


Unnamed: 0,sex,name,n
0,F,Mary,4118058
1,F,Elizabeth,1610948
2,F,Patricia,1570954
3,F,Jennifer,1464067
4,F,Linda,1451331
5,F,Barbara,1433339
6,F,Margaret,1242141
7,F,Susan,1120810
8,F,Dorothy,1106106
9,F,Sarah,1065265


### Task 2-4. What is the proportion of babies having the top 100 names for each year and sex?

In [89]:
# use rank column created above

# create a new column first to grp top 100 names and non top 100 names
babynames["top100"] = np.where(babynames["rank"]<100,"Top100","Rest")


In [90]:
babynames

Unnamed: 0,year,sex,name,n,prop,rank,top100,top10
0,1880,F,Mary,7065,0.072384,0,Top100,1
1,1880,F,Anna,2604,0.026679,1,Top100,1
2,1880,F,Emma,2003,0.020522,2,Top100,1
3,1880,F,Elizabeth,1939,0.019866,3,Top100,1
4,1880,F,Minnie,1746,0.017889,4,Top100,1
...,...,...,...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002,12650,Rest,0
1858685,2015,M,Zyking,5,0.000002,12649,Rest,0
1858686,2015,M,Zykir,5,0.000002,12648,Rest,0
1858687,2015,M,Zyrus,5,0.000002,12646,Rest,0


In [91]:
top100_df = (babynames.groupby(["year","sex","top100"])["n"].sum()/babynames.groupby(["year","sex"])["n"].sum()*100).reset_index()

In [92]:
# showing the precentage for top 100 names and for the rest of the same year and sex
top100_df

Unnamed: 0,year,sex,top100,n
0,1880,F,Rest,23.453710
1,1880,F,Top100,76.546290
2,1880,M,Rest,19.862431
3,1880,M,Top100,80.137569
4,1881,F,Rest,23.422836
...,...,...,...,...
539,2014,M,Top100,44.426954
540,2015,F,Rest,65.959504
541,2015,F,Top100,34.040496
542,2015,M,Rest,56.264028


In [93]:
# showing the precentage for top 100 names only
top100_df.loc[top100_df["top100"] == "Top100",:]

Unnamed: 0,year,sex,top100,n
1,1880,F,Top100,76.546290
3,1880,M,Top100,80.137569
5,1881,F,Top100,76.577164
7,1881,M,Top100,80.253715
9,1882,F,Top100,76.069097
...,...,...,...,...
535,2013,M,Top100,45.284655
537,2014,F,Top100,33.994583
539,2014,M,Top100,44.426954
541,2015,F,Top100,34.040496


### Task 2-5. For each name, find the year in which it was ranked highest and the rank in that year.

In [None]:
# define the lambda function and use it with transform

In [94]:
def highest_rank(x):
    i = x['rank'].idxmin()
    return x.loc[i, ['rank', 'year']]

In [95]:
babynames.groupby(['sex', 'name']).apply(highest_rank)

Unnamed: 0_level_0,Unnamed: 1_level_0,rank,year
sex,name,Unnamed: 2_level_1,Unnamed: 3_level_1
F,Aabha,11552,2014
F,Aabriella,18101,2015
F,Aada,18100,2015
F,Aaden,19195,2009
F,Aadhira,8771,2014
...,...,...,...
M,Zyus,13958,2015
M,Zyvion,14519,2009
M,Zyvon,11046,2015
M,Zyyon,11130,2014


In [96]:
# this is if we are to ignore sex of the baby
babynames.groupby('name').apply(highest_rank)

Unnamed: 0_level_0,rank,year
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaban,5622,2014
Aabha,11552,2014
Aabid,12133,2003
Aabriella,18101,2015
Aada,18100,2015
...,...,...
Zyvion,14519,2009
Zyvon,11046,2015
Zyyanna,15601,2010
Zyyon,11130,2014


### Task 2-6. Which name has been in the top 10 most often?

In [None]:
# make use of the rank funtion

In [97]:
babynames["top10"] = np.where(babynames["rank"]<10,1,0)

In [98]:
babynames

Unnamed: 0,year,sex,name,n,prop,rank,top100,top10
0,1880,F,Mary,7065,0.072384,0,Top100,1
1,1880,F,Anna,2604,0.026679,1,Top100,1
2,1880,F,Emma,2003,0.020522,2,Top100,1
3,1880,F,Elizabeth,1939,0.019866,3,Top100,1
4,1880,F,Minnie,1746,0.017889,4,Top100,1
...,...,...,...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002,12650,Rest,0
1858685,2015,M,Zyking,5,0.000002,12649,Rest,0
1858686,2015,M,Zykir,5,0.000002,12648,Rest,0
1858687,2015,M,Zyrus,5,0.000002,12646,Rest,0


In [99]:
babynames.groupby(["sex","name"])["top10"].sum().reset_index().sort_values(by='top10', ascending=False)
# name James has been ranked the top 10 Male name for the most number of years at 115 years.
# name Mary has been ranked the top 10 Female name for the most number of years at 92 years.

Unnamed: 0,sex,name,top10
82254,M,James,115
84287,M,John,108
96839,M,Robert,108
103690,M,William,108
40885,F,Mary,92
...,...,...,...
35149,F,Lashunta,0
35148,F,Lashunna,0
35147,F,Lashune,0
35146,F,Lashundria,0
