# Case Study - Babies' names in the US from 1880 to 2015

## Learning Objectives:
1. Perform group-wise operations using Pandas
2. Familiar with Pandas’s groupby objects 
3. Practice aggregate, filter and apply functions in Pandas  

In [491]:
import numpy as np
import pandas as pd

<i><b>Background</b></i>: The dataset, `babynames.csv`, keeps the record of all the male/female baby names in the US from 1880 to 2015, together with their corresponding count ("n") and proportion ("prop") among all the new borns in that year. We will use this dataset to practice group-wise operations using Pandas.

### Load data

In [492]:
babynames = pd.read_csv("babynames.csv")

In [493]:
babynames.head(10)

Unnamed: 0,year,sex,name,n,prop
0,1880,F,Mary,7065,0.072384
1,1880,F,Anna,2604,0.026679
2,1880,F,Emma,2003,0.020522
3,1880,F,Elizabeth,1939,0.019866
4,1880,F,Minnie,1746,0.017889
5,1880,F,Margaret,1578,0.016167
6,1880,F,Ida,1472,0.015081
7,1880,F,Alice,1414,0.014487
8,1880,F,Bertha,1320,0.013524
9,1880,F,Sarah,1288,0.013196


In [494]:
bb = pd.DataFrame(babynames)
bb

Unnamed: 0,year,sex,name,n,prop
0,1880,F,Mary,7065,0.072384
1,1880,F,Anna,2604,0.026679
2,1880,F,Emma,2003,0.020522
3,1880,F,Elizabeth,1939,0.019866
4,1880,F,Minnie,1746,0.017889
...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002
1858685,2015,M,Zyking,5,0.000002
1858686,2015,M,Zykir,5,0.000002
1858687,2015,M,Zyrus,5,0.000002


### Task 1. On Hilary

Let's focus on a particular baby name first.

In [495]:
Hilary = bb['name'] == 'Hilary'
Hbb = bb.loc[Hilary,:] #Hilary sets
Hbb.head(5)
        

Unnamed: 0,year,sex,name,n,prop
5757,1882,M,Hilary,7,5.7e-05
7952,1883,M,Hilary,6,5.3e-05
17221,1887,M,Hilary,7,6.4e-05
27703,1891,M,Hilary,8,7.3e-05
42705,1896,M,Hilary,6,4.6e-05


### Task 1-1. List the numbers of male and female Hilary for each year

In [496]:
Hsex = Hbb.groupby(['year','sex'])
Hsex[['n']].agg([np.sum]) #groupby sex and year

Unnamed: 0_level_0,Unnamed: 1_level_0,n
Unnamed: 0_level_1,Unnamed: 1_level_1,sum
year,sex,Unnamed: 2_level_2
1882,M,7
1883,M,6
1887,M,7
1891,M,8
1896,M,6
...,...,...
2011,F,79
2012,F,75
2013,F,66
2014,F,60


### Task 2. Group-wise operations

### Task 2-1. Count the number of names by year and sex

In [497]:
bb.groupby(['year', 'sex']).size()

year  sex
1880  F        942
      M       1058
1881  F        938
      M        997
1882  F       1028
             ...  
2013  M      14026
2014  F      19150
      M      14026
2015  F      18993
      M      13959
Length: 272, dtype: int64

### Task 2-2. Calculate ranking of each name for each year and sex combination. Which names were most popular in 1999? （Hint: ranking can be calculated with argsort())

In [498]:
bsort = bsort.sort_values(by=['year','sex','n'], ascending=False)
bsort.head(5)


bsort = bb_count.agg({'n': np.sum, 'n': np.argsort}) #groupby sex and year

TypeError: sort_values() got an unexpected keyword argument 'by'

In [499]:
f99 = bb['year'] == 1999
b99 = bb.loc[f99,:]
b_yr_sex = b99.groupby(['year','sex','name'])
b_yr_sex[['n']].agg([np.sum]) #groupby sex and year
b_yr_sex.head(5)


b1999 = b_yr_sex.agg({'n': np.max}) #groupby sex and year - n max and name max
b1999.sort_values(by=['n'], ascending=False)

#Jacob & Emily

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,n
year,sex,name,Unnamed: 3_level_1
1999,M,Jacob,35346
1999,M,Michael,33906
1999,M,Matthew,30417
1999,M,Joshua,27254
1999,F,Emily,26537
1999,F,...,...
1999,F,Meilynn,5
1999,M,Arno,5
1999,M,Lexie,5
1999,F,Meika,5


### Task 2-3. What are the Top 10 in overall name popularity (in terms of total "n") by "sex"?

In [500]:
F = bb['sex']=='F'
F_top10 = bb.loc[F,:]

Fgroup = F_top10.groupby(['sex','name'])
Fgrpby =Fgroup.agg({'n': np.sum}) #groupby sex and year - n max and name max

Fsort = Fgrpby.sort_values(by=['n'], ascending=False)
Fsort.head(10) #Top 10 female names

Unnamed: 0_level_0,Unnamed: 1_level_0,n
sex,name,Unnamed: 2_level_1
F,Mary,4118058
F,Elizabeth,1610948
F,Patricia,1570954
F,Jennifer,1464067
F,Linda,1451331
F,Barbara,1433339
F,Margaret,1242141
F,Susan,1120810
F,Dorothy,1106106
F,Sarah,1065265


In [501]:
M = bb['sex']=='M'
M_top10 = bb.loc[M,:]

Mgroup = M_top10.groupby(['sex','name'])
Mgrpby =Mgroup.agg({'n': np.sum}) #groupby sex and year - n max and name max

Msort = Mgrpby.sort_values(by=['n'], ascending=False)
Msort.head(10) #Top 10 male names


Unnamed: 0_level_0,Unnamed: 1_level_0,n
sex,name,Unnamed: 2_level_1
M,James,5120990
M,John,5095674
M,Robert,4803068
M,Michael,4323928
M,William,4071645
M,David,3589754
M,Joseph,2581785
M,Richard,2558165
M,Charles,2371621
M,Thomas,2290364


### Task 2-4. What is the proportion of babies having the top 100 names for each year and sex?

In [502]:
bb['rank'] = bb.groupby(['year', 'sex'])['prop'].rank(method='min',ascending = False)
top100b = bb[bb['rank'] < 100]
top100b

top100 = top100b.groupby(['year', 'sex'])[['prop']].agg(np.sum).reset_index()
top100

Unnamed: 0,year,sex,prop
0,1880,F,0.711456
1,1880,M,0.746712
2,1881,F,0.710171
3,1881,M,0.745479
4,1882,F,0.706870
...,...,...,...
267,2013,M,0.421771
268,2014,F,0.308675
269,2014,M,0.416191
270,2015,F,0.309699


### Task 2-5. For each name, find the year in which it was ranked highest and the rank in that year.

In [528]:
bsum = bb

bsum.head(100)


Unnamed: 0,year,sex,name,n,prop,rank
0,1880,F,Mary,7065,0.072384,1.0
1,1880,F,Anna,2604,0.026679,2.0
2,1880,F,Emma,2003,0.020522,3.0
3,1880,F,Elizabeth,1939,0.019866,4.0
4,1880,F,Minnie,1746,0.017889,5.0
...,...,...,...,...,...,...
95,1880,F,Amelia,221,0.002264,96.0
96,1880,F,Hannah,221,0.002264,96.0
97,1880,F,Jane,215,0.002203,98.0
98,1880,F,Virginia,213,0.002182,99.0


In [529]:
bsum['rank'] = bsum['rank'].astype(int)
bsum.head()

Unnamed: 0,year,sex,name,n,prop,rank
0,1880,F,Mary,7065,0.072384,1
1,1880,F,Anna,2604,0.026679,2
2,1880,F,Emma,2003,0.020522,3
3,1880,F,Elizabeth,1939,0.019866,4
4,1880,F,Minnie,1746,0.017889,5


In [531]:
bsum['top rank'] = bsum.groupby(['name'])['prop'].rank(method='min',ascending = False)
topname = bsum.loc[bsum['top rank']==1,:]
topname.loc[topname['name']=='Mary',:]

Unnamed: 0,year,sex,name,n,prop,rank,top rank
0,1880,F,Mary,7065,0.072384,1,1.0


### Task 2-6. Which name has been in the top 10 most often?

In [533]:
bsum.head()
top10 = bsum['rank'] < 11
btop10 = bsum.loc[top10,:]
btop10.head(100)

Unnamed: 0,year,sex,name,n,prop,rank,top rank
0,1880,F,Mary,7065,0.072384,1,1.0
1,1880,F,Anna,2604,0.026679,2,9.0
2,1880,F,Emma,2003,0.020522,3,2.0
3,1880,F,Elizabeth,1939,0.019866,4,1.0
4,1880,F,Minnie,1746,0.017889,5,1.0
...,...,...,...,...,...,...,...
9323,1884,M,Frank,3218,0.026218,6,7.0
9324,1884,M,Joseph,2707,0.022055,7,41.0
9325,1884,M,Thomas,2572,0.020955,8,19.0
9326,1884,M,Henry,2474,0.020157,9,8.0


In [534]:
btop10.set_index(['name','n']).count(level='name').sort_values(by=['rank'], ascending=False).iloc[0]

#count the name frequency and sort by frequancy 
#select the 1st line then

year        115
sex         115
prop        115
rank        115
top rank    115
Name: James, dtype: int64