# Case Study - Babies' names in the US from 1880 to 2015

## Learning Objectives:
1. Perform group-wise operations using Pandas
2. Familiar with Pandas’s groupby objects 
3. Practice aggregate, filter and apply functions in Pandas  

In [1]:
import numpy as np
import pandas as pd

<i><b>Background</b></i>: The dataset, `babynames.csv`, keeps the record of all the male/female baby names in the US from 1880 to 2015, together with their corresponding count ("n") and proportion ("prop") among all the new borns in that year. We will use this dataset to practice group-wise operations using Pandas.

### Load data

In [2]:
babynames = pd.read_csv("babynames.csv")

In [3]:
babynames.head(10)

Unnamed: 0,year,sex,name,n,prop
0,1880,F,Mary,7065,0.072384
1,1880,F,Anna,2604,0.026679
2,1880,F,Emma,2003,0.020522
3,1880,F,Elizabeth,1939,0.019866
4,1880,F,Minnie,1746,0.017889
5,1880,F,Margaret,1578,0.016167
6,1880,F,Ida,1472,0.015081
7,1880,F,Alice,1414,0.014487
8,1880,F,Bertha,1320,0.013524
9,1880,F,Sarah,1288,0.013196


### Task 1. On Hilary

Let's focus on a particular baby name first.

In [4]:
name_group= babynames.groupby(['name'])
name_group.get_group('Hilary')


Unnamed: 0,year,sex,name,n,prop
5757,1882,M,Hilary,7,0.000057
7952,1883,M,Hilary,6,0.000053
17221,1887,M,Hilary,7,0.000064
27703,1891,M,Hilary,8,0.000073
42705,1896,M,Hilary,6,0.000046
...,...,...,...,...,...
1694105,2011,F,Hilary,79,0.000041
1728086,2012,F,Hilary,75,0.000039
1761989,2013,F,Hilary,66,0.000034
1795403,2014,F,Hilary,60,0.000031


In [5]:
len(babynames.groupby(['name']).groups['Hilary'])


191

### Task 1-1. List the numbers of male and female Hilary for each year

In [6]:
babyname_group= babynames.groupby(['year', 'sex'])
babyname_group['name'].apply(lambda x: x.str.contains("Hilary").sum())


year  sex
1880  F      0
      M      0
1881  F      0
      M      0
1882  F      0
            ..
2013  M      0
2014  F      1
      M      0
2015  F      1
      M      0
Name: name, Length: 272, dtype: int64

### Task 2. Group-wise operations

### Task 2-1. Count the number of names by year and sex

In [7]:
babynames.groupby(['year','sex']).nunique()['name']

year  sex
1880  F        942
      M       1058
1881  F        938
      M        997
1882  F       1028
             ...  
2013  M      14026
2014  F      19150
      M      14026
2015  F      18993
      M      13959
Name: name, Length: 272, dtype: int64

### Task 2-2. Calculate ranking of each name for each year and sex combination. Which names were most popular in 1999? （Hint: ranking can be calculated with argsort())

In [8]:
# name most popular in 1999- method 1
babyname_group= babynames.groupby(['year'])
babyname_group1= babyname_group.get_group(1999).sort_values(by=['n'], ascending=False)
babyname_group1.head(1)

Unnamed: 0,year,sex,name,n,prop
1320998,1999,M,Jacob,35346,0.017344


In [15]:
# ranking for each name
babynames2_group = babynames
babynames2_group["ranking"] = babynames2_group["n"].values.argsort()
babynames2_group

Unnamed: 0,year,sex,name,n,prop,ranking
0,1880,F,Mary,7065,0.072384,1858688
1,1880,F,Anna,2604,0.026679,819156
2,1880,F,Emma,2003,0.020522,819155
3,1880,F,Elizabeth,1939,0.019866,819154
4,1880,F,Minnie,1746,0.017889,819153
5,1880,F,Margaret,1578,0.016167,819152
6,1880,F,Ida,1472,0.015081,819151
7,1880,F,Alice,1414,0.014487,819157
8,1880,F,Bertha,1320,0.013524,819150
9,1880,F,Sarah,1288,0.013196,819148


In [22]:
# name most popular in 1999- method 2
babynames3_group= babynames2_group.groupby(['year'])
babynames3_group.get_group(1999).sort_values(by=['n'], ascending=False).head(1)

Unnamed: 0,year,sex,name,n,prop,ranking
1320998,1999,M,Jacob,35346,0.017344,1663017


### Task 2-3. What are the Top 10 in overall name popularity (in terms of total "n") by "sex"?

In [10]:
babyname2= babynames.groupby(['sex','name']).sum()
babyname3= babyname2.sort_values(by=['sex','n'], ascending = [False, False])
babyname3= babyname3.drop(columns=['year','prop'])
babyname3.groupby('sex').head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,n
sex,name,Unnamed: 2_level_1
M,James,5120990
M,John,5095674
M,Robert,4803068
M,Michael,4323928
M,William,4071645
M,David,3589754
M,Joseph,2581785
M,Richard,2558165
M,Charles,2371621
M,Thomas,2290364


### Task 2-4. What is the proportion of babies having the top 100 names for each year and sex?

In [37]:
babyname4= babyname2.sort_values(by=['sex','n'], ascending = [False, False])
top100_prop= babyname4.groupby('sex').head(100)
top100_prop= top100_prop.drop(columns=['year','n'])
top100_prop.groupby('sex').head(100)

Unnamed: 0_level_0,Unnamed: 1_level_0,prop
sex,name,Unnamed: 2_level_1
M,James,4.622352
M,John,5.337247
M,Robert,3.845090
M,Michael,2.421536
M,William,4.467286
...,...,...
F,Diana,0.201313
F,Irene,0.415330
F,Annie,0.539515
F,Ruby,0.359253


### Task 2-5. For each name, find the year in which it was ranked highest and the rank in that year.

In [24]:
namegroup = babynames.groupby(["name", "year"])["ranking"].max()
namegroup.head(100)

name      year
Aaban     2007    1334780
          2009    1506898
          2010     697488
          2011     357834
          2012      91831
                   ...   
Aadhavan  2013     447728
          2015     526271
Aadhi     2013    1393091
          2014      73160
          2015     526302
Name: ranking, Length: 100, dtype: int64

### Task 2-6. Which name has been in the top 10 most often?

In [11]:
babygroup_year= babynames.pivot_table(index='name', columns='year', values= 'n',aggfunc='sum')
babygroup_year['Total']= babygroup_year.iloc[:,0:136].sum(axis=1)
babygroup_year.sort_values(by=['Total'],ascending=False).head(10)

year,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Total
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
James,5949.0,5465.0,5910.0,5248.0,5726.0,5201.0,5384.0,4787.0,5607.0,5046.0,...,15981.0,15190.0,14232.0,13890.0,13266.0,13421.0,13555.0,14425.0,14743.0,5144205.0
John,9701.0,8795.0,9597.0,8934.0,9428.0,8801.0,9074.0,8166.0,9299.0,8600.0,...,14446.0,13331.0,12114.0,11553.0,11048.0,10618.0,10703.0,10673.0,10314.0,5117331.0
Robert,2426.0,2149.0,2512.0,2345.0,2475.0,2333.0,2460.0,2133.0,2833.0,2527.0,...,9376.0,8828.0,7826.0,7554.0,6968.0,6936.0,6698.0,6619.0,6084.0,4823167.0
Michael,354.0,298.0,321.0,307.0,373.0,370.0,348.0,345.0,466.0,377.0,...,22023.0,20656.0,18958.0,17361.0,16796.0,16151.0,15492.0,15435.0,14357.0,4345569.0
Mary,7092.0,6948.0,8178.0,8044.0,9253.0,9166.0,9921.0,9935.0,11804.0,11689.0,...,3684.0,3493.0,3155.0,2863.0,2704.0,2572.0,2639.0,2629.0,2602.0,4133216.0
William,9561.0,8554.0,9329.0,8427.0,8931.0,8077.0,8287.0,7514.0,8747.0,7818.0,...,18902.0,18411.0,17920.0,17059.0,17353.0,16875.0,16609.0,16799.0,15824.0,4087556.0
David,869.0,750.0,838.0,740.0,761.0,717.0,674.0,682.0,801.0,757.0,...,17545.0,16329.0,15440.0,14201.0,13227.0,12525.0,12349.0,12172.0,11709.0,3602623.0
Joseph,2642.0,2466.0,2676.0,2532.0,2716.0,2554.0,2602.0,2469.0,3011.0,2736.0,...,17352.0,16580.0,14922.0,13822.0,12959.0,12540.0,12215.0,12086.0,11386.0,2592388.0
Richard,728.0,641.0,746.0,649.0,749.0,672.0,728.0,629.0,780.0,716.0,...,4427.0,4058.0,3612.0,3234.0,3159.0,3019.0,2788.0,2870.0,2659.0,2567700.0
Charles,5359.0,4653.0,5114.0,4844.0,4821.0,4623.0,4555.0,4064.0,4619.0,4227.0,...,7462.0,7297.0,7279.0,7101.0,6983.0,6940.0,7015.0,7330.0,7134.0,2383998.0


In [40]:
ranking= babyname_group["n"].rank(ascending=False, method='first')
babynames["Ranking"] = ranking
babynames

Unnamed: 0,year,sex,name,n,prop,ranking,Ranking
0,1880,F,Mary,7065,0.072384,1858688,3.0
1,1880,F,Anna,2604,0.026679,819156,9.0
2,1880,F,Emma,2003,0.020522,819155,15.0
3,1880,F,Elizabeth,1939,0.019866,819154,16.0
4,1880,F,Minnie,1746,0.017889,819153,18.0
...,...,...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002,437157,32948.0
1858685,2015,M,Zyking,5,0.000002,544606,32949.0
1858686,2015,M,Zykir,5,0.000002,437156,32950.0
1858687,2015,M,Zyrus,5,0.000002,441423,32951.0


In [39]:
filt_top10 = (babynames['Ranking'] <= 10)
babytop10 = babynames.loc[filt_top10,:].copy()
babytop10["name"].max()

'William'