# Case Study - Babies' names in the US from 1880 to 2015

## Learning Objectives:
1. Perform group-wise operations using Pandas
2. Familiar with Pandas’s groupby objects 
3. Practice aggregate, filter and apply functions in Pandas  

In [1]:
import numpy as np
import pandas as pd

<i><b>Background</b></i>: The dataset, `babynames.csv`, keeps the record of all the male/female baby names in the US from 1880 to 2015, together with their corresponding count ("n") and proportion ("prop") among all the new borns in that year. We will use this dataset to practice group-wise operations using Pandas.

### Load data

In [2]:
babynames = pd.read_csv("babynames.csv")

In [3]:
babynames.head(10)

Unnamed: 0,year,sex,name,n,prop
0,1880,F,Mary,7065,0.072384
1,1880,F,Anna,2604,0.026679
2,1880,F,Emma,2003,0.020522
3,1880,F,Elizabeth,1939,0.019866
4,1880,F,Minnie,1746,0.017889
5,1880,F,Margaret,1578,0.016167
6,1880,F,Ida,1472,0.015081
7,1880,F,Alice,1414,0.014487
8,1880,F,Bertha,1320,0.013524
9,1880,F,Sarah,1288,0.013196


In [4]:
babynames.describe() # agg fun

Unnamed: 0,year,n,prop
count,1858689.0,1858689.0,1858689.0
mean,1973.376,183.383,0.0001391443
std,33.69788,1555.357,0.0011702
min,1880.0,5.0,2.259872e-06
25%,1950.0,7.0,3.900959e-06
50%,1983.0,12.0,7.348183e-06
75%,2002.0,32.0,2.324258e-05
max,2015.0,99680.0,0.0815463


In [5]:
babynames.info() # table info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1858689 entries, 0 to 1858688
Data columns (total 5 columns):
 #   Column  Dtype  
---  ------  -----  
 0   year    int64  
 1   sex     object 
 2   name    object 
 3   n       int64  
 4   prop    float64
dtypes: float64(1), int64(2), object(2)
memory usage: 70.9+ MB


In [6]:
babynames.index

RangeIndex(start=0, stop=1858689, step=1)

In [7]:
 babynames.sort_values(['year','sex','n']).head(10)
    

Unnamed: 0,year,sex,name,n,prop
835,1880,F,Adelle,5,5.1e-05
836,1880,F,Adina,5,5.1e-05
837,1880,F,Adrienne,5,5.1e-05
838,1880,F,Albertine,5,5.1e-05
839,1880,F,Alys,5,5.1e-05
840,1880,F,Ana,5,5.1e-05
841,1880,F,Araminta,5,5.1e-05
842,1880,F,Arthur,5,5.1e-05
843,1880,F,Birtha,5,5.1e-05
844,1880,F,Bulah,5,5.1e-05


In [8]:
babynames[babynames["n"] == 5]

Unnamed: 0,year,sex,name,n,prop
835,1880,F,Adelle,5,0.000051
836,1880,F,Adina,5,0.000051
837,1880,F,Adrienne,5,0.000051
838,1880,F,Albertine,5,0.000051
839,1880,F,Alys,5,0.000051
...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002
1858685,2015,M,Zyking,5,0.000002
1858686,2015,M,Zykir,5,0.000002
1858687,2015,M,Zyrus,5,0.000002


In [9]:
 babynames['n'].value_counts()

5        259256
6        184886
7        139578
8        110343
9         89109
          ...  
90623         1
44486         1
14208         1
11734         1
7669          1
Name: n, Length: 13604, dtype: int64

### Task 1. On Hilary

Let's focus on a particular baby name first.

In [10]:
hilary_name_filter = babynames[babynames["name"] == "Hilary"]
hilary_name_filter.head(10)

Unnamed: 0,year,sex,name,n,prop
5757,1882,M,Hilary,7,5.7e-05
7952,1883,M,Hilary,6,5.3e-05
17221,1887,M,Hilary,7,6.4e-05
27703,1891,M,Hilary,8,7.3e-05
42705,1896,M,Hilary,6,4.6e-05
45867,1897,M,Hilary,5,4.1e-05
49142,1898,M,Hilary,5,3.8e-05
62078,1902,M,Hilary,8,6e-05
69355,1904,M,Hilary,5,3.6e-05
72854,1905,M,Hilary,6,4.2e-05


In [11]:
hilary_name_filter.describe()


Unnamed: 0,year,n,prop
count,191.0,191.0,191.0
mean,1956.0,136.471204,8e-05
std,31.021215,250.161165,0.000132
min,1882.0,5.0,2e-06
25%,1932.5,12.0,1e-05
50%,1957.0,35.0,3.4e-05
75%,1980.5,112.0,6.2e-05
max,2015.0,1216.0,0.000592


### Task 1-1. List the numbers of male and female Hilary for each year

In [12]:
hilary_name_filter.groupby(['year','sex'])['n'].sum()

year  sex
1882  M       7
1883  M       6
1887  M       7
1891  M       8
1896  M       6
             ..
2011  F      79
2012  F      75
2013  F      66
2014  F      60
2015  F      53
Name: n, Length: 191, dtype: int64

In [13]:
hilary_name_filter.groupby(['year','sex'])['n'].max()

year  sex
1882  M       7
1883  M       6
1887  M       7
1891  M       8
1896  M       6
             ..
2011  F      79
2012  F      75
2013  F      66
2014  F      60
2015  F      53
Name: n, Length: 191, dtype: int64

### Task 2. Group-wise operations

### Task 2-1. Count the number of names by year and sex

In [14]:
babynames.groupby(['year','sex'])['name'].count().reset_index()


Unnamed: 0,year,sex,name
0,1880,F,942
1,1880,M,1058
2,1881,F,938
3,1881,M,997
4,1882,F,1028
...,...,...,...
267,2013,M,14026
268,2014,F,19150
269,2014,M,14026
270,2015,F,18993


In [15]:
#sort by min after group
babynames.groupby(['year','sex'])['name'].count().sort_values(ascending=True)


year  sex
1881  F        938
1880  F        942
1881  M        997
1882  F       1028
1883  M       1030
             ...  
2010  F      19804
2006  F      20045
2009  F      20169
2008  F      20443
2007  F      20554
Name: name, Length: 272, dtype: int64

In [16]:
#Top 50
babynames.groupby(['year','sex'])['name'].count().nlargest(10).reset_index()

Unnamed: 0,year,sex,name
0,2007,F,20554
1,2008,F,20443
2,2009,F,20169
3,2006,F,20045
4,2010,F,19804
5,2011,F,19549
6,2012,F,19477
7,2013,F,19203
8,2005,F,19178
9,2014,F,19150


### Task 2-2. Calculate ranking of each name for each year and sex combination. Which names were most popular in 1999? （Hint: ranking can be calculated with argsort())

In [60]:
babynames["rank_order"] = babynames['n'].values.argsort()
babynames

Unnamed: 0,year,sex,name,n,prop,rank_order
0,1880,F,Mary,7065,0.072384,1858688
1,1880,F,Anna,2604,0.026679,819156
2,1880,F,Emma,2003,0.020522,819155
3,1880,F,Elizabeth,1939,0.019866,819154
4,1880,F,Minnie,1746,0.017889,819153
...,...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002,437157
1858685,2015,M,Zyking,5,0.000002,544606
1858686,2015,M,Zykir,5,0.000002,437156
1858687,2015,M,Zyrus,5,0.000002,441423


In [29]:
babynames[babynames["year"] == 1999].head(1)


Unnamed: 0,year,sex,name,n,prop,rank_order
1304060,1999,F,Emily,26537,0.013638,0


### Task 2-3. What are the Top 10 in overall name popularity (in terms of total "n") by "sex"?

In [20]:
all_name_grp_sum =  babynames.groupby(["sex","name"])["n"].sum().reset_index()
print(type(all_name_grp_sum))

#apply x parameter takes whole frame not as series
names_grp_sex = all_name_grp_sum.groupby(["sex"]).apply(lambda x: x.sort_values(["n"],ascending = False)).reset_index(drop=True)
names_grp_sex.groupby(["sex"]).head(10) 

#use reset_index() to make it back into a DataFrame

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,sex,name,n
0,F,Mary,4118058
1,F,Elizabeth,1610948
2,F,Patricia,1570954
3,F,Jennifer,1464067
4,F,Linda,1451331
5,F,Barbara,1433339
6,F,Margaret,1242141
7,F,Susan,1120810
8,F,Dorothy,1106106
9,F,Sarah,1065265


In [23]:
#df1 =  babynames.groupby(["sex","name"])
#df2= df1.apply(lambda x: x.sort_values(["n"])) # sort with in the group
#df2.reset_index(drop=True)
#df2.head(10)
#all_name_grp_sum.groupby("sex").apply(lambda x: x.nlargest(10, 'n')).reset_index(drop = True) 

Unnamed: 0,sex,name,n
0,F,Mary,4118058
1,F,Elizabeth,1610948
2,F,Patricia,1570954
3,F,Jennifer,1464067
4,F,Linda,1451331
5,F,Barbara,1433339
6,F,Margaret,1242141
7,F,Susan,1120810
8,F,Dorothy,1106106
9,F,Sarah,1065265


### Task 2-4. What is the proportion of babies having the top 100 names for each year and sex?

In [81]:

# get dataframe sorted by life Expectancy in each continent 
name_100 = babynames.groupby(["sex","year"]).apply(lambda x: x.sort_values(["rank_order"], ascending = False)).reset_index(drop=True)

# select top N rows within each continent
type(name_100)
name_100.groupby(["sex","year"]).head(100)
name_100.groupby(['year', 'sex'])[['prop']].agg(np.sum).reset_index()
#pop_name_100["year"] == "1880"

Unnamed: 0,year,sex,prop
0,1880,F,0.932257
1,1880,M,0.933200
2,1881,F,0.930181
3,1881,M,0.930376
4,1882,F,0.932167
...,...,...,...
267,2013,M,0.935629
268,2014,F,0.912551
269,2014,M,0.936799
270,2015,F,0.914328


### Task 2-5. For each name, find the year in which it was ranked highest and the rank in that year.

In [86]:
namegroup = babynames.groupby(["name", "year"])["rank_order"].max()
namegroup.head(10)

name   year
Aaban  2007    1334780
       2009    1506898
       2010     697488
       2011     357834
       2012      91831
       2013     164862
       2014    1691992
       2015     421393
Aabha  2011     223862
       2012     954422
Name: rank_order, dtype: int64

### Task 2-6. Which name has been in the top 10 most often?

In [92]:
top_10 = babynames[babynames["rank_order"] < 10]
top_10.shape
top_10


Unnamed: 0,year,sex,name,n,prop,rank_order
1816083,2014,M,Yamin,23,1.1e-05,9
1816864,2014,M,Kobie,18,9e-06,8
1818940,2014,M,Janoah,11,5e-06,7
1820146,2014,M,Kamarian,9,4e-06,6
1822066,2014,M,Neomiah,7,3e-06,5
1824811,2014,M,Khylar,5,2e-06,4
1827497,2015,F,Rhiannon,118,6.1e-05,3
1828301,2015,F,Tova,70,3.6e-05,2
1834503,2015,F,Preslyn,13,7e-06,1
1849942,2015,M,Riyad,18,9e-06,0


In [91]:
top10.groupby(["sex","name"])["top10"].sum().reset_index().sort_values(by='top10', ascending=False)

KeyError: 'Column not found: top10'