# Case Study - Babies' names in the US from 1880 to 2015

## Learning Objectives:
1. Perform group-wise operations using Pandas
2. Familiar with Pandas’s groupby objects 
3. Practice aggregate, filter and apply functions in Pandas  

In [1]:
import numpy as np
import pandas as pd

<i><b>Background</b></i>: The dataset, `babynames.csv`, keeps the record of all the male/female baby names in the US from 1880 to 2015, together with their corresponding count ("n") and proportion ("prop") among all the new borns in that year. We will use this dataset to practice group-wise operations using Pandas.

### Load data

In [2]:
babynames = pd.read_csv("babynames.csv")

df = babynames

In [3]:
df.head(10)

Unnamed: 0,year,sex,name,n,prop
0,1880,F,Mary,7065,0.072384
1,1880,F,Anna,2604,0.026679
2,1880,F,Emma,2003,0.020522
3,1880,F,Elizabeth,1939,0.019866
4,1880,F,Minnie,1746,0.017889
5,1880,F,Margaret,1578,0.016167
6,1880,F,Ida,1472,0.015081
7,1880,F,Alice,1414,0.014487
8,1880,F,Bertha,1320,0.013524
9,1880,F,Sarah,1288,0.013196


### Task 1. On Hilary

Let's focus on a particular baby name first.

In [4]:
babynames_Hilary = babynames.loc[babynames['name'].str.contains('Hilary', regex=True)]



### Task 1-1. List the numbers of male and female Hilary for each year

In [5]:
babynames_Hilary.head(5)
babynames_Hilary

Unnamed: 0,year,sex,name,n,prop
5757,1882,M,Hilary,7,0.000057
7952,1883,M,Hilary,6,0.000053
17221,1887,M,Hilary,7,0.000064
27703,1891,M,Hilary,8,0.000073
42705,1896,M,Hilary,6,0.000046
...,...,...,...,...,...
1694105,2011,F,Hilary,79,0.000041
1728086,2012,F,Hilary,75,0.000039
1761989,2013,F,Hilary,66,0.000034
1795403,2014,F,Hilary,60,0.000031


In [6]:
babynames_Hilary["sex"].value_counts()

M    97
F    94
Name: sex, dtype: int64

### Task 2. Group-wise operations

### Task 2-1. Count the number of names by year and sex

In [7]:
df.groupby(['name','year', 'sex']).sum()



Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,n,prop
name,year,sex,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,2007,M,5,0.000002
Aaban,2009,M,6,0.000003
Aaban,2010,M,9,0.000004
Aaban,2011,M,11,0.000005
Aaban,2012,M,11,0.000005
...,...,...,...,...
Zyvion,2009,M,5,0.000002
Zyvon,2015,M,6,0.000003
Zyyanna,2010,F,6,0.000003
Zyyon,2014,M,6,0.000003


### Task 2-2. Calculate ranking of each name for each year and sex combination. Which names were most popular in 1999? （Hint: ranking can be calculated with argsort())

In [8]:
#df.sort_values(['name', 'year', 'sex'])
#f.groupby(['prop','name', 'year','sex']).sum()
#f.head(-1000)


df1 = df
df1 ["ranking"] = df1["n"].values.argsort()
df1

Unnamed: 0,year,sex,name,n,prop,ranking
0,1880,F,Mary,7065,0.072384,1858688
1,1880,F,Anna,2604,0.026679,819156
2,1880,F,Emma,2003,0.020522,819155
3,1880,F,Elizabeth,1939,0.019866,819154
4,1880,F,Minnie,1746,0.017889,819153
...,...,...,...,...,...,...
1858684,2015,M,Zykell,5,0.000002,437157
1858685,2015,M,Zyking,5,0.000002,544606
1858686,2015,M,Zykir,5,0.000002,437156
1858687,2015,M,Zyrus,5,0.000002,441423


### Task 2-3. What are the Top 10 in overall name popularity (in terms of total "n") by "sex"?

In [14]:
df1_Female = df["sex"] == "F"   # filtering all Female
df1_F = df[df1_Female].head(10)

df1_F

Unnamed: 0,year,sex,name,n,prop,ranking
0,1880,F,Mary,7065,0.072384,1858688
1,1880,F,Anna,2604,0.026679,819156
2,1880,F,Emma,2003,0.020522,819155
3,1880,F,Elizabeth,1939,0.019866,819154
4,1880,F,Minnie,1746,0.017889,819153


In [15]:
df1_Male = df["sex"] == "M"   # filtering all Male
df1_M = df[df1_Male].head(10)

df1_M

Unnamed: 0,year,sex,name,n,prop,ranking
942,1880,M,John,9655,0.081546,818935
943,1880,M,William,9531,0.080499,818941
944,1880,M,James,5927,0.05006,818934
945,1880,M,Charles,5348,0.045169,818932
946,1880,M,George,5126,0.043294,818931
947,1880,M,Frank,3242,0.027382,818930
948,1880,M,Joseph,2632,0.02223,818929
949,1880,M,Thomas,2534,0.021402,818928
950,1880,M,Henry,2444,0.020642,818927
951,1880,M,Robert,2415,0.020397,818933


### Task 2-4. What is the proportion of babies having the top 100 names for each year and sex?

In [26]:
Female_Top_100 = df1_F.iloc[0:99, :]   # Female name popularity
Female_Top_100["prop"].sum()


0.2297959100036884

In [61]:
Male_Top_100 = df1_M.iloc[0:99, :]   # Male name popularity
Male_Top_100["prop"].sum()

0.41262172822405596

### Task 2-5. For each name, find the year in which it was ranked highest and the rank in that year.

In [19]:
namegroup = df.groupby(["name", "year"])["ranking"].max()

In [20]:
namegroup.head(100)

name      year
Aaban     2007    1334780
          2009    1506898
          2010     697488
          2011     357834
          2012      91831
                   ...   
Aadhavan  2013     447728
          2015     526271
Aadhi     2013    1393091
          2014      73160
          2015     526302
Name: ranking, Length: 100, dtype: int64

In [21]:
namegroup["Mary"]

year
1880    1858688
1881     819641
1882     818763
1883    1537830
1884     902525
         ...   
2011    1321920
2012    1826524
2013      12828
2014    1392817
2015    1657874
Name: ranking, Length: 136, dtype: int64

### Task 2-6. Which name has been in the top 10 most often?

In [44]:
top_10 = df.groupby(["year", "name"])["ranking"].max()
top_10


year  name  
1880  Aaron      212257
      Ab        1210174
      Abbie      819023
      Abbott    1210180
      Abby       211998
                 ...   
2015  Zyrion    1222790
      Zyron     1711352
      Zyrus      441423
      Zyus       431053
      Zyvon     1441624
Name: ranking, Length: 1695502, dtype: int64

In [60]:
df3 = df1
df3['name'].value_counts().argmax()


0

In [None]:
## popular_name = df['name'].value_counts().argmax()