# Case Study - Babies' names in the US from 1880 to 2015

## Learning Objectives:
1. Perform group-wise operations using Pandas
2. Familiar with Pandas’s groupby objects 
3. Practice aggregate, filter and apply functions in Pandas  

In [None]:
import numpy as np
import pandas as pd

<i><b>Background</b></i>: The dataset, `babynames.csv`, keeps the record of all the male/female baby names in the US from 1880 to 2015, together with their corresponding count ("n") and proportion ("prop") among all the new borns in that year. We will use this dataset to practice group-wise operations using Pandas.

### Load data

In [16]:
import numpy as np
import pandas as pd

In [17]:
babynames = pd.read_csv("babynames.csv")
babynames.head(10)

Unnamed: 0,year,sex,name,n,prop
0,1880,F,Mary,7065,0.072384
1,1880,F,Anna,2604,0.026679
2,1880,F,Emma,2003,0.020522
3,1880,F,Elizabeth,1939,0.019866
4,1880,F,Minnie,1746,0.017889
5,1880,F,Margaret,1578,0.016167
6,1880,F,Ida,1472,0.015081
7,1880,F,Alice,1414,0.014487
8,1880,F,Bertha,1320,0.013524
9,1880,F,Sarah,1288,0.013196


### Task 1. On Hilary

Let's focus on a particular baby name first.

In [19]:
Hilary = babynames["name"] == "Hilary"
Hilarydb = babynames.loc[Hilary]
Hilarydb

Unnamed: 0,year,sex,name,n,prop
5757,1882,M,Hilary,7,0.000057
7952,1883,M,Hilary,6,0.000053
17221,1887,M,Hilary,7,0.000064
27703,1891,M,Hilary,8,0.000073
42705,1896,M,Hilary,6,0.000046
45867,1897,M,Hilary,5,0.000041
49142,1898,M,Hilary,5,0.000038
62078,1902,M,Hilary,8,0.000060
69355,1904,M,Hilary,5,0.000036
72854,1905,M,Hilary,6,0.000042


### Task 1-1. List the numbers of male and female Hilary for each year

In [20]:
Hilarydb["sex"].value_counts()

M    97
F    94
Name: sex, dtype: int64

### Task 2. Group-wise operations

### Task 2-1. Count the number of names by year and sex

In [23]:
BabyNames_year_sex = babynames.groupby(["year", "sex"]).sum()
BabyNames_year_sex

Unnamed: 0_level_0,Unnamed: 1_level_0,n,prop
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1880,F,90992,0.932257
1880,M,110490,0.933200
1881,F,91953,0.930181
1881,M,100743,0.930376
1882,F,107848,0.932167
1882,M,113686,0.931616
1883,F,112318,0.935523
1883,M,104627,0.930200
1884,F,129020,0.937741
1884,M,114443,0.932409


### Task 2-2. Calculate ranking of each name for each year and sex combination. Which names were most popular in 1999? （Hint: ranking can be calculated with argsort())

In [24]:
babynames2 = babynames
babynames2["ranking"] = babynames2["n"].values.argsort()
babynames2

Unnamed: 0,year,sex,name,n,prop,ranking
0,1880,F,Mary,7065,0.072384,1858688
1,1880,F,Anna,2604,0.026679,819156
2,1880,F,Emma,2003,0.020522,819155
3,1880,F,Elizabeth,1939,0.019866,819154
4,1880,F,Minnie,1746,0.017889,819153
5,1880,F,Margaret,1578,0.016167,819152
6,1880,F,Ida,1472,0.015081,819151
7,1880,F,Alice,1414,0.014487,819157
8,1880,F,Bertha,1320,0.013524,819150
9,1880,F,Sarah,1288,0.013196,819148


### Task 2-3. What are the Top 10 in overall name popularity (in terms of total "n") by "sex"?

In [25]:
babynames2_F = babynames2["sex"] == "F"
babynames2_FDB = babynames2[babynames2_F].head(10)
babynames2_FDB

Unnamed: 0,year,sex,name,n,prop,ranking
0,1880,F,Mary,7065,0.072384,1858688
1,1880,F,Anna,2604,0.026679,819156
2,1880,F,Emma,2003,0.020522,819155
3,1880,F,Elizabeth,1939,0.019866,819154
4,1880,F,Minnie,1746,0.017889,819153
5,1880,F,Margaret,1578,0.016167,819152
6,1880,F,Ida,1472,0.015081,819151
7,1880,F,Alice,1414,0.014487,819157
8,1880,F,Bertha,1320,0.013524,819150
9,1880,F,Sarah,1288,0.013196,819148


In [27]:
babynames2_M = babynames2["sex"] == "M"
babynames2_MDB = babynames2[babynames2_M].head(10)
babynames2_MDB

Unnamed: 0,year,sex,name,n,prop,ranking
942,1880,M,John,9655,0.081546,818935
943,1880,M,William,9531,0.080499,818941
944,1880,M,James,5927,0.05006,818934
945,1880,M,Charles,5348,0.045169,818932
946,1880,M,George,5126,0.043294,818931
947,1880,M,Frank,3242,0.027382,818930
948,1880,M,Joseph,2632,0.02223,818929
949,1880,M,Thomas,2534,0.021402,818928
950,1880,M,Henry,2444,0.020642,818927
951,1880,M,Robert,2415,0.020397,818933


### Task 2-4. What is the proportion of babies having the top 100 names for each year and sex?

In [28]:
newf = babynames2_FDB.iloc[0:100, :]
newf["prop"].sum()

0.2297959100036884

In [29]:
newm = babynames2_MDB.iloc[0:100, :]
newm["prop"].sum()

0.41262172822405596

### Task 2-5. For each name, find the year in which it was ranked highest and the rank in that year.

In [30]:
babynames3 = babynames
babynames3

Unnamed: 0,year,sex,name,n,prop,ranking
0,1880,F,Mary,7065,0.072384,1858688
1,1880,F,Anna,2604,0.026679,819156
2,1880,F,Emma,2003,0.020522,819155
3,1880,F,Elizabeth,1939,0.019866,819154
4,1880,F,Minnie,1746,0.017889,819153
5,1880,F,Margaret,1578,0.016167,819152
6,1880,F,Ida,1472,0.015081,819151
7,1880,F,Alice,1414,0.014487,819157
8,1880,F,Bertha,1320,0.013524,819150
9,1880,F,Sarah,1288,0.013196,819148


In [33]:
namegroup = babynames3.groupby(["name", "year"])["ranking"].max()
namegroup.head(100)

name       year
Aaban      2007    1334780
           2009    1506898
           2010     697488
           2011     357834
           2012      91831
           2013     164862
           2014    1691992
           2015     421393
Aabha      2011     223862
           2012     954422
           2014     580050
           2015    1362472
Aabid      2003     562444
Aabriella  2008      10660
           2014    1410998
           2015     115992
Aada       2015     362379
Aadam      1987    1088820
           1988    1229282
           1993    1525948
           1994    1338507
           1995    1398377
           1996    1138146
           1997     453679
           1998    1027231
           1999     955964
           2000     102715
           2002      11924
           2003    1147841
           2004    1426573
                    ...   
Aaden      2005    1382481
           2006     904563
           2007     573792
           2008    1213737
           2009    1351504
           2

In [37]:
namegroup["Annie"]

year
1880    1210158
1881     819634
1882     818790
1883    1538232
1884     436672
1885    1184955
1886    1186149
1887     901925
1888    1504460
1889    1505036
1890    1238113
1891    1221613
1892    1221484
1893    1211388
1894    1517594
1895     913740
1896    1606183
1897    1608148
1898    1608328
1899    1133022
1900     801003
1901    1586210
1902    1570853
1903    1573374
1904     471742
1905    1587599
1906    1409100
1907     370752
1908    1302858
1909    1409908
         ...   
1986    1177679
1987     291682
1988    1229426
1989    1800323
1990    1369382
1991     946650
1992     350450
1993    1242637
1994     332375
1995     945884
1996    1384375
1997     605911
1998     271813
1999     140297
2000    1026984
2001    1491071
2002     506053
2003     610738
2004     198009
2005    1352461
2006     769161
2007    1025347
2008      62882
2009    1610079
2010     585367
2011     175702
2012     472784
2013    1826490
2014     482997
2015     560872
Name: ranking, Leng

### Task 2-6. Which name has been in the top 10 most often?

In [35]:
top10 = babynames3.groupby(["year", "name"])["ranking"].max()
top10

year  name    
1880  Aaron        212257
      Ab          1210174
      Abbie        819023
      Abbott      1210180
      Abby         211998
      Abe          437059
      Abel        1538765
      Abigail      819278
      Abner        437042
      Abraham      212369
      Abram       1210587
      Ada          819161
      Adah         819408
      Adaline      819402
      Adam         212265
      Adda         819286
      Addie       1538994
      Addison     1210547
      Adela        212138
      Adelaide     818972
      Adelbert     437037
      Adele        819098
      Adelia       819526
      Adeline      818986
      Adella       819461
      Adelle       212033
      Aden        1538877
      Adina        212026
      Adline       212097
      Adolf       1210166
                   ...   
2015  Zymarion     196869
      Zymeir      1441625
      Zymere      1657733
      Zymiah       169529
      Zymier       468070
      Zymir        784672
      Zymira       6211

In [36]:
babynames3['name'].value_counts().argmax()

will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
  """Entry point for launching an IPython kernel.


'Jean'