# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [2]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [3]:
baby_names = pd.read_csv("https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv")

In [4]:
baby_names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


### Step 4. See the first 10 entries

In [None]:
baby_names[:10]

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [None]:
baby_names.drop(['Unnamed: 0', 'Id'], axis=1)

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [None]:
# baby_names.Gender.value_counts()

baby_names.groupby("Gender").Gender.count()

Gender
F    558846
M    457549
Name: Gender, dtype: int64

In [None]:
baby_names.Gender.describe()

count     1016395
unique          2
top             F
freq       558846
Name: Gender, dtype: object

### Step 7. Group the dataset by name and assign to names

In [None]:
names = baby_names.groupby("Name")

In [None]:
names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...,...
1004923,5546649,5546650,Gryffin,2014,M,WI,5
1004950,5546676,5546677,Kroy,2014,M,WI,5
1004973,5546699,5546700,Owyn,2014,M,WI,5
1005707,5583654,5583655,Haylea,2005,F,WV,5


### Step 8. How many different names exist in the dataset?

In [None]:
#dropping duplicates, counting values, and printing the length
unique_names = baby_names.drop_duplicates(subset = "Name")
unique_names_count = unique_names["Name"].value_counts()
print(len(unique_names_count))

17632


In [None]:
#using unique() method
unique_names = baby_names.Name.unique()
print(len(unique_names))

17632


In [None]:
# using  nunique()
baby_names.Name.nunique()

17632

### Step 9. What is the name with most occurrences?

In [None]:
#first count the number of values of each occupation
name_count = baby_names["Name"].value_counts()

#sort the values in descending order and print the top value
print((name_count.sort_values(ascending=False)).head(1))

Riley    1112
Name: Name, dtype: int64


In [None]:
#another method, using groupby()
baby_names.groupby("Name").Name.count().sort_values(ascending = False).head(1)


Name
Riley    1112
Name: Name, dtype: int64

### Step 10. How many different names have the least occurrences?

In [23]:
baby_names.groupby("Name").Name.count().sort_values().head()

Name
Katherina    1
Breyona      1
Greidy       1
Shriyan      1
Briah        1
Name: Name, dtype: int64

### Step 11. What is the median name occurrence?

In [None]:
md_name_occurrence = baby_names.groupby("Name").Name.count().median()

In [None]:
md_name_occurrence

8.0

### Step 12. What is the standard deviation of names?

In [None]:
std_name_occurrence = baby_names.groupby("Name").Name.count().std()

In [None]:
std_name_occurrence

122.0299635081389

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [None]:
baby_names.describe()

Unnamed: 0.1,Unnamed: 0,Id,Year,Count
count,1016395.0,1016395.0,1016395.0,1016395.0
mean,2830990.0,2830991.0,2009.053,34.85012
std,1652476.0,1652476.0,3.138293,97.39735
min,11349.0,11350.0,2004.0,5.0
25%,1317326.0,1317328.0,2006.0,7.0
50%,2811920.0,2811921.0,2009.0,11.0
75%,4242554.0,4242556.0,2012.0,26.0
max,5647425.0,5647426.0,2014.0,4167.0
