# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

In [2]:
baby_names=pd.read_csv('US_Baby_Names_right.csv')

### Step 3. Assign it to a variable called baby_names.

In [3]:
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Unnamed: 0  1016395 non-null  int64 
 1   Id          1016395 non-null  int64 
 2   Name        1016395 non-null  object
 3   Year        1016395 non-null  int64 
 4   Gender      1016395 non-null  object
 5   State       1016395 non-null  object
 6   Count       1016395 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 54.3+ MB


### Step 4. See the first 10 entries

In [4]:
baby_names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [5]:
baby_names.drop(['Unnamed: 0','Id'], axis=1,inplace=True)

### Step 6. Are there more male or female names in the dataset?

In [6]:
baby_names['Gender'].value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [7]:
name=pd.DataFrame(baby_names.groupby('Name')['Count'].sum()).sort_values(by='Count', ascending=False)
name[:5]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874
Emma,214852
Michael,214405
Ethan,209277
Isabella,204798


### Step 8. How many different names exist in the dataset?

In [8]:
baby_names['Name'].nunique()

17632

### Step 9. What is the name with most occurrences?

In [9]:
baby_names.groupby('Name')['Count'].sum().nlargest()[:1]

Name
Jacob    242874
Name: Count, dtype: int64

### Step 10. How many different names have the least occurrences?

In [10]:
baby_names.groupby('Name')['Count'].sum().nsmallest()[:5]

Name
Aadarsh    5
Aadin      5
Aaima      5
Aalaya     5
Aaminah    5
Name: Count, dtype: int64

### Step 11. What is the median name occurrence?

In [11]:
baby_names.groupby('Name')['Count'].median().nlargest()[:10]

Name
Emma       279.0
William    263.0
Jacob      254.5
Ethan      245.0
Olivia     228.0
Name: Count, dtype: float64

### Step 12. What is the standard deviation of names?

In [12]:
baby_names.groupby('Name')['Count'].std().nlargest()[:10]

Name
Daniel      555.800180
Jacob       541.322290
Anthony     521.744796
Isabella    510.344833
Sophia      490.273144
Name: Count, dtype: float64

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [13]:
baby_names.describe(include='all')

Unnamed: 0,Name,Year,Gender,State,Count
count,1016395,1016395.0,1016395,1016395,1016395.0
unique,17632,,2,51,
top,Riley,,F,CA,
freq,1112,,558846,76781,
mean,,2009.053,,,34.85012
std,,3.138293,,,97.39735
min,,2004.0,,,5.0
25%,,2006.0,,,7.0
50%,,2009.0,,,11.0
75%,,2012.0,,,26.0


In [14]:
baby_names.describe().loc[['mean','min','max','std','25%','50%','75%']]

Unnamed: 0,Year,Count
mean,2009.05319,34.850124
min,2004.0,5.0
max,2014.0,4167.0
std,3.138293,97.397346
25%,2006.0,7.0
50%,2009.0,11.0
75%,2012.0,26.0


In [15]:
baby_names['Year'].value_counts()

2008    94970
2009    94609
2007    94332
2014    94148
2010    93307
2012    93024
2013    92743
2011    92545
2006    91803
2005    88494
2004    86420
Name: Year, dtype: int64

In [None]:
baby_names['Name'].apply()

In [51]:
res=baby_names.groupby(['State','Name'])['Count'].sum().sort_values(ascending=False)

In [55]:
st=pd.DataFrame(res)

In [71]:
st.reset_index(inplace=True)

In [79]:
st[['State','Name','Count']]

Unnamed: 0,State,Name,Count
0,CA,Daniel,36772
1,CA,Anthony,34113
2,CA,Jacob,33819
3,CA,Angel,31881
4,CA,Isabella,31492
...,...,...,...
167623,FL,Kaniah,5
167624,CO,Judson,5
167625,MN,Abdishakur,5
167626,TX,Ary,5


In [85]:
st[['State','Count']].max()['State']='CA'