# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [None]:
!pip install pandas 
!pip install numpy

In [1]:
import pandas as pd 
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/Laxminarayen/Inceptz-Batch13-Analytics_and_Python/master/08%20-%20Day%20-%208%20-%20Python%20Quiz%20Session/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [2]:
baby_names = pd.read_csv("https://raw.githubusercontent.com/Laxminarayen/Inceptz-Batch13-Analytics_and_Python/master/08%20-%20Day%20-%208%20-%20Python%20Quiz%20Session/06_Stats/US_Baby_Names/US_Baby_Names_right.csv")

### Step 4. See the first 10 entries

In [5]:
baby_names.head(100)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...,...
95,11444,11445,Christina,2004,F,AK,8
96,11445,11446,Clara,2004,F,AK,8
97,11446,11447,Delaney,2004,F,AK,8
98,11447,11448,Gabrielle,2004,F,AK,8


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [6]:
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Unnamed: 0  1016395 non-null  int64 
 1   Id          1016395 non-null  int64 
 2   Name        1016395 non-null  object
 3   Year        1016395 non-null  int64 
 4   Gender      1016395 non-null  object
 5   State       1016395 non-null  object
 6   Count       1016395 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 54.3+ MB


In [10]:
#one way to drop the columns 
baby_names.drop(columns=['Id'],inplace= True)

In [11]:
baby_names

Unnamed: 0.1,Unnamed: 0,Name,Year,Gender,State,Count
0,11349,Emma,2004,F,AK,62
1,11350,Madison,2004,F,AK,48
2,11351,Hannah,2004,F,AK,46
3,11352,Grace,2004,F,AK,44
4,11353,Emily,2004,F,AK,41
...,...,...,...,...,...,...
1016390,5647421,Seth,2014,M,WY,5
1016391,5647422,Spencer,2014,M,WY,5
1016392,5647423,Tyce,2014,M,WY,5
1016393,5647424,Victor,2014,M,WY,5


In [14]:
#second way to drop the column
del baby_names['Unnamed: 0']

KeyError: 'Unnamed: 0'

In [15]:
baby_names

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [17]:
baby_names['Gender'].value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

In [19]:
baby_names['Year'].value_counts()

2008    94970
2009    94609
2007    94332
2014    94148
2010    93307
2012    93024
2013    92743
2011    92545
2006    91803
2005    88494
2004    86420
Name: Year, dtype: int64

In [30]:
baby_names['Name'].value_counts().sort_values(ascending=False)

Riley      1112
Avery      1080
Jordan     1073
Peyton     1064
Hayden     1049
           ... 
Brittni       1
Lilou         1
Surveen       1
Miral         1
Brennyn       1
Name: Name, Length: 17632, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [24]:
baby_names.drop(columns=['Year'],inplace=True)

In [34]:
names = baby_names.groupby(by=['Name']).sum()

In [35]:
names.sort_values("Count",ascending = False)

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874
Emma,214852
Michael,214405
Ethan,209277
Isabella,204798
...,...
Eniola,5
Atlantis,5
Marci,5
Simarpreet,5


In [33]:
names = baby_names.groupby(by=['Name']).max()
names

Unnamed: 0_level_0,Gender,State,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aaban,M,NY,6
Aadan,M,TX,7
Aadarsh,M,IL,5
Aaden,M,WV,158
Aadhav,M,CA,6
...,...,...,...
Zyra,F,TX,8
Zyrah,F,TX,6
Zyren,M,TX,6
Zyria,F,TX,7


### Step 8. How many different names exist in the dataset?

In [36]:
len(names)

17632

In [37]:
baby_names['Name'].nunique()

17632

### Step 9. What is the name with most occurrences?

In [42]:
names.loc[names.Count==names.Count.max()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874


In [43]:
names.loc[names.Count==names.Count.max()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874


In [44]:
names['Count'].idxmax()

'Jacob'

### Step 10. How many different names have the least occurrences?

In [48]:
len(names.loc[names.Count==names.Count.min()])

2578

### Step 11. What is the median name occurrence?

In [49]:
names.loc[names['Count']==names['Count'].median()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aishani,49
Alara,49
Alysse,49
Ameir,49
Anely,49
...,...
Sriram,49
Trinton,49
Vita,49
Yoni,49


### Step 12. What is the standard deviation of names?

In [50]:
names.Count.std()

11006.069467891111

In [51]:
names.Count.mean()

2008.932168784029

In [52]:
names.Count.max()

242874

In [53]:
names.Count.median()

49.0

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [54]:
names.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
