# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [5]:
baby_names = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv')
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0    1016395 non-null int64
Id            1016395 non-null int64
Name          1016395 non-null object
Year          1016395 non-null int64
Gender        1016395 non-null object
State         1016395 non-null object
Count         1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB


### Step 4. See the first 10 entries

In [6]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [8]:
del baby_names['Unnamed: 0']

In [12]:
baby_names.head()

Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11350,Emma,2004,F,AK,62
1,11351,Madison,2004,F,AK,48
2,11352,Hannah,2004,F,AK,46
3,11353,Grace,2004,F,AK,44
4,11354,Emily,2004,F,AK,41


In [19]:
baby_names.drop(columns = 'Id', axis = 1, inplace=True)

In [20]:
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. Is there more male or female names in the dataset?

In [27]:
baby_names['Gender'].value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [37]:
names = pd.DataFrame(baby_names.groupby('Name').sum()['Count'])
names.sort_values('Count', ascending=False).head()

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874
Emma,214852
Michael,214405
Ethan,209277
Isabella,204798


### Step 8. How many different names exist in the dataset?

In [39]:
len(names)

17632

### Step 9. What is the name with most occurrences?

In [43]:
# names.Count.idxmax
"""
Return the row label of the maximum value.

If multiple values equal the maximum, the first row label with that
value is returned.
"""

In [44]:
names.Count.idxmax()

'Jacob'

### Step 10. How many different names have the least occurrences?

In [48]:
len(names[names.Count == names.Count.min()])

2578

In [52]:
names[names.Count == names.Count.min()].head()

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aadarsh,5
Aadin,5
Aaima,5
Aalaya,5
Aaminah,5


### Step 11. What is the median name occurrence?

In [54]:
names[names.Count == names.Count.median()].index

Index(['Aishani', 'Alara', 'Alysse', 'Ameir', 'Anely', 'Antonina', 'Aveline',
       'Aziah', 'Baily', 'Caleah', 'Carlota', 'Cristine', 'Dahlila', 'Darvin',
       'Deante', 'Deserae', 'Devean', 'Elizah', 'Emmaly', 'Emmanuela', 'Envy',
       'Esli', 'Fay', 'Gurshaan', 'Hareem', 'Iven', 'Jaice', 'Jaiyana',
       'Jamiracle', 'Jelissa', 'Jeovany', 'Jkwon', 'Kaedence', 'Kaelee',
       'Kailana', 'Kaio', 'Kyndle', 'Kynsley', 'Leylanie', 'Maisha',
       'Malillany', 'Mariann', 'Marquell', 'Maurilio', 'Mckynzie', 'Mehdi',
       'Nabeel', 'Nalleli', 'Nassir', 'Nazier', 'Nishant', 'Rebecka', 'Reghan',
       'Ridwan', 'Riot', 'Rubin', 'Ryatt', 'Sameera', 'Sanjuanita', 'Shalyn',
       'Skylie', 'Sriram', 'Trinton', 'Vita', 'Yoni', 'Zuleima'],
      dtype='object', name='Name')

### Step 12. What is the standard deviation of names?

In [57]:
names.Count.std()

11006.069467891111

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [58]:
names.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
