# US Baby Names Analysis

This notebook was created to analyze baby names enrolled to US Social Security Administration to find popular names during the period that the data was collected.

## 1. Dataset

- Source: Social Security Administration (SSA)
- Popular Baby Names (State-specific data)
- Contians only the top 1000 names through thier forms. To safeguard privacy, SSA excluded from these files certain names that would indicate, or would allow the ability to determine, names with fewer than 5 occurrences in any geographic area.

## 2. Data Dictionary

- state: state where the names were registered
- sex: baby's sex (binary: Female or Male)
- year: year when the names were registered
- name: registered name
- occurences: how many times the name was registered

## 3. Import files

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
folder_path = 'namesbystate'

In [3]:
# Create an empty list to store individual dataframes
dfs = []

In [4]:
# Import each file using a for loop
for filename in os.listdir(folder_path):
    if filename.endswith('.TXT'):
        # Create the full file path
        file_path = os.path.join(folder_path, filename)

        # Read the text file into a pandas dataframe
        df = pd.read_csv(file_path, sep=',', header = None)

        # Append the dataframe to the list
        dfs.append(df)

In [5]:
# Combine all dataframes into one
combined_df = pd.concat(dfs, ignore_index=True)

In [6]:
# Check the data
combined_df.head()

Unnamed: 0,0,1,2,3,4
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7


In [7]:
# Rename columns
combined_df.columns = ['state', 'sex', 'year', 'name', 'occurences']

In [8]:
combined_df.head()

Unnamed: 0,state,sex,year,name,occurences
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7


## 4. EDA

In [9]:
combined_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6408041 entries, 0 to 6408040
Data columns (total 5 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   state       6408041 non-null  object
 1   sex         6408041 non-null  object
 2   year        6408041 non-null  int64 
 3   name        6408041 non-null  object
 4   occurences  6408041 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 244.4+ MB


In [10]:
combined_df.isnull().sum()

state         0
sex           0
year          0
name          0
occurences    0
dtype: int64

There is no null value in the dataframe.

In [11]:
# # Save the combined dataframe as a csv file
# combined_df.to_csv('BabyNamesUS.csv')

### 1) Unique names for each sex during the full period

In [12]:
name_df = combined_df.pivot_table(index = 'name', values = 'occurences', columns = 'sex', aggfunc='sum')
name_df.head()

sex,F,M
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aaban,,12.0
Aadam,,6.0
Aadan,,23.0
Aadarsh,,11.0
Aaden,,4174.0


In [13]:
name_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32722 entries, Aaban to Zyshonne
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   F       21570 non-null  float64
 1   M       14336 non-null  float64
dtypes: float64(2)
memory usage: 766.9+ KB


There are 21570 unique names for male and 14336 names for female.

### 2) Transform 'year' to 'year_decade' to simplify the dataset for analysis

In [14]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)
combined_df.describe()

Unnamed: 0,year,occurences
count,6408041,6408041
mean,1978,50
std,32,172
min,1910,5
25%,1954,7
50%,1983,12
75%,2005,33
max,2022,10027


This dataset contains baby names reported to SSA (Social Security Administration) from 51 states in the United States from 1910 to 2022. Since there are too many years, I will create a new colum called 'year_decade' to aggregate years by decade.

In [15]:
combined_df['year_decade'] = (combined_df['year'] // 10) * 10

In [16]:
combined_df['year_decade'] = combined_df['year_decade'].apply(lambda x: str(x) + 's')

In [17]:
combined_df

Unnamed: 0,state,sex,year,name,occurences,year_decade
0,AK,F,1910,Mary,14,1910s
1,AK,F,1910,Annie,12,1910s
2,AK,F,1910,Anna,10,1910s
3,AK,F,1910,Margaret,8,1910s
4,AK,F,1910,Helen,7,1910s
...,...,...,...,...,...,...
6408036,WY,M,2022,Lane,5,2020s
6408037,WY,M,2022,Michael,5,2020s
6408038,WY,M,2022,Nicholas,5,2020s
6408039,WY,M,2022,River,5,2020s


### 3) The most popular baby names during the full period.

In [18]:
# Group by the dataframe by sex and name
top_names = combined_df.groupby(['sex', 'name'])['occurences'].sum().reset_index()

# Find the most frequently shown name
top_names.loc[top_names.groupby('sex')['occurences'].idxmax()].reset_index(drop = True)

Unnamed: 0,sex,name,occurences
0,F,Mary,3750176
1,M,James,5047892


The most popular name for female is 'Mary' and for male is 'James' from 1910 to 2022.

### 4) The most popular baby names by each decade

In [19]:
top_names_by_decade = combined_df.groupby(['sex', 'name', 'year_decade'])['occurences'].sum().reset_index()

# Find the index of the maximum value in each group
max_occurrences_index = top_names_by_decade.groupby(['sex', 'year_decade'])['occurences'].idxmax()

# Get the most popular name for each sex within each decade
most_popular_names_decade = top_names_by_decade.loc[max_occurrences_index].reset_index(drop = True)

most_popular_names_decade

Unnamed: 0,sex,name,year_decade,occurences
0,F,Mary,1910s,478637
1,F,Mary,1920s,701755
2,F,Mary,1930s,572987
3,F,Mary,1940s,640066
4,F,Mary,1950s,625601
5,F,Lisa,1960s,496975
6,F,Jennifer,1970s,581753
7,F,Jessica,1980s,469518
8,F,Jessica,1990s,303118
9,F,Emily,2000s,223734


From the analysis, 'Mary' had been a very popular name for a long time. Then the trend changed in 1960s. For males, Robert, James, Michael were nomiated as popular names in many decades.

### 5) Top 10 popular baby names for each sex during the past 5 years

When my niece was born, my family took about a week to decide her name. We wanted to give her a pretty but also trendy name. I believe this analysis might help parents who have a newborn and want to give a trendy name to thier baby.

In [20]:
# Extract recent 5 years data
recent5years = combined_df[combined_df.year > combined_df.year.max() - 5]
recent5years.head()

Unnamed: 0,state,sex,year,name,occurences,year_decade
14421,AK,F,2018,Aurora,46,2010s
14422,AK,F,2018,Amelia,45,2010s
14423,AK,F,2018,Charlotte,44,2010s
14424,AK,F,2018,Olivia,44,2010s
14425,AK,F,2018,Sophia,41,2010s


In [21]:
grouped_df = recent5years.groupby(['sex', 'name'])['occurences'].sum().reset_index()

# Sort the dataframe by occurrences within each group
sorted_df = grouped_df.sort_values(by=['sex', 'occurences'], ascending=[True, False])

# Get the top 10 names for each sex
top5_names = sorted_df.groupby(['sex']).head(10).reset_index(drop = True)

# Display the result
top5_names

Unnamed: 0,sex,name,occurences
0,F,Olivia,88623
1,F,Emma,81620
2,F,Ava,66586
3,F,Sophia,65705
4,F,Charlotte,65550
5,F,Amelia,63437
6,F,Isabella,63030
7,F,Mia,58613
8,F,Evelyn,49127
9,F,Harper,46564
