## A) Descriptive analysis

* Please describe the format of the data files. Can you identify any limitations or distortions of the data?

In [None]:
# Importing necessary libraries

import glob
import pandas as pd

In [11]:
# Data ingestion

folder = r'datasets/*.TXT'
files = glob.glob(folder)
dfs = []

for path in files:
    df = pd.read_csv(path, sep=',', header=None, names=['state', 'gender', 'year', 'name', 'count'])
    df['source_file'] = path.split('/')[-1]
    dfs.append(df)

original_df = pd.concat(dfs, ignore_index=True)

In [13]:
original_df.head()

Unnamed: 0,state,gender,year,name,count,source_file
0,IN,F,1910,Mary,619,IN.TXT
1,IN,F,1910,Helen,324,IN.TXT
2,IN,F,1910,Ruth,238,IN.TXT
3,IN,F,1910,Dorothy,215,IN.TXT
4,IN,F,1910,Mildred,200,IN.TXT


In [15]:
original_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6311504 entries, 0 to 6311503
Data columns (total 6 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   state        object
 1   gender       object
 2   year         int64 
 3   name         object
 4   count        int64 
 5   source_file  object
dtypes: int64(2), object(4)
memory usage: 288.9+ MB


In [25]:
original_df.sample(5)

Unnamed: 0,state,gender,year,name,count,source_file
2605115,NM,F,1999,Alexandria,32,NM.TXT
5860702,FL,F,1986,Bianca,41,FL.TXT
2594985,NM,F,1976,Erin,30,NM.TXT
1352074,UT,M,1968,Monte,5,UT.TXT
3739881,MN,M,1928,Orval,5,MN.TXT


In [17]:
original_df.describe()

Unnamed: 0,year,count
count,6311504.0,6311504.0
mean,1977.182,50.65856
std,31.27894,173.1193
min,1910.0,5.0
25%,1953.0,7.0
50%,1983.0,12.0
75%,2005.0,33.0
max,2021.0,10026.0


**Format of data files:**
* The data consists of multiple state-level text files with a consistent, comma-separated schema.
* Each record represents the count of babies assigned a given first name, by gender, state, and year.
* The data is aggregated at the state-year level, making it well-suited for temporal trend analysis and gender-based comparisons once the files are consolidated.

**Limitations:**
* The data is aggregated rather than individual-level, which restricts longitudinal or cohort-based analysis.
* Gender is captured in a binary format, which limits accurate representation of gender ambiguity in more recent contexts.
* Additionally, raw counts are not normalized by population size, so comparisons across states or years may be skewed without adjustment.
* Finally, the absence of demographic attributes such as ethnicity or socioeconomic indicators means observed trends may reflect reporting practices or population changes rather than pure naming preference.

-  What is the most popular name of all time? (Of either gender.)

In [83]:
original_df.groupby('name')['count'].sum().sort_values(ascending=False).head(1)

name
James    5054074
Name: count, dtype: int64

**Insights**: Since the dataset already contains aggregated birth counts, I grouped the data by name and summed the count column across all records to measure overall popularity. Using this approach, James emerges as the most popular name of all time, with a cumulative count of 5,054,074 births.

-  What is the most gender ambiguous name in 2013? 1945?

In [161]:
# gender ambiguous name in 2013

df_2013=original_df[original_df['year']==2013] 
gender_aggregated_names_2013=df_2013.groupby(['name', 'gender'])['count'].sum().reset_index()
pivot_names_2013 = gender_aggregated_names_2013.pivot(index='name', columns='gender', values='count')
pivot_names_2013 = pivot_names_2013.dropna()
pivot_names_2013['ambiguity'] = (pivot_names_2013['M'] - pivot_names_2013['F']).abs()
pivot_names_2013=pivot_names_2013.sort_values('ambiguity').head(1)
print("gender ambiguous name in 2013:")
print(pivot_names_2013)

# gender ambiguous name in 2013

df_1945=original_df[original_df['year']==1945] 
gender_aggregated_names_1945=df_1945.groupby(['name', 'gender'])['count'].sum().reset_index()
pivot_names_1945 = gender_aggregated_names_1945.pivot(index='name', columns='gender', values='count')
pivot_names_1945 = pivot_names_1945.dropna()
pivot_names_1945['ambiguity'] = (pivot_names_1945['M'] - pivot_names_1945['F']).abs()
pivot_names_1945=pivot_names_1945.sort_values('ambiguity').head(1)
print("gender ambiguous name in 1945:")
print(pivot_names_1945)

gender ambiguous name in 2013:
gender    F    M  ambiguity
name                       
Sonam   5.0  5.0        0.0
gender ambiguous name in 1945:
gender     F     M  ambiguity
name                         
Maxie   19.0  19.0        0.0


**Insights:** I defined gender ambiguity using the absolute difference between male and female birth counts for each name within a given year. After aggregating counts by name and gender, the most gender-ambiguous name in 2013 was Sonam, with equal male and female counts. Similarly, in 1945, Maxie exhibited a perfectly balanced distribution across genders, indicating maximum gender ambiguity for that year.

-  Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980. Largest decrease?

In [192]:
df_after1980=original_df[original_df['year']>=1980] 
df_before1980=original_df[original_df['year']<1980] 

popularity_before1980=df_before1980.groupby('name')['count'].sum()
popularity_after1980=df_after1980.groupby('name')['count'].sum()

popularity_change=pd.concat({'before_1980':popularity_before1980,'after_1980':popularity_after1980}, axis=1).dropna()

popularity_change['percent_change']=((popularity_change['after_1980']-popularity_change['before_1980'])/popularity_change['before_1980'])*100.0

percent_increase=popularity_change['percent_change'].sort_values(ascending=False).head(1)
print(percent_increase)

percent_decrease=popularity_change['percent_change'].sort_values(ascending=True).head(1)
print(percent_decrease)

name
Zoey    1974480.0
Name: percent_change, dtype: float64
name
Gertrude   -99.991656
Name: percent_change, dtype: float64


**Insights:** I measured name popularity using total birth counts and compared aggregate usage before and after 1980 using percentage change to normalize for scale differences. Based on this approach, Zoey shows the largest percentage increase in popularity since 1980. In contrast, Gertrude exhibits the largest percentage decrease, indicating a significant decline from its earlier popularity. These trends highlight generational shifts in naming preferences over time.

-  Can you identify names that may have had an even larger increase or decrease in popularity?

Yes. While percentage change highlights relative growth, it can overemphasize names with very small historical baselines. As a result, other names may have experienced larger absolute increases or decreases in popularity that are understated by this metric. Evaluating absolute change or applying minimum baseline thresholds would surface additional names with significant real-world impact.

## B) Onward to Insight!

* What insight can you extract from this dataset? Feel free to combine the baby names data with other publicly available datasets or APIs, but be sure to include code for accessing any alternative data that you use.

This dataset shows that baby name choices change significantly over time. Many names that were very popular in earlier decades have steadily declined, while newer names have grown quickly, especially after 1980. We also see more names being used for both boys and girls in recent years, which points to a shift toward gender-neutral naming. Overall, these patterns suggest that naming trends reflect changing social norms and generational preferences, not just changes in population size.