# Exploring trends in baby names

When my wife and I were expecting our firstborn, I naturally turned to data visualization to help us choose a name. [Exploring baby name trends](http://hint.fm/papers/final-baby-margin-nocomments.pdf) has already received attention frpm the visualization community, and the visualization system [NameGrapher](https://namerology.com/baby-name-grapher) (previously known as Name Voyager) is still running and still popular. However, this tool was too clunky or not able to answer some questions I had with this data: 

* What are the most popular names within a given time period?
* Which names are becoming more popular?
* Which names are going out of style?
* What are the most gender neutral baby names?

So I went straight to the source, the [Popular Baby Name](https://www.ssa.gov/oact/babynames/index.html) dataset, from the US Social Security Administration (SSA). This notebook does a little exploratory data analysis based on this dataset to answer these questions.

## Notebook Settings
This notebook deals with a lot of data, 7 MB, and can be slow to run on some laptops. Modifying `year_range` to filter the data can help with this issue. Remember that the end of a `range` in Python is not inclusive.

*Note: This notebook expects that you've already downloaded the national data product and uncompressed it into a directory called `names` that contains all of the year of birth (yob) `.txt` files in your current working directory.*

In [18]:
import pandas as pd
import altair as alt
from altair import datum
import numpy as np
import glob, re
from IPython.display import IFrame

# Filter data by range
year_range = range(2010, 2021)

# Setup some color settings used in charts
# Classic pink and blue for girls and boys, respectively
color_domain = ['Female', 'Male']
color_range = ['#ff9da7', '#4e79a7']

## Data Preparation

I make some slight modifications to what is already a very clean dataset. 

1. Take data from files in the `names` directory into a single Pandas DataFrame. 
2. While each file sorts the names by rank, there are many ties. Therefore, I use [`rank` set to *dense*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html).
3. Map the dataset's code for sex, (*M*, *F*) to more verbose categories, (*Male*, *Female*).

In [2]:
files = 'names/*.txt'

data = []
for fn in glob.glob(files):
    year = int(re.search("\d{4}", fn)[0])
    if year not in year_range:  # Skip files for years outside of our range
        continue;
    df = pd.read_csv(fn, index_col=False, names=['name', 'sex', 'total'])
    df['year'] = year
    data.append(df)
    
# Union these datasets together
data = pd.concat(data, axis=0, ignore_index=True)
data.describe(include='all')

# Calculate ranks for each group within year and sex
data['rank'] = data.groupby(['year', 'sex'])['total'].rank(method='dense', ascending=False)

# Expand categorical values for sex
data['sex'] = data.sex.apply(lambda s: 'Male' if s == 'M' else 'Female')

data.describe(include='all')

Unnamed: 0,name,sex,total,year,rank
count,362627,362627,362627.0,362627.0,362627.0
unique,56013,2,,,
top,Isabella,Female,,,
freq,22,207141,,,
mean,,,109.0335,2014.922915,877.053614
std,,,658.574663,3.155369,134.670436
min,,,5.0,2010.0,1.0
25%,,,7.0,2012.0,881.0
50%,,,11.0,2015.0,911.0
75%,,,30.0,2018.0,936.0


### Quality Assurance

This cell performs these quality assurance checks with the raw dataset:

- There should never be more than 1000 occurance of any given rank within sex.

In [3]:
assert data['rank'].max() < 1000

print('All quality tests were passed')

All quality tests were passed


## Exploratory Data Analysis
In this section, I explore a few questions related to this data.

### Which names are popular?

We can answer this question with a humble bar plot. Changing the value of `name_limit` changes the value of $n$.

In [4]:
name_limit = 10

In [5]:
def multiColBarChart(df, x_field, y_field, grp_field, color_field, limit=10, ascending=False):
    grp = data.groupby([grp_field, y_field]).agg({x_field: sum}).sort_values(x_field, ascending=ascending).reset_index().groupby(grp_field)
    grps = [ grp.get_group(x) for x in grp.groups ]
    
    charts = []
    for g in grps:
        g = g.head(limit)
        title = '{order} popular {limit} {sex} names ({min_year} - {max_year})'.format(**{
            'order': 'Most' if not ascending else 'Least',
            'limit': limit,
            'sex': 'male' if g.sex.unique()[0] == 'Male' else 'female',
            'min_year': min(df.year.unique()),
            'max_year': max(df.year.unique())
        })
        chart = alt.Chart(g).mark_bar().encode(
            x=x_field,
            y=alt.Y(y_field, sort=g[y_field].to_list()),
            color=alt.Color(color_field, scale=alt.Scale(domain=color_domain, range=color_range))
        ).properties(title=title)
        charts.append(chart)
    
    return alt.hconcat(*charts)

My wife and I knew were were having a boy, and at one time *Liam* was on our shortlist. Knowing that it was one of the most popular names definitely made it less appealing to us.

In [6]:
multiColBarChart(data, x_field='total', y_field='name', color_field='sex', grp_field='sex', limit=name_limit)

### Which names are unpopular?

I'm also interested in what are the least popular names. "Least popular" is a bit of a misnomer since the dataset includes the top 1000 popular datasets by design.

In [7]:
multiColBarChart(data, x_field='total', y_field='name', color_field='sex', grp_field='sex', limit=name_limit, ascending=True)

### Which names are on the rise?

To determine increasing in popularity the most, we need to remove names that are not ranked for any year in our time frame. To determine which names are continuously within the range of years we're exploring, I remove names where yearly occurrence in the dataset is less than the number of years in the current `year_range`. I do this by creating a separate dataset of all valid names, then do a *left join* to take the difference between these two DataFrames.

In [8]:
name_freq_by_year = data.sort_values('year', ascending=True) \
    .groupby(['sex', 'name'], sort=False) \
    .agg({'year': lambda x: len(x.tolist()) }) \
    .reset_index() \
    .rename(columns={'year': 'count'})
name_freq_by_year = name_freq_by_year[name_freq_by_year['count'] == len(year_range)]

# Merge data
data_continuous = pd.merge(name_freq_by_year, data, left_on=['sex', 'name'], right_on=['sex', 'name'])

# Remove extraneous columns
data_continuous.drop(columns=['count'], inplace=True)

# Peek at the data
data_continuous.head(5)

Unnamed: 0,sex,name,total,year,rank
0,Female,Isabella,22924,2010,1.0
1,Female,Isabella,19919,2011,2.0
2,Female,Isabella,19113,2012,3.0
3,Female,Isabella,17654,2013,4.0
4,Female,Isabella,17108,2014,4.0


Next, I sanity check the group order. For some reason, sorting before grouping is waaaaaay faster than sorting after grouping. This cell performs a few quick sanity checks with the group object:

- Test that the values within the groups are in chronological order.
- Test that there are no gap years in results, e.g. 2000, 2001, 2003, ...

In [9]:
def testOrder(df):
    def isAscending(s):
        s = s.to_list()
        for i in range(len(s) - 1):
            if s[i] >= s[i + 1]:
                return False
        return True
    
    def isSequential(s):
        s = s.to_list()
        for i in range(len(s) - 1):
            if s[i] != s[i + 1] - 1:
                return False
        return True
    
    def test(results):
        exp = results.shape[0]
        reality = sum(results.year)
        assert (exp == reality)

    grp = df.sort_values('year', ascending=True).groupby(['sex', 'name'], sort=False)
    
    # Test ascending
    yearCount = grp.agg({'year': isAscending })
    test(yearCount)
    
    # Test sequential
    yearSeq = grp.agg({'year': isSequential })
    test(yearSeq)
    
    print('All sanity checks passed')


testOrder(data_continuous)

All sanity checks passed


Finally, I calculate the slope for each name within this time frame. I'm just going to fit date and rank linearly, $y=mx+b$. I also flip the sign of the slope because lower ranks are better than higher ranks, first place is better than second place.

In [10]:
def calcSlope(x):
    slope = np.polyfit(range(len(x)), x, 1)[0]
    return slope * -1

data_slopes = data_continuous \
    .sort_values('year', ascending=True) \
    .groupby(['sex', 'name'], sort=False) \
    .agg({'rank': calcSlope}) \
    .reset_index() \
    .rename(columns={'rank': 'slope'})

data_slopes = pd.merge(data_slopes, data, left_on=['sex', 'name'], right_on=['sex', 'name'])
data_slopes.sort_values(['slope', 'year'], ascending=False)

Unnamed: 0,sex,name,slope,total,year,rank
92597,Female,Alaia,82.090909,2254,2020,118.0
92596,Female,Alaia,82.090909,1600,2019,179.0
92595,Female,Alaia,82.090909,526,2018,493.0
92594,Female,Alaia,82.090909,460,2017,544.0
92593,Female,Alaia,82.090909,487,2016,520.0
...,...,...,...,...,...,...
99752,Female,Jayden,-60.372727,575,2014,482.0
99751,Female,Jayden,-60.372727,697,2013,399.0
99750,Female,Jayden,-60.372727,825,2012,354.0
99749,Female,Jayden,-60.372727,1075,2011,276.0


Finally, we can see these trends with a multi series line chart. The legend is sorted by the names with the greatest slope. Both charts support interactive zooming, and tooltips over the dots in each line.

In [11]:
def plotTrends(df, limit=10, top=True):
    nameFilter = df.groupby(['sex', 'name']) \
                 .slope.unique().apply(lambda x: x[0]) \
                 .sort_values(ascending=not top) \
                 .groupby('sex') \
                 .head(limit) \
                 .to_frame().reset_index()[['sex', 'name']]
    df = pd.merge(nameFilter, df, on=['sex', 'name'])
    df_grp = df.groupby('sex')    
    charts = []
    rankDomain = [df['rank'].max(), 1]
    for grp in [ df_grp.get_group(x) for x in df_grp.groups ]:
        name_order = list(grp.sort_values('slope', ascending=False)['name'].unique())
        chart = alt.Chart(grp).mark_line(point=True).encode(
            x='year:O',
            y=alt.Y('rank:Q', scale=alt.Scale(domain=rankDomain), axis=alt.Axis(tickCount=10)),
            color=alt.Color('name', sort=name_order),
            tooltip=['name', 'rank'],
        ).properties(title='Top {} {} {} Names'.format(limit, 'Rising' if top else 'Falling', grp.sex.unique()[0]), width=750).interactive()
        charts.append(chart)
    
    return alt.vconcat(*charts) \
              .resolve_scale(color='independent')

In [12]:
plotTrends(data_slopes, limit=name_limit)

Looking at this data by rank makes the most sense to me, because it's relative and that makes sense to me when comparing year-to-year. But I imagine you'd find different results if you look at total babies born, but those numbers might be affected by some other factors.

### Which names are going out of style?

We can easily explore which names are decreasing in popularity by sorting the data the opposite way. You could include names that drop out of the top 1000 at any year within the selected range, but not including them provides a more consistent view the data.

Results will vary depending on how you've selected the year, but with the default settings, you should see that the female name *Isis* really took a popularity hit when ISIS leader Abu Bakr al Baghdadi announced the formation of a caliphate in 2014.

In [13]:
plotTrends(data_slopes, limit=name_limit, top=False)

### Which names the most gender neutral?

Some gender neutral names are less neutral than others. Johnny Cash articulates a good case in point.

In [21]:
IFrame('https://www.youtube.com/embed/WOHPuY88Ry4', 560, 315)

So I look at names that were assigned to both male and female babies where the proportion of male to female babies doesn't lean too far towards one end of the other. I also filter out some of the less popular names.

In [15]:
# Create a DataFrame that only includes names that have been ranked in both the Male and Female categories
neutral_names = pd.pivot_table(data, index=['name'], columns=['sex'], values=['total'], aggfunc=np.mean).droplevel(0, axis=1).reset_index().dropna(axis=0)

# Calculate the difference between male and female ranks to create an index of difference in popularity called 'diff'
neutral_names['percent'] = neutral_names['Female'] / (neutral_names['Female'] + neutral_names['Male'])

# Filter out low count names
min_count = 50
neutral_names = neutral_names[(neutral_names['Female'] > min_count) & (neutral_names['Male'] > min_count)]

# Filter out names predominantly male or female
min_percent, max_percent = 0.25, 0.75
neutral_names = neutral_names[(neutral_names['percent'] > min_percent) & (neutral_names['percent'] < max_percent)]

# Classify each name as more girl or more boy
neutral_names['leaning'] = neutral_names['percent'].apply(lambda x: 'Female' if x > 0.5 else 'Male')

# Tidy data
neutral_names = pd.melt(neutral_names, id_vars=['name', 'leaning', 'percent'], value_vars=['Male', 'Female'], value_name='total')

neutral_names

Unnamed: 0,name,leaning,percent,sex,total
0,Amari,Male,0.366044,Male,1249.181818
1,Arden,Female,0.706303,Male,100.818182
2,Ari,Male,0.295511,Male,716.272727
3,Arie,Female,0.641457,Male,58.181818
4,Aries,Male,0.308666,Male,163.909091
...,...,...,...,...,...
285,Tru,Male,0.354113,Female,63.000000
286,True,Male,0.465826,Female,63.818182
287,Unknown,Male,0.428273,Female,55.909091
288,Yael,Male,0.350042,Female,113.636364


Within these two groups, the most gender neutral names are at the top of the chart. Remember that these plots don't say anything about the popularity of these names, just whether they tend to be assign to more babies who were assigned male or female at birth.

In [16]:
def plotDistribution(df, grp_var):
    charts = []
    grp = df.groupby(grp_var)
    for grp in [ grp.get_group(x) for x in grp.groups ]:
        sex = grp['leaning'].unique()[0]
        asc = sex == 'Female'
        chart = alt.Chart(grp).mark_bar().encode(
            x=alt.X('sum(total)', stack='normalize'),
            y=alt.Y('name', sort=list(neutral_names.sort_values('percent', ascending=asc).name.unique())),
            color=alt.Color('sex', scale=alt.Scale(domain=color_domain, range=color_range)),
        ).properties(title='Distribution of predominantly {} names'.format(sex.lower()))
        text = alt.Chart(grp).mark_text(dx=-15, dy=0, color='white').encode(
            x=alt.X('percent'),
            y=alt.Y('name', sort=list(neutral_names.sort_values('percent', ascending=asc).name.unique())),
            detail='sex',
            text=alt.Text('percent', format='.2f')
        ).transform_filter(
            (datum.sex == 'Female')
        )
        
        charts.append(chart + text)
    
    return alt.hconcat(*charts)

plotDistribution(neutral_names, 'leaning')