This notebook explores the gender neutral names across states and over the years. It's got maps. Every kernel is better with a map, right? A summary of notebook's content is given below.

* Gender neutral names of the first half of the XXth century were mostly used in the southeastern states such as Texas, Mississippi, Arkansas, Alabama, Georgia. The names that were popular there for both boys and girls included Willie, Jessie, Johnnie, Billie, Tommie and so on - you see the pattern. This group of names has been declining in popularity since the 1930s and is rather rare now. The most popular name left of this kind is Charlie.

* Today's distribution of names' gender ambiguity over the states is much more uniform. 

* Phonetically the gender neutral names seem to either end with some form of -ie or -y (Kelly, Tracy, Jamie) or with a consonant (Taylor, Jordan, Alexis). 

* The most recent popular and gender neutral name is Riley.

Techniques-wise I tried some pandas MultiIndex slicing with `pd.IndexSlice` here. It's a handy way of selecting a subset of names in a subset of states, for example. 

I've also tried making some interactive charts using Plotly. Setting up a choropleth (colored areas) map turned out to be pretty easy with Plotly. There's also a line chart and a scatterplot where hover tooltips make it possible to include much more data than would be readable on a static chart. I've found Plotly easier to work with than Bokeh. Examples are easy to google and the charts are nice. No, they didn't pay me to write this =).

*plotly graphics don't show up when editing kernels but they do show in rendered version.*

*markdown cells from uploaded notebooks lose their text*

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import cm
from plotly import tools
from plotly.offline import init_notebook_mode, iplot
import plotly.plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True)

## Load data

In [None]:
dn = pd.read_csv('../input/NationalNames.csv',index_col='Id')
ds = pd.read_csv('../input/StateNames.csv',index_col='Id')

As a measure of ambiguity I'm using the probability of guessing a baby's gender wrong knowing their name. It ranges from 0 (strictly single gender names) to 0.5 (names given equally to boys and girls). I also compute weighted ambiguity by multiplying it with the name's popularity in a given year. If we sum the weighted ambiguity values in a given year we'll get a probability of guessing the gender wrong for all babies of that year.

In [None]:
dn = dn.set_index(['Name','Year','Gender']).unstack().fillna(0).astype(int)
dn.columns = ['CountF','CountM']
dn['CountTotal'] = dn.CountF + dn.CountM
dn['CountYear'] = dn.groupby(level=['Year'])['CountTotal'].transform('sum')
dn['Popularity'] = 1000*dn.CountTotal.values / dn.CountYear.values #babies per thousand
dn['Ambiguity'] = dn[['CountF','CountM']].min(axis=1).values/dn.CountTotal.values
dn['AmbiguityWeighted'] = dn.Ambiguity * dn.Popularity
dn.head()

## Overall ambiguity trend

Gender ambiguity of baby names for each year is calculated as a weighted mean of individual names ambiguity with weights proportional to name popularity.

From the plot we can see that there is an upwards trend in ambiguity with a dip around year 1950. The probability of guessing a baby's gender wrong from their name went up from 1% in 1880 to about 2.5% in 2014.

In [None]:
amb = dn.groupby(level='Year')['AmbiguityWeighted'].sum()/1000
amb.plot()
plt.title('Gender ambiguity of baby names');

## Gender ambiguity by state

Let's make the same plot for every state separately to see if the previous picture is the same everywhere or different.

In [None]:
ds = ds.set_index(['Name','Year','State','Gender']).unstack().fillna(0).astype(int)
ds.columns = ['CountF','CountM']
ds['CountTotal'] = ds.CountF + ds.CountM
ds['CountYearState'] = ds.groupby(level=[1,2])['CountTotal'].transform('sum')
ds['Popularity'] = 1000*ds.CountTotal.values / ds.CountYearState.values
ds['Ambiguity'] = ds[['CountF','CountM']].min(axis=1).values/ds.CountTotal.values
ds['AmbiguityWeighted'] = ds.Ambiguity * ds.Popularity
ds.head()

In [None]:
ambs = ds.groupby(level=['Year','State'])\
         ['AmbiguityWeighted'].sum()\
         .unstack().fillna(0)/1000

In [None]:
ambs.head(2)

Now the `ambs` dataframe contains time series of gender ambiguity for every state in 1910-2014. While examining these data I noticed some really strange patterns that I attribute to errors in data collecting.

1. DC in 1989 and 1990 - a lot of girls got written down as boys.
2. KY in 2004 - general boy/girl confusion, every name seems gender neutral here.

Details can be found in [another notebook]() and here I'll just remove the outliers and use linear interpolation to fill the missing values.

In [None]:
ambs.loc[2004,'KY'] = np.NaN
ambs.loc[[1989,1990],'DC'] = np.NaN
ambs = ambs.interpolate()

Let's plot all states and the national trend on the same chart:

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
amb.plot(ax=ax,linewidth=5,zorder=100)
ambs.plot(color='#d7472f',alpha=0.6, ax=ax)
ax.legend(['Overall','States'])
ax.set_xlim(1920,2015)
ax.set_xticks(np.arange(1920,2020,10))
ax.set_title('Gender ambiguity over time');

Things I note here:

1. All states follow more or less the same trend since about 1960. Prior to that some states have much more gender neutral names than others. Which states are those?

1. State trends pass below the overall trend. This could be the case if gender neutral names tended to be rare. Names are not reported in states data if there is less than 5 births per state-year-gender. So for some rare names we may get full counts in the national data but not in individual states.

Let's make a map reflecting gender ambiguity of baby names in the 1920s and look at which states had a lot of gender-neutral names.

In [None]:
def ambiguity_map(year):
    minyear = max(year-5, ambs.index.min())
    maxyear = min(year+5, ambs.index.max())
    df = pd.DataFrame(ambs.loc[minyear:maxyear,:].mean(),columns=['Ambiguity']).reset_index()
    data = [ dict(
            type='choropleth',
            autocolorscale = True,
            locations = df['State'],
            z = df['Ambiguity'].astype(float),
            zmax = ambs.max().max(),
            zmin = ambs.min().min(),
            locationmode = 'USA-states',
            text = df['State'],
            marker = dict(
                line = dict (
                    color = 'rgb(255,255,255)',
                    width = 1
                ) ),
            colorbar = dict(
                title = "Gender Ambiguity")
            ) ]

    layout = dict(
            title = 'Gender Ambiguity by state {}-{}'.format(minyear, maxyear),
            geo = dict(
                scope='usa',
                projection=dict( type='albers usa' ),
                showlakes = True,
                lakecolor = 'rgb(255, 255, 255)'),
                 )

    fig = dict( data=data, layout=layout )
    iplot( fig, filename='d3-cloropleth-map' )

In [None]:
ambiguity_map(1925)

It looks like the states with the most gender neutral names are in the southeastern part of the USA: Texas, Mississippi, Arkansas, Alabama, Georgia.

What about the more recent years? Color scale is the same on both plots.

In [None]:
ambiguity_map(2009)

The latest map looks much more uniform. Everyone watches the same TV shows nowadays?

## The gender neutral names of the southeast

Which names had the most impact on the high values of ambiguity in the southeastern states?

In [None]:
ds.loc[pd.IndexSlice[:,list(range(1920,1930)),
       ['MS','TX','AL','GA','AR','TN']],'AmbiguityWeighted']\
  .groupby(level='Name').mean().sort_values(ascending=False).head(20)

There are a lot of names ending with '-ie' here:

In [None]:
names = [n for n in _.index if n.endswith('ie')]
names

And also some Spanish-sounding names: Guadalupe, Santos, Trinidad, Lupe.

## Where were -ie names popular?

In [None]:
df = ds.loc[pd.IndexSlice[names,tuple(range(1920,1930)),:],'Popularity'].groupby(level='State').sum().reset_index()
df.Popularity/=10 # mean over 10 years
data = [ dict(
        type='choropleth',
        autocolorscale = True,
        locations = df['State'],
        z = df['Popularity'].astype(float),
        zmax = df.Popularity.max(),
        zmin = 0.,
        locationmode = 'USA-states',
        text = df['State'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 1
            ) ),
        colorbar = dict(
            title = "Popularity<br>babies per thousand")
        ) ]

layout = dict(
        title = 'Popularity of "-ie" names by state in 1920-1930<br>'+', '.join(names),
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
iplot( fig, filename='d3-cloropleth-map' )

The names ending in '-ie' were more popular in the southeast.

We can't get a good estimate of ambiguity in states where the name is not popular enough, so comparing ambiguities is kind of pointless here.

## -ie names over the years

These names started out as truly gender neutral but then became more masculine. They have been declining in popularity since 1930s and are pretty rare now. 

In [None]:
fig, ax = plt.subplots(1,2,figsize=(12,5))
fig.suptitle(', '.join(names),fontsize='large')
dn.loc[names].groupby(level='Year')['CountM','CountF'].sum().plot(ax=ax[0]);
dn.loc[names].groupby(level='Year')['Popularity'].sum().plot(ax=ax[1]).legend();

## Other gender neutral names

What other names were reasonably popular and gender ambiguous over time?

In [None]:
n = dn.groupby(level='Name')['AmbiguityWeighted'].max().sort_values(ascending=False)
n.head(20)

In [None]:
df = dn.loc[list(n.head(12).index),'AmbiguityWeighted'].unstack().T.fillna(0)
data = []
for col in df.columns:
    data.append(
        go.Scatter(
        x=df.index,
        y=df[col].values,
        name=col,
        text=col
        ))

layout = go.Layout(
    title='Popular gender ambiguous names over time<br>(hover over line to see the name)',
    hovermode= 'closest',
    xaxis=dict(title='Year'),
    yaxis=dict(title='Ambiguity * Popularity')
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='gender-ambiguous-names')

The name Willie dominated the stage before 1945 with Jessie as runner-up, then came a diversity of gender neutral names including Terry, Tracy, Shannon, Jamie, Casey, Taylor, Jordan, Alexis and Riley. New names go in and out of fashion much faster.

Phonetically the gender neutral names seem to either end with some form of -ie or -y or with a consonant.

## Gender neutral names of 2014

In [None]:
df = dn.loc[pd.IndexSlice[:,2014],:]\
        .sort_values(by='AmbiguityWeighted',ascending=False).head(100).reset_index()
df1 = df.head(30)
df2 = df.tail(70)
trace1 = go.Scatter(
    x=df1.CountF,
    y=df1.CountM,
    marker = dict(color=df1.CountF.values/(df1.CountTotal).values,
                  colorscale = 'Viridis'),
    text = df1.Name,
    mode='markers+text',
    textposition = 'top center'
)
trace2 = go.Scatter(
    x=df2.CountF,
    y=df2.CountM,
    marker = dict(color=df2.CountF.values/(df2.CountTotal).values,
                  colorscale = 'Viridis'),
    text = df2.Name,
    mode='markers',
)
line = [df.CountTotal.min()/2, df.CountTotal.max()/2]
trace3 = go.Scatter(
    x = line, y=line, 
    mode='lines',
    line=dict(color='rgba(0,0,0,0.1)',width=1))
data = [trace1, trace2,trace3]
layout = go.Layout(
    title='Gender neutral names of 2014',
    autosize=False,
    width=800,
    height=800,
    showlegend=False,
    hovermode='closest',
    xaxis=dict(type='log',
               title='Count of girls'),
    yaxis=dict(type='log',
               title='Count of boys'),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='style-annotation')