# Regional and Ethnic Baby Name Popularity Variations

People's names vary from place-to-place and by ethnicity. By combining the NYC Baby Names dataset with the USA Names dataset, we can learn a couple of interesting things on this subject:

* What names are more or less popular than the United States national average in New York City?
* What names are more ore less popular than the United States national average amongst peoples of a specific ethnicity?

## Data Munging

Before we do any analysis we will need to munge the data into the proper format.

In [None]:
import pandas as pd

nyc_names = pd.read_csv("../input/nyc-baby-names/Most_Popular_Baby_Names_by_Sex_and_Mother_s_Ethnic_Group__New_York_City.csv")
nat_names = pd.read_csv("../input/us-baby-names/NationalNames.csv")

In [None]:
nyc_names.head()

In [None]:
nat_names.head()

The ethnicity fields needs a bit of mapping work to consolidate the fields.

In [None]:
nyc_names['Ethnicity'].value_counts()

In [None]:
ethmap = {
    'WHITE NON HISP': 'WHITE NON HISPANIC',
    'ASIAN AND PACI': 'ASIAN AND PACIFIC ISLANDER',
    'BLACK NON HISP': 'BLACK NON HISPANIC'
}

nyc_names['Ethnicity'] = nyc_names['Ethnicity'].map(lambda n: ethmap[n] if n in ethmap else n)

In [None]:
nyc_names['Ethnicity'].value_counts()

I should note here that I find the breakdown of ethnicities used by this dataset rather puzzling. It leaves out Native American, for example, which is usaly a shoe-in for these things, as well as a few other possible options: "Mixed", "Other", and so on. I wonder how this data was collected?

Anyway, national names are relatively easy. We'll pick a year, normalize the names as a probability, and throw out the extraneous columns.

In [None]:
sel_nat_names = nat_names[nat_names['Year'] == 2008]\
                    .pipe(lambda df: df.assign(PCount = df['Count'] / df['Count'].sum()))\
                    .drop(['Year', 'Id', 'Count'], axis='columns')

In [None]:
sel_nat_names.head()

Getting the NYC names ready takes a bit more work. We want to format this data so that it can be keyed against the national data in a merge operation, which requires massaging a few of the fields into the "proper" format.

In [None]:
import numpy as np

In [None]:
sel_nyc_names = (
    nyc_names
        .pipe(lambda df: df.assign(PCount = df['Count'] / df['Count'].sum()))
        .drop('Rank', axis='columns')
        .rename(columns={"Child's First Name": "Name", "Year of Birth": "Year"})
        .pipe(lambda df: df.assign(Name=df['Name'].str.title()))
        .groupby(['Ethnicity', 'Gender', 'Name'])
        .sum()
        .drop('Year', axis='columns')
        .reset_index()
        .pipe(lambda df: df.assign(
            Gender=df['Gender'].map(lambda n: 'F' if 'F' in n else 'M')
        ))
        .drop('Count', axis='columns')
)

Now the join.

In [None]:
joined_names = (sel_nyc_names.merge(sel_nat_names, 
                                    on=['Name', 'Gender'], 
                                    suffixes=('_NYC', '_USA'),
                                    how='outer')
                    .fillna({'PCount_NYC': 0, 'PCount_USA': 0, 'Ethnicity': 'ANY'})
               )

In [None]:
joined_names.head()

Note that in the resulting frame, `PCount_USA` is the probability for these names for *all* of the United States, not just the part of it that happens to be of the ethnicity in question!

## Exploring Regional Differences

With the munging done we are free to explore some results.

Let's first examine regional name occurance differences. The shebang below calculates a probability difference (`P_USA_NYC_Diff`) and probability ratio (`NYC_USA_Ratio`) for names in the United States and NYC.

In [None]:
d_usa_nyc = (
    joined_names
        .groupby(['Gender', 'Name'])
        .agg({'PCount_USA': np.unique, 'PCount_NYC': np.sum})
        .pipe(lambda df: df.assign(USA_NYC_Ratio=df['PCount_USA'] / df['PCount_NYC']))
        .pipe(lambda df: df.assign(USA_NYC_Diff=df['PCount_USA'] - df['PCount_NYC']))
        .reset_index()
        .rename(columns={
            'PCount_USA': 'P(USA)',
            'PCount_NYC': 'P(NYC)',
            'USA_NYC_Ratio': 'P(USA)/P(NYC)',
            'USA_NYC_Diff': 'P(USA) - P(NYC)'
        })
)

### Nationally Popular, Locally Not

In [None]:
d_usa_nyc.sort_values(by='P(USA) - P(NYC)', ascending=False).head(20)

These names tend to be popular Christian names. These differences can be *quite large*, given that we are looking at thousands of names. For example, the name "Addison" is 6.5 times as likely nationally as it is in New York City. The names Peyton and Colton are a whopping 20 and 30 times as likely way out there, respectively.

It's also interesting to note the presence of Brooklyn, a relatively popular name that is far less likely to be found on babies born in New York City. Probably for no better reason than that it's probably kinda tacky to name your baby after a city borough; I wouldn't do it...

### Locally Popular, Nationally Not

In [None]:
d_usa_nyc.sort_values(by='P(USA) - P(NYC)').head(20)

On the flip side, here are some names that are popular in New York City which are much less well-known nationally. It's important to note that many of these are ethnic names. New York City is after all an Eastern Seaboard city, and hence much less white, on average, than the United States at large.

This difference accounts for most of the positive difference between NYC baby names and national names. Some of the rest of the names on this list are all-around popular names (like "Mathew") that are actually more popular in New York City than you would expect. These variations are probably true regional variations: "Mathew" just happens to *even more popular than average* in large cities on the Eastern Seaboard.

Note that a little bit of the variation comes from sampling differences between our datasets: the United States baby names are from 2008, while the NYC baby names are from 2011-2014. Names get more or less popular over time, but this effect typically takes closer to a decade than a fistful of years to manifest.

## Exploring Ethnical Differences

Now we turn our attention to the ethnic factor.

It's important to state that the breakdown we are using below is just a hack. The problem is that we are not comparing apples to apples: we only have a breakdown on ethnic name variety in New York City; we do not have it nationally. Hence it's difficult to normalize the New York City ethnic breakdown (which is ~45% white) against the United States ethnic brekdown (which is ~65% white). So I didn't even try.

All in all, take the results that follow with a grain of salt. But still, there should be some relatively interesting output.

In [None]:
delta_ethnicity = (
    joined_names
        .groupby(['Gender', 'Name', 'Ethnicity'])
        .apply(
            lambda df: df.assign(
                Ethnic_Diff=df['PCount_NYC'] - df['PCount_USA'] / len(df))['Ethnic_Diff']
        )
        .reset_index()
)

In [None]:
delta_ethnicity.columns = ['Gender', 'Name', 'Ethnicity', 'Δ']
delta_ethnicity.index.name = None

In our ethnic breakdown, we are looking at names which are improbably more popular amongst members of the particular ethnicity in question. In the printouts that follow, the `Δ` variable is the difference between the New York City incidence rate for this name and the average national one. The higher the difference, the more "uniquely ethnic" that name is.

### Identifiably White Names in NYC

In [None]:
(
    delta_ethnicity
        .query('Ethnicity == "WHITE NON HISPANIC"')
        .sort_values(by='Δ', ascending=False)
        .head(10)
)

The `WHITE NON HISPANIC` breakdown is *totally dominated* by strongly Jewish names. It's not surprising to me that nams like "Moshe" are going to only appear on self-identified white people; what is surprising to me is the strength of the effect. It seems that, within the confines of New York City, super white means super Jewish. That's pretty interesting!

### Identifiably Hispanic Names in NYC

In [None]:
(
    delta_ethnicity
        .query('Ethnicity == "HISPANIC"')
        .sort_values(by='Δ', ascending=False)
        .head(10)
)

Next, here are some uniquely Hispanic names. These make sense, though a couple of them are surprising.

### Identifiably African-American Names in NYC

In [None]:
(
    delta_ethnicity
        .query('Ethnicity == "BLACK NON HISPANIC"')
        .sort_values(by='Δ', ascending=False)
        .head(10)
)

Effect strength for highly African-American names is weaker than it is for Hispanics or the (insular) New York City Jewish community, which indicates a smaller degree of "unique choices" in the choice of names for African American children. Indeed, some of these names (particularly Amir and Mohamed) are Muslim names popular amongst those that practice the religion.

### Identifiably Asian/Pacific Islander Names in NYC

In [None]:
(
    delta_ethnicity
        .query('Ethnicity == "ASIAN AND PACIFIC ISLANDER"')
        .sort_values(by='Δ', ascending=False)
        .head(10)
)

Finally, the most Asian names. The most popular names here are again, and to a much larger extent, Muslim names (including three possible variations on Mohammad!).