This dataset is so nice and clean. But it looks like the underlying data has some typo problems.

In the [notebook on spike-fade names](https://www.kaggle.com/dvasyukova/d/kaggle/us-baby-names/persistent-vs-spike-fade-names) I found an extreme example - a boy name Christop. It is given to more than a thousand babies one year and is never seen again. Can we find out if it is a typo or a genuine name?

My exploration shows that 29 one-shot names like Christop come from NY in 1989. Whatever happened there (a fire in the archives? a coffee spill?) has eaten 1-3 last letters from a bunch of names.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm

In [None]:
dn = pd.read_csv('../input/NationalNames.csv')
ds = pd.read_csv('../input/StateNames.csv')

## Christop

In [None]:
dn[dn.Name=='Christop']

In which states does this name appear?

In [None]:
ds[ds.Name=='Christop']

Only in NY.

What are some similar names that this could be a typo of

In [None]:
import Levenshtein
Levenshtein.distance('Christop','Christopher')

In [None]:
names = dn.groupby(['Name','Gender'])['Count'].sum().reset_index()
names['Distance'] = names.Name.apply(lambda x: Levenshtein.distance(x,'Christop'))

In [None]:
names.loc[(names.Gender=='M')&(names.Distance<=3)]\
     .sort_values(by=['Count','Distance'], ascending=[False,True]).head(10)

In [None]:
data = ds[(ds.State=='NY')&(ds.Name=='Christopher')&(ds.Gender=='M')]
fig, ax = plt.subplots()
ax.plot(data.Year, data.Count)
c = data.loc[data.Year==1989,'Count'].values[0]
ax.vlines(1989, c, c+1082)
ax.set_xlim(1970,2014);
ax.set_title('Boys named Christopher in NY')
ax.arrow(1985,2500,4,1300)
ax.text(1985,2200,'Boys named Cristop this year',
        horizontalalignment='center')

I'd say this is pretty strong evidence for "Christop" being a typo:

- it only appears for one year and only in one state (NY 1989)
- The number of boys named "Christopher" that year in NY has a dip of the same size.

## Other likely typos

Select names that like Christop only appeared in one state on one year.

In [None]:
names = dn.groupby(['Name','Gender'])['Count']\
          .agg(['sum','count'])\
          .rename(columns={'sum':'Count','count':'YearsActive'})
names['StatesActive'] = ds.groupby(['Name','Gender'])['State']\
                          .apply(pd.Series.nunique)
names = names.sort_values(by=['YearsActive','StatesActive','Count'],ascending=[True,True,False])

In [None]:
typos = names[(names.YearsActive==1)&(names.StatesActive==1)]
typos = typos.merge(ds[['Name','Gender','Year','State']], how='left', 
                    left_index=True, right_on=['Name','Gender'])
typos.head()

In [None]:
typos.groupby(['State','Year'])['Name'].count().sort_values(ascending=False).head()

## Typos from NY in 1989

It looks like the Cristophers of NY in 1989 were not the only ones misreported here. Let's look at other likely typos from NY.

In [None]:
ny = typos.loc[(typos.State=='NY')&(typos.Year==1989),['Count','Name','Gender']]
ny.head(10)

In [None]:
print('total babies affected: {}'.format(ny.Count.sum()))

It looks to me like these names mostly lost their last letters. Need to find most likely full variants.

In [None]:
names = names.reset_index()

In [None]:
def find_full_name(typoname, gender):
    data = names.loc[(names.Gender==gender)&(names.Name.str.startswith(typoname))]
    return data.loc[data.Count.idxmax(),'Name']
find_full_name('Alexandr','F')

In [None]:
ny['FullName'] = ''
for i in ny.index:
    ny.loc[i, 'FullName'] = find_full_name(ny.loc[i,'Name'],ny.loc[i,'Gender'])
ny.head(10)

Now we know the likely actual names. We can calculate how many babies could be expected to receive that name in 1989 by taking an average of values from 1988 and 1990. Then we can compare the difference between actual and expected count with the count of typos like we did for Christop.

In [None]:
# under construction

## Old and draft stuff below this line


----------


## Alexandr (F)


In [None]:
ds[(ds.Name=='Alexandr')&(ds.Gender=='F')]

It's NY again, same year. Did they have a fire in the archives?

In [None]:
#names = names.reset_index()
names['Distance'] = names.Name.apply((lambda x: Levenshtein.distance(x,'Alexandr')))
names.loc[(names.Gender=='F')&(names.Distance<=3)]\
     .sort_values(by=['Count','Distance'], ascending=[False,True]).head(10)

In [None]:
data = ds[(ds.State=='NY')&(ds.Name=='Alexandra')&(ds.Gender=='F')]
fig, ax = plt.subplots()
ax.plot(data.Year, data.Count)
c = data.loc[data.Year==1989,'Count'].values[0]
ax.vlines(1989, c, c+301)
ax.set_xlim(1970,2014);
ax.set_title('Girls named Alexandra in NY')
ax.arrow(1995,400,-6,300)
ax.text(1995,350,'Girls named Alexandr this year',
        horizontalalignment='center')

## Dalary (F)

Looks like it's just a very new name, not a typo.

In [None]:
ds[(ds.Name=='Dalary')&(ds.Gender=='F')]

## Jacquely (F)

In [None]:
ds[(ds.Name=='Jacquely')&(ds.Gender=='F')]

In [None]:
names['Distance'] = names.Name.apply((lambda x: Levenshtein.distance(x,'Jacquely')))
names.loc[(names.Gender=='F')&(names.Distance<=3)]\
     .sort_values(by=['Count','Distance'], ascending=[False,True]).head(10)

In [None]:
data = ds[(ds.State=='NY')&(ds.Name=='Jacquelyn')&(ds.Gender=='F')]
fig, ax = plt.subplots()
ax.plot(data.Year, data.Count)
c = data.loc[data.Year==1989,'Count'].values[0]
ax.vlines(1989, c, c+50)
ax.set_xlim(1970,2014);
ax.set_title('Girls named Jacquelyn in NY')
ax.arrow(1985,40,4,40)
ax.text(1985,35,'Girls named Jacquely this year',
        horizontalalignment='center')

In [None]:
data = ds[(ds.State=='NY')&(ds.Name=='Cassandra')&(ds.Gender=='F')]
fig, ax = plt.subplots()
ax.plot(data.Year, data.Count)
c = data.loc[data.Year==1989,'Count'].values[0]
ax.vlines(1989, c, c+152)
ax.set_xlim(1970,2014);
ax.set_title('Girls named Cassandra in NY')
ax.arrow(1995,200,-6,200)
ax.text(1995,180,'Girls named Cassandr this year',
        horizontalalignment='center')