# Exploring First Names and Rankings - 1880-2018


## Overview

The focus of this project is to utilize data from the U.S. Social Security Adminisration to get some hands on practice using Pandas, Numpy and Bokeh. The data set comes coutesy of the SSA and you can read more about it and access it [here](https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data). 

### The Data
The data is split up into one file per year spanning 1880-2018 and includes the relative frequency of first names for U.S. births over the specified time frame. Each data file includes a list of records in the format of "name,sex,number," with the "number" corresponding to the number of occurences of the name & gender combination for the given year. The data is formatted nicely (sorted by sex, then count), so there isn't really much cleaning to do and there shouldn't be any missing values. Here's a quick sample from the first file:

```
Mary,F,7065
Anna,F,2604
Emma,F,2003
...
```
You may be wondering... is this an exhaustive set of names over that time period? In short, the answer is 'No'. For starters, names with fewer than 5 occurrences for a given year are excluded from the list to protect privacy. In addition, only names that are between 2-15 characters are included and sorry Prince... symbols are not allowed. (And yes, I do know Prince's birth name is Prince Rogers Nelson, so cool your jets people.)

If you are interested in more background info, try these pages out:
- Background - https://www.ssa.gov/oact/babynames/background.html
- Main Baby Names Page for SSA - https://www.ssa.gov/oact/babynames/index.html (includes some nice interactive functionality to play with the data (limited to the top 1000 names for each year)


### Analysis Ideas

In this notebook, I'll primarily be using Pandas and Numpy for data manipulation and analysis and Bokeh for visualizations. Starting out, I'm seeking to answer the following questions:

1. What are the most popular names of all time - overall and split by gender? And which names are popular for both males and females (i.e., gender neutral)?
2. What's the most popular male and female name for each decade?
3. Which names consistently rank the highest? And how does this compare to overall popularity?
4. Which letters of the alphabet get the most love when it comes to first initials??

That seems like enough for now... we'll see if any other interesting ideas pop up along the way. To get started, we'll need to read in the data, so let's do that now.

In [28]:
# Import libraries
import pandas as pd
import numpy as np

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
#from bokeh.models import ColumnDataSource, HoverTool

In [29]:
%%html
<style>
    table {
        display: inline-block
    }
</style>

## Import the Data
To start, let's import a single data file to see what we're working with. Based on the above, we know each data file only has 3 columns - `name`, `gender` and `number`. There are no headers in the data, so we'll have to create those as part of creating the dataframe.

### Create the Initial Dataframe

In [2]:
# Create column headers and read in the first file for 1880
cols = ["name", "gender", "year_count"]
names = pd.read_csv("data/yob1880.txt", names=cols)

# Preview what the data looks like 
print(names.shape)
print(sum(names.year_count))
names.head()

(2000, 3)
201484


Unnamed: 0,name,gender,year_count
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


So in our initial file, we have 2000 total rows and our data looks as expected. In order to combine the files from all years into a single dataframe, we'll need to add a column for the year so we can keep things straight.

In [3]:
# Add a new column for year
year = 1880
names['year'] = year
names.head()

Unnamed: 0,name,gender,year_count,year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


### Read in the Rest of the Data 
Now that we have an initial dataframe ready to go, let's read in the rest of the files and add them to our dataset for analysis. We'll have to account for the new `year` column as we go, which we can get from the file name itself. 

In [4]:
# Concatenate all remaining files into our dataframe

# Set column headers
cols = ["name", "gender", "year_count"]

# Loop through files and append to dataframe (starting with 1881)
for i in range(1881,2019,1):
    filename = str("data/yob" + str(i) + ".txt") 
    temp_df = pd.read_csv(filename, names=cols)
    temp_df["year"] = i
    names = pd.concat([names, temp_df], ignore_index=True) # Reset the index

# Preview the shape of the updated dataframe
print(names.shape)
print(sum(names.year_count))
names.tail()

(1957046, 4)
351653025


Unnamed: 0,name,gender,year_count,year
1957041,Zylas,M,5,2018
1957042,Zyran,M,5,2018
1957043,Zyrie,M,5,2018
1957044,Zyron,M,5,2018
1957045,Zzyzx,M,5,2018


After combining all the files into one, we can see that we now have nearly **2 million** rows and a total count of over **350 million** which represents the number of unique names accounted for in the data. And that last name looks pretty interesting... ***Zzyzx***. If you want to learn more (I did), you can take a look [here](https://en.wikipedia.org/wiki/Zzyzx,_California), or [here](https://www.babynamewizard.com/baby-name/boy/zzyzx). Moving on...

## Data Analysis
Next, let's do some analysis on our data to answer some of those burning questions we listed out above. Keep in mind, this isn't an exhaustive record containing the names of ***every*** U.S. citizen from 1880-2018. That said, as we saw above, it does account for quite a few people. So hopefully it will be enough to help us spot some interesting trends.

Let's start our analysis with some basics - things like gender split, most popular names (male, female, combined), etc. Before we do that though, let's do a quick check to ensure we don't have any missing values.

### Check for Missing Values

In [5]:
names.isna().sum()

name          0
gender        0
year_count    0
year          0
dtype: int64

OK, looks like we're good to move forward. Next, let's check out the gender split.

### Gender Split - Total
First, let's look at the total gender split to see what that turns up. To do the work, we'll use the pivot_table function to get the view we're after - setting the index to `gender` and the values to `year_count`. 

In [6]:
# Determine the split between Males and Females in the data
names.pivot_table(index="gender",values=["year_count"],aggfunc=sum)

Unnamed: 0_level_0,year_count
gender,Unnamed: 1_level_1
F,174079232
M,177573793


### Gender Split - Over Time
Looks like a pretty even split in the overall numbers. Before we move forward, let's take a look at the split between male and female records for each year to see if there are any interesting trends. Since this data is based on U.S. births where the child has a social security number, we might expect to see a higher number of males in the earlier years with the number of females catching up and surpassing males in the last hundred years. 

First, let's setup Bokeh to output to our notebook so we can view the results inline, then we'll plot it out and see what we get.

In [7]:
# Setup Bokeh to output directly to the notebook
output_notebook(resources=None, verbose=False, hide_banner=True, load_timeout=5000, notebook_type='jupyter')

In [8]:
# Function to display a line chart in Bokeh
def createLinePlot(title, x_values, y_values, x_label, y_label, colors, legend_labels, legend_location):
    """Creates a single or multiple line plot using bokeh
    
    Args:
        title (string): 
            Title of plot
        x_values (list or Series): 
            Values for the x axis
        y_values (list of lists): 
            List of lists, each list containing the y-axis values for a separate line
        x_label (string): 
            Title of x axis
        y_label (string): 
            Title of y axis
        colors (list): 
            List of one or more strings corresponding with the color to use for each line
        legend_labels (list): 
            List of one or more strings to use as legend labels for each corresponding set of y_values
        legend_location (string):
            String indicating the location of the legend
    
    Returns:
        Single or multi-line Bokeh line plot depending on the input, displayed inline
    """
    
    # Create the figure 
    p = figure(plot_width=800,plot_height=600, toolbar_location=None, tools="")
    p.background_fill_color = "lightslategray"
    p.background_fill_alpha = 0.3

    # Style the figure
    p.title.text=title
    p.title.text_color="black"
    p.title.text_font="helvetica"
    p.title.text_font_style="bold"
    p.xaxis.minor_tick_line_color=None
    p.yaxis.minor_tick_line_color=None
    p.xgrid.grid_line_color = None
    p.y_range.start = 0
    p.xaxis.axis_label=x_label
    p.xaxis.axis_label_text_color="gray"
    p.xaxis.axis_label_text_font="helvetica"
    p.xaxis.axis_label_text_font_style="bold"
    p.yaxis.axis_label=y_label
    p.yaxis.axis_label_text_color="gray"
    p.yaxis.axis_label_text_font="helvetica"
    p.yaxis.axis_label_text_font_style="bold"

    # Create a line for each set of y_values
    for i in range(len(y_values)):
        p.line(x_values, y_values[i], color=colors[i], line_width=2, alpha=0.8, legend_label=legend_labels[i])
    
    # Add legend
    p.legend.location = legend_location
    
    # Display plot
    show(p)
    

# Function to display a vbar chart in Bokeh
def createVBarPlot(title, x_values, y_values, x_label, y_label, colors, x_range=None):
    """Creates a bar plot using bokeh
    
    Args:
        title (string): 
            Title of plot
        x_values (list or Series): 
            Values for the x axis
        y_values (list of lists): 
            List of lists, each list containing the y-axis values for a separate set of bars
        x_label (string): 
            Title of x axis
        y_label (string): 
            Title of y axis
        colors (list): 
            List of one or more strings corresponding with the color to use for each line
    
    Returns:
        Bokeh bar plot depending on the input, displayed inline
    """
    
    # Create the figure 
    p = figure(x_range=x_range, plot_width=800,plot_height=600, toolbar_location=None, tools="")
    p.background_fill_color = "lightslategray"
    p.background_fill_alpha = 0.3

    # Style the figure
    p.title.text=title
    p.title.text_color="black"
    p.title.text_font="helvetica"
    p.title.text_font_style="bold"
    p.xaxis.minor_tick_line_color=None
    p.yaxis.minor_tick_line_color=None
    p.xgrid.grid_line_color = None
    p.xaxis.axis_label=x_label
    p.xaxis.axis_label_text_color="gray"
    p.xaxis.axis_label_text_font="helvetica"
    p.xaxis.axis_label_text_font_style="bold"
    p.yaxis.axis_label=y_label
    p.yaxis.axis_label_text_color="gray"
    p.yaxis.axis_label_text_font="helvetica"
    p.yaxis.axis_label_text_font_style="bold"

    # Create a line for each set of y_values
    for i in range(len(y_values)):
        p.vbar(x=x_values, top=y_values[i], width=0.6, color=colors[i], alpha=0.8)
        #TODO - Consider adding legend labels and hover tooltips
    
    # Display plot
    show(p)

In [9]:
# Create list of distinct years for x values
x_values = names['year'].unique().tolist()

# Create y values as the sum of the count for each year, grouped by gender 
y_male = names[names['gender']=='M'].groupby(['year', 'gender']).year_count.sum() / 1000000
y_female = names[names['gender']=='F'].groupby(['year', 'gender']).year_count.sum() / 1000000
y_values = [y_male, y_female]

# Assign other variables
title = "Gender Split Over Time"
x_label = "Year"
y_label = "Count, M"
colors =  ["cornflowerblue", "indianred"]
legend_labels = ["Male", "Female"]
legend_location = "top_left"

createLinePlot(title, x_values, y_values, x_label, y_label, colors, legend_labels, legend_location)

Interesting... my hypothesis was pretty much dead wrong. In fact, the data shows almost the complete opposite being true with women outnumbering men for most of the first 60 years and then men overtaking women starting around 1950. It's also interesting that the two lines mirror each other in terms of shape almost perfectly. 

Let's take a look at the difference in gender numbers over time before we move on to name popularity. First, we'll create a series for the difference between male and female numbers of the years and then we'll plot it out.

In [10]:
# Add a chart plotting the difference in men and women yoy

# Calculate the difference for each year
male_count = names[names['gender']=='M'].groupby(['year', 'gender']).year_count.sum() 
female_count = names[names['gender']=='F'].groupby(['year', 'gender']).year_count.sum()
gender_gaps = []
for i in range(len(male_count)):
    gender_gap = (male_count[i] - female_count[i]) / 1000
    gender_gaps.append(gender_gap)

# Create lists of values for males and females relative to a difference of 0
list_length = len(gender_gaps)
more_males = [0] * list_length
more_females = [0] * list_length

for n in range(len(gender_gaps)):
    if gender_gaps[n] > 0:
        more_males[n] = gender_gaps[n]
    elif gender_gaps[n] < 0:
        more_females[n] = gender_gaps[n]
    else:
        continue

# x values are the series of unique years
x_values = names['year'].unique().tolist()

# y values are the values for males and females relative to a difference of 0
y_values = [more_males, more_females]

# Assign other variables
title = "Gender Difference Over Time"
x_label = "Year"
y_label = "Difference, in thousands"
colors =  ["cornflowerblue", "indianred"]

# Create plot using function
createVBarPlot(title, x_values, y_values, x_label, y_label, colors)

### Most Popular Names - Overall
Now let's move on to name popularity. First, we'll look at which names are most popular over time for each gender. We can use the `Pandas.pivot_table()` function again to achieve this result.

In [11]:
# Create a pivot table view for the top 10 names based on total count
names.pivot_table(index=["name", "gender"], values="year_count", aggfunc=sum).sort_values("year_count", ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,year_count
name,gender,Unnamed: 2_level_1
James,M,5164280
John,M,5124817
Robert,M,4820129
Michael,M,4362731
Mary,F,4125675
William,M,4117369
David,M,3621322
Joseph,M,2613304
Richard,M,2565301
Charles,M,2392779


Looks like male names dominate the list of overall popularity. Not sure exactly what that says about male names... maybe they tend to be more traditional, or maybe it's a more frequent occurence that male names are handed down between generations. In any case, we should split this out by gender next so we can see the separate lists. 

### Most Popular Names - By Gender
First, let's create a new dataframe to make this a bit easier. We can use `Pandas.copy()` to make a deep copy of our initial dataframe and then we can collapse the year_count for each name & gender combo.

In [12]:
# Create a copy of the dataframe, drop the year column & rename year_count to total_count
names_copy = names.copy(deep=True)
names_copy = names_copy.drop(columns='year', axis=1)
names_copy = names_copy.rename(columns={'name': 'name', 'gender': 'gender', 'year_count': 'total_count'})

# Collapse the data by name and gender and sum up the year_count column
names_copy = names_copy.groupby(['name', 'gender'], as_index=False)\
            .agg({'total_count': 'sum'}).reindex(columns=names_copy.columns)

#### Top 10 Male Names - All Time

In [13]:
# Filter the new dataframe by gender and sort by year_count
# Top 10 male names based on total count
names_copy[names_copy["gender"]=="M"].sort_values("total_count", ascending=False, ignore_index=True).head(10)

Unnamed: 0,name,gender,total_count
0,James,M,5164280
1,John,M,5124817
2,Robert,M,4820129
3,Michael,M,4362731
4,William,M,4117369
5,David,M,3621322
6,Joseph,M,2613304
7,Richard,M,2565301
8,Charles,M,2392779
9,Thomas,M,2311849


#### Top 10 Female Names - All Time

In [14]:
# Top 10 female names based on total count
names_copy[names_copy["gender"]=="F"].sort_values("total_count", ascending=False, ignore_index=True).head(10)

Unnamed: 0,name,gender,total_count
0,Mary,F,4125675
1,Elizabeth,F,1638349
2,Patricia,F,1572016
3,Jennifer,F,1467207
4,Linda,F,1452668
5,Barbara,F,1434397
6,Margaret,F,1248985
7,Susan,F,1121703
8,Dorothy,F,1107635
9,Sarah,F,1077746


### Popular Gender Neutral Names
There's nothing too surprising in the above lists. Let's dig a little deeper and see if we can figure out the most common gender neutral names. In other words, names that are relatively common for both males and females. We can start by looking at the total count for names that appear ***at least*** once for each gender and see what that looks like. I suspect it won't give us exactly what we're after, but let's give it a try. 

To do this analysis, first we'll isolate the list of unique names for each gender into separate Numpy ndarrays and then we'll use `Numpy.intersect1d()` to find the interesection of the two arrays. Once we have the intersection, we can use it to filter our dataframe to the gender neutral names

In [15]:
# Isolate the unique male names & view the total
male_names = names_copy[names_copy['gender'] == 'M'].name.unique()
print("Number of male names: " + str(len(male_names)))

# Isolate the unique female names & view the total
female_names = names_copy[names_copy['gender'] == 'F'].name.unique()                          
print("Number of female names: " + str(len(female_names)))

# Find the intersection of the two arrays & view the total
gn_names = np.intersect1d(male_names, female_names)
print("Number of gender neutral names: " + str(len(gn_names)))

Number of male names: 41475
Number of female names: 67698
Number of gender neutral names: 10773


In [16]:
# Top 10 gender neutral names based on total count
names_copy[names_copy.name.isin(gn_names)]\
    .pivot_table(index="name", values="total_count", aggfunc=sum).sort_values("total_count", ascending=False).head(10)

Unnamed: 0_level_0,total_count
name,Unnamed: 1_level_1
James,5187679
John,5146508
Robert,4840228
Michael,4384463
Mary,4140840
William,4133327
David,3634229
Joseph,2623958
Richard,2574832
Charles,2405197


Clearly, this result is unexpected as none of the names in the above list is a common ***gender neutral*** name. In fact, if your keeping score at home, you'll notice this list looks suspiciously similar to the list of the most popular names overall. I suspect the result is skewed for names that have a disproportionately high number for one gender and a much smaller number for the other. We can confirm this to be the case by looking at the numbers for one of the names and checking the count for each gender.

In [17]:
# Create a list of the top10 names from the list above
top10_count = names_copy[names_copy.name.isin(gn_names)]\
    .pivot_table(index="name", values="total_count", aggfunc=sum).sort_values("total_count", ascending=False).head(10)
top10_names = top10_count.index
top10_names

Index(['James', 'John', 'Robert', 'Michael', 'Mary', 'William', 'David',
       'Joseph', 'Richard', 'Charles'],
      dtype='object', name='name')

In [18]:
# Use the list to filter out the counts of these names for each gender
names_copy[names_copy.name.isin(top10_names)]\
    .pivot_table(index=["name","gender"], values="total_count").sort_values(["name","gender"]).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_count
name,gender,Unnamed: 2_level_1
Charles,F,12418
Charles,M,2392779
David,F,12907
David,M,3621322
James,F,23399
James,M,5164280
John,F,21691
John,M,5124817
Joseph,F,10654
Joseph,M,2613304


Just as I suspected, each name in the list has a disproportionately high count for one gender and a fairly small count for the other. In order to find a more representative list, we'll have to do some more clean up. Let's start by creating a new df with the gender neutral names and then filtering it to a threshold count for each gender. Then, we can revisit the top 10 list to see if we get a different result.

In [19]:
###TODO###
# Refactor to avoid creating a new df
#Create a function to adjust threshold by passing an int to get resulting the list

# Compile a list of names that are not likely to be common 
# gender neutral names by generating a list of names where count is < 30k
non_gn_names = names_copy.name[names_copy['total_count'] < 30000]

# Create a copy of our name_counts dataframe 
gn_name_counts = names_copy.copy(deep=True)
print("Rows before: " + str(gn_name_counts.shape[0]))

# Drop rows that don't meet the threshold
gn_name_counts.drop(gn_name_counts[gn_name_counts['name'].isin(non_gn_names)].index, inplace = True) 

print("Rows after: " + str(gn_name_counts.shape[0]))

Rows before: 109173
Rows after: 79


In [20]:
# And now, pivot and display the top 10 based on total count
gn_name_counts.pivot_table(index="name", values="total_count", aggfunc=sum).sort_values("total_count", ascending=False).head(10)

Unnamed: 0_level_0,total_count
name,Unnamed: 1_level_1
Willie,595102
Kelly,553154
Terry,519811
Jordan,505517
Taylor,430836
Alexis,401937
Leslie,379807
Jamie,353733
Shannon,347023
Shawn,335706


This looks a little bit more reasonable for a list of gender neutral names. Next, let's take a look at the most popular male and female names by decade.


### Most Popular Names by Decade 
In order to accomplish this, we'll work through the following steps:
1. Create a function to generate the decade from the year for each row
2. Apply the function to the df to populate the new "decade" column
3. Collapse the data by decade to get the total count per decade for each name
4. Clean up our interim df
5. Create our final df by sorting on decade, gender & decade count and displaying the max decade count for each grouping 

In [21]:
# Create function to return decade for a given year
def decadeColumn(year):
    """Returns 4 digit decade when passed 4 digit year
    
    Args:
        year (int): 4 digit year (e.g. 2020)
        
    Returns:
        decade for year as 4 digit integer
    
    """
    year = str(year)
    decade = int(year) - int(year[3])
    return decade


# Create new column for decade and populate with function
names["decade"] = names.year.apply(decadeColumn)

# Collapse the data by decade, name and gender and sum up the year_count column
decade_names = names.groupby(['decade', 'name', 'gender'], as_index=False)\
            .agg({'year_count': 'sum'}).reindex(columns=names.columns)

# Clean up df cols - drop, reorder, rename
decade_names = decade_names.drop('year', axis=1)\
    .reindex(columns=['decade', 'gender', 'name', 'year_count']).rename(columns={'year_count': 'decade_count'})

# Create new df limited to most popular male and female names by decade
decade_popular_names = decade_names.sort_values(['decade','gender','decade_count'])\
    .groupby(['decade','gender']).tail(1)
decade_popular_names = decade_popular_names.reset_index(drop=True)
decade_popular_names

Unnamed: 0,decade,gender,name,decade_count
0,1880,F,Mary,91668
1,1880,M,John,89950
2,1890,F,Mary,131136
3,1890,M,John,80665
4,1900,F,Mary,161505
5,1900,M,John,84593
6,1910,F,Mary,478639
7,1910,M,John,376318
8,1920,F,Mary,701754
9,1920,M,Robert,576364


For the most part, this list is not too surprising based on our lists of overall popularity by gender. It is interesting to see that some variation starts appearing in the more recent decades and especially since 2000.

### Most Consistent Highly Ranked Names
Next up, let's take a look at the names that are the most consisently popular over the years. There are a few different ways we could look at this...
- Most consistently in the top ten per year/decade
- Highest rank on average per year/decade

For the sake of simplicity, let's start by looking at the highest ranking names on average for each year and then based on that we can decide next steps.

In [22]:
# Calculate top 10 male names of all time based on average per year
names[names["gender"]=="M"].groupby(['name', 'gender']).agg({'year_count': 'mean'})\
    .astype(int).sort_values("year_count", ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,year_count
name,gender,Unnamed: 2_level_1
James,M,37153
John,M,36869
Robert,M,34677
Michael,M,31386
William,M,29621
David,M,26052
Joseph,M,18800
Richard,M,18455
Charles,M,17214
Thomas,M,16632


In [23]:
# Calculate top 10 male names of all time based on average per decade
decade_names[decade_names["gender"]=="M"].groupby(['name', 'gender']).agg({'decade_count': 'mean'})\
    .astype(int).sort_values("decade_count", ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,decade_count
name,gender,Unnamed: 2_level_1
James,M,368877
John,M,366058
Robert,M,344294
Michael,M,311623
William,M,294097
David,M,258665
Joseph,M,186664
Richard,M,183235
Charles,M,170912
Thomas,M,165132


In [24]:
# Calculate top 10 female names of all time based on average per year
names[names["gender"]=="F"].groupby(['name', 'gender']).agg({'year_count': 'mean'})\
    .astype(int).sort_values("year_count", ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,year_count
name,gender,Unnamed: 2_level_1
Mary,F,29681
Jennifer,F,14526
Elizabeth,F,11786
Patricia,F,11644
Ashley,F,10576
Linda,F,10450
Barbara,F,10319
Kimberly,F,9730
Madison,F,9519
Margaret,F,8985


In [25]:
# Calculate top 10 female names of all time based on average per decade
decade_names[decade_names["gender"]=="F"].groupby(['name', 'gender']).agg({'decade_count': 'mean'})\
    .astype(int).sort_values("decade_count", ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,decade_count
name,gender,Unnamed: 2_level_1
Mary,F,294691
Jennifer,F,133382
Elizabeth,F,117024
Patricia,F,112286
Linda,F,103762
Barbara,F,102456
Kimberly,F,92985
Margaret,F,89213
Ashley,F,84612
Susan,F,80121


#### Results Name Popularity Over Time
Let's take a look at the results for name popularity over time. It's a bit easier to see it when it's all in one place. 

For male names, there is literally no difference in the lists when comparing overall popularity (i.e, total count) with highest average counts per year and decade.

For female names, there is a bit more variation, but overall the lists are pretty consistent. When it comes to the masses, I guess we aren't very creative with our name choices.


| MALE |  # | Overall    | Avg by Year | Avg by Dec  | FEMALE |  # | Overall     | Avg by Year | Avg by Dec  |
| ---- |:--:|:---------- |:----------- |:----------- | ------ |:--:|:----------- |:----------- |:----------- |
|      |  1 | James      | James       | James       |        |  1 | Mary        | Mary        | Mary        |
|      |  2 | John       | John        | John        |        |  2 | Elizabeth   | Jennifer    | Jennifer    |
|      |  3 | Robert     | Robert      | Robert      |        |  3 | Patricia    | Elizabeth   | Elizabeth   |  
|      |  4 | Michael    | Michael     | Michael     |        |  4 | Jennifer    | Patricia    | Patricia    |
|      |  5 | William    | William     | William     |        |  5 | Linda       | Ashley      | Linda       |
|      |  6 | David      | David       | David       |        |  6 | Barbara     | Linda       | Barbara     |
|      |  7 | Joseph     | Joseph      | Joseph      |        |  7 | Margaret    | Barbara     | Kimberly    |
|      |  8 | Richard    | Richard     | Richard     |        |  8 | Susan       | Kimberly    | Margaret    |
|      |  9 | Charles    | Charles     | Charles     |        |  9 | Dorothy     | Madison     | Ashley      |
|      | 10 | Thomas     | Thomas      | Thomas      |        | 10 | Sarah       | Margaret    | Susan       |


### Most Popular First Initial
Now let's go back to our original dataframe and see if we can figure out the most popular first initial by gender and also combined.

In [30]:
# Create a simple function to get the first letter of each name
def getFirstLetter(string):
    """Function to return first letter of a string
    
    Args:
        string (string): name or other string

    Returns: 
        first letter of string
            
    """
    return string[0]

# Apply function to dataframe to get first letter for each name
names["letter"] = names.name.apply(getFirstLetter)
names.head()

Unnamed: 0,name,gender,year_count,year,decade,letter
0,Mary,F,7065,1880,1880,M
1,Anna,F,2604,1880,1880,A
2,Emma,F,2003,1880,1880,E
3,Elizabeth,F,1939,1880,1880,E
4,Minnie,F,1746,1880,1880,M


In [31]:
# x values are the unique letters 
x_values = names['letter'].unique().tolist()
x_values.sort()

# y values are the sum of the count for each letter 
y_values = round((names.groupby(['letter']).year_count.sum() / 1000000), 2).to_list()
# y_values = y_values.tolist()

# Create remaining variables for plot
title = "First Initial - Letter Counts Over Time"
x_label = "Letter"
y_label = "Count, M"
colors = ["mediumpurple"]
x_range = x_values

# Create plot using function
createVBarPlot(title, x_values, [y_values], x_label, y_label, colors, x_range)

### Write to csv
Let's write this data to a single csv to make it easier to import in the future.

In [32]:
# Write new dataframe to single csv file
names.to_csv('names.csv') 

## TODO
- Create function to plot the popularity of a single name over time
    - Call with my name
    - Check for popularity bump of unique names (e.g. Barrack, Kanye, etc.)
- Add function to generate the following stats for a specified name and yob:
    - Rank in yob
    - Highest ranking year / rank
    - Lowest ranking year / rank


**Some Ideas for an Interactive UI**
1. Enter a name and year of birth to get statistics:
    1. How popular was the name during the specified yob?
    2. What were the most popular names for yob (M/F)?
    3. Which years did the name rank highest and lowest? 
2. Baby name generator ideas:
    1. Throwback name ideas (i.e. popular names from a specified period - n years in the past)
    2. Popular gender neutral names - all time, last 10 years, etc.
    3. Specify a first initial and gender to get a list of popular or random names
    4. "I'm feeling lucky" - Generates a random first & middle name combo (could include some controls for specifying the acceptable popularity levels)

In [None]:
# #TODO - Needs to be cleaned up


# # Plot the popularity of a single name over time

# # Assign variables for name and gender (simulation of user input)
# name = 'Lisa'
# gender = 'F'

# # Create a temp df based on the above variables
# temp_df = names_df[(names_df.Name == name) & (names_df.Gender == gender)]

# # Assign x and y values
# x = temp_df["Year"]
# y = temp_df["Count"]

# # Create a figure object
# #f = figure(plot_width=800,plot_height=600, toolbar_location=None, tools="")
# f = figure()

# # Style the plot
# f.title.text="Name Popularity 1880-2018"
# f.title.text_color="Gray"
# f.title.text_font="helvetica"
# f.title.text_font_style="bold"
# f.xaxis.minor_tick_line_color=None
# f.yaxis.minor_tick_line_color=None

# ## Line plot
# # f.line(x, y, color="darkblue", line_width=2, alpha=0.9)

# ## Bar plot
# f.vbar(x=x, top=y, width=0.2, alpha=0.9)

# f.xgrid.grid_line_color = None
# f.y_range.start = 0

# output_file("name-popularity.html")
# show(f)

In [None]:
# #TODO - Needs to be cleaned up or removed


# # Plot the popularity of a two names over time


# # Assign variables for name and gender (simulation of user input)
# name1 = 'Steven'
# gender1 = 'M'

# name2 = 'Lisa'
# gender2 = 'F'

# # Create temp dfs based on the above variables
# ###TODO###
# # Fill in any missing years with 0 count
# temp_df1 = names_df[(names_df.Name == name1) & (names_df.Gender == gender1)]
# temp_df2 = names_df[(names_df.Name == name2) & (names_df.Gender == gender2)]

# # Assign x and y values
# x1 = temp_df1["Year"]
# y1 = temp_df1["Count"]

# x2 = temp_df2["Year"]
# y2 = temp_df2["Count"]

# # Create a figure object
# #f = figure(plot_width=800,plot_height=600, toolbar_location=None, tools="")
# f = figure()

# # Style the plot
# f.title.text="Name Popularity - " + name1 + " vs. " + name2 + " - 1880-2018"
# f.title.text_color="Black"
# f.title.text_font="helvetica"
# f.title.text_font_style="bold"
# f.xaxis.minor_tick_line_color=None
# f.yaxis.minor_tick_line_color=None

# # Bar plot
# f.vbar(x=x1, top=y1, width=0.3, color="cornflowerblue", alpha=0.9,legend_label=(name1))
# f.vbar(x=x2, top=y2, width=0.3, color="indianred", alpha=0.9, legend_label=(name2))

# # # Line plot
# # f.line(x1, y1, color="darkblue", line_width=2, alpha=0.8, legend_label=(name1))
# # f.line(x2, y2, color="darkred", line_width=2, alpha=0.8, legend_label=(name2))
# # f.multi_line([x1, x2], [y1, y2],
# #              color=["darkblue", "darkred"], alpha=[0.8, 0.8], line_width=2, legend_label=([name1,name2]))

# f.xgrid.grid_line_color = None
# f.y_range.start = 0

# f.legend.location = "top_left"
# f.legend.click_policy="hide"

# output_file("name-popularity-vs.html")
# show(f)

In [None]:
# #TODO - Needs to be cleaned up or removed

# ###
# # TODO - Refactor code to be less repetitive 
# # Assign variables to simulate user input
# name1 = 'Steven'
# gender1 = 'M'

# name2 = 'Lisa'
# gender2 = 'F'

# names = [name1, name2]
# genders = [gender1, gender2]

# p = figure(plot_width=800, plot_height=600)
# p.title.text = "Name Popularity - " + name1 + " vs. " + name2 + " - 1880-2018"

# ### Example that needs to be updated
# for data, name, color in zip([AAPL, IBM, MSFT, GOOG], ["AAPL", "IBM", "MSFT", "GOOG"], Spectral4):
#     df = pd.DataFrame(data)
#     df['date'] = pd.to_datetime(df['date'])
#     p.line(df['date'], df['close'], line_width=2, color=color, alpha=0.8, legend_label=name)

# p.legend.location = "top_left"
# p.legend.click_policy="hide"

# output_file("interactive_legend.html", title="interactive_legend.py example")

# show(p)
# ### End example to be updated

In [None]:
# #TODO - Needs to be cleaned up or removed


# ###
# # TODO
# # Refactor all to accept a list of names, e.g., [Steve, Steven, Stephen]
# # How popular was your name during your year of birth?

# def yob_rank(name, gender, year, df):
#     df_copy = df.copy(deep=True)
#     df_copy = df_copy[(df_copy.Year == year) & (df_copy.Gender == gender)].reset_index()
#     rank = df_copy[df_copy.Name == name].index.tolist()[0] + 1
#     return rank
    
# # rank_Steven = yob_rank('Steven', 'M', 1974, names_df)
# # print(rank_Steven)
# # rank_Lisa = yob_rank('Lisa', 'F', 1971, names_df)
# # print(rank_Lisa)


# # What were the most popular names?
# def year_top10(year, df):
#     df_copy = df.copy(deep=True)
#     df_copy = df_copy[(df_copy.Year == year)].reset_index()
#     top_boys = df_copy[df_copy.Gender == 'M'].nlargest(10, columns=['Count'])
#     boys = []
#     for boy in top_boys.Name:
#         boys.append(boy)
#     top_girls = df_copy[df_copy.Gender == 'F'].nlargest(10, columns=['Count'])
#     girls = []
#     for girl in top_girls.Name:
#         girls.append(girl)
#     return boys, girls

# # pop_74 = year_top10(1974, names_df)
# # print(pop_74[0])
# # print(pop_74[1])


# # Throwback name ideas
# def throwback(num,df):
#     year = 2018 - num
#     tb_list = year_top10(year, df)
#     return tb_list

# # throwback_20 = throwback(20, names_df)
# # print(throwback_20[0])
# # print(throwback_20[1])
# # throwback_50 = throwback(50, names_df)
# # print(throwback_50[0])
# # print(throwback_50[1])
# # throwback_100 = throwback(100, names_df)
# # print(throwback_100[0])
# # print(throwback_100[1])