## KEY NOTE

This notebook is something little different form what we usually do here on Kaggle. I have used the dataset of names and analysed the sounds using fuzzy and hence predicted the gender of the names from the other dataset of authors.

Dataset used: [Name Phonics Dataset](https://www.kaggle.com/amritvirsinghx/gender-prediction-from-name-pronunciation)

<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Notebook Navigation</h3>

[1. Introduction: Sound It Out!](#1)   
[2. Authoring The Authors](#2)  
[3. It's Time to Bring on The Phonics!](#3)   
[4. The Inbetweeners](#4)    
[5. Playing Matchmaker](#5)       
[6. Tally Up](#6)     
[7. Foreign-Born Authors?](#7)     
[8. Raising The Bar](#8)     

<a id="1"></a>
## 1.Introduction: Sound It Out!
<p>Grey and Gray. Colour and Color. Words like these have been the cause of many heated arguments between Brits and Americans. Accents (and jokes) aside, there are many words that are pronounced the same way but have different spellings. While it is easy for us to realize their equivalence, basic programming commands will fail to equate such two strings. </p>
<p>More extreme than word spellings are names because people have more flexibility in choosing to spell a name in a certain way. To some extent, tradition sometimes governs the way a name is spelled, which limits the number of variations of any given English name. But if we consider global names and their associated English spellings, you can only imagine how many ways they can be spelled out. </p>
<p>One way to tackle this challenge is to write a program that checks if two strings sound the same, instead of checking for equivalence in spellings. We'll do that here using fuzzy name matching.</p>

In [None]:
#installing the package in the container
!pip install fuzzy

In [None]:
# Importing the fuzzy package
import fuzzy
# Exploring the output of fuzzy.nysiis
fuzzy.nysiis

So we have loaded our base library to work with

<a id="2"></a>
## 2. Authoring The Authors
<p>The New York Times puts out a weekly list of best-selling books from different genres, and which has been published since the 1930’s.  We’ll focus on Children’s Picture Books, and analyze the gender distribution of authors to see if there have been changes over time. We'll begin by reading in the data on the best selling authors from 2008 to 2017.</p>

In [None]:
import pandas as pd
author_df = pd.read_csv("../input/gender-prediction-from-name-pronunciation/nytkids_yearly.csv", delimiter=';')

# Looping through author_df['Author'] to extract the authors first names
first_name = []
for name in author_df['Author']:
    first_name.append(name.split()[0])
    
#extracting first name
author_df['first_name'] = first_name
author_df.head()

<a id="3"></a>
## 3. It's Time to Bring on The Phonics!
<p>When we were young children, we were taught to read using phonics; sounding out the letters that compose words. So let's relive history and do that again, but using python this time. We will now create a new column or list that contains the phonetic equivalent of every first name that we just extracted. </p>
<p>To make sure we're on the right track, let's compare the number of unique values in the <code>first_name</code> column and the number of unique values in the nysiis coded column. As a rule of thumb, the number of unique nysiis first names should be less than or equal to the number of actual first names.</p>

In [None]:
import numpy as np

# Looping through author's first names to create the nysiis (fuzzy) equivalent
nysiis_name = []
for first_name in author_df['first_name']:
    tmp = fuzzy.nysiis(first_name)
    nysiis_name.append(tmp.split()[0])

# Adding first_name as a column to author_df
author_df['first_name'] = first_name
# Adding nysiis_name as a column to author_df
author_df['nysiis_name'] = nysiis_name

num_bananas_one = np.unique(author_df['first_name'])
lst1 = list(num_bananas_one)
num_bananas_one = np.asarray(lst1)

num_bananas_two = np.unique(author_df['nysiis_name'])
lst2 = list(num_bananas_two)
num_bananas_two = np.asarray(lst2)

# Printing out the difference between unique firstnames and unique nysiis_names:
print(str("Difference is" + str(num_bananas_one) + "," + str(num_bananas_two) + "."))

<a id="4"></a>
## 4. The Inbetweeners
<p>We'll use <code>babynames_nysiis.csv</code> to identify author genders. The dataset contains unique NYSIIS versions of baby names, and also includes the percentage of times the name appeared as a female name (<code>perc_female</code>) and the percentage of times it appeared as a male name (<code>perc_male</code>). </p>
<p>We'll use this data to create a list of <code>gender</code>. Let's make the following simplifying assumption: For each name, if <code>perc_female</code> is greater than <code>perc_male</code> then assume the name is female, if <code>perc_female</code> is less than <code>perc_male</code> then assume it is a male name, and if the percentages are equal then it's a "neutral" name.</p>

In [None]:
import pandas as pd
babies_df = pd.read_csv('../input/gender-prediction-from-name-pronunciation/babynames_nysiis.csv', delimiter = ';')

gender = []
for idx, row in babies_df.iterrows():
    if row[1] > row[2]:
        gender.append('F')
    elif row[1] < row[2]:
        gender.append('M')
    elif row[1] == row[2]:
        gender.append('N')
    else:
        gender
# Adding a gender column to babies_df
babies_df['gender'] = pd.Series(gender)
print(babies_df.head(10))

<a id="5"></a>
## 5. Playing Matchmaker
<p>Now that we have identified the likely genders of different names, let's find author genders by searching for each author's name in the <code>babies_df</code> DataFrame, and extracting the associated gender. </p>

In [None]:
def locate_in_list(a_list, element):
   loc_of_name = a_list.index(element) if element in a_list else -1
   return(loc_of_name)

author_gender = []

for idx in author_df['nysiis_name']:
   index = locate_in_list(list(babies_df['babynysiis']),idx)
   #print(index)
   if(index==-1): 
       author_gender.append('Unknown')
   else: 
       author_gender.append(list(babies_df['gender'])[index])

author_df['author_gender'] = author_gender 
author_df['author_gender'].value_counts()

<a id="6"></a>
## 6. Tally Up
<p>From the results above see that there are more female authors on the New York Times best seller's list than male authors. Our dataset spans 2008 to 2017. Let's find out if there have been changes over time.</p>

In [None]:
# Creating a list of unique years, sorted in ascending order.
years = np.unique(author_df['Year'])

males_by_yr = []
females_by_yr = []
unknown_by_yr = []

for yy in years:   
   males_by_yr.append(len( author_df[ (author_df['Year']==yy) & (author_df['author_gender']=='M')  ] ))
   females_by_yr.append(len( author_df[ (author_df['Year']==yy) & (author_df['author_gender']=='F')  ] ))
   unknown_by_yr.append(len( author_df[ (author_df['Year']==yy) & (author_df['author_gender']=='Unknown')  ] ))

# Printing out yearly values to examine changes over time
print(males_by_yr)
print(females_by_yr)
print(unknown_by_yr)

<a id="7"></a>
## 7. Foreign-Born Authors?
<p>Our gender data comes from social security applications of individuals born in the US. Hence, one possible explanation for why there are "unknown" genders associated with some author names is because these authors were foreign-born. While making this assumption, we should note that these are only a subset of foreign-born authors as others will have names that have a match in <code>baby_df</code> (and in the social security dataset). </p>
<p>Using a bar chart, let's explore the trend of foreign-born authors with no name matches in the social security dataset.</p>

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.bar(unknown_by_yr,'year')
plt.title('Plot1')
plt.xlabel('X')
plt.ylabel('Y')


<a id="8"></a>
## 8. Raising The Bar
<p>What’s more exciting than a bar chart is a grouped bar chart. This type of chart is good for displaying <em>changes</em> over time while also <em>comparing</em> two or more groups. Let’s use a grouped bar chart to look at the distribution of male and female authors over time.</p>

In [None]:
years_shifted = [year + 0.25 for year in years]

# Plotting males_by_yr by year
plt.bar(males_by_yr, 'year', width = 0.25, color = 'lightblue')

# Plotting females_by_yr by years_shifted
plt.bar(females_by_yr, 'year_shifted', width = 0.25, color = 'pink')

plt.title('Plot2')
plt.xlabel('X')
plt.ylabel('Y')

If you want to replicate something similar please checkout this dataset:
[Name Phonics Dataset](https://www.kaggle.com/amritvirsinghx/gender-prediction-from-name-pronunciation)

Your feedback is appreciated :)