<img src="../../../images/banners/pandas-cropped.jpeg" width="600"/>

<a class="anchor" id="intro_to_data_structures"></a>
# <img src="../../../images/logos/pandas.png" width="23"/> DataFrame Mini Project: Predict Gender from Name

## <img src="../../../images/logos/toc.png" width="20"/> Table of Contents
* [Required Libraries](#required-libraries)
* [Generate Fake Data](#generate-fake-data)
* [Extract First name and Last name](#extract-firstname-and-lastname)
* [Predict Gender](#predict-gender)
* [Your Turn!](#your-turn)

---

In this project we want to generate a fake dataset and predict gender from their names. For example a name like 'سیامک' looks like Male.

<a class="anchor" id="required-libraries"></a>

## Required Libraries

To do so, we use the following libraries:
- [`Faker`](https://github.com/joke2k/faker): To generate fake names
- [`names-dataset`](https://github.com/philipperemy/name-dataset): To get gender and country info

So Let's first install the libraries by running:

In [1]:
!pip install Faker
!pip install names-dataset



Using faker we can generate fake names in different languages:

In [18]:
from faker import Faker

In [3]:
en_fake = Faker()
fa_fake = Faker('Fa')

In [4]:
en_fake.name()

'Alyssa Vargas'

In [7]:
fa_fake.name()

'آرمین جلیلی'

And using names dataset you can get gender and country info from a name:

In [10]:
from names_dataset import NameDataset

# This line takes time some time as the database is massive.
nd = NameDataset()

In [11]:
nd.search('سیامک')

{'first_name': {'country': {'United Arab Emirates': 0.005,
   'Afghanistan': 0.019,
   'Austria': 0.005,
   'Germany': 0.009,
   'Georgia': 0.005,
   'Iraq': 0.042,
   'Iran, Islamic Republic of': 0.87,
   'Sweden': 0.005,
   'Syrian Arab Republic': 0.005,
   'Turkey': 0.037},
  'gender': {'Female': 0.025, 'Male': 0.975},
  'rank': {'Afghanistan': 9456,
   'Georgia': 2689,
   'Iran, Islamic Republic of': 1200,
   'United Arab Emirates': None,
   'Austria': None,
   'Germany': None,
   'Iraq': None,
   'Sweden': None,
   'Syrian Arab Republic': None,
   'Turkey': None}},
 'last_name': None}

Note that by providing an input name, names-dataset generates the stats for both first name and last name separately. If your input is a first name, you can extract the first name results:

In [13]:
# سیامک is 97.5% likely to be Male
nd.search('سیامک')['first_name']['gender']

{'Female': 0.025, 'Male': 0.975}

<a class="anchor" id="generate-fake-data"></a>

## Generate Fake Data

Let's first define a function that randomly genrates English or Iranian names.

In [19]:
import pandas as pd
import numpy as np

In [20]:
def make_name():
    """
    This function generate an Iranian name in 50% of the times and
    an English name in the other 50% of the times.
    """
    if np.random.rand() > 0.5:
        return en_fake.name()
    
    return fa_fake.name()

Now it is so easy to generate random Iranian and English names.

In [30]:
[make_name() for _ in range(10)]

['فاطمه عبدالعلی',
 'Nicholas Mendez',
 'دانيال هدایت',
 'نرگس زارعی',
 'محمد محمدی',
 'معصومه رودگر',
 'ثنا سعیدی',
 'Kayla Fleming DDS',
 'مبينا بهمنی',
 'Kimberly Castillo']

So let's build our dataset now:

In [47]:
df = pd.DataFrame({
    'Name': [make_name() for _ in range(10)]
})

In [48]:
df

Unnamed: 0,Name
0,محمدمهدي جنتی
1,Emily Gray
2,Nicole Atkins
3,ماهان شمشیری
4,Michael Frost
5,Tonya Kirby
6,محمدحسین سلطانی
7,Anthony Elliott
8,سرکار خانم زینب شمشیری
9,سينا عقیلی


<a class="anchor" id="extract-firstname-and-lastname"></a>

## Extract First name and Last name

To predict gender and country using a name, we need the first name and last names separate. We can simply assume that the first part of a name is the first name and the last par is last name. For example:

In [53]:
full_name = fa_fake.name()
full_name

'امیررضا فرجی'

In [54]:
first_name, *_, last_name = full_name.split()

In [55]:
first_name

'امیررضا'

In [56]:
last_name

'فرجی'

So let's split the names using `apply`.

`apply` method is used to apply a function on a column or a row of a dataframe. It is very useful when you want to apply a function on a column or a row and create another column or row based on the result of the function.

In [58]:
df['First Name'] = df['Name'].apply(lambda full_name: full_name.split()[0])

In [59]:
df['Last Name'] = df['Name'].apply(lambda full_name: full_name.split()[-1])

In [60]:
df

Unnamed: 0,Name,First Name,Last Name
0,محمدمهدي جنتی,محمدمهدي,جنتی
1,Emily Gray,Emily,Gray
2,Nicole Atkins,Nicole,Atkins
3,ماهان شمشیری,ماهان,شمشیری
4,Michael Frost,Michael,Frost
5,Tonya Kirby,Tonya,Kirby
6,محمدحسین سلطانی,محمدحسین,سلطانی
7,Anthony Elliott,Anthony,Elliott
8,سرکار خانم زینب شمشیری,سرکار,شمشیری
9,سينا عقیلی,سينا,عقیلی


Of course this is not a perfect separation and you can see that some first name and last name splits are not correct. This is because the data is not perfect and some names include titles like سر کار خانم or Mr.

You can write a function to fix these errors but for now we will just ignore them.

<a class="anchor" id="predict-gender"></a>

## Predict Gender

Now let's generate gender and country info from a name. Let's write a function that given a first name, returns gender and country.

**Note:** We use first name to generate gender as it is more accurate than last name. Last names can be used for other purposes like finding the origin country of the person. You can try it with the last name and compare the results.

In [111]:
def name_to_gender(first_name):
    info = nd.search(first_name)['first_name']
    if info is None:
        return None
    
    return max(info['gender'], key=info['gender'].get)

In [87]:
name_to_gender('سیامک')

'Male'

In [88]:
# For unknown names, it returns None
print(name_to_gender('abc'))

None


Let's apply this function on our dataframe and extract gender:

In [89]:
df['Gender'] = df['First Name'].apply(lambda fn: name_to_gender(fn))

In [90]:
df

Unnamed: 0,Name,First Name,Last Name,Gender
0,محمدمهدي جنتی,محمدمهدي,جنتی,Male
1,Emily Gray,Emily,Gray,Female
2,Nicole Atkins,Nicole,Atkins,Female
3,ماهان شمشیری,ماهان,شمشیری,Male
4,Michael Frost,Michael,Frost,Male
5,Tonya Kirby,Tonya,Kirby,Female
6,محمدحسین سلطانی,محمدحسین,سلطانی,Male
7,Anthony Elliott,Anthony,Elliott,Male
8,سرکار خانم زینب شمشیری,سرکار,شمشیری,
9,سينا عقیلی,سينا,عقیلی,Male


<a class="anchor" id="your-turn"></a>

## Your Turn!

Now it's your turn to do some data extraction.

Write a code that adds 3 more columns to the dataframe:
- Gender Probability
- Country
- Country Probability

The final dataframe should look like this:

<img src="../images/pandas/gender-country-from-name.png" width="800"/>