# Constructing a Gender Identifier - Exercise

Gender identification is another interesting NLP problem. 

In this Notebook, we will use a **heuristic** to construct a feature vector and use it to train a classifier. The heuristic that we will use here is the last N letters of a given name. For example, if the name ends with "ly", it's most likely a female name, such as "Holly", "Kelly" or "Emely". On the other hand, if the name ends with "rk", it's likely a male name such as "Clark", "Mark" or "Dirk". Since we are not sure of the exact number of letters to use, we will play around with this parameter and find out what the best answer is.

We will use the same approach as for the sentiment analyzer and build and train a Naive Bayes Classifier.

BTW: combining letters or words into patterns is another popular way to tokenize your text. It's called a **'n-gram'**, where n equals the amount of items (chars or words) we combine into a token. 

<img src="./resources/ngram-words.png"  style="height: 250px"/>
<img src="./resources/ngrams_char.jpg"  style="height: 250px"/>

## 1. Read the names

First import the NLTK package.

In [1]:
import nltk

In the resources you will find two text documents `male.txt` and `female.txt` with the most common names in the Netherlands and Belgium of the last two years. Read the names in these documents and create two variables *male_names* (an array with all the male names) and *female_names* (an array with all the female names). You should get (for the *male_names*):

```
['Aad', 'Aalbert', 'Aaldert', 'Aaldrik', 'Aalt', 'Aarnoud', ... ]
```

In [None]:
from nltk.tokenize import word_tokenize




In [None]:
print(male_names)

## 2. Shuffle the names

Now create a new array *data* (using the previous two arrays) with all the male and female names (randomly shuffled). Indicate if the name is male or female. You should get something like this:

```
[('Jitske', 'female'), ('Etienne', 'male'), ('Danischa', 'female'), ... ]
```

In [None]:
print(data)

## 3. Create the featureset

In the example of the sentiment analyzer, we used every top 3,000 word as an input feature for our classifier. Whether the word existed in the document (true or false) was the value of the feature. As output feature or label we used pos or neg. Therefore the featureset used to train the classifier was something like this (we only used the top 3 words and the first 5 documents).

```
[({'plot': True, 'bothered': False, 'annual': False}, 'pos'), ({'plot': False, 'bothered': False, 'annual': False}, 'pos'), ({'plot': False, 'bothered': False, 'annual': False}, 'pos'), ({'plot': True, 'bothered': True, 'annual': False}, 'pos'), ({'plot': True, 'bothered': True, 'annual': False}, 'neg')]
```

In this example there is only one input feature, namely the last N letters from the name in __lowercase__. The output feature is male of female. Therefore the array we have to create to train our classifier must have this format:

```
[({'letters': 'rissa'}, 'female'), ({'letters': 'rigje'}, 'female'), ({'letters': 'amos'}, 'male'), ({'letters': 'kbule'}, 'female'), ({'letters': 'nady'}, 'female'), ({'letters': 'ienna'}, 'female'), ({'letters': 'lérie'}, 'female'), ({'letters': 'berta'}, 'female'), ({'letters': 'rigje'}, 'female'), ({'letters': 'tiny'}, 'female')]
```

Create a function `create_featureset` with two parameters: the first parameter is N (the number of letters), the second parameter is data (all the male and female names shuffled). The return value of the function is the featureset from above (in the case of N=5).

In [None]:
print(create_featureset(5,data))

## 4. Train and classify

Create an array with the input names to classify. Write a loop (1 to 5) to classify the names using the last 1, 2, ... 5 letters. Print the accuracy of the classifier and the predicted gender of the name. The output should be:

```
Number of end letters: 1
Accuracy = 77.34%
Yvonne - {'letters': 'e'} ==> female
Johan - {'letters': 'n'} ==> male
Yvette - {'letters': 'e'} ==> female
Patrick - {'letters': 'k'} ==> male
Heidi - {'letters': 'i'} ==> female
Jos - {'letters': 's'} ==> male
Carine - {'letters': 'e'} ==> female

Number of end letters: 2
Accuracy = 78.73%
Yvonne - {'letters': 'ne'} ==> female
Johan - {'letters': 'an'} ==> male
Yvette - {'letters': 'te'} ==> female
Patrick - {'letters': 'ck'} ==> male
...
```

Normally the accuracy should peak at two letters and start decreasing after that.

In [None]:
input_names = ['Yvonne', 'Johan', 'Yvette', 'Patrick', 'Heidi', 'Jos', 'Carine']