In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import random
from collections import Counter
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier, classify
import pandas as pd
%matplotlib inline


Some examples of classification tasks are:

* Deciding whether an email is spam or not.
* Deciding what the topic of a news article is, from a fixed list of topic areas such as "politics," "technology," and "sport"

### Guessing gender from names
We know that male and female names have distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male.  We aim to build a classifier to model these differences.



## Example 1
Let us define the probabilistic model.

Variables and domains:

$$
G \in \{M,F\}
$$
$$
LL \in \{a,b,c,d,\dots,z\} %last letter in the name
$$

where LL stands for last letter in the name.

We define
$$
\begin{aligned}
p(G=1) &=\theta\\
p(G=0) &=1-\theta\\
\end{aligned}
$$

with $1$ means Female and $0$ Male. We can rewrite it in one line
using the Bernoulli distribution.

$$
p(G=g) =\theta^g(1-\theta)^{1-g} \text{ with } g \in \{0,1\}
$$

Let's now denote with $\theta_{lg} $ the probability that LL is equal to the letter $l$ given that
the gender $G$ is $g$, that is

$$
p(LL=l|G=g)=\theta_{lg} 
$$
we have $26\times 2$ parameters.

It is convenient to Binarize the letters (encoding) in a one-vs-all fashion, that is 
$$
\begin{aligned}
a &\rightarrow & [1,0,\dots,0]\\
b &\rightarrow & [0,1,\dots,0]\\
... & ... & ....
\end{aligned}
$$

With this encoding, if we denote with $y_{l}$ the vector that has one in the l-th position 
and with $y_{al}$ is first component, $y_{bl}$ is second component etc., then 

$$
p(LL=l|G=g)=\theta_{ag}^{y_{al}}\theta_{bg}^{y_{bl}}\theta_{cg}^{y_{cl}}\cdots\theta_{zg}^{y_{zl}}
$$

which is a Categorical distribution. So we can say that the problem has $m=26$ binary features.

We are interesting in computing (Bayes' rule)
$$
p(G=g|LL=l)=\dfrac{p(LL=l|G=g)p(G=g)}{p(LL=l)}
$$

that is the posterior probability that the gender is $g$ when the last letter in the name is $l$

For instance

$$
P(G=1|LL=a)=\dfrac{P(LL=a|G=1)p(G=1)}{p(LL=a)}=\dfrac{\theta_{a1}\theta_1}{\theta_{a0}\theta_0+\theta_{a1}\theta_1}
$$

Problem: we do not know the thetas. What do we do?

We can estimate them from a Dataset of all Male and Female English givennames:
$$
\mathcal{D}=\{(a,1),(n,0),(o,0),(a,1),\dots\}
$$

### General recipe ML
Maximum likelihood estimation assuming i.i.d.
$$
\arg \max_{\theta_{lg},\theta_{g}} \prod_{i=1}^N p(LL=l(i)|G=g(i))P(G=g(i)) =  \arg \max_{\theta_{lg},\theta_{g}} \prod_{i=1}^N \theta_{l(i)g(i)}  \theta_{g(i)}
$$

Example assume that $\mathcal{D}=\{(a,1),(n,0),(o,0),(a,1)\}$, that is N=4 observations then
$$
\prod_{i=1}^4 p(LL=l(i)|G=g(i))P(G=g(i))=\theta_{a1}\theta_1\,\theta_{n0}\theta_0\,\theta_{o0}\theta_0\,\theta_{a1}\theta_1 =\theta_{a1}^2\theta_1^2\,\theta_{n0}\theta_0\,\theta_{o0}\theta_0
$$
Note that, by summing the exponent for the same base, we make the computation of this likelihood **much faster**.


The resulting classifier is called **Multinomial Naive-Bayes estimator**, but you can see that the parameters are estimated using MLE and, therefore, the uncertainty in this estimate is not considered.

Although the name of the classifier has Bayes inside, it is a general recipe ML approach because the thetas
are estimated via MLE.

The MLE estimate is

$$
\theta_{g}=\dfrac{n_{g}}{N} ~~\textit{for } g=0,1
$$

where $n_{g=1}$ is the number of instances (rows) in the dataset where the class variable is equal to one and $N$
is the total number of instances.

Similarly,

$$
\theta_{lg}=\dfrac{n_{lg}}{n_g}
$$
where $n_{lg}$ is the number of instances where the letter is $l$ and the gender is $g$.

**Regularisation**
It may happen that $n_{lg}=0$ and, therefore, $\theta_{lg}=0$. To avoid this problem, it is common to add a regularisation term (see https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes)

$$
\theta_{lg}=\dfrac{n_{lg}+\alpha}{n_g+\alpha\, m}
$$

where $m$ is the number of features. $\alpha=1$ is called Laplace smoothing.

In [3]:
def gender_features(word):
    return word[-1:].lower() #last_letter
gender_features('Alessio')


'o'

In [4]:
#we download the names
import nltk
nltk.download('names')

[nltk_data] Downloading package names to /home/shravan/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [5]:
#we make a datasets
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +\
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
labeled_names=np.array(labeled_names)
labeled_names[0:10,:]

array([['Daisy', 'female'],
       ['Lorry', 'female'],
       ['Rycca', 'female'],
       ['Carson', 'male'],
       ['Kassia', 'female'],
       ['Agnes', 'female'],
       ['Thaine', 'male'],
       ['Tamera', 'female'],
       ['Janeen', 'female'],
       ['Anallise', 'female']], dtype='<U15')

We build our input and output variables

In [6]:

name_letters=[gender_features(name) for name in labeled_names[:,0]]

#remove NaN
name_letters=np.array(name_letters)
ind = np.where((name_letters!=' ')==True)[0]
name_letters = name_letters[ind]

X = name_letters#

X=np.array(X).reshape(-1,1)
y=np.where(labeled_names[ind,1]=='male',0,1)


In [7]:
X

array([['y'],
       ['y'],
       ['a'],
       ...,
       ['h'],
       ['e'],
       ['n']], dtype='<U1')

How do we know if our classifier is doing a good job?

We can't use actually unknown outputs to check this and we can't check with outputs we trained on because it should get.

In General ML, we evaluate the generalisation error, that is we test the algorithm on unseen examples.

In [8]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(X)
X2=  lb.transform(X)

In [9]:
X2

array([[0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [10]:
#we split the dataset in training and testing
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.4, random_state=42)
print(X_train2.shape)
print(X_test2.shape)

#we also split the original datase
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

(4765, 25)
(3178, 25)


In [11]:
## General ML Recipe algorithm
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
clf = MultinomialNB(alpha=0.1,fit_prior=True) # Note alpha
clf.fit(X_train2, y_train2)
y_train_pred = clf.predict(X_train2)
y_test_pred  = clf.predict(X_test2)

In [12]:
np.set_printoptions(suppress=True)
print(clf.predict_proba(X_test2))
X_test

[[0.01520287 0.98479713]
 [0.74223286 0.25776714]
 [0.52411601 0.47588399]
 ...
 [0.73783307 0.26216693]
 [0.2551471  0.7448529 ]
 [0.52411601 0.47588399]]


array([['a'],
       ['s'],
       ['l'],
       ...,
       ['t'],
       ['e'],
       ['l']], dtype='<U1')

In [13]:
clf.predict_proba(lb.transform(np.array([[gender_features("Anna")]])))

array([[0.01520287, 0.98479713]])

### Evaluation metrics
We evaluate the accuracy of the classifier.

$$
\textit{Accuracy}= \dfrac{\textit{Number of correct predictions}}{\textit{Total number of predictions}}
$$


In [15]:
print("accuracy training set ", accuracy_score(y_train_pred,y_train2))
print("accuracy test set ",accuracy_score(y_test_pred,y_test2))

accuracy training set  0.7672612801678909
accuracy test set  0.7561359345500315


Another interesting metric is the **confusion matrix**.
By definition a confusion matrix $C$ is such $C_{ij}$ that is equal to the number of observations known to be in group $i$ but predicted to be in group $j$

In [16]:
CC=confusion_matrix(y_test2,y_test_pred)
CC

array([[ 755,  374],
       [ 401, 1648]])

The classifier also returns the probability of the two classes

### Questions

* What happens if we change the feature for instance the letter before the last?
* What happens if we use the last two letters instead of the last one, how should you modify the code and what is the accuracy?
* What happens if we use the last three letters, how should you modify the code and what is the accuracy?
* What happens if we use the last four letters, how should you modify the code and what is the accuracy?

Try yourself!