# Name Classification with Naive Bayes

## Overview

Focus of this project is to build a python module, that can determine if a person is Japanese or not, based on their name string. 
Final product is a class file containing `NameClassifier` class, capable of 
- loading & preprocessing train and test data
- train the model
- predict and evaluate the model
- save & load trained model for future use

The data of name strings with various origin from the world was obtained by using `Faker` library in python. 100000 fake names were created for each class for training and testing purposes. 

## Libraries / Dependencies

Couple python libraries were used to build this class
- scikit-learn
- pandas
- pickle

In order to use the class, these libraries and their dependencies need to be installed on your system.

## Setup and Locations

This class is only tested on Ubuntu Linux 18.04 version, and can be used by importing the class. The class file `model.py` needs to be located in the directory where you intend to use it. Data and saved model file can be located anywhere, as long as you have relative path to them from the class file. However generally it's good idea to keep everything within same or its child's directory. 

Now the basics are all out of the way, let's get started!

## Taking a look at data

This module load in the name data as csv file using pandas. You should have separate csv files, each for Japanese and non-Japanese names. 

In a file, this would look like

>Country, Address, name, other col..<br>
value1, value2, John Smith, value4

**As long as there is a column named `name` with the name data, other columns can also be present.<br> 
There should always be a white space between first and last name, since the model breaks down into first and last name, and analyze them.**

For example, using dataframe, the data might look like

In [1]:
import pandas as pd
j_name = pd.read_csv('data/jp_names.csv')
f_name = pd.read_csv('data/f_names.csv')

print("Japanese Names: \n", j_name.sample(10))
print("\nnon-Japanese Names:\n", f_name.sample(10))

Japanese Names: 
         code    name
28320  jp_JP   伊藤 加奈
93197  jp_JP   中島 加奈
83717  jp_JP   田辺 直人
26733  jp_JP    鈴木 舞
58254  jp_JP  野村 あすか
77565  jp_JP   吉本 千代
58693  jp_JP   青田 直子
38966  jp_JP   山口 千代
77998  jp_JP  大垣 聡太郎
73452  jp_JP    小林 幹

non-Japanese Names:
         code                           name
28559  en_US                     Roger Kemp
80017  tr_TR  Zebirce Ayasun Hançer Dumanlı
838    ar_EG                 Tyler Reynolds
71633  ro_RO                  Adriana Dobre
49043  it_IT                    Danny Ricci
56978  no_NO                    Aud Arnesen
36306  es_MX     Wilfrido Claudia Contreras
25864  en_US                Christina Ellis
68169  pt_BR          Maria Cecília da Rosa
74183  ro_RO              Georgian Ardelean


## Preprocessing

Preprocessing of data is one of the most important aspect of machine learing. It can boost or ruin the models' performance. 
Here, since we're dealing with text data, it needs to be encoded into numbers.

### Spliting Dataset
The dataset is splitted into train and test datasets, for model training and testing.
Default ratio is set to 
> train : test = 70% : 30%

This ratio can be modified if necessary.

We use `NameClassifier.load_data()` method to load the data and split them into 2 dataset like this...





In [2]:
## import the class
from model import NameClassifier

clf = NameClassifier()
# the method will return x_train, x_test, y_train, y_test in the particular order
x_train, x_test, y_train, y_test = clf.load_data(['data/jp_names.csv', 'data/f_names.csv'], test_size=0.4)
print('name data: \n', x_train.sample(10))
print('\nlabels: ', y_train)

name data: 
 52714               이 영숙
16092     Matthew Wilson
89334    Юстим Кириленко
55628               山本 淳
52764               박 정식
546          David Bowen
54600              青山 拓真
24620              村山 裕樹
92000               詹 金凤
37699               杉山 舞
Name: name, dtype: object

labels:  [0. 1. 0. ... 1. 0. 0.]


## Bag of Words Model
In this simple technique, each word that appear in the dataset are assigned with unique number, so that each text can be expressed as a sequence of the numbers.
The sequences are converted into vector with each position / index representing each word, and value expressing the frequency of the occurence of the word.

<br>
Specifically in this class, word count is utilized with scikit-learn's `CountVectorizer`. 



<br>
The data will be encoded into numpy sparse matrix, and is ready to be fed into the Naive Bayes model
<br>

Also, this class is implemented using Naive Bayes algorithm, using `Sklearn.naive_bayes.MultinomialNB` class.

Encoding names with word count and traning naive bayes are done in `NameClassifer.train()` method, lile...



In [3]:
# First, instantiate the NameClassifer with your choice of name.
clf = NameClassifier()

# then start training with the training data
clf.train(x_train, y_train)

Fitting the vectorizer and training the model...
training completed!


## Model Evaluation

Model evaluation was done by testset, with 3 metrics.
### accuracy
how many data points did the model correclty predicted, regardless of class

$$acc = \frac{TP + TN}{Total Data}$$

### precision
Out of all predicted Japanese names, how many were actual Japanese names?

$$precision = \frac{TP}{TP + FP}$$
### recall
Out of all actual Japanese names, how many did we predict as Japanese?
$$recall = \frac{TP}{TP + FN}$$

By using `NameClassifier.evaluate()`


In [6]:
# with test data, this will calculate each metrics, and return a dictionary
metrics = clf.evaluate(x_test, y_test)
#print(metrics)
print('accuracy: {}%\nprecision: {}\nrecall: {}'.format(metrics['accuracy']*100, metrics['precision'][0], metrics['recall'][0]))

accuracy: 98.35875%
precision: 1.0
recall: 0.967300077204692


## Prediction

**Random Names**<br>
Now let us predict some fake names, using the trained model. This can be done through `NameClassifier.predict()` method, and it accept python list of name strings. <br>
Let's take a look.

In [8]:
some_names = ['渡辺　謙', '木村　拓哉', 'Jack Nicholson', ' 陳　港生']
# Only the first 2 are Japanese name, so output should look like [1,1,0,0]
pred = clf.predict(some_names)
print(pred)

[1. 1. 0. 1.]


Output will be 1 for Japanese name, and 0 for non-Japanese name.<br> 
As you can see, it's getting everything right except it is predicting the last Chinese name as Japanese. 


### What about unseen names?


Here, let's try to predict with names that were not existing in the training dataset, such as
- unseen last / first names
- Japanese Names in roman
- non-Japanese names in Katakana

Here we'll use the name 安倍晋三 as a test name, which is not included in the training data.


In [7]:
# First we'll get word_dictionary that are based on model's training data
trained_dict = clf.get_word_dict()
print('安倍' in trained_dict.keys())
print('晋三' in trained_dict.keys())

print('\nSo the name 安倍　晋三 is was not present in the training set.\nLet us try along with Katakana name.')
pred_anom = clf.predict(['安倍　晋三', 'ジェニファー　ローレンス', 'Jennifer Lawrence', 'Wataru Takahashi'])
print(pred_anom)

False
False

So the name 安倍　晋三 is was not present in the training set.
Let us try along with Katakana name.
[1. 1. 0. 1.]


As you can see, model fails to recognize the Japanese name at index = 0, and classifying ジェニファー・ローレンス as Japanese. 
It is interesting that it classified Wataru Takahashi as Japanese, despite it is in Roman.

Possible improvement would be to 
- add more varieties to the training data, namely more Katakana
- Breaking down the name into letters, if it's in Kanji
- different ML algorithm

## Saving and Loading the Model

- Saving the trained model is easy, just use `NameClassifier.save_model()`
- Loading can be done by `NameClassifier.load_model()`

Both methods accept `path/to/modelFile/fileName.pickle` as argument.

In [18]:
clf.save_model('saved_model.pickle')
clf.load_model('saved_model.pickle')

## To do / future features

- document the code better
- Multi class classification of names for nationalities/origins -> on other branch
- trying out with different algorithms, such as 
    - Neural nets (RNN?)
    - random forest
    - SVM
- input the names as image data, and use CNN to train it