# Name Classification with Naive Bayes

## Overview

Focus of this project is to build a python module, that can determine if a person is Japanese or not, based on their name string. 
Final product is a class file containing `NameClassifier` class, capable of 
- loading & preprocessing train and test data
- train the model
- predict and evaluate the model
- save & load trained model for future use

The data of name strings with various origin from the world was obtained by using `Faker` library in python. 100000 fake names were created for each class for training and testing purposes. 

## Libraries / Dependencies

Couple python libraries were used to build this class
- scikit-learn
- pandas
- pickle

In order to use the class, these libraries and their dependencies need to be installed on your system.

## Setup and Locations

This class is only tested on Ubuntu Linux 18.04 version, and can be used by importing the class. The class file `model.py` needs to be located in the directory where you intend to use it. Data and saved model file can be located anywhere, as long as you have relative path to them from the class file. However generally it's good idea to keep everything within same or its child's directory. 

Now the basics are all out of the way, let's get started!

## Taking a look at data

This module load in the name data as csv file using pandas. You should have separate csv files, each for Japanese and non-Japanese names. 

In a file, this would look like

>Country, Address, name, other col..<br>
value1, value2, John Smith, value4

**As long as there are column named `name`, other columns won't be a problem.<br> 
There should always be a white space between first and last name though.**

For example, using dataframe, the data might look like

In [1]:
import pandas as pd
j_name = pd.read_csv('data/jp_names.csv')
f_name = pd.read_csv('data/f_names.csv')

print("Japanese Names: \n", j_name.sample(10))
print("\nnon-Japanese Names:\n", f_name.sample(10))

Japanese Names: 
         code    name
65886  jp_JP    中村 晃
56486  jp_JP  青山 くみ子
73021  jp_JP   加藤 知実
73239  jp_JP  野村 美加子
60836  jp_JP   石田 結衣
29346  jp_JP   青山 春香
39545  jp_JP    小林 晃
38880  jp_JP  小泉 聡太郎
22614  jp_JP   田辺 英樹
74925  jp_JP   大垣 洋介

non-Japanese Names:
         code                  name
66591  pt_BR  Srta. Agatha Freitas
17437  en_AU           Calvin Hall
94248  zh_CN                  陈 桂荣
67145  pt_BR        Rodrigo Duarte
44152  fr_FR      Agnès Carpentier
68337  pt_BR           Caio Barros
97603  zh_TW                  魏 佩君
89041  uk_UA          Мілена Палій
42054  fr_FR           Paul Hubert
44291  fr_FR        Franck Jacques


## Preprocessing

Preprocessing of data is one of the most important aspect of machine learing. It can boost or ruin the models' performance. 
Here, since we're dealing with text data, it needs to be encoded into numbers. 

### Spliting Dataset
The dataset is splitted into train and test datasets, for model training and testing.
Default ratio is set to 
> train : test = 70% : 30%

This ratio can be modified if necessary.

We use `NameClassifier.load_data()` method to load the data and split them into 2 dataset like this...





In [2]:
## import the class
from model import NameClassifier

# the method will return x_train, x_test, y_train, y_test in the particular order
x_train, x_test, y_train, y_test = NameClassifier.load_data('data/jp_names.csv', 'data/f_names.csv', test_size=0.4)
print('name data: \n', x_train.sample(10))
print('\nlabels: ', y_train)

name data: 
 49782                              廣川 香織
36503    Rodrigo Héctor Alejandro Zamora
64099                 Dariusz Frankowicz
23230                    Cassandra Allen
90242                               董 玉华
70934                          Mona Niță
60380                              渚 くみ子
61170                        Piotr Dybaś
26908                              笹田 亮介
44995                              三宅 健一
Name: name, dtype: object

labels:  [0 1 1 ... 0 0 1]


## Bag of Words Model
In this simple technique, each word that appear in the dataset are assigned with unique number, so that each text can be expressed as a sequence of the numbers.
The sequences are converted into vector with each position / index representing each word, and value expressing the frequency of the occurence of the word.

<br>
Specifically in this class, word count is utilized with scikit-learn's `CountVectorizer`. 



<br>
The data will be encoded into numpy sparse matrix, and is ready to be fed into the Naive Bayes model


## How Naive Bayes works...

Brah Brah....

Using `Sklearn.naive_bayes.MultinomialNB` class.

Encoding names with word count and traning naive bayes are done in `NameClassifer.train()` method, lile...



In [3]:
# First, instantiate the NameClassifer with your choice of name.
clf = NameClassifier()

# then start training with the training data
clf.train(x_train, y_train)

Fitting the vectorizer and training the model...
training completed!


## Evaluation Metrics

Model evaluation was done by testset, with 3 metrics.
### accuracy
how many data points did the model correclty predicted, regardless of class

$$acc = \frac{TP + TN}{Total Data}$$

### precision
Out of all predicted Japanese names, how many were actual Japanese names?

$$precision = \frac{TP}{TP + FP}$$
### recall
Out of all actual Japanese names, how many did we predict as Japanese?
$$recall = \frac{TP}{TP + FN}$$

By using `NameClassifier.evaluate()`


In [4]:
# with test data, this will calculate each metrics, and return a dictionary
metrics = clf.evaluate(x_test, y_test)

print('accuracy: {}%\nprecision: {}\nrecall: {}'.format(metrics['accuracy']*100, metrics['precision'], metrics['recall']))

accuracy: 99.8075%
precision: 1.0
recall: 0.9961563420356412


## Prediction

Now let us predict some fake names, using the trained model. This can be done through `NameClassifier.predict()` method, and it accept python list of name strings. <br>
Let's take a look.

In [11]:
some_names = ['渡辺　謙', '木村　拓哉', 'Jack Nicholson', ' 陳　港生']
# Only the first 2 are Japanese name, so output should look like [1,1,0,0]
pred = clf.predict(some_names)
print(pred)

[1, 1, 0, 0]


## Saving and Loading the Model

- Saving the trained model is easy, just use `NameClassifier.save_model()`
- Loading can be done by `NameClassifier.load_model()`

Both methods accept `path/to/modelFile/fileName.pickle` as argument.

In [18]:
clf.save_model('saved_model.pickle')