# Name Classifier Project Report

## Overview

With increasing foreginer populations in Japan, there are need to take statistics of numbers of foreginers in any organizations in Japn.
In many cases, databases does not include the person's nationality. Automating the task of classification of person into their nationality based on provided name strings and thier address is the objective of this project.

## Objective

**Business**<br>
Still vaigue, but at this point,
* automate the task of marking foreginer's records in a database based on thier name and address strings.
* Concrete measurement metrics to be used such as accuracy, precision or recall is unclear at this point.

**Machine Learning**<br>

* Binary Classification of foreginer vs Japanese
* Input Data, either or both of
    * Person's name string
    * Address string
* Output 1 for foreginers, 0 for Japanese
* Evaluation Metrics
    * Will be using recall as the model evaluation metrics, since 2.16 % of Japan's population is composed of foreign residents, making it an extreme minority class.
    
    
## Background 

### Foreign residents' population in Japan

Based on the 2 sources below, entire population of Japan at 2018 is 126443000, while foregin residents (who stays longer than 3 months) are 2731093, which is approximately 2.16%.
- [Numbers of Foregin residents in Japan - Ministry of Justice](https://www.e-stat.go.jp/stat-search/files?page=1&layout=datalist&toukei=00250012&tstat=000001018034&cycle=1&year=20180&month=24101212&tclass1=000001060399)
- [Japanese Population - Statistics Beural of Japan](https://www.stat.go.jp/data/jinsui/2018np/index.html)

Out of the foregin residents, below table shows ratio and raking of nationalities that composes of them.

nationality|population|percentage|cumpercentage
-----------|-----------|---------|-------------
中国|764720.0|28.007443551662618|28.007443551662618
韓国|449634.0|16.46759451028909|44.47503806195171
ベトナム|330835.0|12.11664738389777|56.591685445849485
フィリピン|271289.0|9.935808339898266|66.52749378574775
ブラジル|201865.0|7.3931930543942554|73.920686840142
ネパール|88951.0|3.2577807712155322|77.17846761135753
台湾|60684.0|2.2225176593904887|79.40098527074802
米国|57500.0|2.105905434957371|81.50689070570539


### Precedents by others
After doing few google search, few precedents were discovered, such as below.
- [クロスエントロピーで名前から国籍判定する](https://ai-lab.lapras.com/nlp/nationality-judgement/): Japanese, create a language model(ngram) from collection of each country's name, and calculate the probability of a name occuring in a language model.
**Further investigation for this method is needed**

- [ベイジアン・フィルターを使って患者氏名情報だけで日本人か外国人かの判別をする](http://healthcareit-interpreter.hatenablog.com/entry/2017/07/16/110514): creating naive bayes filter from scratch in python and train it with name strings collected on wikipedia.
    - very simple implementation, but lack of the training data was a problem, as well as classifying Asian names represented in Kanji characters. 
    
### Previous Version

One can create a simple name classifier using regex in order to implement the following logic.

```
If name is in Hiragana and/or Commonly Used Kanji:
    person is Japanese
Else
    person is foreginers.
    
```
This approach fails when
- name is represented in Roman, or Katakana letter.
- East Asian name that includes kanji as well, such as Chinese, Tiwanese or Korean.

The Katakana representation is especially problematic, because foregin resident's names are represented in katakana for official use most of the time, and some Japanese names (using unsusal Kanji or elderly person's) are sometimes represented as Katakana.

## Data

### Issue
Because both person's name and addresses are part of Personal Identifiable Information (PII), it is extrememly difficult to collect data that accurately represent the real world demographic. <br>

### Solution
I am trying to overcome this problem by instead generating fake name data that would reflect accurate demography of Japanese and foregin residents' name. 
[Faker Library](https://github.com/joke2k/faker) was used for this task. Followings are some characteristics of this tool.
- retains list of unique first and last names for each nationality in its source code, and generate fake name based on the list. 
- Japanese locale contains only 51 unique first and last name, total of 102.<br>
- Japanese locale can generate names in Kanji, Katakana and Roman letters
- Others do not support Katakana generation of names

This means that I need to 
- collect unique first and last names for Japanese people
- convert non-Japanese names into katakana representation

#### Japanese
_Last Names_<br>
Last name was collected though [全国名字ランキング](https://myoji-yurai.net/prefectureRanking.htm), which include infomation of
- list of unique last name
- ranking and estimated population of the given last name in Japan

Data from this website is based on annual phonebook, as well as ministry of statistics' data.

_First Names_<br>

First names were sourced from 2 data source,
1.  [Meiji Yasuda Insurance Website](https://www.meijiyasuda.co.jp/sp/enjoy/ranking/index.html#/year/2018n/9)
    * list of top 10 popular first names since 1912 ~ 2018
2. [Facebook](https://www.facebook.com/)
    * Search function was used with the collected last name, and firs name was scraped from the result.


#### Non-Japanese

Faker Library was used to generate names of choosen nationalities, which is based on actual foregin resident's demographic in Japan (Refer to background section's foregin resident's population in Japan). 

Within these nationalities, ones that occupy less than cumpercentage of 90% was included in the data generation process, namely
1. China
2. Korea
3. Brazil
4. Nepal
5. Taiwan
6. U.S.

Some countries are not included which are not supported by Faker library so far.

## Potential Approaches / Models

### 1. Regex and Logic

Using regex to classify each letters into alphabet, Katakana, Hiragana and other language representations. This approach only works when all strings are represented in thier native characters.

### 2. Naive Bayes

This approach is simple and can handle various kinds of langugage representations.

**1. Word Level Tokenization**<br>
Using this approach, we can provide as many unique first and last names as possible in the dataset to cover the entire population.
Problem of this approach is its predicting power is limited to words (names) seen in the training dataset, and cannot handle unseen ones.

**2. Character Level Tokenization**<br>
Advantage of this approach is that I don't have to provide every single unique names, just a components (characters) of it. Another advantage is that it might be able to predict unseen names correctly, as long as it's components are included in the training dataset. (Different name, but using same set of characters in different order). I could use `CountVectorizer` or other methods.

Is it really tokenized in character level or by ascii or utf-8 encoding? etc? 
Dig Deeper

### 3. Character Level Language Model

2 approach here,

**1. Alphabet Only**<br>
In this approach, first all of name / address strings will be converted into alphabet. 
In Japanese's roman representation, 
- One Hiragana / Katakana is almost always represented as 2 alphabets
- Ordering of Consonant followed by Vowel in pair, and repeated.

This structure probably is a useful feature. I can possibly use a language representation methods that
- retain information of characters' ordering (N-gram, RNN, etc) 

Disadvantage is that romanizing various languages might be very difficult task in itself. (Use Mechanical Turk? and train another model for it? jk).

**2. Native Character**<br>
I'm not sure of it's possibility or disadvantages, but it's an idea.


