# Name data analysis

Here I'll be performing EDA on the name dataset generated by Faker module.

This is a binary classification problem of determining if a name is of a Japanese origin or not. 
Therefore I'll start with analyzing Japanese names, 

## Japanese Names

### Full name

How many unique names are in a dataset.(`jp_name.csv`)


In [1]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [10]:
jp_name = pd.read_csv('data/jp_names.csv')
print(jp_name.head())
print("# of unique full names in whole dataset: ", len(jp_name.name.unique()))
print('number of data points: ', jp_name.shape[0])

    code    name
0  jp_JP  青山 あすか
1  jp_JP  西之園 修平
2  jp_JP   三宅 裕樹
3  jp_JP   喜嶋 直子
4  jp_JP   加納 桃子
# of unique full names in whole dataset:  2601
number of data points:  100000


Only 2.6% of the whole dataset is consists of unique name. No wonder the model was performing amazing with test data, 
because test data would contain names that are exactly the same.

### First and Last name

Now tokenizing the name into first and last.

In [14]:
#jp_name['first'] = jp_name['name'].str.split(" ").str[1]
jp_name['last'] = jp_name['name'].str.split(" ").str[0]
jp_name.head()

Unnamed: 0,code,name,first,last
0,jp_JP,青山 あすか,あすか,青山
1,jp_JP,西之園 修平,修平,西之園
2,jp_JP,三宅 裕樹,裕樹,三宅
3,jp_JP,喜嶋 直子,直子,喜嶋
4,jp_JP,加納 桃子,桃子,加納


### Unique first and last names

In [29]:
first_names = jp_name['first'].unique() 
last_names = jp_name['last'].unique() 

print('first names: , total {}\n'.format(len(first_names)), first_names)
print('\nlast names: , total {}\n'.format(len(last_names)), last_names)

first names: , total 51
 ['あすか' '修平' '裕樹' '直子' '桃子' '春香' '京助' '治' '明美' '翼' '知実' '舞' '直人' '稔' '香織'
 '美加子' '太郎' '千代' '零' '智也' '太一' '七夏' '翔太' '聡太郎' '花子' '健一' '英樹' '裕美子' '結衣'
 '真綾' '学' '加奈' '和也' 'くみ子' '充' '陽子' '直樹' '幹' '里佳' '浩' '亮介' '裕太' '康弘' '洋介'
 '晃' '涼平' '拓真' '陽一' '淳' 'さゆり' '篤司']

last names: , total 51
 ['青山' '西之園' '三宅' '喜嶋' '加納' '村山' '山田' '江古田' '大垣' '加藤' '田辺' '鈴木' '井高' '桐山'
 '伊藤' '廣川' '青田' '中津川' '原田' '木村' '渚' '小林' '渡辺' '松本' '藤本' '宮沢' '小泉' '笹田'
 '田中' '佐々木' '浜田' '坂本' '山岸' '野村' '津田' '石田' '山本' '吉本' '近藤' '吉田' '高橋' '斉藤'
 '中島' '宇野' '中村' '山口' '井上' '若松' '工藤' '杉山' '佐藤']


## Insights 

Comparing to the total number of names in the dataset ( 100000) this is rediculously small numbers of unique last and first names. 

This means that vast majority of the Japanese names in the dataset are duplicates, which explains the high testing set accuracy (testing set is devided from the faker generated dataset), as well as the model's inability to generalize to unseen names.


I'm assuming that the sklearn's CountVectorizer is tokenizing the names into first and last name based on the whitespace (**Check if this is correct**).

With Naive Bayes, it's meaningless to train the model with duplicates names IF numbers of unique names are limited and small. <br>


**Important**
Just have to make sure that the dataset includes top 90% most frequent Last and First Japanese names(don't have to be dataset of names, but can be a dictionary of names/name list)

**Unseen Names**<br>
Dealing with unseen names, 
1. recognize if it is unseen name, if so, proceed
2. add to Japanese / Foregin namelist according to correct answer
    * Human intervention needed here?
    
**異なる字体**

Same kanji has different representation sometimes, causing it to be counted as couple names when they're the same name. 
> For e.g.
斎藤、斉藤、齊藤、齋藤

* possible solution is to use `CountVectorizer()`'s  `strip_accents:{ascii, unicode, None}` option to normalize character. 

## data collection

### Japanese

**Last Name**
* Scrape Last names [here](https://myoji-yurai.net/prefectureRanking.htm;jsessionid=C13440C475D5A9E10ACD0C8C63AF6E6C.jvm1)

**First Name**

- [Meiji Yasuda Insurance](https://www.meijiyasuda.co.jp/sp/enjoy/ranking/index.html#/year/2018n/9)
This one's list of newer weird first names. 

- [yearly popular name ranking](http://www.tonsuke.com/nebin.html)
Yearly one, based on the Yasuda Insurance Data

**Kanji**
- [人名、常用漢字一覧表](https://kanji.jitenon.jp/cat/jimmei.html)