# Analyzing Names from Facebook 

By using the last name lists collected previously, I was able to search for people's name on public search on Facebook.
1101 total last names were searched on FB, and total of 164075 names (valid and invalid, also including non-Japanese names) were collected. 

In order to be able to leverage this data, we'll need to clean this up first, concretely,
1. extract valid names only.
2. classify into Japanese and non-Japanese names
3. isolate first names from Japanese names, and create a list of unique first names

Then we have the unique list of first names, that can be added to the Faker library. Other names collected here can be used for the testing purpose as well.

Now Let's get started!

In [33]:
import pandas as pd
import os
import glob

cur = os.getcwd()
PATH = os.path.join(cur, "data/scraper/fb_names")
# create a list of file names, and read the whole name in as a DataFrame, for pretty display
file_names = glob.glob(os.path.join(PATH, "*_names.txt"))
frame_generator = (pd.read_csv(f, header=0, names=['name']) for f in file_names)
df = pd.concat(frame_generator, axis=0, ignore_index=True)
print("names are loaded!")

names are loaded!


In [36]:
# now the df is loaded, let's take a look!
from IPython.display import HTML, display
pd.set_option("display.max_rows", None)
display(df.sample(500))

Unnamed: 0,name
144330,小坂田 加代
107265,利根田 一朗
162733,Dian'ga Diageo
124631,Shan Shan
130343,Yi Yuan
88367,山下 潤
138258,片山泉
64561,Dao Cao Van
28756,木下裕子
161076,田代 アクリサイン


As you can see, most of them are valid names (Japanese and non-Japanese), but some includes () entry in which they have second name? or nick names. 
Most of Japanese names are consisted of Kanji, with or without white space between last & first name.

## EDA on names
Exploring the characteristics of the names collected here. We'll start with the size of the dataset, and dive itno further details

Here, let's remove the records with () entry.

In [38]:
# get the size
names = df[~df['name'].str.contains("(", regex=False)]
display(names.sample(100))
print("Dataset Size: ", df.shape[0])
print("Dataset without () size: ", names.shape[0])

Unnamed: 0,name
148425,江口富美雄
163532,工藤幾子
62511,田中優輝
26946,牧野 時夫
163393,木ノ下 拓弥
13674,宮本貴文
30108,北村 信明
143392,Xiao Bao
63448,上田 昌宏
81315,水岡 崇


Dataset Size:  164044
Dataset without () size:  152425


That leaves us 152425 records, out of all 164044 records. This also contains Japanese and non-Japanese names. 

### Letterwise Analyzation

Now we can analyze the records in terms of types of letters that it contains. Considering that most Japanese names on this dataset are composed sololy of Kanji, let's check the ones that contains
- Hiragana
- Katakana
individually. 

We start with removing names that contains non-Japanese characters, for now. 

In [42]:
# regex pattern for Japanese characters, thanks to 
# https://gist.github.com/ryanmcgrath/982242 
pattern = "[\u3000-\u303F]|[\u3040-\u309F]|[\u30A0-\u30FF]|[\uFF00-\uFFEF]|[\u4E00-\u9FAF]|[\u2605-\u2606]|[\u2190-\u2195]|\u203B"
j_names = names[names['name'].str.contains(pattern, regex=True)]
display(j_names.sample(100))

Unnamed: 0,name
34991,成塚幸子
47916,大石 広美
77340,青木美穂
151471,小谷田 稔
122771,伊藤 達也
13782,宮本綾恵
115073,三条恵理奈
7447,早川 奈穂子
121721,河野広一
34545,河村英子


Now, let's take a look at names that include Hiragana and Katakana.

In [44]:
## Set up regex
hiragana = "[\u3040-\u309F]"
katakana = "[\u30A0-\u30FF]"

hira_names = names[names['name'].str.contains(hiragana, regex=True)]
display(hira_names.sample(100))

Unnamed: 0,name
133448,菅野 えいじ
11043,西山 あかね
133876,森下 りお
51507,建石 せいじろう
26192,男澤 あけみ
53192,菅原 ゆう子
25786,大塚 りえこ
103005,中島 かよ
59042,渡部 めぐみ
63051,上田 まゆみ


Now we can observe that majority of names that include Hiragana has structure of _LastName HiragaraFirstName_, with charming exceptions of weird nicknames as well as gibrish that makes no sense, here and there. 

Now onto names with Katakana

In [47]:
kata_names = names[names['name'].str.contains(katakana, regex=True)]
display(kata_names.sample(300))

Unnamed: 0,name
115528,ワッツ 三条堺町
163205,木ノ下隆文
163283,木ノ下 忍
830,大野 アイリ
130393,戸ケ崎 泰史
158761,しち ゼロ ラマ
1769,矢ノ下 伸吾
58324,渡部 アサ子
162992,木ノ下 結衣
98547,アキミト アキラ


Just like the hiragana names, basic format of the names seems to be _lastName KatakanaFirstName_. 
Some exceptions are
- last name include katakana, such as 木ノ下
- fake? or made up first name in Katakana, such as ジュニア、オルゴール、etc
- non-Japanese names expressed in Katakana, e.g: グェン フ- ン
- and complete gibrish.

### Names that include ()s

In [24]:
nick_names = df[df['name'].str.contains("(", regex=False)]
print("total numbers of name records: ", df.shape[0])
print("numbers of the name records with (): ", nick_names.shape[0])
print("{} % of the records contains ().".format(round((nick_names.shape[0]/df.shape[0])*100)))
nick_names.sample(500)

total numbers of name records:  164044
numbers of the name records with ():  11619
7 % of the records contains ().


Unnamed: 0,name
149575,米子 (Mizu)
94675,及川 敬子 (森川)
140311,早戸和希 (Kazuki Hayato)
132895,Lu Cas Mu (E Li Sha)
149083,江口 知佳 (浦本知佳)
131039,加賀野 ケイ (Keiji Nakano)
90196,三浦 宏昭 (ミウ)
136803,Thuỳ An (Trùm Bán Lẻ)
62738,田中久子 (ひさごん)
105396,渡辺 熱 (Atsushi Watanabe)


Parenthesis mainly contains the followings
- syllabic characters (to indicate pronunciation)
- Different last names (because of marriage?)
- nick name
- Native Name in the native language, when original name is English name (or whatever language predominant in the country they're at)

many of them are valid names, so simply removing the () portion would be enough and we can treat them as the rest of the names. 

## Conclusion

While majority of the collected names are valid, there are a lot of anomalies included in the data. Best way to clean up and organize the data probably is to
1. extract names that contains the searched last name (posisble problem though, is that this might miss the different representations of last name, in Hira and Katakana and romans)
2. investigate the index of the last name in the name, to make sure the order of Last and First name. (reverse back if reversed)
3. check the presence of white space between last and first name. 
4. separate into last and first name
5. extract unique first names and put syllabic characters (use this [website](https://kanji.reader.bz/))

Then add the unique name lists of Hiragana, Kanji and Katakana to the Faker module.

Alongside of that, for data with ()s, we can simply remove the () and its contents, and put them into the pipeline above. 

Also, try using that Regex to extract Japanese letter for simpler version of the name.