<a href="https://colab.research.google.com/github/ychu19/dissertation-classifier/blob/master/citizenship_random_forest_presentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying Citizenship of Naturalized Individuals in Japan

My dissertation project seeks to explain why permanent immigrants in Japan refused to acquire Japanese citizenship even when they were born and raised there [1]. I hypothesize that their home country attachment through diasporic organizations affects their propensity to naturalize.

I pulled the naturalization records from the Japanese Government Gazette ([官報](https://search.npb.go.jp/kanpou/)), with information about each and every naturalized individuals in Japan since the 1950s. I focus on the time between the 1971 and 1980, with a total of 72,416 individuals who have naturalized. 

This document presents a smaller project within my dissertation - **classifying the country of origin for each naturalized individual**. The information about naturalized citizens from the Gazette includes (a) their names, (b) their names before naturalization [2], (c) their residential addresses, (d) their dates of birth, and (e) dates of approval. While the Gazette provides a rare and valuable opportunity to look into the individual-level features of naturalized citizens, it does not include information about their countries of origin. Fortunately, the Gazette did include original citizenship for those who naturalized in the 50s. **This project uses the data from 1954 to 1955 as prior to predict the countries of origin for those who naturalized in the 70s.**


> [1]: Japan is not governed by *jus soli*, meaning that there is no birthright citizenship in Japan. [See Japan MOJ](http://www.moj.go.jp/ENGLISH/information/tnl-01.html)

> [2]: Prior to 1983, most of the applicants to naturalization were implicitly asked to change their names to a Japanese-sounding name. [See Wikipedia](https://en.wikipedia.org/wiki/Japanese_nationality_law#Naturalization).



## Data Source

Japanese Government Gazette ([官報](https://search.npb.go.jp/kanpou/)) in 1954 and 1955: 

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn import tree

In [None]:
#@title Default title text
data_1954_demo = pd.read_excel("/content/drive/My Drive/UIUC Grad School/0_dissertation/dat_1954_demo.xlsx", sheet_name='anonym')
print(data_1954_demo.head().to_markdown()

|    | name    | citizenship   | address_anonym   | birthdate                |   household |   date_approval | betsume.1   |   betsume.2 |   betsume.3 |
|---:|:--------|:--------------|:-----------------|:-------------------------|------------:|----------------:|:------------|------------:|------------:|
|  0 | * 光 *  | 無国籍        | 東京都           | 大正十四年十二月十一日生 |           1 |        19540105 | nan         |         nan |         nan |
|  1 | * 鎮 *  | 朝鮮          | 同県山           | 昭和十四年三月十六日生   |           2 |        19540105 | *城正*      |         nan |         nan |
|  2 | *本 万* | 朝鮮          | 高知県           | 明治四十二年七月七日生   |           3 |        19540105 | *万*        |         nan |         nan |
|  3 | *本 又* | 朝鮮          | 高知県           | 大正六年三月二十四日生   |           4 |        19540105 | *又*        |         nan |         nan |
|  4 | *本 玉* | 朝鮮          | 同県同           | 昭和十三年九月二十九日生 |           5 |        19540105 | *玉*        |         nan |         nan |


## Random Forest Classification from `sklearn`

The major task here is to identify Koreans and Chinese among all the naturalized citizens (the scope of my dissertation focuses on [zainichi Koreans](https://en.wikipedia.org/wiki/Koreans_in_Japan) and [overseas Chinese](https://en.wikipedia.org/wiki/Overseas_Chinese) in Japan.) I do so using the following three features to predict the label, citizenship:

*   Number of names each individual has before naturalization ("betsume")
*   Last name, and whether the last names are common in Korea and China
*   First name, and whether the first names are common in Korea and China

To deal with the overlapped common names in Korea and China (like "Lee" and "Kim"), I also included crossed features of first and last names.


### Import data

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn import tree

In [None]:
# reading data from 1954 and 1955
data_1954 = pd.read_excel("/content/dat_1954.xlsx")
data_1955 = pd.read_excel("/content/dat_1955.xlsx")

frames = [data_1954, data_1955]
data = pd.concat(frames).reset_index(drop=True)
data = data.reindex(np.random.permutation(data.index)) # shuffle the rows



### Setting up X and y

In [None]:
# create y
def nationality(string):
  """Get a string from the archival data, 
     Return whether an individual is of (1) Korean, (2) Chinese, 
     or (0) Other national origin"""
  nationality = 0 # default set as others
  if string == '朝鮮':
    nationality = 1
  elif string in ['無国籍','中華民国']:
    nationality = 2
  return nationality

# create y in data_1954 and data_1955 

data['citizenship'] = data['citizenship'].apply(lambda x:nationality(x))

In [None]:
# creat X

## 1st criteria: number of betsume
def betsume_numbers(betsume1, betsume2, betsume3):
  """calculate the numbers of betsume regardless of the column sequence"""
  b1 = pd.notna(betsume1)
  b2 = pd.notna(betsume2)
  b3 = pd.notna(betsume3) # python calculates Bloolean as integers 
  return (b1+b2+b3)

betsume_columns = ['betsume.1','betsume.2','betsume.3']
data['betsume'] = data[betsume_columns].apply(lambda x:betsume_numbers(*x), axis=1)

In [None]:
### 2nd criteria: most common last names in Korea and China

## korean last names
kr_last_names = pd.read_excel('/content/Korea_last_name_2015.xlsx')
kr_last_names = kr_last_names[kr_last_names.columns[1]]
kr_last_names = kr_last_names.dropna(axis=0)

data['kr_last_name'] = data['last.name'].isin(kr_last_names).astype(int)

## chinese last names
ch_last_names = pd.read_excel('/content/China_last_names_2015.xlsx')
ch_last_names = ch_last_names[ch_last_names.columns[2]]
ch_last_names = ch_last_names.dropna(axis=0)

data['ch_last_name'] = data['last.name'].isin(ch_last_names).astype(int)

In [None]:
data['last_name_cross'] = data['kr_last_name'] * data['ch_last_name']

In [None]:
### 3rd criteria: common first names in Korea and China/Taiwan

data['first.name.1'] = data['first.name'].str.slice(start=0, stop=1).astype("string")
data['first.name.2'] = data['first.name'].str.slice(start=1, stop=2).astype("string")

In [None]:
## Taiwanese first names
tw = pd.read_excel('/content/Taiwan_common_first_names_2015.xlsx')
tw_men = tw['男性取用名字'].str.split('、')
tw_women = tw['女性取用名字'].str.split('、')

tw_first_names = tw_men.append(tw_women, ignore_index=True)
tw_first_names = np.concatenate(tw_first_names)
tw_first_names = pd.Series(tw_first_names, dtype='string')

In [None]:
## Chinese first names
ch_first_names = pd.Series(["偉", "芳", "娜", "敏", "靜", "秀英", "麗", "強", "磊", 
                  "洋", "艷", "勇", "軍", "杰", "娟", "英", "華", "文", "明", "蘭", 
                  "國", "春", "紅", "小", "梅", "平", "海", "珍", "榮", "建国", "国強",
                  "志華","国慶", "抗美", "援朝", "衛華", "保国", "躍進", "勝天", "超英",
                  "超美", "躍華", "自力", "更生", "図強","宝勤"])#, dtype='string') # save as a list first
ch_first_names = ch_first_names.append(tw_first_names)

name_list = ch_first_names.str.cat()
def name_in_list(name):
    """input is name
       output is whether name is a member of our list
       function also deal with NA values + ""
       """
    if pd.notna(name) and len(name) > 0:
      return name in name_list
    else:
      return False

data['ch_first_name_1'] = data['first.name.1'].apply(lambda x:name_in_list(x))
data['ch_first_name_2'] = data['first.name.2'].apply(lambda x:name_in_list(x))
data['ch_first_name'] = data['ch_first_name_1'] + data['ch_first_name_2']
data['ch_first_name']=data['ch_first_name']*1

  f"evaluating in Python space because the {repr(op_str)} "


In [None]:
## Korean names
kr_first_names = pd.Series(["榮秀","英秀", "英鎬", "榮鎬", "英植", "榮植", 
                "榮子", "英子", "靜子", "貞子", "順子", "純子", 
                "英澈", "榮澈", "洙"
                "榮淑", "英淑", "宰", "載",
                "道勇", "道永", "夏江", "忠吉"])
name_list = kr_first_names.str.cat()
def name_in_list(name):
    """input is name
       output is whether name is a member of our list
       function also deal with NA values + ""
       """
    if pd.notna(name) and len(name) > 0:
      return name in name_list
    else:
      return False

data['kr_first_name_1'] = data['first.name.1'].apply(lambda x:name_in_list(x))
data['kr_first_name_2'] = data['first.name.2'].apply(lambda x:name_in_list(x))
data['kr_first_name'] = data['kr_first_name_1'] + data['kr_first_name_2']
data['kr_first_name'] = data['kr_first_name']*1

  f"evaluating in Python space because the {repr(op_str)} "


In [None]:
data['first_name_cross'] = data['kr_first_name'] * data['ch_first_name']

### Separate Training and Test Data

In [None]:
features = ['kr_last_name', 'ch_last_name', 'last_name_cross', 'first_name_cross', 'betsume']
# features = ['kr_last_name', 'ch_last_name', 'last_name_cross', 'ch_first_name', 'kr_first_name', 'betsume']
X = data[features]

In [None]:
train_set = data.sample(frac=0.75, random_state=0)
test_set = data.drop(train_set.index)

test_y = test_set.citizenship
test_X = test_set[features]

#### Downsampling and Upweighing the Majority Group (Koreans)



In [None]:
# deal with imbalanced data
target_chinese_sample_size = len(train_set[train_set['citizenship']==2]) 
target_chinese_sample_ratio = 0.35
adjusted_sample_total = int(target_chinese_sample_size/target_chinese_sample_ratio)

target_korean_sample_size = adjusted_sample_total - target_chinese_sample_size

adjusted_sample_koreans = train_set[train_set['citizenship']==1].sample(
    n=target_korean_sample_size, 
    random_state=0)
adjusted_sample_chinese = train_set[train_set['citizenship']==2]

frames = [adjusted_sample_koreans, adjusted_sample_chinese]
adjusted_sample = pd.concat(frames)
adjusted_sample = adjusted_sample.reindex(np.random.permutation(adjusted_sample.index)) # shuffle the rows

y = adjusted_sample.citizenship
X = adjusted_sample[features]

In [None]:
adjusted_sample.head() # check if rows were shuffled

Unnamed: 0,name,citizenship,birthplace,address,birthdate,V3,date_approval,betsume.1,betsume.2,betsume.3,last.name,first.name,betsume,kr_last_name,ch_last_name,last_name_cross,first.name.1,first.name.2,ch_first_name_1,ch_first_name_2,ch_first_name,kr_first_name_1,kr_first_name_2,kr_first_name,first_name_cross
2350,全 光 博,1,大阪府中河内郡巽村大字西足代四百九十番地,岡山県津山市西寺町十二番地,昭和十九年九月五日生,1753,19541126,金成光博,,,全,光博,1,1,0,0,光,博,False,False,0,False,False,0,0
2862,清水 光子,1,宮城県仙台市原町清水沼上六番地,同県黒川郡落合村松坂字内問答山二十三番地,大正八年三月六日生,213,19550222,,,,清水,光子,0,0,0,0,光,子,False,True,1,False,True,1,1
3720,呉 秀 三,2,住　所 同区下目黒一丁目百五番地 東京都目黒区三田町百十九番地,呉　秀　三 同区下目黒一丁目百五番地,昭和十九年九月十日生,893,19550621,,,,呉,秀三,0,0,0,0,秀,三,True,False,1,True,False,1,1
368,元 金 子,1,神奈川県横浜市鶴見区生麦町三百十二番地,同県同市西区藤棚町二丁目百九十七番地,大正五年一月二十日生,277,19540218,徳原キン,,,元,金子,1,1,0,0,金,子,True,True,1,False,True,1,1
4114,韓 君 子,1,福岡県遠賀郡水巻村吉田以下不詳,東京都北多摩郡村山町大字中藤三千二百六十番地,昭和八年七月三十日生,1238,19550820,金山君子,,,韓,君子,1,1,1,1,君,子,True,True,1,False,True,1,1


In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) # train data and validation data

#### Model 1: RandomForestClassifier

In [None]:
model = RandomForestClassifier(random_state=1,  class_weight={1:5,2:1})
model = model.fit(train_X, train_y)
model_predict = model.predict_proba(val_X)

threshold = 0.85

val_y = [val_y.values[i]==1 for i in range(len(val_y))]
val_y = pd.Series(val_y).astype(int)
y_pred = [model_predict[i][0]>=threshold for i in range(len(model_predict))]
y_pred = pd.Series(y_pred).astype(int)

train_pred = model.predict_proba(train_X)
# train_pred = [model_predict[i][0]>=threshold for i in range(len(train_pred))]
# train_pred = pd.Series(train_pred).astype(int)

from sklearn.metrics import accuracy_score
val_accuracy = accuracy_score(val_y, y_pred)
print("validation accuracy: {:.3f}".format(val_accuracy))

validation accuracy: 0.708


In [None]:
from sklearn.metrics import roc_auc_score
y_scores = [model_predict[i][0] for i in range(len(model_predict))]
roc_auc_score(val_y, y_scores)

0.7352642276422763

In [None]:
rf_model_on_full_data = RandomForestClassifier(random_state = 1,class_weight={1:5,2:1})

rf_model_on_full_data.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight={1: 5, 2: 1},
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [None]:
# Cross-validation
from sklearn.model_selection import cross_validate
cv = cross_validate(rf_model_on_full_data, X, y, cv=10)
# print(cv['test_score'])
print("mean test scores: ", cv['test_score'].mean())

mean test scores:  0.6961538461538461


### Model 2: BaggingClassifier

In [None]:
from sklearn.ensemble import BaggingClassifier
model_2 = BaggingClassifier(base_estimator = RandomForestClassifier(), 
                            random_state = 0)
model_2 = model_2.fit(train_X, train_y, sample_weight=(5))
model_2_pred = model_2.predict_proba(val_X)

y_pred = [model_2_pred[i][0]>=threshold for i in range(len(model_2_pred))]
y_pred = pd.Series(y_pred).astype(int)

val_accuracy = accuracy_score(val_y, y_pred)
print("validation accuracy: {:.3f}".format(val_accuracy))

# bagging_model_on_full_data = RandomForestClassifier(random_state = 1,class_weight={1:5,2:1})

# rf_model_on_full_data.fit(X, y)

validation accuracy: 0.423


### Run on Test Data to Predict Citizenship

In [None]:
test_preds = rf_model_on_full_data.predict_proba(test_X)

test_y_pred = [test_preds[i][0]>=threshold for i in range(len(test_preds))]
test_y_pred = pd.Series(test_y_pred).astype(int)
test_accuracy = accuracy_score(test_y, test_y_pred)
print("accuracy in test data: ", test_accuracy)

accuracy in test data:  0.8934817170111288


## Maybe try it with 1971 - 1980 data?

In [None]:
import os

In [None]:
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if (f[:2] == '19')]

In [None]:
files_xls

['1977.dat.xlsx',
 '1979.dat.xlsx',
 '1978.dat.xlsx',
 '1974.dat.xlsx',
 '1975.dat.test.xlsx',
 '1976.dat.xlsx',
 '1973.dat.xlsx',
 '1980.dat.xlsx']

In [None]:
real_data = pd.DataFrame()
for f in files_xls:
    real_data = pd.read_excel(f, 'Sheet1')
    real_data = real_data.append(real_data)

In [None]:
real_data.head()

Unnamed: 0,id,namae,betsume1,betsume2,betsume3,jyuusyo,year,getsu,nichi,age,date_approval,surname,firstname,citizenship
0,1,金和子,金順＃,金谷和子,隅谷和子,兵庫県明石市林一丁目八番三十六号,1950,7,28.0,29.0,19800104,金,和子,kr
1,2,朴慶南,藤本時子,星本時子,,兵庫県姫路市阿保六百九十二番地,1948,6,9.0,32.0,19800104,朴,慶南,kr
2,2,李成昌,星本成昌,,,兵庫県姫路市阿保六百九十二番地,1950,2,8.0,30.0,19800104,李,成昌,kr
3,2,李政樹,星本政樹,,,兵庫県姫路市阿保六百九十二番地,1972,1,1.0,8.0,19800104,李,政樹,kr
4,2,李昌俊,星本昌俊,,,兵庫県姫路市阿保六百九十二番地,1970,9,18.0,9.0,19800104,李,昌俊,kr


In [None]:
# creat X

## numbers of betsume

betsume_columns = ['betsume1','betsume2','betsume3']
real_data['betsume'] = real_data[betsume_columns].apply(lambda x:betsume_numbers(*x), axis=1)

In [None]:
## last names

real_data['kr_last_name'] = real_data['surname'].isin(kr_last_names).astype(int)
real_data['ch_last_name'] = real_data['surname'].isin(ch_last_names).astype(int)

real_data['last_name_cross'] = real_data['kr_last_name']*real_data['ch_last_name']

In [None]:
## first names

real_data['first.name.1'] = real_data['firstname'].str.slice(start=0, stop=1).astype("string")
real_data['first.name.2'] = real_data['firstname'].str.slice(start=1, stop=2).astype("string")

name_list = ch_first_names.str.cat()
def name_in_list(name):
    """input is name
       output is whether name is a member of our list
       function also deal with NA values + ""
       """
    if pd.notna(name) and len(name) > 0:
      return name in name_list
    else:
      return False

real_data['ch_first_name_1'] = real_data['first.name.1'].apply(lambda x:name_in_list(x))
real_data['ch_first_name_2'] = real_data['first.name.2'].apply(lambda x:name_in_list(x))
real_data['ch_first_name'] = real_data['ch_first_name_1'] + real_data['ch_first_name_2']
real_data['ch_first_name']=real_data['ch_first_name']*1


name_list = kr_first_names.str.cat()
def name_in_list(name):
    """input is name
       output is whether name is a member of our list
       function also deal with NA values + ""
       """
    if pd.notna(name) and len(name) > 0:
      return name in name_list
    else:
      return False

real_data['kr_first_name_1'] = real_data['first.name.1'].apply(lambda x:name_in_list(x))
real_data['kr_first_name_2'] = real_data['first.name.2'].apply(lambda x:name_in_list(x))
real_data['kr_first_name'] = real_data['kr_first_name_1'] + real_data['kr_first_name_2']
real_data['kr_first_name']=real_data['kr_first_name']*1

real_data['first_name_cross']=real_data['ch_first_name']*real_data['kr_first_name']

  f"evaluating in Python space because the {repr(op_str)} "


In [None]:
X = real_data[features]

In [None]:
real_model = rf_model_on_full_data.predict_proba(X)

real_y_pred = [real_model[i][0]>=threshold for i in range(len(real_model))]
real_y_pred = pd.Series(real_y_pred).astype(int)

def return_citizenship(x):
  """transforming binary categories back to citizenship"""
  if x == 1:
    return "Korea"
  else:
    return "China" 

real_y_pred = real_y_pred.apply(lambda x:return_citizenship(x))

real_data['citizenship_predict']=real_y_pred

In [None]:
real_data.citizenship_predict.value_counts()

Korea    12904
China     3058
Name: citizenship_predict, dtype: int64

In [None]:
real_data.to_excel("predicted_citizenship.xlsx")