#### NCCU, Fall 2020
#### **PyTorch and Machine Learning**
##### 107363015 郭育丞 (Morton Kuo)

---
#### **ML18: NLP - Surname Classification**
##### https://medium.com/analytics-vidhya/ml18-6e9b1b66c30e
##### @author: Morton Kuo (2021 / 01 / 05)


---
### PART I: Introduction & Pre-processing

*   The outcome here is ***the best*** RNN classifier I found.
*   The code mainly comes from the book ***Rao, D. & McMahan, B. (2019). Natural Language Processing with PyTroch. California, CA: O’Reilly Media***. Based on that, I tune the model and visualize the outcomes.


---
### Outline
*   (1) Data Source
*   (2) Import & Settings
*   (3) Input & Inspecting
*   (4) Splitting the Surname Dataset
*   (5) Saving the New Surname Dataset




---
### Reference

1. Rao, D. & McMahan, B. (2019). Natural Language Processing with PyTroch. California, CA: O’Reilly Media. 




---
### (1) Data Source
To proceed the code, \
1. Kindly donwnload the "surnames.csv" dataset from the link below and upload it to Colab.\
https://bit.ly/3bbMEwX 

---
### (2) Import & Settings

In [None]:
import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

In [None]:
args = Namespace(
    raw_dataset_csv="surnames.csv",
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="surnames_with_splits.csv",
    seed=1337
)

---
### (3) Input & Inspecting

In [None]:
# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0)

In [None]:
surnames.head()

Unnamed: 0,surname,nationality
0,Woodford,English
1,Coté,French
2,Kore,English
3,Koury,Arabic
4,Lebzak,Russian


In [None]:
# Unique classes
set(surnames.nationality)

{'Arabic',
 'Chinese',
 'Czech',
 'Dutch',
 'English',
 'French',
 'German',
 'Greek',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Polish',
 'Portuguese',
 'Russian',
 'Scottish',
 'Spanish',
 'Vietnamese'}

---
### (4) Splitting the Surname Dataset

In [None]:
# Splitting train by nationality
# Create dict
by_nationality = collections.defaultdict(list)
for _, row in surnames.iterrows():
    by_nationality[row.nationality].append(row.to_dict())

In [None]:
# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
    np.random.shuffle(item_list)
    n = len(item_list)
    n_train = int(args.train_proportion*n)
    n_val = int(args.val_proportion*n)
    n_test = int(args.test_proportion*n)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  
    
    # Add to final list
    final_list.extend(item_list)

In [None]:
# Write split data to file
final_surnames = pd.DataFrame(final_list)

In [None]:
final_surnames.split.value_counts()

train    7680
test     1660
val      1640
Name: split, dtype: int64

In [None]:
final_surnames.head()

Unnamed: 0,surname,nationality,split
0,Totah,Arabic,train
1,Abboud,Arabic,train
2,Fakhoury,Arabic,train
3,Srour,Arabic,train
4,Sayegh,Arabic,train


---
### (5) Saving the New Surname Dataset

In [None]:
# Write munged data to CSV
final_surnames.to_csv(args.output_munged_csv, index=False)