# Homework: classify the origin of names using a character-level RNN

In this homework we will use an rnn-based model to perform classification. The goal is threefold:

1. Get more hands on with the preprocessing needed to perform text classification from A to Z. No preprocessing is done for you!
2. Use embeddings and RNNs in conjunction at the character level to perform classification.
3. Write a function that takes as input a string, and outputs the name of the predicted class.

However, here are guidelines to help you through all the steps:

1. Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
2. Use the keras tokenizer at the character level to tokenize your input into integer sequences.
3. Pad your sequences using the keras preprocessing tools.
4. Build a model that uses, minimally, an embedding layer, an RNN (of your choice) and a dense layer to output the logits or probabilities for the target classes (name origins).
5. Fit the model and evaluate on the test set.

In [27]:
%tensorflow_version 2.x
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
import tensorflow as tf
import pandas as pd
from tensorflow import keras

In [2]:
# Download the data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2022-05-13 15:16:50--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 18.65.229.14, 18.65.229.67, 18.65.229.105, ...
Connecting to download.pytorch.org (download.pytorch.org)|18.65.229.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2022-05-13 15:16:50 (53.6 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Korean.t

In [3]:
data = []
for filename in glob('data/names/*.txt'):
  origin = filename.split('/')[-1].split('.txt')[0]
  names = open(filename).readlines()
  for name in names:
    data.append((name.strip(), origin))

names, origins = zip(*data)

In [16]:
df = pd.DataFrame(names, origins).reset_index()
df.columns = ["origin", "name"]
df.head()

Unnamed: 0,origin,name
0,Korean,Ahn
1,Korean,Baik
2,Korean,Bang
3,Korean,Byon
4,Korean,Cha


In [21]:
categories = df.origin.unique() #unique origins

In [22]:
n_categories = df.origin.nunique() #number of unique origins

In [None]:
df.name = df.name.apply(lambda x: x.lower())

# Lets look at the data

In [33]:
names_text = "".join(df.name)
names_text

"ahnbaikbangbyonchachangchichinchochoechoichongchouchuchunchungchwehgilgugwanghahanhohonghunghwanghyunjangjeonjeongjojonjongjungkangkimkokookukwakkwangleelilimmamomoonnamngainohohpaepakparkrarheerheemririmronryomryooryusanseoseokshimshinshonsisinsosonsongsooksuhsuksunsungtsaiwangwooyangyeoyeonyiyimyooyoonyouyoujyounyuyunabeabukaraadachiaidaaiharaaizawaajibanaakaikeakamatsuakatsukaakechiakeraakimotoakitaakiyamaakutagawaamagawaamayaamorianamiandoanzaiaokiaraiarakawaarakiarakidaaratoarihyoshiarishimaaritaariwaariwaraasaharaasahiasaiasanoasanumaasariashiaashidaashikagaasuharaatshushiayabitoayugaibababaisoteibandobunyachibachikamatsuchikanatsuchinochishuchoshidaishidandatedazaideguchideushidoiebinaebisawaedaegamieguchiekiguchiendoendosoenokienomotoerizawaetoetsukoezakiyafuchidafugunagafujikagefujimakifujimotofujiokafujishimafujitafujiwarafukaofukayamafukudafukumitsufukunakafukuokafukusakufukushimafukuyamafukuzawafumihikofunabashifunakifunakoshifurusawafuschidafusefutabateifuwagakushagendage

In [34]:
#preprocessing
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(names_text)

In [35]:
tokenizer.texts_to_sequences(["Ahn"])

[[1, 8, 5]]

In [36]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

In [47]:
df['encoded'] = df['name'].apply(lambda x: tokenizer.texts_to_sequences([x]))

In [50]:
df.head()

Unnamed: 0,origin,name,encoded
0,Korean,ahn,"[[1, 8, 5]]"
1,Korean,baik,"[[16, 1, 4, 9]]"
2,Korean,bang,"[[16, 1, 5, 18]]"
3,Korean,byon,"[[16, 17, 2, 5]]"
4,Korean,cha,"[[19, 8, 1]]"


In [48]:
train_size = df.shape[0]* 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(df[:train_size])

ValueError: ignored

In [42]:
encoded

array([0, 7, 4, ..., 8, 3, 6])

In [None]:
def predict_origin(name):
  assert isinstance(name, str)
  # do something with the model
  # do something with model output
  the_origin = None
  return the_origin