# Homework: classify the origin of names using a character-level RNN

In this homework we will use an rnn-based model to perform classification. The goal is threefold:

1. Get more hands on with the preprocessing needed to perform text classification from A to Z. No preprocessing is done for you!
2. Use embeddings and RNNs in conjunction at the character level to perform classification.
3. Write a function that takes as input a string, and outputs the name of the predicted class.

However, here are guidelines to help you through all the steps:

1. Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
2. Use the keras tokenizer at the character level to tokenize your input into integer sequences.
3. Pad your sequences using the keras preprocessing tools.
4. Build a model that uses, minimally, an embedding layer, an RNN (of your choice) and a dense layer to output the logits or probabilities for the target classes (name origins).
5. Fit the model and evaluate on the test set.
6. Write a function that takes a string as input and predicts the origin (as its original string value)

In [None]:
%tensorflow_version 2.x
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
import tensorflow as tf

TensorFlow 2.x selected.


In [None]:
# Download the data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2020-03-26 02:24:10--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 54.192.151.68, 54.192.151.109, 54.192.151.98, ...
Connecting to download.pytorch.org (download.pytorch.org)|54.192.151.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2020-03-26 02:24:10 (86.9 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Kore

In [None]:

data = []
for filename in glob('data/names/*.txt'):
  origin = filename.split('/')[-1].split('.txt')[0]
  names = open(filename).readlines()
  for name in names:
    data.append((name.strip(), origin))

names, origins = zip(*data)
names_train, names_test, origins_train, origins_test = train_test_split(names, origins, test_size=0.25, shuffle=True, random_state=123)

# Lets look at the data

In [None]:
for name, origin in zip(names_train[:20], origins_train[:20]):
  print(name.ljust(20), origin)

Adashik              Russian
Farina               Italian
Pirumov              Russian
Ridge                English
Babyuk               Russian
Monet                French
Ukhabin              Russian
Agaltsov             Russian
Marfelev             Russian
Evelson              Russian
Gulko                Russian
Finyagin             Russian
Rogatko              Russian
Albani               Italian
Colombo              Italian
Katoaka              Japanese
Nowak                Czech
Nahas                Arabic
Koury                Arabic
Pakholkov            Russian


In [None]:
def predict_origin(name):
  assert isinstance(name, str)
  # do something with the model
  # do something with model output
  the_origin = None
  return the_origin

In [None]:
%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras

**Task 1:**<br>
Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.

In [None]:
len(set(origins)), len(origins)

(18, 20074)

In [None]:
origins_train[:7]

['Russian', 'Italian', 'Russian', 'English', 'Russian', 'French', 'Russian']

In [None]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(origins_train)
y_train = le.transform(origins_train)
y_test = le.transform(origins_test)

y_train[:10], y_test[:10]

(array([14,  9, 14,  4, 14,  5, 14, 14, 14, 14]),
 array([14,  0,  3, 16,  4,  0, 14, 14, 14, 14]))

In [None]:
le.classes_

array(['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French',
       'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean',
       'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish',
       'Vietnamese'], dtype='<U10')

**Task 2:**<br>
Use the keras tokenizer at the character level to tokenize your input into integer sequences.

In [None]:
encoder = tf.keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=True,
                                   oov_token=None, 
                                   document_count=0)

In [None]:
encoder.fit_on_texts(names_train)
sequences = encoder.texts_to_sequences(names_train)
for i in range(5):
  print('{} ---> {}'.format(names_train[i], sequences[i]))

Adashik ---> [1, 15, 1, 7, 8, 4, 9]
Farina ---> [21, 1, 6, 4, 5, 1]
Pirumov ---> [22, 4, 6, 13, 14, 2, 11]
Ridge ---> [6, 4, 15, 18, 3]
Babyuk ---> [16, 1, 16, 17, 13, 9]


In [None]:
sequences_test = encoder.texts_to_sequences(names_test)

**Task 3:** <br>
Pad your sequences using the keras preprocessing tools.

In [None]:
sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding='post')
sequences_test = tf.keras.preprocessing.sequence.pad_sequences(sequences_test, padding='post')
for i in range(5):
  print('{} ---> {}'.format(names_train[i], sequences[i]))

Adashik ---> [ 1 15  1  7  8  4  9  0  0  0  0  0  0  0  0  0  0  0  0  0]
Farina ---> [21  1  6  4  5  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
Pirumov ---> [22  4  6 13 14  2 11  0  0  0  0  0  0  0  0  0  0  0  0  0]
Ridge ---> [ 6  4 15 18  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
Babyuk ---> [16  1 16 17 13  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0]


**Task 4 & 5** :<br>
- Build a model that uses, minimally, an embedding layer, an RNN (of your choice) and a dense layer to output the logits or probabilities for the target classes (name origins). <br>
- Fit the model and evaluate on the test set.

In [None]:
embedding_input_dim = max(encoder.index_word) + 1
embedding_output_dim = 32

In [None]:
model = tf.keras.models.Sequential(layers=[
                                           tf.keras.layers.Embedding(input_dim=embedding_input_dim,
                                                                     output_dim=embedding_output_dim),
                                           tf.keras.layers.LSTM(units=64),
                                           tf.keras.layers.Dense(23, activation="tanh"),
                                           tf.keras.layers.Dense(18, activation= 'softmax')
                                           ])

model.compile(optimizer= 'adam',  metrics=['accuracy'],
    loss=tf.keras.losses.SparseCategoricalCrossentropy() )

In [None]:
history = model.fit(np.array(sequences),y_train, epochs=20)

**Evaluation**

In [None]:
model.evaluate(sequences_test, y_test)



[0.7233220070547068, 0.79537755]

Accuracy on the test set : **80 %** 
Not Bad!

**Task 6** :<br>
Write a function that takes a string as input and predicts the origin (as its original string value)

In [None]:
def predict_origin(name):
  assert isinstance(name, str)
  
  # sequence is Local variable 
  name = name
  sequence = [x[0] for x in encoder.texts_to_sequences(name)]
  sequence = sequence + (len(sequences[0]) - len(sequence)+1) * [0]
  sequence = np.array([sequence])

  # the result of prediction is output of softmax, we pick the label with th highest probability
  p = model.predict(sequence)
  label = np.argmax(p)
  the_origin = le.inverse_transform([label])[0]
  return the_origin

names_list = ['Micheal', 'YU', 'Julio', 'ahmad','stillitano']
for x in names_list:
  print('name -->{},   Model_prediction --> {}'.format(x, predict_origin(x)))  

name -->Micheal,   Model_prediction --> English
name -->YU,   Model_prediction --> Chinese
name -->Julio,   Model_prediction --> Russian
name -->ahmad,   Model_prediction --> Arabic
name -->stillitano,   Model_prediction --> Italian
