<b>Summary: </b>Building a classifier which gives the language of the given name 

<b>Datasets: </b>https://download.pytorch.org/tutorial/data.zip

<b>Ref: </b>https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

In [1]:
import pandas as pd
import glob

We will use glob library to load all the files in file_path directory

In [2]:
file_path =  'datasets/name_by_language/names'
allFiles = glob.glob(file_path + "/*.txt")

In [3]:
allFiles

['datasets/name_by_language/names/Czech.txt',
 'datasets/name_by_language/names/German.txt',
 'datasets/name_by_language/names/Arabic.txt',
 'datasets/name_by_language/names/Japanese.txt',
 'datasets/name_by_language/names/Chinese.txt',
 'datasets/name_by_language/names/Vietnamese.txt',
 'datasets/name_by_language/names/French.txt',
 'datasets/name_by_language/names/Irish.txt',
 'datasets/name_by_language/names/Spanish.txt',
 'datasets/name_by_language/names/Greek.txt',
 'datasets/name_by_language/names/Italian.txt',
 'datasets/name_by_language/names/Scottish.txt',
 'datasets/name_by_language/names/Dutch.txt',
 'datasets/name_by_language/names/Korean.txt',
 'datasets/name_by_language/names/Polish.txt']

In [4]:
names_data = pd.DataFrame()
list_ = []

* we will read every file turn by turn
* data in txt files are not coma separated, there is new name in every line<br>
    Hence we will use sep = '/n', error_bad_lines will avoid all the lines which are not separated properly in dataset
* we will use **split** function to form targets of all names
    We will basically extract language name from filename
* We will have DF for every language, now we will murge them in one DF
    We will do this by first making a list of dfs and then concatanate all dfs in one

In [5]:
for file_ in allFiles:
    df = pd.read_csv(file_,
                     sep='/n',
                     names = ['Name'],
                     engine="python",
                     error_bad_lines=False)
    
    df['language'] = str(file_.split('/')[-1].split('.')[0])
    
    list_.append(df)
    names_data = pd.concat(list_)

In [14]:
names_data.sample(10)

Unnamed: 0,Name,language
29,Jeong,Korean
83,Fenyo,Czech
47,Guo,Chinese
173,Peerenboom,Dutch
89,You,Korean
113,Gallego,Spanish
374,Kurzmann,German
449,Muhlfeld,German
221,Rocha,Spanish
52,Agani,Italian


All languages 

In [7]:
languages = names_data['language'].unique()

In [8]:
languages

array(['Czech', 'German', 'Arabic', 'Japanese', 'Chinese', 'Vietnamese',
       'French', 'Irish', 'Spanish', 'Greek', 'Italian', 'Scottish',
       'Dutch', 'Korean', 'Polish'], dtype=object)

In [9]:
len(languages)

15

#### Remove the duplicate names
The drop_duplicates() function in pandas allows removal of duplicates.

In [10]:
len(names_data)

4994

In [11]:
names_data = names_data.drop_duplicates()

In [12]:
len(names_data)

4931

#### Getting all posible letters
This will allow us to create a one-hot-encoded tensor for the names

In [12]:
import string
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)
all_letters

"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .,;'"

#### Function to convert name in tensors 
This effectively performs one-hot-encoding

In [13]:
import torch

def name_to_tensor(name):
    name_in_tensor = torch.zeros(len(name), 1, n_letters)
    for i, letter in enumerate(name):
        name_in_tensor[i][0][all_letters.find(letter)] = 1
        
    return name_in_tensor

#### Check what names may look like when converted to tensors
You can see a(small) is first element and A(capital) is somewhere in between

In [14]:
name_to_tensor('a')

tensor([[[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.]]])

In [15]:
name_to_tensor('a A')

tensor([[[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.]],

        [[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
           0.,  0.]],

        [[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
           0.,  0.,  0.,  0.,  0.,  0.,  0.,

#### Define the RNN
Explanation of the model below
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#creating-the-network

To run a step of this network we need to pass an input (in our case, the Tensor for the current letter) and a previous hidden state (which we initialize as zeros at first). We’ll get back the output (probability of each language) and a next hidden state (which we keep for the next step).

* the <b>i2h</b> layer is an input-to-hidden layer while <b>i2o</b> is input-to-output <br /> 
* the <b>combined</b> layer performs the combination of the current input letter along with the value of the previous hidden layer


In [16]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)    
        output = self.i2o(combined)    
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)


#### Create an RNN
We specify the number of hidden layers along with the output size

In [17]:
n_hidden = 128
n_languages= len(languages)

rnn = RNN(n_letters, n_hidden, output_size = n_languages)

#### Define parameters for training the model
* we perform 100,000 iterations which will ensure a thorough coverage of the 10,000 names in the dataset
* the loss function is negative log likelihood loss
* we initialize a learning rate of 0.005 which will decrease with each iteration

In [18]:
iterations = 100000
criterion = nn.NLLLoss()
learning_rate = 0.005

#### Convert a prediction to the string label for language
We get a list of probabilities for each language and then from index of each language we get the name of language

In [19]:
def output_to_language (output):
    
    top_n, top_index = output.topk(1)
    pred_i = top_index[0].item()
    pred = languages[pred_i] 
    
    return pred

#### Import the random module
We will be picking names randomly from our dataset for which we will use the random module

In [20]:
import random

#### Perform the training 
* we pick a name randomly from the dataset and convert it to a tensor
* we get the actual label for that name
* the RNN is initialized with zero gradients
* for each character in the name:
 * we use the RNN to perform a prediction on letters of the name up to that character
* we calculate the loss based on the predicted and actual values of language
* we perform a back propagation to recalibrate the weights in the NN
* we update the parameters of the NN by adding to them their gradient and subtracting the learning rate (to slow down the learning)

Finally, for every 5000th iteration, we print out the name, the prediction and the actual label along with the calculated loss

In [21]:
# To Keep track of losses for plotting
current_loss = 0
all_losses = []

In [None]:
for iteration in range(1 , iterations +1):

    i = random.randint(0, len(names_data) - 1)
    
    name = names_data.iloc[i][0]
    name_in_tensor = name_to_tensor(name)
    
    language = names_data.iloc[i][1]
    language_in_tensor = torch.tensor([list(languages).index(language)], dtype=torch.long)
    
    hidden = rnn.initHidden()
    rnn.zero_grad()

    for i in range(name_in_tensor.size()[0]):
        output, hidden = rnn(name_in_tensor[i], hidden)

    loss = criterion(output, language_in_tensor)
    loss.backward()
    
    current_loss += loss.item()
    
    for p in rnn.parameters():
        p.data.add_(-learning_rate, p.grad.data)

    if iteration% 5000 == 0:
        
        pred = output_to_language(output)
        
        correct = '✓' if pred == language else '✗ (%s)' % language
        print('iters- %d %d%% (%s) Name- %s Language- %s %s' % \
              (iteration, iteration/iterations*100, loss.item(), name, pred, correct))

    if iteration % 1000 == 0:
        all_losses.append(current_loss / 1000)
        current_loss = 0

iters- 5000 5% (3.724365711212158) Name- Macleod Language- German ✗ (Scottish)
iters- 10000 10% (2.8470005989074707) Name- Favager Language- German ✗ (French)
iters- 15000 15% (1.484479308128357) Name- Rosales Language- Spanish ✓
iters- 20000 20% (1.1251862049102783) Name- Brune Language- German ✓
iters- 25000 25% (0.0027055158279836178) Name- Tsukamoto Language- Japanese ✓
iters- 30000 30% (0.25187188386917114) Name- Ansaldi Language- Italian ✓
iters- 35000 35% (0.10041199624538422) Name- Mazuka Language- Japanese ✓
iters- 40000 40% (0.06594379246234894) Name- Alexandropoulos Language- Greek ✓
iters- 45000 45% (1.5990803241729736) Name- Blanco Language- Italian ✗ (Spanish)
iters- 50000 50% (1.411132574081421) Name- Abelló Language- Spanish ✗ (Italian)
iters- 55000 55% (0.7573774456977844) Name- Leitz Language- German ✓
iters- 60000 60% (0.34390515089035034) Name- Di caprio Language- Italian ✓
iters- 65000 65% (1.3341048955917358) Name- Cearbhall Language- German ✗ (Irish)
iters- 70000

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

#### Perform a test using 10,000 randomly selected names
We peform steps similar to what was done in training. 
* pick a name at random, conver to tensor
* convert the label to a tensor
* initialize the RNN's hidden layers
* make a prediction for each additional character of the name and then get the final prediction after all the characters have been fed in

After this, we create lists with the real and predicted values of language. We will use these to plot a confusion matrix to check the accuracy. 

In [None]:
n_confusion = 10000

prediction=[]
actual = []

for _ in range(n_confusion):

    i = random.randint(0, len(names_data) - 1)
    
    name = names_data.iloc[i][0]
    name_in_tensor = name_to_tensor(name)
    
    language = names_data.iloc[i][1]
    language_in_tensor = torch.tensor([list(languages).index(language)], dtype=torch.long)
        
    hidden = rnn.initHidden()

    for j in range(name_in_tensor.size()[0]):
        output, hidden = rnn(name_in_tensor[j], hidden)
    
    pred = output_to_language(output)
    
    prediction.append(pred)
    actual.append(language)

#### Install pandas_ml
This is needed for the confusion matrix

In [None]:
from pandas_ml import ConfusionMatrix
import numpy as np

In [None]:
actual,prediction

In [None]:
confusion_matrix = ConfusionMatrix(actual, prediction)

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt

confusion_matrix.plot()

In [None]:
correct = 0
for i, data in enumerate(actual):
    if data == prediction[i]:
        correct += 1
    
print('Accuracy of this language classifier is ', correct/n_confusion)

In [None]:
print("Confusion matrix:\n%s" % confusion_matrix)