Machine Learning Final Project
=================================

This project is the first leg of effort to translate to and from colloquial Arabic. In this project file, using a dataset from [Tatoeba](https://tatoeba.org/eng/downloads), we accomplished a model to translate to and from Modern Standard Arabic in an effort to lay the groundwork for colloquial corpora.



## Imports and Intitializations ##

Below are imports and intitializations for the project including the device: which determines which device the algorithm will be run on.

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random
import pandas as pd

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Below we are creating a language class, it holds attributes like the number of words, but also methods that convert words to and from indices. 




In [2]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

Since encoding with Arabic typically gets rid of the characters or augments text in a non-ideal fashion, we are just removing punctuation here.


In [3]:
import unicodedata as ud
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Standardize strings and remove punctuation

def normalizeString(s):
    s = s.lower().strip()
    s = re.sub(r"([?.!,؟'،])", r" ", s)
    s = re.sub(r'[" "]+', " ", s)
    return s.strip()

Here we read and preprocess the data we receive from the file. The reverse flag indicates Arabic to English if false and the opposite if true.




In [4]:
def readLines():
    lines = open('./ara.txt', encoding='UTF-8').read().strip().split('\n')
    num_examples = len(lines)
    pairs = [[w for w in l.split('\t')][:2] for l in lines[:num_examples]]
    data = pd.DataFrame(pairs, columns=["eng", "ar"])
    pairs = [[normalizeString(s) for s in l.split('\t')][:2] for l in lines]
    return data, pairs

def readLangs(reverse=True):
    data, pairs = readLines()

    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(data['eng'])
        output_lang = Lang(data['ar'])
    else:
        input_lang = Lang(data['ar'])
        output_lang = Lang(data['eng'])

    return input_lang, output_lang, pairs

The logic below prints test data to visualize what's loaded in the above step. It also prepares data for next step of model process.




In [5]:
def prepareData(lang1, lang2, reverse=True):
    input_lang, output_lang, pairs = readLangs(reverse)
    print("Read %s sentence pairs" % len(pairs))
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Number of words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs


input_lang, output_lang, pairs = prepareData('ar', 'eng')

Read 11584 sentence pairs
Number of words:
0                                                      Hi.
1                                                     Run!
2                                                    Help!
3                                                    Jump!
4                                                    Stop!
                               ...                        
11579    I'll be playing tennis with Tom this afternoon...
11580    The mobile phone you have dialed is either swi...
11581    A man touched down on the moon. A wall came do...
11582    Ladies and gentlemen, please stand for the nat...
11583    There are mothers and fathers who will lie awa...
Name: eng, Length: 11584, dtype: object 11982
0                                                  مرحبًا.
1                                                    اركض!
2                                                  النجدة!
3                                                    اقفز!
4                         

## Seq2Seq Modeling ##

With this Seq2Seq model (or encoder and decoder model), the encoder is designed for an input sequence to output a vector with logic, and the decoder reads that vector and produces the output sentence or "guess". A Seq2Seq model encoder encodes the "meaning" of the sentence unlike a single RNN. This fact is ideal for langauge translation.




### The Encoder ###

The encoder outputs the vector stated above alongside the hidden state, a variable used for the next word and iteration through the encoder.




In [6]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### Decoder ###
As opposed to a simple decoder, we will use an attention decoder to decipher encoder vectors. With an attention decoder, the network is able to "attend to" specific portions of encoder returns and formulate a more context driven result. Below is logic and attention weights recommended by pytorch in an example they have given.


In [7]:
MAX_LENGTH = 11584

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [8]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

## Training the Model ##

Here, we train out model and keep track of outputs and hidden states. The decoder first receives the ``<SOS>`` token as first input na dhidden state of encoder as first hidden state. We move along in the training process by teacher forcing, decided by an if-statement below. Teacher forcing is using the actual target output as enxt input instead of value that the encoder produces. It should speed up model convergence.




In [9]:
teacher_forcing_ratio = 0.5


def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]

    else:
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

These helper methods just aid the output produce estimated and elapsed time values.



In [10]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

Below is a pretty simple iterator that loops through our training data and displays progress information on time and loss values.




In [11]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0 
    plot_loss_total = 0 

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs))
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

## Evaluation ##

In the evaluation stage, logic is quite similar to training other than the lack of targets. Here we geed decoder predictions back to model in each step to evaluate accuracy. This portion also allows us to visualize tangible results.



In [12]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]

In [13]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np

def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

The method below allows results to be visualized by user.

In [14]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('original >', pair[0])
        print('correct =', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('prediction <', output_sentence[:-6])
        print('')

## Training and Evaluating ##

To mitigate long training times and lack of access to great hardware, we use the parameters below to evaluate the model. After an initial run, the first 3 lines may be commented out to further train the same model.



In [18]:
# hidden_size = 256
# encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
# attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

trainIters(encoder1, attn_decoder1, 100000, print_every=5000)

1m 47s (- 34m 11s) (5000 5%) 4.5446
3m 22s (- 30m 23s) (10000 10%) 4.1693
5m 1s (- 28m 25s) (15000 15%) 3.8820
6m 39s (- 26m 39s) (20000 20%) 3.6218
8m 19s (- 24m 58s) (25000 25%) 3.3905
9m 59s (- 23m 18s) (30000 30%) 3.2034
11m 40s (- 21m 40s) (35000 35%) 2.9978
13m 23s (- 20m 5s) (40000 40%) 2.8222
15m 7s (- 18m 29s) (45000 45%) 2.6367
16m 53s (- 16m 53s) (50000 50%) 2.5064
18m 35s (- 15m 13s) (55000 55%) 2.3312
20m 17s (- 13m 31s) (60000 60%) 2.2024
21m 56s (- 11m 48s) (65000 65%) 2.0966
23m 39s (- 10m 8s) (70000 70%) 2.0333
25m 19s (- 8m 26s) (75000 75%) 1.8791
27m 8s (- 6m 47s) (80000 80%) 1.7753
29m 29s (- 5m 12s) (85000 85%) 1.7247
32m 8s (- 3m 34s) (90000 90%) 1.6205
34m 41s (- 1m 49s) (95000 95%) 1.5317
37m 17s (- 0m 0s) (100000 100%) 1.5143


Below are actual predictions by machine as indicated in print.

In [19]:
evaluateRandomly(encoder1, attn_decoder1)

original > صومطرة جزيرةٌ
correct = sumatra is an island
prediction < sumatra is an island

original > أُنظر إلى هذا
correct = look at this
prediction < look at this

original > يجب عليك الانضمام لي
correct = you must join me
prediction < you must join me

original > بإمكان هذا الطفل أن يعد إلى مئة مع أنه ما زال لديه أربع سنوات
correct = that child is only four but he can already count to 100
prediction < the only is is but is is is he is that he is

original > كل ما عليك فعله هو أن تسمع بعناية
correct = all you need to do is listen carefully
prediction < all all you need is is is to you

original > أنت لم تتغير مُطلقاً
correct = you haven t changed at all
prediction < you haven t changed all all all

original > لا تنس إخراج القمامة
correct = don t forget to take out the garbage
prediction < don t forget the card

original > أنا طالب
correct = i m a student
prediction < i m a student

original > أسعدتني رسالتك
correct = your letter made me happy
prediction < your letter made me

origina