# Assignment #04 - Sequential Models

Deep Learning / Fall 1399, Khatam University



---



**Please pay attention to these notes:**
<br><br>



- **Assignment Due:** <b><font color='red'>1399.11.20</font></b> 23:59:00
- If you need any additional information, please review the assignment page on the course website.
- The items you need to answer are highlighted in <font color="purple">**bold purple**</font> and the coding parts you need to implement are denoted by:
```
########################################
#     Put your implementation here     #
########################################
```
- We always recommend co-operation and discussion in groups for assignments. However, **each student has to finish all the questions by him/herself**. If our matching system identifies any sort of copying, you'll be responsible for consequences.
- Students who audit this course should submit their assignments like other students to be qualified for attending the rest of the sessions.
- If you have any questions about this assignment, feel free to drop us a line. You may also post your questions on the course Microsoft Teams channel.
- You must run this notebook on Google Colab platform, it depends on Google Colab VM for some of the depencecies.
- You can double click on collapsed code cells to expand them.
- <b><font color='red'>When you are ready to submit, please follow the instructions at the end of this notebook.</font></b>


<br>



# Introduction

In this assignment we are going to see some sequential models in practice.

#1.NER

**Named Entity Recognition (NER)**, is the task of locating named entities mentioned in a given document. Therefore, this task is a simple sequence tagging task, in which the model must assign a class to every element of the sequence (in this case words, or to be more specific, tokens).

In this assignment, we are going to implement and train a model for NER. We are going to get familiar with embedding layers, learn few practical tricks in implementing RNN models and address an issue regarding the imbalanced dataset.

For this assignment, we are going to use the **CoNLL 2003** dataset. Let's first download the data:

In [None]:
#@title Donwload the dataset
from IPython.display import clear_output

!gdown https://drive.google.com/uc?id=1f4UfZdTnwQrgnJ_ppzk0vAEpQFEJDX9b
!unzip NER_data.zip
!rm NER_data.zip

clear_output()

print ("Done!")

Let's see few lines of the training set file:

In [None]:
!head -n 20 "/content/NER_data/train.txt"

As you can see, each file is a collection of documents, each document is a collection of sentences, and each sentence is split into several lines with each line representing a word and its corresponding info. Documents are separated with `-DOCSTART- -X- -X- O` lines and sentences are split with double newline characters. Each word comes with some additional information, but the parts we are interested in are the first part (which is the word itself) and the last part which is the word NER class. Let's parse and extract this information from the raw files:

In [None]:
#@title Parse raw data files

from tqdm.notebook import tqdm

def parse_input_file (file_path):
  parsed_docs = []
  with open (file_path, "r") as f:
    docs = f.read().strip().split("\n\n")
    for doc in docs:
      if "-DOCSTART-" in doc: continue
      parsed_doc = []
      words = doc.split("\n")
      for word in words:
        parts = word.split()
        parsed_doc.append((parts[0], parts[-1]))
      parsed_docs.append(parsed_doc)
  return parsed_docs


parsed_train= parse_input_file("/content/NER_data/train.txt")
parsed_test= parse_input_file("/content/NER_data/test.txt")
parsed_valid= parse_input_file("/content/NER_data/valid.txt")
  
print ("Done!")
  

Now let's take a look at parsed datasets:

In [None]:
print ("element 0 of parsed trainset:\n")
parsed_train[0]

As you can observe, each element in parsed datasets is a list of tuples, and each tuple represents a word and it's NER tag. Now, to feed the words and classes to a neural network, we need to map them to integer values. Let's create the mapping files:

In [None]:
#@title Create mappings

import string

vocab = set()
classes = set()


for doc in parsed_train:
  for word, label in doc:
    vocab.add(word) #we could use python <set> datastructure
    classes.add(label)

word2id = {w:i for i, w in enumerate(vocab, 2)}
word2id["<PAD>"] = 0
word2id["<UNK>"] = 1

tag2id = {c:i for i,c in enumerate(classes)}
id2tag = {tag2id[c]:c for c in tag2id}

english_chars = string.printable

char2id = {c:i for i, c in enumerate(english_chars, 2)}
char2id["<PAD>"] = 0
char2id["<UNK>"] = 1

print ("Done!")
  

Note that we added two special tokens `"<PAD>"` and `"<UNK>"` to our mapping files. `"<UNK>"` is useful when we are dealing with out of vocabulary words, and `"<PAD>"` is going to be used later to make all input sequences in a batch have the same length.

Also, note that we have a mapping for characters as well as a mapping for words. This is because we are going to use a technique called **Character Embedding** in our implementation later.

Now let's tokenize (map) our parsed datasets using these mappings:

In [None]:
#@title Tokenize

def tokenize(parsed_data):

  unk_tok = word2id["<UNK>"]
  unk_char = char2id["<UNK>"]
  word_tokenized_data = [ [word2id.get(w, unk_tok) for w,l in s] for s in parsed_data]
  char_tokenized_data = [ [ [char2id.get(c, unk_char) for c in w] for w,l in s] for s in parsed_data]
  tags = [ [tag2id[l] for w,l in s] for s in parsed_data]

  return (word_tokenized_data,
          char_tokenized_data,
          tags)
  
tokenized_train = tokenize(parsed_train)
tokenized_test = tokenize(parsed_test)
tokenized_valid = tokenize(parsed_valid)

print ("Done!")

We created two different tokenized datasets: (1) a word-level tokenized dataset and (2) a char-level tokenized dataset. Let's take a look and see how they are constructed: 

In [None]:
word_tokenized_data, char_tokenized_data, tags = tokenized_train

print(f"Word level tokenized train dataset has {len(word_tokenized_data)} elements.")
print(f"Each element represents a sentence and is stored as a {type(word_tokenized_data[0])}.")
print(f"The first sentence contains {len(word_tokenized_data[0])} inner elemens, each representing a word (token).")
print(f"Each of these tokens are a {type(word_tokenized_data[0][0])} representing a word id.")

print("\n"+55*"-"+"\n")

print(f"Char level tokenized train dataset has {len(char_tokenized_data)} elements aswell.")
print(f"Also, here each element represents a sentence and is stored as a {type(char_tokenized_data[0])}.")
print(f"Also, here the first sentence contains {len(char_tokenized_data[0])} inner elemens, each representing a word (token).")
print(f"However, here each word is represented by a {type(char_tokenized_data[0][0])} of character ids.")


The char-level tokenized dataset is going to be used later in our char-embedding representations.

As we mentioned earlier, we have to make all sequences in a batch have the same length so we can feed them to a neural network. Therefore, we need to trim long sentences to `MAX_SENT_LEN`, and also pad the short ones to the same length. Likewise, for our char-level dataset, we also have to decide on a `MAX_WORD_LEN` to pad and trim the words. We also need to pad our labels. However, we did not have any specific class for padding, therefore we choose the "O" class for this purpose, and as we will see this is not important, we just need our labels to have a consistent length with our sentences.  

In [None]:
#@title pad and truncate


import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences


MAX_SENT_LEN = 48 #@param {type:"integer"}
MAX_WORD_LEN = 16 #@param {type:"integer"}

def pad_and_truncate(tokenized_data, sent_len=50, word_len=20):
  word_tokenized, char_tokenized, tags = tokenized_data
  pad_tok = word2id["<PAD>"]
  pad_char = char2id["<PAD>"]
  pad_cls = tag2id["O"]
  words_paded = pad_sequences(word_tokenized, 
                             maxlen=sent_len, 
                             padding='post', 
                             truncating='post',
                             value=pad_tok)
  
  chars_paded = []

  for doc in char_tokenized:
    _temp = []
    for word_index in range(sent_len):
      if word_index < len(doc):
        _temp.append(doc[word_index])
      else:
        _temp.append([])
    doc_paded = pad_sequences(_temp, 
                              maxlen=word_len, 
                              padding='post', 
                              truncating='post',
                              value=pad_char) 
    chars_paded.append(doc_paded)
  chars_paded = np.array(chars_paded, "int32")

      
    
  tags_paded = pad_sequences(tags, 
                              maxlen=sent_len, 
                              padding='post', 
                              truncating='post',
                              value=pad_cls)
 

  return words_paded, chars_paded, tags_paded

paded_train = pad_and_truncate(tokenized_train, MAX_SENT_LEN, MAX_WORD_LEN)
paded_test = pad_and_truncate(tokenized_test, MAX_SENT_LEN, MAX_WORD_LEN)
paded_valid = pad_and_truncate(tokenized_valid, MAX_SENT_LEN, MAX_WORD_LEN)

print ("Done!")
  

Let's see the result:

In [None]:
word_paded_data, char_paded_data, tags_paded = paded_train

print(f"Paded word-level train dataset shape is: {word_paded_data.shape}")
print(f"Paded char-level train dataset shape is: {char_paded_data.shape}")
print(f"Paded train tags shape is: {tags_paded.shape}")

Now let's make TensorFlow datasets! A regular TensorFlow dataset, returns a tuple of (inputs, label) per iteration, however, here we have a third element, which is `sample_weight`. The `sample_weight` indicates how much that sample contributes to our loss function. To obtain the `sample_weight` we pass a dictionary mapping each class to its corresponding weight to our dataset generation function. The goal is to prevent more frequent classes dominate and let less frequent classes be as impactful in the loss function. For now, we pass a default weight mapping and consider the weight of each class to be equal to one, but later on, you need to construct a proper weight mapping concerning the frequency of each class. Therefore notice how the default weigh dictionary is made here:

In [None]:
#@title Tf datasets
from tensorflow.data import Dataset
import tensorflow as tf

def make_tf_dataset (paded_data, class_weight):
  words_paded, chars_paded, tags_paded = paded_data

  words_dataset = Dataset.from_tensor_slices(words_paded)
  chars_dataset = Dataset.from_tensor_slices(chars_paded)

  X = Dataset.from_tensor_slices((words_paded, chars_paded))
  tags_dataset = Dataset.from_tensor_slices(tf.one_hot(tags_paded, len(tag2id)))

  sample_weights = []
  for sentence in tags_paded:
    sentence_weights = []
    for tag in sentence:
      sentence_weights.append(class_weight[tag])
    sample_weights.append(sentence_weights)
  sample_weights = Dataset.from_tensor_slices(sample_weights)

  dataset = Dataset.zip ((X, tags_dataset, sample_weights)).batch(64).prefetch(tf.data.AUTOTUNE)

  return dataset

## Default weight mapping
default_class_weights = {i:1 for i in id2tag}

dataset_train = make_tf_dataset(paded_train, default_class_weights)
dataset_test = make_tf_dataset(paded_test, default_class_weights)
dataset_valid = make_tf_dataset(paded_valid, default_class_weights)

print ("Done!")

Now it's time to make our model! As a baseline, we use a simple bi-LSTM neural network. Also, we just use the word level dataset (although the char level dataset is also passed to the model by the TensorFlow dataset we created, but we ignore it for this part). We follow these steps:


1.   We take our input with shape `<batch, max_sent_length>` and feed it to an `Embedding` layer.
2.   The `Embedding` layer maps each token id to a vector representation of that word. At first, these are random vectors, but as the training goes on, the model learns meaningful representations for each token. The output shape will be `<batch, max_sent_length, word_embedding_out>`. Note that we passed `True` to the `mask_zero` parameter of our `Embedding` layer. If you can recall, some layers take a mask tensor which indicates which inputs they should take into consideration when they do their computations. By activating `mask_zero`, our `Embedding` layer automatically generates a mask for tokens with id=0. This is the reason we considered the <PAD> token to be mapped to 0. The generated mask is then automatically passed to subsequent layers, and this is the reason why we arbitrarily padded our tags with class "O", since the mask is already generated and these padded values are masked regardless. 

3.   We feed the raw word representations to a `bi-LSTM` layer to create some context-aware representation for our words. The resulting shape will be <batch, max_sent_length, bi_LSTM_hidden_shape>

4.   We finally pass these contextual representations to a `Dense` layer to tag our classes. The shape will be <batch, max_sent_length, class_num>

In [None]:
#@title base model architecture

import tensorflow as tf
from tensorflow.keras import layers

class BaseNERModel(tf.keras.Model):
    def __init__(self,
                 max_sent_len,
                 word_embed_input_dim, 
                 word_embed_output_dim,
                 num_classes,
                 hidden_size=75):
      
        super().__init__()
        self.word_embedding = layers.Embedding(input_dim=word_embed_input_dim,
                                               output_dim=word_embed_output_dim,
                                               input_length=max_sent_len,
                                               trainable=True,
                                               mask_zero=True)


        self.bilstm = layers.Bidirectional(layers.LSTM(hidden_size, return_sequences=True))
        
        self.dense = layers.Dense(num_classes, activation="softmax")
    
    def call(self, inputs):
        word_input, char_input = inputs
        word_vectors = self.word_embedding(word_input)        
  
        lstm = self.bilstm(word_vectors) #batchsize, max_seq_len, hidden_dim_bilstm
        logits = self.dense(lstm)

        return logits

base_model = BaseNERModel(max_sent_len=MAX_SENT_LEN,
                          word_embed_input_dim=len(word2id), 
                          word_embed_output_dim=75,
                          num_classes=len(tag2id))
    

Let's train our model:

In [None]:
#@title train the base model

base_model.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])


result = base_model.fit(dataset_train, epochs=8, validation_data=dataset_valid)

Now let's see how our baseline perfomed on the test set:

In [None]:
#@title predictions on test set
sample_index = 200 #@param {type:"integer"}

pred = base_model.predict(dataset_test)

n = sample_index
len_sent = len(parsed_test[n])
print("{:15} | {:5} | {}".format("Word", "True", "Pred"))
print(32 * "=")
for i in range (len_sent):
  print("{:15} : {:5}   {}".format(parsed_test[n][i][0], parsed_test[n][i][1], id2tag[np.argmax(pred[n][i], axis=-1)]))


Evidently, **accuracy** is not a good performance metric in this case, since most of our instances are taged as "O" and the dataset is heavily unbalanced. Hence, we use the **F-score** to measure the performance:

In [None]:
#@title f-score
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, accuracy_score

## flaten the predictions and labels 
sent_lens = [min(len(s), MAX_SENT_LEN) for s in parsed_test] 

flat_test_preds = []
for i, sent in enumerate(pred):
  sent_preds = np.argmax(sent[:sent_lens[i]], axis=-1)
  flat_test_preds += list(sent_preds)

  
flat_test_labels = []
for i, sent in enumerate(parsed_test):
  sent_labels = [tag2id[l] for w,l in sent]
  flat_test_labels += sent_labels[:sent_lens[i]]

## compute metric
f1 = f1_score(flat_test_labels, flat_test_preds, average=None)

print("{:12} |  {:5}".format("Class", "F-score"))
print(25 * "=")
for i in range (len(id2tag)):
  print("{:12} : {:5.1f} ".format(id2tag[i], 100*f1[i]))
print(25 * "-")
print("{:12} : {:5.1f} ".format("AVG.", 100*np.mean(f1)))

<font color="purple"><b>Now you must implement a model to use the char-level dataset as well. Follow these steps:</font></b>

1.   <font color="purple"><b>Generate raw word representations like before.</font></b>

2.   <font color="purple"><b>Use another `Embedding` layer which takes char-level dataset entries as input and generates raw character representations. Remember to set `mask_zero=True` to generate a mask for padded characters. The output shape will be `<batch, max_sent_length, max_word_length, char_embedding_out>`.</font></b>

3.   <font color="purple"><b>Use a 1D-convolutional layer on the char representations. This generates context-aware representations based on local features in raw subsequent character representations. The output shape will be `<batch, max_sent_length, num_conv_features, conv_feature_map>`.</font></b>

4.   <font color="purple"><b>We want to generate word representations using these context-aware character representations. Use a proper reduction method on the second axis (`num_conv_features`) to generate an output with shape `<batch, max_sent_length, conv_feature_map>`.</font></b>

5. <font color="purple"><b>Concatenate these newly generated word representations with the previous raw word representations you already had along the third axis. The output shape will be `<batch, max_sent_length, conv_feature_map + word_embeding_out>`.</font></b>

6.  <font color="purple"><b>Pass to `bi-LSTM` and `Dense` as before.</font></b>

In [None]:
#@title YOUR PART#1 - char level model architecture

import tensorflow as tf
from tensorflow.keras import layers

class CharLevelNERModel(tf.keras.Model):
  def __init__(self,
                max_sent_len,
                word_embed_input_dim, 
                word_embed_output_dim,
                max_word_len,
                char_embed_input_dim, 
                char_embed_output_dim,
                num_classes,
                conv_filters = 20,
                hidden_size=75):
    
    ########################################
    #     Put your implementation here     #
    ########################################
 
  def call(self, inputs):

    ########################################
    #     Put your implementation here     #
    ########################################
    
char_level_model = CharLevelNERModel(max_sent_len=MAX_SENT_LEN,
                          word_embed_input_dim=len(word2id), 
                          word_embed_output_dim=75,
                          max_word_len=MAX_WORD_LEN,
                          char_embed_input_dim=len(char2id), 
                          char_embed_output_dim=25,
                          conv_filters = 20,
                          num_classes=len(tag2id))
    

In [None]:
#@title train the char level model

char_level_model.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])


result = char_level_model.fit(dataset_train, epochs=8, validation_data=dataset_valid)

Test the performance:

In [None]:
#@title f-score
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, accuracy_score


pred = char_level_model.predict(dataset_test)

## flaten the predictions and labels 
sent_lens = [min(len(s), MAX_SENT_LEN) for s in parsed_test] 

flat_test_preds = []
for i, sent in enumerate(pred):
  sent_preds = np.argmax(sent[:sent_lens[i]], axis=-1)
  flat_test_preds += list(sent_preds)

  
flat_test_labels = []
for i, sent in enumerate(parsed_test):
  sent_labels = [tag2id[l] for w,l in sent]
  flat_test_labels += sent_labels[:sent_lens[i]]

## compute metric
f1 = f1_score(flat_test_labels, flat_test_preds, average=None)

print("{:12} |  {:5}".format("Class", "F-score"))
print(25 * "=")
for i in range (len(id2tag)):
  print("{:12} : {:5.1f} ".format(id2tag[i], 100*f1[i]))
print(25 * "-")
print("{:12} : {:5.1f} ".format("AVG.", 100*np.mean(f1)))

If you did everything right, the average f-score will be incresed by ~4 or 5  percents. 

<font color="purple"><b>Explain why? What kind of information do you think is encoded in these newly word representaions generated using char representations?</font></b>

<font color="purple"><b>##### PUT YOUR ANSWER HERE! #####<b></font>

As you can see our dataset is dominated by class "O". Here you must create a new class weight mapping with respect to each class frequency in our train set. The more frequent a class is, the less its weight should be.

<font color="purple"><b>Generate a proper class weighs mapping dictionary:<b></font>

In [None]:
#@title YOUR PART#2 - class weights



train_class_weight  = {} #?

########################################
#     Put your implementation here     #
########################################


Now let's initialize a new model and train again using weighted train set:

In [None]:
#@title train the char level model with class weights

dataset_train = make_tf_dataset(paded_train, train_class_weight)

char_level_model = CharLevelNERModel(max_sent_len=MAX_SENT_LEN,
                                      word_embed_input_dim=len(word2id), 
                                      word_embed_output_dim=75,
                                      max_word_len=MAX_WORD_LEN,
                                      char_embed_input_dim=len(char2id), 
                                      char_embed_output_dim=25,
                                      num_classes=len(tag2id))

char_level_model.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])


result = char_level_model.fit(dataset_train, epochs=8, validation_data=dataset_valid)

In [None]:
#@title f-score
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, accuracy_score


pred = char_level_model.predict(dataset_test)

## flaten the predictions and labels 
sent_lens = [min(len(s), MAX_SENT_LEN) for s in parsed_test] 

flat_test_preds = []
for i, sent in enumerate(pred):
  sent_preds = np.argmax(sent[:sent_lens[i]], axis=-1)
  flat_test_preds += list(sent_preds)

  
flat_test_labels = []
for i, sent in enumerate(parsed_test):
  sent_labels = [tag2id[l] for w,l in sent]
  flat_test_labels += sent_labels[:sent_lens[i]]

## compute metric
f1 = f1_score(flat_test_labels, flat_test_preds, average=None)

print("{:12} |  {:5}".format("Class", "F-score"))
print(25 * "=")
for i in range (len(id2tag)):
  print("{:12} : {:5.1f} ".format(id2tag[i], 100*f1[i]))
print(25 * "-")
print("{:12} : {:5.1f} ".format("AVG.", 100*np.mean(f1)))

<font color="purple"><b>Compare the results: which classes benefit and why?<b></font>

<font color="purple"><b>##### PUT YOUR ANSWER HERE! #####<b></font>

# 2. Sequence-to-Sequence Spelling Correction

In the final part of the last assignment, we are going to try a different task type: sequence-to-sequence or shortly **seq2seq**. A seq2seq model simply gets a sequence of items as input and generates another sequence as its output. Its most common applications is machine translation, where seq2seq models were born. And you can think of other applications or search for them!

Seq2seq tasks are different from sequence tagging because in seq2seq tasks, input and output items are not explicitly mapped to each other, and the length of input and output may not be equal at all.

![](https://miro.medium.com/max/2400/1*1nERP8YPd-0DkpVC4Fi2pg.png)

As you probably guessed, it is hard* to train a seq2seq model for a real-world problem like translation (why?). So we are going to apply a seq2seq model on a semi-toy problem: **Spelling Correction**. The input will be a text with spelling mistakes (sequence of characters) and the model should generate the corrected text (again as a sequence of chars). To prevent boredom and to exhilarate ourselves with literary delicacy and wisdom, the original texts are selected from Masnavi-e-Ma'navi. We will follow these steps to train our seq2seq model:

1. Generate texts with controlled random spelling mistakes from original texts (mesra's of the poems)
2. Prepare our data examples as pairs of input (noisy) and target (original) sequences
3. Define our seq2seq model architecture
4. Train the model on our generated data

*requiring lots of computation resources, data and design effort

## Prepare Data

### Generate spelling mistakes

First question is "how to make spelling mistakes as humans do?" But we don't take this question too seriously! Probably we could use probabilistic models that imitate human mistakes, but we use a simplistic method to do that.

In Persian, one main reason of spelling mistakes are **homophonic characters**, for example four different characters have same sound *z*! Other than that, typing errors are very frequent:
* missing chars
* inserting extra chars
* hitting adjacent keys on keyboard
* swaping order of adjacent chars

Let's make it easier by taking homophonic characters and swaping characters from typos. We are going to define a function that makes random spelling mistakes of these two types...

### Define Text Perturbation


Perturbation means making small changes to something, like adding noise to a signal. We are going to define perturbation functions that imitate spelling mistakes.

It's not your turn yet :), so if not interested, you can skip reading and understanding the codes and just run the following cells and move forward...

In [None]:
#@title import libs
from IPython.display import clear_output

import re
import os
from random import random, choices, choice
from collections import defaultdict, Counter
import numpy as np
from tqdm.notebook import tqdm

In [None]:
# @title Word-level perterbations

# persian keyboard layout, seems useless now!
# but we could use it to model one of typing errors above...
persian_kb = ['ضصثقفغعهخحجچپ',
              'شسیبلاتنمکگ',
              'ظطزرذدئو']

persian_chars = ''.join(persian_kb) + 'ژآ'
homophonic_groups = ['زذضظ', 'سصث', 'تط', 'غق', 'هح'] # + ['اآع']
homophonic_chars = ''.join(homophonic_groups)

def get_random_homophonic_char(ch):
  for group in homophonic_groups:
    if ch in group:
      return choice(group.replace(ch, ''))
  return ch

def count_homophonic_chars(word):
  homo_counts = Counter(homophonic_chars * 2)
  char_counts = Counter(word)
  return sum((homo_counts & char_counts).values())

def homophonic_perturb(word):
  is_homo = [1 if ch in homophonic_chars else 0 for ch in word]
  new_word = list(word)
  indexes = list(range(len(word)))
  target_index = choices(indexes, weights=is_homo)[0]
  new_char = get_random_homophonic_char(new_word[target_index])
  new_word[target_index] = new_char
  return ''.join(new_word)

def swap_chars_perturb(word):
  if len(word) < 2:
    return word
  new_word = list(word)
  indexes = list(range(len(word)))
  target_index = choice(indexes[:-1])
  new_word[target_index], new_word[target_index+1] = new_word[target_index+1], new_word[target_index]
  return ''.join(new_word)

print('homophonic chars:', homophonic_perturb('صحیح'))
print('swap adjacent chars:', swap_chars_perturb('صحیح'))

homophonic chars: ثحیح
swap adjacent chars: حصیح


In [None]:
# @title Text perturbation
def perturb_words(words, weights, k, mode):
  """
  perturbs words with given mode, k times.
  
  Arguments:
  words: list of strings, the words to perturb
  weights: list of weights for each word, used to selectd random word indexes
  k: number of words to perturb
  mode: perturbation mode, one of 'homophonic', 'swap'

  Returns:
  perturbed words as a list,
  weight update which will be used to reduce probability of 
  perturbing a single word again in future pertarbations
  """
  target_indexes = choices(range(len(words)), weights=weights, k=k)
  weight_updates = np.ones(len(words))

  for idx in target_indexes:
    if mode == 'homophonic':
      words[idx] = homophonic_perturb(words[idx])
    elif mode == 'swap':
      words[idx] = swap_chars_perturb(words[idx])
    
    weight_updates[idx] *= 0.1
  return words, weight_updates

def perturb_text(text,
                 homophonic_prob=0.5, swap_prob=0.5):
  """
  perturbs a given text, based on probabilities for each mode
  """
  words = text.split(' ')

  indexes = list(range(len(words)))
  homophonic_weights = [count_homophonic_chars(word) for word in words]
    
  words, weights = perturb_words(words, homophonic_weights,
                                 int(len(words)*homophonic_prob), mode='homophonic')
  words, weights = perturb_words(words, weights,
                                 int(len(words)*swap_prob), mode='swap')
  return ' '.join(words)

perturb_text('بیایید همیشه درست بنویسیم')

'بیایید حمیشه درست بنویثیم'

Now we have `perturb_text` function that applies spelling mistakes to the input text, controlled by the probability of each perturbation mode.

### Download, Read Texts

Let's get our text data and check how our perturbation works on it...

In [None]:
#@title Download texts
!git clone https://github.com/ganjoor/ganjoor-tex
# !git clone https://github.com/UniversalDependencies/UD_Persian-Seraji
# !git clone https://github.com/tihu-nlp/normalized_bijankhan
# !7z x /content/normalized_bijankhan/bijankhan.7z
clear_output()
print('Done!')

Done!


In [None]:
#@title Read texts

import os, re

def preprocess_text(text):
  """
  cleans input text from non-Persian characters, ...
  """
  oov_chars = re.compile(f'[^{persian_chars} ]+')
  multispace = re.compile(f'[\s]+')

  text = text.replace('\u200c', '')       # semi-space -> space
  text = multispace.sub(' ', text)        # double-space -> space
  text = oov_chars.sub('', text)          # remove out-of-vocab chars
  return text.strip()

def read_masnavi(data_dir):

  for dname in os.listdir(data_dir):
    directory = os.path.join(data_dir, dname)
    if directory[-1] not in '12':
      continue
    for fname in os.listdir(directory):
      with open(os.path.join(directory, fname)) as txt_file:
        for i, text in enumerate(txt_file):
          if i == 0:
            continue
          text = text.strip()
          if len(text) > 10:
            yield preprocess_text(text)

def read_seraji(data_path):
  with open(data_path) as txt_file:
    for i, text in enumerate(txt_file):
      if text.startswith('# text'):
        text = text[8:]
        if 10 < len(text) < 100:
          yield preprocess_text(text)

def read_bijankhan(data_path, max_char_len=64):
  separators = '[#.؟؛)()]'
  with open(data_path) as txt_file:
    text = ''
    for i, line in enumerate(txt_file):
      if line.startswith('!'):
        continue
      elif line[0] in separators:
        if len(preprocess_text(text)) > max_char_len / 2:
          yield preprocess_text(text)
        text = ''
      else:
        new_word = line.partition('\t')[0] + ' '
        if len(preprocess_text(text + new_word)) > max_char_len:
          yield preprocess_text(text)
          text = ''
        text += new_word

DATA_DIR = '/content/ganjoor-tex/txt/moulavi/masnavi' #@param {type: "string"}
# DATA_DIR = '/content/bijankhan.txt'  #@param {type: "string"}
MAX_LEN = 36  #@param {type: "integer"}
# texts = list(read_bijankhan(DATA_DIR, MAX_LEN-2))
texts = list(read_masnavi(DATA_DIR))

counter = Counter()
for text in texts:
  counter.update(text.split())

print(len(texts), 'texts')
print(sum(map(lambda t: len(t.split()), texts)), 'word tokens')
print(len(counter), 'unique words')
print(sum(map(len, texts)), 'chars')
print('\nexample 0:\n', texts[0])

15664 texts
95897 word tokens
13278 unique words
397853 chars

example 0:
 سوی مکه شیخ امت بایزید


In [None]:
print(texts[1], '\t',texts[0])
print(perturb_text(texts[1]), '\t', perturb_text(texts[0]))

از برای حج و عمره میدوید 	 سوی مکه شیخ امت بایزید
اض برای حج و عرمح مدیوید 	 وسی مکح یشخ امط بایزید


### Make Tf data

Now we can make our training data as pair of perturbed (noisy) and original texts. But before proceeding, we need to define three special characters:
* **start** of sequence, indicating the start of sequence!
* **end** of sequence
* **pad** character to extend our sequences to maximum length and ignore it later.

In [None]:
#@title Make char vocab, Define special chars
pad_char = '_'
start_char = '<'
end_char = '>'
all_chars = pad_char + start_char + end_char + ' ' + persian_chars

inp_vocab_size = len(all_chars)
trg_vocab_size = len(all_chars)

id2ch = dict(enumerate(all_chars))
ch2id = {ch:cid for cid, ch in id2ch.items()}

Because our noisy input data is randomly generated, we want to generate new random inputs for each epoch. So we make our tf dataset using `tf.data.Dataset.from_generator`:

In [None]:
#@title Make tf data

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

train_texts, val_texts = train_test_split(texts, test_size=0.1)

BUFFER_SIZE = len(texts)
MAX_LEN = 36  #@param {type: "integer"}
BATCH_SIZE = 256 #@param {type: "integer"}

steps_per_epoch = len(train_texts) // BATCH_SIZE

def get_data_generator(texts):
  def noisy_text_generator():
    for text in texts:
      noisy_text = perturb_text(text)

      # add special characters and pad to max length
      text = (start_char + text + end_char).ljust(MAX_LEN, pad_char)
      noisy_text = (start_char + noisy_text + end_char).ljust(MAX_LEN, pad_char)

      # char -> id
      target_ids = [ch2id[ch] for ch in text]
      noisy_ids = [ch2id[ch] for ch in noisy_text]
      yield (tf.convert_to_tensor(noisy_ids, dtype=tf.int64),
             tf.convert_to_tensor(target_ids, dtype=tf.int64))
  return noisy_text_generator

# training set
dataset = tf.data.Dataset.from_generator(
    get_data_generator(train_texts), 
    output_signature=(tf.TensorSpec((MAX_LEN,), dtype=tf.int64), tf.TensorSpec((MAX_LEN,), dtype=tf.int64)))
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
dataset = dataset.prefetch(BUFFER_SIZE)

# validation set
val_dataset = tf.data.Dataset.from_generator(
    get_data_generator(val_texts), 
    output_signature=(tf.TensorSpec((MAX_LEN,), dtype=tf.int64), tf.TensorSpec((MAX_LEN,), dtype=tf.int64)))
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)

In [None]:
example_input_batch, example_target_batch = next(iter(dataset))
print('input batch shape (batch_size, input_seq_len):', example_input_batch.shape)
print('target batch shape (batch_size, input_seq_len):', example_target_batch.shape)

input batch shape (batch_size, input_seq_len): (256, 36)
target batch shape (batch_size, input_seq_len): (256, 36)


## Define, Train Model

We are going to implement a seq2seq model including:
* Bidirectional GRU Encoder
* Attention layer based on [Luong's](https://arxiv.org/pdf/1508.04025.pdf) Global attention
* GRU Decoder 

The picture below shows the final model. We will implement these 3 part step by step and the attention layer is up to you!

![](https://github.com/teias-courses/dl99/raw/gh-pages/assets/img/luong_att.png)

Image is picked from [here](http://cnyah.com/2017/08/01/attention-variants/) and customized to fit our model.

In [None]:
#@title model hyperparams (do not change!)

EMBEDDING_DIM = 100 #@param {type: "integer"}
UNITS = 300  #@param {type: "integer"}

### Define Seq2Seq Model

The encoder part has nothing new! A char embedding layer and a bidirectional GRU. We need encoder's final state (concatenation of forward and backward states) to later initialize the decoder with, and all output vectors $\overline{h}_s$ as input to attention layer.

In [None]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, units):
    super(Encoder, self).__init__()
    self.units = units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    gru_fw = tf.keras.layers.GRU(self.units//2,
                                      return_sequences=True,
                                      return_state=True)
    gru_bw = tf.keras.layers.GRU(self.units//2,
                                  return_sequences=True,
                                  return_state=True,
                                  go_backwards=True)
    self.bi_gru = tf.keras.layers.Bidirectional(gru_fw, backward_layer=gru_bw)

  def call(self, x, hidden):
    x = self.embedding(x)

    # split hidden state of forward and backward GRU
    hidden = tf.split(hidden, num_or_size_splits=2, axis=-1)

    output, state_fw, state_bw = self.bi_gru(x, initial_state=hidden)

    # merge (concat) hidden state of forward and backward GRU
    return output, tf.concat([state_fw, state_bw], axis=-1)

  def initialize_hidden_state(self, batch_size):
    return tf.zeros((batch_size, self.units))

In [None]:
encoder = Encoder(inp_vocab_size, EMBEDDING_DIM, UNITS)

# sample input batch
sample_hidden = encoder.initialize_hidden_state(BATCH_SIZE)
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)

print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

Encoder output shape: (batch size, sequence length, units) (256, 36, 300)
Encoder Hidden state shape: (batch size, units) (256, 300)


Its time for you to take part and implement the attention layer. 

Attention mechanism generally has 3 main parts:
* query vector
* key vectors
* value vectors

The output will be a weighted average of **value vectors**, based on weights which we call them atterntion scores (or weights). The attention scores are calculated by a scoring function over **query** and **key vectors**.

In our case the **query** is decoder state at current time step $h_t$ both **keys and values** are the same and equal to encoder outputs $\overline{h}_s$.

The score function proposed by Loung et al. is:

$score(h_t, \overline{h}_s)=h_t^\top W_a\overline{h}_s$

where $W_a$ is a learnable parameter. Then the scores are sharpened using softmax:

$a_t=sotfmax(score(h_t, \overline{h}_s))$

And the output of our attention layer (known as context vector) will be calculated as weighted average of $\overline{h}_{s}$:

${c}_t = \sum\limits_i a_{ti}\overline{h}_{si}$

<font color="purple"><b>Now implement the `LuongAttention` layer which gets `query` and `values` (as a batch input)
and returns both context vector $c_t$ and attention scores $a_t$ (for the input batch).</b></font>


In [None]:
#@title YOUR PART #3, Implement attention layer
class LuongAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super().__init__()
    # units == #units of encoder == #units of encoder (hidden size)
    # and also size of the context vector we want to calculate
    # define the learnable parameter here.
    ########################################
    #     Put your implementation here     #
    ########################################

  def call(self, query, values):
    # query (decoder hidden state) shape: (batch size, hidden size)
    # values shape: (batch size, sequence len, hidden size)
    ########################################
    #     Put your implementation here     #
    ########################################
    # context_vector shape: (batch size, hidden size)
    # attention_scores shape: (batch size, sequence len)
    return context_vector, attention_scores

In [None]:
attention_layer = LuongAttention(UNITS)
context_vector, attention_scores = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(context_vector.shape))
print("Attention weights shape: (batch_size, sequence_length) {}".format(attention_scores.shape))

Attention result shape: (batch size, units) (256, 300)
Attention weights shape: (batch_size, sequence_length) (256, 36)


The decoder is much more tricky! It includes:
* a char embedding
* a unidirectional GRU
* the attention layer
* a Dense layer 

On each `forward` call the decoder calculates one time step and returns next character as its main output, but there is much more in this single step to consider.

The main part of decoder is a GRU. At each step, a GRU accepts an input vector, a hidden state (maybe its previous state) and returns an output vector (which equals its new hidden state). In the diagram below, you can see the flow of state vectors (horizontal) and input to output (vertical).

![](https://github.com/teias-courses/dl99/raw/gh-pages/assets/img/luong_att.png)

The decoder GRU recieves previous character from of decoder embedding, and also previous state of its own. It then makes an output vector, a new state.

The state will be fed to GRU itself in future steps and also will be used to get context vector from attention layer $c_t$.

The GRU output will be concatentated to context vector to make use of both attention context vector and decoder output $\tilde{h}_t$. This vector is fed to dense layer to predict next character.

And finally to make GRU aware of the decision made with help of the attention mechanism, $\tilde{h}_t$ will be concatenated to input of GRU at next time step (dotted lines in the diagram).

**NOTE: The flow of our decoder is mostly decided by the Luong's global attention mechansim. It may be different for other methods like the older [Bahdanau](https://arxiv.org/abs/1409.0473)'s attention.**

In [None]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, units):
    super(Decoder, self).__init__()
    self.units = units
    self.vocab_size = vocab_size

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True)
    self.attention = LuongAttention(self.units)
    self.fc = tf.keras.layers.Dense(vocab_size)


  def initialize_input_context(self, batch_size):
    return tf.zeros((batch_size, self.units * 2))

  def call(self, x, hidden, enc_output, previous_context):
    # enc_output shape == (batch_size, max_length, hidden_size)

    # x shape after passing through embedding: (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation: (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(previous_context, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    gru_out, gru_state = self.gru(x, initial_state=hidden)
    gru_out = gru_state   # when len(x) == 1 these have equal values, but output has time axis of size 1

    context_vector, attention_weights = self.attention(gru_state, enc_output)

    # output shape: (batch_size, hidden_size)
    # gru_out = tf.reshape(gru_out, (-1, gru_out.shape[2]))
    full_context = tf.concat([gru_out, context_vector], axis=1)

    # output shape: (batch_size, vocab size + context dim)
    y = self.fc(full_context)

    return y, full_context, gru_state, attention_weights

In [None]:
decoder = Decoder(trg_vocab_size, EMBEDDING_DIM, UNITS)

sample_decoder_output, _,  _, _ = decoder(
    tf.zeros((BATCH_SIZE, 1)),
    sample_hidden, sample_output,
     decoder.initialize_input_context(BATCH_SIZE))

print('Decoder output shape: (batch size, vocab size) {}'.format(sample_decoder_output.shape))

Decoder output shape: (batch size, vocab size) (256, 38)


### Define the optimizer, loss function

To start training, first we need to define our loss function, optimizer. There is some trick for both of them:

* The loss function is **sparse** cross entropy, because we do not convert targets to one-hot vectors.
* There is no need to punish the model for pad target chars. So we define a custom loss function based on `SparseCategoricalCrossentropy` to mask out pad chars.
* When training a model, it is often recommended to lower the learning rate as the training progresses. We do this using `ExponentialDecay` learning rate scheduler to start training fast and slowly converge in the end.
* Do not try to change the learning rate value!

In [None]:
#@title Define optimizer, loss function (do not change LR!)
LR = 0.002 #@param {type: "number"}

sparse_cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(target, pred):
  mask = tf.math.logical_not(tf.math.equal(target, 0))
  loss_ = sparse_cross_entropy(target, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

# decays learning rate by 'decay_rate' after each 'decay_steps'
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    LR,
    decay_steps=steps_per_epoch,
    decay_rate=0.95)

optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

### Training

Training part has its own story! We train our model following these main steps:

1. Pass the input through the encoder which return **encoder output** and the **encoder hidden state**.
2. The **encoder output**, **encoder hidden state** and the **decoder input** (which is the `start_char`) is passed to the decoder.
3. The decoder returns the **predictions**, **context vector** and the **decoder hidden state**.
4. Predictions are used to calculate the loss and other decoder outputs are fed to itself to generate next character. For next time steps, the decoder input is edcided by **teacher forcing**.
7. The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

<font color="purple"><b>Read (and think) about teacher forcing in seq2seq models and explain why it is necessary to train seq2seq models? 

(It is associated with a single line in training step function below)</b></font>

<font color="purple"><b>##### PUT YOUR ANSWER HERE! #####<b></font>


In [None]:
#@title train step
@tf.function
def train_step(inp, targ, enc_hidden):
  
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([ch2id[start_char]] * BATCH_SIZE, 1)
    dec_input_context = decoder.initialize_input_context(BATCH_SIZE)
    
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_input_context, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output, dec_input_context)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing: feeding the target as the next input
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables
  gradients = tape.gradient(loss, variables)
  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [None]:
#@title validation step
def val_step(inp, targ, enc_hidden):
  
  loss = 0
  enc_output, enc_hidden = encoder(inp, enc_hidden)

  dec_hidden = enc_hidden

  dec_input = tf.expand_dims([ch2id[start_char]] * BATCH_SIZE, 1)
  dec_input_context = decoder.initialize_input_context(BATCH_SIZE)

  all_predictions = np.zeros((inp.shape))

  for t in range(1, targ.shape[1]):
    # passing enc_output to the decoder
    predictions, dec_input_context, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output, dec_input_context)

    loss += loss_function(targ[:, t], predictions)
    predicted_ids = tf.argmax(predictions, axis=-1)

    # no teacher forcing: feeding its own preds to next step
    dec_input = tf.expand_dims(predicted_ids, 1)

    all_predictions[:, t] = predicted_ids.numpy()

  batch_loss = (loss / int(targ.shape[1]))

  return batch_loss, all_predictions

In [None]:
#@title train loop
EPOCHS = 20 #@param {type: "integer"}

# metrics used for logging
train_loss = tf.keras.metrics.Mean('train_loss')
val_loss = tf.keras.metrics.Mean('val_loss')

for epoch in range(EPOCHS):

  train_loss.reset_states()
  val_loss.reset_states()

  enc_hidden = encoder.initialize_hidden_state(BATCH_SIZE)
  
  progress = tqdm(dataset, desc=f'Epoch {epoch+1}', total=steps_per_epoch)
  for inp, targ in progress:
    batch_loss = train_step(inp, targ, enc_hidden)
    train_loss.update_state(batch_loss)

    progress.set_postfix(loss=batch_loss.numpy())

  for inp, targ in val_dataset:
    batch_loss, _ = val_step(inp, targ, enc_hidden)
    val_loss.update_state(batch_loss)

  progress.set_postfix(
      train_loss=train_loss.result().numpy(),
      val_loss=val_loss.result().numpy(),
      )
  progress.close()

### Evaluation

In [None]:
#@title inference functions
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

def evaluate(text):
  attention_map = np.zeros((MAX_LEN, MAX_LEN))

  text = start_char + text + end_char

  inputs = [ch2id[ch] for ch in text]
  inputs = tf.keras.preprocessing.sequence.pad_sequences(
      [inputs], maxlen=MAX_LEN, padding='post')
  inputs = tf.convert_to_tensor(inputs)

  result = ''

  hidden = tf.zeros((1, UNITS))
  enc_out, enc_hidden = encoder(inputs, hidden)

  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([ch2id[start_char]], 0)
  dec_input_context = decoder.initialize_input_context(1)

  for t in range(MAX_LEN):
    predictions, dec_input_context, dec_hidden, attention_weights = decoder(
        dec_input,
        dec_hidden,
        enc_out,
        dec_input_context)

    # storing the attention weights to plot later on
    attention_weights = tf.reshape(attention_weights, (-1, ))
    attention_map[t] = attention_weights.numpy()

    predicted_id = tf.argmax(predictions[0]).numpy()

    if id2ch[predicted_id] == end_char:
      return result, text, attention_map

    result += id2ch[predicted_id]

    # the predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)

  return result, text, attention_map

def plot_attention(attentions, inputs, predicted):

  fig = plt.figure(figsize=(10,10))
  ax = fig.add_subplot(1, 1, 1)
  attentions = attentions[:len(predicted),:len(inputs)]
  ax.matshow(attentions, cmap='viridis')

  fontdict = {'fontsize': 14}

  ax.set_xticklabels(['_'] + list(inputs.ljust(MAX_LEN)), fontdict=fontdict)
  ax.set_yticklabels(['_'] + list(predicted.ljust(MAX_LEN)), fontdict=fontdict)

  ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
  ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

  plt.show()

In [None]:
#@title get output, plot attention
text = '\u0628\u0634\u0648\u0646 \u0627\u0632 \u0646\u06CC \u0686\u0646\u0648 \u0647\u06A9\u0627\u06CC\u062A \u0645\u06CC\u0646\u06A9\u062F'  #@param {type:"string"}
text = preprocess_text(text)

result, text, attentions = evaluate(text)
print(result)
plot_attention(attentions, text, result)

<font color="purple"><b>Now that you can feed the model with your arbitrary input, answer the following questions based on your observations of the outputs and attenion maps of the model. 

0. How can we make sure that the model does not memorize the training lyrics instead of correcting input spelling?
1. When correcting homophonic perturbations, does it replace chars with the most frequent homophonic chars or considers the correct form of the words?
2. Does the model fix spelling of a word based on its context? or it has leared something like a word dictionary?
3. How the model handles swapped chars?
4. Show some model failures (input, outputs) and explain why the model fails on that specific input?

If the answer is based on the output you've got from the model, write that answer after a code cell showing your input to the model and the model output, attention map.<b></font>

In [None]:
text = ''  #@param {type:"string"}
text = preprocess_text(text)

result, text, attentions = evaluate(text)
print(result)
plot_attention(attentions, text, result)

<font color="purple"><b>##### PUT YOUR ANSWER TO Q0 HERE! #####<b></font>



In [None]:
text = ''  #@param {type:"string"}
text = preprocess_text(text)

result, text, attentions = evaluate(text)
print(result)
plot_attention(attentions, text, result)

<font color="purple"><b>##### PUT YOUR ANSWER TO Q1 HERE! #####<b></font>



In [None]:
text = ''  #@param {type:"string"}
text = preprocess_text(text)

result, text, attentions = evaluate(text)
print(result)
plot_attention(attentions, text, result)

<font color="purple"><b>##### PUT YOUR ANSWER TO Q2 HERE! #####<b></font>



In [None]:
text = ''  #@param {type:"string"}
text = preprocess_text(text)

result, text, attentions = evaluate(text)
print(result)
plot_attention(attentions, text, result)

<font color="purple"><b>##### PUT YOUR ANSWER TO Q3 HERE! #####<b></font>


In [None]:
text = ''  #@param {type:"string"}
text = preprocess_text(text)

result, text, attentions = evaluate(text)
print(result)
plot_attention(attentions, text, result)

<font color="purple"><b>##### PUT YOUR ANSWER TO Q4 HERE! #####<b></font>


# Submission

Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instructions:

1. Check and review your answers. Make sure all of the cell outputs are what you want. 
2. Select File > Save.
3. **Fill your information** & run the cell bellow.
4. Run **Make Submission** cell, It may take several minutes and it may ask you for your credential.
5. Run **Download Submission** cell to obtain your submission as a zip file.
6. Grab the downloaded file (`dl_asg04__xx__xx.zip`) and hand it over in microsoft teams.

## Fill your information (Run the cell)

In [None]:
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = "" #@param {type:"string"}
student_name = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)


from pathlib import Path

ASSIGNMENT_PATH = Path('asg04')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)

## Make Submission (Run the cell)

In [None]:
#@title Make submission
! pip install -U --quiet PyDrive > /dev/null
! pip install -U --quiet jdatetime > /dev/null

# ! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz 


import os
import time
import yaml
import json
import jdatetime

from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

asg_name = 'Assignment_4'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
'''
# repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg04__%s__%s.zip'%(student_id, student_name.lower().replace(' ',  '_'))

sub_info = {
    'student_id': student_id,
    'student_name': student_name, 
    'dateime': str(jdatetime.date.today()),
    'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))

Javascript(script_save)

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name) 

! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null

print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")

In [None]:
files.download(submission_file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>