# Subject neural tagger seq2seq with attention

This notebook trains a sequence to sequence (seq2seq) model for finding the subject (SBJ) in a sentence. This is an advanced example that assumes some knowledge of sequence to sequence models.

After training the model in this notebook, you will be able to input a Hebrew sentence, such as *"באנציקלופדיה אנחנו משתדלים לדווח כמה שפחות על דברים שקרו לאחרונה"*, and return the subject words: *"אנחנו"*

The tagging quality is reasonable for a toy example, but the generated attention plot is perhaps more interesting. This shows which parts of the input sentence has the model's attention while tagging:

for the sentence:
```"הוא לא יודע מה עובר עליי"```

![example_sentences_tagged](img/result_example.png "sentences_tagged")

Note: This example takes approximately 90 mintues to run on a single GTX 1060 GPU.

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals
#!pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow as tf

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time
import pandas as pd
import logging
import gensim
from scipy import spatial

orginal_sentences_oath_file = 'Data/Hebrew/SVLM_Hebrew_Wikipedia_Corpus.txt'
tagged_sentences_path_file = 'Data/Hebrew/parsed.txt'
csv_hebrew_sentences_path_file = 'Data/Hebrew/Hebrew_tagged_sentences.csv'
NUM_EXAMPLES = 50000
EMBEDDING_DIM = 300

## Tagging and Preprocessing the dataset

We'll use a [curpus](https://github.com/NLPH/SVLM-Hebrew-Wikipedia-Corpus/blob/master/SVLM_Hebrew_Wikipedia_Corpus.txt) built by Dr. Vered Silber-Varod and Prof. Ami Moyal as part of their work on [paper](https://github.com/NLPH/SVLM-Hebrew-Wikipedia-Corpus/blob/master/Phonemes_freqency_Silber-Varod-Latin-Moyal.pdf).

The SVLM Hebrew Wikipedia Courpus is a corpus made up of 50,000 Hebrew sentences from the Hebrew Wikipedia chosen to ensure phoneme coverage for the purpose of a sentence recording project

This Corpus contains sentence in the format:
```
פרסומים עיקריים מקורות המחשבה הצבאית המודרנית משרד הביטחון ההוצאה לאור
קישורים חיצוניים אתר האינטרנט של המרכז הירושלמי לענייני ציבור ומדינה
```
As it was generated from Hebrew Wikipedia sources, which are licensed under the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license, this corpus is thus also necessarilly licensed under the same license. 

### tagging

To tag the sentences we'll use Hebrew Dependency Parser  [Yoav Goldberg, September 2011](https://www.cs.bgu.ac.il/~yoavg/software/hebparsers/hebdepparser/).

For example to the input:
```
קורות חייו צמרת למד בחוגים לחינוך ופילוסופיה באוניברסיטה העברית בירושלים
```
the parser will output parsing sentence like this (it will be in the location - 'Data/Hebrew/parsed.txt' ) : 

![sentences_tagged](img/sentences_tagged.png "sentences_tagged")

we will take only the subject word (SBJ):![sentences_tagged_SBJ](img/sentences_tagged_sbj.png "sentences_tagged")

### Preprocessing

Here are the steps we'll take to prepare the data:

1. extract subject for each tagged sentence.
2. flat the list of subject to one string.
3. match subject to orginal sentence.
4. make csv.

## Prepare the dataset

The csv contains sentences & subjects pairs in the format:

In [3]:
#extract subject for each tagged sentence
def extract_sbj(sentences):
    listoflists = []
    sublist = []
    for i in sentences:
        if i.find("SBJ") !=-1:
            i = re.sub(r"[^א-ת\"]+", " ", i)
            if len(i)>3:
                sublist.append(i)
        elif i == '\n':
            listoflists.append(sublist)
            sublist = []
    return listoflists

#flat the list of subject to one string
def listToString(s):  
    str1 = " " 
    return (str1.join(s))

#match subject to orginal sentence
def make_pairs(orginal_sentences,tagged_sentences,num_examples):
    subjects = extract_sbj(tagged_sentences)
    clear_lines = []
    for tag, sentence in zip(subjects[:num_examples],orginal_sentences[:num_examples]):
        clear_lines.append(listToString(tag) + '\t' + sentence)
    subject = []
    sentences = []
    for line in clear_lines:
        line = line.split('\t')
        sentences.append(line[1])
        subject.append(line[0])
    return subject, sentences

def make_csv_subject_sentence(orginal_sentences,tagged_sentences, csv_path,num_examples):
    subjects, sentences = make_pairs(orginal_sentences,tagged_sentences,num_examples)
    trainDF = pd.DataFrame()
    trainDF['subject'] = subjects[:num_examples]
    trainDF['sentence'] = sentences[:num_examples]
    mask = (trainDF['subject'].str.len()>1)
    trainDF = trainDF.loc[mask]
    trainDF.to_csv(csv_path)

In [7]:
orginal_sentences = open(orginal_sentences_oath_file).readlines()
tagged_sentences = open(tagged_sentences_path_file).readlines()
make_csv_subject_sentence(orginal_sentences,tagged_sentences,csv_hebrew_sentences_path_file,NUM_EXAMPLES)

In [8]:
df = pd.read_csv(csv_hebrew_sentences_path_file,index_col=[0])
NUM_EXAMPLES = df.shape[0]

In [9]:
df.head()

Unnamed: 0,subject,sentence
0,קורות,קורות חייו צמרת למד בחוגים לחינוך ופילוסופיה ב...
3,פרסומים,פרסומים עיקריים מקורות המחשבה הצבאית המודרנית ...
7,מרוקו,מרוקו הצטרפה למשפחת האירוויזיון בפעם הראשונה ו...
12,קו,כתוצאה ממלחמת העצמאות ובעקבות הסכמי שביתת הנשק...
13,הסבר,הסבר לחשיבות הקריטריון ניתן למצוא בדיון שנערך ...


### preparing & Cleaning:

Here are the steps we'll take to prepare the data:

1. Add a *start* and *end* token to each sentence.
2. Clean the sentences by removing special characters.
3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
4. Pad each sentence to a maximum length.

In [7]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence_Hebrew(w):
    #w = unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    #w = re.sub(r"([?.!,¿\"])", r" \1 ", w)
    #w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^א-ת\"]+", " ", w)

    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

In [8]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence_English(w):
    w = unicode_to_ascii(w.lower().strip())

    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

In [9]:
example_sentence = preprocess_sentence_Hebrew("מבחינתי אפשר להעלות אותו לרשימת ההמתנה לשמוע הערות ואח\"כ להצבעה")
print(example_sentence)
example_sentence = [subject for subject in example_sentence.split()]

<start> מבחינתי אפשר להעלות אותו לרשימת ההמתנה לשמוע הערות ואח"כ להצבעה <end>


In [10]:
def max_length(tensor):
    return max(len(t) for t in tensor)

In [11]:
def create_dataset_from_csv(csv_path, num_examples):
    df = pd.read_csv(csv_path,index_col=[0])
    subjects = [preprocess_sentence_Hebrew(subject) for subject in df['subject']]  
    sentences = [preprocess_sentence_Hebrew(sentence) for sentence in df['sentence']] 
    subjects = [subject.split() for subject in subjects]
    sentences = [sentence.split() for sentence in sentences]
    return subjects, sentences,df

In [12]:
subjects, sentences,df = create_dataset_from_csv(csv_hebrew_sentences_path_file, NUM_EXAMPLES)

In [13]:
def create_dataset_from_csv_new(csv_path, num_examples):
    df = pd.read_csv(csv_path,index_col=[0])
    subjects = df['subject']
    sentences = [preprocess_sentence_Hebrew(sentence) for sentence in df['sentence']] 
    subjects = [subject.split() for subject in subjects]
    sentences = [sentence.split() for sentence in sentences]
    return subjects, sentences,df

In [14]:
subjects_100, sentences_100,df_100 = create_dataset_from_csv_new(csv_hebrew_sentences_path_file, 100)

In [15]:
new_ss = [a[0] for a in subjects_100]
new_ss

['קורות',
 'פרסומים',
 'מרוקו',
 'קו',
 'הסבר',
 'פקולטה',
 'התנהגות',
 'הוא',
 'עשה',
 'השפעת',
 'שרידי',
 'מספר',
 'דיון',
 'ערכים',
 'אביטל',
 'מאמץ',
 'חוקרי',
 'גאוגרפיה',
 'ריצ',
 'אנו',
 'פנטום',
 'קרטר',
 'בריטניה',
 'מסחר',
 'אינדקס',
 'קישורים',
 'מדינה',
 'טכנולוגיה',
 'דיסקוגרפיה',
 'קיסר',
 'קטגוריית',
 'קישורים',
 'חסידות',
 'ויקיפדיה',
 'היסטוריונים',
 'חלקן',
 'טיל',
 'מחשבים',
 'הקמת',
 'יוצאי',
 'לוח',
 'ערכים',
 'הגדרה',
 'בנים',
 'ערך',
 'עוד',
 'הוא',
 'כלים',
 'יקיחדשות',
 'זאת',
 'יצירה',
 'פתרונות',
 'הקמת',
 'מדיניות',
 'מלחמת',
 'נובגורוד',
 'חנויות',
 'האם',
 'תקופה',
 'אדולף',
 'מנהל',
 'צומת',
 'קישורים',
 'ערך',
 'ג',
 'חפירות',
 'רפורמציה',
 'כנסיות',
 'ביוגרפיה',
 'ערכים',
 'ג',
 'תואר',
 'קשרים',
 'הצבעת',
 'הצעת',
 'לימודי',
 'חובבי',
 'אקראית',
 'הם',
 'משפחה',
 'מערכת',
 'משתמשים',
 'אתה',
 'הוא',
 'מחבר',
 'תשומת',
 'מדיניות',
 'התייחסות',
 'ערכים',
 'מסמך',
 'אלכסנדר',
 'החלטת',
 'הוא',
 'תאוריה',
 'פיגועים',
 'קיימות',
 'קריירה',
 'ארגנטינה',
 'הכ

In [16]:
word_model = gensim.models.Word2Vec(subjects+sentences, size=EMBEDDING_DIM, min_count=1, window=5, iter=100)

In [17]:
vectors = [word_model.wv[a] for a in new_ss]

In [18]:
trainDF = pd.DataFrame()
trainDF['vector'] = vectors
trainDF['subject'] = new_ss

In [35]:
trainDF['vector'][0]

array([ 0.00398569, -1.8901986 , -0.04167079,  0.5108059 , -0.69707596,
        0.40042928, -1.231232  , -1.6347198 ,  0.09509303, -1.2260326 ,
       -0.03563622,  1.1412473 ,  0.9325855 , -0.50731283,  0.73774457,
       -0.18820085, -0.57643193,  0.6506263 ,  0.4773282 , -0.22409096,
       -0.37280095,  0.87427264, -0.22967365, -0.91149324,  0.32480574,
        0.80843794, -0.9488527 ,  1.5277755 ,  1.1899632 , -0.31318113,
        1.2684431 , -0.5371515 ,  0.12485521,  1.4042177 , -0.10658001,
       -1.1739335 ,  0.06208733, -0.27067086,  0.44054776, -0.9853829 ,
        0.30536523,  0.45870048,  1.0198349 ,  0.57741475,  0.45780244,
       -0.39679253, -2.4860334 , -0.4869366 , -0.36467692,  0.11694994,
       -1.1362572 , -0.48502183, -0.7076488 ,  0.09906752, -0.6212556 ,
        0.33545193,  0.51957005,  1.006085  ,  0.18142621, -0.5473227 ,
       -0.4349027 ,  1.7716393 , -0.4017008 ,  0.13678183,  0.81687164,
        0.097845  ,  0.19195746,  0.98078406, -0.50224024, -0.21

In [47]:
# new_csv_path = "Data/Hebrew/Subjects_vector.csv"
# trainDF.to_csv(new_csv_path)

In [44]:
f.close()

In [66]:
from tqdm import tqdm
Txt_file = "Data/Hebrew/Subjects_vector.txt"
f = open(Txt_file, "a")
for idx,i in enumerate(tqdm(vectors[:100])):
    f.write("THE SUJECT: "+ new_ss[idx]+'/n')
    f.write('/n')
    f.write('/n')
    f.write("VECTOR START >>>>>>>>>>>>>>>"+ str(i)+"VECTOR END<<<<<<<<<<<<<<"+'/n')
    f.write('/n')


  0%|          | 0/100 [00:00<?, ?it/s][A
 44%|████▍     | 44/100 [00:00<00:00, 432.14it/s][A
100%|██████████| 100/100 [00:00<00:00, 444.90it/s][A


In [27]:
trainDF[:10]

Unnamed: 0,vector,subject
0,"[1.6848079, -0.013529115, -0.4853551, -1.30793...",קורות
1,"[-0.3767081, 0.03185713, -0.5402789, -0.347758...",פרסומים
2,"[-0.042412773, -0.017034844, 0.1894208, -0.136...",מרוקו
3,"[0.7955116, -0.111858465, -0.1133503, 0.665261...",קו
4,"[-1.663684, 0.344732, 0.8169256, -0.73074293, ...",הסבר
5,"[-0.013377348, 0.011749349, -0.051014815, -0.0...",פקולטה
6,"[-0.4862766, 0.11782753, 0.7233255, 0.610386, ...",התנהגות
7,"[-0.052913904, -0.62564147, 0.93190145, 0.2648...",הוא
8,"[0.0073105013, -0.8763988, 0.6765104, 0.274296...",עשה
9,"[0.53223413, -0.23286606, -0.14494942, -0.1577...",השפעת


## Cross-validation: evaluating estimator performance

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called **overfitting**. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set `x_test, y_test`. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally. Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques.

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" alt="Drawing" style="width: 500px;"/>

*Grid Search Workflow*

In scikit-learn a random split into training and test sets can be quickly computed with the **`train_test_split`** helper function.

### Parameters

#### arrays:
sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

#### test_size:
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

Let’s load the tagged sentencses data:

In [12]:
sentences_train, sentences_test, subjects_train, subjects_test =\
train_test_split(sentences, subjects, test_size = 0.25)

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step. So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

### Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

`word_model = gensim.models.Word2Vec(subjects+sentences, size=EMBEDDING_DIM, min_count=1, window=5, iter=100)`

#### size
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me.

#### window
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you 
have lots of data, then the window size should not matter too much, as long as its a decent sized window.

<img src="https://miro.medium.com/max/1600/0*1uA0SYcKU_dLTj-V.png" alt="Drawing" style="width: 500px;"/>
<center>C is the window size</center>


#### min_count
Minimium frequency count of words. The model would ignore words that do not statisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

#### workers
How many threads to use behind the scenes?

#### iter
Number of iterations (epochs) over the corpus.

### When should you use Word2Vec?
There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary.

Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work.

In [13]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
word_model = gensim.models.Word2Vec(subjects+sentences, size=EMBEDDING_DIM, min_count=1, window=5, iter=100)

In [14]:
#Each word in the dictionary has an index
def word2idx(word, word_model):
    if word in word_model.wv.vocab:
        return word_model.wv.vocab[word].index
    else:
        return 1 #default index for non-exsits in vec_model voacb
    
def idx2word(idx, word_model):
  return word_model.wv.index2word[idx]

In [15]:
def convert(sentence, word_model):
  for w in sentence:
      print ("%d ----> %s" % (word2idx(w,word_model),w))

In [16]:
#padding the sentences with <end> tag index
def sentence_to_indexes(sentence,word_model,max_length_inp):
    data_set = [word2idx(w,word_model) for w in sentence]
    data_set = tf.keras.preprocessing.sequence.pad_sequences([data_set],max_length_inp, padding='post', value=1)
    return data_set

In [17]:
#create matrix of the dataset indexes 
def create_dataset_word2vec_matrix(sentences,vec_model,max_length_inp):
    data_set = []
    #creates train data set
    for i, sentence in enumerate(sentences):
        data_set.append(list())
        data_set[i] = ([word2idx(word,vec_model) for word in sentence])
    data_set = tf.keras.preprocessing.sequence.pad_sequences(data_set,max_length_inp, padding='post', value=1)
    return data_set 

In [18]:
max_sentence_len, max_subjects_len  = max_length(sentences_train), max_length(subjects_train)
x_train = create_dataset_word2vec_matrix(sentences_train,word_model,max_sentence_len)
y_train = create_dataset_word2vec_matrix(subjects_train,word_model,max_subjects_len)
x_test = sentences_test
y_test = subjects_test
max_length_inp, max_length_targ  = x_train.shape[1], y_train.shape[1]

In [19]:
#example sentence
convert(example_sentence,word_model)
sentence_to_indexes(example_sentence,word_model,max_length_inp)

0 ----> <start>
233 ----> מבחינתי
79 ----> אפשר
821 ----> להעלות
48 ----> אותו
4239 ----> לרשימת
10519 ----> ההמתנה
1130 ----> לשמוע
403 ----> הערות
14250 ----> ואח"כ
1619 ----> להצבעה
1 ----> <end>


array([[    0,   233,    79,   821,    48,  4239, 10519,  1130,   403,
        14250,  1619,     1,     1,     1]], dtype=int32)

In [20]:
#input matrix
x_train, x_train.shape

(array([[   0,   13,    3, ...,    1,    1,    1],
        [   0,  240, 1376, ...,    1,    1,    1],
        [   0,  151, 5459, ...,    1,    1,    1],
        ...,
        [   0,  910, 4251, ...,    1,    1,    1],
        [   0,   18,   12, ...,    1,    1,    1],
        [   0,    2,    4, ...,    1,    1,    1]], dtype=int32), (26876, 14))

In [21]:
BUFFER_SIZE = len(x_train)
BATCH_SIZE = 64
steps_per_epoch = len(x_train)//BATCH_SIZE
embedding_dim = EMBEDDING_DIM
units = 1024
pretrained_weights = word_model.wv.syn0
vocab_size = word_model.wv.syn0.shape[0]
emdedding_size = word_model.wv.syn0.shape[1]

  
  import sys
  


In [22]:
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(len(y_train))
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [23]:
example_input_batch, _ = next(iter(dataset))
example_input_batch.shape

TensorShape([64, 14])

## Write the encoder and decoder model

Implement an encoder-decoder model with attention which you can read about in the TensorFlow [Neural Machine Translation (seq2seq) tutorial](https://github.com/tensorflow/nmt). This example uses a more recent set of APIs. This notebook implements the [attention equations](https://github.com/tensorflow/nmt#background-on-the-attention-mechanism) from the seq2seq tutorial. The following diagram shows that each input words is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence. The below picture and formulas are an example of attention mechanism from [Luong's paper](https://arxiv.org/abs/1508.04025v5). 

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg" width="500" alt="attention mechanism">

The input is put through an encoder model which gives us the encoder output of shape *(batch_size, max_length, hidden_size)* and the encoder hidden state of shape *(batch_size, hidden_size)*.

Here are the equations that are implemented:

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg" alt="attention equation 0" width="800">
<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg" alt="attention equation 1" width="800">

This tutorial uses [Bahdanau attention](https://arxiv.org/pdf/1409.0473.pdf) for the encoder. Let's decide on notation before writing the simplified form:

* FC = Fully connected (dense) layer
* EO = Encoder output
* H = hidden state
* X = input to the decoder

And the pseudo-code:

* `score = FC(tanh(FC(EO) + FC(H)))`
* `attention weights = softmax(score, axis = 1)`. Softmax by default is applied on the last axis but here we want to apply it on the *1st axis*, since the shape of score is *(batch_size, max_length, hidden_size)*. `Max_length` is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
* `context vector = sum(attention weights * EO, axis = 1)`. Same reason as above for choosing axis as 1.
* `embedding output` = The input to the decoder X is passed through an embedding layer.
* `merged vector = concat(embedding output, context vector)`
* This merged vector is then given to the LSTM

The shapes of all the vectors at each step have been specified in the comments in the code:

In [24]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim,pretrained_weights, maxlen, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim,weights=[pretrained_weights],input_length=maxlen,trainable=True)
    self.lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(self.enc_units, return_sequences=True, return_state=True))

  def call(self, x, forward_h, forward_c, backward_h, backward_c):
    x = self.embedding(x)
    output, st_forward_h, st_forward_c, st_backward_h, st_backward_c = self.lstm(x, initial_state = [forward_h, forward_c, backward_h, backward_c])
    return output, st_forward_h, st_forward_c, st_backward_h, st_backward_c

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units))

In [25]:
encoder = Encoder(vocab_size, emdedding_size, pretrained_weights, max_length_inp ,units, BATCH_SIZE)
forward_h, forward_c, backward_h, backward_c = encoder.initialize_hidden_state()
sample_output, forward_h, forward_c, backward_h, backward_c = encoder(example_input_batch, forward_h, forward_c, backward_h, backward_c)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden forward_h state shape: (batch size, units) {}'.format(forward_h.shape))
print ('Encoder Hidden forward_c state shape: (batch size, units) {}'.format(forward_c.shape))
print ('Encoder Hidden backward_h state shape: (batch size, units) {}'.format(backward_h.shape))
print ('Encoder Hidden backward_c state shape: (batch size, units) {}'.format(backward_c.shape))

Encoder output shape: (batch size, sequence length, units) (64, 14, 2048)
Encoder Hidden forward_h state shape: (batch size, units) (64, 1024)
Encoder Hidden forward_c state shape: (batch size, units) (64, 1024)
Encoder Hidden backward_h state shape: (batch size, units) (64, 1024)
Encoder Hidden backward_c state shape: (batch size, units) (64, 1024)


In [26]:
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 =  tf.keras.layers.Dense(units)
    self.W2 =  tf.keras.layers.Dense(units)
    self.V =  tf.keras.layers.Dense(1)

  def call(self, forward_h, forward_c, backward_h, backward_c, values):
    hidden_h =  tf.keras.layers.Concatenate()([forward_h, backward_h])
    hidden_c =  tf.keras.layers.Concatenate()([forward_c, backward_c])
    query =  tf.keras.layers.Concatenate()([hidden_h, hidden_c])

    hidden_with_time_axis = tf.expand_dims(query, 1)

    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    attention_weights = tf.nn.softmax(score, axis=1)

    context_vector = attention_weights * values

    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights


In [27]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, pretrained_weights, maxlen, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim,weights=[pretrained_weights],input_length=maxlen,trainable=True)
    self.lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(self.dec_units, return_sequences=True, return_state=True))
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, forward_h, forward_c, backward_h, backward_c, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(forward_h, forward_c, backward_h, backward_c, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the bi-LSTM
    output, st_forward_h, st_forward_c, st_backward_h, st_backward_c = self.lstm(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)
    return x, st_forward_h, st_forward_c, st_backward_h, st_backward_c, attention_weights


In [28]:
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(forward_h, forward_c, backward_h, backward_c, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

decoder = Decoder(vocab_size, emdedding_size, pretrained_weights, max_length_targ ,units, BATCH_SIZE)

sample_decoder_output, _, _, _, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      forward_h, forward_c, backward_h, backward_c, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Attention result shape: (batch size, units) (64, 2048)
Attention weights shape: (batch_size, sequence_length, 1) (64, 14, 1)
Decoder output shape: (batch_size, vocab size) (64, 21000)


## Define the optimizer and the loss function

In [29]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

In [30]:
def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

## Checkpoints (Object-based saving)

In [31]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

## Training

1. Pass the *input* through the *encoder* which return *encoder output* and the *encoder hidden state*.
2. The encoder output, encoder hidden state and the decoder input (which is the *start token*) is passed to the decoder.
3. The decoder returns the *predictions* and the *decoder hidden state*.
4. The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
5. Use *teacher forcing* to decide the next input to the decoder.
6. *Teacher forcing* is the technique where the *target word* is passed as the *next input* to the decoder.
7. The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

In [32]:
@tf.function
def train_step(inp, targ, forward_h, forward_c, backward_h, backward_c):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_forward_h, enc_forward_c, enc_backward_h,\
                    enc_backward_c = encoder(inp,forward_h, forward_c, backward_h, backward_c)

    dec_forward_h, dec_forward_c, dec_backward_h, dec_backward_c = \
    enc_forward_h, enc_forward_c, enc_backward_h, enc_backward_c

    dec_input = tf.expand_dims([word_model.wv.vocab['<start>'].index] * BATCH_SIZE, 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_forward_h, dec_forward_c, dec_backward_h,dec_backward_c, _ =\
                decoder(dec_input, dec_forward_h, dec_forward_c, dec_backward_h, dec_backward_c, enc_output)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [34]:
EPOCHS = 30

for epoch in range(EPOCHS):
  start = time.time()

  enc_forward_h, enc_forward_c, enc_backward_h, enc_backward_c = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_forward_h, enc_forward_c, enc_backward_h, enc_backward_c)
    total_loss += batch_loss

    if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                     batch,
                                                     batch_loss.numpy()))
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 0.7746
Epoch 1 Batch 100 Loss 0.6184
Epoch 1 Batch 200 Loss 0.8386
Epoch 1 Batch 300 Loss 0.7160
Epoch 1 Batch 400 Loss 0.6254
Epoch 1 Loss 0.7371
Time taken for 1 epoch 145.1186227798462 sec

Epoch 2 Batch 0 Loss 0.5638
Epoch 2 Batch 100 Loss 0.4822
Epoch 2 Batch 200 Loss 0.5849
Epoch 2 Batch 300 Loss 0.4553
Epoch 2 Batch 400 Loss 0.4175
Epoch 2 Loss 0.5142
Time taken for 1 epoch 149.01123881340027 sec

Epoch 3 Batch 0 Loss 0.4014
Epoch 3 Batch 100 Loss 0.3248
Epoch 3 Batch 200 Loss 0.3705
Epoch 3 Batch 300 Loss 0.2620
Epoch 3 Batch 400 Loss 0.1939
Epoch 3 Loss 0.3114
Time taken for 1 epoch 144.58724117279053 sec

Epoch 4 Batch 0 Loss 0.2585
Epoch 4 Batch 100 Loss 0.1908
Epoch 4 Batch 200 Loss 0.2030
Epoch 4 Batch 300 Loss 0.1409
Epoch 4 Batch 400 Loss 0.1110
Epoch 4 Loss 0.1796
Time taken for 1 epoch 149.0610339641571 sec

Epoch 5 Batch 0 Loss 0.1738
Epoch 5 Batch 100 Loss 0.0925
Epoch 5 Batch 200 Loss 0.0865
Epoch 5 Batch 300 Loss 0.0784
Epoch 5 Batch 400 Loss 0

In [35]:
def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))

    inputs = sentence_to_indexes(sentence,word_model, max_length_inp)
    inputs = tf.convert_to_tensor(inputs)

    result = ''
    forward_h  = tf.zeros((1, units))
    forward_c  = tf.zeros((1, units))
    backward_h = tf.zeros((1, units))
    backward_c = tf.zeros((1, units))

    enc_out, enc_forward_h, enc_forward_c, enc_backward_h, enc_backward_c = encoder(inputs, forward_h, forward_c, backward_h, backward_c)

    dec_forward_h, dec_forward_c, dec_backward_h, dec_backward_c = enc_forward_h, enc_forward_c, enc_backward_h, enc_backward_c


    dec_input = tf.expand_dims([word_model.wv.vocab['<start>'].index], 0)

    for t in range(max_length_targ):
        predictions, dec_forward_h, dec_forward_c, dec_backward_h, dec_backward_c, attention_weights = decoder(dec_input,
                                                                                                               dec_forward_h, dec_forward_c, dec_backward_h, dec_backward_c,
                                                                                                               enc_out)

        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()
        
        predicted_id = tf.argmax(predictions[0]).numpy()
        
        if word_model.wv.index2word[predicted_id] != '<end>':
            result += word_model.wv.index2word[predicted_id] + ' '

        if word_model.wv.index2word[predicted_id] == '<end>':
            return result, sentence, attention_plot

        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 1)

    return result, sentence, attention_plot

In [36]:
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')

    fontdict = {'fontsize': 14}

    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()

## KDTree - nearest-neighbor

The algorithm used is described in Maneewongvatana and Mount 1999. The general idea is that the kd-tree is a binary tree, each of whose nodes represents an axis-aligned hyperrectangle. Each node specifies an axis and splits the set of points based on whether their coordinate along that axis is greater than or less than a particular value.

During construction, the axis and splitting point are chosen by the “sliding midpoint” rule, which ensures that the cells do not all become long and thin.

The tree can be queried for the r closest neighbors of any given point (optionally returning only those within some maximum distance of the point). It can also be queried, with a substantial gain in efficiency, for the r approximate closest neighbors.

For large dimensions (20 is already large) do not expect this to run significantly faster than brute force. High-dimensional nearest-neighbor queries are a substantial open problem in computer science.

The tree also supports all-neighbors queries, both with arrays of points and with other kd-trees. These do use a reasonably efficient algorithm, but the kd-tree is not necessarily the best data structure for this sort of calculation.

![kdtree](img/kdtree-ops.png "sentences_tagged")

`class scipy.spatial.KDTree(data, leafsize=10)[source]`

kd-tree for quick nearest-neighbor lookup

This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point.

In [37]:
def KDtree_nearest(test_sentence,result):
    A = [word_model[word] for word in test_sentence.split()]
    tree = spatial.KDTree(A)
    if(len(result)>1):
        word_result = result.split()
    else:word_result=result
    a = word_model[word_result[0].rstrip().strip()]
    index = tree.query(a)[1]
    return test_sentence.split()[index]

## Tag

* The evaluate function is similar to the training loop, except we don't use *teacher forcing* here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.
* Stop predicting when the model predicts the *end token*.
* And store the *attention weights for every time step*.
* Run **KDTree** to find the right word from the sentence*.

Note: The encoder output is calculated only once for one input.

In [38]:
def tag(sentence):
    result, sentence, attention_plot = evaluate(sentence)
    result = KDtree_nearest(sentence,result)
    print('Input: %s' % (sentence))
    print('Predicted subject: {}'.format(result))
    sentence = sentence[::-1]
    result = result[::-1]
    sentence_splited = sentence.split(' ')
    #to fix the hebrew changes
    sentence_splited.reverse()
    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    plot_attention(attention_plot, sentence_splited, result.split(' '))
    return result

## restoring the latest checkpoint in checkpoint_dir

In [39]:
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f96d4b83eb8>

In [40]:
# test_sentence = "הוא לא יודע מה עובר עליי"
# result = tag(test_sentence)

In [41]:
def listToString(s):  
    str1 = " " 
    return (str1.join(s))

In [42]:
len(sentences_test)

8959

In [43]:
new_sentences_test = [listToString(sen[1:-1]) for sen in sentences_test]
new_subjects_test = [listToString(sen[1:-1]) for sen in subjects_test]

In [44]:
def new_tag(sentence):
    result, sentence, attention_plot = evaluate(sentence)
    if len(result)>0:
        result = KDtree_nearest(sentence,result)
    return result

In [45]:
from tqdm.notebook import trange, tqdm
counter = 0
for index,sen1 in enumerate(tqdm(new_sentences_test)):
    if new_tag(sen1) == new_subjects_test[index]:
        counter = counter+1

HBox(children=(IntProgress(value=0, max=8959), HTML(value='')))

  
  import sys





In [46]:
counter/len(new_sentences_test)

0.07299921866279718