# Text Summarization

<hr>

***Automatic Text Summarization*** is the process of shortening a text document using the deep learning methods. More specifically we will use a seq2seq model for this purpose. This model is widely used in industry today. Search engines are an example. Others include summarization of documents, image collections and videos. 

<img width="600px" src="assets/intro.png">
<br><br>
In this Jupyter Notebook, we will go through the following chapters:

- **Chapter 1:** Data Prepration
- **Chapter 2:** Data Preprocessing & Embedding
- **Chapter 3:** Training the Seq2Seq Model
- **Chapter 4:** Inference

<br>


# Chapter 1: Data Prepration

<hr>

In this chapter, we will make our data ready. Furthermore, we make sure the necessary libraries have been installed.

<br>

### LESSION 1.1: DOWNLOAD AND IMPORT PACKAGES


<hr>

Now, let's import all the packages and libraries that we are going to use throughout this jupyter notebook. Also make sure that you have installed Tensorflow version 1.12 otherwise the codes won't run.

In [None]:
!pip install tqdm
!pip install pandas
!pip install numpy
!pip install contractions
!pip install tensorflow==1.12
!pip install tensorflow_hub
!pip install nltk
!pip install textsearch

In [1]:
# Import the libraries
import time
import re
import html
import os
import tqdm
import pandas as pd
import numpy as np
from collections import Counter
import contractions

# Import the deep learning libraries
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.python.layers.core import Dense
#import seq2seq

# Import the NLP libraries
import nltk
from nltk.translate.bleu_score import sentence_bleu

In [2]:
# Download punkt sentence tokenizer
nltk.download('punkt')

[nltk_data] Error loading punkt: <urlopen error [Errno 8] nodename nor
[nltk_data]     servname provided, or not known>


False

In [3]:
# Check the tensorflow version
print("Tensorflow Version: ", tf.__version__)
assert tf.__version__ == "1.12.0", print("Please Install Tensorflow Version 1.12.0")

Tensorflow Version:  1.12.0


<br>

### LESSION 1.2: DOWNLOAD & LOAD THE DATASET

<hr>

The dataset that we are going to use is called "Amazon Find Food Reviews". This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Our aim is to input a review (Text column) and automatically create a summary (Summary colum) for it. 

<img width="400px" src="assets/amazon_food.png">

You can download the dataset from the following link:

https://www.kaggle.com/snap/amazon-fine-food-reviews/data

In [4]:
# Load the dataset
data = pd.read_csv("dataset/Reviews.csv")

In [5]:
# Take a look at first 5 rows of dataset
data.head(n = 5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
print("Dataset Shape: ", data.shape)

Dataset Shape:  (568454, 10)


<br>

### LESSION 1.3: EXPLORE THE DATASET

<hr>

In this part, we will make our dataset to be ready for data preprocessing. More specifically, we will remove the missing values, we will get the relevant columns and ignore the rest, and lastly we will remove the texts if it's too large or if it's too short.

In [7]:
# Check the total missing values
print("Missing Values: \n\n", data.isnull().sum())

Missing Values: 

 Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64


In [8]:
# Drop the missing values in "Summary" column
data.dropna(subset = ["Summary"], inplace = True)

In [9]:
# Take only the columns "Summary" and "Text"
data = data[['Summary', 'Text']]

In [10]:
# Take a look at the dataset
data.head(n = 5)

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


In [11]:
### Get the text with length between 100 and 150

# Initialize the empty lists
texts = []
summaries = []

# Iterate through text and its summary
for i_text, i_summary in zip(data["Text"], data["Summary"]):
    
    # If length is between 100 and 150
    if 100 < len(i_text) < 150:
        
        # Append the text and summary into list
        texts.append(i_text)
        summaries.append(i_summary)

In [12]:
print("Length of Texts Lists: ", len(texts))
print("Length of Summaries Lists: ", len(summaries))

Length of Texts Lists:  78862
Length of Summaries Lists:  78862


In [13]:
# Get the first 3 texts and its summary
for i_text, i_summary in zip(texts[:3], summaries[:3]):
    print("Text: ", i_text)
    print("Summary: ", i_summary)
    print("\n")

Text:  Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.
Summary:  Great taffy


Text:  This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!
Summary:  Wonderful, tasty taffy


Text:  Right now I'm mostly just sprouting this so my cats can eat the grass. They love it. I rotate it around with Wheatgrass and Rye too
Summary:  Yay Barley




# Chapter 2: Data Preprocessing & Embedding

<hr>

In this chapter we will apply all the necessary preprocessing steps into our text. Then we will create embedding for our dataset.

<br>

### LESSION 2.1: CLEAN AND PREPARE THE DATASET

<hr>

In this lessson, we will write down some function for creating mini-batches and sequence padding. Afterward we will apply the text preprocessing steps for making our dataset ready for neural network. More specifically, we will apply the following steps:
1. Lowercase the text
2. Fix the contractions (e.g. she's -> she is, he's -> he is, etc.)
3. Remove the punctuations
4. Remove some specific characters inside the given dataset
5. Remove the extra spaces
6. Tokenizing the texts into words

In [14]:
def preprocessing_steps(text, keep_most = False):
    """
    Function for preprocessing the sentenes.
    
    ARGUMENTS
    ===================
        - text: The text that we want to perform the preprocessing.
        - keep_most: Depending if True or False, we either keep only letters and numbers or also other 
                     characters.
        
    RETURNS
    ===================
        - text: Preprocessed text.
    """
    # Lower case the text
    text = text.lower()
    
    # Remove and replace some specific words
    text = fixup(text)
    
    # Fix the contractions (e.g. she's -> she is, he's -> he is, etc.)
    text = contractions.fix(text)
    
    # Remove the punctuation 
    if keep_most:
        text = re.sub(r"[^a-z0-9%!?.,:()/]", " ", text)
    else:
        text = re.sub(r"[^a-z0-9]", " ", text)
    
    # Remove the breaks
    text = re.sub(r"<br />", " ", text)
        
    # Remove the extra spaces
    text = re.sub(r"    ", " ", text)
    text = re.sub(r"   ", " ", text)
    text = re.sub(r"  ", " ", text)
    text = text.strip()
    
    return text

In [15]:
def fixup(x):
    re1 = re.compile(r'  +')
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
        ' @-@ ', '-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))

In [16]:
def preprocess_and_tokenize(text, keep_most = False):
    """
    Split the text intp sentences. Then preprocess them. Afterward tokenize the sentence.
    
    ARGUMENTS
    ===================
        - text: The text that we want to perform the preprocessing.
        - keep_most: depending if True or False, we either keep only letters and numbers or also other characters.
        
    RETURNS
    ===================
        - tokenized: Tokenized & preprocessed text.
    """
    # Initialize an empty list
    tokenized = []
    
    # Iterate through tokenized sentences
    for i_sentence in nltk.sent_tokenize(text):
        
        # Apply the preprocessing_steps function
        i_sentence = preprocessing_steps(i_sentence, keep_most)
        
        # Tokenize into words
        i_sentence = nltk.word_tokenize(i_sentence)
        
        # Iterate through tokenized words
        for token in i_sentence:
            
            # Append the tokens into the list
            tokenized.append(token)

    return tokenized

In [17]:
def apply_preprocessing(texts, summaries, keep_most = False): # preprocess_texts_and_summaries
    """
    Apply all the preprocessing steps to our texts and summaries.
    
    ARGUMENTS
    ===================
        - texts: The texts that we want to perform the preprocessing.
        - summaries: The summaries that we want to perform the preprocessing.
        - keep_most: depending if True or False, we either keep only letters and numbers or also other characters.
        
    RETURNS
    ===================
        - processed texts
        - processed summaries
        - words_counted: Array containing all the unique words together with their counts sorted by counts.
    """
    # Initialize the lists
    processed_texts = []
    processed_summaries = []
    words = []

    # Iterate through texts
    for i_text in tqdm.tqdm(texts):
        
        # Preprocess the text
        i_text = preprocess_and_tokenize(i_text, keep_most)
        
        # Iterate through each words
        for i_word in i_text:
            
            # Append the word into words list
            words.append(i_word)
            
        # Append the preprocessed text into processed_texts
        processed_texts.append(i_text)
        
    # Iterate through summaries
    for i_summary in tqdm.tqdm(summaries):
        
        # Preprocess the summary
        i_summary = preprocess_and_tokenize(i_summary, keep_most)
        
        # Iterate through each words
        for i_word in i_summary:
            
            # Append the word into words
            words.append(i_word)

        # Append the preprocessed summary into processed_summaries
        processed_summaries.append(i_summary)
        
    # Create word count
    words_counted = Counter(words).most_common()

    return processed_texts, processed_summaries, words_counted

In [18]:
# Preprocess the texts and summaries
processed_texts, processed_summaries, words_counted = apply_preprocessing(texts, 
                                                                          summaries, 
                                                                          keep_most = False)

100%|██████████| 78862/78862 [01:34<00:00, 832.90it/s] 
100%|██████████| 78862/78862 [00:23<00:00, 3427.01it/s]


In [19]:
# Take a look at the preprocessed texts and summaries
for i_text, i_summary in zip(processed_texts[:3], processed_summaries[:3]):
    print("Preprocessed Text: ", i_text)
    print("Preprocessed Summary: ", i_summary)
    print("\n")

Preprocessed Text:  ['great', 'taffy', 'at', 'a', 'great', 'price', 'there', 'was', 'a', 'wide', 'assortment', 'of', 'yummy', 'taffy', 'delivery', 'was', 'very', 'quick', 'if', 'your', 'a', 'taffy', 'lover', 'this', 'is', 'a', 'deal']
Preprocessed Summary:  ['great', 'taffy']


Preprocessed Text:  ['this', 'taffy', 'is', 'so', 'good', 'it', 'is', 'very', 'soft', 'and', 'chewy', 'the', 'flavors', 'are', 'amazing', 'i', 'would', 'definitely', 'recommend', 'you', 'buying', 'it', 'very', 'satisfying']
Preprocessed Summary:  ['wonderful', 'tasty', 'taffy']


Preprocessed Text:  ['right', 'now', 'i', 'm', 'mostly', 'just', 'sprouting', 'this', 'so', 'my', 'cats', 'can', 'eat', 'the', 'grass', 'they', 'love', 'it', 'i', 'rotate', 'it', 'around', 'with', 'wheatgrass', 'and', 'rye', 'too']
Preprocessed Summary:  ['yay', 'barley']




<br>

### LESSION 2.2: CREATE LOOKUP DICTIONARIES

<hr>

Now it's time to create two dictionaries; One that converts text into integers (word2int) and another one that converts integers into text (int2word).

<img width="800px" src="assets/int2word.jpeg">
<p style="font-size:9px; text-align:right; color:gray;" >Image Taken From hackernoon.com</p>

In [20]:
def create_word_inds_dicts(words_counted, specials = None, min_occurences = 0):
    """ 
    Create lookup dicts from word to index and back. 

    ARGUMENTS
    ===================
        - words_counted
        - specials
        - min_occurences
        
    RETURNS
    ===================
        - word2ind
        - ind2word
        - missing_words
    """
    # Initialize empty list
    missing_words = []
    
    # Initialize empty dictionary
    word2ind = {}
    ind2word = {}
    
    # Initialize the index
    index = 0

    # If specials has been specified
    if specials is not None:
        
        # Iterate through specials
        for i_special in specials:
            
            # Append the special and its index into word2ind
            word2ind[i_special] = index
            
            # Append the special and its index into ind2word
            ind2word[index] = i_special
            
            # Increment the index
            index += 1

    # Iterate through word counts
    for (i_word, i_count) in words_counted:
        
        # If count is greater than or equal to min_occurences
        if i_count >= min_occurences:
            
            # Append the word and its index into word2ind
            word2ind[i_word] = index
            
            # Append the word and its index into ind2word
            ind2word[index] = i_word
            
            # Increment the index
            index += 1
        
        # If count is less than or equal to min_occurences
        else:
            
            # Append the word into missing_words
            missing_words.append(i_word)

    return word2ind, ind2word, missing_words

In [21]:
def convert_sentence(review, word2ind):
    """ 
    Convert the given sentence to integer values corresponding to the given word2ind.
    
    ARGUMENTS
    ===================
        - review: The given text that we want to convert to integer.
        - word2ind: A dictionary that maps each word into an integer number.
        
    RETURNS
    ===================
        - integers: Integers corresponding ot the given sentence.
        - unknown_words: Word that were not in word2ind. 
    """
    # Initialize empty lists
    integers = []
    unknown_words = []

    # Iterate through each word in review
    for i_word in review:
        
        # If word is in word2ind
        if i_word in word2ind.keys():
            
            # Append integer corresponding to the word
            integers.append(int(word2ind[i_word]))
            
        # If word is NOT in word2ind
        else:
            
            # Append integer corresponding '<UNK>' 
            integers.append(int(word2ind['<UNK>']))
            
            # Append the word into unknown_words
            unknown_words.append(word)

    return integers, unknown_words

In [22]:
def convert_to_inds(input, word2ind, eos = False, sos = False):
    """
    Convert the integers.
    
    ARGUMENTS
    ===================
        - input: The input
        - word2ind: A dictionary that maps each word into an integer number
        - eos: End of sentence
        - sos: Start of sentence
        
    RETURNS
    ===================
        - converted_input
        - all_unknown_words
    """
    # Initialize an empty list
    converted_input = []
    
    # Initialize a set
    all_unknown_words = set()

    # Iterate through inputs
    for i_input in input:
        
        # Convert the given sentence into integers
        converted_inp, unknown_words = convert_sentence(i_input, word2ind)
        
        # If end of sentence
        if eos:
            
            # Append the integer correponding to <EOS> into converted_inp
            converted_inp.append(word2ind['<EOS>'])
            
        # If start of sentence
        if sos:
            
            # Append the integer correponding to <SOS> into converted_inp
            converted_inp.insert(0, word2ind['<SOS>'])
            
        # Append converted_inp into converted_input
        converted_input.append(converted_inp)
        
        # Append unknown_words into all_unknown_words
        all_unknown_words.update(unknown_words)

    return converted_input, all_unknown_words

In [23]:
def convert_inds_to_text(inds, ind2word):
    """ 
    Convert the given indexes back to text 
    
    ARGUMENTS
    ===================
        - inds: Integers we want to convert to text.
        - ind2word: A dictionary that maps integers into words.
        
    RETURNS
    ===================
        - words: Coverted words
    """
    # Iterate through integers and get the words using ind2word dictionary
    words = [ind2word[word] for word in inds]
    
    return words

In [24]:
# Specials
specials = ["<EOS>", "<SOS>", "<PAD>", "<UNK>"]

# Create lookup dicts from word to index and back
word2ind, ind2word,  missing_words = create_word_inds_dicts(words_counted, specials = specials)

# Print the lengths
print("Length of word2ind: ", len(word2ind))
print("Length of ind2word: ", len(ind2word))
print("Length of Missing Words: ", len(missing_words))

Length of word2ind:  25031
Length of ind2word:  25031
Length of Missing Words:  0


In [25]:
# Covert texts into integers (looks like we have to set eos here to False)
converted_texts, unknown_words_in_texts = convert_to_inds(processed_texts, 
                                                          word2ind, 
                                                          eos = False)

# Covert summaries into integers
converted_summaries, unknown_words_in_summaries = convert_to_inds(processed_summaries, 
                                                                  word2ind, 
                                                                  eos = True, 
                                                                  sos = True)

In [26]:
print("Integer text: \n", converted_texts[0], "\n")
print("Integer summary: \n", converted_summaries[0], "\n")
print("Converted Text: \n", convert_inds_to_text(converted_texts[0], ind2word), "\n")
print("Converted Summary: \n", convert_inds_to_text(converted_summaries[0], ind2word))

Integer text: 
 [12, 1707, 45, 8, 12, 44, 130, 29, 8, 2701, 1144, 16, 104, 1707, 317, 29, 26, 247, 61, 99, 8, 1707, 647, 10, 9, 8, 192] 

Integer summary: 
 [1, 12, 1707, 0] 

Converted Text: 
 ['great', 'taffy', 'at', 'a', 'great', 'price', 'there', 'was', 'a', 'wide', 'assortment', 'of', 'yummy', 'taffy', 'delivery', 'was', 'very', 'quick', 'if', 'your', 'a', 'taffy', 'lover', 'this', 'is', 'a', 'deal'] 

Converted Summary: 
 ['<SOS>', 'great', 'taffy', '<EOS>']


<br>

### LESSION 2.3: EMBEDDING  

<hr>

Word embeddings are basically a form of word representation that bridges the human understanding of language to that of a machine. Word embeddings are distributed representations of text in an n-dimensional space. These are essential for solving most NLP problems. Optionally we can use pretrained word embeddings. Those have proved to increase training speed and accuracy. Here we can use two different options: glove embedding and tf_hub embedding. However the ones from tf_hub worked better.

<img width="300px" src="assets/embeddings.png">
<p style="font-size:9px; text-align:right; color:gray;" >Image Taken From udacity.com</p>

In [27]:
def load_pretrained_embeddings(path):
    """
    Loading the pretrained embeddings which it stores each embedding in a dictionary with its corresponding word.
    """
    # Initialize a dictionary
    embeddings = {}
    
    # Open the file
    with open(path, 'r', encoding='utf-8') as f:
        
        # Iterate through each line
        for line in f:
            
            # Split the line
            values = line.split(' ')
            
            # Get the word
            word = values[0]
            
            # Get the embedding vector
            embedding_vector = np.array(values[1:], dtype='float32')
            
            # Append the embedding vector into embeddings dictionary
            embeddings[word] = embedding_vector
            
    return embeddings

In [28]:
def create_and_save_embedding_matrix(word2ind, pretrained_embeddings_path, save_path, embedding_dim = 300):
    """
    Creating an embedding matrix for each word in word2ind. if that words is in pretrained_embeddings, 
    that vector is used. otherwise initialized randomly.
    
    ARGUMENTS
    ===================
        - word2ind: A dictionary that maps each word into an integer number.
        - pretrained_embeddings_path: Path to the pre-trained embedding.
        - save_path: Path for saving the embedding matrix.
        - embedding_dim: Embedding dimension.
        
    RETURNS
    ===================
        - embedding_matrix: Embedding matrix
        
    """
    # Load the pre-trained embeddings
    pretrained_embeddings = load_pretrained_embeddings(pretrained_embeddings_path)
    
    # Initialize the embedding matrix with zeros
    embedding_matrix = np.zeros((len(word2ind), embedding_dim), dtype=np.float32)
    
    # Iterate through each word and its index
    for i_word, i_index in word2ind.items():
        
        # If word is inside the pre-trained embeddings key
        if i_word in pretrained_embeddings.keys():
            
            # Append into embedding_matrix 
            embedding_matrix[i_index] = pretrained_embeddings[i_word]
            
        # If word is NOT inside the pre-trained embeddings key
        else:
            
            # Initialize a uniformly random embedding
            embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
            
            # Append the embedding into embedding_matrix
            embedding_matrix[i_index] = embedding
            
    # If save_path does not exist
    if not os.path.exists(os.path.dirname(save_path)):
        
        # Make save_path directory
        os.makedirs(os.path.dirname(save_path))
        
    # Save the embedding matrix
    np.save(save_path, embedding_matrix)
    
    return np.array(embedding_matrix)

In [29]:
# Path to glove embedding
#glove_embeddings_path = './glove/glove.6B.300d.txt'

# Path to save the embedding matrix
#embedding_matrix_save_path = './embeddings/embedding.npy'

# Create an embedding matrix for each word in word2ind
#emb = create_and_save_embedding_matrix(word2ind, glove_embeddings_path, embedding_matrix_save_path)

In [None]:
# Get the embeddings from tf_hub.
embed = hub.Module("https://tfhub.dev/google/Wiki-words-250/1")      # embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")
emb = embed([key for key in word2ind.keys()]) 

# Session
with tf.Session() as sess:
    
    # Initialize the variables
    sess.run(tf.global_variables_initializer())
    
    # Initialize all tables of the default graph
    sess.run(tf.tables_initializer())
    
    # Run the session
    embedding = sess.run(emb)

In [None]:
print("Embedding Size: ", embedding.shape)

In [None]:
# Save the embedding
np.save('embeddings/embedding.npy', embedding)

<br>

# Chapter 3: Training the Seq2Seq Model

<hr>

In this chapter, we will start training our seq2seq model for text summarization. We will begin with setting our hyperparameters, then we will initialize our model and afterward train it.

<br>

### LESSION 3.1: SET HYPERPARAMETERS & PATHS

<hr>

Before initializing the seq2seq model, let's set our hyperparameters. We will come back to this part more often in order to change the hyperparameters. This is a important part because it determines the big part of our accuracy. So take your time for changing the value and monitoring the accuracy.

In [38]:
# model hyperparametes
num_layers_encoder = 4
num_layers_decoder = 4
rnn_size_encoder = 512
rnn_size_decoder = 512

batch_size = 512
epochs = 15
clip = 5
keep_probability = 0.5
learning_rate = 0.0005
max_lr=0.005
learning_rate_decay_steps = 700
learning_rate_decay = 0.90


pretrained_embeddings_path = 'embeddings/embedding.npy'
summary_dir = os.path.join('./tensorboard', str('Nn_' + str(rnn_size_encoder) + '_Lr_' + str(learning_rate)))


use_cyclic_lr = True
inference_targets=True


In [39]:
len(converted_summaries)

78862

In [40]:
round(78862*0.9)

70976

<br>

### LESSION 3.2: IMPLEMENT THE SEQ2SEQ MODEL

<hr>

Now let's implement the seq2seq model. Every seq2seq model is divided into two parts: encoder and decoder. The encoder encodes the input sequence into a fixed-length context vector. This vector is an internal representation of the text. This context vector is then decoded into the output sequence by the decoder. 

<img src="assets/encoder-decoder.png">

In [41]:
def reset_graph(seed = 97):
    """
    Function for reseting the default graph.
    
    """
    # Clear the default graph stack and reset the global default graph
    tf.reset_default_graph()
    
    # Set the random seed in tensorflow
    tf.set_random_seed(seed)
    
    # Set the random seed in numpy
    np.random.seed(seed)

In [42]:
def minibatches(inputs, targets, minibatch_size):
    """batch generator. yields x and y batch.
    """
    x_batch, y_batch = [], []
    for inp, tgt in zip(inputs, targets):
        if len(x_batch) == minibatch_size and len(y_batch) == minibatch_size:
            yield x_batch, y_batch
            x_batch, y_batch = [], []
        x_batch.append(inp)
        y_batch.append(tgt)

    if len(x_batch) != 0:
        for inp, tgt in zip(inputs, targets):
            if len(x_batch) != minibatch_size:
                x_batch.append(inp)
                y_batch.append(tgt)
            else:
                break
        yield x_batch, y_batch


def pad_sequences(sequences, pad_tok, tail=True):
    """Pads the sentences, so that all sentences in a batch have the same length.
    """

    max_length = max(len(x) for x in sequences)

    sequence_padded, sequence_length = [], []

    for seq in sequences:
        seq = list(seq)
        if tail:
            seq_ = seq[:max_length] + [pad_tok] * max(max_length - len(seq), 0)
        else:
            seq_ = [pad_tok] * max(max_length - len(seq), 0) + seq[:max_length]

        sequence_padded += [seq_]
        sequence_length += [min(len(seq), max_length)]

    return sequence_padded, sequence_length

In [46]:
class seq2seq:

    def __init__(self,
                 word2ind,
                 ind2word,
                 save_path,
                 mode='TRAIN',
                 num_layers_encoder=1,
                 num_layers_decoder=1,
                 embedding_dim=300,
                 rnn_size_encoder=256,
                 rnn_size_decoder=256,
                 learning_rate=0.001,
                 learning_rate_decay=0.9,
                 learning_rate_decay_steps=100,
                 max_lr=0.01,
                 keep_probability=0.8,
                 batch_size=64,
                 beam_width=10,
                 epochs=20,
                 eos="<EOS>",
                 sos="<SOS>",
                 pad='<PAD>',
                 clip=5,
                 inference_targets=False,
                 pretrained_embeddings_path=None,
                 summary_dir=None,
                 use_cyclic_lr=False):
        """

        Args:
            word2ind: lookup dict from word to index.
            ind2word: lookup dict from index to word.
            save_path: path to save the tf model to in the end.
            mode: String. 'TRAIN' or 'INFER'. depending on which mode we use
                  a different graph is created.
            num_layers_encoder: Float. Number of encoder layers. defaults to 1.
            num_layers_decoder: Float. Number of decoder layers. defaults to 1.
            embedding_dim: dimension of the embedding vectors in the embedding matrix.
                           every word has a embedding_dim 'long' vector.
            rnn_size_encoder: Integer. number of hidden units in encoder. defaults to 256.
            rnn_size_decoder: Integer. number of hidden units in decoder. defaults to 256.
            learning_rate: Float.
            learning_rate_decay: only if exponential learning rate is used.
            learning_rate_decay_steps: Integer.
            max_lr: only used if cyclic learning rate is used.
            keep_probability: Float.
            batch_size: Integer. Size of minibatches.
            beam_width: Integer. Only used in inference, for Beam Search.('INFER'-mode)
            epochs: Integer. Number of times the training is conducted
                    on the whole training data.
            eos: EndOfSentence tag.
            sos: StartOfSentence tag.
            pad: Padding tag.
            clip: Value to clip the gradients to in training process.
            inference_targets:
            pretrained_embeddings_path: Path to pretrained embeddings. Has to be .npy
            summary_dir: Directory the summaries are written to for tensorboard.
            use_cyclic_lr: Boolean.
        """

        self.word2ind = word2ind
        self.ind2word = ind2word
        self.vocab_size = len(word2ind)
        self.num_layers_encoder = num_layers_encoder
        self.num_layers_decoder = num_layers_decoder
        self.rnn_size_encoder = rnn_size_encoder
        self.rnn_size_decoder = rnn_size_decoder
        self.save_path = save_path
        self.embedding_dim = embedding_dim
        self.mode = mode.upper()
        self.learning_rate = learning_rate
        self.learning_rate_decay = learning_rate_decay
        self.learning_rate_decay_steps = learning_rate_decay_steps
        self.keep_probability = keep_probability
        self.batch_size = batch_size
        self.beam_width = beam_width
        self.eos = eos
        self.sos = sos
        self.clip = clip
        self.pad = pad
        self.epochs = epochs
        self.inference_targets = inference_targets
        self.pretrained_embeddings_path = pretrained_embeddings_path
        self.use_cyclic_lr = use_cyclic_lr
        self.max_lr = max_lr
        self.summary_dir = summary_dir

    def build_graph(self):
        self.add_placeholders()
        self.add_embeddings()
        self.add_lookup_ops()
        self.initialize_session()
        self.add_seq2seq()
        self.saver = tf.train.Saver()
        print('Graph built.')

    def add_placeholders(self):
        self.ids_1 = tf.placeholder(tf.int32,
                                    shape=[None, None],
                                    name='ids_source')
        self.ids_2 = tf.placeholder(tf.int32,
                                    shape=[None, None],
                                    name='ids_target')
        self.sequence_lengths_1 = tf.placeholder(tf.int32,
                                                 shape=[None],
                                                 name='sequence_length_source')
        self.sequence_lengths_2 = tf.placeholder(tf.int32,
                                                 shape=[None],
                                                 name='sequence_length_target')
        self.maximum_iterations = tf.reduce_max(self.sequence_lengths_2,
                                                name='max_dec_len')

    def create_word_embedding(self, embed_name, vocab_size, embed_dim):
        """Creates embedding matrix in given shape - [vocab_size, embed_dim].
        """
        embedding = tf.get_variable(embed_name,
                                    shape=[vocab_size, embed_dim],
                                    dtype=tf.float32)
        return embedding

    def add_embeddings(self):
        """Creates the embedding matrix. In case path to pretrained embeddings is given,
           that embedding is loaded. Otherwise created.
        """
        if self.pretrained_embeddings_path is not None:
            self.embedding = tf.Variable(np.load(self.pretrained_embeddings_path),
                                         name='embedding')
            print('Loaded pretrained embeddings.')
        else:
            self.embedding = self.create_word_embedding('embedding',
                                                        self.vocab_size,
                                                        self.embedding_dim)

    def add_lookup_ops(self):
        """Additional lookup operation for both source embedding and target embedding matrix.
        """
        self.word_embeddings_1 = tf.nn.embedding_lookup(self.embedding,
                                                        self.ids_1,
                                                        name='word_embeddings_1')
        self.word_embeddings_2 = tf.nn.embedding_lookup(self.embedding,
                                                        self.ids_2,
                                                        name='word_embeddings_2')

    def make_rnn_cell(self, rnn_size, keep_probability):
        """Creates LSTM cell wrapped with dropout.
        """
        cell = tf.nn.rnn_cell.LSTMCell(rnn_size)
        cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=keep_probability)
        return cell

    def make_attention_cell(self, dec_cell, rnn_size, enc_output, lengths, alignment_history=False):
        """Wraps the given cell with Bahdanau Attention.
        """
        attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(num_units=rnn_size,
                                                                   memory=enc_output,
                                                                   memory_sequence_length=lengths,
                                                                   name='BahdanauAttention')

        return tf.contrib.seq2seq.AttentionWrapper(cell=dec_cell,
                                                   attention_mechanism=attention_mechanism,
                                                   attention_layer_size=None,
                                                   output_attention=False,
                                                   alignment_history=alignment_history)

    def triangular_lr(self, current_step):
        """cyclic learning rate - exponential range."""
        step_size = self.learning_rate_decay_steps
        base_lr = self.learning_rate
        max_lr = self.max_lr

        cycle = tf.floor(1 + current_step / (2 * step_size))
        x = tf.abs(current_step / step_size - 2 * cycle + 1)
        lr = base_lr + (max_lr - base_lr) * tf.maximum(0.0, tf.cast((1.0 - x), dtype=tf.float32)) * (0.99999 ** tf.cast(
            current_step,
            dtype=tf.float32))
        return lr


    def add_seq2seq(self):
        """Creates the sequence to sequence architecture."""
        with tf.variable_scope('dynamic_seq2seq', dtype=tf.float32):
            # Encoder
            encoder_outputs, encoder_state = self.build_encoder()

            # Decoder
            logits, sample_id, final_context_state = self.build_decoder(encoder_outputs,
                                                                        encoder_state)
            if self.mode == 'TRAIN':

                # Loss
                loss = self.compute_loss(logits)
                self.train_loss = loss
                self.eval_loss = loss
                self.global_step = tf.Variable(0, trainable=False)


                # cyclic learning rate
                if self.use_cyclic_lr:
                    self.learning_rate = self.triangular_lr(self.global_step)

                # exponential learning rate
                else:
                    self.learning_rate = tf.train.exponential_decay(
                        self.learning_rate,
                        self.global_step,
                        decay_steps=self.learning_rate_decay_steps,
                        decay_rate=self.learning_rate_decay,
                        staircase=True)

                # Optimizer
                opt = tf.train.AdamOptimizer(self.learning_rate)


                # Gradients
                if self.clip > 0:
                    grads, vs = zip(*opt.compute_gradients(self.train_loss))
                    grads, _ = tf.clip_by_global_norm(grads, self.clip)
                    self.train_op = opt.apply_gradients(zip(grads, vs),
                                                        global_step=self.global_step)
                else:
                    self.train_op = opt.minimize(self.train_loss,
                                                 global_step=self.global_step)



            elif self.mode == 'INFER':
                loss = None
                self.infer_logits, _, self.final_context_state, self.sample_id = logits, loss, final_context_state, sample_id
                self.sample_words = self.sample_id

    def build_encoder(self):
        """The encoder. Bidirectional LSTM."""

        with tf.variable_scope("encoder"):
            fw_cell = self.make_rnn_cell(self.rnn_size_encoder // 2, self.keep_probability)
            bw_cell = self.make_rnn_cell(self.rnn_size_encoder // 2, self.keep_probability)

            for _ in range(self.num_layers_encoder):
                (out_fw, out_bw), (state_fw, state_bw) = tf.nn.bidirectional_dynamic_rnn(
                    cell_fw=fw_cell,
                    cell_bw=bw_cell,
                    inputs=self.word_embeddings_1,
                    sequence_length=self.sequence_lengths_1,
                    dtype=tf.float32)
                encoder_outputs = tf.concat((out_fw, out_bw), -1)

            bi_state_c = tf.concat((state_fw.c, state_bw.c), -1)
            bi_state_h = tf.concat((state_fw.h, state_bw.h), -1)
            bi_lstm_state = tf.nn.rnn_cell.LSTMStateTuple(c=bi_state_c, h=bi_state_h)
            encoder_state = tuple([bi_lstm_state] * self.num_layers_encoder)

            return encoder_outputs, encoder_state


    def build_decoder(self, encoder_outputs, encoder_state):

        sos_id_2 = tf.cast(self.word2ind[self.sos], tf.int32)
        eos_id_2 = tf.cast(self.word2ind[self.eos], tf.int32)
        self.output_layer = Dense(self.vocab_size, name='output_projection')

        # Decoder.
        with tf.variable_scope("decoder") as decoder_scope:

            cell, decoder_initial_state = self.build_decoder_cell(
                encoder_outputs,
                encoder_state,
                self.sequence_lengths_1)

            # Train
            if self.mode != 'INFER':

                helper = tf.contrib.seq2seq.ScheduledEmbeddingTrainingHelper(
                    inputs=self.word_embeddings_2,
                    sequence_length=self.sequence_lengths_2,
                    embedding=self.embedding,
                    sampling_probability=0.5,
                    time_major=False)

                # Decoder
                my_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                             helper,
                                                             decoder_initial_state,
                                                             output_layer=self.output_layer)

                # Dynamic decoding
                outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(
                    my_decoder,
                    output_time_major=False,
                    maximum_iterations=self.maximum_iterations,
                    swap_memory=False,
                    impute_finished=True,
                    scope=decoder_scope
                )

                sample_id = outputs.sample_id
                logits = outputs.rnn_output


            # Inference
            else:
                start_tokens = tf.fill([self.batch_size], sos_id_2)
                end_token = eos_id_2

                # beam search
                if self.beam_width > 0:
                    my_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
                        cell=cell,
                        embedding=self.embedding,
                        start_tokens=start_tokens,
                        end_token=end_token,
                        initial_state=decoder_initial_state,
                        beam_width=self.beam_width,
                        output_layer=self.output_layer,
                    )

                # greedy
                else:
                    helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(self.embedding,
                                                                      start_tokens,
                                                                      end_token)

                    my_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                                 helper,
                                                                 decoder_initial_state,
                                                                 output_layer=self.output_layer)
                if self.inference_targets:
                    maximum_iterations = self.maximum_iterations
                else:
                    maximum_iterations = None

                # Dynamic decoding
                outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(
                    my_decoder,
                    maximum_iterations=maximum_iterations,
                    output_time_major=False,
                    impute_finished=False,
                    swap_memory=False,
                    scope=decoder_scope)

                if self.beam_width > 0:
                    logits = tf.no_op()
                    sample_id = outputs.predicted_ids
                else:
                    logits = outputs.rnn_output
                    sample_id = outputs.sample_id

        return logits, sample_id, final_context_state

    def build_decoder_cell(self, encoder_outputs, encoder_state,
                           sequence_lengths_1):
        """Builds the attention decoder cell. If mode is inference performs tiling
           Passes last encoder state.
        """

        memory = encoder_outputs

        if self.mode == 'INFER' and self.beam_width > 0:
            memory = tf.contrib.seq2seq.tile_batch(memory,
                                                   multiplier=self.beam_width)
            encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state,
                                                          multiplier=self.beam_width)
            sequence_lengths_1 = tf.contrib.seq2seq.tile_batch(sequence_lengths_1,
                                                               multiplier=self.beam_width)
            batch_size = self.batch_size * self.beam_width

        else:
            batch_size = self.batch_size

        # MY APPROACH
        if self.num_layers_decoder is not None:
            lstm_cell = tf.nn.rnn_cell.MultiRNNCell(
                [self.make_rnn_cell(self.rnn_size_decoder, self.keep_probability) for _ in
                 range(self.num_layers_decoder)])

        else:
            lstm_cell = self.make_rnn_cell(self.rnn_size_decoder, self.keep_probability)

        # attention cell
        cell = self.make_attention_cell(lstm_cell,
                                        self.rnn_size_decoder,
                                        memory,
                                        sequence_lengths_1)

        decoder_initial_state = cell.zero_state(batch_size, tf.float32).clone(cell_state=encoder_state)

        return cell, decoder_initial_state


    def compute_loss(self, logits):
        """Compute the loss during optimization."""
        target_output = self.ids_2
        max_time = self.maximum_iterations

        target_weights = tf.sequence_mask(self.sequence_lengths_2,
                                          max_time,
                                          dtype=tf.float32,
                                          name='mask')

        loss = tf.contrib.seq2seq.sequence_loss(logits=logits,
                                                targets=target_output,
                                                weights=target_weights,
                                                average_across_timesteps=True,
                                                average_across_batch=True, )
        return loss


    def train(self,
              inputs,
              targets,
              restore_path=None,
              validation_inputs=None,
              validation_targets=None):
        """Performs the training process. Runs training step in every epoch.
           Shuffles input data before every epoch.
           Optionally: - add tensorboard summaries.
                       - restoring previous model and retraining on top.
                       - evaluation step.
        """
        assert len(inputs) == len(targets)

        if self.summary_dir is not None:
            self.add_summary()

        self.initialize_session()
        if restore_path is not None:
            self.restore_session(restore_path)

        best_score = np.inf
        nepoch_no_imprv = 0

        inputs = np.array(inputs)
        targets = np.array(targets)

        for epoch in range(self.epochs + 1):
            print("============> STARTING EPOCH {}/{} <============".format(epoch, self.epochs))

            # shuffle the input data before every epoch.
            shuffle_indices = np.random.permutation(len(inputs))
            inputs = inputs[shuffle_indices]
            targets = targets[shuffle_indices]

            # run training epoch
            score = self.run_epoch(inputs, targets, epoch)

            # evaluate model
            if validation_inputs is not None and validation_targets is not None:
                self.run_evaluate(validation_inputs, validation_targets, epoch)


            if score <= best_score:
                nepoch_no_imprv = 0
                if not os.path.exists(self.save_path):
                    os.makedirs(self.save_path)
                self.saver.save(self.sess, self.save_path)
                best_score = score
                print("--- new best score ---\n\n")
            else:
                # warm up epochs for the model
                if epoch > 10:
                    nepoch_no_imprv += 1
                # early stopping
                if nepoch_no_imprv >= 5:
                    print("- early stopping {} epochs without improvement".format(nepoch_no_imprv))
                    break

    def infer(self, inputs, restore_path, targets=None):
        """Runs inference process. No training takes place.
           Returns the predicted ids for every sentence.
        """
        self.initialize_session()
        self.restore_session(restore_path)

        prediction_ids = []
        if targets is not None:
            feed, _, sequence_lengths_2 = self.get_feed_dict(inputs, trgts=targets)
        else:
            feed, _ = self.get_feed_dict(inputs)

        infer_logits, s_ids = self.sess.run([self.infer_logits, self.sample_words], feed_dict=feed)
        prediction_ids.append(s_ids)

        # for (inps, trgts) in summarizer_model_utils.minibatches(inputs, targets, self.batch_size):
        #     feed, _, sequence_lengths= self.get_feed_dict(inps, trgts=trgts)
        #     infer_logits, s_ids = self.sess.run([self.infer_logits, self.sample_words], feed_dict = feed)
        #     prediction_ids.append(s_ids)

        return prediction_ids

    def run_epoch(self, inputs, targets, epoch):
        """Runs a single epoch.
           Returns the average loss value on the epoch."""
        batch_size = self.batch_size
        nbatches = (len(inputs) + batch_size - 1) // batch_size
        losses = []

        for i, (inps, trgts) in enumerate(minibatches(inputs,
                                                                             targets,
                                                                             batch_size)):
            if inps is not None and trgts is not None:
                fd, sl, s2 = self.get_feed_dict(inps,
                                                trgts=trgts)

                if i % 10 == 0 and self.summary_dir is not None:
                    _, train_loss, training_summ = self.sess.run([self.train_op,
                                                                  self.train_loss,
                                                                  self.training_summary],
                                                                 feed_dict=fd)
                    self.training_writer.add_summary(training_summ, epoch*nbatches + i)

                else:
                    _, train_loss = self.sess.run([self.train_op, self.train_loss],
                                                  feed_dict=fd)

                if i % 2 == 0 or i == (nbatches - 1):
                    print('Epoch {}/{}... Iteration: {}/{}... Training Loss: {:.4f}'.format(epoch, self.epochs, i, nbatches - 1, train_loss))
                losses.append(train_loss)

            else:
                print('Minibatch empty.')
                continue

        avg_loss = self.sess.run(tf.reduce_mean(losses))
        print('Average Score for this Epoch: {}'.format(avg_loss))

        return avg_loss

    def run_evaluate(self, inputs, targets, epoch):
        """Runs evaluation on validation inputs and targets.
        Optionally: - writes summary to Tensorboard.
        """
        if self.summary_dir is not None:
            eval_losses = []
            for inps, trgts in minibatches(inputs, targets, self.batch_size):
                fd, sl, s2 = self.get_feed_dict(inps, trgts)
                eval_loss = self.sess.run([self.eval_loss], feed_dict=fd)
                eval_losses.append(eval_loss)

            avg_eval_loss = self.sess.run(tf.reduce_mean(eval_losses))

            print('Eval_loss: {}\n'.format(avg_eval_loss))
            eval_summ = self.sess.run([self.eval_summary], feed_dict=fd)
            self.eval_writer.add_summary(eval_summ, epoch)

        else:
            eval_losses = []
            for inps, trgts in minibatches(inputs, targets, self.batch_size):
                fd, sl, s2 = self.get_feed_dict(inps, trgts)
                eval_loss = self.sess.run([self.eval_loss], feed_dict=fd)
                eval_losses.append(eval_loss)

            avg_eval_loss = self.sess.run(tf.reduce_mean(eval_losses))

            print('Eval_loss: {}\n'.format(avg_eval_loss))



    def get_feed_dict(self, inps, trgts=None):
        """Creates the feed_dict that is fed into training or inference network.
           Pads inputs and targets.
           Returns feed_dict and sequence_length(s) depending on training mode.
        """
        if self.mode != 'INFER':
            inp_ids, sequence_lengths_1 = pad_sequences(inps,
                                                                               self.word2ind[self.pad],
                                                                               tail=False)

            feed = {
                self.ids_1: inp_ids,
                self.sequence_lengths_1: sequence_lengths_1
            }

            if trgts is not None:
                trgt_ids, sequence_lengths_2 = pad_sequences(trgts,
                                                                                    self.word2ind[self.pad],
                                                                                    tail=True)
                feed[self.ids_2] = trgt_ids
                feed[self.sequence_lengths_2] = sequence_lengths_2

                return feed, sequence_lengths_1, sequence_lengths_2

        else:

            inp_ids, sequence_lengths_1 = pad_sequences(inps,
                                                                               self.word2ind[self.pad],
                                                                               tail=False)

            feed = {
                self.ids_1: inp_ids,
                self.sequence_lengths_1: sequence_lengths_1
            }

            if trgts is not None:
                trgt_ids, sequence_lengths_2 = pad_sequences(trgts,
                                                                                    self.word2ind[self.pad],
                                                                                    tail=True)

                feed[self.sequence_lengths_2] = sequence_lengths_2

                return feed, sequence_lengths_1, sequence_lengths_2
            else:
                return feed, sequence_lengths_1

    def initialize_session(self):
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def restore_session(self, restore_path):
        self.saver.restore(self.sess, restore_path)
        print('Done.')

    def add_summary(self):
        """Summaries for Tensorboard."""
        self.training_summary = tf.summary.scalar('training_loss', self.train_loss)
        self.eval_summary = tf.summary.scalar('evaluation_loss', self.eval_loss)
        self.training_writer = tf.summary.FileWriter(self.summary_dir,
                                                     tf.get_default_graph())
        self.eval_writer = tf.summary.FileWriter(self.summary_dir)


<br>

### LESSION 3.3: INITIALIZE THE SEQ2SEQ MODEL

<hr>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

In [None]:
# build graph and train the model 
reset_graph()
summarizer = seq2seq(word2ind,
                         ind2word,
                         save_path = 'saved model',
                         mode = 'TRAIN',
                         num_layers_encoder = num_layers_encoder,
                         num_layers_decoder = num_layers_decoder,
                         rnn_size_encoder = rnn_size_encoder,
                         rnn_size_decoder = rnn_size_decoder,
                         batch_size = batch_size,
                         clip = clip,
                         keep_probability = keep_probability,
                         learning_rate = learning_rate,
                         max_lr = max_lr,
                         learning_rate_decay_steps = learning_rate_decay_steps,
                         learning_rate_decay = learning_rate_decay,
                         epochs = epochs,
                         pretrained_embeddings_path = pretrained_embeddings_path,
                         use_cyclic_lr = use_cyclic_lr,
                         summary_dir = summary_dir)       

<br>

### LESSION 3.4: TRAIN THE SEQ2SEQ MODEL

<hr>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

In [None]:
summarizer.build_graph()

summarizer.train(converted_texts[:70976], 
                 converted_summaries[:70976],
                 validation_inputs = converted_texts[70976:],
                 validation_targets = converted_summaries[70976:])

# hidden training output.
# both train and validation loss decrease nicely.

# Chapter 4: Inference

<hr>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

<br>

### LESSION 4.1: INITIALIZE THE SEQ2SEQ MODEL FOR INFERENCE

<hr>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

In [None]:
reset_graph()
summarizer = seq2seq.seq2seq(word2ind,
                                   ind2word,
                                   'saved model',
                                   'INFER',
                                   num_layers_encoder = num_layers_encoder,
                                   num_layers_decoder = num_layers_decoder,
                                   batch_size = len(converted_texts[:50]),
                                   clip = clip,
                                   keep_probability = 1.0,
                                   learning_rate = 0.0,
                                   beam_width = 5,
                                   rnn_size_encoder = rnn_size_encoder,
                                   rnn_size_decoder = rnn_size_decoder,
                                   inference_targets = True,
                                   pretrained_embeddings_path = pretrained_embeddings_path)

summarizer.build_graph()
preds = summarizer.infer(converted_texts[:50],
                         restore_path =  'saved model',
                         targets = converted_summaries[:50])


<br>

### LESSION 4.2: PREDICT THE VALIDATION SET

<hr>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

In [None]:
def sample_results(preds, ind2word, word2ind, converted_summaries, converted_texts, use_bleu=False):
    """Plots the actual text and summary and the corresponding created summary.
    takes care of whether beam search or greedy decoder was used.
    """
    beam = False

    if len(np.array(preds).shape) == 4:
        beam = True

    '''Bleu score is not used correctly here, but serves as reference.
    '''
    if use_bleu:
        bleu_scores = []

    for pred, summary, text, seq_length in zip(preds[0],
                                               converted_summaries,
                                               converted_texts,
                                               [len(inds) for inds in converted_summaries]):
        print('\n\n\n', 100 * '-')
        if beam:
            actual_text = [ind2word[word] for word in text if
                           word != word2ind["<SOS>"] and word != word2ind["<EOS>"]]
            actual_summary = [ind2word[word] for word in summary if
                              word != word2ind['<EOS>'] and word != word2ind['<SOS>']]

            created_summary = []
            for word in pred:
                if word[0] != word2ind['<SOS>'] and word[0] != word2ind['<EOS>']:
                    created_summary.append(ind2word[word[0]])
                    continue
                else:
                    continue

            print('Actual Text:\n{}\n'.format(' '.join(actual_text)))
            print('Actual Summary:\n{}\n'.format(' '.join(actual_summary)))
            print('Created Summary:\n{}\n'.format(' '.join(created_summary)))
            if use_bleu:
                bleu_score = sentence_bleu([actual_summary], created_summary)
                bleu_scores.append(bleu_score)
                print('Bleu-score:', bleu_score)

            print()


        else:
            actual_text = [ind2word[word] for word in text if
                           word != word2ind["<SOS>"] and word != word2ind["<EOS>"]]
            actual_summary = [ind2word[word] for word in summary if
                              word != word2ind['<EOS>'] and word != word2ind['<SOS>']]
            created_summary = [ind2word[word] for word in pred if
                               word != word2ind['<EOS>'] and word != word2ind['<SOS>']]

            print('Actual Text:\n{}\n'.format(' '.join(actual_text)))
            print('Actual Summary:\n{}\n'.format(' '.join(actual_summary)))
            print('Created Summary:\n{}\n'.format(' '.join(created_summary)))
            if use_bleu:
                bleu_score = sentence_bleu([actual_summary], created_summary)
                bleu_scores.append(bleu_score)
                print('Bleu-score:', bleu_score)

    if use_bleu:
        bleu_score = np.mean(bleu_scores)
        print('\n\n\nTotal Bleu Score:', bleu_score)

In [None]:
# show results
sample_results(preds,
              ind2word,
              word2ind,
              converted_summaries[:50],
              converted_texts[:50])

<br>

### LESSION 4.3: CALCULATE THE ACCURACY

<hr>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

<br>

### LESSION 4.4: PREDICT UN-SEEN SENTENCES

<hr>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

##### THE END OF CODE

<br>

# Conclusion

---

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

<br>

# Improvements

---

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. Nullam vitae ultrices velit. Donec auctor mollis consequat. Vivamus efficitur id ligula nec rhoncus. Sed elit nulla, tincidunt at nunc cursus, facilisis sagittis nunc. Ut varius congue enim, ac ultricies nunc ultrices nec. In hac habitasse platea dictumst. Maecenas urna neque, sodales ac vehicula a, tristique quis nisi. Vestibulum congue rhoncus enim a semper. Aliquam laoreet venenatis nisi sed posuere. Etiam et auctor lacus. Suspendisse ut iaculis turpis, et porttitor risus. Phasellus fringilla purus in faucibus mollis. Fusce facilisis elit in metus consectetur facilisis. In in risus eget risus porta pulvinar iaculis ac quam. Sed eget aliquet est, vel tempor leo.

<br>

# Resources

---

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sollicitudin mauris quis ante tempus, et sodales leo aliquam. 

1. <a>Github Blog Series Text Summarization</a> Thanks to Blah. A lot of ideas got from his codes.