<a href="https://colab.research.google.com/github/tottenjordan/deep-learning/blob/master/text-summarization/abstract_text_sum_aws_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook uses a bidirectional sequence-to-sequence RNN to generate text summaries. The text generation method is abstractive, meaning it learns the semantic and contextual meaning between the reveiw and summary and attempts to summarize out-of-sample reveiws with words that have similar semantic and contextual menaing- not just extracting words from that reveiw (extractive method). 

Jordan Totten

## Notebook Setup

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls /content/drive/'My Drive'/'Colab Notebooks'/Amazon-Reviews

models			  Reviews.csv		      Reviews_pickled.pkl
numberbatch-en-17.06.txt  reviews_optimal_sample.csv


*Note: I reverted to Tensorflow 1.1.0 because I had trouble with the "Dynamic Attention" functions in later TF versions. The Attention methods used in this model have been deprecated in later TF versions. I need to understand the additional parameters before upgrading versions*

In [0]:
!pip install tensorflow==1.1.0
import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
print('TensorFlow Version: {}'.format(tf.__version__))

Collecting tensorflow==1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/cd/e4/b2a8bcd1fa689489050386ec70c5c547e4a75d06f2cc2b55f45463cd092c/tensorflow-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (31.4MB)
[K    100% |████████████████████████████████| 31.4MB 1.1MB/s 
[31mstable-baselines 2.2.1 has requirement tensorflow>=1.5.0, but you'll have tensorflow 1.1.0 which is incompatible.[0m
[31mmagenta 0.3.19 has requirement tensorflow>=1.12.0, but you'll have tensorflow 1.1.0 which is incompatible.[0m
Installing collected packages: tensorflow
  Found existing installation: tensorflow 1.13.1
    Uninstalling tensorflow-1.13.1:
      Successfully uninstalled tensorflow-1.13.1
Successfully installed tensorflow-1.1.0
TensorFlow Version: 1.1.0


## Reviews Data
This dataset comes from the [Amazon Product Review Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews). The original dataset has +500,000 reviews. The reveiws selected for this study were preprocessed locally. I created a "usefulness" index which is the number of users who found the review helpful divided by the number of users who indicated whether or not the review was helpful. The reviews with the top 25,001 usefulness index score were used.



1.  Product ID
2.  User ID
3. HelpfulnessNumerator is the number of users who found the reveiw helpful
4. HelpfulnessDeominator is the number of users who indicate whether they found the review helpful or not
5. Score is a rating between 1 and 5 and references the product being reviewed
6. Time is a timestamp
7. Summary is a brief summary of the review
8. Text of the review



In [0]:
reviews = pd.read_csv("/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/reviews_optimal_sample.csv")
reviews.shape

(25000, 11)

### Data Sanity Checks

In [0]:
reviews.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,usefulness
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3.0,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...,3.0
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3.0,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...,1.5
2,508796,B001EO7GFS,A1CRI3DKT18JBX,D. Burck,5.0,5,4,1251331200,excellent value,Excellent value for a premium product. As typi...,1.0
3,386062,B004AGBYA0,AA8OP79Z9GMQL,Tudor,1.0,1,1,1333065600,Another good product bites the dust?,"This was my favourite brand of cod liver, it's...",1.0
4,8157,B001IZA8S0,A1KKE6VX8VPWZK,"J. J. Marino ""Geekasaurus Rex""",8.0,8,5,1278201600,Great tasting tea.,We usually have fresh lemon grass tea at our l...,1.0


In [0]:
# Check for any nulls values
reviews.isnull().sum()

Id                        0
ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   0
Text                      0
usefulness                0
dtype: int64

In [0]:
# Remove null values and unneeded features
# reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator',
                        'Score','Time','usefulness'], 1)
reviews = reviews.reset_index(drop=True)

In [0]:
reviews.head()

Unnamed: 0,Summary,Text
0,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
2,excellent value,Excellent value for a premium product. As typi...
3,Another good product bites the dust?,"This was my favourite brand of cod liver, it's..."
4,Great tasting tea.,We usually have fresh lemon grass tea at our l...


In [0]:
# Inspecting some of the reviews
for i in range(5):
    print("Review #",i+1)
    print(reviews.Summary[i])
    print(reviews.Text[i])
    print()

Review # 1
Bought This for My Son at College
My son loves spaghetti so I didn't hesitate ordering this for him. He says they are great. I have tried them myself and they are delicious. Just open and pop them in the microwave. It is very easy. The best thing about ordering from Amazon grocery is that they deliver to your door. If you have a loved one that lives far away and may have limited transportation this is the answer. Just order what you want them to have and Amazon takes care of the rest.

Review # 2
Pure cocoa taste with crunchy almonds inside
It was almost a 'love at first bite' - the perfectly roasted almond with a nice thin layer of pure flavorful cocoa on the top.<br /><br />You can smell the cocoa as soon as you open the canister - making you want to take a bite.<br /><br />You may or may not like the taste of this cocoa roasted almonds depending on your likingness for cocoa.  We are so much used to the taste of chocolate (which is actually cocoa + many other ingredients l

### Natural Language Preprocessing

* Remove unwanted characters and contractions from reviews and summaries
* Remove stopwords from reviews training data, but leave stopwords in summaries (labels)
* Stopwords in summaries will enable more natural review summaries in the generative model

In [0]:
# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}


def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)

    return text

In [0]:
import nltk
nltk.download('stopwords')

# Keep stopwords for summaries 
# Clean the summaries and texts
clean_summaries = []
for summary in reviews.Summary:
    clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")

clean_texts = []
for text in reviews.Text:
    clean_texts.append(clean_text(text))
print("Texts are complete.")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Summaries are complete.
Texts are complete.


Sanity check on cleaned summaries

In [0]:
# Inspect the cleaned summaries and texts to ensure they have been cleaned well
for i in range(5):
    print("Clean Review #",i+1)
    print(clean_summaries[i])
    print(clean_texts[i])
    print()

Clean Review # 1
bought this for my son at college
son loves spaghetti hesitate ordering says great tried delicious open pop microwave easy best thing ordering amazon grocery deliver door loved one lives far away may limited transportation answer order want amazon takes care rest

Clean Review # 2
pure cocoa taste with crunchy almonds inside
almost love first bite perfectly roasted almond nice thin layer pure flavorful cocoa top <br ><br >you smell cocoa soon open canister making want take bite <br ><br >you may may like taste cocoa roasted almonds depending likingness cocoa much used taste chocolate actually cocoa many ingredients like milk might never really tasted really cocoa <br ><br >tasting item like tasting enjoying flavorful pure raw cocoa crunchy almonds center get box see real cocoa almonds <br ><br >where product loses star packaging external sleeve kind comes one piece try remove lid external sleeve kind tends come fully careful removing external sleeve canister

Clean Rev

In [0]:
# function to count vocabulary
def count_words(count_dict, text):
    '''Count the number of occurrences of each word in a set of text'''
    for sentence in text:
        for word in sentence.split():
            if word not in count_dict:
                count_dict[word] = 1
            else:
                count_dict[word] += 1

In [0]:
# Find the number of times each word was used and the size of the vocabulary
word_counts = {}

count_words(word_counts, clean_summaries)
count_words(word_counts, clean_texts)
            
print("Size of Vocabulary:", len(word_counts))

Size of Vocabulary: 34217


### Word Embeddings Matrix

Conceptnet Numberbatch's (CN) embeddings are used. CN is an ensemble of many pre-trained vectors (including GloVe)
(https://github.com/commonsense/conceptnet-numberbatch)


In [0]:
# Load Conceptnet Numberbatch's (CN) embeddings 
# (https://github.com/commonsense/conceptnet-numberbatch)

embeddings_index = {}
with open("/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/numberbatch-en-17.06.txt", encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

Word embeddings: 417195


* Find the number of words in our reviews corpus that are missing from the CN embedding
* Set a frequency threshold of 20 (i.e., words occuring less than 20 times in the reviews that are not in the CN embedding will not be identified)
* A threshold ensures that the missing words from the CN embedding are represented in our word embedding matrix 

In [0]:
# Find the number of words that are missing from CN, 
# and are used greater than or equal to our threshold.
missing_words = 0
threshold = 20

for word, count in word_counts.items():
    if count > threshold:
        if word not in embeddings_index:
            missing_words += 1
            
missing_ratio = round(missing_words/len(word_counts),4)*100
            
print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

Number of words missing from CN: 258
Percent of words that are missing from vocabulary: 0.75%


* UNK is an unknown word
* GO represents the first token fed to a decoder 
* EOS indicates the token that represents the "end of setence." A decoder recognizes this as the end of an answer (punctuation not used!)
* PAD ensures each sequence in a training batch is the same length 


In [0]:
# Limit the vocab that we will use to words that appear ≥ threshold or are in CN

#dictionary to convert words to integers
vocab_to_int = {} 

value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index:
        vocab_to_int[word] = value
        value += 1

# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>","<EOS>","<GO>"]   

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int)

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100

print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

Total number of unique words: 34217
Number of words we will use: 23912
Percent of words we will use: 69.88%


In [0]:
# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim = 300
nb_words = len(vocab_to_int)

# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
for word, i in vocab_to_int.items():
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # If word not in CN, create a random embedding for it
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding
        word_embedding_matrix[i] = new_embedding

# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

23912


* Convert words in text to an integer 
* If word is not in vocab_to_int, use UNK's integer
* Total the number of words and UNKs
* Add EOS token to the end of the texts



In [0]:
def convert_to_ints(text, word_count, unk_count, eos=False):
    ints = []
    for sentence in text:
        sentence_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentence_ints.append(vocab_to_int[word])
            else:
                sentence_ints.append(vocab_to_int["<UNK>"])
                unk_count += 1
        if eos:
            sentence_ints.append(vocab_to_int["<EOS>"])
        ints.append(sentence_ints)
    return ints, word_count, unk_count

In [0]:
# Apply convert_to_ints to clean_summaries and clean_texts
word_count = 0
unk_count = 0

int_summaries, word_count, unk_count = convert_to_ints(clean_summaries, word_count, unk_count)
int_texts, word_count, unk_count = convert_to_ints(clean_texts, word_count, unk_count, eos=True)

unk_percent = round(unk_count/word_count,4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

Total number of words in headlines: 1218950
Total number of UNKs in headlines: 20945
Percent of words that are UNK: 1.72%


In [0]:
def create_lengths(text):
    '''Create a data frame of the sentence lengths from a text'''
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])
  
lengths_summaries = create_lengths(int_summaries)
lengths_texts = create_lengths(int_texts)

print("Summaries:")
print(lengths_summaries.describe())
print()
print("Texts:")
print(lengths_texts.describe())

Summaries:
             counts
count  25000.000000
mean       4.250480
std        2.654133
min        0.000000
25%        2.000000
50%        4.000000
75%        6.000000
max       24.000000

Texts:
             counts
count  25000.000000
mean      45.507520
std       42.587654
min        1.000000
25%       20.000000
50%       33.000000
75%       56.000000
max      996.000000


In [0]:
# Inspect the length of texts
print(np.percentile(lengths_texts.counts, 90))
print(np.percentile(lengths_texts.counts, 95))
print(np.percentile(lengths_texts.counts, 99))

90.0
121.0
215.0


In [0]:
# Inspect the length of summaries
print(np.percentile(lengths_summaries.counts, 90))
print(np.percentile(lengths_summaries.counts, 95))
print(np.percentile(lengths_summaries.counts, 99))

8.0
9.0
13.0


In [0]:
def unk_counter(sentence):
    '''Counts the number of time UNK appears in a sentence.'''
    unk_count = 0
    for word in sentence:
        if word == vocab_to_int["<UNK>"]:
            unk_count += 1
    return unk_count

* Sort the summaries and tests by the length of the texts, shortest to longest
* Limit the length of summaries and texts based on the min and max ranges
* Remove reviews that include too many UNKs

In [0]:
sorted_summaries = []
sorted_texts = []
max_text_length = 84
max_summary_length = 13
min_length = 2
unk_text_limit = 1
unk_summary_limit = 0

for length in range(min(lengths_texts.counts), max_text_length): 
    for count, words in enumerate(int_summaries):
        if (len(int_summaries[count]) >= min_length and
            len(int_summaries[count]) <= max_summary_length and
            len(int_texts[count]) >= min_length and
            unk_counter(int_summaries[count]) <= unk_summary_limit and
            unk_counter(int_texts[count]) <= unk_text_limit and
            length == len(int_texts[count])
           ):
            sorted_summaries.append(int_summaries[count])
            sorted_texts.append(int_texts[count])
        
# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

16827
16827


## Building the Model

Function for creating network

In [0]:
def model_inputs():
    '''Create palceholders for inputs to the model'''
    
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
    max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,), name='text_length')

    return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length
  

Function for preparing the input data for encoding
* remove the last word ID from each batch and concatenate the <"GO"> to the begining of each batch

In [0]:
def process_encoding_input(target_data, vocab_to_int, batch_size):
       
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input
  

Function for building encoding layers

In [0]:
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
        
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    rnn_inputs,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
    # Join outputs since we are using a bidirectional RNN
    enc_output = tf.concat(enc_output,2)
    
    return enc_output, enc_state

Function for creating the training logits

In [0]:
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state, output_layer, 
                            vocab_size, max_summary_length):
        
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                        sequence_length=summary_length,
                                                        time_major=False)

    training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                       training_helper,
                                                       initial_state,
                                                       output_layer) 

    training_logits, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                           output_time_major=False,
                                                           impute_finished=True,
                                                           maximum_iterations=max_summary_length)
    return training_logits

Function for creating the inference logits

In [0]:
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                             max_summary_length, batch_size):
        
    start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')
    
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                start_tokens,
                                                                end_token)
                
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                        inference_helper,
                                                        initial_state,
                                                        output_layer)
                
    inference_logits, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                            output_time_major=False,
                                                            impute_finished=True,
                                                            maximum_iterations=max_summary_length)
    
    return inference_logits

Function for creating the decoding layers
* attention created for training and inference layers

Attention source: https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 

In [0]:
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length, 
                   max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
        
    for layer in range(num_layers):
        with tf.variable_scope('decoder_{}'.format(layer)):
            lstm = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2)) 
            dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, 
                                                     input_keep_prob = keep_prob)
    
    output_layer = Dense(vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                  enc_output,
                                                  text_length,
                                                  normalize=False,
                                                  name='BahdanauAttention')

    dec_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(dec_cell, attn_mech, rnn_size)
            
    initial_state = tf.contrib.seq2seq.DynamicAttentionWrapperState(enc_state[0],_zero_state_tensors(rnn_size, batch_size, tf.float32))
                                                      
    
    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, 
                                                  summary_length, 
                                                  dec_cell, 
                                                  initial_state,
                                                  output_layer,
                                                  vocab_size, 
                                                  max_summary_length)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings,  
                                                    vocab_to_int['<GO>'], 
                                                    vocab_to_int['<EOS>'],
                                                    dec_cell, 
                                                    initial_state, 
                                                    output_layer,
                                                    max_summary_length,
                                                    batch_size)

    return training_logits, inference_logits

Create Sequence to Sequence Model
* uses previous functions to create the training and inference logits

In [0]:
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length, 
                  vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
        
    # Use Numberbatch's embeddings and the newly created ones as our embeddings
    embeddings = word_embedding_matrix
    
    enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
    enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
    
    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
    
    training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                        embeddings,
                                                        enc_output,
                                                        enc_state, 
                                                        vocab_size, 
                                                        text_length, 
                                                        summary_length, 
                                                        max_summary_length,
                                                        rnn_size, 
                                                        vocab_to_int, 
                                                        keep_prob, 
                                                        batch_size,
                                                        num_layers)
    
    return training_logits, inference_logits

* PAD sentences so that each sequence of a batch has the same length
* Define batch summaries, reviews, and the length of their sentences together

In [0]:
def pad_sentence_batch(sentence_batch):
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]
  
def get_batches(summaries, texts, batch_size):
        
    for batch_i in range(0, len(texts)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i + batch_size]
        texts_batch = texts[start_i:start_i + batch_size]
        pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
        pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
        
        # Need the lengths for the _lengths parameters
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
        
        pad_texts_lengths = []
        for text in pad_texts_batch:
            pad_texts_lengths.append(len(text))
        
        yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths

Set hyperparameters for model

In [0]:
# Set the Hyperparameters
epochs = 100
batch_size = 64
rnn_size = 256
num_layers = 2
learning_rate = 0.005
keep_probability = 0.75 # 0.50

## Build graph

In [0]:
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():
    
    # Load the model inputs    
    input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()

    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                      targets, 
                                                      keep_prob,   
                                                      text_length,
                                                      summary_length,
                                                      max_summary_length,
                                                      len(vocab_to_int)+1,
                                                      rnn_size, 
                                                      num_layers, 
                                                      vocab_to_int,
                                                      batch_size)
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
    
    # Create the weights for sequence_loss
    masks = tf.sequence_mask(summary_length, max_summary_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)
print("Graph is built.")

Graph is built.


## Model Training

In [0]:
from tensorboardcolab import *
import os
from tensorflow.contrib.tensorboard.plugins import projector
PATH = os.getcwd()
import gensim
#LOG_DIR = PATH + 'word_embed-logging'
#metadata = os.path.join(LOG_DIR, 'metadata.tsv')
len(sorted_summaries)

Using TensorFlow backend.


16827

In [0]:
# Subset the data for training
start = 0
end = start + 10000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))
print("The longest text length:",len(sorted_texts_short[-1]))

The shortest text length: 2
The longest text length: 33


In [0]:
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0 
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1

update_loss = 0 
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the model


checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

# Create a summary operation to log the progress of the network
with tf.variable_scope('logging_m22'):
    tf.summary.scalar('current_cost', cost)
    summary = tf.summary.merge_all()


with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    
    # If we want to continue training a previous session
    #loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')
    #loader.restore(sess, checkpoint)
    training_writer = tf.summary.FileWriter('./logs/training/m1', sess.graph)
    
    for epoch_i in range(1, epochs+1):
        update_loss = 0
        batch_loss = 0
        for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost],
                {input_data: texts_batch,
                 targets: summaries_batch,
                 lr: learning_rate,
                 summary_length: summaries_lengths,
                 text_length: texts_lengths,
                 keep_prob: keep_probability})

            batch_loss += loss
            update_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time

            if batch_i % display_step == 0 and batch_i > 0:
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(sorted_texts_short) // batch_size, 
                              batch_loss / display_step, 
                              batch_time*display_step))
                batch_loss = 0

            if batch_i % update_check == 0 and batch_i > 0:
                print("Average loss for this update:", round(update_loss/update_check,3))
                summary_update_loss.append(update_loss)
                
                # If the update loss is at a new minimum, save the model
                if update_loss <= min(summary_update_loss):
                    print('New Record!') 
                    stop_early = 0
                    saver = tf.train.Saver() 
                    saver.save(sess, path)

                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
                update_loss = 0
            
                    
        # Reduce learning rate, but not below its minimum value
        learning_rate *= learning_rate_decay
        if learning_rate < min_learning_rate:
            learning_rate = min_learning_rate
        
        if stop_early == stop:
            print("Stopping Training.")
            break

NameError: ignored

## Save Model in Colab

In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive 
from google.colab import auth 
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()                       
drive = GoogleDrive(gauth)

In [0]:
model.save('seq2seq-text-gen-train.h5')
model_file = drive.CreateFile({'Abs-Text-Summarization': 'seq2seq-text-gen-train.h5'})
model_file.SetContentFile('seq2seq-text-gen-train.h5')
model_file.Upload()

## Text Generation
* Create a review or use a review from the dataset
* Summary length is set to random

In [0]:
def text_to_seq(text):
    '''Prepare the text for the model'''
    
    text = clean_text(text)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]

Here I grab random words from the reveiws to create a test reveiw. 

In [0]:
# Create a reveiw in 'input_sentence' 
# or generate a random review from training data

#input_sentence = "I can make up my own review for testing"
#text = text_to_seq(input_sentence)
random = np.random.randint(0,len(clean_texts))
input_sentence = clean_texts[random]
text = text_to_seq(clean_texts[random])

#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: coffe average slight hint pumpkin spice want overwhelmingly flavored coffee <br ><br >works great office would buy

Text
  Word Ids:    [1187, 118, 3333, 1619, 752, 305, 888, 7371, 222, 134, 6903, 6904, 23908, 22, 3204, 103, 161]
  Input Words: coffe average slight hint pumpkin spice want overwhelmingly flavored coffee <br ><br <UNK> great office would buy

Summary
  Word Ids:       [188, 97, 427, 735]
  Response Words: too much decaf mocha


This is an actual reveiw from the training data. 

Original Summary: "bought this for my son at college"

In [0]:
# Create your own review or use one from the dataset
input_sentence = "My son loves spaghetti so I didn't hesitate ordering this for him.\
  He says they are great. I have tried them myself and they are delicious. Just open \
  and pop them in the microwave. It is very easy. The best thing about ordering from Amazon grocery is that they deliver to your door. \
  If you have a loved one that lives far away and may have limited transportation this is the answer. Just order what you want them to have and Amazon takes care of the rest."

text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: My son loves spaghetti so I didn't hesitate ordering this for him.  He says they are great. I have tried them myself and they are delicious. Just open   and pop them in the microwave. It is very easy. The best thing about ordering from Amazon grocery is that they deliver to your door.   If you have a loved one that lives far away and may have limited transportation this is the answer. Just order what you want them to have and Amazon takes care of the rest.

Text
  Word Ids:    [4, 148, 3463, 6901, 2977, 564, 22, 731, 83, 348, 1107, 2737, 396, 48, 77, 2977, 921, 1618, 4094, 848, 331, 265, 3328, 315, 559, 1277, 1869, 6902, 5378, 875, 888, 921, 3529, 264, 1244]
  Input Words: son loves spaghetti hesitate ordering says great tried delicious open pop microwave easy best thing ordering amazon grocery deliver door loved one lives far away may limited transport

Here I create my own review. As seen, the model does not correct summarize the intent of the review. Perhaps it is being sarcastic!

In [0]:
# Create your own review or use one from the dataset
input_sentence = "Disgusting cheese. I will never order this again"
text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: Disgusting cheese. I will never order this again

Text
  Word Ids:    [1410, 689, 779, 875]
  Input Words: disgusting cheese never order

Summary
  Word Ids:       [2446, 466]
  Response Words: utterly fantastic


In [0]:
# Create your own review or use one from the dataset
input_sentence = "Terrible book. I wish to return it"
text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: Terrible book. I wish to return it

Text
  Word Ids:    [583, 3115, 491, 484]
  Input Words: terrible book wish return

Summary
  Word Ids:       [281, 281, 1927, 55, 583]
  Response Words: cannot cannot stop it terrible


In [0]:
# Create your own review or use one from the dataset
input_sentence = "almost love first bite perfectly roasted almond nice thin layer pure flavorful cocoa"
text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: almost love first bite perfectly roasted almond nice thin layer pure flavorful cocoa

Text
  Word Ids:    [425, 215, 861, 1231, 436, 2082, 587, 78, 4168, 6239, 7, 404, 8]
  Input Words: almost love first bite perfectly roasted almond nice thin layer pure flavorful cocoa

Summary
  Word Ids:       [516, 516]
  Response Words: yum yum


In [0]:
# Create your own review or use one from the dataset
input_sentence = "It was almost a 'love at first bite' - the perfectly roasted \
almond with a nice thin layer of pure flavorful cocoa on the top.<br /><br />You\
can smell the cocoa as soon as you open the canister - making you want to take a \
bite.<br /><br />You may or may not like the taste of this cocoa roasted almonds\
depending on your likingness for cocoa.  We are so much used to the taste of \
chocolate (which is actually cocoa + many other ingredients like milk ...) - that \
you might have never really tasted really cocoa.<br /><br />Tasting this item it \
like tasting and enjoying flavorful pure raw cocoa with crunchy almonds in the \
center.  Get yourself a box and see for yourself what real cocoa + almonds \
is !<br /><br />Where this product "
text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: It was almost a 'love at first bite' - the perfectly roasted almond with a nice thin layer of pure flavorful cocoa on the top.<br /><br />You can smell the cocoa as soon as you open the canister - making you want to take a bite.<br /><br />You may or may not like the taste of this cocoa roasted almonds depending on your likingness for cocoa.  We are so much used to the taste of chocolate (which is actually cocoa + many other ingredients like milk ...) - that you might have never really tasted really cocoa.<br /><br />Tasting this item it like tasting and enjoying flavorful pure raw cocoa with crunchy almonds in the center.  Get yourself a box and see for yourself what real cocoa + almonds is !<br /><br />Where this product 

Text
  Word Ids:    [425, 215, 861, 1231, 436, 2082, 587, 78, 4168, 6239, 7, 404, 8, 361, 6903, 6904, 6905, 962, 8, 685, 348, 4092

In [0]:
# Create your own review or use one from the dataset
input_sentence = "Excellent value for a premium product. As typical for dried product, \
as many broken pieces as whole mushrooms, but immaterial for something usually diced after \
reconstituting. Great flavor boost for soups, sauces, risottos - really anything braised, simmered or blended."
text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: Excellent value for a premium product. As typical for dried product, as many broken pieces as whole mushrooms, but immaterial for something usually diced after reconstituting. Great flavor boost for soups, sauces, risottos - really anything braised, simmered or blended.

Text
  Word Ids:    [14, 15, 696, 18, 2587, 608, 18, 469, 1169, 1214, 1560, 1045, 6909, 835, 3997, 4776, 6910, 22, 163, 3221, 673, 2600, 23908, 286, 1273, 6911, 6912, 4411]
  Input Words: excellent value premium product typical dried product many broken pieces whole mushrooms immaterial something usually diced reconstituting great flavor boost soups sauces <UNK> really anything braised simmered blended

Summary
  Word Ids:       [14, 15]
  Response Words: excellent value


In [0]:
# Create your own review or use one from the dataset
input_sentence = "This was my favourite brand of cod liver, it's hard to find in Toronto, \
so I buy a dozen at a time when I find it. However, the last purchase, every one of the cans \
I opened was very lightly packed, about one third liver and the rest just oil. Even worse, the liver had \
dark, grey spots that would turn my stomach just looking at them. Perhaps good cod liver has become a \
thing of the past, like many other delicacies. I would prefer if they just raised the price to whatever \
it needs to be to provide the quality that we deserve when we spend our hard earned money, or, if that's \
not possible, just let the fish live and make soyburgers or something."
text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: This was my favourite brand of cod liver, it's hard to find in Toronto, so I buy a dozen at a time when I find it. However, the last purchase, every one of the cans I opened was very lightly packed, about one third liver and the rest just oil. Even worse, the liver had dark, grey spots that would turn my stomach just looking at them. Perhaps good cod liver has become a thing of the past, like many other delicacies. I would prefer if they just raised the price to whatever it needs to be to provide the quality that we deserve when we spend our hard earned money, or, if that's not possible, just let the fish live and make soyburgers or something.

Text
  Word Ids:    [3289, 370, 6913, 939, 347, 494, 6914, 161, 6915, 122, 494, 2292, 1230, 535, 949, 265, 237, 6365, 1119, 2563, 265, 5161, 939, 1244, 498, 99, 1779, 939, 201, 147, 251, 103, 3271, 395, 2309, 261

In [0]:
# Create your own review or use one from the dataset
input_sentence = "We usually have fresh lemon grass tea at our local Zen center. \
It is extremely easy to make, put a few blades of lemon grass in a pot to steep \
and you will get a fantastic tasting tea.<br /><br />This tea is here is a fine \
replacement for when you cannot get fresh lemon grass. Lemon grass has many health \
benefits. It is said to detoxify the liver and can lower uric acid levels, this is \
important if you have gout.<br /><br />Because you get such a large quantity here on \
Amazon it should last you quite a while. Caffeine free and light it makes a very \
refreshing afternoon tea.<br /><br />Give it a try!<br /><br />Thank you for reading my review."
text = text_to_seq(input_sentence)


#checkpoint = "./best_model.ckpt"
checkpoint = "best_model.ckpt" 
path = F"/content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/{checkpoint}"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(path + '.meta')
    loader.restore(sess, path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding 
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

INFO:tensorflow:Restoring parameters from /content/drive/My Drive/Colab Notebooks/Amazon-Reviews/models/best_model.ckpt
Original Text: We usually have fresh lemon grass tea at our local Zen center. It is extremely easy to make, put a few blades of lemon grass in a pot to steep and you will get a fantastic tasting tea.<br /><br />This tea is here is a fine replacement for when you cannot get fresh lemon grass. Lemon grass has many health benefits. It is said to detoxify the liver and can lower uric acid levels, this is important if you have gout.<br /><br />Because you get such a large quantity here on Amazon it should last you quite a while. Caffeine free and light it makes a very refreshing afternoon tea.<br /><br />Give it a try!<br /><br />Thank you for reading my review.

Text
  Word Ids:    [3997, 551, 130, 2637, 24, 2353, 3855, 4022, 1184, 396, 397, 768, 6919, 130, 2637, 700, 4529, 42, 466, 23, 24, 6903, 6904, 6920, 24, 763, 1123, 281, 42, 551, 130, 2637, 130, 2637, 469, 1268, 14