This is a seq_to_seq problem. (needs encoder and decoder)<br>
Process:<br>
1: Load the data <br>
2: Preprocess the data <br>
3: Encode the sentences (create the dictionary from words, map words to integers) <br>
4: Build and train the seq2seq model (Using GloVe for the embeddings and Attention with decoder) <br> 
5: Generate the summary (Subjects for email contents) <br>

In [112]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os.path
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from bs4 import BeautifulSoup
import re
import string
from keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
import nltk
from tensorflow.python.layers.core import Dense


The data is from Enron email dataset. <br>
In this case we consider the subject of an email a few words summary that we need to learn for that email. Therefore, the inputs are email contents and targets are email subjects (or summaries).

Load the data

In [113]:
df = pd.read_csv('/emails_data/enron_emails.csv')
#Extracting only two columns 'Subject' and 'content'
df1 = df[['Subject','content']]

We disregard the forwarded and replied emails Since they do not provide any particular subject

In [114]:
df1=df1[~df1['Subject'].str.contains("FW:", na=False)]
df1=df1[~df1['Subject'].str.contains("Fw:", na=False)]
df1=df1[~df1['Subject'].str.contains("fw:", na=False)]
df1=df1[~df1['Subject'].str.contains("RE:", na=False)]
df1=df1[~df1['Subject'].str.contains("Re:", na=False)]
df1=df1[~df1['Subject'].str.contains("re:", na=False)]

Emails that contain "Forwarded by" in their content are the replied emails and their subjects are changed by the current sender. Therefore, for now we disregard those too. (Although later we can use the replied emails for the test part to give them subjects.)

In [115]:
df1=df1[~df1['content'].str.contains("Forwarded by", na=False)]

Removing NaN subjects

In [116]:
df1 = df1[pd.notnull(df1['Subject'])]
df1

Unnamed: 0,Subject,content
24,San Juan Index,"Liane, As we discussed yesterday, I am concern..."
106,tv on 33,Cash Hehub Chicago PEPL Katy Socal Opal Permia...
126,For Wade,"Wade, I understood your number one priority wa..."
140,assoc. for west desk,"Celeste, I need two assoc./analyst for the wes..."
143,test,testing
224,Priority List,"Will, Here is a list of the top items we need ..."
267,eol,Jeff/Brenda: Please authorize the following pr...
395,Mike Grigsby,Please approve Mike Grigsby for Bloomberg. Tha...
413,San Marcos construction project,Please find attached the pro formas for the pr...
518,Headcount,Financial (6) West Desk (14) Mid Market (16)


Finding out the maximum and minimum length for the content column so we can define a specific length range for emails that we want to include in our dataset

In [117]:
max_len = df1.applymap(lambda x: len(str(x))).max()
print(max_len)
min_len = df1.applymap(lambda x: len(str(x))).min()
print(min_len)

Subject       258
content    737640
dtype: int64
Subject    1
content    1
dtype: int64


We only include the emails with the contents in the range of [500,6000] characters.

In [118]:
#mask = (0<df1['Subject'].str.len()<258) & (500<df1['content'].str.len() <6000)
#df1 = df1.loc[mask]
df1=df1[df1['content'].astype('str').map(len) <= 6000]
df1=df1[df1['content'].astype('str').map(len) >= 500] 
emails=df1

Clean emails contents and subjects 

In [119]:
def load_clean(emails,stop_words):
    '''Clean the data both subjects and content of emails'''
    emails_messages=[]
    subj_messages=[]
    for email_content in emails['content']:
        #Extra celaning of text before tokenization 
        #Removing stopwords                
        email_content=' '.join(i for i in email_content.split() if i not in stop_words)
        #Removing special characters and float numbers
        email_content=re.sub("(\d*\.\d+)|(\d+\.[0-9 ]+)","",email_content)
        email_content=re.sub(r'[^\w]', ' ', email_content)
        '''for word in email_content:
            email_content=" ".join([w for w in email_content.split() if not w.isdigit()])'''
        #Removing all numbers (except for joint numbers to strings such as 27th; we also may later try to keep numbers related to dates and rooms, money , etc such as Sep 27, room numbers 3, 10 cent, etc)
        email_content = " ".join([w for w in email_content.split() if not w.isdigit()])

        emails_messages.append(email_content)
    for subject_messages in emails['Subject']:
        #Extra celaning of text before tokenization 
        #Removing stopwords                
        subject_messages=' '.join(i for i in subject_messages.split() if i not in stop_words)
        #Removing special characters and float numbers
        subject_messages=re.sub("(\d*\.\d+)|(\d+\.[0-9 ]+)","",subject_messages)
        subject_messages=re.sub(r'[^\w]', ' ', subject_messages)
        '''for word in email_content:
            email_content=" ".join([w for w in email_content.split() if not w.isdigit()])'''
        #Removing all numbers (except for joint numbers to strings such as 27th; we also may later try to keep numbers related to dates and rooms, money , etc such as Sep 27, room numbers 3, 10 cent, etc)
        subject_messages = " ".join([w for w in subject_messages.split() if not w.isdigit()])

        subj_messages.append(subject_messages)    

    return subj_messages,emails_messages

In [120]:
#Loading stop words from nltk
nltk.download("stopwords")
stop_words = set(stopwords.words('english'))
#Clean and preprocess the data
subject_messages,emails_messages=load_clean(emails,stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [121]:
#Comapring the first email before and after preprocessing
print("First email before:",emails.loc[24,'content'])

print("First email after:",emails_messages[0])


('First email before:', "Liane, As we discussed yesterday, I am concerned there has been an attempt to manipulate the El Paso San Juan monthly index. A single buyer entered the marketplace on both September 26 and 27 and paid above market prices ($4.70-$4.80) for San Juan gas with the intent to distort the index. At the time of these trades, offers for physical gas at significantly (10 to 15 cents) lower prices were bypassed in order to establish higher trades to report into the index calculation. Additionally, these trades are out of line with the associated financial swaps for San Juan. We have compiled a list of financial and physical trades executed from September 25 to September 27. These are the complete list of trades from Enron Online (EOL), Enron's direct phone conversations, and three brokerage firms (Amerex, APB, and Prebon). Please see the attached spreadsheet for a trade by trade list and a summary. We have also included a summary of gas daily prices to illustrate the valu

<EOS> specifies the end of text<br>
<SOS> specifies the beggning of each sequence as well as each batch <br>
<PAD> is used to make sequences to have the same lengths

In [122]:
def encode_words_in_seqences(phrases,sentences):
    '''Convert words to numbers (Create the dictionary of words)'''
    
    #Get all of the words in sentences(content) and phrases(subjects)
    word_list_content = ' '.join(sentences).split(' ')   
    word_list_subj = ' '.join(phrases).split(' ')   
    word_list = word_list_content + word_list_subj
    word_list=set(word_list)
    #Number of unique words in all above
    text_len=len(word_list)
    
    #Initial different sequences for list of contents and list of subjects
    data_seq=[]
    data_seq_subj=[]
    
    word_index=dictionary(word_list)
    #Add EOS and SOS to dictionary (only for decoder,i.e.,subjects not encoder)
    word_index["<SOS>"]=text_len+1
    word_index["<EOS>"]=text_len+2
    word_index["<PAD>"]=text_len+3
    for s in sentences:
            s=s.split(" ")
            s=[word_index[w] for w in s]
            data_seq.append(s)
    for p in phrases:
            p=p.split(" ")
            #Add SOS and EOS to subjects because we only need it for decoder 
            p.insert(0,"<SOS>")
            p.insert(len(p)+1,"<EOS>")
            p=[word_index[w] for w in p]
            data_seq_subj.append(p)
    
    #Choose the maximum number of tokens in all sequences 
    num_tokens = [len(tokens) for tokens in data_seq]
    max_seq_length=np.max(num_tokens)
    
    num_tokens_subj = [len(tokens) for tokens in data_seq_subj]
    max_seq_subj_length=np.max(num_tokens_subj)
    
    #Padding sequences (using Keras)
    #Make sequences to have the same lengths (add extra zeros to the end of the sentences)
    #PAD's value is word_index["<PAD>"]=text_len+3 
    data_seq = pad_sequences(data_seq, maxlen = max_seq_length,
                                padding='post', truncating='pre', value=word_index["<PAD>"])
    data_seq_subj = pad_sequences(data_seq_subj, maxlen = max_seq_subj_length,
                                padding='post', truncating='pre', value=word_index["<PAD>"])
    
    return word_index,data_seq,data_seq_subj

In [123]:
#Create the dictionary from list of words in text
def dictionary(words):
    #Create list of words without their duplications 
    words=set(words)
    #Map word to index
    indx = {key: i for i, key in enumerate(words)}
    return indx

In [124]:
#Convert from index to word
def get_by_key_dict(indx_word,words_dict):
    for word, indx in words_dict.iteritems():    
        if indx == indx_word:
            return word

In [125]:
#Words to numbers
word_index,data_sequences,data_seq_subj=encode_words_in_seqences(subject_messages,emails_messages)


In [126]:
word_index

{'mdbe': 0,
 '': 1,
 'gai': 121841,
 'Craziness': 3,
 'EXPLAIN': 4,
 'chudson': 5,
 'Pront': 6,
 'woods': 7,
 'Derike': 8,
 'spiders': 9,
 'Connerty': 10,
 'woody': 11,
 'Prone': 12,
 'ministration': 160619,
 'O8LG9H': 13,
 'Eury': 14,
 'Y1A': 15,
 'gab': 16,
 'Archuleta': 17,
 'Journey': 152975,
 'Boedecker': 19,
 'patying': 20,
 'Retreat': 21,
 'Euro': 22,
 'Tootie': 23,
 'CORRIDOR': 24,
 'Brianmmorgan': 25,
 'Valle': 26,
 'uietly': 27,
 'Mizell': 28,
 'WATERMELON': 29,
 'gencos': 30,
 'Morten': 31,
 'Statdat': 189,
 'bringing': 33,
 'wooded': 34,
 'Reconcile': 223,
 'wooden': 36,
 'Miers': 37,
 'wednesday': 38,
 'Sack': 39,
 'virtuosos': 40,
 'tepenovitch': 67529,
 'Bonanz2002alpha': 41,
 'Sacr': 42,
 'ToolGraphs': 43,
 'Pinkerton': 44,
 'mv94014': 45,
 'gaskets': 46,
 'convo': 47,
 'Shocked': 48,
 'Hughlette': 50,
 'consenting': 51,
 'Honorable': 52,
 'stohn': 53,
 'snuggled': 54,
 'USAFA': 55,
 'nalysis': 56,
 'dialogs': 57,
 'warmongering': 58,
 'usenet': 60,
 'College': 61,
 'Gl

In [127]:
#Sequences for subjects
data_seq_subj

array([[184969,  36307, 148992, ..., 184971, 184971, 184971],
       [184969,  36307, 148992, ..., 184971, 184971, 184971],
       [184969, 165306,  38393, ..., 184971, 184971, 184971],
       ..., 
       [184969,  43121,  71731, ..., 184971, 184971, 184971],
       [184969,  12222,  52830, ..., 184971, 184971, 184971],
       [184969, 119107,  14178, ..., 184971, 184971, 184971]], dtype=int32)

In [128]:
#Sequences for contents
data_sequences

array([[165068, 178094, 147508, ..., 184971, 184971, 184971],
       [165068, 178094, 147508, ..., 184971, 184971, 184971],
       [165306,  38393, 107130, ..., 184971, 184971, 184971],
       ..., 
       [107130, 145346,  95395, ..., 184971, 184971, 184971],
       [ 57993,  81199,  48884, ..., 184971, 184971, 184971],
       [ 40449,   8915, 138223, ..., 184971, 184971, 184971]], dtype=int32)

In [129]:
#Get the input and target lengths (after padding)
input_len=len(data_sequences[0])
target_len=len(data_seq_subj[0])
input_len

1013

Instead of using the naive approach for embedding (which is initializing the embedding vectors with random numbers and then let our model to further learn the embeddings), we can use GloVe to initialize some of the embeddings with pre_trained data learned from GloVe. In this case, the words in our text or dictionary that exist in GloVe are initialize with the word embeddings learned from GloVe and for the rest of words we may use the embedding vectors with random numbers.

In [130]:
glovefile="/glovedata/glove.6B.50d.txt"
#Load Glove model from https://stackoverflow.com/questions/37793118/load-pretrained-glove-vectors-in-python
def loadGloveModel(gloveFile):
    words_in_glove=[]
    print "Loading Glove Model"
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        words_in_glove.append(word)
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model,words_in_glove
model,words_in_glove=loadGloveModel(glovefile)

print model['hello']

Loading Glove Model
Done. 400000  words loaded!
[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]


Create embeddings of GloVe

In [131]:
def build_embeddings(vocabulary_size,embedding_size,word_index,words_in_glove):
    #For the words in the emails data that are not in the glove we give them random embeddings vectors
    embedding_all= tf.Variable(tf.random_uniform((vocabulary_size, embedding_size), -1, 1))
    for i in range(0,vocabulary_size):
        if get_by_key_dict(i,word_index) in words_in_glove:
            tf.assign(embedding_all[i],model[get_by_key_dict(i,word_index)])
    print(embedding_all)
    return embedding_all

In [132]:
def  create_model_inputs():
    '''Define model inputs'''
    
    #Model's placeholders for inputs
    encoding_inputs = tf.placeholder(tf.int32, [None, None], name='encode_inputs')
    decoder_targets = tf.placeholder(tf.int32, [None, None], name='targets')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    inputs_length = tf.placeholder(shape=(None,), dtype=tf.int32, name='inputs_length')
    target_length =tf.placeholder(shape=(None,), dtype=tf.int32, name='target_length')

    return encoding_inputs,decoder_targets,keep_prob,inputs_length,target_length

In [133]:
def get_batches(x, y, batch_size):
    '''Using generator to return batches for train'''
    
    n_batches = len(x)//batch_size
    '''In case that the batch_size is not a multiple of data size (number of sequences) in order to create batches with the same sizes, this line will ignore the last sequences that cannot create a full batch'''
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]


Build encoder cells

In [134]:
def build_RNN_encoder_cells(num_hidden,lstm_layer_numbers,keep_prob,batch_size):
    '''Build RNN encoder cells'''

    #Define LSTM layers(Bidirectional need both backward and forward info)
    lstms=[]
    for i in range(lstm_layer_numbers):
        lstms.append(tf.contrib.rnn.BasicLSTMCell(num_hidden))
    # Add regularization dropout to the LSTM cells
    lstm_fw_cell = [tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) for lstm in lstms]
    lstm_bw_cell = [tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) for lstm in lstms]

    # Stack up multiple LSTM layers
    stacked_lstm_fw = tf.contrib.rnn.MultiRNNCell(lstm_fw_cell)
    stacked_lstm_bw = tf.contrib.rnn.MultiRNNCell(lstm_bw_cell)
    
    return stacked_lstm_fw,stacked_lstm_bw

Build decoder cells (with attention). Attention is useful for long sentences from encoder that are needed to be paid attention by decoder to their specific words for better prediction. 

In [135]:
def build_RNN_decoder_cells(encoder_outputs,encoder_state,num_hidden,lstm_layer_numbers,batch_size,inputs_length):
    '''Build RNN decoder attention cells'''
    #Define LSTM layers
    lstms=[]
    for i in range(lstm_layer_numbers):
        #The concatenation of backward anf forward of LSTM cells from encoder resulted in num_hidden*2 units instead of num_hidden; thtat's what decoder should expect (num_hidden*2), same goes for attention
        lstms.append(tf.contrib.rnn.BasicLSTMCell(num_hidden*2))
    #Add regularization dropout to the LSTM cells
    decoder_lstm_cells = [tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) for lstm in lstms]
    #Stack up multiple LSTM layers
    stacked_decoder_cells = tf.contrib.rnn.MultiRNNCell(decoder_lstm_cells)
    
    #Build Attention cells
    #Attention Mechanisms
    attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(num_units = num_hidden*2, memory = encoder_outputs, memory_sequence_length = inputs_length, name='BahdanauAttention')
    # Attention Wrapper
    attention_cell = tf.contrib.seq2seq.AttentionWrapper(cell = stacked_decoder_cells,attention_mechanism = attention_mechanism,attention_layer_size = num_hidden*2, name="attention_wrapper")  
    #Pass the encoder states to attention
    initial_state_attention = attention_cell.zero_state(dtype=tf.float32, batch_size=batch_size).clone(cell_state=encoder_state)
    
    return initial_state_attention,attention_cell

Split the data for train and test parts

In [136]:
#Split the data into train, test 
X_train, X_test, y_train, y_test = train_test_split(data_sequences, data_seq_subj, test_size=0.1, random_state=1)
len(X_train)

96516

In [137]:
def build_encoder(embeds_input,num_hidden,lstm_layer_numbers,keep_prob,batch_size,inputs_length):
    '''Create Bidirectional encoder to make model more precise '''
    
    lstm_cells_encoder_fw, lstm_cells_encoder_bw = build_RNN_encoder_cells(num_hidden,lstm_layer_numbers,keep_prob,batch_size)
    #Need to unstack the sequence of input into a list of tensors
    #seq_input = [tf.squeeze(i,[1]) for i in tf.split(embeds_input,input_len,1)] 
    
    #Decoder needs ecnoders's final states as its initial state (It will pass through attention)
    (encoder_fw_output,encoder_bw_output),(encoder_fw_state,encoder_bw_state) = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_cells_encoder_fw, cell_bw=lstm_cells_encoder_bw, inputs=embeds_input,sequence_length=inputs_length, dtype=tf.float32)
    #Concat the backward forward outputs from encoder
    encoder_output = tf.concat((encoder_fw_output, encoder_bw_output),2)
    #Concat the backward forward states from encoder
    encoder_states = []
    for i in range(lstm_layer_numbers):
        #Basic LSTM state is a tuple contains cell and hidden states 
        state_c = tf.concat(values=(encoder_fw_state[i].c,encoder_bw_state[i].c),axis=1)
        state_h = tf.concat(values=(encoder_fw_state[i].h,encoder_bw_state[i].h),axis=1)
        encoder_state = tf.contrib.rnn.LSTMStateTuple(c=state_c, h=state_h)
        encoder_states.append(encoder_state)
    encoder_states = tuple(encoder_states)
    #From the tuple get the state part not the output
    encoder_states_c = encoder_states
    
    return encoder_output,encoder_states_c

Inference(prediction) has different decoder from traning as expained very well by https://github.com/udacity/deep-learning/blob/master/seq2seq/sequence_to_sequence_implementation.ipynb

In [138]:
def build_decoder(embeds_target,encoder_output,encoder_states_c,num_hidden,lstm_layer_numbers,batch_size,inputs_length,target_length):
    '''Create decoder with attention '''

    initial_state,lstm_cells_decoder=build_RNN_decoder_cells(encoder_output,encoder_states_c,num_hidden,lstm_layer_numbers,batch_size,inputs_length)

    #Decoder setup
    
    #Helps to convert outputs to logits
    output_layer = Dense(vocab_size)
    
    #Training
    #Training helper (helper is used for BasicDecoder)
    helper = tf.contrib.seq2seq.TrainingHelper(embeds_target,sequence_length=target_length,time_major=False)
    #Training decoder
    train_decoder = tf.contrib.seq2seq.BasicDecoder(lstm_cells_decoder, helper, initial_state,output_layer = output_layer)
    print(embeds_target.shape)
    print(target_length.shape)
    train_final_outputs,_,_ = tf.contrib.seq2seq.dynamic_decode(train_decoder,output_time_major=False,impute_finished=True)

    #Inference
    #need for GreedyEmbeddingHelper parameter
    start_tokens = tf.fill([batch_size], word_index["<SOS>"])
    #Inference helper (helper is used for BasicDecoder)
    #For inference we use GreedyEmbeddingHelper because the ground truth is not available as input and it uses the output of the previous timestep instead (First param is embeddings vector)
    helper_inf = tf.contrib.seq2seq.GreedyEmbeddingHelper(emneddings,start_tokens,word_index["<EOS>"])
    #Inference decoder
    inf_decoder = tf.contrib.seq2seq.BasicDecoder(lstm_cells_decoder, helper_inf, initial_state,output_layer = output_layer)
    
    inf_final_outputs,_,_ = tf.contrib.seq2seq.dynamic_decode(inf_decoder,output_time_major=False,impute_finished=True)


    logits_train = train_final_outputs.rnn_output
    logits_inf = inf_final_outputs.rnn_output
    
    return logits_train,logits_inf

In [139]:
#Vocabulary size plus one for 0, the int number that added for padding (still need to think about this!?)
vocab_size = len(word_index)+1
#Number of units
num_hidden = 256
#Encoder and decoder layers 
lstm_layer_numbers=2
#Encoder and decoder embedding size 
embed_size=300
#To avoid ResourceExhaustedError due to batch size use a proper size based on gpu performance (still OOM problem)
batch_size= 24
learning_rate=0.001

Encoding and decoding layers are created. From encoding layers, only the encoding's output state is needed for decoding layer as its intial state.<br> If one uses attention, they also need to use the encode's output for attention cell, otherwise, encode's output is not needed.
We need to create the decoder for both training and inference(prediction) phases.<br> 
The difference is that in training decoder, the inputs of decoder are the sequences of targets that are fed to the decoder to create the model but in inference(prediction) phase the decoder works like language model in which the output of the previous time step in decoder is fed to the input of the next time step in decoder.

In [None]:
#Resert the default graph 
tf.reset_default_graph()
graph0 = tf.Graph()
#There exits a global default graph created by tensorflow, for new graphs we need to set them as a default graph
with graph0.as_default():
    encoding_inputs,decoder_targets,keep_prob,inputs_length,target_length=create_model_inputs()
    #tf.AUTO_REUSE for reuisng the same scope for generating as for training
    #with tf.variable_scope('rnn1', reuse=tf.AUTO_REUSE):
    #Create word embeddings from words using Glove
    emneddings=build_embeddings(vocab_size,embed_size,word_index,words_in_glove)
    
    print("Encoder")
    #ENCODER
    
    #Create embeddings for encoding_inputs (Encoder)
    embeds_input = tf.nn.embedding_lookup(emneddings, encoding_inputs)
    
    encoder_output,encoder_states_c = build_encoder(embeds_input,num_hidden,lstm_layer_numbers,keep_prob,batch_size,inputs_length)

    print("Decoder")
    #DECODER
    
    #As mentioned here: https://github.com/udacity/deep-learning/blob/master/seq2seq/sequence_to_sequence_implementation.ipynb
    #We should remove the last words from each target sequence in the decoder since in the last time step in decoder the target input is the last word from the sequence target and it will be ignored (This word is either <PAD> or <EOS>)
    sliced_targets= tf.strided_slice(decoder_targets, [0, 0], [batch_size, -1], [1, 1])
    #Need to append SOS to each target sequenece
    decoder_targets = tf.concat([tf.fill([batch_size, 1], word_index["<SOS>"]), sliced_targets], 1)
    #Create embeddings for targets (Decoder)
    embeds_target = tf.nn.embedding_lookup(emneddings, decoder_targets)
    
    logits_train,logits_inf = build_decoder(embeds_target,encoder_output,encoder_states_c,num_hidden,lstm_layer_numbers,batch_size,inputs_length,target_length)

    
    masks = tf.sequence_mask(target_length,target_len, dtype=tf.float32)
    loss = tf.contrib.seq2seq.sequence_loss(logits_train,decoder_targets,masks)

    train_op = tf.contrib.layers.optimize_loss(loss=loss,global_step=tf.train.get_global_step(),optimizer=tf.train.AdamOptimizer,learning_rate=learning_rate)

    init_op = tf.global_variables_initializer()
    saver = tf.train.Saver()

<tf.Variable 'Variable:0' shape=(184972, 300) dtype=float32_ref>
Encoder
Decoder
(24, ?, 300)
(?,)


Run the graph for training

In [None]:
#Execute the graph for training
gpu_options = tf.GPUOptions(allow_growth=True)
with tf.Session(graph=graph0,config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
    sess = tf.Session(graph=graph0)
    sess.run(init_op)
    no_of_batches_train = int(len(X_train)/batch_size)
    epochs = 20
    text_len=[input_len] * batch_size
    text_len = np.asarray(text_len)
    summary_len=[target_len] * batch_size
    summary_len = np.asarray(summary_len)
    for epoch in range(epochs):
        print(epoch)
        state = sess.run(init_op)
        avg_cost_train = 0 
        for ii, (x, y) in enumerate(get_batches(X_train, y_train, batch_size), 1):
            _, cost= sess.run([train_op, loss], feed_dict={encoding_inputs: x,
                                                            decoder_targets: y,keep_prob: 0.5,inputs_length:text_len,target_length:summary_len})

            avg_cost_train += cost / no_of_batches_train
        print("Epoch:", epoch+1, "cost_train=",avg_cost_train)
    #Save the model into a file 
    checkpoint="./model/savedmodel.ckpt"
    save_path = saver.save(sess, checkpoint)

0


Predict email subjects

In [None]:
#Execute the same graph0 for prediction
with tf.Session(graph=graph0) as sess2:
    # Load the model
    saved = tf.train.import_meta_graph('./model/savedmodel.ckpt.meta')
    saved.restore(sess2, checkpoint)
    
    input_test=np.asarray(batch_size* [X_train[0]])
    print(input_test.shape)
    prediction= sess2.run([logits_inf ], feed_dict={encoding_inputs: input_test,
                                                            keep_prob: 0.5,inputs_length:text_len,target_length:summary_len})
Subject_Email=[]
for i in prediction:
    Subject_Email.append(get_by_key_dict(word_int,words_index))
print("Email Content:",X_train[0])
print("Predicted Subject:",Subject_Email)
print("real Subject:",y_train[0])


Working to solve the problem:<br> There is a OOM(out of memory) problem even when running with a good gpu and reducing the batch_size OOM remains. Since this happens after training step, I belive the memory cannot be released from training step and it runs out of memory in prediction step.