This is a seq_to_seq problem (needs encoder and decoder)<br>
Process:<br>
1-Load the data <br>
2-Preprocess the data <br>
3-Create dictionary from the words <br>
4-Build and train the seq2seq model (Using GloVe for the embeddings and Attention with decoder) <br> 
5-Generate the summary <br>

In [34]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os.path
from keras.preprocessing.text import Tokenizer
from bs4 import BeautifulSoup
import re
import string
from keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
import nltk

The data is from Enron email dataset. <br>
In this case we consider the subject of an email a few words summary that we need to learn for that email.

In [35]:
df = pd.read_csv('/emails_data/enron_emails.csv')
#Neen only two columns 'Subject' and 'content'
df1 = df[['Subject','content']]

Unnamed: 0,Subject,content
0,,Here is our forecast
1,Re:,Traveling to have a business meeting takes the...
2,Re: test,test successful. way to go!!!
3,,"Randy, Can you send me a schedule of the salar..."
4,Re: Hello,Let's shoot for Tuesday at 11:45.
5,Re: Hello,"Greg, How about either next Tuesday or Thursda..."
6,,Please cc the following distribution list with...
7,Re: PRC review - phone calls,any morning between 10 and 11:30
8,Re: High Speed Internet Access,1. login: pallen pw: ke9davis I don't think th...
9,FW: fixed forward or other Collar floor gas pr...,---------------------- Forwarded by Phillip K ...


We disregard the forwarded and replied emails

In [36]:
df1=df1[~df1['Subject'].str.contains("FW:", na=False)]
df1=df1[~df1['Subject'].str.contains("Fw:", na=False)]
df1=df1[~df1['Subject'].str.contains("fw:", na=False)]
df1=df1[~df1['Subject'].str.contains("RE:", na=False)]
df1=df1[~df1['Subject'].str.contains("Re:", na=False)]
df1=df1[~df1['Subject'].str.contains("re:", na=False)]

Unnamed: 0,Subject,content
0,,Here is our forecast
3,,"Randy, Can you send me a schedule of the salar..."
6,,Please cc the following distribution list with...
11,,"Lucy, Here are the rentrolls: Open them and sa..."
12,Consolidated positions: Issues & To Do list,---------------------- Forwarded by Phillip K ...
13,Consolidated positions: Issues & To Do list,---------------------- Forwarded by Phillip K ...
14,,"Dave, Here are the names of the west desk memb..."
16,"Var, Reporting and Resources Meeting",---------------------- Forwarded by Phillip K ...
17,,"Tim, mike grigsby is having problems with acce..."
18,Westgate,---------------------- Forwarded by Phillip K ...


Emails that contain "Forwarded by" in their content are the replied emails that the subjects are changed by the current sender. Therfore, for now we disregard those too (Althoug later we can use the replied emails for test to give them subjects).

In [37]:
df1=df1[~df1['content'].str.contains("Forwarded by", na=False)]

Unnamed: 0,Subject,content
0,,Here is our forecast
3,,"Randy, Can you send me a schedule of the salar..."
6,,Please cc the following distribution list with...
11,,"Lucy, Here are the rentrolls: Open them and sa..."
14,,"Dave, Here are the names of the west desk memb..."
17,,"Tim, mike grigsby is having problems with acce..."
20,,"Brenda, Please use the second check as the Oct..."
24,San Juan Index,"Liane, As we discussed yesterday, I am concern..."
28,,"Reagan, Just wanted to give you an update. I h..."
33,,"Chris, What is the latest with PG&E? We have b..."


Removing NaN subjects

In [38]:
df1 = df1[pd.notnull(df1['Subject'])]
df1

Unnamed: 0,Subject,content
24,San Juan Index,"Liane, As we discussed yesterday, I am concern..."
106,tv on 33,Cash Hehub Chicago PEPL Katy Socal Opal Permia...
126,For Wade,"Wade, I understood your number one priority wa..."
140,assoc. for west desk,"Celeste, I need two assoc./analyst for the wes..."
143,test,testing
224,Priority List,"Will, Here is a list of the top items we need ..."
267,eol,Jeff/Brenda: Please authorize the following pr...
395,Mike Grigsby,Please approve Mike Grigsby for Bloomberg. Tha...
413,San Marcos construction project,Please find attached the pro formas for the pr...
518,Headcount,Financial (6) West Desk (14) Mid Market (16)


Finding out the maximum and minimum length for content column so we can define a specific length range for emails that we want to include in our dataset

In [39]:
max_len = df1.applymap(lambda x: len(str(x))).max()
print(max_len)
min_len = df1.applymap(lambda x: len(str(x))).min()
print(min_len)

Subject       258
content    737640
dtype: int64
Subject    1
content    1
dtype: int64


We include the email with the content in the range of [500,6000] characters

In [40]:
#mask = (0<df1['Subject'].str.len()<258) & (500<df1['content'].str.len() <6000)
#df1 = df1.loc[mask]
df1=df1[df1['content'].astype('str').map(len) <= 6000]
df1=df1[df1['content'].astype('str').map(len) >= 500] 
emails=df1

In [41]:
def load_clean(emails,stop_words):
    '''Clean the data'''
    emails_messages=[]
    for email_content in emails['content']:
        #Extra celaning of text before Keras tokenization 
        #Removing stopwords                
        email_content=' '.join(i for i in email_content.split() if i not in stop_words)
        #Removing special characters and float numbers
        email_content=re.sub("(\d*\.\d+)|(\d+\.[0-9 ]+)","",email_content)
        email_content=re.sub(r'[^\w]', ' ', email_content)
        '''for word in email_content:
            email_content=" ".join([w for w in email_content.split() if not w.isdigit()])'''
        #remove all numbers (except for joint numbers to strings such as 27th; we also may later try to keep numbers related to dates and rooms, money , etc such as Sep 27, room numbers 3, 10 cent, etc)
        email_content = " ".join([w for w in email_content.split() if not w.isdigit()])

        emails_messages.append(email_content)
    return emails_messages

In [43]:
#load stop words from nltk
nltk.download("stopwords")
stop_words = set(stopwords.words('english'))
#clean and preprocess the data
emails_messages=load_clean(emails,stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['Liane As discussed yesterday I concerned attempt manipulate El Paso San Juan monthly index A single buyer entered marketplace September paid market prices San Juan gas intent distort index At time trades offers physical gas significantly cents lower prices bypassed order establish higher trades report index calculation Additionally trades line associated financial swaps San Juan We compiled list financial physical trades executed September September These complete list trades Enron Online EOL Enron s direct phone conversations three brokerage firms Amerex APB Prebon Please see attached spreadsheet trade trade list summary We also included summary gas daily prices illustrate value San Juan based several spread relationships The two key points data follows The high physical prices 26th 27th much greater high financial trades days The spread relationship San Juan points Socal Northwest consistent end September October gas daily It make sense monthly indeces dramatically different I unde

In [None]:
#Comapring the first email after perprocessing
emails.loc[24,'content']
emails_messages[0]

In [44]:
def encode_words(sentences):
    '''Convert words to numbers (Create dictionary of words)'''
    
    #Keras tokenization (punctualtion removal, normalization and split by white space)
    tokenize = Tokenizer()
    #Fit tokenizer to the whole data
    tokenize.fit_on_texts(sentences)
    data_seq=tokenize.texts_to_sequences(sentences)
    word_index = tokenize.word_index
    #Choose the maximum number of tokens in all sequences 
    num_tokens = [len(tokens) for tokens in data_seq]
    max_seq_length=np.max(num_tokens)
    #Make sequences to have the same lengths (add extra zeros to the end of the sentences)
    data_seq = pad_sequences(data_seq, maxlen = max_seq_length,
                                padding='post', truncating='pre')
    return data_seq,word_index

In [45]:
#Words to int
data_sequences,word_index=encode_words(emails_messages)

In [46]:
word_index

{'mdbe': 29646,
 'ecipients': 36729,
 '5173mt': 135661,
 'reallocations': 61349,
 'transend': 88704,
 'sowell': 28867,
 'chudson': 69247,
 'sonja': 34782,
 'degussa': 98038,
 'mycoolinternet': 87221,
 'vani': 87495,
 'woods': 7193,
 'spiders': 49816,
 'paolis': 34772,
 'hanging': 6353,
 'woody': 10162,
 'suzana': 39029,
 'cellspacing': 2034,
 'scvwd': 60195,
 'localized': 25173,
 'nordisk': 91343,
 'lenci': 100457,
 'sodikoff': 92833,
 'canes': 45401,
 'canet': 111128,
 'duathalon': 129946,
 'sprague': 37747,
 'mmoran1970': 42016,
 'jairam': 43091,
 'john31': 108780,
 'cfb': 108925,
 'refunding': 28687,
 'jandarma': 121256,
 'svingen': 58321,
 '8bps': 101816,
 'gatx': 34126,
 'kaa24090': 109225,
 'pigment': 129108,
 '3d136': 75238,
 'showt': 115258,
 'tourister': 24109,
 'fantas': 102686,
 'igateway': 102399,
 'equilon': 16598,
 'broward': 16865,
 'badlnd': 6576,
 'bringing': 2796,
 'prizing': 123648,
 'wisemiller': 42232,
 'aichi': 68717,
 'wooded': 93973,
 'inetevents': 37509,
 'endd

In [47]:
data_sequences

array([[26894,    58,   681, ...,     0,     0,     0],
       [26894,    58,   681, ...,     0,     0,     0],
       [   29,  1618,     4, ...,     0,     0,     0],
       ..., 
       [    4,  3755,  1102, ...,     0,     0,     0],
       [    2,   137,   384, ...,     0,     0,     0],
       [   10, 11009,  2116, ...,     0,     0,     0]], dtype=int32)

Instead of using the naive approach for embedding (which is initializing the embedding vectors with random numbers and then let our model to further learn the embeddings) we can use GloVe to initialize some of the embeddings with pre_trained data learned from GloVe and initialize the nn existing words by GloVe with random numbers.