# Things to do

- Topic Modelling on All Employees Sent Emails
- What is Topic Modelling?
  - Topic can be defined as “a repeating pattern of co-occurring terms in a corpus”.
- What is Ernon Scandal?
   - Enron Corporation was an American energy, commodities, and services company based in Houston, Texas.
     Before its bankruptcy on December 2, 2001, Enron employed approximately 20,000 staff and wasone of the world's 
     major electricity, natural gas, communications, and pulp and paper companies, with claimed revenues of nearly 111 billion 
     during 2000. At the end of 2001, it was revealed that its reported financial condition was sustained substantially 
     by an institutionalized, systematic, and creatively planned accounting fraud, known since as the Enron scandal.

#### Importing All Libraries

In [2]:
import glob
import os
import re
import email
from email.parser import Parser
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import gensim
from gensim import corpora

#### Step 1 
Data Collection & Pre Processing

- Initially the data was in MIME type format.
- Lopped through each email and sent to Email Parser (get_payload()) in order to get the "Text Body".
- Stored text body of all sent emails as txt file in seperate directory. [midterm/data/Topic Modelling/Sent Emails/*.txt].
- Read all sent emails txt files and stored in one LIST

In [24]:
relativePath = os.getcwd()
path = relativePath+"/"+'midterm/data/ernon/maildir/'
i = 1;

if os.path.isdir(relativePath+"/"+'midterm/data/') == 1:
    if os.path.isdir(relativePath+"/"+'midterm/data/Topic Modeling/Sent Emails') == 0:
        os.makedirs(relativePath+"/"+'midterm/data/Topic Modeling/Sent Emails')

# Using Email Parder to read MIME type emails.
        
def emailParser(inputFile, i):
    with open(inputFile, "r") as f:
        data = f.read()
    email = Parser().parsestr(data)
    with open(relativePath+"/"+'midterm/data/Topic Modeling/Sent Emails/'+str(i)+'.txt', 'w', encoding='utf-8') as txtFile:
        txtFile.write(email.get_payload())


for directory, subDirectory, fileNames in os.walk(path):
    if 'sent_items' in directory:
        for filename in fileNames:
            emailParser(os.path.join(directory, filename), i)
            i=i+1
    
        

In [17]:
doc_com = []

def readEmail(inputFile):
    with open(inputFile, "r") as f:
        data = f.read()
        doc_com.append(data)


for directory, subDirectory, fileNames in os.walk(relativePath+"/"+'midterm/data/Topic Modeling/Sent Emails/'):
    for files in fileNames:
        readEmail(os.path.join(directory, files))
        

#### Step 2
- Cleaning
- Looped through the List to remove all stopword using (nltk stopwords('english))
- Removed all Punctuation from the List.
- Normalized the Data List using (NLTK wordnet lemma.lemmatize(word)).

In [18]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_com]


#### Step 3
- Making Document Term Matrix

- All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix.

- Creating the term dictionary of data, where every unique term is assigned an index. 
- Converting list of documents (corpus) into Document Term Matrix using dictionary prepared.
- Created an object for LDA model and train it on Document-Term matrix using gensim library(Lda = gensim.models.ldamodel.LdaModel).
- Running and Trainign LDA model on the document term matrix using gensim.

In [19]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]


In [20]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=4, id2word = dictionary, passes=50)

tList = ldamodel.print_topics(num_topics=4, num_words=10)

#### Step 4
- Result of Topic Modelling

In [21]:
topicList = []

for x in tList:
    y = re.sub('[^A-Za-z]+', ' ', x[1])
    topicList.append(y)
    

for z in topicList:
    seperate = z.split()
    print('Topic', seperate)

Topic ['would', 'enron', 'power', 'company', 'market', 'energy', 'price', 'gas', 'contract']
Topic ['intended', 'email', 'recipient', 'omnicalendarentry', 'or', 'corp', 'use', 'affiliate', 'enron', 'omniexcludefromviewdomniexcludefromview']
Topic ['message', 'subject', 'to', 'from', 'original', 'sent', 'please', 'pm', 'cc']
Topic ['message', 'to', 'sent', 'subject', 'from', 'original', 'pm', 're', 'know']


Each line is a topic with individual topic terms. 
- Topic 1 - It Can be termed as Business.
- Topic 2 - It Can be termed as Legalities.
- Topic 3 - It Can be termed as Meeting.
- Topic 4 - It Can be termed as Meeting in casual tone.

## Conclusion
- Topic 1 contains words that are directly related to the core business of Enron like "gas", "power" etc.
- Topic 2 while related to business seems to be more about the process rather than the content of the core business. It has a lot of terms relevant to business legalities.
- Topic 3 contains a lot of meeting related words, perhaps they are from emails that were sent as meeting notices.
- Topic 4 also seems to be meeting-related but in a more casual tone and setting.