# Enron Latent Dirichlet Allocation Analysis
<hr>
**Author: **

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('enron_lda').getOrCreate()

In [2]:
data = spark.read.csv('./user_and_emails.csv', header=True)

In [3]:
data.show()

+-------+--------------------+--------------------+
|   user|                From|          email_body|
+-------+--------------------+--------------------+
|allen-p|phillip.allen@enr...|Here is our forec...|
|allen-p|phillip.allen@enr...|Traveling to have...|
|allen-p|phillip.allen@enr...|test successful  ...|
|allen-p|phillip.allen@enr...|Randy    Can you ...|
|allen-p|phillip.allen@enr...|Let s shoot for T...|
|allen-p|phillip.allen@enr...|Greg    How about...|
|allen-p|phillip.allen@enr...|Please cc the fol...|
|allen-p|phillip.allen@enr...|any morning betwe...|
|allen-p|phillip.allen@enr...|1  login   pallen...|
|allen-p|phillip.allen@enr...|                 ...|
|allen-p|phillip.allen@enr...|Mr  Buckner    Fo...|
|allen-p|phillip.allen@enr...|Lucy    Here are ...|
|allen-p|phillip.allen@enr...|                 ...|
|allen-p|phillip.allen@enr...|                 ...|
|allen-p|phillip.allen@enr...|Dave     Here are...|
|allen-p|phillip.allen@enr...|Paula    35 milli...|
|allen-p|phi

### ELT

In [4]:
from pyspark.ml.feature import (StopWordsRemover, 
                                Tokenizer, 
                                CountVectorizer, 
                                RegexTokenizer,
                                IDF)
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.ml.clustering import LDA

In [5]:
data.show()

+-------+--------------------+--------------------+
|   user|                From|          email_body|
+-------+--------------------+--------------------+
|allen-p|phillip.allen@enr...|Here is our forec...|
|allen-p|phillip.allen@enr...|Traveling to have...|
|allen-p|phillip.allen@enr...|test successful  ...|
|allen-p|phillip.allen@enr...|Randy    Can you ...|
|allen-p|phillip.allen@enr...|Let s shoot for T...|
|allen-p|phillip.allen@enr...|Greg    How about...|
|allen-p|phillip.allen@enr...|Please cc the fol...|
|allen-p|phillip.allen@enr...|any morning betwe...|
|allen-p|phillip.allen@enr...|1  login   pallen...|
|allen-p|phillip.allen@enr...|                 ...|
|allen-p|phillip.allen@enr...|Mr  Buckner    Fo...|
|allen-p|phillip.allen@enr...|Lucy    Here are ...|
|allen-p|phillip.allen@enr...|                 ...|
|allen-p|phillip.allen@enr...|                 ...|
|allen-p|phillip.allen@enr...|Dave     Here are...|
|allen-p|phillip.allen@enr...|Paula    35 milli...|
|allen-p|phi

### Adding index column

In [6]:
data_with_id = data.withColumn('id', monotonically_increasing_id())

In [7]:
data_with_id.show()

+-------+--------------------+--------------------+---+
|   user|                From|          email_body| id|
+-------+--------------------+--------------------+---+
|allen-p|phillip.allen@enr...|Here is our forec...|  0|
|allen-p|phillip.allen@enr...|Traveling to have...|  1|
|allen-p|phillip.allen@enr...|test successful  ...|  2|
|allen-p|phillip.allen@enr...|Randy    Can you ...|  3|
|allen-p|phillip.allen@enr...|Let s shoot for T...|  4|
|allen-p|phillip.allen@enr...|Greg    How about...|  5|
|allen-p|phillip.allen@enr...|Please cc the fol...|  6|
|allen-p|phillip.allen@enr...|any morning betwe...|  7|
|allen-p|phillip.allen@enr...|1  login   pallen...|  8|
|allen-p|phillip.allen@enr...|                 ...|  9|
|allen-p|phillip.allen@enr...|Mr  Buckner    Fo...| 10|
|allen-p|phillip.allen@enr...|Lucy    Here are ...| 11|
|allen-p|phillip.allen@enr...|                 ...| 12|
|allen-p|phillip.allen@enr...|                 ...| 13|
|allen-p|phillip.allen@enr...|Dave     Here are.

### Tokenizing and removing empty tokens

In [8]:
regex_tokenizer = RegexTokenizer(inputCol='email_body', outputCol='tokens', pattern='\\W')

In [9]:
regex_tokenized = regex_tokenizer.transform(data_with_id)

In [10]:
regex_tokenized.show()

+-------+--------------------+--------------------+---+--------------------+
|   user|                From|          email_body| id|              tokens|
+-------+--------------------+--------------------+---+--------------------+
|allen-p|phillip.allen@enr...|Here is our forec...|  0|[here, is, our, f...|
|allen-p|phillip.allen@enr...|Traveling to have...|  1|[traveling, to, h...|
|allen-p|phillip.allen@enr...|test successful  ...|  2|[test, successful...|
|allen-p|phillip.allen@enr...|Randy    Can you ...|  3|[randy, can, you,...|
|allen-p|phillip.allen@enr...|Let s shoot for T...|  4|[let, s, shoot, f...|
|allen-p|phillip.allen@enr...|Greg    How about...|  5|[greg, how, about...|
|allen-p|phillip.allen@enr...|Please cc the fol...|  6|[please, cc, the,...|
|allen-p|phillip.allen@enr...|any morning betwe...|  7|[any, morning, be...|
|allen-p|phillip.allen@enr...|1  login   pallen...|  8|[1, login, pallen...|
|allen-p|phillip.allen@enr...|                 ...|  9|[forwarded, by, p...|

In [11]:
stop_words = StopWordsRemover()
new_stop_words = [
    '00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '0', '1', 
    '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '15', 
    '16', '17', '18', '19','20', '25', 'www', 'com', 'pm', '853', 'click',
    'mail', 'e', 'http', 'na', 'ees', 'cc', 'hou', 'etc', '30', 'll', '35',
    'a','cannot','into','our','thus','about','co','is','ours','to','above',
    'could','it','ourselves','together','across','down','its','out','too',
    'after','during','itself','over','toward','afterwards','each','last', '50',
    'own', 'towards','again','eg','latter','per','under','against','either',
    'latterly', 'perhaps','until','all','else','least','rather','up','almost',
    'elsewhere', 'less','same','upon','alone','enough','seem','us', 'said',
    'along','etc', 'many','seemed','very','already','even','may','seeming','via',
    'also','ever', 'me','seems','was','although','every','meanwhile','several',
    'always', 'everyone','might','she','well','among','everything','more','should',
    'were', 'amongst','everywhere','moreover','since','what','an','except','most',
    'whatever','and','few','mostly','some','when','another','first','much',
    'somehow','whence','any','for','must','someone','whenever','anyhow', 'mr', 
    'my','something','where','anyone','formerly','myself','sometime', 'gif', '31',
    'whereafter','anything','from','namely','sometimes','whereas','anywhere', 
    'further','neither','somewhere','whereby','are','had','never','still', 'let',
    'wherein','around','has','nevertheless','such','whereupon','as','have', 'ip',
    'next','than','wherever','at','he','no','that','whether','be','hence',
    'nobody','the','whither','became','her','none','their','which','because', 'send',
    'here','noone','them','while','become','hereafter','nor','themselves','who',
    'becomes','hereby','not','then','whoever','becoming','herein','nothing',
    'thence','whole','been','hereupon','now','there','whom','before','hers', 'ski',
    'nowhere','thereafter','whose','beforehand','herself','of','thereby','why', 
    'especially', 'image', 're', 'we', 'so', 'static', 'width', 'href', '000',
    'behind','him','off','therefore','will','being','himself','often','therein', 
    'with','below','his','on','thereupon','within','beside','how','once', 'try', 
    'these','without','besides','however','one','they','would','between','i', 'far',
    'only','this','yet','beyond','ie','onto','those','you','both','if','or', 'get',
    'though','your','but','in','other','through','yours','by','inc','others', 'suggest', 
    'take', 'throughout','yourself','can','indeed','otherwise','thru','yourselves', 
    'login', 'please', 'forwarded', 'pw', 'k', '-', '+', '|', ' ', 'go', 'takes', 
    'td', 'font', 'br', 'b', 'tr', 'm', 'align', 'net', '3d', '2001', 'new', 
    'said', '11', 'ect', '2000', 'sent', 'know', 'dbcaps97data',
    '12', 'need', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '12', 'aol',
    '2002', 'mailto', '713', 'error', 'nbsp', 'et', 
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
    ]

all_stop_words = stop_words.getStopWords() + new_stop_words
stop_words_set = set(all_stop_words)
stop_words_set = list(stop_words_set)

remover = StopWordsRemover(inputCol='tokens', outputCol='words', stopWords=stop_words_set)

In [12]:
cleaned = remover.transform(regex_tokenized)

In [13]:
# cv = CountVectorizer(inputCol='words', outputCol='vectors') # to user with IDF
cv = CountVectorizer(inputCol='words', outputCol='features')

In [14]:
count_vectorizer_model = cv.fit(cleaned)

In [15]:
result = count_vectorizer_model.transform(cleaned)

In [16]:
vocab = count_vectorizer_model.vocabulary

In [17]:
# idf = IDF(inputCol='vectors', outputCol='features')

In [18]:
# idf_model = idf.fit(result)

In [19]:
# rescale_data = idf_model.transform(result)

In [17]:
# rescale_data.show()
result.show()

+-------+--------------------+--------------------+---+--------------------+--------------------+--------------------+
|   user|                From|          email_body| id|              tokens|               words|            features|
+-------+--------------------+--------------------+---+--------------------+--------------------+--------------------+
|allen-p|phillip.allen@enr...|Here is our forec...|  0|[here, is, our, f...|          [forecast]|(262144,[1917],[1...|
|allen-p|phillip.allen@enr...|Traveling to have...|  1|[traveling, to, h...|[traveling, busin...|(262144,[6,16,29,...|
|allen-p|phillip.allen@enr...|test successful  ...|  2|[test, successful...|[test, successful...|(262144,[59,955,1...|
|allen-p|phillip.allen@enr...|Randy    Can you ...|  3|[randy, can, you,...|[randy, schedule,...|(262144,[31,111,1...|
|allen-p|phillip.allen@enr...|Let s shoot for T...|  4|[let, s, shoot, f...|[shoot, tuesday, 45]|(262144,[81,403,6...|
|allen-p|phillip.allen@enr...|Greg    How about.

### Training data

In [18]:
lda_model = LDA(k=10, seed=101)

In [19]:
model = lda_model.fit(result)

In [20]:
topics = model.describeTopics()

In [21]:
topics.show()

+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[0, 1, 4, 8, 38, ...|[0.05377207869946...|
|    1|[0, 5, 1, 13, 154...|[0.02173466966236...|
|    2|[489, 472, 2756, ...|[0.02855617595264...|
|    3|[0, 259, 8, 44, 1...|[0.01020669808007...|
|    4|[59, 102, 24, 0, ...|[0.00609482962220...|
|    5|[520, 1858, 946, ...|[0.01064719697642...|
|    6|[129, 205, 281, 3...|[0.01403991828246...|
|    7|[3, 146, 0, 311, ...|[0.00874891997286...|
|    8|[0, 1, 9, 28, 6, ...|[0.01265285162653...|
|    9|[2, 3, 12, 14, 10...|[0.01177401931700...|
+-----+--------------------+--------------------+



In [22]:
topics_rdd = topics.rdd

In [23]:
topics_words = topics_rdd\
    .map(lambda row: row['termIndices'])\
    .map(lambda inx_list: [vocab[idx] for idx in inx_list])\
    .collect()

In [24]:
for idx, topic in enumerate(topics_words):
    print(f'Topic: {idx+1}')
    print('===============')
    for word in topic:
        print(word)
    print('\n')

Topic: 1
enron
subject
corp
company
enron_development
thanks
meeting
business
john
gas


Topic: 2
enron
message
subject
original
intended
recipient
doc
information
kay
database


Topic: 3
omni
1999
3dhou
3dect
forney
3djohn
ou
omniexcludefromview
omniduration
omnienddatetime


Topic: 4
enron
edu
company
million
vince
cn
subject
request
business
round


Topic: 5
way
texas
houston
enron
time
like
day
good
tx
year


Topic: 6
dec
travelocity
active
prc
m0
book
position
enron
travel
ft


Topic: 7
size
class
face
images
arial
table
kate
right
border
color


Topic: 8
energy
ca
enron
gov
org
nytimes
html
subject
cpuc
pge


Topic: 9
enron
subject
thanks
agreement
time
email
message
call
attached
information


Topic: 10
power
energy
california
state
market
gas
electricity
prices
price
company




I played around with different setups trying to get coherent topics. With the IDF packages, words in topics where all over the place. This current setup gave me the most coherent topics. I'm open to learn how to fine tune passing different parameters.

### Topic: 4
I will classify this topic as very corporate conversations.

### Topic: 3
I will classify this topic as meeting conversations since have words such as call, time, attached, and mark.

### Topic: 2
I don't have any classification for this topic.

### Topic: 1
I will classify this topic as stock market


## TODO:
<hr>
- Gain more knowledge about the domain for fine tuning stop words.
- Find resources to learn how to fine tune by tweaking function's parameters.