# Script content

## In this script we first load file text_clean (emails list processed for LDA modelling) where each list element contains one tokenized sequence or words. 

## Then we build a dictionary with this collection and downsize it by filtering out words that appear  too often or to seldom using filter_extremes function from gensim. Thresholds chosen are min=50 emails and maximum 10% of  all emails. These values are chosen based on intuition and logical reasoning. 
#### Confirmation of such choices should be done as a combination of hard statistical tests where applicable and business input.

### In a sequence of iterations towards tweaking the final output for business purpose various stakeholders inputs should be taken into account as well as continues  reflection whether choices indeed match business need. 

### Personally I would consider such aspect perhaps the most important for data scientist's contribution in a corporate environment.

In [11]:
# import and setup modules we'll be using in this notebook
import logging
import itertools
import gensim
import numpy as np



### We can switch logging on as in the cell below but will not do that here


In [None]:
# logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
# logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

### Load cleaned email texts 

In [23]:

import cPickle 
# f = file('/notebooks/LDA models and data/Data Frames and lists/text_term_matrix_clean.pkl', 'rb')
# text_term_matrix=cPickle.load(f)

text_clean_path = file('/notebooks/LDA models and data/Data Frames and lists/text_clean.pkl', 'rb')
text_clean=cPickle.load(text_clean_path)

# dictfile=file('/notebooks/LDA models and data/Data Frames and lists/Dictionary_tfidf_05V2.pkl', 'rb')
# Dictionary=cPickle.load(dictfile)

### Create and print dictionary based on our clean_texts file

In [49]:
Dictionary = gensim.corpora.Dictionary(text_clean)
print(Dictionary)

Dictionary(158864 unique tokens: [u'fawn', u'verplank', u'fawl', u'percopo', u'vang']...)


### Remove from the Dictionary words that appear in less than 20 documents or more than 10% documents.

In [48]:
Dictionary.filter_extremes(no_below=100, no_above=0.1)
print(Dictionary)

Dictionary(12311 unique tokens: [u'woodi', u'yellow', u'doucett', u'elvi', u'prefix']...)


### Check whether we have any empty emails now. We create helper function for it.

In [50]:
def remove_empty_texts(input_text_clean):
    output_text_clean=[] 
    for i in range(0, len(input_text_clean)):
        if len(text_clean)>0:
            output_text_clean.append(input_text_clean[i])
    return output_text_clean

### Thus not! All good

In [54]:
print len(text_clean)
print len(remove_empty_texts(text_clean))

517401
517401


### This is the same helper function to remove empty bows that we used in previous script. 
### Such bows can come from emails that contain no words from our Dictionary.

In [55]:
def remove_empty_bows(input_text_term_matrix, one_or_two_bowlists=1):
    output_text_term_matrix=[] 
    for i in range(0, len(input_text_term_matrix)):
        if len(np.array(input_text_term_matrix[i]).shape)==2:
            if np.array(input_text_term_matrix[i]).shape[1]==2:
                output_text_term_matrix.append(input_text_term_matrix[i])
    return output_text_term_matrix

## In order to train and test our LDA topics model we approach as follows: 

### Since it is an unsupervised task there is no hard way to check 'targeting'. 

### Instead we will train model on .7% of emails and the remaining portion will be subsampled (for performance reasons) for two test metrics. 
### In addition each cleaned email from subsampled testing part will be randomly split into two parts and then we will compute following: 
### 1) Perplexity of a (a sample of) train and test emails, as our first valuation method. 

### 2) Compute and compare mean co-similarities between two email halves over the whole test subsample, versus average co-similarity computed over randomly selected parts from pairs of different emails, from the test sample..

### This are our methods for valuation of the LDA topic model.


In [56]:
# radomly split tokanized emails 07 vs 03 
p_train = int(len(text_clean) *0.7)
rc =np.random.choice(len(text_clean), size=len(text_clean), replace=False, p=None)
text_clean_train=[text_clean[i] for i in rc[0:p_train]]    
text_clean_test=[text_clean[i] for i in rc[p_train:]]    

# create bags of words to train model and asses perplexity
text_term_matrix_train = [Dictionary2.doc2bow(text) for text in text_clean_train]
text_term_matrix_train= remove_empty_bows(text_term_matrix_train)
# subselect from 30% of test examples 20000 emails to evaluate perplexity. It may be too low but we prpceed with ot due to very long run times
text_clean_test= random.sample(text_clean_test, 20000)
text_term_matrix_test = [Dictionary2.doc2bow(text) for text in text_clean_test]
text_term_matrix_test= remove_empty_bows(text_term_matrix_test)

### Checks for empty bow's

In [58]:
print 'len(text_clean) is', len(text_clean)
print 'len(text_clean_train) is', len(text_clean_train)
print 'len(remove_empty_texts(text_clean_train)) is', len(remove_empty_texts(text_clean_train))
print 'len(text_clean_test) is', len(text_clean_test)
print 'len(remove_empty_texts(text_clean_test)) is', len(remove_empty_texts(text_clean_test))

# and now check same for bow's; notice that we already applied remove_empty_bows function to both objects
print 'len(text_term_matrix_train) is', len(text_term_matrix_train)
print 'len(text_term_matrix_test) is', len(text_term_matrix_test)

# indeed we lost some bow's; now we can proceed without worrying about error messages from gensim (for this point)

len(text_clean) is 517401
len(text_clean_train) is 362180
len(remove_empty_texts(text_clean_train)) is 362180
len(text_clean_test) is 20000
len(remove_empty_texts(text_clean_test)) is 20000
len(text_term_matrix_train) is 355916
len(text_term_matrix_test) is 19651


### Define function for evaluation of cossim scores as follows: split test portion

In [114]:
import random
import numpy as np

def intra_inter(dictionary, model, docs, num_pairs=10000):
    # just to be sure nothing strange happened and all emails are not empty
    docs=remove_empty_texts(docs)
    # split each test document into two halves and compute topics for each half
    part1 = [model[dictionary.doc2bow(email_tokens[: len(email_tokens) / 2])] for email_tokens in docs]
    part2 = [model[dictionary.doc2bow(email_tokens[len(email_tokens) / 2 :])] for email_tokens in docs]

    # print computed similarities (uses cossim)
    print("average cosine similarity between corresponding parts (higher is better):")
    print(np.mean([gensim.matutils.cossim(p1, p2) for p1, p2 in zip(part1, part2)]))
    intra = np.mean([gensim.matutils.cossim(p1, p2) for p1, p2 in zip(part1, part2)])

    random_pairs = np.random.randint(0,len(docs), size=(num_pairs, 2))
    print("average cosine similarity between 10,000 random parts (lower is better):")    
    print(np.mean([gensim.matutils.cossim(part1[i[0]], part2[i[1]]) for i in random_pairs]))
    inter = np.mean([gensim.matutils.cossim(part1[i[0]], part2[i[1]]) for i in random_pairs])
    
    
    return [intra, inter]

### Finally we are ready to train and test our LDA model.

In [115]:
import time
from collections import defaultdict
import numpy as np
import random
grid_train = defaultdict(list)
grid_test = defaultdict(list)
cosine_score=[]

# num topics list to loop through for LDA output evaluation
parameter_list=[10, 20, 30, 50, 60, 75, 100, 125, 150]


# radomly split tokanized emails 07 vs 03 
p_train = int(len(text_clean) *0.7)
rc =np.random.choice(len(text_clean), size=len(text_clean), replace=False, p=None)
text_clean_train=[text_clean[i] for i in rc[0:p_train]]    
text_clean_test=[text_clean[i] for i in rc[p_train:]]    

# create bags of words to train model and asses perplexity
text_term_matrix_train = [Dictionary.doc2bow(text) for text in text_clean_train]
text_term_matrix_train= remove_empty_bows(text_term_matrix_train)

# cretae a sample from traing part for perplexity evaluation
text_clean_train_perpl= random.sample(text_clean_train, 20000)
text_term_matrix_train_perpl = [Dictionary.doc2bow(text) for text in text_clean_train_perpl]
text_term_matrix_train_perpl= remove_empty_bows(text_term_matrix_train_perpl)

# subselect from 30% of test examples 20000 emails to evaluate perplexity. It may be too low but we prpceed with ot due to very long run times
text_clean_test= random.sample(text_clean_test, 20000)
text_term_matrix_test = [Dictionary.doc2bow(text) for text in text_clean_test]
text_term_matrix_test= remove_empty_bows(text_term_matrix_test)

# for num_topics_value in num_topics_list:
for parameter_value in parameter_list:
    # print "starting pass for num_topic = %d" % num_topics_value
    print "starting pass for parameter_value = %.3f" % parameter_value
    start_time = time.time()

    # run model
    model = gensim.models.ldamulticore.LdaMulticore(corpus=text_term_matrix_train, id2word=Dictionary, num_topics=parameter_value, chunksize=3125,\
                                     passes=1, eval_every=None, alpha=None, eta=None, decay=0.5)
    
    # show elapsed time for model
    elapsed = time.time() - start_time
    print "Elapsed time: %s" % elapsed
    
    perplex_test = model.bound(text_term_matrix_test)
    print "Perplexity test: %s" % perplex_test
    grid_test[parameter_value].append(perplex_test)
    
    perplex_train = model.bound(text_term_matrix_train_perpl)
    print "Perplexity train: %s" % perplex_train
    grid_train[parameter_value].append(perplex_train)
    
    per_word_perplex_test = np.exp2(-perplex_test / sum(cnt for document in text_term_matrix_test for _, cnt in document))
    print "Per-word Perplexity test: %s" % per_word_perplex_test
    grid_test[parameter_value].append(per_word_perplex_test)
    
    per_word_perplex_train = np.exp2(-perplex_train / sum(cnt for document in text_term_matrix_train_perpl for _, cnt in document))
    print "Per-word Perplexity train: %s" % per_word_perplex_train
    grid_test[parameter_value].append(per_word_perplex_train)
    
    
    cosine_score.append(intra_inter(dictionary=Dictionary, model=model, docs=text_clean_test, num_pairs=20000))
# get `v1 = lsi.projection.u[dictionary.token2id['the']]; v2 = lsi.projection.u[dictionary.token2id['of']]` and print cossim(v1, v2)  

starting pass for parameter_value = 10.000
Elapsed time: 121.661648035
Perplexity test: -15126136.5465
Perplexity train: -14976077.1792
Per-word Perplexity test: 484.016500426
Per-word Perplexity train: 480.50003498
average cosine similarity between corresponding parts (higher is better):
0.57738876956
average cosine similarity between 10,000 random parts (lower is better):
0.207897425551
starting pass for parameter_value = 20.000
Elapsed time: 279.906225204
Perplexity test: -16130492.3379
Perplexity train: -15928708.4949
Per-word Perplexity test: 729.67823135
Per-word Perplexity train: 711.666442406
average cosine similarity between corresponding parts (higher is better):
0.472040056749
average cosine similarity between 10,000 random parts (lower is better):
0.113729421633
starting pass for parameter_value = 30.000
Elapsed time: 514.336200953
Perplexity test: -17032550.7964
Perplexity train: -16796631.4162
Per-word Perplexity test: 1054.98195878
Per-word Perplexity train: 1017.8674608

Process PoolWorker-297:
Traceback (most recent call last):
Process PoolWorker-292:
Process PoolWorker-293:
Process PoolWorker-281:
Process PoolWorker-296:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Process PoolWorker-286:
Process PoolWorker-279:
Traceback (most recent call last):
Process PoolWorker-278:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Process PoolWorker-290:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
    self._target(*self._args, **self._kwargs)
Process PoolWorker-288:
Process PoolWorker-298:
Traceback (most recent call last):
Process PoolWorker-287:
Process PoolWorker-291:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 97, in work

  File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamulticore.py", line 274, in worker_e_step
    chunk_no, chunk, worker_lda = input_queue.get()
    chunk_no, chunk, worker_lda = input_queue.get()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 115, in get
    self.run()
    self.run()
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamulticore.py", line 274, in worker_e_step
    self.run()
    chunk_no, chunk, worker_lda = input_queue.get()
    initializer(*initargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
KeyboardInterrupt
    chunk_no, chunk, worker_lda = input_queue.get()
    chunk_no, chunk, worker_lda = input_queue.get()
    self.run()
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/ldamulticore.py", line 274, in worker_e_step
    chunk_no, chunk, worker_lda = input_queue.get()
    chunk_no, chunk, worker_lda = input_queue.get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", 

KeyboardInterrupt: 

    chunk_no, chunk, worker_lda = input_queue.get()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 115, in get
    self._target(*self._args, **self._kwargs)
    chunk_no, chunk, worker_lda = input_queue.get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 115, in get
    self._rlock.acquire()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 115, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 115, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 115, in get
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    initializer(*initargs)
    self._target(*self._args, **self._kwargs)
    self._rlock.acquire()
    chunk_no, chunk, worker_lda = input_queue.get()
KeyboardInterrupt
    self._rlock.acquire()
  File "/usr/lib/python2.7/multiprocessing/queues.py", lin

### We have manually aborted code execution in the cell above because it was taking to long for 150 topics model to fit. Furthermore results printed on the screen showed already that models with over 20-30 topics have worse scores.

### It may however be so that for selecting 10-20 topics (a number reasonable given the context of email files from a select group of executives in a corporation---they simply can’t be discussing on too many topics) we can better fit a model with say 50 topics and then using pyLDAvis manually pinpoint 15-20 topics. 

### I have in fact done this and I have found 10-15 topics where it seems very obvious what topics were and another 5-10 with some doubts though I could point a topic.


### Now let us train the LDA model with 50 topics and visualize results with pyLDAvis package. 
### Inputs files are same as above, namely Dictionary with 12311 words and our text_term_matrix_train list.

In [94]:
lda_20topics = gensim.models.ldamulticore.LdaMulticore(corpus=text_term_matrix_train, id2word=Dictionary, num_topics=20, chunksize=3125,\
     ### Now let us train the LDA model with 50 topics and visualize results with pyLDAvis package. 
### Inputs files are same as above, namely Dictionary with 12311 words and our text_term_matrix_train list.
### Now let us train the LDA model with 50 topics and visualize results with pyLDAvis package. 
### Inputs files are same as above, namely Dictionary with 12311 words and our text_term_matrix_train list.
                                passes=1, eval_every=None, alpha=None, eta=None, decay=0.5)

In [116]:
lda_20topics

<gensim.models.ldamulticore.LdaMulticore at 0x7f714ad4f6d0>

### At last we Visualize LDA

In [118]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

vis20=pyLDAvis.gensim.prepare(lda_20topics, text_term_matrix_train, Dictionary)
pyLDAvis.display(vis20)