# reference: 

orignal paper: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=12&cad=rja&uact=8&ved=2ahUKEwjzx5aL2rfkAhUPjq0KHe51AV0QFjALegQIARAC&url=http%3A%2F%2Fwww.jmlr.org%2Fpapers%2Fvolume3%2Fblei03a%2Fblei03a.pdf&usg=AOvVaw1xRN0-HZ0xS5mhxzNFmHhF

reference1: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

reference2: https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925

gensim package manual: https://radimrehurek.com/gensim/

## Step 1: Load Data 

choose data from sklearn.datasets

In [1]:
import logging
from sklearn.datasets import fetch_20newsgroups
logging.basicConfig() # to show logging message of what is happening behind the scene.



In [2]:
categories = ['sci.med', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True, categories = categories)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True, categories = categories)

print(list(newsgroups_train.target_names))

print(newsgroups_train.filenames.shape, newsgroups_test.filenames.shape)


['sci.med', 'sci.space']
(1187,) (790,)


## Step 2: Data Pre-processing  

For each document, step 1. tokenize (simple_preprocess) step 2. filter out stopwords and the short words (STOPWORDS, len(token)>3) step 3. transform to the word stem (lemmatize_stemming)

E.g. 
sentence: "This is a sentence for illustration of the data pre-processing."

    After step 1: ['this', 'is', 'sentence', 'for', 'illustration', 'of', 'the', 'data', 'pre', 'processing']

    After step 2: ['sentence', 'illustration', 'data', 'processing']

    After step 3: ['sentenc', 'illustr', 'data', 'process']


In [3]:
'''
Loading Gensim and nltk libraries
'''
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer

from gensim.utils import lemmatize

stemmer = SnowballStemmer("english")


In [4]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in simple_preprocess(text) :
        if token not in STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result


In [5]:
processed_docs = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))
    

## Step 3: Build a Dictionary 

1. Save each word/stem into a dictionary (Dictionary)
2. Filter out the words with extreme occurrence (filter_extremes)
3. Turn the words into bow_corpus using the dictionary in step 1 and documents (doc2bow). With bow, each word is matched with the frequency of occurrence.

In [6]:
from gensim.corpora import Dictionary


In [7]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = Dictionary(processed_docs)

In [8]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

In [9]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

## Step 4: LDA Model with Bag of Words

1. Train LDA model with BOW (LdaMulticore), this uses multicore in computer for faster speed.
2. Print out the topics.
3. Save model to path (cPickle.dump).

In [10]:
from gensim.models import LdaMulticore
import _pickle as cPickle

In [11]:
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model_bow'
'''
lda_model_bow =  LdaMulticore(bow_corpus, 
                                   num_topics = 10, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [12]:
lda_model_bow.print_topics(num_topics = 5, num_words = 5)

[(5,
  '0.010*"rocket" + 0.009*"model" + 0.008*"theori" + 0.008*"scientif" + 0.007*"henri"'),
 (4,
  '0.020*"health" + 0.012*"softwar" + 0.012*"report" + 0.011*"henri" + 0.010*"toronto"'),
 (8,
  '0.018*"fred" + 0.017*"dseg" + 0.016*"higgin" + 0.014*"digex" + 0.014*"mccall"'),
 (0,
  '0.035*"alaska" + 0.021*"aurora" + 0.018*"nsmca" + 0.012*"acad" + 0.012*"astronaut"'),
 (7,
  '0.031*"food" + 0.014*"water" + 0.010*"sensit" + 0.009*"studi" + 0.007*"chines"')]

In [13]:
'''
Save the model to local path
'''
path = 'model/lda_bow.pkl'

# save the classifier
with open(path, 'wb') as fn:
    cPickle.dump(lda_model_bow, fn)    

## Step 4': LDA Model with TF-IDF

1. Train TF-IDF (TfidfModel) model and fit into documents (tfidf). TF-IDF is for term frequency - inverse document frequency. Calculation details: multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length.

2. Train LDA model with tfidf (LdaMulticore). Note that this uses the same LDA model as previous, but different data inputs.
3. Print out the topics.
4. Save model to path.

In [14]:
from gensim.models import TfidfModel, LdaMulticore

In [15]:
'''
Create TF-IDF model for each document i.e realize the transformation between word-document co-occurrence matrix (int) 
into a locally/globally weighted TF-IDF matrix (positive floats). 

Save the global model to 'tfidf' and the realization to corpus_tfidf.
'''
# fit model
tfidf = TfidfModel(bow_corpus)
# apply model to the corpus documents
corpus_tfidf = tfidf[bow_corpus]

In [16]:
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model_tfidf =  LdaMulticore(corpus_tfidf, 
                                   num_topics = 10, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [17]:
lda_model_tfidf.print_topics(num_topics = 5, num_words = 5)

[(5,
  '0.015*"weight" + 0.010*"gene" + 0.010*"command" + 0.010*"needl" + 0.008*"diet"'),
 (4,
  '0.036*"gordon" + 0.014*"chastiti" + 0.014*"cadr" + 0.014*"surrend" + 0.014*"intellect"'),
 (3,
  '0.004*"diseas" + 0.003*"patient" + 0.003*"mail" + 0.003*"satellit" + 0.003*"project"'),
 (0,
  '0.015*"alaska" + 0.012*"dyer" + 0.008*"aurora" + 0.008*"nsmca" + 0.007*"spdcc"'),
 (7,
  '0.016*"vnet" + 0.014*"food" + 0.009*"smoke" + 0.008*"jupit" + 0.008*"portal"')]

In [18]:
'''
Save the model to local path
'''
path = 'model/lda_tfidf.pkl'

# save the classifier
with open(path, 'wb') as fn:
    cPickle.dump(lda_model_tfidf, fn)    

## Step 5: Apply the Model on Training Data 

If interested in the performance of training data. Otherwise, skip this part.
1. Load the model (cPickle.load)
2. Apply the model to document (model_result)

    a. transform the document into bow (doc2bow)
    
    b. apply the model to data (model[bow_vector])
    
    c. print the topics and possibilities (print_topic) 

In [19]:
'''
Load the models
'''  
path = '/Users/stephanie/Job_Market/github/DS/LDA/model/lda_bow.pkl'
with open(path, 'rb') as fn:
    lda_model_bow = cPickle.load(fn)    

path = '/Users/stephanie/Job_Market/github/DS/LDA/model/lda_tfidf.pkl'
with open(path, 'rb') as fn:
    lda_model_tfidf = cPickle.load(fn)

  

In [20]:
def model_result(model, document):
    bow_vector = dictionary.doc2bow(preprocess(document))
    
    for index, score in sorted(model[bow_vector], key=lambda tup: -1*tup[1]):
        print("Score: {}\t Topic: {}".format(score, model.print_topic(index, 5)))

In [21]:
document_num = 10  # any num within [0, 1186] -- training data
document = newsgroups_train.data[document_num]
#print(document)

In [22]:
'''
Result of LDA Using Bag of Words
'''
model_result(lda_model_bow, document)

Score: 0.979994535446167	 Topic: 0.015*"mission" + 0.011*"satellit" + 0.010*"spacecraft" + 0.009*"shuttl" + 0.009*"mar"


In [23]:
'''
Result of LDA Using TF-IDF
'''
model_result(lda_model_tfidf, document)

Score: 0.5161755681037903	 Topic: 0.010*"stage" + 0.009*"centaur" + 0.008*"mission" + 0.007*"cancer" + 0.007*"appl"
Score: 0.3118633031845093	 Topic: 0.023*"henri" + 0.019*"toronto" + 0.014*"larc" + 0.010*"spencer" + 0.009*"zoolog"
Score: 0.1282937079668045	 Topic: 0.004*"diseas" + 0.003*"patient" + 0.003*"mail" + 0.003*"satellit" + 0.003*"project"
Score: 0.030328504741191864	 Topic: 0.022*"digex" + 0.013*"express" + 0.012*"onlin" + 0.011*"communic" + 0.008*"carl"


## Step 6: Apply the Model to Test Data 

If interested in the performance of test data. Otherwise, skip this part.
1. Load the model (cPickle.load)
2. Apply the model to document (model_result)

    a. transform the document into bow (doc2bow)
    
    b. apply the model to data (model[bow_vector])
    
    c. print the topics and possibilities (print_topic) 

In [25]:
'''
Load the models
'''  
path = '/Users/stephanie/Job_Market/github/DS/LDA/model/lda_bow.pkl'
with open(path, 'rb') as fn:
    lda_model_bow = cPickle.load(fn)    

path = '/Users/stephanie/Job_Market/github/DS/LDA/model/lda_tfidf.pkl'
with open(path, 'rb') as fn:
    lda_model_tfidf = cPickle.load(fn)



In [26]:
def model_result(model, document):
    bow_vector = dictionary.doc2bow(preprocess(document))
    
    for index, score in sorted(model[bow_vector], key=lambda tup: -1*tup[1]):
        print("Score: {}\t Topic: {}".format(score, model.print_topic(index, 5)))

In [27]:
document_num = 100 # any num within [0, 789] -- test data
unseen_document = newsgroups_test.data[document_num]
#print(unseen_document)

In [28]:
'''
Result of LDA Using Bag of Words
'''
model_result(lda_model_bow, unseen_document)

Score: 0.5368219614028931	 Topic: 0.015*"mission" + 0.011*"satellit" + 0.010*"spacecraft" + 0.009*"shuttl" + 0.009*"mar"
Score: 0.3186612129211426	 Topic: 0.014*"satellit" + 0.013*"fund" + 0.010*"commerci" + 0.010*"market" + 0.010*"project"
Score: 0.11858393996953964	 Topic: 0.016*"patient" + 0.016*"pain" + 0.015*"diseas" + 0.010*"candida" + 0.008*"infect"


In [29]:
'''
Result of LDA Using TF-IDF
'''
model_result(lda_model_tfidf, unseen_document)

Score: 0.7678722143173218	 Topic: 0.004*"diseas" + 0.003*"patient" + 0.003*"mail" + 0.003*"satellit" + 0.003*"project"
Score: 0.11889021098613739	 Topic: 0.015*"alaska" + 0.012*"dyer" + 0.008*"aurora" + 0.008*"nsmca" + 0.007*"spdcc"
Score: 0.08730076253414154	 Topic: 0.010*"stage" + 0.009*"centaur" + 0.008*"mission" + 0.007*"cancer" + 0.007*"appl"


## Extra: BOW or TF-IDF? 

1. BOW: summarize the term frequency.
2. TF-IDF: summarize the term frequency and inverse document frequency. bow_corpus is the input of TF-IDF as well.

For feeding LDA, a paper by Blei and Lafferty (2009) suggests that TF-IDF is not necessry but could be helpful. 

paper: http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf

Stackflow discussion: https://stackoverflow.com/questions/44781047/necessary-to-apply-tf-idf-to-new-documents-in-gensim-lda-model/44789327#44789327

In [31]:
'''
Preview BOW & TF-IDF for our sample preprocessed document
'''
document_num = 20
bow_doc_x = bow_corpus[document_num]
corpus_tfidf_x = corpus_tfidf[document_num]
#print corpus_tfidf_x[1][0]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time, with tf-idf as {}.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1],
                                                     corpus_tfidf_x[i][1]))
    if i > 10:
        break


Word 9 ("human") appears 1 time, with tf-idf as 0.11119258084160516.
Word 39 ("soon") appears 1 time, with tf-idf as 0.1013654407969559.
Word 70 ("consid") appears 1 time, with tf-idf as 0.10688973455146679.
Word 79 ("school") appears 1 time, with tf-idf as 0.11655844119462777.
Word 96 ("mayb") appears 2 time, with tf-idf as 0.21022179855651255.
Word 102 ("shuttl") appears 1 time, with tf-idf as 0.10874765465476678.
Word 123 ("current") appears 1 time, with tf-idf as 0.10176540863607408.
Word 183 ("worth") appears 1 time, with tf-idf as 0.1417022709780529.
Word 204 ("offic") appears 1 time, with tf-idf as 0.11598977500433545.
Word 217 ("administr") appears 1 time, with tf-idf as 0.14384614193604373.
Word 238 ("mar") appears 3 time, with tf-idf as 0.42510681293415875.
Word 254 ("daniel") appears 1 time, with tf-idf as 0.1461059528303259.
