LDA for Topic Modeling


---------------------------

Some of the well known topic modelling techniques are

* Latent Semantic Analysis (LSA)
* Probabilistic Latent Semantic Analysis (PLSA)
* Latent Dirichlet Allocation (LDA)
* Correlated Topic Model (CTM)

In this video, we will do LDA

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

The LDA makes two key assumptions:

- Documents are a mixture of topics, and

- Topics are a mixture of tokens (or words)

So the documents are known as the probability density (or distribution) of topics and the topics are the probability density (or distribution) of words.

## How will LDA optimize the distributions?

The end goal of LDA is to find the most optimal representation of the Document-Topic matrix and the Topic-Word matrix to find the most optimized Document-Topic distribution and Topic-Word distribution.

As LDA assumes that documents are a mixture of topics and topics are a mixture of words so LDA backtracks from the document level to identify which topics would have generated these documents and which words would have generated those topics.



## Load Newsgroups data

As before, let's consider a specific set of categories:

In [1]:
!pip install pyLDAvis -q

In [2]:
from sklearn.datasets import fetch_20newsgroups

def load_dataset(sset, categories):
    """
    Function to load 20 newsgroups dataset from sklearn. The dataset is a collection of approximately 
    20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
    
    Parameters:
    sset (str): The subset of the dataset to load. This can be 'train', 'test', or 'all' 
                to load the training set, the test set, or all dataset.
    
    categories (list of str): The list of categories (newsgroups) to load. If it's an empty list, 
                              all categories will be loaded.
    
    Returns:
    newsgroups_dset (sklearn.utils.Bunch): The loaded dataset. It's a dict-like object with the following 
                                           attributes:
                                           - data: the text data to learn
                                           - target: the classification labels
                                           - target_names: the meaning of the labels
                                           - DESCR: the full description of the dataset.

    """
    
    if categories==[]:
        newsgroups_dset = fetch_20newsgroups(subset=sset,
                          remove=('headers', 'footers', 'quotes'),
                          shuffle=True)
    else:
        newsgroups_dset = fetch_20newsgroups(subset=sset, categories=categories,
                          remove=('headers', 'footers', 'quotes'),
                          shuffle=True)
    return newsgroups_dset

The 20 Newsgroups dataset is a collection of around 18,000 newsgroups posts on 20 “topics”,

---------------

### Define the list of categories to extract from the data


In [3]:
categories = ["comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball", "rec.sport.hockey", "sci.crypt", "sci.med", "sci.space", "talk.politics.mideast"]

newsgroups_all = load_dataset('all', categories) # To access both training and test sets, use “all” as the first argument


print(len(newsgroups_all.data))

9850


In [4]:
import nltk
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

def stem(text):
    return stemmer.stem(text)


In [5]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS as stopwords


def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text, min_len = 4 ):
        if token not in stopwords:
            result.append(stem(token))
            
    return result

In [6]:
doc_sample = newsgroups_all.data[0]

print('Original document: ')
print(doc_sample)

Original document: 
Hi Xperts!

How can I move the cursor with the keyboard (i.e. cursor keys), 
if no mouse is available?

Any hints welcome.

Thanks.


In [7]:
print('\n\nTokenized document: ')
words = []

for token in gensim.utils.tokenize(doc_sample):
    words.append(token)
    
words



Tokenized document: 


['Hi',
 'Xperts',
 'How',
 'can',
 'I',
 'move',
 'the',
 'cursor',
 'with',
 'the',
 'keyboard',
 'i',
 'e',
 'cursor',
 'keys',
 'if',
 'no',
 'mouse',
 'is',
 'available',
 'Any',
 'hints',
 'welcome',
 'Thanks']

In [8]:
print(preprocess(doc_sample))

['xpert', 'cursor', 'keyboard', 'cursor', 'key', 'mous', 'avail', 'hint', 'welcom', 'thank']


In [9]:
for i in range(0, 10):
    print(str(i) + "\t" + ", ".join(preprocess(newsgroups_all.data[i])[:10] ))

0	xpert, cursor, keyboard, cursor, key, mous, avail, hint, welcom, thank
1	obtain, copi, open, look, widget, obtain, need, order, copi, thank
2	right, signal, strong, live, west, philadelphia, perfect, sport, fan, dream
3	canadian, thing, coach, boston, bruin, colorado, rocki, summari, post, gather
4	heck, feel, like, time, includ, cafeteria, work, half, time, headach
5	damn, right, late, climb, meet, morn, bother, right, foot, asleep
6	olympus, stylus, pocket, camera, smallest, class, includ, time, date, stamp
7	includ, follow, chmos, clock, generat, driver, processor, chmos, eras, prom
8	chang, intel, discov, xclient, xload, longer, work, bomb, messag, error
9	termin, like, power, server, run, window, manag, special, client, program


In [10]:
processed_docs = []

for i in range(0, len(newsgroups_all.data)):
    processed_docs.append(preprocess(newsgroups_all.data[i]))
    
len(processed_docs)

9850

In [11]:
dictionary = gensim.corpora.Dictionary(processed_docs)

len(dictionary)

39350

In [12]:
index = 0

for key, value in dictionary.iteritems():
    print(key, value)
    index +=1
    if index > 9:
        break

0 avail
1 cursor
2 hint
3 key
4 keyboard
5 mous
6 thank
7 welcom
8 xpert
9 copi


In [13]:
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000 )

len(dictionary)

5868

In [14]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs ]

len(bow_corpus)

9850


```
[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(6, 1),
  (9, 2),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 2),
  (15, 1),
  (16, 1),
  (17, 1)],
  ....
```

In [15]:
bow_corpus[0]

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]

In [16]:
bow_doc = bow_corpus[0]

for i in range(len(bow_doc)):
    print(f"Key {bow_doc[i][0]} =\"{dictionary[bow_doc[i][0]]}\":\
    occurrences={bow_doc[i][1]}")

Key 0 ="avail":    occurrences=1
Key 1 ="cursor":    occurrences=2
Key 2 ="hint":    occurrences=1
Key 3 ="key":    occurrences=1
Key 4 ="keyboard":    occurrences=1
Key 5 ="mous":    occurrences=1
Key 6 ="thank":    occurrences=1
Key 7 ="welcom":    occurrences=1
Key 8 ="xpert":    occurrences=1


In [17]:
#  Initialize id2word to the dictionary where each word stem is mapped to a unique ID
id2word = dictionary

# Create the corpus with word frequencies
corpus = bow_corpus

# Build the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=1000,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)

# Output all topics and for each of them print out its index and
# the most informative words identified
for index, topic in lda_model.print_topics(-1):
    print(f"Topic: {index} \nWords: {topic}")

Topic: 0 
Words: 0.021*"encrypt" + 0.018*"secur" + 0.018*"chip" + 0.016*"govern" + 0.013*"clipper" + 0.012*"public" + 0.010*"privaci" + 0.010*"key" + 0.010*"phone" + 0.009*"algorithm"
Topic: 1 
Words: 0.017*"appear" + 0.014*"copi" + 0.013*"cover" + 0.013*"star" + 0.013*"book" + 0.011*"penalti" + 0.010*"black" + 0.009*"comic" + 0.008*"blue" + 0.008*"green"
Topic: 2 
Words: 0.031*"window" + 0.015*"server" + 0.012*"program" + 0.012*"file" + 0.012*"applic" + 0.012*"display" + 0.011*"widget" + 0.010*"version" + 0.010*"motif" + 0.010*"support"
Topic: 3 
Words: 0.015*"space" + 0.007*"launch" + 0.007*"year" + 0.007*"medic" + 0.006*"patient" + 0.006*"orbit" + 0.006*"research" + 0.006*"diseas" + 0.005*"develop" + 0.005*"nasa"
Topic: 4 
Words: 0.018*"armenian" + 0.011*"peopl" + 0.008*"kill" + 0.008*"said" + 0.007*"turkish" + 0.006*"muslim" + 0.006*"jew" + 0.006*"govern" + 0.005*"state" + 0.005*"greek"
Topic: 5 
Words: 0.024*"price" + 0.021*"sale" + 0.020*"offer" + 0.017*"drive" + 0.017*"sell" + 0

## Meaning of LDA Model Params

https://stackoverflow.com/questions/50805556/understanding-parameters-in-gensim-lda-model

I wonder if you have seen [this page][1]?

As for the other parameters:

 - `random_state` - this serves as a seed (in case you wanted to repeat exactly the training process)

 -  `chunksize` - number of documents to consider at once (affects the memory consumption)

 -  [`update_every`][2] - update the model every `update_every` `chunksize` chunks (essentially, this is for memory consumption optimization)

 - `passes` - how many times the algorithm is supposed to pass over the whole corpus

 - `alpha` - to cite the documentation:

    > can be set to an explicit array = prior of your choice. It also
    > support special values of `‘asymmetric’ and ‘auto’: the former uses a
    > fixed normalized asymmetric 1.0/topicno prior, the latter learns an
    > asymmetric prior directly from your data.

 - `per_word_topics` - setting this to `True` allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. `phi_value` is another parameter that steers this process - it is a threshold for a word treated as indicative or not.

Optimal training process parameters are described particularly well in [M. Hoffman et al., Online Learning for Latent Dirichlet Allocation][3].

For memory optimization of the training process or the model see [this blog post][4].


  [1]: https://radimrehurek.com/gensim/models/ldamodel.html
  [2]: https://groups.google.com/forum/#!topic/gensim/ojySenxQHi4
  [3]: https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation
  [4]: https://miningthedetails.com/blog/python/lda/GensimLDA/

for i, topic_list in enumerate(lda_model[bow_corpus]):
        print(topic_list)
        
''' 
([(1, 0.1004754), (2, 0.8267648)], [(0, [2]), (1, [2]), (2, [2]), (3, [2]), (4, [2]), (5, [2]), (6, [2]), (7, [2]), (8, [1])], [(0, [(2, 0.99994975)]), (1, [(2, 1.9998709)]), (2, [(2, 0.999867)]), (3, [(2, 0.99986845)]), (4, [(2, 0.9999427)]), (5, [(2, 0.99766123)]), (6, [(2, 0.9959857)]), (7, [(2, 0.998597)]), (8, [(1, 0.99861616)])])
([(2, 0.46435428), (6, 0.47409445)], [(6, [6, 2]), (9, [6, 2]), (10, [6]), (11, [2, 6]), (12, [2, 6]), (13, [2, 6]), (14, [6, 2]), (15, [2, 6]), (16, [2, 6]), (17, [2])], [(6, [(2, 0.4898241), (6, 0.5101444)]), (9, [(2, 0.575622), (6, 1.42426)]), (10, [(6, 0.99996334)]), (11, [(2, 0.6880579), (6, 0.31189892)]), (12, [(2, 0.78450835), (6, 0.21544383)]), (13, [(2, 0.64989084), (6, 0.35005748)]), (14, [(2, 0.39420748), (6, 1.6055375)]), (15, [(2, 0.64683723), (6, 0.35311702)]), (16, [(2, 0.7038416), (6, 0.29598135)]), (17, [(2, 0.99997586)])])
...
...
...

And if I print topic_list[0] it will print

[(1, 0.1004752), (2, 0.8267664)]
[(2, 0.4643499), (6, 0.47409886)]
[(7, 0.4230467), (8, 0.4000204), (9, 0.17062153)]

And this is the form where each tuple represents

(unique_topic_id, topic_contribution )

'''



In [30]:
def get_topic_param(ldamodel, corpus, texts):
    """
    Function to extract the dominant topic, its percentage contribution, keywords and a snippet of the original text 
    for each document in the corpus.

    Parameters:
    ldamodel (gensim.models.ldamodel.LdaModel): The trained LDA model.
    corpus (list of list of (int, float)): The corpus used to train the LDA model. Each document is represented 
                                           as a list of (word id, word frequency) tuples.
    texts (list of str): The original text documents.

    Returns:
    main_topic (dict of int): The dominant topic for each document.
    percentage (dict of float): The percentage contribution of the dominant topic in each document.
    keywords (dict of str): The keywords for the dominant topic in each document.
    text_snippets (dict of str): A snippet of the original text for each document.
    """
    
    # Initialize dictionaries to hold the results
    main_topic = {}  # to hold the dominant topic for each document
    percentage = {}  # to hold the percentage contribution of the dominant topic in each document
    keywords = {}  # to hold the keywords for the dominant topic in each document
    text_snippets = {}  # to hold a snippet of the original text for each document
    
    # Iterate over all the documents in the corpus
    for i, topic_list in enumerate(ldamodel[corpus]):
        # Get the topic distribution for the document
        topic = topic_list[0] if ldamodel.per_word_topics else topic_list

        # Sort the topics by their contribution to the document
        topic = sorted(topic, key = lambda x: (x[1]), reverse = True)
        
        # Only the dominant topic, its percentage contribution and its keywords are considered
        for j, (topic_num, topic_contribution) in enumerate(topic):
            if j == 0:  # if this is the dominant topic
                # Get the keywords for the topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp[:5]])
                
                # Store the results in the dictionaries
                main_topic[i] = int(topic_num)
                percentage[i] = round(topic_contribution, 4)
                keywords[i] = topic_keywords
                text_snippets[i] = texts[i][:8]  # get a snippet of the original text
            else:
                break
    
    # Return the dictionaries
    return main_topic, percentage, keywords, text_snippets


In [31]:
main_topic, percentage, keywords, text_snippets   = get_topic_param(lda_model, bow_corpus, processed_docs )

In [33]:
indexes = list(range(10))

rows = []

rows.append(['ID', 'Main Topic', 'Contribution (%)', 'Keywords', 'Snippet'])

for idx in indexes:
    
    rows.append([ str(idx), f"{main_topic.get(idx)}",
                 f"{percentage.get(idx):.4}",
                 f"{keywords.get(idx)}\n",
                f"{text_snippets.get(idx)}"                 
                 ])
    
columns = zip(*rows)

column_width = [max(len(item) for item in col) for col in columns ]

for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width = column_width[i]) for i in range(0, len(row)) ))
    
    
    

 ID  Main Topic  Contribution (%)  Keywords                                Snippet                                                                           
 0   2           0.8268            window, server, program, file, applic
  ['xpert', 'cursor', 'keyboard', 'cursor', 'key', 'mous', 'avail', 'hint']         
 1   6           0.4742            mail, list, file, inform, send
         ['obtain', 'copi', 'open', 'look', 'widget', 'obtain', 'need', 'order']           
 2   7           0.4231            like, know, time, look, think
          ['right', 'signal', 'strong', 'live', 'west', 'philadelphia', 'perfect', 'sport'] 
 3   8           0.4159            game, team, play, year, player
         ['canadian', 'thing', 'coach', 'boston', 'bruin', 'colorado', 'rocki', 'summari'] 
 4   9           0.9039            peopl, think, like, time, right
        ['heck', 'feel', 'like', 'time', 'includ', 'cafeteria', 'work', 'half']           
 5   7           0.6291            like, know, time,

## pyLDAvis

In [None]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary = lda_model.id2word )
vis

* Each bubble represents a topic. The larger the bubble, the higher percentage of that text in the corpus is about that topic.

* Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, the blue bars of the most frequently used words will be displayed.

* Red bars give the estimated number of times a given term was generated by a given topic. 

* The further the bubbles are away from each other, the more different they are.