# News Modeling

Topic modeling involves **extracting features from document terms** and using
mathematical structures and frameworks like matrix factorization and SVD to generate **clusters or groups of terms** that are distinguishable from each other and these clusters of words form topics or concepts

Topic modeling is a method for **unsupervised classification** of documents, similar to clustering on numeric data

These concepts can be used to interpret the main **themes** of a corpus and also make **semantic connections among words that co-occur together** frequently in various documents

Topic modeling can help in the following areas:
- discovering the **hidden themes** in the collection
- **classifying** the documents into the discovered themes
- using the classification to **organize/summarize/search** the documents

Frameworks and algorithms to build topic models:
- Latent semantic indexing
- Latent Dirichlet allocation
- Non-negative matrix factorization

## Latent Dirichlet Allocation (LDA)
The latent Dirichlet allocation (LDA) technique is a **generative probabilistic model** where each **document is assumed to have a combination of topics** similar to a probabilistic latent semantic indexing model

In simple words, the idea behind LDA is that of two folds:
- each **document** can be described by a **distribution of topics**
- each **topic** can be described by a **distribution of words**

### LDA Algorithm

- 1. For each document, **randomly initialize each word to one of the K topics** (k is chosen beforehand)
- 2. For each document D, go through each word w and compute:
    - **P(T |D)** , which is a proportion of words in D assigned to topic T
    - **P(W |T )** , which is a proportion of assignments to topic T over all documents having the word W
- **Reassign word W with topic T** with probability P(T |D)´ P(W |T ) considering all other words and their topic assignments

![LDA](https://raw.githubusercontent.com/subashgandyer/datasets/main/images/LDA.png)

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Stop words removal
    - Email removal
    - Non-alphabetic words removal
    - Tokenize
    - Lowercase
    - BiGrams & TriGrams
    - Lemmatization
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Evaluate the Topic Model
    - Model Perplexity
    - Topic Coherence
- Visualize the topics

### Install the necessary library

In [1]:
!pip install pandas



In [2]:
!pip install pyLDAvis gensim spacy

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting tzdata>=2022.7 (from pandas>=2.0.0->pyLDAvis)
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: funcy, tzdata, pandas, pyLDAvis
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERRO

### Import the libraries

In [3]:
import nltk
! nltk.download('stopwords')

/bin/bash: -c: line 1: syntax error near unexpected token `'stopwords''
/bin/bash: -c: line 1: ` nltk.download('stopwords')'


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel
from gensim.models.phrases import Phrases, Phraser
import re
import spacy
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt

### Download the dataset
Dataset: https://raw.githubusercontent.com/subashgandyer/datasets/main/newsgroups.json

#### 20-Newsgroups dataset
- 11K newsgroups posts
- 20 news topics

### Load the dataset

In [5]:
import pandas as pd

# Path to the JSON file
file_path = "newsgroups.json"  # Replace this with the actual path to your JSON file

# Load JSON data into a pandas DataFrame
df = pd.read_json(file_path)


# Display the DataFrame
print(df.head())

                                             content  target  \
0  From: lerxst@wam.umd.edu (where's my thing)\nS...       7   
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...       4   
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...       4   
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  


  and should_run_async(code)


In [6]:
# 1. Display the first few rows
print("First few rows of the DataFrame:")
print(df.head())

# 2. Check the shape
print("\nShape of the DataFrame:", df.shape)

# 3. Check column names
print("\nColumn names:")
print(df.columns)

# 4. Check data types
print("\nData types of columns:")
print(df.dtypes)

# 5. Check for missing values
print("\nNumber of missing values in each column:")
print(df.isnull().sum())

# 6. Summary statistics
print("\nSummary statistics for numeric columns:")
print(df.describe())

# 7. Unique values
print("\nNumber of unique values in each column:")
print(df.nunique())


First few rows of the DataFrame:
                                             content  target  \
0  From: lerxst@wam.umd.edu (where's my thing)\nS...       7   
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...       4   
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...       4   
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...       1   
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...      14   

            target_names  
0              rec.autos  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3          comp.graphics  
4              sci.space  

Shape of the DataFrame: (11314, 3)

Column names:
Index(['content', 'target', 'target_names'], dtype='object')

Data types of columns:
content         object
target           int64
target_names    object
dtype: object

Number of missing values in each column:
content         0
target          0
target_names    0
dtype: int64

Summary statistics for numeric columns:
             target
count  11314.000000
mean       

  and should_run_async(code)


### Preprocess the data

### Email Removal

In [7]:
import re

# Function to remove email addresses from text
def remove_emails(text):
    # Regular expression pattern to match email addresses
    email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    # Replace email addresses with an empty string
    text_without_emails = re.sub(email_pattern, '', text)
    return text_without_emails

# Remove email addresses from the 'content' column
df['content'] = df['content'].apply(remove_emails)


  and should_run_async(code)


### Newline Removal

In [8]:
# Function to remove newline characters from text
def remove_newlines(text):
    # Replace newline characters with an empty string
    text_without_newlines = text.replace('\n', '')
    return text_without_newlines

# Remove newline characters from the 'content' column
df['content'] = df['content'].apply(remove_newlines)


  and should_run_async(code)


### Single Quotes Removal

In [9]:
# Function to remove single quotes from text
def remove_single_quotes(text):
    # Replace single quotes with an empty string
    text_without_single_quotes = text.replace("'", "")
    return text_without_single_quotes

# Remove single quotes from the 'content' column
df['content'] = df['content'].apply(remove_single_quotes)

  and should_run_async(code)


### Tokenize
- Create **sent_to_words()**
    - Use **gensim.utils.simple_preprocess**
    - Use **generator** instead of an usual function

In [10]:
from gensim.utils import simple_preprocess

# Generator function to tokenize text into words
def sent_to_words(sentences):
    for sentence in sentences:
        # Tokenize each sentence into words
        yield simple_preprocess(str(sentence), deacc=True)  # deacc=True removes punctuation

# Tokenize the 'content' column and create a new column 'tokens'
df['tokens'] = list(sent_to_words(df['content']))

# Display the first few rows of the DataFrame with the tokenized text
print(df[['content', 'tokens']].head())


  and should_run_async(code)


                                             content  \
0  From:  (wheres my thing)Subject: WHAT car is t...   
1  From:  (Guy Kuo)Subject: SI Clock Poll - Final...   
2  From:  (Thomas E Willis)Subject: PB questions....   
3  From: jgreen@amber (Joe Green)Subject: Re: Wei...   
4  From:  (Jonathan McDowell)Subject: Re: Shuttle...   

                                              tokens  
0  [from, wheres, my, thing, subject, what, car, ...  
1  [from, guy, kuo, subject, si, clock, poll, fin...  
2  [from, thomas, willis, subject, pb, questions,...  
3  [from, jgreen, amber, joe, green, subject, re,...  
4  [from, jonathan, mcdowell, subject, re, shuttl...  


  and should_run_async(code)


### Stop words Removal
- Extend the stop words corpus with the following words
    - from
    - subject
    - re
    - edu
    - use
1.   We import the simple_preprocess() function from gensim.utils and the STOPWORDS corpus from gensim.parsing.preprocessing.
2.   We define a set of additional stop words to be added to the existing stop words corpus.
3. We extend the existing stop words corpus (STOPWORDS) with the additional stop words.
4. We define a function remove_stopwords() that removes stop words from a list of tokens.
5. We apply the remove_stopwords() function to the 'tokens' column of the DataFrame to create a new column 'tokens_no_stopwords' containing the tokenized text without stop words.
6. Finally, we display the first few rows of the DataFrame with the 'tokens' and 'tokens_no_stopwords' columns.





In [11]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

# Define additional stop words to be added to the existing stop words corpus
additional_stopwords = {'from', 'subject', 're', 'edu', 'use'}

# Extend the existing stop words corpus with additional stop words
STOPWORDS |= additional_stopwords

# Function to remove stop words from tokenized text
def remove_stopwords(tokens):
    return [word for word in tokens if word not in STOPWORDS]

# Remove stop words from the 'tokens' column and create a new column 'tokens_no_stopwords'
df['tokens_no_stopwords'] = df['tokens'].apply(remove_stopwords)

# Display the first few rows of the DataFrame with the tokenized text without stop words
print(df[['tokens', 'tokens_no_stopwords']].head())


  and should_run_async(code)


                                              tokens  \
0  [from, wheres, my, thing, subject, what, car, ...   
1  [from, guy, kuo, subject, si, clock, poll, fin...   
2  [from, thomas, willis, subject, pb, questions,...   
3  [from, jgreen, amber, joe, green, subject, re,...   
4  [from, jonathan, mcdowell, subject, re, shuttl...   

                                 tokens_no_stopwords  
0  [wheres, thing, car, nntp, posting, host, rac,...  
1  [guy, kuo, si, clock, poll, final, callsummary...  
2  [thomas, willis, pb, questions, organization, ...  
3  [jgreen, amber, joe, green, weitek, organizati...  
4  [jonathan, mcdowell, shuttle, launch, smithson...  


#### remove_stopwords( )

In [12]:
def remove_stopwords(texts):
    return None

  and should_run_async(code)


  and should_run_async(code)


### Bigrams
- Use **gensim.models.Phrases**
- 100 as threshold
1. We import the Phrases model from gensim.models.
We create a Phrases model bigram_model by passing the tokenized text (df['tokens_no_stopwords']) as input, along with min_count=1 to consider all words and threshold=100 to identify candidate bigrams with a score greater than or equal to 100.
2. We apply the bigram model to the tokenized text using the bigram_model[] syntax, and store the result in a new column 'tokens_bigrams' in the DataFrame.
3. Finally, we display the first few rows of the DataFrame with the original tokenized text and the tokenized text with bigrams.

In [13]:
from gensim.models import Phrases

# Create bigrams from the tokenized text
bigram_model = Phrases(df['tokens_no_stopwords'], min_count=1, threshold=100)

# Apply the bigram model to the tokenized text
df['tokens_bigrams'] = list(bigram_model[df['tokens_no_stopwords']])

# Display the first few rows of the DataFrame with the tokenized text and bigrams
print(df[['tokens_no_stopwords', 'tokens_bigrams']].head())


  and should_run_async(code)


                                 tokens_no_stopwords  \
0  [wheres, thing, car, nntp, posting, host, rac,...   
1  [guy, kuo, si, clock, poll, final, callsummary...   
2  [thomas, willis, pb, questions, organization, ...   
3  [jgreen, amber, joe, green, weitek, organizati...   
4  [jonathan, mcdowell, shuttle, launch, smithson...   

                                      tokens_bigrams  
0  [wheres, thing, car, nntp_posting, host_rac, w...  
1  [guy_kuo, si, clock, poll, final, callsummary,...  
2  [thomas, willis, pb, questions, organization, ...  
3  [jgreen_amber, joe, green, weitek, organizatio...  
4  [jonathan_mcdowell, shuttle_launch, smithsonia...  


#### make_bigrams( )

In [14]:
def make_bigrams(texts):
    return None

  and should_run_async(code)


### Lemmatization
- Use spacy
    - Download spacy en model (if you have not done that before)
    - Load the spacy model

In [15]:
!python -m spacy download en_core_web_sm


  and should_run_async(code)


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [16]:
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Function to perform lemmatization on text
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_text = [token.lemma_ for token in doc]
    return lemmatized_text

# Apply lemmatization to the 'tokens_bigrams' column and create a new column 'tokens_lemmatized'
df['tokens_lemmatized'] = df['tokens_bigrams'].apply(lambda x: lemmatize_text(" ".join(x)))

# Display the first few rows of the DataFrame with the lemmatized text
print(df[['tokens_bigrams', 'tokens_lemmatized']].head())


  and should_run_async(code)


                                      tokens_bigrams  \
0  [wheres, thing, car, nntp_posting, host_rac, w...   
1  [guy_kuo, si, clock, poll, final, callsummary,...   
2  [thomas, willis, pb, questions, organization, ...   
3  [jgreen_amber, joe, green, weitek, organizatio...   
4  [jonathan_mcdowell, shuttle_launch, smithsonia...   

                                   tokens_lemmatized  
0  [where, s, thing, car, nntp_poste, host_rac, w...  
1  [guy_kuo, si, clock, poll, final, callsummary,...  
2  [thomas, willis, pb, question, organization, p...  
3  [jgreen_amber, joe, green, weitek, organizatio...  
4  [jonathan_mcdowell, shuttle_launch, smithsonia...  


#### lemmatizaton( )

In [17]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

  and should_run_async(code)


In [18]:
data_lemmatized = lemmatization(df['tokens_bigrams'], allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

  and should_run_async(code)


In [19]:
print(data_lemmatized[:1])

[['thing', 'car', 'nntp_poste', 'host_rac', 'wam_umd', 'eduorganization', 'university', 'parkline', 'wondering_enlighten', 'car', 'sawthe', 'day', 'door_sport', 'car', 'look', 'late_early', 'door', 'small', 'addition_bumper', 'separate', 'rest', 'body', 'know', 'tellme_model', 'engine', 'spec', 'yearsof', 'production', 'car', 'history', 'info', 'youhave', 'funky_looke', 'car', 'mail', 'thank', 'lerxst']]


  and should_run_async(code)


### Create a Dictionary

In [20]:
from gensim import corpora

# Create a dictionary from the lemmatized data
dictionary = corpora.Dictionary(data_lemmatized)

# Print the dictionary
print(dictionary)


  and should_run_async(code)


Dictionary<122634 unique tokens: ['addition_bumper', 'body', 'car', 'day', 'door']...>


### Create Corpus

In [21]:
# Create a corpus from the lemmatized data
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

# Print the first few elements of the corpus
print(corpus[:1])


  and should_run_async(code)


[[(0, 1), (1, 1), (2, 5), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]


### Filter low-frequency words

In [22]:
# Filter out tokens that appear in less than 5 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=5, no_above=0.5)

# Re-create the corpus after filtering
corpus = [dictionary.doc2bow(text) for text in data_lemmatized]

# Print the first few elements of the filtered corpus
print(corpus[:1])


  and should_run_async(code)


[[(0, 1), (1, 5), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)]]


### Create Index 2 word dictionary

In [23]:
# Create index to word dictionary
id2word = {v: k for k, v in dictionary.token2id.items()}

# Print the first few elements of the index to word dictionary
print(dict(list(id2word.items())[:10]))

{0: 'body', 1: 'car', 2: 'day', 3: 'door', 4: 'eduorganization', 5: 'engine', 6: 'history', 7: 'info', 8: 'know', 9: 'late_early'}


  and should_run_async(code)


### Build a News Topic Model

#### LdaModel
- **num_topics** : this is the number of topics you need to define beforehand
- **chunksize** : the number of documents to be used in each training chunk
- **alpha** : this is the hyperparameters that affect the sparsity of the topics
- **passess** : total number of training assess
1. We import the necessary libraries, including LdaModel from Gensim and visualization tools like pyLDAvis.
2. We define the parameters for the LDA model, including the number of topics, chunk size, alpha value, and number of passes.
3. We train the LDA model on the corpus using the LdaModel class.
We print the topics generated by the model.
4. Optionally, we visualize the topics using pyLDAvis.

In [25]:
from gensim.models import LdaModel
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
from gensim.corpora import Dictionary

# Define LDA parameters
num_topics = 10  # Define the number of topics
chunksize = 1000  # Number of documents to be used in each training chunk
alpha = 'auto'  # Hyperparameter affecting the sparsity of the topics
passes = 20  # Total number of training passes

# Build LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics,
                     chunksize=chunksize, alpha=alpha, passes=passes)

# Print the topics generated by the model
for topic_id, topic in lda_model.print_topics():
    print(f"Topic {topic_id}: {topic}")


# Create Dictionary object from corpus
dictionary = Dictionary.from_corpus(corpus, id2word=id2word)

# Visualize the topics
lda_display = gensimvis.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)


  and should_run_async(code)


Topic 0: 0.021*"window" + 0.017*"key" + 0.014*"drive" + 0.014*"bit" + 0.013*"run" + 0.013*"card" + 0.011*"chip" + 0.011*"software" + 0.010*"problem" + 0.009*"machine"
Topic 1: 0.020*"think" + 0.018*"know" + 0.018*"people" + 0.018*"say" + 0.014*"go" + 0.014*"time" + 0.013*"come" + 0.012*"article" + 0.012*"thing" + 0.010*"s"
Topic 2: 0.029*"game" + 0.028*"team" + 0.023*"year" + 0.020*"play" + 0.015*"good" + 0.014*"win" + 0.014*"player" + 0.011*"season" + 0.008*"hockey" + 0.008*"well"
Topic 3: 0.017*"space" + 0.010*"year" + 0.009*"program" + 0.007*"new" + 0.006*"information" + 0.006*"research" + 0.005*"technology" + 0.005*"service" + 0.005*"datum" + 0.005*"include"
Topic 4: 0.879*"ax" + 0.060*"max" + 0.002*"ei" + 0.001*"rlk" + 0.001*"tm" + 0.001*"resistor" + 0.001*"pl_pl" + 0.001*"bj" + 0.001*"wm" + 0.001*"bhj_bhj"
Topic 5: 0.020*"article" + 0.019*"organization" + 0.019*"nntp_poste" + 0.018*"line" + 0.014*"host" + 0.013*"good" + 0.012*"know" + 0.012*"car" + 0.011*"need" + 0.011*"look"
Top

### Print the Keyword in the 10 topics

In [26]:
# Print the keywords in each topic
topics = lda_model.show_topics(num_topics=num_topics, num_words=10)
for topic in topics:
    print("Topic", topic[0], ":", topic[1])

Topic 0 : 0.021*"window" + 0.017*"key" + 0.014*"drive" + 0.014*"bit" + 0.013*"run" + 0.013*"card" + 0.011*"chip" + 0.011*"software" + 0.010*"problem" + 0.009*"machine"
Topic 1 : 0.020*"think" + 0.018*"know" + 0.018*"people" + 0.018*"say" + 0.014*"go" + 0.014*"time" + 0.013*"come" + 0.012*"article" + 0.012*"thing" + 0.010*"s"
Topic 2 : 0.029*"game" + 0.028*"team" + 0.023*"year" + 0.020*"play" + 0.015*"good" + 0.014*"win" + 0.014*"player" + 0.011*"season" + 0.008*"hockey" + 0.008*"well"
Topic 3 : 0.017*"space" + 0.010*"year" + 0.009*"program" + 0.007*"new" + 0.006*"information" + 0.006*"research" + 0.005*"technology" + 0.005*"service" + 0.005*"datum" + 0.005*"include"
Topic 4 : 0.879*"ax" + 0.060*"max" + 0.002*"ei" + 0.001*"rlk" + 0.001*"tm" + 0.001*"resistor" + 0.001*"pl_pl" + 0.001*"bj" + 0.001*"wm" + 0.001*"bhj_bhj"
Topic 5 : 0.020*"article" + 0.019*"organization" + 0.019*"nntp_poste" + 0.018*"line" + 0.014*"host" + 0.013*"good" + 0.012*"know" + 0.012*"car" + 0.011*"need" + 0.011*"loo

  and should_run_async(code)


## Evaluation of Topic Models
- Model Perplexity
- Topic Coherence

### Model Perplexity

Model perplexity is a measurement of **how well** a **probability distribution** or probability model **predicts a sample**

In [27]:
# Compute perplexity
perplexity = lda_model.log_perplexity(corpus)
print("Perplexity:", perplexity)


  and should_run_async(code)


Perplexity: -7.340101815294145


### Topic Coherence
Topic Coherence measures score a single topic by measuring the **degree of semantic similarity** between **high scoring words** in the topic.

In [33]:
# Create a new dictionary with keys and values swapped
id2token_swapped = {v: k for k, v in id2token.items()}

# Create a new dictionary object with the swapped keys and values
class MyDict(dict):
    def __init__(self, id2token):
        self.id2token = id2token
        self.token2id = {v: k for k, v in id2token.items()}  # Add token2id attribute

id2word_new = MyDict(id2token_swapped)

# Compute coherence score
coherence_model = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word_new, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print("Coherence Score:", coherence_score)

  and should_run_async(code)


Coherence Score: 0.6315583725846504


### Visualize the Topic Model
- Use **pyLDAvis**
    - designed to help users **interpret the topics** in a topic model that has been fit to a corpus of text data
    - extracts information from a fitted LDA topic model to inform an interactive web-based visualization

In [36]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

class MyDict:
    def __init__(self, id2word):
        self.id2word = id2word
        self.token2id = {v: k for k, v in id2word.items()}

    def __len__(self):
        return len(self.token2id)

# Create an instance of MyDict
my_dict = MyDict(id2word)

# Prepare the pyLDAvis visualization
lda_display = gensimvis.prepare(lda_model, corpus, my_dict)

# Display the visualization
pyLDAvis.display(lda_display)

  and should_run_async(code)
