
Models
Models are algorithms that learn patterns from data. In Gensim and topic modeling context, models learn to identify topics within a corpus of text data. 

Gensim provides implementations of several popular topic modeling algorithms, such as Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI).

Topic modeling is a technique to extract hidden topics from large volumes of text.

What is Topic Modeling? Topic modeling is a method for identifying latent motifs or subjects in vast amounts of text data. It entails analyzing the words in the documents to find patterns and grouping similar documents based on their substance.

It is extensively used in many fields, including banking, healthcare, marketing, and social media analysis. Topic modeling can find important topics and patterns that take time to become evident to people by analyzing and grouping words in a text corpus.



In [1]:
!python --version

Python 3.12.7


In [2]:
from collections.abc import Mapping

In [3]:
!pip install --upgrade gensim==3.8
from collections.abc import Mapping


Collecting gensim==3.8
  Using cached gensim-3.8.0-cp312-cp312-win_amd64.whl
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.3
    Uninstalling gensim-4.3.3:
      Successfully uninstalled gensim-4.3.3
Successfully installed gensim-3.8.0


In [4]:
pip install --upgrade gensim

Collecting gensim
  Using cached gensim-4.3.3-cp312-cp312-win_amd64.whl.metadata (8.2 kB)
Using cached gensim-4.3.3-cp312-cp312-win_amd64.whl (24.0 MB)
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.8.0
    Uninstalling gensim-3.8.0:
      Successfully uninstalled gensim-3.8.0
Successfully installed gensim-4.3.3
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install --upgrade collections

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement collections (from versions: none)
ERROR: No matching distribution found for collections


In [6]:
pip show gensim

Name: gensim
Version: 4.3.3
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: C:\Users\Welcome\anaconda3\Lib\site-packages
Requires: numpy, scipy, smart-open
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [7]:
from gensim.corpora import Dictionary

In [8]:
from collections.abc import Mapping
import gensim
import gensim.corpora as corpora

In [9]:
from gensim.corpora import Dictionary

# Create a corpus from a list of documents
documents = [['this', 'is', 'a', 'document'], ['this', 'is', 'another', 'document']]
dictionary = Dictionary(documents)
corpus = [dictionary.doc2bow(document) for document in documents]

In [10]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1)], [(1, 1), (2, 1), (3, 1), (4, 1)]]

In [11]:
from gensim.models import LdaModel

# Train an LDA topic model on a corpus
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)

# Get the topic distribution for a new document
new_document = ['this', 'is', 'a', 'new', 'document']
new_document_bow = dictionary.doc2bow(new_document)
new_document_topics = lda_model[new_document_bow]

In [12]:
new_document_topics

[(0, 0.020000024),
 (1, 0.020000024),
 (2, 0.81999975),
 (3, 0.020000024),
 (4, 0.020000024),
 (5, 0.020000024),
 (6, 0.020000024),
 (7, 0.020000024),
 (8, 0.020000024),
 (9, 0.020000024)]

Preparing Text Data for Topic Modeling
Topic modeling allows us to uncover hidden patterns and themes within the text.  It can be applied to a wide range of text data, including customer feedback, social media posts, news articles, and scientific publications.

stop words can be removed from the text data to reduce noise and improve the accuracy of the topic modeling results.

Low-frequency terms are words that infrequently appear in the text data and may not be useful for analysis. These words can be removed from the document-term matrix to reduce noise and improve the accuracy of the topic modeling results.

In [13]:
import gensim
from gensim import corpora
from nltk.corpus import stopwords

# Sample documents
documents = ["This is the first document.", "This is the second document.", "This is the third document."]

# Create a dictionary from the documents
dictionary = corpora.Dictionary([doc.split() for doc in documents])

# Remove stopwords from the dictionary
stop_words = set(stopwords.words('english'))
dictionary.filter_tokens(bad_ids=[dictionary.token2id[stopword] for stopword in stop_words if stopword in dictionary.token2id])

# Remove low-frequency terms from the dictionary
dictionary.filter_extremes(no_below=1)

print(dictionary)

Dictionary<3 unique tokens: ['first', 'second', 'third']>


After removing the stop words, we further filter the dictionary to remove low-frequency terms using the filter_extremes() method. We set the no_below parameter to 2, which means we only keep terms that appear in at least two documents. This helps to remove very rare terms that may not be relevant for topic modeling.

Finally, we print the resulting dictionary to verify that the stopwords and low-frequency terms have been removed.

Creating a Bag of Words Model
Creating a bag-of-words (BoW) model is another important step in preparing text data for topic modeling. A BoW model is a simple way to represent text data as a collection of words and their frequency counts.

To create a BoW model using Gensim, we first need to create a corpus object from the tokenized documents. A corpus is a collection of documents represented as a list of lists, where each inner list contains the tokens for a single document.

Once we have the corpus, we can create a BoW model using the corpora. Dictionary object we created earlier. The doc2bow() method of the dictionary can be used to convert each document in the corpus to a BoW representation, which is a list of tuples containing the word id and its frequency count in the document.

In [14]:


# Sample tokenized documents
documents = [["apple", "banana", "orange"], ["orange", "juice"], ["banana", "apple", "juice", "orange"]]

# Create a dictionary from the documents
dictionary = corpora.Dictionary(documents)   #corpora.Dictionary() creates tokenized documents

# Create a corpus from the tokenized documents
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Print the BoW representation for the first document
print(corpus[0])
print(corpus[1])
print(corpus[2])

[(0, 1), (1, 1), (2, 1)]
[(2, 1), (3, 1)]
[(0, 1), (1, 1), (2, 1), (3, 1)]


In this example, we start by defining a list of tokenized documents. We then create a dictionary object from the documents using the corpora.Dictionary() method.

Next, we create a corpus object by applying the doc2bow() method of the dictionary to each document in the list of tokenized documents. This creates a BoW representation for each document in the corpus.

Finally, we print the BoW representation for the first document in the corpus using the print() function. The output will be a list of tuples, where each tuple contains the word id and its frequency count in the document.

Topic modeling is a technique to extract hidden topics from large volumes of text.

What is Topic Modeling? Topic modeling is a method for identifying latent motifs or subjects in vast amounts of text data. It entails analyzing the words in the documents to find patterns and grouping similar documents based on their substance.

It is extensively used in many fields, including banking, healthcare, marketing, and social media analysis. Topic modeling can find important topics and patterns that take time to become evident to people by analyzing and grouping words in a text corpus.

Fundamentals of Topic Modeling with Gensim
Topic modeling is a powerful tool for extracting insights and understanding complex datasets. It is a technique used to extract the underlying topics from large volumes of text automatically. It can be applied to various scenarios, such as text classification and trend detection. 

The challenge with topic modeling is extracting high-quality clear, segregated, and meaningful topics. This depends heavily on text preprocessing and finding the optimal number of topics.

In this guide, we will explore the fundamentals of topic modeling with Gensim, including the key concepts and techniques used to create accurate and effective models.

Understanding LSA and LDA
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are two popular techniques for topic modeling. 

LSA uses singular value decomposition to identify patterns in the relationships between terms and concepts in unstructured text data. It then creates a lower-dimensional representation of the documents and terms, which allows for easier comparison and clustering. 

LDA, on the other hand, is a generative probabilistic model that assumes each document is a mixture of various topics and each word in the document is attributable to one of the document’s topics. It then infers the topic distribution of each document and the word distribution of each topic, enabling the identification of topics within the document corpus.


In this example, we first define a sample corpus of three documents. We then create a dictionary from the corpus and convert the corpus into a bag-of-words representation using the doc2bow function.

Finally, we build the LDA model using the LdaModel function, specifying the number of topics and the number of passes to make over the corpus. We then print the topics and the associated words, which will be displayed in descending order of relevance. 

This example demonstrates the simplicity and power of Gensim's interface for implementing LDA and exploring the topics within a corpus.

Creating a Gensim Dictionary

A Gensim dictionary is a mapping between words and their integer IDs. It is used to create a bag-of-words representation of text documents for use in topic modeling. 

Creating a Gensim dictionary is crucial in building a topic model using Gensim. The dictionary maps terms to their corresponding numerical IDs and filters out unwanted terms, such as stop words or rare words. 

Here are a few different ways to create a Gensim dictionary:

1. From a list of documents:
2. One of the most common ways to create a dictionary is from a list of documents. Here's an example:

In [16]:
!pip3 install gensim==4.0

Collecting gensim==4.0
  Using cached gensim-4.0.0.tar.gz (23.1 MB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py): started
  Building wheel for gensim (setup.py): finished with status 'error'
  Running setup.py clean for gensim
Failed to build gensim


  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [681 lines of output]
  C:\Users\Welcome\anaconda3\Lib\site-packages\setuptools\__init__.py:94: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!
  
          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************
  
  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  running build
  running build_py
  creating build\lib.win-amd64-cpython-312\gensim
  copying gensim\downloader.py -> build\lib.win-amd64-cpython-312\gensim
  copying gensim\interfaces.py -> build\lib.win-amd64-cpython-312\gensim
  copying gensim\matutils.py -> build\lib.win-amd64-cpython-312\gensim

In [19]:
from gensim import corpora

# List of documents
documents = [["apple", "banana", "orange", "pear", "peach"],
             ["car", "truck", "bike", "motorcycle", "bus"],
             ["cat", "dog", "bird", "fish", "lizard"]]

# Create the dictionary
dictionary = corpora.Dictionary(documents)

# Print the dictionary
print(dictionary)


Dictionary<15 unique tokens: ['apple', 'banana', 'orange', 'peach', 'pear']...>


Output
Dictionary(15 unique tokens: ['apple', 'banana', 'orange', 'peach', 'pear']...)

In this example, we create a dictionary from a list of three documents. The Dictionary function automatically assigns a unique ID to each term and returns a dictionary object.

2. From a pre-existing dictionary: If you already have a pre-existing dictionary, you can load it into Gensim using the load_from_text function:

In [None]:
#Then load_from_text() won’t work, and you should instead use:

#dictionary = corpora.Dictionary.load('dic.dict')


✅ Option 1: If the file is a binary .dict file:

dictionary = corpora.Dictionary.load(r'C:\Users\Welcome\Documents\dic.dict')
✅ Option 2: If the file is a .txt created with save_as_text():

dictionary = corpora.Dictionary.load_from_text(r'C:\Users\Welcome\Documents\dic.txt')
🟡 
Open the file dic.txt:

If it looks like lines of id<TAB>term<TAB>docfreq, then it's a text dictionary.

If it's unreadable or binary, it was saved using dictionary.save().



ValueError: invalid line in dictionary file C:\\Users\Welcome\Documents\dic.txt: car = {
tells us that your dic.txt file is not in the correct format expected by Dictionary.load_from_text().

🔍 What Went Wrong?
The file you’re trying to load:

ini
Copy
Edit
car = {
…looks like a Python object string or JSON/dict-like content — not the plain text format Gensim expects.

✅ What Gensim Expects
Dictionary.load_from_text() expects a file saved using:


dictionary.save_as_text('dic.txt')
Which contains lines like:

Copy
Edit
0  the   10
1  car    5
2  road   3
That is:

php-template

<token_id> <token> <doc_freq>

 if saved like below
dictionary.save('dic.dict')
Then load it with:


dictionary = corpora.Dictionary.load(r'C:\Users\Welcome\Documents\dic.dict')
🔁 Don't use load_from_text() in this case.

✅ Option 2: If You Have a Python Dict Saved as Text
If your file has content like:


car = {
  'key': value
}
Then it's not a Gensim dictionary at all. You must:

load it with eval() or json.load() (if it’s JSON),

OR regenerate the dictionary from your tokenized text.

🧪 Check Your File
Please open the dic.txt file and:

If it looks like token = value, it's not valid for load_from_text().

If it looks like <id> <token> <frequency> per line, it should work.

💡 What You Can Do Now
Option A — If you have the original tokenized corpus:
You can simply rebuild the dictionary:

In [6]:
from gensim import corpora

texts = [['car', 'road', 'drive'], ['bus', 'road', 'stop']]  # your tokenized texts
dictionary = corpora.Dictionary(texts)
dictionary.save_as_text('C:\\Users\Welcome\Documents\dictt.txt')  # if you really want text format
dictionary.save('dic.dict')         # preferred for reuse

dictionary = corpora.Dictionary.load('dic.dict')

In [7]:
from gensim import corpora

# Load the dictionary from a file
dictionary = corpora.Dictionary.load_from_text(r'C:\\Users\Welcome\Documents\dictt.txt')

# Print the dictionary
print(dictionary)

Dictionary<5 unique tokens: ['bus', 'car', 'drive', 'road', 'stop']>


In [5]:
import os
print(os.path.exists(r'C:\Users\Welcome\Documents\dictt.txt'))

False


3. From a gensim corpus: If you have already created a Gensim corpus, you can extract the dictionary from it using the corpora.Dictionary.from_corpus method:

In [56]:
from gensim import corpora

# Create a corpus from a list of documents
documents = [["apple", "banana", "orange", "pear", "peach"],
             ["car", "truck", "bike", "motorcycle", "bus"],
             ["cat", "dog", "bird", "fish", "lizard"]]
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Create the dictionary from the corpus
dictionary = corpora.Dictionary.from_corpus(corpus, id2word=None)

# Print the dictionary
print(dictionary)

Dictionary<15 unique tokens: ['0', '1', '2', '3', '4']...>


4. From a DataFrame:

After you have created a DataFrame in pandas, we can then tokenize and create a Dictionary from there.




In [58]:
import csv

# Data to write into the CSV file
data = [
    ["Name", "City", "Profession"],
    ["Alice", "New York", "Engineer"],
    ["Bob", "London", "Artist"],
    ["Charlie", "Paris", "Chef"]
]

In [59]:
# Writing to a CSV file
with open("string_data.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerows(data)

print("CSV file 'string_data.csv' created successfully!")

CSV file 'string_data.csv' created successfully!


In [60]:
# Reading the CSV file
with open("string_data.csv", mode="r", encoding="utf-8") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)




['Name', 'City', 'Profession']
['Alice', 'New York', 'Engineer']
['Bob', 'London', 'Artist']
['Charlie', 'Paris', 'Chef']


In [67]:
import pandas as pd
from gensim import corpora

# Load data into a DataFrame
df = pd.read_csv(r'C:\Users\Welcome\string_data.csv')

# Tokenize the documents in the DataFrame
tokenized_docs = df[data[0][0]].apply(lambda x: x.split())

# Create the dictionary from the tokenized documents
my_dict = corpora.Dictionary(tokenized_docs)

In [68]:
my_dict

<gensim.corpora.dictionary.Dictionary at 0x1778756a990>

import pandas as pd
from gensim import corpora

# Load data into a DataFrame
df = pd.read_csv('my_docs.csv')

# Tokenize the documents in the DataFrame
tokenized_docs = df['text'].apply(lambda x: x.split())

# Create the dictionary from the tokenized documents
my_dict = corpora.Dictionary(tokenized_docs)