What is LdaMulticore in Gensim?
LdaMulticore is a parallelized version of LdaModel in Gensim that trains Latent Dirichlet Allocation (LDA) topic models faster using multiple CPU cores.

It’s ideal for large text corpora where training with LdaModel would be too slow.

✅ When to Use:
You have a large number of documents (thousands+)

You want to speed up topic modeling

You have a multi-core CPU

8.2 Topic Modelling using LDA
LDA is a popular method for topic modelling which considers each document as a collection of topics in a certain proportion. We need to take out the good quality of topics such as how segregated and meaningful they are. The good quality topics depend on- 

The quality of text processing
Finding the optimal number of topics
Tuning parameters of the algorithm

Prepare the Data 
This is done by removing the stopwords and then lemmatizing it. In order to lemmatize using Gensim, we need to first download the pattern package and the stopwords. 
Let's install pattern package and import nltk library

In [1]:
pip install pattern

Collecting pattern
  Using cached pattern-0.0.1a0-py3-none-any.whl.metadata (1.7 kB)
Using cached pattern-0.0.1a0-py3-none-any.whl (4.0 kB)
Installing collected packages: pattern
Successfully installed pattern-0.0.1a0
Note: you may need to restart the kernel to use updated packages.


In [1]:
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

we have pre-processed the data by removing stopwords and lemmatization.






In [2]:
import gensim
from gensim import corpora
from gensim.models import LdaModel, LdaMulticore
import gensim.downloader as api
from gensim.utils import simple_preprocess, lemmatize
import nltk
from nltk.corpus import stopwords
import re
import logging

logging.basicConfig(format ='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level = logging.INFO)
stop_words = stopwords.words('english')
stop_words = stop_words + ['subject', 'com', 'are', 'edu', 'would', 'could']

dataset = api.load("text8")
data = [w for w in dataset]

# Preparing the data
processed_data = []
for x, doc in enumerate(data[:100]):
    doc_out = []
    for word in doc:
        if word not in stop_words: 
            Lemmatized_Word = lemmatize(word, allowed_tags = re.compile('(NN|JJ|RB)'))  # lemmatize
            if Lemmatized_Word:
                doc_out.append(Lemmatized_Word[0].split(b'/')[0].decode('utf-8'))
        else:
            continue
    processed_data.append(doc_out) 
print(processed_data[0][:10])

ImportError: cannot import name 'lemmatize' from 'gensim.utils' (C:\Users\Welcome\anaconda3\envs\myenv\lib\site-packages\gensim\utils.py)

You're seeing this error:


ImportError: cannot import name 'lemmatize' from 'gensim.utils'
❌ Problem:
lemmatize is not part of gensim.utils anymore — it was removed from recent versions of Gensim.

Use spaCy for lemmatization instead.
Let me show you how to correctly preprocess and lemmatize your data using spaCy, keeping everything else the same.

✅ Step-by-Step Fix with spaCy
1. 📦 Install required packages:

In [5]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting spacy
  Downloading spacy-3.8.7-cp310-cp310-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp310-cp310-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp310-cp310-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp310-cp310-win_amd64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp310-cp310-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp310-cp310-win_amd6

  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.6 which is incompatible.
scipy 1.10.1 requires numpy<1.27.0,>=1.19.5, but you have numpy 2.2.6 which is incompatible.



Successfully installed annotated-types-0.7.0 blis-1.3.0 catalogue-2.0.10 cloudpathlib-0.21.1 confection-0.1.5 cymem-2.0.11 langcodes-3.5.0 language-data-1.3.0 marisa-trie-1.2.1 markdown-it-py-3.0.0 mdurl-0.1.2 murmurhash-1.0.13 numpy-2.2.6 preshed-3.0.10 pydantic-2.11.7 pydantic-core-2.33.2 rich-14.0.0 shellingham-1.5.4 spacy-3.8.7 spacy-legacy-3.0.12 spacy-loggers-1.0.5 srsly-2.5.1 thinc-8.3.6 typer-0.16.0 typing-inspection-0.4.1 wasabi-1.1.3 weasel-0.4.1
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ----------------------------- ---------- 9.4/12.8 MB 45.2 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 44.6 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m[+] Download and installation s

Update your preprocessing code:
Replace the old lemmatize(...) logic with this:



In [7]:
import gensim
from gensim import corpora
from gensim.models import LdaMulticore
import gensim.downloader as api
import logging
import nltk
import spacy

# Setup
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english') + ['subject', 'com', 'edu', 'would', 'could']
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(logging.INFO)

# Load text8 data
dataset = api.load("text8")
data = dataset

# Group words into pseudo-documents
docs = [data[i:i+100] for i in range(0, 10000, 100)]

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Preprocess with lemmatization
def spacy_lemmatize(texts):
    processed = []
    for doc in texts:
        tokens = nlp(" ".join(doc))
        processed_doc = [token.lemma_ for token in tokens 
                         if not token.is_stop and token.is_alpha and token.lemma_ not in stop_words]
        processed.append(processed_doc)
    return processed

processed_data = spacy_lemmatize(docs)
print(processed_data[0][:10])  # Show first 10 lemmatized words of the first doc


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


TypeError: 'Dataset' object is not subscriptable

In [9]:
import gensim
from gensim import corpora
from gensim.models import LdaMulticore
import gensim.downloader as api
import logging
import nltk
import spacy

# Setup
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english') + ['subject', 'com', 'edu', 'would', 'could']
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(logging.INFO)

# Load text8 data
dataset = api.load("text8")  # Already tokenized as one big list of words
data = list(dataset)  # DO NOT re-wrap it

# Group words into pseudo-documents
docs = [data[i:i+100] for i in range(0, 10000, 100)]  # Each doc is 100 words

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Preprocess with lemmatization
def spacy_lemmatize(texts):
    processed = []
    for doc in texts:
        joined = " ".join(doc)  # Now doc is a list of strings
        tokens = nlp(joined)
        processed_doc = [token.lemma_ for token in tokens 
                         if not token.is_stop and token.is_alpha and token.lemma_ not in stop_words]
        processed.append(processed_doc)
    return processed

# Run preprocessing
processed_data = spacy_lemmatize(docs)
print(processed_data[0][:10])  # Show sample lemmatized output


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


TypeError: sequence item 0: expected str instance, list found

In [12]:
import gensim
from gensim import corpora
from gensim.models import LdaMulticore
import gensim.downloader as api
import logging
import nltk
import spacy

# Setup
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english') + ['subject', 'com', 'edu', 'would', 'could']

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(logging.INFO)

# Load text8 and prepare documents
dataset = api.load("text8")
data = list(dataset)  # This should be a list of words (strings)
print(type(data[0]))  # Should print <class 'str'>

# Split into pseudo-documents
# Increase doc size to get more valid tokens per doc
docs = [data[i:i+500] for i in range(0, 50000, 500)]  # 500 words per doc, 100 docs # Group into docs


print(type(docs[0][0]))  # Should also be <class 'str'>
# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Preprocess (lemmatize + filter)
def spacy_lemmatize(texts):
    processed = []
    for doc in texts:
        if not all(isinstance(w, str) for w in doc):
            continue  # skip malformed docs
        tokens = nlp(" ".join(doc))
        processed_doc = [token.lemma_ for token in tokens 
                         if not token.is_stop and token.is_alpha and token.lemma_ not in stop_words]
        processed.append(processed_doc)
    return processed

# Run it
processed_data = spacy_lemmatize(docs)
print(processed_data[0][:10])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<class 'list'>
<class 'list'>
[]


In [15]:
import gensim
from gensim import corpora
from gensim.models import LdaMulticore
import gensim.downloader as api
import logging
import nltk
import spacy

# Setup logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(logging.INFO)

# Download NLTK stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the dataset
dataset = api.load("text8")  # Iterator
data = list(dataset)         # Convert to list of words

# Group into larger pseudo-documents (500 words per doc)
docs = [data[i:i+500] for i in range(0, 50000, 500)]  # 100 docs

# Preprocessing function
def spacy_lemmatize(texts):
    processed = []
    for doc in texts:
        joined = " ".join(doc)
        tokens = nlp(joined)
        # ❗️Only remove short, numeric, and custom stopwords
        doc_lemmas = [token.lemma_.lower() for token in tokens 
                      if len(token.text) > 2 and token.lemma_.isalpha() 
                      and token.lemma_.lower() not in stop_words]
        processed.append(doc_lemmas)
    return processed

# Run preprocessing
processed_data = spacy_lemmatize(docs)

# Show output
print(f"\n✅ Total processed documents: {len(processed_data)}")
print(f"📄 Sample processed doc (first 10 words):\n{processed_data[0][:10]}")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


TypeError: sequence item 0: expected str instance, list found

Reason for above error
The error you're seeing:


TypeError: sequence item 0: expected str instance, list found
means you're trying to " ".join(doc) — but doc is not a list of strings, it's a list of lists (nested list).

❌ What went wrong:
You used:

data = [w for w in dataset]  # This already gives you one big list of words
docs = [data[i:i+100] for i in range(0, 10000, 100)]  # ← OK: breaks into docs
But when text8 is loaded from gensim, dataset is already a list of tokens, so data = [w for w in dataset] is redundant and causes problems.

