What is Gensim?
Gensim is a popular open-source natural language processing (NLP) library specialising in unsupervised topic modeling. Topic modeling is a technique to extract hidden topics from large volumes of text.

The Gensim library is designed to handle large amounts of text data and provide efficient and scalable algorithms for topic modeling, similarity detection, and text summarization.

Gensim makes it easy to perform these tasks by providing efficient implementations of popular algorithms such as Latent Dirichlet Allocation (LDA).

If you have experiance with Spacy pakcage, working with Gensim is much simple to use in your natural language processing projects.

To install Gensim, use the below command

Pip install gensim==3.8.3

What is Topic Modeling?
Topic modeling is a method for identifying latent motifs or subjects in vast amounts of text data. It entails analyzing the words in the documents to find patterns and grouping similar documents based on their substance. 

It is extensively used in many fields, including banking, healthcare, marketing, and social media analysis. Topic modeling can find important topics and patterns that take time to become evident to people by analyzing and grouping words in a text corpus.

Gensim includes a set of subject modeling tools such as

Latent Semantic Analysis (LSA),
Latent Dirichlet Allocation (LDA)
Hierarchical Dirichlet Process (HDP).
These algorithms are intended to pull subjects from text data collection and reveal underlying themes and patterns.

Why use Gensim for Topic Modeling?
Gensim has a number of benefits for subject modeling. Scalability is a significant benefit of Gensim. It is built to manage large amounts of text data, making it ideal for analyzing vast datasets. 

Furthermore, Gensim includes efficient text cleaning, preprocessing, and transformation methods, making deriving insights from raw text data more straightforward.

Aside from subject modeling, it can be used for text summarization, similarity recognition, and document categorization.  Gensim also includes simple APIs for integrating with other common machine learning frameworks like Scikit-learn and TensorFlow.

It also offers fast versions of famous methods such as LDA and LSI, making topic modeling simple to learn. Additionally, it has been designed to handle large text collections, so it can scale up to handle real-world datasets. 

Finally, Gensim has a user-friendly API and extensive documentation, making it accessible to users with varying experience levels.

Gensim Core Concepts
As a Natural Language Processing (NLP) beginner, understanding Gensim core concepts is essential for comprehending and applying topic modeling techniques.

Documents
 In Gensim, a document refers to a single text unit within a collection of texts. It could be a single sentence, a paragraph, a whole book, or even a collection of documents. To represent a document in Gensim, we usually use a list of words or tokens, where each token is a string representing a word in the text.

In [1]:
# Create a document as a list of words
document = ['this', 'is', 'a', 'document']

# Create a document as a string
document = 'This is a document.'

Corpus 
A corpus is a collection of text documents. In Gensim, a corpus is represented as a list of documents; each document is a list of words. 

Before building a model, we must preprocess the text data by removing stopwords, punctuation, and other noise and convert the text into a numerical representation.

In [2]:
from gensim.corpora import Dictionary

# Create a corpus from a list of documents
documents = [['this', 'is', 'a', 'document'], ['this', 'is', 'another', 'document']]
dictionary = Dictionary(documents)
corpus = [dictionary.doc2bow(document) for document in documents]

Vectors
A vector is a mathematical representation of a document or a word in a corpus. In Gensim, vectors are used to represent documents in numerical form. A vector is simply an ordered list of numbers that encodes information about the document it represents. 

Gensim provides several methods for generating document and word vectors. One popular method is the Word2Vec model, which learns word vectors by predicting the context in which a word appears in a corpus. 

In [4]:
from gensim.models import Word2Vec
import numpy as np

# Create a list of tokenized documents
documents = [['this', 'is', 'a', 'document'], ['this', 'is', 'another', 'document']]

# Train a Word2Vec model on the documents
model = Word2Vec(documents, vector_size=100, window=5, min_count=1)

# Get the vector for a word
word_vector = model.wv['document']

# Get the mean vector for a document
new_document = ['this', 'is', 'another', 'document']
document_vector = np.mean([model.wv[word] for word in new_document], axis=0)

In [6]:
word_vector

array([-5.3622725e-04,  2.3643136e-04,  5.1033497e-03,  9.0092728e-03,
       -9.3029495e-03, -7.1168090e-03,  6.4588725e-03,  8.9729885e-03,
       -5.0154282e-03, -3.7633716e-03,  7.3805046e-03, -1.5334714e-03,
       -4.5366134e-03,  6.5540518e-03, -4.8601604e-03, -1.8160177e-03,
        2.8765798e-03,  9.9187379e-04, -8.2852151e-03, -9.4488179e-03,
        7.3117660e-03,  5.0702621e-03,  6.7576934e-03,  7.6286553e-04,
        6.3508903e-03, -3.4053659e-03, -9.4640139e-04,  5.7685734e-03,
       -7.5216377e-03, -3.9361035e-03, -7.5115822e-03, -9.3004224e-04,
        9.5381187e-03, -7.3191668e-03, -2.3337686e-03, -1.9377411e-03,
        8.0774371e-03, -5.9308959e-03,  4.5162440e-05, -4.7537340e-03,
       -9.6035507e-03,  5.0072931e-03, -8.7595852e-03, -4.3918253e-03,
       -3.5099984e-05, -2.9618145e-04, -7.6612402e-03,  9.6147433e-03,
        4.9820580e-03,  9.2331432e-03, -8.1579173e-03,  4.4957981e-03,
       -4.1370760e-03,  8.2453608e-04,  8.4986202e-03, -4.4621765e-03,
      

In [5]:
document_vector

array([-0.00432601,  0.00406971,  0.00082073,  0.00285212,  0.00260904,
       -0.00250835,  0.00165858,  0.00615073, -0.0025268 , -0.00281055,
        0.00292883, -0.00209868,  0.00078521,  0.00207212,  0.00361077,
       -0.00070364,  0.00591563,  0.00327086, -0.00716529, -0.00208044,
        0.00106662, -0.00162748,  0.00522037, -0.00203441,  0.00282427,
       -0.00070741,  0.00217527,  0.00482545, -0.00344711,  0.00179091,
        0.00214926, -0.00413491, -0.0014018 , -0.00439473, -0.00079619,
        0.0020215 ,  0.00595606, -0.00062651,  0.00040572,  0.00335066,
       -0.00096744,  0.00122857, -0.00313262, -0.00018954,  0.00204034,
        0.00373947,  0.00064566,  0.00074721, -0.00025051,  0.00190524,
        0.00134328, -0.00189608, -0.00525017, -0.00329428, -0.00192758,
       -0.00053604,  0.00341957,  0.00121441, -0.00128411,  0.00227522,
       -0.00138613,  0.00118845,  0.00182446, -0.00309547, -0.00020932,
        0.00286489,  0.00342713,  0.00334012, -0.00112963,  0.00

In this example, we first create a list of tokenized documents and train a Word2Vec model on these documents. Then we get the vector for an individual word and compute the mean vector for an entire document by averaging the vectors for each word.

Models
Models are algorithms that learn patterns from data. In Gensim and topic modeling context, models learn to identify topics within a corpus of text data. 

Gensim provides implementations of several popular topic modeling algorithms, such as Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI).

In [7]:
from gensim.models import LdaModel

# Train an LDA topic model on a corpus
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)

# Get the topic distribution for a new document
new_document = ['this', 'is', 'a', 'new', 'document']
new_document_bow = dictionary.doc2bow(new_document)
new_document_topics = lda_model[new_document_bow]

In [8]:
new_document_bow 

[(0, 1), (1, 1), (2, 1), (3, 1)]

In [9]:
new_document_topics

[(0, 0.020000024),
 (1, 0.020000024),
 (2, 0.020005062),
 (3, 0.020000024),
 (4, 0.020000024),
 (5, 0.020000024),
 (6, 0.020000024),
 (7, 0.8199947),
 (8, 0.020000024),
 (9, 0.020000024)]

Preparing Text Data for Topic Modeling
Topic modeling allows us to uncover hidden patterns and themes within the text.  It can be applied to a wide range of text data, including customer feedback, social media posts, news articles, and scientific publications.

However, before we can begin topic modeling, it’s important to prepare our text data properly. This involves several steps, such as

Cleaning the text, 

Removing stop words and punctuation, 

Tokenizing the text into individual words or phrases, 

Converting the text into a numerical representation. 

Removing Stopwords and Low-Frequency Terms
Stopwords are commonly used words such as "the",, "and", "is", "in", etc., that frequently occur in a language but do not add much meaning to the text. 

These words can be removed from the text data to reduce noise and improve the accuracy of the topic modeling results.

Low-frequency terms are words that infrequently appear in the text data and may not be useful for analysis. These words can be removed from the document-term matrix to reduce noise and improve the accuracy of the topic modeling results.

In [10]:
import gensim
from gensim import corpora
from nltk.corpus import stopwords

# Sample documents
documents = ["This is the first document.", "This is the second document.", "This is the third document."]

# Create a dictionary from the documents
dictionary = corpora.Dictionary([doc.split() for doc in documents])

# Remove stopwords from the dictionary
stop_words = set(stopwords.words('english'))
dictionary.filter_tokens(bad_ids=[dictionary.token2id[stopword] for stopword in stop_words if stopword in dictionary.token2id])

# Remove low-frequency terms from the dictionary
dictionary.filter_extremes(no_below=1)

print(dictionary)

Dictionary<3 unique tokens: ['first', 'second', 'third']>


Topic Modeling Implementation with Gensim
Topic modeling is a technique used in natural language processing and machine learning to identify and extract hidden topics or themes from a collection of documents Gensim is a popular Python library.

To perform topic modeling in Gensim, text data must first be preprocessed, including tokenization, stopword removal, stemming, and lemmatization.

Next, Gensim's implementation of the Latent Dirichlet Allocation (LDA) algorithm is used to create a model that identifies the topics present in the corpus. The LDA algorithm uses statistical inference to determine the distribution of topics in a document and the distribution of words within topics The model is then trained.

Once the model is trained, it can be used to predict the topic distribution of new documents. Gensim's easy-to-use and flexible implementation of LDA allows you to quickly and easily perform topic modeling on textual data and gain insight into your corpus's underlying themes and topics. 

Gensim's easy-to-use and flexible LDA implementation allows you to quickly and easily perform topic modeling on textual data to gain insight into underlying themes and topics present in your corpus.

Using Gensim For Real Life Example
We will use 20 Newsgroups Dataset: This is a classic dataset for text classification and topic modeling. It comprises approximately 20,000 newsgroup documents across 20 topics, such as sports, politics, and technology.

One of the problems that can be solved using topic modeling on the 20 newsgroup dataset is identifying the most common topics discussed in newsgroups. This may help to understand the interests and concerns of newsgroup participants and identify emerging trends in discussions.

For example, topic modeling can be used to identify the most common topics discussed in "sci.med" newsgroups. This can help participants understand the health issues that are of greatest concern to them, which may help inform public health policy and research priorities. 

Similarly, topic modeling could be used to identify the most common topics discussed in the "talk.politics.mideast" newsgroup to help understand the political dynamics and tensions in the region.

In [1]:
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load the 20 Newsgroups dataset
categories = ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# Preprocess the dataset
def preprocess(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Remove short words
    tokens = [token for token in tokens if len(token) > 2]
    
    return tokens

# Preprocess each document in the dataset
preprocessed_docs = []
for doc in newsgroups.data:
    preprocessed_docs.append(preprocess(doc))

In [2]:
preprocessed_docs

[['mamatha',
  'devineni',
  'ratnam',
  'mr47andrewcmuedu',
  'subject',
  'pens',
  'fans',
  'reactions',
  'organization',
  'post',
  'office',
  'carnegie',
  'mellon',
  'pittsburgh',
  'lines',
  'nntppostinghost',
  'po4andrewcmuedu',
  'sure',
  'bashers',
  'pens',
  'fans',
  'pretty',
  'confused',
  'lack',
  'kind',
  'posts',
  'recent',
  'pens',
  'massacre',
  'devils',
  'actually',
  'bit',
  'puzzled',
  'bit',
  'relieved',
  'however',
  'going',
  'put',
  'end',
  'nonpittsburghers',
  'relief',
  'bit',
  'praise',
  'pens',
  'man',
  'killing',
  'devils',
  'worse',
  'thought',
  'jagr',
  'showed',
  'much',
  'better',
  'regular',
  'season',
  'stats',
  'also',
  'lot',
  'fun',
  'watch',
  'playoffs',
  'bowman',
  'let',
  'jagr',
  'lot',
  'fun',
  'next',
  'couple',
  'games',
  'since',
  'pens',
  'going',
  'beat',
  'pulp',
  'jersey',
  'anyway',
  'disappointed',
  'see',
  'islanders',
  'lose',
  'final',
  'regular',
  'season',
  'ga

Now we use the preprocessed text data to create a document-term matrix, which represents the frequency of each term in each document. We can then use this matrix as input to a topic modeling algorithm.

In [4]:
from gensim import corpora

# Create a dictionary from the preprocessed documents
dictionary = corpora.Dictionary(preprocessed_docs)

# Create a document-term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in preprocessed_docs]

#The next step is to use a topic modeling algorithm to extract the underlying topics from the document-term matrix. Gensim provides several algorithms for topic modeling, including Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA).

from gensim.models import LdaModel

# Train an LDA model on the document-term matrix
num_topics = 10
lda_model = LdaModel(doc_term_matrix, num_topics=num_topics, id2word=dictionary, passes=10)

# Print the topics
for i, topic in lda_model.show_topics(num_topics=num_topics, formatted=False):
    print('Topic {}: {}'.format(i, ', '.join([word[0] for word in topic])))

Topic 0: clayton, cramer, new, 1st, gay, men, crameroptilinkcom, 100, risc, art
Topic 1: maxaxaxaxaxaxaxaxaxaxaxaxaxaxax, space, key, encryption, clipper, information, chip, system, may, use
Topic 2: subject, lines, organization, university, nntppostinghost, writes, would, know, article, distribution
Topic 3: god, one, would, people, subject, jesus, say, dont, think, lines
Topic 4: turkish, armenian, armenians, turkey, turks, armenia, greek, serdar, argic, greece
Topic 5: drive, car, one, writes, article, like, bike, get, hard, speed
Topic 6: 550, period, winnipeg, van, tor, blues, det, vancouver, bos, chi
Topic 7: file, windows, image, use, program, files, software, data, available, using
Topic 8: writes, lines, subject, organization, game, article, team, would, year, university
Topic 9: people, would, one, writes, article, dont, like, think, subject, lines


Now let’s use the trained model on the new document.

In [5]:
# Load the new document
new_doc = 'Darwin fish bumper stickers and assorted other atheist paraphernalia are available from the Freedom From Religion Foundation in the US.'



# Preprocess the new document
new_doc_processed = preprocess(new_doc)

# Create a bag-of-words representation of the new document
new_doc_bow = dictionary.doc2bow(new_doc_processed)

# Use the LDA model to infer the topic distribution of the new document
new_doc_topics = lda_model.get_document_topics(new_doc_bow)

# Print the top topic for the new document
top_topic = max(new_doc_topics, key=lambda x: x[1])[0]
print('Top Topic for New Document: {}'.format(top_topic))

Top Topic for New Document: 3


From here, if we have a large enough dataset, this can be useful in various real-life scenarios, such as classifying news articles, categorizing customer feedback, or identifying the main topics in social media posts.