Gensim is an open-source library for natural language processing (NLP), primarily used for unsupervised topic modeling and document indexing. It is implemented in Python and Cython, allowing for efficient processing of large text collections using modern statistical machine learning techniques. Gensim enables users to represent documents as semantic vectors, making it easier to analyze and retrieve information from unstructured text data.

Gensim also provides efficient multicore implementations for various algorithms to increase processing speed. It provides more convenient  facilities for text processing than other packages like Scikit-learn, R etc.

Another most significant advantage of Gensim is that, it let us handle large text files even without loading the whole file in memory.

Gensim = Generate Similar is a popular open source natural language processing (NLP) library used for unsupervised topic modeling. It uses top academic models and modern statistical machine learning to perform various complex tasks such as −

Building document or word vectors
Corpora
Performing topic identification
Performing document comparison (retrieving semantically similar documents)
Analysing plain-text documents for semantic structure

Apart from performing the above complex tasks, Gensim, implemented in Python and Cython, is designed to handle large text collections using data streaming as well as incremental online algorithms. This makes it different from those machine learning software packages that target only in-memory processing.



Gensim can easily process large and web-scale corpora by using its incremental online training algorithms. It is scalable in nature, as there is no need for the whole input corpus to reside fully in Random Access Memory (RAM) at any one time. In other words, all its algorithms are memory-independent with respect to the corpus size.

We can easily plug in our own input corpus or data stream. It is also very easy to extend with other Vector Space Algorithms.

Gensim provides efficient multicore implementations of various popular algorithms like Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP).

Uses of Gensim
Gensim has been used and cited in over thousand commercial and academic applications. It is also cited by various research papers and student theses. It includes streamed parallelised implementations of the following −

fastText
fastText, uses a neural network for word embedding, is a library for learning of word embedding and text classification. It is created by Facebooks AI Research (FAIR) lab. This model, basically, allows us to create a supervised or unsupervised algorithm for obtaining vector representations for words.

Word2vec
Word2vec, used to produce word embedding, is a group of shallow and two-layer neural network models. The models are basically trained to reconstruct linguistic contexts of words.

LSA (Latent Semantic Analysis)
It is a technique in NLP (Natural Language Processing) that allows us to analyse relationships between a set of documents and their containing terms. It is done by producing a set of concepts related to the documents and terms.

LDA (Latent Dirichlet Allocation)
It is a technique in NLP that allows sets of observations to be explained by unobserved groups. These unobserved groups explain, why some parts of the data are similar. Thats the reason, it is a generative statistical model.

tf-idf (term frequency-inverse document frequency)
tf-idf, a numeric statistic in information retrieval, reflects how important a word is to a document in a corpus. It is often used by search engines to score and rank a documents relevance given a user query. It can also be used for stop-words filtering in text summarisation and classification.

All of them will be explained in detail in the next sections.

Advantages
Gensim is a NLP package that does topic modeling. The important advantages of Gensim are as follows −

We may get the facilities of topic modeling and word embedding in other packages like scikit-learn and R, but the facilities provided by Gensim for building topic models and word embedding is unparalleled. It also provides more convenient facilities for text processing.

Another most significant advantage of Gensim is that, it let us handle large text files even without loading the whole file in memory.

Gensim doesnt require costly annotations or hand tagging of documents because it uses unsupervised models.

Memory independence – there is no need for the whole training corpus to reside fully in RAM at any one time. Can process large, web-scale corpora using data streaming.

Data scientists play around with text data in various ways to get meaningful results. There are various algorithms, such as word2vec, doc2vec, topic modeling, tf-idf, etc., that make our work easier while training our models with text data. These features play a significant role in Natural Language Processing applications, and we need a Python library that deals with them efficiently. Hence, Gensim in NLP.

Since it uses unsupervised models, Gensim in NLP does not require tagging of documents.

How to Install Gensim?
The installation process of Gensim in NLP is quick and easy. We can install the Python library through pip and conda.

pip
pip install --upgrade gensim

conda
conda install -c conda-forge gensim

Gensim is designed to handle large and complex text corpora. It provides an efficient and easy-to-use interface for performing topic modeling and similarity detection tasks.

The Gensim library is designed to handle large amounts of text data and provide efficient and scalable algorithms for topic modeling, similarity detection, and text summarization.

Gensim makes it easy to perform these tasks by providing efficient implementations of popular algorithms such as Latent Dirichlet Allocation (LDA).

Gensim includes a set of subject modeling tools such as

Latent Semantic Analysis (LSA),
Latent Dirichlet Allocation (LDA)
Hierarchical Dirichlet Process (HDP).
These algorithms are intended to pull subjects from text data collection and reveal underlying themes and patterns.

Why use Gensim for Topic Modeling?
Gensim has a number of benefits for subject modeling. Scalability is a significant benefit of Gensim. It is built to manage large amounts of text data, making it ideal for analyzing vast datasets. 

Furthermore, Gensim includes efficient text cleaning, preprocessing, and transformation methods, making deriving insights from raw text data more straightforward.

Aside from subject modeling, it can be used for text summarization, similarity recognition, and document categorization.  Gensim also includes simple APIs for integrating with other common machine learning frameworks like Scikit-learn and TensorFlow.

It also offers fast versions of famous methods such as LDA and LSI, making topic modeling simple to learn. Additionally, it has been designed to handle large text collections, so it can scale up to handle real-world datasets. 

Documents
 In Gensim, a document refers to a single text unit within a collection of texts. It could be a single sentence, a paragraph, a whole book, or even a collection of documents. To represent a document in Gensim, we usually use a list of words or tokens, where each token is a string representing a word in the text.

In [1]:
# Create a document as a list of words
document = ['this', 'is', 'a', 'document']

# Create a document as a string
document = 'This is a document.'

Corpus 
A corpus is a collection of text documents. In Gensim, a corpus is represented as a list of documents; each document is a list of words. 

Before building a model, we must preprocess the text data by removing stopwords, punctuation, and other noise and convert the text into a numerical representation.

In [2]:
from gensim.corpora import Dictionary

# Create a corpus from a list of documents
documents = [['this', 'is', 'a', 'document'], ['this', 'is', 'another', 'document']]
dictionary = Dictionary(documents)
corpus = [dictionary.doc2bow(document) for document in documents]

In [3]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1)], [(1, 1), (2, 1), (3, 1), (4, 1)]]

we first import the Dictionary class. Then we define a list of documents and pass it into the Dictionary object. It creates a dictionary of all the unique words in the documents.

After using the doc2bow method to create a bag-of-words representation, we create a corpus by combining bag-of-words representation.

Vectors
A vector is a mathematical representation of a document or a word in a corpus. In Gensim, vectors are used to represent documents in numerical form. A vector is simply an ordered list of numbers that encodes information about the document it represents. 

Gensim provides several methods for generating document and word vectors. One popular method is the Word2Vec model, which learns word vectors by predicting the context in which a word appears in a corpus. 

In [5]:
from gensim.models import Word2Vec
import numpy as np

# Create a list of tokenized documents
documents = [['this', 'is', 'a', 'document'], ['this', 'is', 'another', 'document']]

# Train a Word2Vec model on the documents
model = Word2Vec(documents, vector_size=100, window=5, min_count=1)

# Get the vector for a word
word_vector = model.wv['document']

# Get the mean vector for a document
new_document = ['this', 'is', 'another', 'document']
document_vector = np.mean([model.wv[word] for word in new_document], axis=0)

In [6]:
document_vector

array([-0.00432601,  0.00406971,  0.00082073,  0.00285212,  0.00260904,
       -0.00250835,  0.00165858,  0.00615073, -0.0025268 , -0.00281055,
        0.00292883, -0.00209868,  0.00078521,  0.00207212,  0.00361077,
       -0.00070364,  0.00591563,  0.00327086, -0.00716529, -0.00208044,
        0.00106662, -0.00162748,  0.00522037, -0.00203441,  0.00282427,
       -0.00070741,  0.00217527,  0.00482545, -0.00344711,  0.00179091,
        0.00214926, -0.00413491, -0.0014018 , -0.00439473, -0.00079619,
        0.0020215 ,  0.00595606, -0.00062651,  0.00040572,  0.00335066,
       -0.00096744,  0.00122857, -0.00313262, -0.00018954,  0.00204034,
        0.00373947,  0.00064566,  0.00074721, -0.00025051,  0.00190524,
        0.00134328, -0.00189608, -0.00525017, -0.00329428, -0.00192758,
       -0.00053604,  0.00341957,  0.00121441, -0.00128411,  0.00227522,
       -0.00138613,  0.00118845,  0.00182446, -0.00309547, -0.00020932,
        0.00286489,  0.00342713,  0.00334012, -0.00112963,  0.00

In this example, we first create a list of tokenized documents and train a Word2Vec model on these documents. Then we get the vector for an individual word and compute the mean vector for an entire document by averaging the vectors for each word.

Creating Bigrams and Trigrams
Bigrams and trigrams are pairs and triplets of consecutive words in a text document. They can provide additional context and meaning compared to individual words alone.

For example, the bigram “New York” carries a different meaning than the individual words “New” and “York” considered separately.

In Gensim, we can create bigrams and trigrams using the Phrases and Phraser classes. The Phrases class takes a list of sentences as input and generates a list of bigrams or trigrams based on the frequency of co-occurrence of words in the input sentences. 

The resulting list can be converted to a Phraser object, which is a more memory-efficient version of the Phrases object that can be used to apply the bigram or trigram transformation to new documents.

In [7]:
from gensim.models import Phrases
from gensim.utils import simple_preprocess

text = '''The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: Pride and Prejudice
Author: Jane Austen
Posting Date: August 26, 2008 [EBook #1342]
Release Date: June, 1998
Language: English
*** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***
Produced by Anonymous Volunteers
PRIDE AND PREJUDICE
By Jane Austen
Chapter 1
It is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife.
However little known the feelings or views of such a man may be on his
first entering a neighborhood, this truth is so well fixed in the minds
of the surrounding families that he is considered as the rightful property
of some one or other of their daughters.'''

tokens = list(simple_preprocess(text))

bigram = Phrases([tokens], min_count=1, threshold=1)
trigram = Phrases(bigram[[tokens]], min_count=1, threshold=1)

bigrams = [b for b in bigram[tokens] if b.count('_') == 1]
trigrams = [t for t in trigram[bigram[tokens]] if t.count('_') == 2]

print('Bigrams:', bigrams)
print('Trigrams:', trigrams)


Bigrams: ['the_project', 'gutenberg_ebook', 'pride_and', 'prejudice_by', 'jane_austen', 'this_ebook', 'of_the', 'project_gutenberg', 'this_ebook', 'pride_and', 'jane_austen', 'project_gutenberg', 'pride_and', 'pride_and', 'prejudice_by', 'jane_austen', 'of_the']
Trigrams: ['pride_and_prejudice', 'pride_and_prejudice']


We first import the necessary libraries in the code above and load the sample text data. We then use the ‘simple_preprocess’ to preprocess the text.

Then we use ‘Phrases’ to create bigrams and pass this to create a trigram again. Now we create bigrams and trigrams from input text by applying previously created bigrams and trigram respectively.

The result is an individual tokens and bigrams or trigrams separated by underscore(_). Then we print them to see the result.

Summarizing Text Documents
Text summarization is the process of condensing a lengthy piece of text into a succinct version that communicates the essential information. You can use gensim to extract the essential lines from a text document and create a summary that conveys the substance of the original content.  

Gensim's summary function employs an extractive summarization technique based on the TextRank algorithm to produce summaries. The TextRank algorithm prioritizes sentences in the text and chooses the essential sentences to include in the summary.

Summarizing text papers with Gensim can be helpful for swiftly and easily pulling important information from large quantities of text. This is useful for activities like researching a subject, reviewing books, or taking notes on what you've read. 

In addition to its summarization capabilities, Gensim also includes other natural language processing tools, such as topic modeling and word vector representations.

In [2]:
pip install gensim==3.8.3

Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
     ---------------------------------------- 0.0/23.4 MB ? eta -:--:--
     --- ------------------------------------ 2.1/23.4 MB 11.8 MB/s eta 0:00:02
     --------- ------------------------------ 5.5/23.4 MB 14.0 MB/s eta 0:00:02
     ---------- ----------------------------- 6.3/23.4 MB 14.3 MB/s eta 0:00:02
     ------------ --------------------------- 7.6/23.4 MB 9.2 MB/s eta 0:00:02
     -------------------- ------------------ 12.1/23.4 MB 11.8 MB/s eta 0:00:01
     ---------------------------- ---------- 17.3/23.4 MB 13.8 MB/s eta 0:00:01
     --------------------------------------  23.3/23.4 MB 15.9 MB/s eta 0:00:01
     --------------------------------------- 23.4/23.4 MB 15.4 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py): started
  Building wheel for ge

  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [753 lines of output]
  C:\Users\Welcome\anaconda3\Lib\site-packages\setuptools\__init__.py:94: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!
  
          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************
  
  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  running build
  running build_py
  creating build\lib.win-amd64-cpython-312\gensim
  copying gensim\downloader.py -> build\lib.win-amd64-cpython-312\gensim
  copying gensim\interfaces.py -> build\lib.win-amd64-cpython-312\gensim
  copying gensim\matutils.py -> build\lib.win-amd64-cpython-312\gensim

In [8]:
!pip3 install gensim==3.6.0

Collecting gensim==3.6.0

  You can safely remove it manually.



  Downloading gensim-3.6.0.tar.gz (23.1 MB)
     ---------------------------------------- 0.0/23.1 MB ? eta -:--:--
     ---------- ----------------------------- 6.3/23.1 MB 35.1 MB/s eta 0:00:01
     ---------------------------- ---------- 17.0/23.1 MB 41.3 MB/s eta 0:00:01
     ------------------------------------- - 22.0/23.1 MB 42.2 MB/s eta 0:00:01
     --------------------------------------- 23.1/23.1 MB 28.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py): started
  Building wheel for gensim (setup.py): finished with status 'done'
  Created wheel for gensim: filename=gensim-3.6.0-cp312-cp312-win_amd64.whl size=23429730 sha256=900108fbac328d5a507ab954fc3e507d4452cae9e2c36cdd6bae9d0fb85a7be9
  Stored in directory: c:\users\welcome\appdata\local\pip\cache\wheels\36\85\0c\382ef8ed2cc6456cb568447e126e39cd11f6aff4c5a93eeb47
Succe

In [9]:
import gensim
from gensim.summarization import summarize
text = '''Rice Pudding - Poem by Alan Alexander Milne
What is the matter with Mary Jane?
She's crying with all her might and main,
And she won't eat her dinner - rice pudding again -
What is the matter with Mary Jane?
What is the matter with Mary Jane?
I've promised her dolls and a daisy-chain,
And a book about animals - all in vain -
What is the matter with Mary Jane?
What is the matter with Mary Jane?
She's perfectly well, and she hasn't a pain;
But, look at her, now she's beginning again! -
What is the matter with Mary Jane?
What is the matter with Mary Jane?
I've promised her sweets and a ride in the train,
And I've begged her to stop for a bit and explain -
What is the matter with Mary Jane?'''

summary = summarize(text,ratio=0.3)

print(summary)


Rice Pudding - Poem by Alan Alexander Milne
And she won't eat her dinner - rice pudding again -
I've promised her dolls and a daisy-chain,
I've promised her sweets and a ride in the train,
What is the matter with Mary Jane?


We first import the necessary libraries in the code above and load the sample text data. We then use the ‘summarize’ to generate the summary. 

The ‘ratio’ parameter controls the length of the summary as a ratio to the original text.  In this example, we set it to 0.3, which means that the summary should be approximately 30% of the length of the original text. You can adjust this parameter to get longer or shorter summaries depending on your needs.

1. Creating a Dictionary
You can create a dictionary from a list of tokenized documents:

In [2]:
pip install gensim==4.0

Collecting gensim==4.0
  Using cached gensim-4.0.0.tar.gz (23.1 MB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py): started
  Building wheel for gensim (setup.py): finished with status 'error'
  Running setup.py clean for gensim
Failed to build gensim
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [681 lines of output]
  C:\Users\Welcome\anaconda3\Lib\site-packages\setuptools\__init__.py:94: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!
  
          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************
  
  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  running build
  running build_py
  creating build\lib.win-amd64-cpython-312\gensim
  copying gensim\downloader.py -> build\lib.win-amd64-cpython-312\gensim
  copying gensim\interfaces.py -> build\lib.win-amd64-cpython-312\gensim
  copying gensim\matutils.py -> build\lib.win-amd64-cpython-312\gensim

In [7]:
from collections.abc import Mapping


In [8]:
!pip install --upgrade gensim==3.8
from collections.abc import Mapping
import gensim

Collecting gensim==3.8
  Using cached gensim-3.8.0-cp312-cp312-win_amd64.whl
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-3.8.0


ImportError: cannot import name 'Mapping' from 'collections' (C:\Users\Welcome\anaconda3\Lib\collections\__init__.py)

In [1]:
from gensim.corpora import Dictionary

# Create a corpus from a list of documents
documents = [['this', 'is', 'a', 'document'], ['this', 'is', 'another', 'document']]
dictionary = Dictionary(documents)
corpus = [dictionary.doc2bow(document) for document in documents]

In [3]:
from gensim import corpora

In [4]:
# Example tokenized documents
documents = [["hello", "world"], ["machine", "learning", "world"], ["hello", "gensim"]]

# Create a dictionary
dictionary = corpora.Dictionary(documents)

# View the dictionary
print(dictionary.token2id)  # Mapping of words to IDs

{'hello': 0, 'world': 1, 'learning': 2, 'machine': 3, 'gensim': 4}
