<a href="https://colab.research.google.com/github/sivanathvenigalla/Jaya-Venkatasivanath_INFO5731_Fall2024/blob/main/Venigalla_Exercise_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import spacy

#Dataset
texts = [
    """Siva is the most influential and respected leader in Guntur, shaping the community with his powerful vision.
Known as the Iron Man of India, his presence commands respect and admiration across regions.
Currently residing in the USA, Siva continues to inspire people globally, bridging the cultures of India and America.
His journey from Guntur to the world stage is a testament to his dedication and resilience.
Siva often says, "True strength lies in uplifting others," a mantra he follows passionately.
In both India and abroad, he remains a symbol of strength, unity, and unyielding determination."""
]

# Preprocessing
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
processed_texts = []
for doc in texts:
    tokens = [token.lemma_ for token in nlp(doc.lower()) if token.is_alpha and not token.is_stop]
    processed_texts.append(tokens)

# Create Dictionary and Corpus
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Determine Optimal Number of Topics
coherence_scores = []
for k in range(2, 6):
    lda_model = gensim.models.LdaModel(corpus, num_topics=k, id2word=dictionary, random_state=42)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_texts, dictionary=dictionary, coherence='c_v')
    coherence_scores.append((k, coherence_model.get_coherence()))

# Select the model with the highest coherence score
optimal_k = max(coherence_scores, key=lambda x: x[1])[0]

# Train Final LDA Model
final_lda_model = gensim.models.LdaModel(corpus, num_topics=optimal_k, id2word=dictionary, random_state=42)

# Summarize Topics
topics = final_lda_model.print_topics(num_words=5)
for topic in topics:
    print(f"Topic {topic[0]}: {topic[1]}")




Topic 0: 0.037*"india" + 0.034*"siva" + 0.031*"strength" + 0.029*"guntur" + 0.022*"globally"
Topic 1: 0.044*"siva" + 0.041*"india" + 0.030*"guntur" + 0.028*"strength" + 0.022*"lie"


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
from gensim import corpora, models
from gensim.models import CoherenceModel


# Preprocessing
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
processed_texts = []
for doc in texts:
    tokens = [token.lemma_ for token in nlp(doc.lower()) if token.is_alpha and not token.is_stop]
    processed_texts.append(tokens)

# Create Dictionary and Corpus
dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Determine Optimal Number of Topics
coherence_scores = []
for k in range(2, 6):
    lsi_model = models.LsiModel(corpus, num_topics=k, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi_model, texts=processed_texts, dictionary=dictionary, coherence='c_v')
    coherence_scores.append((k, coherence_model.get_coherence()))

# Select the model with the highest coherence score
optimal_k = max(coherence_scores, key=lambda x: x[1])[0]

# Train Final LSA Model
final_lsi_model = models.LsiModel(corpus, num_topics=optimal_k, id2word=dictionary)

# Summarize Topics
topics = final_lsi_model.print_topics(num_words=5)
for topic in topics:
    print(f"Topic {topic[0]}: {topic[1]}")


Topic 0: 0.359*"india" + 0.359*"siva" + 0.239*"strength" + 0.239*"guntur" + 0.120*"america"


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:

!pip install --upgrade gensim nltk


import nltk
from nltk.corpus import stopwords


nltk.download('punkt')         # For tokenization
nltk.download('stopwords')     # For stop words
nltk.download('punkt_tab')     # To resolve punkt_tab issue

# Step 3: Load and preprocess your dataset
def preprocess(text):
    tokens = nltk.word_tokenize(text.lower())  # Tokenize the text
    return [word for word in tokens if word.isalnum() and word not in stop_words]

# Example dataset
documents = [
    "Natural language processing allows computers to understand human language.",
    "Deep learning is a subset of machine learning.",
    "Artificial intelligence is changing the world.",
    # Add more documents as needed...
]

stop_words = set(stopwords.words('english'))  # Set of English stop words
processed_docs = [preprocess(doc) for doc in documents]


from gensim import corpora

dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]


from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel
import numpy as np

num_topics_range = range(2, 20)
coherence_scores = []

for num_topics in num_topics_range:
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_docs, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)


best_num_topics = num_topics_range[np.argmax(coherence_scores)]
print(f'Best number of topics: {best_num_topics}')

final_model = LdaModel(corpus, num_topics=best_num_topics, id2word=dictionary, passes=10)


topics = final_model.print_topics(num_words=10)
for i, topic in enumerate(topics):
    print(f'Topic {i+1}: {topic[1]}')



## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
# Step 1: Install required libraries (if not already installed)
!pip install --upgrade bertopic

# Step 2: Import libraries
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic
from sklearn.metrics import silhouette_score
import numpy as np

# Step 3: Load a smaller subset of the dataset
docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))['data'][:500]

# Step 4: Fit BERTopic model with a smaller embedding model and reduced range of topics
coherence_scores = []
num_topics_range = range(2, 4)  # Testing only 2 and 3 topics

for num_topics in num_topics_range:
    topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", language="english", nr_topics=num_topics)

    topics, probs = topic_model.fit_transform(docs)

    if len(set(topics)) > 1:
        # Get document embeddings using the `get_document_embeddings` method
        embeddings = topic_model.get_document_embeddings(docs)

        # Ensure embeddings are 2D
        if len(embeddings.shape) == 1:
            embeddings = embeddings.reshape(-1, 1)

        # Calculate silhouette score
        silhouette_avg = silhouette_score(embeddings, topics)
        coherence_scores.append(silhouette_avg)
    else:
        coherence_scores.append(-1)

# Step 5: Find the optimal number of topics
best_num_topics = num_topics_range[np.argmax(coherence_scores)]
print(f'Best number of topics based on silhouette score: {best_num_topics}')

# Fit final model with the best number of topics
final_topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", language="english", nr_topics=best_num_topics)
final_topics, final_probs = final_topic_model.fit_transform(docs)

# Step 6: Summarize the topics
topic_info = final_topic_model.get_topic_info()
print(topic_info.head(10))

# Access and print the representative words for the first few topics
for i in range(best_num_topics):
    print(f'Topic {i}: {final_topic_model.get_topic(i)}')





AttributeError: 'BERTopic' object has no attribute 'get_document_embeddings'

## **Question 3 (Alternative) - (10 points)**

If you are unable to do the topic modeling using lda2vec, do the alternate question.

Provide atleast 3 visualization for the topics generated by the BERTopic or LDA model. Explain each of the visualization in detail.

In [None]:
# Write your code here
# Then Explain the visualization

# Repeat for the other 2 visualizations as well.

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

When evaluating the four topic modeling algorithms—Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic—several factors come into play, including coherence scores, interpretability, and computational efficiency. LDA typically excels in small datasets, providing coherent and interpretable topics that clearly reflect underlying themes. In contrast, LSA may struggle with clarity and specificity, producing broader topics that can be less meaningful. lda2vec enhances topic representation by integrating word embeddings, yielding nuanced results, although its effectiveness can vary based on the quality of those embeddings. BERTopic stands out for larger, complex datasets, leveraging modern embeddings and clustering techniques to generate high-coherence and interpretable topics.

Ultimately, the choice of algorithm depends on the context. For smaller datasets, LDA is often the preferred option due to its clarity and thematic focus. However, for larger datasets that demand richer semantic structures, BERTopic or lda2vec may be more effective. It's crucial to conduct thorough evaluations across different models and datasets, considering both coherence and interpretability to identify the most suitable approach for specific analytical goals.

# Mandatory Question

When evaluating the four topic modeling algorithms—Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic—several factors come into play, including coherence scores, interpretability, and computational efficiency. LDA typically excels in small datasets, providing coherent and interpretable topics that clearly reflect underlying themes. In contrast, LSA may struggle with clarity and specificity, producing broader topics that can be less meaningful. lda2vec enhances topic representation by integrating word embeddings, yielding nuanced results, although its effectiveness can vary based on the quality of those embeddings. BERTopic stands out for larger, complex datasets, leveraging modern embeddings and clustering techniques to generate high-coherence and interpretable topics.

Ultimately, the choice of algorithm depends on the context. For smaller datasets, LDA is often the preferred option due to its clarity and thematic focus. However, for larger datasets that demand richer semantic structures, BERTopic or lda2vec may be more effective. It's crucial to conduct thorough evaluations across different models and datasets, considering both coherence and interpretability to identify the most suitable approach for specific analytical goals.

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



It was quite useful to work with text data and investigate topic modeling methods like as LDA, LSA, lda2vec, and BERTopic. My comprehension of feature extraction in text processing has improved as a result of each algorithm's own method for identifying themes and patterns in unstructured data. I was better able to understand the fundamental ideas of these algorithms and the variations in how they perceive themes and latent structures in the data thanks to the practical implementation. Insight into model tuning and the necessary balance between subject coherence and model complexity were made clear by the iterative process of selecting the best K values based on coherence scores.


Since the lda2vec environment required certain installs and configurations, getting it up was one of the biggest obstacles. In many situations, lda2vec was less useful than alternative techniques because to its greater than anticipated training time and memory needs. BERTopic was easy to use, but it took more testing to adjust its settings to meet the coherence of other models. Choosing the best parameters, accurately preparing the text input, and meaningfully interpreting the results were all difficult tasks that required significant thought and patience.