# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import TfidfModel
from gensim.matutils import Sparse2Corpus

movie_reviews = [
    "This movie is absolutely fantastic! The acting is superb and the storyline is captivating.",
    "I couldn't stand this film. The plot was weak, and the acting was terrible.",
    "The movie was okay. It had some good moments but overall, it was disappointing."
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(movie_reviews)


corpus = Sparse2Corpus(X.T)


coherence_scores = []
for k in range(2, 11): # You can adjust the range for K
   lda_model = LatentDirichletAllocation(n_components=k, random_state=42)
   lda_model.fit(X)
   id2word = {v: k for k, v in vectorizer.vocabulary_.items()}
   lda_gensim = LdaModel(
       corpus,
       num_topics=k,
       id2word=id2word,
       passes=15,
       iterations=100,
       random_state=42
   )
   coherence_model_lda = CoherenceModel(model=lda_gensim, corpus=corpus, coherence='u_mass')
   coherence_lda = coherence_model_lda.get_coherence()
   coherence_scores.append(coherence_lda)

optimal_k = 2 + coherence_scores.index(max(coherence_scores))

lda_model = LatentDirichletAllocation(n_components=optimal_k, random_state=42)
lda_model.fit(X)


feature_names = vectorizer.get_feature_names_out()
topics = []
for topic_idx, topic in enumerate(lda_model.components_):
   top_features_ind = topic.argsort()[:-10 - 1:-1] # Top 10 features for each topic
   topic_words = [feature_names[i] for i in top_features_ind]
   topics.append(topic_words)

for idx, topic in enumerate(topics):
    print(f"Topic {idx+1}: {topic}")

Topic 1: ['is', 'the', 'captivating', 'storyline', 'absolutely', 'fantastic', 'superb', 'this', 'acting', 'and']
Topic 2: ['was', 'it', 'the', 'terrible', 'stand', 'film', 'weak', 'plot', 'couldn', 'some']


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models import LsiModel
from gensim.matutils import Sparse2Corpus

movie_reviews = [
    "This movie is absolutely fantastic! The acting is superb and the storyline is captivating.",
    "I couldn't stand this film. The plot was weak, and the acting was terrible.",
    "The movie was okay. It had some good moments but overall, it was disappointing."
]


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(movie_reviews)


corpus = Sparse2Corpus(X.T)


coherence_scores = []
for k in range(2, 11): # You can adjust the range for K
   lsa_model = TruncatedSVD(n_components=k, random_state=42)
   lsa_model.fit(X)
   id2word = {v: k for k, v in vectorizer.vocabulary_.items()}
   lsa_gensim = LsiModel(
       corpus,
       num_topics=k,
       id2word=id2word,
       chunksize=2000,
       decay=0.5,
       onepass=False
   )
   coherence_model_lsa = CoherenceModel(model=lsa_gensim, corpus=corpus, coherence='u_mass')
   coherence_lsa = coherence_model_lsa.get_coherence()
   coherence_scores.append(coherence_lsa)

optimal_k = 2 + coherence_scores.index(max(coherence_scores))


lsa_model = TruncatedSVD(n_components=optimal_k, random_state=42)
lsa_model.fit(X)


feature_names = vectorizer.get_feature_names_out()
topics = []
for topic_idx, topic in enumerate(lsa_model.components_):
   top_features_ind = topic.argsort()[:-10 - 1:-1] # Top 10 features for each topic
   topic_words = [feature_names[i] for i in top_features_ind]
   topics.append(topic_words)


for idx, topic in enumerate(topics):
   print(f"Topic {idx + 1}: {'|'.join(topic)}")


Topic 1: was|the|is|it|this|acting|and|movie|weak|couldn
Topic 2: is|superb|storyline|captivating|fantastic|absolutely|acting|this|and|the
Topic 3: weak|film|terrible|stand|plot|couldn|was|this|and|acting


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import TfidfModel
from gensim.matutils import Sparse2Corpus
movie_reviews = [
    "This movie is absolutely fantastic! The acting is superb and the storyline is captivating.",
    "I couldn't stand this film. The plot was weak, and the acting was terrible.",
    "The movie was okay. It had some good moments but overall, it was disappointing."
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(movie_reviews)  # Assuming 'df' is a list of documents

corpus = Sparse2Corpus(X.T)

coherence_scores = []
for k in range(2, 11):  # You can adjust the range for K
    lda_model = LatentDirichletAllocation(n_components=k, random_state=42)
    lda_model.fit(X)
    id2word = {v: k for k, v in vectorizer.vocabulary_.items()}
    lda_gensim = LdaModel(
        corpus,
        num_topics=k,
        id2word=id2word,
        passes=15,
        iterations=100,
        random_state=42
    )
    coherence_model_lda = CoherenceModel(model=lda_gensim, corpus=corpus, coherence='u_mass')
    coherence_lda = coherence_model_lda.get_coherence()
    coherence_scores.append(coherence_lda)

optimal_k = 3 + coherence_scores.index(max(coherence_scores))


lda_model = LatentDirichletAllocation(n_components=optimal_k, random_state=42)
lda_model.fit(X)


feature_names = vectorizer.get_feature_names_out()
topics = []
for topic_idx, topic in enumerate(lda_model.components_):
    top_features_ind = topic.argsort()[:-10 - 1:-1]  # Top 10 features for each topic
    topic_words = [feature_names[i] for i in top_features_ind]
    topics.append(topic_words)


for idx, topic in enumerate(topics):
    print(f"Topic {idx + 1}: {'|'.join(topic)}")


Topic 1: is|the|was|acting|and|this|weak|terrible|stand|plot
Topic 2: it|was|moments|but|disappointing|good|some|had|overall|okay
Topic 3: movie|was|the|superb|storyline|absolutely|captivating|fantastic|is|this


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [3]:

import re
import pandas as pd
import tensorflow as tf
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
   text = text.lower()
   text = re.sub(r'[^a-zA-Z\s]', '', text)
   tokens = word_tokenize(text)
   stop_words = set(stopwords.words('english'))
   tokens = [word for word in tokens if word not in stop_words]
   lemmatizer = WordNetLemmatizer()
   tokens = [lemmatizer.lemmatize(word) for word in tokens]
   preprocessed_text = ' '.join(tokens)
   return preprocessed_text


df = pd.read_csv('/content/code.csv')
df
preprocessed_df = [preprocess_text(doc) for doc in df]


from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tf

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Write your code here



## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
The "best" subject modeling algorithm is dependent on your particular data and objectives; there is no one "best" algorithm. Here's a summary of a few well-known algorithms to aid in your decision:

1. LDA, or latent dirichlet allocation:


Pros: Most reputable and extensively utilized. Effective in locating cohesive subjects with comprehensible word distributions.extend_more provides probabilistic results for the frequency of a topic in documents.extend_more
Cons: May have trouble grasping brief materials or subjects that are quite similar.extend_more demands that the number of topics be predetermined, which can be difficult.

2. Factorization of Non-Negative Matrix (NMF):

Pros: Identifies parts-of-speech patterns with ease and manages sparse data well. Frequently quicker than LDA.extend_more
Cons: Word meaning within subjects cannot be interpreted and results are not probabilistic.yell Not the best method for identifying themes.

3. Latent Semantic Analysis with Probabilities (PLSA):

Pros: Functions similarly to LDA, but frequently operates more quickly.extend_more useful for applications involving the decrease of dimensionality.
Cons: Has some of the same restrictions as LDA, such as topic pre-specification.yell Maybe less successful at encapsulating intricate theme systems.yell

4. BERTopic:

Pros: This relatively new technique is excellent at finding brief, targeted subjects. frequently does well on user-generated content or writing on social networking.extend_more
Cons: Compared to LDA or NMF, this approach is newer and has less established research. For larger papers or in-depth topic research, it might not be the best option.

Here's a table summarizing the key points:

Algorithm	Strengths	Weaknesses
LDA	Established, interpretable topics, probabilistic	Pre-specifying topics, struggles with short documents
NMF	Handles sparse data, fast	Non-probabilistic, less interpretable topics
PLSA	Faster than LDA, dimensionality reduction	Similar limitations to LDA, less effective for complex topics
BERTopic	Short, focused topics, good for social media text	Newer method, less established, may not be ideal for longer documents.

It could be best to use PLSA or LDA for topical topics. For data that contain a lot of unnecessary phrases, NMF is helpful.
Interpretability: LDA and PLSA are better options if you need to deduce a topic's meaning from the words used. NMF provides less interpretability and is not probabilistic.

Speed: Generally speaking, NMF is quicker than LDA or PLSA, which is significant for very large datasets.

In the end, the best method to select is to test many algorithms on your data to determine which one yields the most pertinent and understandable themes for your particular requirements.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
BERTopic: Because it has the best value of k and many topics as compaired to the others.

Challenges Faced : Faced some difficulties while doing the excercise as some of them are new to me.

Relevance : The models are relevance in NLP since they are being used in text analysis to generate topics and value of K.