<a href="https://colab.research.google.com/github/rmvsaipavan/manivenkatasaipavan_INFO5731_Fall2023/blob/main/Ramisetty_Exercise_04_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [None]:
# Install necessary libraries if not already installed
# !pip install pandas gensim spacy

import pandas as pd
import re
import spacy
from gensim import corpora, models
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

# Load the data
df = pd.read_csv('book_stall1.csv')

# Preprocess the data
nlp = spacy.load('en_core_web_sm')

def preprocess_text(text):
    # Remove punctuation, numbers, and special characters
    text = re.sub(r'[^a-zA-Z]', ' ', text)

    # Convert to lowercase and tokenize
    tokens = [token.lemma_ for token in nlp(text.lower()) if not token.is_stop]

    return tokens

# Apply preprocessing to the 'text' column
df['tokens'] = df['title'].apply(preprocess_text)

# Create a dictionary and a corpus
dictionary = corpora.Dictionary(df['tokens'])
corpus = [dictionary.doc2bow(tokens) for tokens in df['tokens']]

# Determine the optimal number of topics using coherence score
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

# Choose a range of potential number of topics (adjust as needed)
start = 2
limit = 12
step = 1

model_list, coherence_values = compute_coherence_values(dictionary, corpus, df['tokens'], limit, start, step)

# Find the optimal number of topics
optimal_num_topics = start + coherence_values.index(max(coherence_values)) * step

# Train the final LDA model with the optimal number of topics
final_lda_model = model_list[coherence_values.index(max(coherence_values))]

# Extract and summarize topics
topics = final_lda_model.show_topics(formatted=False)

# ... (previous code remains the same)

# Print the coherence score
print(f'Optimal Number of Topics: {optimal_num_topics}')
print(f'Coherence Score: {max(coherence_values):.4f}')

# Extract and summarize topics
topics = final_lda_model.show_topics(formatted=False)


# Print out the topics
for topic in topics:
    print(f'Topic {topic[0]+1}:')
    keywords = [word for word, _ in topic[1]]
    print(f'  Keywords: {", ".join(keywords)}')




Optimal Number of Topics: 5
Coherence Score: 0.5772
Topic 1:
  Keywords:  ,    ,     ,   , s, vol, life, world, love, story
Topic 2:
  Keywords:  ,    ,      , s, year, new,   ,     , city, love
Topic 3:
  Keywords:  ,    , s,     , day, chronicle, note, death, family, book
Topic 4:
  Keywords:  ,    , fruit, basket,     , s, life, vol, boy, heaven
Topic 5:
  Keywords:  ,    , girl, secret, art, lose, work, find, love, shopaholic


## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel

# Step 1: Load and preprocess data
df = pd.read_csv('book_stall1.csv')
documents = df['title'].tolist()  # Replace 'text_column' with the actual column name containing text

# Step 2: Text preprocessing (tokenization, stopword removal, lemmatization) - You can use libraries like spaCy or NLTK for this.

# Step 3: Create TF-IDF matrix
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Step 4: Apply LSA to reduce dimensionality
lsa_model = TruncatedSVD(n_components=300)  # Adjust n_components as needed
lsa_matrix = lsa_model.fit_transform(tfidf_matrix)

# Step 5 (continued): Determine optimal number of topics (K) using coherence score
dictionary = Dictionary([doc.split() for doc in documents])
corpus = [dictionary.doc2bow(doc.split()) for doc in documents]
coherence_scores = []
for k in range(2, 11):  # Adjust the range of K as needed
    lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=k)
    coherence_model = CoherenceModel(model=lda_model, texts=documents, dictionary=dictionary, coherence='c_v')
    coherence_scores.append(coherence_model.get_coherence())

optimal_k = coherence_scores.index(max(coherence_scores)) + 2  # Add 2 because we started from K=2

# Print optimal number of topics and coherence score
print(f"Optimal Number of Topics: {optimal_k}")

# Step 6: Generate topics using LSA
final_lsa_model = TruncatedSVD(n_components=optimal_k)
final_lsa_matrix = final_lsa_model.fit_transform(tfidf_matrix)

# Step 7: Summarize topics
terms = tfidf_vectorizer.get_feature_names_out()  # Get the terms (words)
for i, topic in enumerate(final_lsa_model.components_):
    topic_terms = " ".join([terms[j] for j in topic.argsort()[:-10 - 1:-1]])  # Get top 10 terms for each topic
    print(f"Topic {i+1}: {topic_terms}")


  m_lr_i = np.log(numerator / denominator)
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))


Optimal Number of Topics: 2
Topic 1: girl life trilogy millennium ice murder guide lost good train
Topic 2: life vol love art american world earth basket fruits story


## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
# Install the required packages
!pip install lda2vec spacy

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import spacy

# Step 1: Load and preprocess data
df = pd.read_csv('book_stall1.csv')
documents = df['title'].tolist()  # Replace 'text_column' with the actual column name containing text

# Step 2: Text preprocessing (tokenization, stopword removal, lemmatization)
nlp = spacy.load("en_core_web_sm")

processed_documents = []
for doc in documents:
    doc = nlp(doc)
    processed_doc = " ".join([token.lemma_ for token in doc if not token.is_stop and token.is_alpha])
    processed_documents.append(processed_doc)

# Step 3: Create a TF-IDF matrix
tfidf_vectorizer = CountVectorizer(max_df=0.8, min_df=2, stop_words=ENGLISH_STOP_WORDS)
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_documents)

# Step 4: Apply LDA to generate topics
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)  # You can adjust the number of topics
lda_model.fit(tfidf_matrix)

# Step 5: Summarize topics
feature_names = tfidf_vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(lda_model.components_):
    print(f"Topic #{topic_idx+1}:")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]))


Topic #1:
shades novel great woman vampire kitchen legend island project memoir
Topic #2:
story love world art trilogy true play new earth time
Topic #3:
vol basket volume saga collected red editions fruits fruit sandman
Topic #4:
harry potter dark day prince art giant games vol game
Topic #5:
guide thing home lose war day new tell genius wild
Topic #6:
life ice love death vol black search high note walk
Topic #7:
girl live good end raven cycle life history heart leave
Topic #8:
chronicles science god shadow shopaholic night vol new human mind
Topic #9:
book paris life run court glass world midnight city little
Topic #10:
secret america murder universe american grayson little complete love boy


## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [2]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m642.1 kB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.4.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.8/90.8 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━

In [11]:
import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Load the data from the CSV file
data = pd.read_csv("book_stall1.csv")
text_data = data['title'].astype(str).tolist()

# Initialize BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Fit the model to the data
topics, _ = topic_model.fit_transform(text_data)

# Calculate coherence scores for different numbers of topics
coherence_scores = []
range_of_topics = range(2, 21)  # Adjust the range as needed

for k in range_of_topics:
    model = BERTopic(language="english", calculate_probabilities=True, verbose=True, nr_topics=k)
    topics, _ = model.fit_transform(text_data)
    coherence_score = model.get_topic_info().coherence_mean.mean()
    coherence_scores.append(coherence_score)

# Determine the optimal number of topics with the highest coherence score
best_k = range_of_topics[coherence_scores.index(max(coherence_scores))]
best_coherence_score = max(coherence_scores)

print(f"The best number of topics is {best_k} with a coherence score of {best_coherence_score}")

# Train the final BERTopic model with the optimal number of topics
model = BERTopic(language="english", calculate_probabilities=True, verbose=True, nr_topics=best_k)
topics, _ = model.fit_transform(text_data)

# Summarize the topics
most_frequent_topics = model.get_topic_info()
most_frequent_topics


## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

In [None]:
# Write your answer here (no code needed for this question)
"""
LSA modeling algorithm is more efficient than the other three algorithms.
The statement that "LSA modeling algorithm is more efficient than the other three algorithms (LDA, lda2vec, and BERTopic)
when dealing with large datasets" can be supported based on the following reasons:

1. Dimensionality Reduction: LSA is primarily a dimensionality reduction technique that uses Singular Value Decomposition (SVD)
to reduce the dimensionality of the document-term matrix. By reducing the number of dimensions, LSA can handle large datasets
more efficiently. This reduction in dimensionality helps in reducing the computational complexity and memory requirements,
making LSA more suitable for large datasets.

2. Computational Efficiency: LSA performs SVD on the document-term matrix, which has a time complexity of O(n^2m), where n is
the number of documents and m is the number of unique terms. However, the SVD operation can be efficiently implemented using
optimized libraries like SciPy or NumPy, which further improves the computational efficiency of LSA. In contrast, algorithms
like LDA, lda2vec, and BERTopic may require more computational resources and training time due to their complex architectures
and larger model sizes.

3. Scalability: LSA can handle large datasets by processing them in batches or by using distributed computing frameworks like
Spark. This scalability allows LSA to efficiently process and analyze large volumes of text data. On the other hand, algorithms
like LDA, lda2vec, and BERTopic may face challenges in terms of memory requirements and training time when dealing with large
datasets.

4. Trade-off with Interpretability: While LSA is efficient in handling large datasets, it may sacrifice some interpretability
compared to algorithms like LDA. LSA focuses on capturing latent semantic relationships and reducing noise, but it may not
provide explicit topic-word distributions or document-topic proportions.If interpretability is crucial, LDA may be a superior
option, even if it is less efficient for huge datasets.

In summary, due to its dimensionality reduction capabilities, computational efficiency, scalability, and capacity to handle
enormous volumes of text data, LSA is regarded more efficient than LDA, lda2vec, and BERTopic when dealing with large datasets.
However, before selecting the best algorithm, consider the trade-off with interpretability and examine the task's specific
requirements.


"""
