# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
import re
from nltk.corpus import stopwords
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel

def analyze_text_with_lda_coherence(documents):

  # Preprocess the text data (example for a single sentence)
  pDocument = []
  for doc in documents:
    txt = " ".join(doc)  # Join words back into sentences (optional)
    txt = txt.lower()
    txt = re.sub(r'[^\w\s]', '', txt)
    stop_words = stopwords.words('english')
    pDocument.append([word for word in txt.split() if word not in stop_words])

  # Convert text to document-term matrix
  dictionary = corpora.Dictionary(pDocument)
  corpus = [dictionary.doc2bow(doc) for doc in pDocument]

  # Define a range of K values
  k_values = range(5, 15)  # Adjust range as needed
  coherence_scr = []
  for K in k_values:
    lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=K)
    coherence_model = CoherenceModel(model=lda_model, texts=pDocument, dictionary=dictionary, coherence='u_mass')
    coherence_score = coherence_model.get_coherence()
    coherence_scr.append(coherence_score)

  return coherence_scr

# Example usage (replace with your actual corpus)
documents = [
    ["Hello", "doctor", "im", "feeling", "sick", "and"],
    ["my", "health", "is", "not", "good", ","],["doctor", "please", "give", "me", "medications"]

]

coherence_scr = analyze_text_with_lda_coherence(documents)

print(f"Coherence scores for different K values:")
for i, score in enumerate(coherence_scr):
  print(f"K={i}: {score}")


LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [2]:
# Write your code here
import re
from nltk.corpus import stopwords
from gensim import corpora, models, similarities
from gensim.models.coherencemodel import CoherenceModel  # Note: CoherenceModel also works for LSA

def analyze_text_with_lsa_coherence(documents):

    # Preprocess the text data (identical to LDA example)
    pDocument = []
    for doc in documents:
        txt = " ".join(doc)  # Join words back into sentences (optional)
        txt = txt.lower()
        txt = re.sub(r'[^\w\s]', '', txt)
        stop_words = stopwords.words('english')
        pDocument.append([word for word in txt.split() if word not in stop_words])

    # Convert text to document-term matrix
    dictionary = corpora.Dictionary(pDocument)
    corpus = [dictionary.doc2bow(doc) for doc in pDocument]

    # Create TF-IDF corpus for LSA
    tfidf = models.TfidfModel(corpus)  # Apply TF-IDF weighting for LSA
    corpus_tfidf = tfidf[corpus]

    # Define a range of K values
    k_values = range(5, 15)  # Adjust range as needed
    coherence_scr = []
    for K in k_values:
        # Train LSA model
        lsa_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=K)

        # Calculate coherence scores (compatible with LSA)
        coherence_model = CoherenceModel(model=lsa_model, texts=pDocument, dictionary=dictionary, coherence='u_mass')
        coherence_score = coherence_model.get_coherence()
        coherence_scr.append(coherence_score)

    return coherence_scr

# Example usage (identical to LDA example)
documents = [
    ["Hello", "doctor", "im", "feeling", "sick", "and"],
    ["my", "health", "is", "not", "good", ","],
    ["doctor", "please", "give", "me", "medications"]
]

coherence_scr = analyze_text_with_lsa_coherence(documents)

print(f"Coherence scores for different K values:")
for i, score in enumerate(coherence_scr):
    print(f"K={i}: {score}")

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [3]:
from lda2vec import preprocess, Corpus
from lda2vec import LDA2Vec
import numpy as np
import os

# Sample text
text = "hello doctor, im feeling sick and my health is not good, doctor please give me medications"

# Preprocess the text
tokens, vocab = preprocess.tokenize(text)
corpus = Corpus()
corpus.update_word_count(tokens)
corpus.finalize()

# Create a model
model = LDA2Vec(corpus, num_topics=5, num_words=10, word_embedding_size=100, num_epochs=20)

# Train the model
model.fit(corpus, num_epochs=20, log=True)

# Get the topics
topics = model.get_topics(n_words=5)
for i, topic in enumerate(topics):
    print(f"Topic {i+1}: {', '.join(topic)}")


ModuleNotFoundError: No module named 'lda2vec'

In [None]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━

## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [4]:
import BERTopic as bertopic

def analyze_text_with_bertopic_coherence(documents):
  # Preprocess text (similar to your LSA example)
  pDocument = []
  for doc in documents:
    txt = " ".join(doc)  # Join words back into sentences (optional)
    txt = txt.lower()
    txt = re.sub(r'[^\w\s]', '', txt)
    pDocument.append(txt)

  # Create BERTopic model with automatic topic number selection
  topic_model = bertopic.BERTopic(calculate_coherence=True)
  topics, _ = topic_model.fit_transform(pDocument)  # Train model & calculate coherence scores

  # Find topic with highest coherence score (adjust threshold as needed)
  best_topic_id = np.argmax(topic_model.coherence_scr_)
  best_coherence_score = topic_model.coherence_scr_[best_topic_id]
  best_topics = topics[best_topic_id]  # Get topics for the best model

  # Summarize topics (assuming best_topics is a list of dictionaries)
  for topic_id, topic in enumerate(best_topics):
    print(f"Topic {topic_id+1}:")
    print(f"  - Keywords: {', '.join(topic['words'][:5])}")  # Display top 5 keywords
    print(f"  - Coherence score: {best_coherence_score}")  # Display coherence score

# Example usage (identical to your LSA example)
documents = [
  ["Hello", "doctor", "im", "feeling", "sick", "and"],
  ["my", "health", "is", "not", "good", ","],
  ["doctor", "please", "give", "me", "medications"]
]

analyze_text_with_bertopic_coherence(documents)

ModuleNotFoundError: No module named 'BERTopic'

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [None]:
# Write your code here
according to my opinion i feel like LDA model is more accurate because of the directonal approach is reduced in LSA and topic modelling can more used using LDA.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

i felt more challenging in BERT topic model and i was unable to get the expected result. i understood the key concepts.



'''