# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [6]:
# Write your code here
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import TfidfModel
from gensim.matutils import Sparse2Corpus

flight_experience = [
    "The budget airline I flew with had cramped seating but provided excellent on-time performance and friendly cabin crew",
    "The long-haul flight I took had spacious seats and a wide selection of entertainment options, making the journey enjoyable and",
    "The flight with XYZ Airlines offered comfortable seating and efficient service",
    "The red-eye flight I booked had minimal turbulence, allowing for a restful journey despite the late departure."
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(flight_experience)


corpus = Sparse2Corpus(X.T)


coherence_scores = []
for k in range(2, 11):
  # You can adjust the range for K
   lda_model = LatentDirichletAllocation(n_components=k, random_state=42)
   lda_model.fit(X)
   id2word = {v: k for k, v in vectorizer.vocabulary_.items()}
   lda_gensim = LdaModel(
       corpus,
       num_topics=k,
       id2word=id2word,
       passes=15,
       iterations=100,
       random_state=42
   )
   coherence_model_lda = CoherenceModel(model=lda_gensim, corpus=corpus, coherence='u_mass')
   coherence_lda = coherence_model_lda.get_coherence()
   coherence_scores.append(coherence_lda)

optimal_k = 2 + coherence_scores.index(max(coherence_scores))

lda_model = LatentDirichletAllocation(n_components=optimal_k, random_state=42)
lda_model.fit(X)


feature_names = vectorizer.get_feature_names_out()
topics = []
for topic_idx, topic in enumerate(lda_model.components_):
   top_features_ind = topic.argsort()[:-10 - 1:-1] # Top 10 features for each topic
   topic_words = [feature_names[i] for i in top_features_ind]
   topics.append(topic_words)

for idx, topic in enumerate(topics):
    print(f"Topic {idx+1}: {topic}")


Topic 1: ['the', 'despite', 'departure', 'for', 'red', 'restful', 'eye', 'minimal', 'late', 'booked']
Topic 2: ['flight', 'journey', 'had', 'the', 'offered', 'airlines', 'comfortable', 'efficient', 'xyz', 'service']
Topic 3: ['flight', 'journey', 'had', 'the', 'offered', 'airlines', 'comfortable', 'efficient', 'xyz', 'service']
Topic 4: ['and', 'the', 'seating', 'with', 'flight', 'xyz', 'airlines', 'offered', 'efficient', 'service']


## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [5]:
# Write your code here
number_of_topics = 4


In [26]:
import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from gensim.models import LsiModel
from gensim.corpora import Dictionary
from gensim.matutils import corpus2dense
from gensim.parsing.preprocessing import preprocess_string
from gensim.models.coherencemodel import CoherenceModel


corpus = [
    "The budget airline I flew with had cramped seating but provided excellent on-time performance and friendly cabin crew",
    "The long-haul flight I took had spacious seats and a wide selection of entertainment options, making the journey enjoyable and",
    "The flight with XYZ Airlines offered comfortable seating and efficient service",
    "The red-eye flight I booked had minimal turbulence, allowing for a restful journey despite the late departure"
]
def prepare_corpus(doc_clean):
    """
    Input  : clean document
    Purpose: create term dictionary of our courpus and Converting list of documents (corpus) into Document Term Matrix
    Output : term dictionary and Document Term Matrix
    """
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    dictionary = corpora.Dictionary(doc_clean)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    # generate LDA model
    return dictionary,doc_term_matrix


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
def create_gensim_lsa_model(doc_clean,number_of_topics,words):
      """
      Input  : clean document, number of topics and number of words associated with each topic
      Purpose: create LSA model using gensim
    Output : return LSA model
    """
      dictionary,doc_term_matrix=prepare_corpus(doc_clean)
    # generate LSA model
      lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)
      print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
      return lsamodel

In [14]:
def compute_coherence_values_lsa(dictionary, doc_term_matrix, doc_clean, start, limit, step):

     coherence_values = []
     model_list = []
     for num_topics in range(start, limit, step):
        # generate LSA model
        model = LsiModel(doc_term_matrix, num_topics=7, id2word = dictionary)  # train model
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
     return model_list, coherence_values

In [32]:
model_list_lsa,coherence_lsa = compute_coherence_values_lsa(corpora.dictionary,corpus,WordNetLemmatizer.lemmatize,start=2,limit=40,step=6)


NameError: name 'TruncatedSVD' is not defined

In [30]:
for m, cv in zip(6,coherence_lsa):
  print("Num Topics=",m," has coherence value of", round(cv, 4))

optimal_model_lsa = model_list_lsa[5]
model_topics_lsa = optimal_model_lsa.show_topics(formatted=False)
print(optimal_model_lsa.print_topics(num_words=10))

for no_of_topics, cv in zip(range(2, 60, 6), coherence_lsa):
  print("Num Topics:", no_of_topics, " - Coherence Value:", round(cv, 4))
optimal_model_lsa = model_list_lsa[5]
model_topics_lsa = optimal_model_lsa.show_topics(formatted=False)
print(optimal_model_lsa.print_topics(num_words=10))
for no_of_topics, cv in zip(range(2, 60, 6), coherence_lsa):
  print("Num Topics:", no_of_topics, " - Coherence Value:", round(cv, 4))

NameError: name 'coherence_lsa' is not defined

## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import TfidfModel
from gensim.matutils import Sparse2Corpus
flight_experience = [
    "The budget airline I flew with had cramped seating but provided excellent on-time performance and friendly cabin crew",
    "The long-haul flight I took had spacious seats and a wide selection of entertainment options, making the journey enjoyable and",
    "The flight with XYZ Airlines offered comfortable seating and efficient service",
    "The red-eye flight I booked had minimal turbulence, allowing for a restful journey despite the late departure."
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(flight_experience)
  # Assuming 'df' is a list of documents

corpus = Sparse2Corpus(X.T)

coherence_scores = []
for k in range(2, 11):  # You can adjust the range for K
    lda_model = LatentDirichletAllocation(n_components=k, random_state=42)
    lda_model.fit(X)
    id2word = {v: k for k, v in vectorizer.vocabulary_.items()}
    lda_gensim = LdaModel(
        corpus,
        num_topics=k,
        id2word=id2word,
        passes=15,
        iterations=100,
        random_state=42
    )
    coherence_model_lda = CoherenceModel(model=lda_gensim, corpus=corpus, coherence='u_mass')
    coherence_lda = coherence_model_lda.get_coherence()
    coherence_scores.append(coherence_lda)

optimal_k = 3 + coherence_scores.index(max(coherence_scores))


lda_model = LatentDirichletAllocation(n_components=optimal_k, random_state=42)
lda_model.fit(X)


feature_names = vectorizer.get_feature_names_out()
topics = []
for topic_idx, topic in enumerate(lda_model.components_):
    top_features_ind = topic.argsort()[:-10 - 1:-1]
     # Top 10 features for each topic
    topic_words = [feature_names[i] for i in top_features_ind]
    topics.append(topic_words)


for idx, topic in enumerate(topics):
    print(f"Topic {idx + 1}: {'|'.join(topic)}")


Topic 1: had|the|seating|with|journey|flight|and|airlines|comfortable|efficient
Topic 2: and|the|flight|xyz|offered|airlines|efficient|service|comfortable|with
Topic 3: had|the|seating|with|journey|flight|and|airlines|comfortable|efficient
Topic 4: airline|cramped|performance|provided|friendly|flew|excellent|crew|cabin|time
Topic 5: the|despite|departure|for|eye|red|restful|late|minimal|booked


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [2]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text into BERT embeddings
def encode_text(text):
    input_ids = tokenizer.encode(text, add_special_tokens=True, max_length=512, truncation=True)
    with torch.no_grad():
        outputs = model(torch.tensor([input_ids]))
        embeddings = torch.mean(outputs.last_hidden_state, dim=1).squeeze()
    return embeddings

# Sample query
query = "Flight_experience"

# Sample text data (preprocessed)
preprocessed_text_data = [
    "The budget airline I flew with had cramped seating but provided excellent on-time performance and friendly cabin crew",
    "The long-haul flight I took had spacious seats and a wide selection of entertainment options, making the journey enjoyable and",
    "The flight with XYZ Airlines offered comfortable seating and efficient service",
    "The red-eye flight I booked had minimal turbulence, allowing for a restful journey despite the late departure."
]

# Encode query into BERT embeddings
query_embedding = encode_text(query)

# Encode text data into BERT embeddings
text_embeddings = torch.stack([encode_text(text) for text in preprocessed_text_data])

# Calculate cosine similarity between query and each text in the data
similarities = cosine_similarity(query_embedding.unsqueeze(0), text_embeddings)

# Rank the similarities in descending order
ranked_indices = similarities.argsort()[0][::-1]

# Print the ranked texts
print("Ranked Texts based on Similarity with Query:")
for i, idx in enumerate(ranked_indices, 1):
    similarity_score = similarities[0][idx]
    text = preprocessed_text_data[idx]
    print(f"{i}. Similarity: {similarity_score:.4f}")
    print(f" Text: {text}\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Texts based on Similarity with Query:
1. Similarity: 0.5970
 Text: The flight with XYZ Airlines offered comfortable seating and efficient service

2. Similarity: 0.5619
 Text: The budget airline I flew with had cramped seating but provided excellent on-time performance and friendly cabin crew

3. Similarity: 0.5087
 Text: The long-haul flight I took had spacious seats and a wide selection of entertainment options, making the journey enjoyable and

4. Similarity: 0.5058
 Text: The red-eye flight I booked had minimal turbulence, allowing for a restful journey despite the late departure.



## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [33]:
'''
The best subject modeling algorithm depends on the particular data and goals;
there isn't a single algorithm that works better everywhere. While NMF excels in handling limited data and computational efficiency, LDA is well-known for its interpretability and probabilistic conclusions.
In the end, though, the decision comes down to putting several algorithms through their paces with your data to see which one best identifies themes that are clear and pertinent to your particular requirements.
'''

"\nThe best subject modeling algorithm depends on the particular data and goals;\nthere isn't a single algorithm that works better everywhere. While NMF excels in handling limited data and computational efficiency, LDA is well-known for its interpretability and probabilistic conclusions.\nIn the end, though, the decision comes down to putting several algorithms through their paces with your data to see which one best identifies themes that are clear and pertinent to your particular requirements.\n"

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [34]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
This exercise deepened my understanding of feature extraction and text similarity ranking using advanced NLP techniques like cosine similarity and BERT embeddings. I particularly valued learning about text preprocessing, eature extraction with models like BERT, and similarity score calculation with cosine similarity. However, I faced challenges in comprehending and efficiently using BERT embeddings for text similarity ranking, as well as in managing dependencies across different libraries and versions. In the context of my NLP studies, this exercise is highly relevant, offering insights into essential concepts and methods for tasks like text classification and sentiment analysis. Ultimately, it provided valuable practical insights into implementing real-world NLP tasks using Python and established frameworks.

'''

'\nThis exercise deepened my understanding of feature extraction and text similarity ranking using advanced NLP techniques like cosine similarity and BERT embeddings. I particularly valued learning about text preprocessing, eature extraction with models like BERT, and similarity score calculation with cosine similarity. However, I faced challenges in comprehending and efficiently using BERT embeddings for text similarity ranking, as well as in managing dependencies across different libraries and versions. In the context of my NLP studies, this exercise is highly relevant, offering insights into essential concepts and methods for tasks like text classification and sentiment analysis. Ultimately, it provided valuable practical insights into implementing real-world NLP tasks using Python and established frameworks.\n\n'