# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [3]:
!pip install pyLDAvis
import nltk
nltk.download('stopwords')

import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])


new_data = [
    "These Jordan’s are authentic & very well made. All leather! My 10 year old grandson loved them, he put them on immediately upon opening. They fit perfectly & comfortably",
    "My son loves these and only wears them for bball practice or a game. They have held up great and still look new",
    "They a great looking shoes comfortable, & are pretty wide I usually get wide made shoes to fit right, but these are good in the size 13 fit.",
    "The box was slightly smashed and thought shoes were damaged. No damage to the shoes. Very comfortable and no major issues (at this point): hopefully no issues for a long time."
]

# Convert to list
data = new_data

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data)

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words)

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out


# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized)

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus)

id2word[0]

[[(id2word[id], freq) for id, freq in cp] for cp in corpus]

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['These Jordan’s are authentic & very well made. All leather! My 10 year old '
 'grandson loved them, he put them on immediately upon opening. They fit '
 'perfectly & comfortably',
 'My son loves these and only wears them for bball practice or a game. They '
 'have held up great and still look new',
 'They a great looking shoes comfortable, & are pretty wide I usually get wide '
 'made shoes to fit right, but these are good in the size 13 fit.',
 'The box was slightly smashed and thought shoes were damaged. No damage to '
 'the shoes. Very comfortable and no major issues (at this point): hopefully '
 'no issues for a long time.']
[['these', 'jordan', 'are', 'authentic', 'very', 'well', 'made', 'all', 'leather', 'my', 'year', 'old', 'grandson', 'loved', 'them', 'he', 'put', 'them', 'on', 'immediately', 'upon', 'opening', 'they', 'fit', 'perfectly', 'comfortably'], ['my', 'son', 'loves', 'these', 'and', 'only', 'wears', 'them', 'for', 'bball', 'practice', 'or', 'game', 'they', 'have', '

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [2]:
# Write your code here
import os.path
from gensim import corpora
from gensim.models import LsiModel
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

def load_data():

    documents_list = [
        "These Jordan’s are authentic & very well made. All leather! My 10 year old grandson loved them, he put them on immediately upon opening. They fit perfectly & comfortably",
        "My son loves these and only wears them for bball practice or a game. They have held up great and still look new",
        "They a great looking shoes comfortable, & are pretty wide I usually get wide made shoes to fit right, but these are good in the size 13 fit.",
        "The box was slightly smashed and thought shoes were damaged. No damage to the shoes. Very comfortable and no major issues (at this point): hopefully no issues for a long time."
    ]
    titles = [text[0:min(len(text), 100)] for text in documents_list]
    print("Total Number of Documents:", len(documents_list))
    return documents_list, titles

# Load data
documents_list, _ = load_data()

def preprocess_data(doc_set):
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    # Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        # stem tokens
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        # add tokens to list
        texts.append(stemmed_tokens)
    return texts

# Preprocess the data
doc_clean = preprocess_data(documents_list)

def prepare_corpus(doc_clean):
    # Creating the term dictionary of our corpus, where every unique term is assigned an index.
    dictionary = corpora.Dictionary(doc_clean)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    # Return dictionary and Document Term Matrix
    return dictionary, doc_term_matrix

def create_gensim_lsa_model(doc_clean):
    dictionary, doc_term_matrix = prepare_corpus(doc_clean)
    # generate LSA model
    lsamodel = LsiModel(doc_term_matrix, num_topics=2, id2word=dictionary)  # train model
    print(lsamodel.print_topics(num_topics=2, num_words=5))
    return lsamodel

def compute_coherence_values(doc_clean, stop, start=2, step=1):
    dictionary, doc_term_matrix = prepare_corpus(doc_clean)

    coherence_values = []
    model_list = []
    for num_topics in range(start, stop, step):
        # generate LSA model
        model = LsiModel(doc_term_matrix, num_topics=num_topics, id2word=dictionary)  # train model
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

# Call the function to create the LSA model
lsa_model = create_gensim_lsa_model(doc_clean)

# Call the function to compute coherence values
model_list, coherence_values = compute_coherence_values(doc_clean, stop=5)
print(coherence_values)


Total Number of Documents: 4
[(0, '0.495*"shoe" + 0.334*"fit" + 0.305*"comfort" + 0.277*"wide" + 0.218*"issu"'), (1, '0.354*"damag" + 0.354*"issu" + -0.288*"fit" + -0.200*"made" + 0.178*"shoe"')]
[0.45854376854875567, 0.29251419963123154, 0.4408805529290713]


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [5]:

from gensim import corpora, models
import numpy as np
import pyLDAvis
from gensim.models.coherencemodel import CoherenceModel

try:
    import seaborn
except ImportError:
    pass

pyLDAvis.enable_notebook()

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values


texts = [
"These Jordan’s are authentic & very well made. All leather! My 10 year old grandson loved them, he put them on immediately upon opening. They fit perfectly & comfortably",
    "My son loves these and only wears them for bball practice or a game. They have held up great and still look new",
    "They a great looking shoes comfortable, & are pretty wide I usually get wide made shoes to fit right, but these are good in the size 13 fit.",
    "The box was slightly smashed and thought shoes were damaged. No damage to the shoes. Very comfortable and no major issues (at this point): hopefully no issues for a long time."
]

# Tokenize and preprocess for the given text
tokenized_text = [text.lower().split() for text in texts]

# Creating the dictionary and corpus
dictionary = corpora.Dictionary(tokenized_text)
corpus = [dictionary.doc2bow(text) for text in tokenized_text]

# Computing coherence values
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=tokenized_text, start=2, limit=10, step=1)

# Print the coherence values
for num_topics, coherence_val in zip(range(2, 10), coherence_values):
    print("Num Topics:", num_topics, " - Coherence Value:", coherence_val)


Num Topics: 2  - Coherence Value: 0.6099060769195582
Num Topics: 3  - Coherence Value: 0.8862701315112353
Num Topics: 4  - Coherence Value: 0.893311249727419
Num Topics: 5  - Coherence Value: 0.5718174074141513
Num Topics: 6  - Coherence Value: 0.7643034153506361
Num Topics: 7  - Coherence Value: 0.6144389832488342
Num Topics: 8  - Coherence Value: 0.7358515356695429
Num Topics: 9  - Coherence Value: 0.6153454761799311


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [10]:

!pip install --upgrade bertopic
!pip install bertopic
# Importing the necessary library
from bertopic import BERTopic


data = [
    "These Jordan’s are authentic & very well made.",
    "All leather! My 10-year-old grandson loved them, he put them on immediately upon opening.",
    "They fit perfectly & comfortably.",
    "My son loves these and only wears them for basketball practice or a game.",
    "They have held up great and still look new.",
    "They are great-looking shoes, comfortable, & are pretty wide. I usually get wide shoes to fit right,",
    "but these are good in the size 13 fit.",
    "The box was slightly smashed and thought the shoes were damaged.",
    "No damage to the shoes. Very comfortable and no major issues at this point,",
    "hopefully no issues for a long time."
]

# Initialize BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Fit BERTopic on the provided data
topics, probs = topic_model.fit_transform(data)

# Get frequency of topics
freq = topic_model.get_topic_info()
print(freq.head(5))

# Get top words associated with a specific topic
topic_words = topic_model.get_topic(0)
print(topic_words)




2024-03-30 03:04:31,427 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2024-03-30 03:04:32,489 - BERTopic - Embedding - Completed ✓
2024-03-30 03:04:32,491 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-30 03:04:36,748 - BERTopic - Dimensionality - Completed ✓
2024-03-30 03:04:36,750 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-30 03:04:36,760 - BERTopic - Cluster - Completed ✓
2024-03-30 03:04:36,768 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-30 03:04:36,787 - BERTopic - Representation - Completed ✓


   Topic  Count                  Name  \
0     -1     10  -1_the_shoes_and_are   

                                      Representation  \
0  [the, shoes, and, are, them, these, they, fit,...   

                                 Representative_Docs  
0  [My son loves these and only wears them for ba...  
False


## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [13]:
'''''Based on the data, the four topic modeling techniques LDA, NMF, LSA, and BERTopic—each have advantages. LDA provides intelligible topics but may omit intricate relationships. While NMF and LSA uncover hidden patterns, they may be difficult to understand. With the help of BERT, BERTopic can better comprehend context and delineate topics, which makes it suitable for large and diverse datasets. Therefore, because BERTopic understands words in context and generates clear topics, it's probably the best option if you need topics that make sense and demonstrate connections.
'''''

"''Based on the data, the four topic modeling techniques LDA, NMF, LSA, and BERTopic—each have advantages. LDA provides intelligible topics but may omit intricate relationships. While NMF and LSA uncover hidden patterns, they may be difficult to understand. With the help of BERT, BERTopic can better comprehend context and delineate topics, which makes it suitable for large and diverse datasets. Therefore, because BERTopic understands words in context and generates clear topics, it's probably the best option if you need topics that make sense and demonstrate connections.\n"

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [12]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
 working with text data and extracting features using various topic modeling algorithms provided a valuable learning experience. I learned algorithms such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA),BERT model . It was a little difficult to determine the ideal number of topics (K) for each algorithm based on coherence scores because it needed some trial and error and an understanding of the trade-offs between interpretability and model complexity. By generating topics using LDA, LSA, lda2vec, and BERTopic, we gain insights into the underlying structures and themes present in the text data. Understanding and summarizing these topics provide valuable information for tasks such as document clustering, categorization, and summarization, which are essential in various NLP applications such as sentiment analysis, document recommendation, and information retrieval.







'''

'\n working with text data and extracting features using various topic modeling algorithms provided a valuable learning experience. I learned algorithms such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA),BERT model . It was a little difficult to determine the ideal number of topics (K) for each algorithm based on coherence scores because it needed some trial and error and an understanding of the trade-offs between interpretability and model complexity. By generating topics using LDA, LSA, lda2vec, and BERTopic, we gain insights into the underlying structures and themes present in the text data. Understanding and summarizing these topics provide valuable information for tasks such as document clustering, categorization, and summarization, which are essential in various NLP applications such as sentiment analysis, document recommendation, and information retrieval.\n\n\n\n\n\n\n\n'