# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
import nltk

# Download NLTK stopwords
nltk.download('stopwords')

# Sample data
data = [
    "Climate change and its effects on global weather patterns",
    "Renewable energy sources and their impact on sustainability",
    "The importance of biodiversity and ecosystem conservation",
    "Pollution control and waste management strategies",
    "Deforestation and its impact on the environment",
    "Sustainable agriculture and food security",
    "Water scarcity and its effects on communities",
    "The role of environmental policy in mitigating climate change",
    "Urban planning and green infrastructure",
    "Ocean conservation and marine life protection"
]


# Preprocess the data
def preprocess(texts):
    stop_words = stopwords.words('english')
    return [[word for word in simple_preprocess(doc) if word not in stop_words] for doc in texts]

processed_data = preprocess(data)

# Create Dictionary and Corpus
id2word = corpora.Dictionary(processed_data)
corpus = [id2word.doc2bow(text) for text in processed_data]

# Function to compute coherence values for different number of topics
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=10, workers=2)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

# Compute coherence values for different numbers of topics
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=processed_data, limit=20)

# Select the model with the highest coherence and print the topics
optimal_model = model_list[coherence_values.index(max(coherence_values))]
optimal_num_topics = optimal_model.num_topics
topics = optimal_model.show_topics(num_words=10, formatted=False)

print(f"Optimal Number of Topics: {optimal_num_topics}")
for topic in topics:
    print(f"Topic {topic[0]}: {[word[0] for word in topic[1]]}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Optimal Number of Topics: 14
Topic 8: ['effects', 'water', 'communities', 'scarcity', 'conservation', 'impact', 'climate', 'environment', 'deforestation', 'urban']
Topic 13: ['change', 'climate', 'role', 'policy', 'mitigating', 'global', 'patterns', 'effects', 'weather', 'environmental']
Topic 3: ['impact', 'climate', 'environment', 'conservation', 'communities', 'scarcity', 'food', 'effects', 'importance', 'security']
Topic 11: ['impact', 'climate', 'scarcity', 'deforestation', 'urban', 'environment', 'conservation', 'change', 'effects', 'planning']
Topic 7: ['biodiversity', 'ecosystem', 'conservation', 'importance', 'climate', 'scarcity', 'environment', 'impact', 'food', 'deforestation']
Topic 1: ['ocean', 'management', 'protection', 'strategies', 'conservation', 'pollution', 'control', 'marine', 'waste', 'life']
Topic 12: ['impact', 'food', 'climate', 'environment', 'conservation', 'deforestation', 'scarcity', 'change', 'effects', 'importance']
Topic 10: ['effects', 'climate', 'envi

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [2]:
import gensim
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download and set up necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Sample dataset
data = [
    "Advances in quantum computing and its applications",
    "The impact of artificial intelligence on healthcare",
    "Blockchain technology and its use in securing data",
    "The future of augmented reality and virtual reality",
    "Cybersecurity threats and prevention measures",
    "The role of machine learning in data analysis",
    "Internet of Things: Connecting the physical and digital worlds",
    "The evolution of mobile technology and its impact on society",
    "Cloud computing trends and innovations",
    "The significance of big data in modern business"
]
# Preprocess the data
stop_words = set(stopwords.words('english'))
texts = [[word for word in word_tokenize(document.lower()) if word not in stop_words] for document in data]

# Create the Document-Term Matrix
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Function to compute coherence values for different number of topics
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = models.LsiModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

# Set the range for number of topics
start, limit, step = 2, 10, 1

# Compute coherence values for different numbers of topics
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=texts, start=start, limit=limit, step=step)

# Find the optimal model
max_coherence_value = max(coherence_values)
optimal_model_index = coherence_values.index(max_coherence_value)
optimal_model = model_list[optimal_model_index]
optimal_num_topics = start + optimal_model_index

# Summarize the topics
topics = optimal_model.show_topics(num_topics=optimal_num_topics)

print(f"Optimal Number of Topics: {optimal_num_topics}")
print("Topics:")
for topic in topics:
    print(topic)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Optimal Number of Topics: 7
Topics:
(0, '-0.605*"data" + -0.347*"technology" + -0.227*"blockchain" + -0.227*"use" + -0.227*"securing" + -0.189*"learning" + -0.189*"machine" + -0.189*"role" + -0.189*"analysis" + -0.189*"big"')
(1, '-0.740*"reality" + -0.370*"future" + -0.370*"augmented" + -0.370*"virtual" + 0.078*"connecting" + 0.078*"physical" + 0.078*":" + 0.078*"digital" + 0.078*"things" + 0.078*"internet"')
(2, '-0.370*":" + -0.370*"worlds" + -0.370*"connecting" + -0.370*"internet" + -0.370*"physical" + -0.370*"things" + -0.370*"digital" + -0.155*"reality" + -0.078*"virtual" + -0.078*"augmented"')
(3, '-0.498*"impact" + -0.355*"technology" + -0.315*"evolution" + -0.315*"mobile" + -0.315*"society" + 0.245*"data" + -0.184*"artificial" + -0.184*"healthcare" + -0.184*"intelligence" + 0.143*"significance"')
(4, '0.632*"computing" + 0.316*"innovations" + 0.316*"trends" + 0.316*"cloud" + 0.316*"quantum" + 0.316*"advances" + 0.316*"applications" + 0.000*"reality" + 0.000*"augmented" + 0.000

## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [4]:
nltk.download('stopwords')
nltk.download('punkt')

# Remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))

# Convert to lowercase
def to_lowercase(text):
    return text.lower()

# Tokenize
def tokenize(text):
    return word_tokenize(text)

# Remove stop words
def remove_stopwords(texts):
    stop_words = set(stopwords.words('english'))
    return [[word for word in text if word not in stop_words] for text in texts]

# Lemmatize
def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
    lemmatized_texts = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        lemmatized_sent = [token.lemma_ for token in doc if token.pos_ in allowed_postags]
        lemmatized_texts.append(lemmatized_sent)
    return lemmatized_texts

# Create bigrams
def create_bigrams(texts, id2word):
    bigram = gensim.models.Phrases(texts, min_count=5, threshold=100)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    return [bigram_mod[doc] for doc in texts], id2word

# Create trigrams
def create_trigrams(texts, id2word):
    bigram = gensim.models.Phrases(texts, min_count=5, threshold=100)
    trigram = gensim.models.Phrases(bigram[texts], threshold=100)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
    return [trigram_mod[bigram_mod[doc]] for doc in texts], id2word

# Preprocess text data
def preprocess_text_data(data, id2word):
    data = remove_punctuation(data)
    data = to_lowercase(data)
    data = tokenize(data)
    data = remove_stopwords(data)
    data = lemmatize(data)
    return [id2word.doc2bow(text) for text in data], id2word





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [13]:
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic

# Load the dataset with specified categories
categories = ['sci.space', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
texts = newsgroups.data

# Initialize and fit BERTopic
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(texts)

# Print the top 10 topics
print(topic_model.get_topic_info().head(10))





   Topic  Count                         Name  \
0     -1    598             -1_the_of_to_and   
1      0    231              0_the_to_of_and   
2      1    107           1_image_and_for_of   
3      2     89          2_the_points_den_is   
4      3     87       3_card_vesa_mode_video   
5      4     62           4_for_and_data_the   
6      5     56            5_sky_the_that_to   
7      6     55          6_jpeg_gif_you_file   
8      7     51  7_conference_int_nok_oprows   
9      8     49    8_hst_the_reboost_mission   

                                      Representation  \
0      [the, of, to, and, in, is, it, that, for, on]   
1   [the, to, of, and, that, space, in, is, be, for]   
2   [image, and, for, of, or, it, 3d, is, the, with]   
3  [the, points, den, is, of, problem, this, line...   
4  [card, vesa, mode, video, the, to, it, with, p...   
5  [for, and, data, the, ftp, in, available, is, ...   
6  [sky, the, that, to, advertising, of, it, righ...   
7  [jpeg, gif, you, fil

## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [16]:
# LDA (Latent Dirichlet Allocation)
# Preferred for huge datasets with well-defined and unique subjects due to its simplicity and interpretability.
# Example usage:
# from sklearn.decomposition import LatentDirichletAllocation
# lda_model = LatentDirichletAllocation(n_components=number_of_topics)
# lda_topic_matrix = lda_model.fit_transform(document_term_matrix)

# LSA (Latent Semantic Analysis)
# Appropriate for jobs requiring dimensionality reduction and computational efficiency, particularly in singular value decomposition.
# Example usage:
# from sklearn.decomposition import TruncatedSVD
# lsa_model = TruncatedSVD(n_components=number_of_topics)
# lsa_topic_matrix = lsa_model.fit_transform(document_term_matrix)

# lda2vec
# Excellent for capturing semantic and topical linkages in texts, requires more computer resources and a larger dataset.
# Note: lda2vec is not directly available in sklearn, needs custom implementation or other libraries.

# BERTopic
# Perfect for extracting nuanced and contextually rich topics from complicated text input, but requires substantial computational capacity.
# Example usage:
# from bertopic import BERTopic
# bertopic_model = BERTopic()
# topics, probabilities = bertopic_model.fit_transform(docs)


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [18]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
#The process of experimenting with various topic modeling methods such as LDA, LSA, lda2vec, and BERTopic gives a full learning experience in text data analysis. Understanding these techniques is critical to comprehending the complexities of feature extraction from text data
#Interpreting the themes and their importance can be subjective and sometimes difficult, especially when the issues are not easily distinct or are too detailed.
#Understanding and implementing these methods improves one's capacity to efficiently handle and evaluate vast amounts of text data, which is an important skill in the NLP sector. This exercise combines theoretical understanding and practical application, offering a good basis for further investigation into complex NLP problems.





'''

"\nPlease write you answer here:\n#The process of experimenting with various topic modeling methods such as LDA, LSA, lda2vec, and BERTopic gives a full learning experience in text data analysis. Understanding these techniques is critical to comprehending the complexities of feature extraction from text data\n#Interpreting the themes and their importance can be subjective and sometimes difficult, especially when the issues are not easily distinct or are too detailed.\n#Understanding and implementing these methods improves one's capacity to efficiently handle and evaluate vast amounts of text data, which is an important skill in the NLP sector. This exercise combines theoretical understanding and practical application, offering a good basis for further investigation into complex NLP problems.\n\n\n\n\n\n"