# **INFO5731 In-class Exercise 4**

**This exercise will provide a valuable learning experience in working with text data and extracting features using various topic modeling algorithms. Key concepts such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), lda2vec, and BERTopic.**

In [52]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)

**Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [53]:
import csv
import json
import nltk
import pandas as pd
nltk.download('stopwords')

with open('/root/Info_5731 (1).csv', 'r') as f:
    data = list(csv.reader(f))

df = pd.DataFrame(data)
print(df.head())


   0                                                  1           2  \
0                                             Name_Info  Noun_Count   
1  0  kay aiko abe nisei femal born may selleck wash...          10   
2  1  art abe nisei male born june seattl washington...          10   
3  2  sharon tanagi aburano nisei femal born octob s...          14   
4  3  toshiko aiboshi nisei femal born juli boyl hei...          11   

            3          4          5  
0  Verb_Count  Adj_Count  Adv_Count  
1           3          2          1  
2           2          2          0  
3           1          4          0  
4           2          3          1  


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
!pip install pyLDAvis

Collecting pandas>=2.0.0 (from pyLDAvis)
  Using cached pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2.2.1


In [None]:
!pip install pandas==1.5.3

Collecting pandas==1.5.3
  Using cached pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.1
    Uninstalling pandas-2.2.1:
      Successfully uninstalled pandas-2.2.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyldavis 3.4.1 requires pandas>=2.0.0, but you have pandas 1.5.3 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-1.5.3


In [54]:
# Importing necessary libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from gensim.models import LdaMulticore
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
import re
from gensim.utils import simple_preprocess
import nltk
from nltk.corpus import stopwords
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import matplotlib.pyplot as plt

# Read data into news_articles DataFrame
news_articles = pd.read_csv('/root/Info_5731 (1).csv')

# Preprocess text data
news_articles['processed_text'] = news_articles['Name_Info'].map(lambda x: re.sub('[,!?]', '', x))
news_articles['processed_text'] = news_articles['processed_text'].map(lambda x: x.lower())

# Tokenize and remove stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')


def tokenize(text):
    return [word for word in simple_preprocess(text) if word not in stop_words]

news_articles['tokenized_text'] = news_articles['processed_text'].map(tokenize)

# Create a dictionary and corpus
id2word = Dictionary(news_articles['tokenized_text'])
corpus = [id2word.doc2bow(text) for text in news_articles['tokenized_text']]

# Determine the optimal number of topics using coherence score
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics, workers=1)  # Set workers to 1
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

# Set the range of topics to evaluate
start_topic = 2
end_topic = 11
step_topic = 1

# Compute coherence scores
model_list, coherence_values = compute_coherence_values(dictionary=id2word,
                                                        corpus=corpus,
                                                        texts=news_articles['tokenized_text'],
                                                        start=start_topic,
                                                        limit=end_topic,
                                                        step=step_topic)

# Find the optimal number of topics based on coherence score
optimal_num_topics = start_topic + coherence_values.index(max(coherence_values))

# Build LDA model with optimal number of topics
lda_model = LdaMulticore(corpus=corpus, id2word=id2word, num_topics=optimal_num_topics, workers=1)  # Set workers to 1

# Print the topics
topics = lda_model.print_topics()
for topic in topics:
    print(topic)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(0, '0.038*"washington" + 0.037*"born" + 0.034*"nisei" + 0.024*"seattl" + 0.023*"male" + 0.015*"father" + 0.015*"grew" + 0.014*"california" + 0.012*"famili" + 0.011*"femal"')
(1, '0.045*"born" + 0.033*"nisei" + 0.029*"grew" + 0.027*"parent" + 0.025*"femal" + 0.024*"ran" + 0.023*"world" + 0.022*"war" + 0.021*"california" + 0.020*"ii"')
(2, '0.045*"born" + 0.029*"nisei" + 0.024*"california" + 0.024*"femal" + 0.020*"male" + 0.020*"washington" + 0.018*"grew" + 0.017*"los" + 0.017*"angel" + 0.015*"war"')
(3, '0.053*"born" + 0.049*"california" + 0.041*"nisei" + 0.030*"male" + 0.027*"grew" + 0.024*"world" + 0.024*"war" + 0.022*"femal" + 0.020*"ii" + 0.016*"remov"')
(4, '0.028*"washington" + 0.026*"born" + 0.026*"seattl" + 0.023*"nisei" + 0.021*"male" + 0.010*"grew" + 0.008*"bill" + 0.008*"femal" + 0.007*"california" + 0.007*"bainbridg"')
(5, '0.031*"born" + 0.025*"male" + 0.023*"japan" + 0.013*"school" + 0.012*"nisei" + 0.011*"father" + 0.010*"california" + 0.008*"return" + 0.008*"attend" + 0

## Question 2 (10 Points)

**Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [58]:
# Write your code here
import pandas as pd
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_short, stem_text, preprocess_string
from gensim import corpora
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Assuming df is your DataFrame containing Name_Info column

# Preprocess text data
def preprocess(text):
    CUSTOM_FILTERS = [lambda x: x.lower(),
                      remove_stopwords,
                      strip_punctuation,
                      strip_short,
                      stem_text]
    text = preprocess_string(text, CUSTOM_FILTERS)
    return text

print(df)
# Preprocess text data
df['Text (Clean)'] = df.iloc[:, 1].apply(lambda x: preprocess(x))




# Create a dictionary with the corpus
corpus = df['Text (Clean)']
dictionary = corpora.Dictionary(corpus)

# Convert corpus into a bag of words
bow = [dictionary.doc2bow(text) for text in corpus]

# Find the optimal number of topics using coherence score
coherence_values = []
for num_topics in range(2, 11):
    lsi = LsiModel(bow, num_topics=num_topics, id2word=dictionary)
    coherence_model = CoherenceModel(model=lsi, texts=df['Text (Clean)'], dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_values.append((num_topics, coherence_score))
    print('Coherence score with {} clusters: {}'.format(num_topics, coherence_score))

# Choose the optimal number of topics based on coherence score
optimal_num_topics = max(coherence_values, key=lambda x: x[1])[0]

# Perform LSA to generate K topics
lsi = LsiModel(bow, num_topics=optimal_num_topics, id2word=dictionary)

# Print the coherence score
# print('Optimal number of topics:', optimal_num_topics)
# print('Coherence score with {} clusters: {}'.format(optimal_num_topics, max(coherence_values, key=lambda x: x[1])[1]))

# Print the 5 words with the strongest association to the derived topics


       0                                                  1           2  \
0                                                 Name_Info  Noun_Count   
1      0  kay aiko abe nisei femal born may selleck wash...          10   
2      1  art abe nisei male born june seattl washington...          10   
3      2  sharon tanagi aburano nisei femal born octob s...          14   
4      3  toshiko aiboshi nisei femal born juli boyl hei...          11   
..   ...                                                ...         ...   
973  972  karen yoshitomi sansei femal born spokan washi...           6   
974  973  john young chine american male born may los an...           8   
975  974  sharon yuen sansei femal born juli seattl wash...           9   
976  975  loi yuki nisei femal born septemb tule lake co...          11   
977  976  aaron zajic born baltimor maryland redress mov...          11   

              3          4          5  
0    Verb_Count  Adj_Count  Adv_Count  
1             3    

In [59]:
for topic_num, words in lsi.print_topics(num_words=10):
    print('Words in topic {}: {}.'.format(topic_num, words))

# Find the scores given between the review and each topic
corpus_lsi = lsi[bow]
scores = []
for doc in corpus_lsi:
    scores.append([round(val[1], 2) for val in doc])

Words in topic 0: 0.457*"born" + 0.397*"california" + 0.377*"nisei" + 0.267*"grew" + 0.256*"male" + 0.222*"world" + 0.222*"war" + 0.205*"femal" + 0.156*"remov" + 0.152*"washington".
Words in topic 1: 0.570*"washington" + -0.538*"california" + 0.392*"seattl" + -0.182*"lo" + -0.177*"angel" + 0.151*"born" + 0.123*"nisei" + 0.117*"japan" + -0.113*"war" + -0.111*"world".


In [60]:
# Create a DataFrame to show scores assigned for each topic for each review
df_topic = pd.DataFrame(scores, columns=['Topic {}'.format(i) for i in range(optimal_num_topics)])
print(df_topic)  # Corrected line
df_topic['Text'] = df['Text (Clean)']  # Assuming 'Name_Info' is the correct column name
df_topic['Dominant Topic'] = df_topic.iloc[:, :optimal_num_topics].idxmax(axis=1)


# Find a sample review from each topic
for i in range(optimal_num_topics):
    df_topic_i = df_topic[df_topic['Dominant Topic'] == 'Topic {}'.format(i)]
    if not df_topic_i.empty:
        sample_text = df_topic_i.sample(1, random_state=2)['Text'].values[0]
        print('Sample text from topic {}:\n{}'.format(i, sample_text))
    else:
        print('No sample text available for topic {}'.format(i))


     Topic 0  Topic 1
0       0.00     0.00
1       1.49     1.12
2       1.87     1.88
3       2.00     1.16
4       1.76    -0.10
..       ...      ...
973     2.34     1.70
974     1.94    -0.99
975     1.00     1.17
976     2.21     0.08
977     0.52     0.20

[978 rows x 2 columns]
Sample text from topic 0:
['makoto', 'otsu', 'nisei', 'male', 'born', 'march', 'steveston', 'british', 'columbia', 'canada', 'grew', 'steveston', 'father', 'fish', 'canneri', 'world', 'war']
Sample text from topic 1:
['masako', 'murakami', 'nisei', 'femal', 'born', 'octob', 'seattl', 'washington', 'incarc', 'puyallup', 'assembl', 'center', 'washington', 'minidoka', 'concentr', 'camp', 'idaho', 'resettl', 'seattl', 'washington']


## Question 3 (10 points):
**Generate K topics by using lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [None]:
!pip install gensim pandas nltk




In [None]:
!pip install Lda2Vec



In [66]:
# Import necessary libraries
import pandas as pd
from bertopic import BERTopic
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary

# Read the CSV file
df = pd.read_csv("/root/Info_5731 (1).csv")

# Initialize BERTopic model
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Convert values in 'Name_Info' column to strings
df['Name_Info'] = df['Name_Info'].astype(str)

# Fit BERTopic model on the preprocessed text data
topics, probabilities = topic_model.fit_transform(df['Name_Info'])

# Get top words per topic
topics_info = topic_model.get_topics()
top_words_per_topic = [[word for word, _ in words] for topic_id, words in topics_info.items() if topic_id != -1]

# Tokenize and preprocess documents
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
processed_docs = [[word for word in tokenizer.tokenize(doc.lower()) if word not in stop_words] for doc in df['Name_Info'].astype(str).tolist()]

# Create dictionary and corpus for coherence calculation
dictionary = Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Calculate coherence score
coherence_model = CoherenceModel(topics=top_words_per_topic, texts=processed_docs, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score}")

# Determine the number of topics
num_topics = len(topics_info) - 1 if -1 in topics_info else len(topics_info)
print(f"Number of Topics (K): {num_topics}")

# Summarize Topics
print("\nTopic Summaries:")
for topic_id, words in topics_info.items():
    if topic_id != -1:
        topic_summary = ", ".join([word for word, _ in words])
        print(f"Topic {topic_id}: {topic_summary}\n")
    else:
        break  # No more topics to print


2024-03-29 23:44:41,360 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/31 [00:00<?, ?it/s]

2024-03-29 23:44:58,818 - BERTopic - Embedding - Completed ✓
2024-03-29 23:44:58,821 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-29 23:45:03,452 - BERTopic - Dimensionality - Completed ✓
2024-03-29 23:45:03,454 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-29 23:45:03,523 - BERTopic - Cluster - Completed ✓
2024-03-29 23:45:03,530 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-29 23:45:03,576 - BERTopic - Representation - Completed ✓


Coherence Score: 0.7319151928449092
Number of Topics (K): 3

Topic Summaries:
Topic 0: born, nisei, california, male, grew, femal, world, war, ii, washington

Topic 1: interview, tashima, bill, led, jacl, panel, elain, akemi, kuros, joy

Topic 2: redress, justic, os, identifi, movement, offic, administr, depart, establish, work



## Question 4 (10 points):
**Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics.**

You may refer the code here: https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [None]:
! pip install BERTopic

Collecting BERTopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m112.6/154.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from BERTopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from BERTopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9

In [64]:
from bertopic import BERTopic
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary

# Initialize BERTopic model
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

# Assuming df is your DataFrame containing the data
# Ensure 'Text (Clean)' column is present in your DataFrame
# Tokenize and preprocess documents
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
processed_docs = [" ".join([word for word in tokenizer.tokenize(doc.lower()) if word not in stop_words]) for doc in df['Text (Clean)'].astype(str).tolist()]

# Fit BERTopic model on the preprocessed text data
topics, probabilities = topic_model.fit_transform(processed_docs)

# Get top words per topic
topics_info = topic_model.get_topics()
top_words_per_topic = [[word for word, _ in words] for topic_id, words in topics_info.items() if topic_id != -1]

# Tokenize each document separately
tokenized_docs = [doc.split() for doc in processed_docs]

# Create dictionary and corpus for coherence calculation
dictionary = Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Calculate coherence score
coherence_model = CoherenceModel(topics=top_words_per_topic, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score}")

# Determine the number of topics
num_topics = len(topics_info) - 1 if -1 in topics_info else len(topics_info)
print(f"Number of Topics (K): {num_topics}")

# Summarize Topics
print("\nTopic Summaries:")
for topic_id, words in topics_info.items():
    if topic_id != -1:
        topic_summary = ", ".join([word for word, _ in words])
        print(f"Topic {topic_id}: {topic_summary}\n")
    else:
        break  # No more topics to print


2024-03-29 23:41:54,845 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/31 [00:00<?, ?it/s]

2024-03-29 23:42:17,255 - BERTopic - Embedding - Completed ✓
2024-03-29 23:42:17,258 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-03-29 23:42:21,870 - BERTopic - Dimensionality - Completed ✓
2024-03-29 23:42:21,872 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-03-29 23:42:21,944 - BERTopic - Cluster - Completed ✓
2024-03-29 23:42:21,952 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-03-29 23:42:22,006 - BERTopic - Representation - Completed ✓


Coherence Score: 0.7209625284078601
Number of Topics (K): 3

Topic Summaries:
Topic 0: born, nisei, california, male, grew, femal, world, war, washington, remov

Topic 1: interview, tashima, led, jacl, panel, elain, akemi, joi, kuro, matsumoto

Topic 2: redress, justic, identifi, offic, administr, movement, depart, establish, work, administ



## Extra Question (5 Points)

**Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.**

**This question will compensate for any points deducted in this exercise. Maximum marks for the exercise is 40 points.**

In [68]:
# Write your code here
"""
LDA : In the output it gives words with their probability whether it is positive or negative related to sentinment analysis
positive means good or happy negative means bad or sad that sort of things
LSA : In the output it takes sample text from a topic and relates whether it is positive or negative. The output is based on the coherence score
lda2vec: In the output provided, each topic is summarized with key terms related to the topic, providing a concise overview of the themes.
BERTopic: In the output provided, each topic is summarized with key terms, facilitating easy interpretation.
LDA and LSA are faster among them LDA is easy.  LDA may be preferred when interpretability is crucial, while lda2vec and BERTopic may be suitable for capturing more complex relationships in the data.
"""

'\nLDA : In the output it gives words with their probability whether it is positive or negative related to sentinment analysis\npositive means good or happy negative means bad or sad that sort of things\nLSA : In the output it takes sample text from a topic and relates whether it is positive or negative. The output is based on the coherence score\nlda2vec: In the output provided, each topic is summarized with key terms related to the topic, providing a concise overview of the themes.\nBERTopic: In the output provided, each topic is summarized with key terms, facilitating easy interpretation.\nLDA and LSA are faster\n'

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.

Consider the following points in your response:

**Learning Experience:** Describe your overall learning experience in working with text data and extracting features using various topic modeling algorithms. Did you understand these algorithms and did the implementations helped in grasping the nuances of feature extraction from text data.

**Challenges Encountered:** Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [69]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Learning Experience:
Overall, working with text data and implementing various topic modeling algorithms was a valuable learning experience. I gained a deeper understanding of how these algorithms work and their applications in extracting features from text data. The implementations provided clear examples of how to preprocess text, train topic models, and interpret the results. This hands-on approach helped me grasp the nuances of feature extraction from text data and understand the importance of parameter tuning and evaluation metrics in topic modeling.

Challenges Encountered:
One challenge I encountered was in understanding the nuances of each topic modeling algorithm and selecting the most appropriate one for the task at hand.
Relevance to Your Field of Study:
This exercise is highly relevant to the field of NLP, as topic modeling is a fundamental task in analyzing and understanding text data. By extracting features from text data, we can uncover hidden patterns, themes, and relationships that are crucial for various NLP applications such as document classification, information retrieval, and sentiment analysis.





'''

'\nLearning Experience:\nOverall, working with text data and implementing various topic modeling algorithms was a valuable learning experience. I gained a deeper understanding of how these algorithms work and their applications in extracting features from text data. The implementations provided clear examples of how to preprocess text, train topic models, and interpret the results. This hands-on approach helped me grasp the nuances of feature extraction from text data and understand the importance of parameter tuning and evaluation metrics in topic modeling.\n\nChallenges Encountered:\nOne challenge I encountered was in understanding the nuances of each topic modeling algorithm and selecting the most appropriate one for the task at hand. \nRelevance to Your Field of Study:\nThis exercise is highly relevant to the field of NLP, as topic modeling is a fundamental task in analyzing and understanding text data. By extracting features from text data, we can uncover hidden patterns, themes, 