<a href="https://colab.research.google.com/github/ramya940758/Ramya-mundru/blob/main/mundru_exercise_04_05nov.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The fourth in-class-exercise (40 points in total, 03/28/2022)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (10 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [31]:
import nltk
nltk.download('stopwords')

import re
import gensim
import pyLDAvis.gensim_models
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
from nltk.corpus import stopwords

req_documents = ['By analyzing Google search data using Google Trends, we measured the impact of highly publicized plastic surgery-related events on the interest level of the general population in specific search terms.',
             'Additionally, we investigated seasonal and geographic trends around interest in rhinoplasties, which is information that physicians and small surgical centers can use to optimize marketing decisions.',
             'A noticeable impact was observed in both celebrity cases on search term volume, and a seasonal effect is apparent for rhinoplasty searches. ',
             'As many surgeons already employ aggressive Internet marketing strategies, understanding and utilizing these trends could help optimize their investments, increase social engagement, and increase practice awareness by potential patients.']

stop_word = set(stopwords.words('english'))
def preprocess(text):
    tokens_01 = re.findall(r'\w+', text.lower())
    tokens_01 = [word for word in tokens_01 if word not in stop_word]
    return tokens_01

processed_documents = [preprocess(doc) for doc in req_documents]
dictionary = Dictionary(processed_documents)
corpus = [dictionary.doc2bow(doc) for doc in processed_documents]
coherence_scores = []
for k in range(2, 11):
    lda_model = LdaModel(corpus, num_topics=k, id2word=dictionary, passes=15)
    coherence_model = CoherenceModel(model=lda_model, texts=processed_documents, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)

optimal_k = coherence_scores.index(max(coherence_scores)) + 2  # +2 because we started from K=2
print(f"The number of optimal topics is: {optimal_k}")
optimal_lda_model = LdaModel(corpus, num_topics=optimal_k, id2word=dictionary, passes=15)
topics = optimal_lda_model.print_topics(num_words=10)  # Adjusting the number of words as needed

for topic in topics:
    print(topic)





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The number of optimal topics is: 4
(0, '0.038*"marketing" + 0.038*"trends" + 0.038*"optimize" + 0.038*"interest" + 0.038*"seasonal" + 0.038*"surgical" + 0.038*"investigated" + 0.038*"geographic" + 0.038*"physicians" + 0.038*"additionally"')
(1, '0.059*"search" + 0.059*"google" + 0.033*"impact" + 0.033*"trends" + 0.033*"plastic" + 0.033*"specific" + 0.033*"general" + 0.033*"events" + 0.033*"publicized" + 0.033*"data"')
(2, '0.043*"impact" + 0.043*"seasonal" + 0.043*"noticeable" + 0.043*"volume" + 0.043*"effect" + 0.043*"rhinoplasty" + 0.043*"searches" + 0.043*"apparent" + 0.043*"observed" + 0.043*"term"')
(3, '0.058*"increase" + 0.032*"engagement" + 0.032*"social" + 0.032*"utilizing" + 0.032*"patients" + 0.032*"investments" + 0.032*"many" + 0.032*"employ" + 0.032*"understanding" + 0.032*"strategies"')


In [24]:
!pip install pyLDAvis



## (2) (10 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

req_documents = ['By analyzing Google search data using Google Trends, we measured the impact of highly publicized plastic surgery-related events on the interest level of the general population in specific search terms.',
             'Additionally, we investigated seasonal and geographic trends around interest in rhinoplasties, which is information that physicians and small surgical centers can use to optimize marketing decisions.',
             'A noticeable impact was observed in both celebrity cases on search term volume, and a seasonal effect is apparent for rhinoplasty searches. ',
             'As many surgeons already employ aggressive Internet marketing strategies, understanding and utilizing these trends could help optimize their investments, increase social engagement, and increase practice awareness by potential patients.']
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, max_features=5000, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(req_documents)

num_topics = 4
lsa = TruncatedSVD(n_components=num_topics)
lsa_topic_matrix = lsa.fit_transform(tfidf_matrix)
terms = tfidf_vectorizer.get_feature_names_out()
singular_values = lsa.singular_values_

for i, singular_value in enumerate(singular_values):
    top_terms = [terms[j] for j in np.argsort(lsa.components_[i])[::-1][:10]]
    print(f"Topic {i+1}: {', '.join(top_terms)}")






Topic 1: search, seasonal, google, trends, impact, marketing, optimize, increase, volume, searches
Topic 2: increase, optimize, marketing, aggressive, utilizing, understanding, surgeons, practice, potential, patients
Topic 3: seasonal, additionally, surgical, physicians, small, investigated, information, geographic, rhinoplasties, decisions
Topic 4: increase, apparent, volume, searches, cases, celebrity, effect, noticeable, rhinoplasty, observed


## (3) (10 points) Generate K topics by using  lda2vec, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://nbviewer.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec.ipynb

In [41]:
import pyLDAvis
pyLDAvis.enable_notebook()
from nltk.tokenize import word_tokenize
from gensim.corpora import Dictionary
nltk.download('punkt')
req_documents = ['By analyzing Google search data using Google Trends, we measured the impact of highly publicized plastic surgery-related events on the interest level of the general population in specific search terms.',
             'Additionally, we investigated seasonal and geographic trends around interest in rhinoplasties, which is information that physicians and small surgical centers can use to optimize marketing decisions.',
             'A noticeable impact was observed in both celebrity cases on search term volume, and a seasonal effect is apparent for rhinoplasty searches. ',
             'As many surgeons already employ aggressive Internet marketing strategies, understanding and utilizing these trends could help optimize their investments, increase social engagement, and increase practice awareness by potential patients.']

tokenized_docs = [word_tokenize(doc.lower()) for doc in req_documents]

dictionary = Dictionary(tokenized_docs)

corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

from gensim.models.coherencemodel import CoherenceModel
from gensim.models import LdaModel

def compute_coherence_values(dictionary, corpus, tokenized_docs, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        model_list.append(model)
        coherence_model = CoherenceModel(model=model, texts=tokenized_docs, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherence_model.get_coherence())

    return model_list, coherence_values

model_list, coherence_values = compute_coherence_values(dictionary, corpus, tokenized_docs, limit=10)

optimal_model = model_list[coherence_values.index(max(coherence_values))]
optimal_K = optimal_model.num_topics
def summarize_topics(model):
    topics = model.print_topics(num_words=5)
    for topic in topics:
        print(topic)

summarize_topics(optimal_model)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(0, '0.063*"," + 0.044*"increase" + 0.043*"and" + 0.024*"engagement" + 0.024*"."')
(1, '0.054*"a" + 0.029*"," + 0.029*"search" + 0.029*"and" + 0.029*"seasonal"')
(2, '0.034*"and" + 0.028*"," + 0.020*"trends" + 0.019*"marketing" + 0.019*"in"')
(3, '0.052*"the" + 0.036*"google" + 0.036*"search" + 0.036*"of" + 0.035*","')
(4, '0.034*"," + 0.033*"and" + 0.025*"seasonal" + 0.023*"decisions" + 0.023*"in"')


## (4) (10 points) Generate K topics by using BERTopic, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing

In [46]:

!pip install bertopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all')['data']

topic_model = BERTopic(nr_topics="auto", calculate_probabilities=True, verbose=True)
topics, _ = topic_model.fit_transform(data)

topic_overview = topic_model.get_topic_freq()

for topic_num, freq in topic_overview[1:].values:
    topic_words = topic_model.get_topic(topic_num)
    topic_summary = ", ".join([word[0] for word in topic_words[:5]])
    print(f"Topic {topic_num}: {topic_summary} (Freq: {freq})")




Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2023-11-06 03:57:12,460 - BERTopic - Transformed documents to Embeddings
2023-11-06 03:58:03,509 - BERTopic - Reduced dimensionality
2023-11-06 04:00:15,852 - BERTopic - Clustered reduced embeddings
2023-11-06 04:00:40,853 - BERTopic - Reduced number of topics from 364 to 113


Topic -1: the, to, of, and, in (Freq: 6221)
Topic 1: game, team, he, games, the (Freq: 1189)
Topic 2: pain, migraine, jb, hernia, is (Freq: 144)
Topic 3: msg, food, sensitivity, chinese, superstition (Freq: 63)
Topic 4: battery, batteries, concrete, acid, lead (Freq: 62)
Topic 5: mary, she, her, sin, immaculate (Freq: 59)
Topic 6: copy, protected, protection, disks, program (Freq: 58)
Topic 7: shift, shifting, manual, automatic, clutch (Freq: 53)
Topic 8: gamma, bursters, oort, ray, cloud (Freq: 50)
Topic 9: dog, dogs, my, bike, springer (Freq: 47)
Topic 10: helmet, helmets, shoei, jacket, fit (Freq: 45)
Topic 11: fonts, font, truetype, atm, tt (Freq: 45)
Topic 12: oil, drain, changing, plug, self (Freq: 44)
Topic 13: candida, yeast, systemic, bloom, infections (Freq: 41)
Topic 14: cpu, fan, heat, fans, sink (Freq: 41)
Topic 15: phone, line, number, onhook, led (Freq: 39)
Topic 16: countersteering, bike, countersteeringfaq, lean, left (Freq: 39)
Topic 17: points, sphere, den, radius, c

## (5) (10 extra points) Compare the results generated by the four topic modeling algorithms, which one is better? You should explain the reasons in details.

In [28]:
-LDA can handle sprase unit. It generally requires lesser topics compared to word-embedding based techniques which makes easier to read. Also, it shows adjectives, and nouns within the topic.

-LSA appears to be the most direct with its distinct subjects. It may not go deep enough to capture more complex interactions which is very easy to implement.

-lda2Vec supprots hierarchical topic reduction and also it automatically finds the number of topics but it is not suitable for small datatsets.

-BERTopic provides high versatility and stability across domains which supports hierarchical topic reduction. However, it generates many outliers.

-For general interpretability and a balance between complexity and performance, LDA might be the ideal option.

-If computational resources are available and the corpus is diverse, BERTopic can be highly effective.




