# Step5: Topic Modeling and Classification

In this notebook, we perform topic modeling and classification on a dataset of tweets. The primary objectives are to identify topics within the tweets and classify each tweet into predefined categories. The workflow involves the following steps:

#### Cluster Related Tweets
- Load the codebook, clustered tweets with embeddings, and manually labeled cluster samples.
- Calculate the percentage of tweets related to the event for each cluster.
- Filter tweets based on related clusters.

#### Topic Modeling with Guided LDA
- Create a dictionary of codebook categories and associated n-grams.
- Vectorize the related tweets using CountVectorizer with specified parameters.
- Use GuidedLDA for topic modeling, considering seed topics from the codebook.
- Evaluate the model on a test set for different window sizes.
**Note**: You can check the [GuidedLDA Installation Workaround](https://github.com/dex314/GuidedLDA_WorkAround) for installation instructions. 

#### Topic Modeling with BERTopic
- Utilize the SentenceTransformer model for embedding tweets.
- Apply BERTopic for topic modeling, considering seed topics from the codebook.
- Evaluate the model on a test set for different window sizes.

#### Topic Modeling with BERT and RoBERTa
- Use pre-trained BERT and RoBERTa models for tweet embeddings.
- Apply embeddings to seed topics and tweets for classification.
- Evaluate both models on a test set for different window sizes.

#### Topic Modeling with GloVe Embeddings
- Load GloVe embeddings and create embeddings for seed topics.
- Apply GloVe embeddings to tweets for classification.
- Evaluate the model on a test set for different window sizes.

**Note**: The final evaluation scores for each method and model are printed for different window sizes.


In [None]:
import pandas as pd
import torch
from sklearn.feature_extraction.text import CountVectorizer
import pickle
from lda import guidedlda
from transformers import BertModel, BertTokenizer, RobertaModel, RobertaTokenizer

from utils import enhanced_stop_words, preprocess_tweet, calculate_topic_modeling_score
import numpy as np
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer



In [None]:
codebook = pd.read_csv('../data/intermediate/input/step_5_codebook_manually_labeled.csv')
tweets_df = pickle.load(open('../data/intermediate/input/step_4_clustered_tweets_with_embeddings.pkl', 'rb'))
labeled_clusters_df = pd.read_csv('../data/intermediate/input/step_4_cluster_samples_manually_labeled.csv')
# for each cluster sum is_related column and divide by the number of tweets in the cluster
# to get the percentage of tweets that are related to the event
cluster_relatedness_df = labeled_clusters_df.groupby('cluster').agg({'is_related': 'sum', 'id': 'count'}).reset_index()
cluster_relatedness_df['relatedness'] = cluster_relatedness_df['is_related'] / cluster_relatedness_df['id']
cluster_relatedness_df['is_related'] = cluster_relatedness_df['relatedness'] > 0.5
cluster_relatedness_df = cluster_relatedness_df[['cluster', 'is_related']]
related_tweets_df = tweets_df[tweets_df['cluster'].isin(cluster_relatedness_df[cluster_relatedness_df['is_related']]['cluster'].tolist())]

In [None]:
# dictionary of codebook as key is the category and value is the list of n_grams
codebook_dict = {x: codebook.loc[codebook['category'] == x, 'n_gram'].tolist() for x in codebook['category'].unique()}
category_ids = {x: i for i, x in enumerate(codebook['category'].unique())}
category_ids_inv = {i: x for i, x in enumerate(codebook['category'].unique())}

vectorizer = CountVectorizer(ngram_range=(1, 3), min_df=0.001, max_df=0.85, stop_words=enhanced_stop_words)
v_fit = vectorizer.fit_transform(related_tweets_df['processed_text'].tolist())
word2id = dict((v, idx) for idx, v in enumerate(vectorizer.get_feature_names_out()))

seed_topics = {}
for cagegory, seed_words in codebook_dict.items():
    for word in seed_words:
        if word not in word2id:
            continue
        seed_topics[word2id[word]] = category_ids[cagegory]

In [None]:
TOPIC_NUMBER = 10
NITER = 25
ALPHA = .3 #
ETA = .05
CONF = 1
IN_OR_OUT = 1
TOP_N_WORDS = 20
window_sizes = [0, 1, 2, 5]

model = guidedlda.GuidedLDA(
        n_topics=TOPIC_NUMBER,
        n_iter=NITER,
        random_state=0,
        alpha=ALPHA,
        eta=ETA
    )
model.fit(v_fit, seed_topics=seed_topics, seed_confidence=CONF)
test_set_df = pd.read_csv('../data/test_set_for_topic_modeling.csv')
test_set_df['processed_text'] = test_set_df['text'].apply(preprocess_tweet)
test_set_df['vectorized_text'] = test_set_df['processed_text'].apply(lambda x: vectorizer.transform([x]))
# get top5 topics for each tweet as cat1_pred, cat2_pred, cat3_pred, cat4_pred, cat5_pred
test_set_df['topic_predictions'] = test_set_df['vectorized_text'].apply(lambda x: model.transform(x)[0])
for i in range(1,6):
    test_set_df[f'cat{i}_pred'] = test_set_df['topic_predictions'].apply(lambda x: category_ids_inv.get(np.argsort(x)[-i]))
    
for window_size in window_sizes:
    accuracy = calculate_topic_modeling_score(test_set_df, window_size)
    print(f'Score for window size {window_size}: {accuracy:.2f}')


In [None]:
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
bertopic_model = BERTopic(seed_topic_list=list(codebook_dict.values()), n_gram_range=(1, 3), embedding_model=sentence_model,
                        min_topic_size=10, calculate_probabilities=True, verbose=True, nr_topics=TOPIC_NUMBER, top_n_words=TOP_N_WORDS)
tweets = tweets_df["processed_text"].tolist()
bertopic_model.fit(tweets)
test_set_df = pd.read_csv('../data/test_set_for_topic_modeling.csv')
test_set_df['processed_text'] = test_set_df['text'].apply(preprocess_tweet)
test_set_df['topic_predictions'] = test_set_df['processed_text'].apply(lambda x: bertopic_model.transform([x])[0])
for i in range(1,6):
    test_set_df[f'cat{i}_pred'] = test_set_df['topic_predictions'].apply(lambda x: category_ids_inv.get(np.argsort(x)[-i]))

for window_size in window_sizes:
    accuracy = calculate_topic_modeling_score(test_set_df, window_size)
    print(f'Score for window size {window_size}: {accuracy:.2f}')

In [None]:
from source.utils import get_pretrained_model_and_tokenizer

model_name = 'bert' # options are bert, roberta, sbert, sroberta
model, tokenizer = get_pretrained_model_and_tokenizer(model_name)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = model.to(device)
test_set_df = pd.read_csv('../data/test_set_for_topic_modeling.csv')
test_set_df['processed_text'] = test_set_df['text'].apply(preprocess_tweet)

seed_topics_embeddings = []
for seed_words in codebook_dict.values():
    seed_topics_embeddings.append(model(**tokenizer(seed_words, return_tensors="pt", padding=True, truncation=True).to(device))['last_hidden_state'].mean(dim=0).detach().cpu().numpy())
test_set_df['embedding'] = test_set_df['processed_text'].apply(lambda x: model(**tokenizer(x, return_tensors="pt", padding=True, truncation=True).to(device))['last_hidden_state'].mean(dim=0).detach().cpu().numpy())
test_set_df['topic_predictions'] = test_set_df['embedding'].apply(lambda x: np.argsort([np.dot(x, y) for y in seed_topics_embeddings]))
for i in range(1,6):
    test_set_df[f'cat{i}_pred'] = test_set_df['topic_predictions'].apply(lambda x: category_ids_inv.get(x[-i]))
    

for window_size in window_sizes:
    accuracy = calculate_topic_modeling_score(test_set_df, window_size)
    print(f'Score for window size {window_size}: {accuracy:.2f}')

In [None]:
model_name = 'roberta' # options are bert, roberta, sbert, sroberta
model, tokenizer = get_pretrained_model_and_tokenizer(model_name)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = model.to(device)
test_set_df = pd.read_csv('../data/test_set_for_topic_modeling.csv')
test_set_df['processed_text'] = test_set_df['text'].apply(preprocess_tweet)

seed_topics_embeddings = []
for seed_words in codebook_dict.values():
    seed_topics_embeddings.append(model(**tokenizer(seed_words, return_tensors="pt", padding=True, truncation=True).to(device))['last_hidden_state'].mean(dim=0).detach().cpu().numpy())
test_set_df['embedding'] = test_set_df['processed_text'].apply(lambda x: model(**tokenizer(x, return_tensors="pt", padding=True, truncation=True).to(device))['last_hidden_state'].mean(dim=0).detach().cpu().numpy())
test_set_df['topic_predictions'] = test_set_df['embedding'].apply(lambda x: np.argsort([np.dot(x, y) for y in seed_topics_embeddings]))
for i in range(1,6):
    test_set_df[f'cat{i}_pred'] = test_set_df['topic_predictions'].apply(lambda x: category_ids_inv.get(x[-i]))
    

for window_size in window_sizes:
    accuracy = calculate_topic_modeling_score(test_set_df, window_size)
    print(f'Score for window size {window_size}: {accuracy:.2f}')

In [None]:
glove_embeddings_dict = {}
with open("../data/glove/glove.twitter.27B.200d.txt", 'r', encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        glove_embeddings_dict[word] = vector

seed_topics_embeddings = []
ngrams = list(codebook_dict.values())
# split ngrams into words per category
ngrams = [[x.split() for x in y] for y in ngrams]
words = []
for x in ngrams:
    flatten_list = [item for sublist in x for item in sublist]
    words.append(flatten_list)
for seed_words in words:
    seed_topics_embeddings.append(np.mean([glove_embeddings_dict.get(x) for x in seed_words if x in glove_embeddings_dict], axis=0))
    
test_set_df = pd.read_csv('../data/test_set_for_topic_modeling.csv')
test_set_df['processed_text'] = test_set_df['text'].apply(preprocess_tweet)
test_set_df['embedding'] = test_set_df['processed_text'].apply(lambda x: np.mean([glove_embeddings_dict.get(y) for y in x.split() if y in glove_embeddings_dict], axis=0))
test_set_df['topic_predictions'] = test_set_df['embedding'].apply(lambda x: np.argsort([np.dot(x, y) for y in seed_topics_embeddings]))

for i in range(1,6):
    test_set_df[f'cat{i}_pred'] = test_set_df['topic_predictions'].apply(lambda x: category_ids_inv.get(x[-i]))

for window_size in window_sizes:
    accuracy = calculate_topic_modeling_score(test_set_df, window_size)
    print(f'Score for window size {window_size}: {accuracy:.2f}')