In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.lda_model
import pyLDAvis.gensim_models

In [2]:
# re-creating LDA model from notebook 04
df = pd.read_csv('../data/reddit_preprocessed.csv')

# Creating count-vectorizer for pre-processed data
count_text_vectorizer = CountVectorizer(stop_words='english', min_df=5, max_df=0.7)
count_text_vectors = count_text_vectorizer.fit_transform(df["processed_text"])

# Creating TF-IDF vectorizier for pre-processed data
tfidf_text_vectorizer = TfidfVectorizer(stop_words='english', min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['processed_text'])

# Latent Dirichlet Allocation (LDA) model
lda_text_model = LatentDirichletAllocation(n_components = 2, random_state=509)
W_lda_text_matrix = lda_text_model.fit_transform(count_text_vectors)
H_lda_text_matrix = lda_text_model.components_

In [3]:
# Creating interactive visualization for LDA Model
lda_display = pyLDAvis.lda_model.prepare(lda_text_model, count_text_vectors, count_text_vectorizer, sort_topics=False)
pyLDAvis.display(lda_display)

#### When we explore the topics using pyLDAvis, we can see more detail about the terms that define each one. In Topic 1, the most relevant words are “study,” “new,” “people,” “use,” “risk,” and “age.” These suggest the topic might be focused on health, research, or social science themes. In Topic 2, the top terms include “new,” “study,” “trump,” “scientist,” “brain,” “datum,” and “researcher.” This group also leans toward science and research but includes more political or public terms like “trump” and “social,” which might explain some of the overlap we saw earlier between the subreddits.

#### The size and position of the circles in the Intertopic Distance Map on the left show how distinct each topic is from the other. Since the circles are fairly separated, the model sees them as different enough to be meaningful. This view helped us interpret the topics more clearly and added context to the accuracy score we calculated when comparing them to the actual subreddit labels.

In [4]:
# Assign topic number to each post based on the highest topic weight
df['lda_topic'] = W_lda_text_matrix.argmax(axis=1)

In [5]:
# Crosstab of LDA topic vs actual subreddit
pd.crosstab(df['lda_topic'], df['subreddit'])

subreddit,science,technology
lda_topic,Unnamed: 1_level_1,Unnamed: 2_level_1
0,429,400
1,325,417


In [6]:
# Get the topic most associated with each subreddit
topic_match = df.groupby('lda_topic')['subreddit'].agg(lambda x: x.mode()[0])

# Map predicted topics to most likely subreddit
df['predicted_subreddit'] = df['lda_topic'].map(topic_match)

# Accuracy of topic label vs actual label
accuracy = (df['predicted_subreddit'] == df['subreddit']).mean()
print(f"LDA topic alignment accuracy: {accuracy:.2f}")

LDA topic alignment accuracy: 0.54


#### We used topic modeling without looking at the subreddit labels, then compared the assigned topics to the actual subreddits to see if they matched. The results showed that both topics had a mix of science and technology posts, so the topics did not clearly line up with the original groups. When we matched each topic to the subreddit it mostly contained, the accuracy was about 54%. This means the topic model grouped posts somewhat differently than the original labels, which makes sense since science and tech often overlap.