Comparing Topic Models for Each Period

In [1]:
import pandas
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

In [2]:
%store -r reddit_sent_df

In [3]:
nltk.data.find("corpora/stopwords")

FileSystemPathPointer('/home/beherya/nltk_data/corpora/stopwords')

This verifies that the stopwords corpus is installed and returns its location. Interpretation: The stopwords are successfully located, so the code can proceed to use them for text cleaning.

In [4]:
base_stopwords = set(stopwords.words("english"))

In [5]:
custom_stopwords = set([
    'like', 'get', 'dont', 'im', 'would', 'really', 'one', 'people',
    'time', 'know', 'feel', 'even', 'go', 'want', 'think', 'much',
    'life', 'day', 'days', 'years', 'year', 'something', 'nothing',
    'got', 'make', 'feeling', 'going', 'things', 'way', 'work',
    'help', 'cant', 'need', 'see', 'friends', 'family', 'ive', 'anyone',
    'anything', 'always', 'else', 'getting', 'started'
])

full_stop_words = base_stopwords.union(custom_stopwords)

This creates a comprehensive stopword list by combining standard English stopwords with custom ones. Custom stopwords are common words that appear frequently in conversational text but don't carry specific meaning (like "like", "really", "things"). These will be removed from the text to focus on more meaningful content words.

In [6]:
def preprocess_text(text):
    if not isinstance(text, str):
        return ""

    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text


This function cleans text by removing URLs, stripping out punctuation and numbers, and converting everything to lowercase. This standardizes the text for analysis.


In [7]:
def display_topics(df, text_col, title, n_topics=5, n_top_words=10):
    """
    Runs and prints topic models for a given DataFrame.
    """
    print("\n" + "="*50)
    print(f" {title} (n={len(df)} posts) ")
    print("="*50)

    if len(df) < n_topics:
        print(f"Not enough documents to model {n_topics} topics. Skipping.")
        return

    # 1. Vectorize: Convert text to a word-count matrix
    # We apply our preprocessing and stopword removal here
    vectorizer = CountVectorizer(
        preprocessor=preprocess_text,
        stop_words=list(full_stop_words),
        max_df=0.9,  # Ignore words in > 90% of docs
        min_df=10,   # Ignore words in < 10 docs
        ngram_range=(1, 1) # Only use single words
    )

    try:
        dtm = vectorizer.fit_transform(df[text_col])
    except ValueError as e:
        print(f"Error vectorizing text (maybe all words were stopwords?): {e}")
        return

    # 2. Model: Run Latent Dirichlet Allocation
    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=42, # For reproducible results
        n_jobs=-1
    )
    lda.fit(dtm)

    # 3. Display: Print the top words for each topic
    feature_names = vectorizer.get_feature_names_out()

    for topic_idx, topic in enumerate(lda.components_):
        # Get the indices of the top words
        top_words_indices = topic.argsort()[:-n_top_words - 1:-1]
        # Get the words themselves
        top_words = [feature_names[i] for i in top_words_indices]
        print(f"Topic {topic_idx + 1}: {' '.join(top_words)}")

This is the main function that performs topic modeling using Latent Dirichlet Allocation (LDA). It:

Uses CountVectorizer to convert text into a document-term matrix (word counts)
Filters out very common words (appearing in >90% of documents) and very rare words (appearing in <10 documents)

Fits an LDA model with 5 topics

Extracts and displays the top 10 words for each topic

The parameters random_state=42 ensures reproducibility, and n_jobs=-1 uses all CPU cores for faster processing.


In [8]:
pre_covid_df = reddit_sent_df[reddit_sent_df["covid_period"] == "Pre-COVID"]
during_covid_df = reddit_sent_df[reddit_sent_df["covid_period"] == "During COVID"]
post_covid_df = reddit_sent_df[reddit_sent_df["covid_period"] == "Post-COVID"]

This splits the Reddit dataset into three time periods to compare how topics changed before, during, and after COVID-19.

In [9]:
display_topics(pre_covid_df, 'full_text', title="Pre-COVID Topics")
display_topics(during_covid_df, 'full_text', title="During-COVID Topics")
display_topics(post_covid_df, 'full_text', title="Post-COVID Topics")



 Pre-COVID Topics (n=576 posts) 




Topic 1: fucking everyone alone someone love nye never hate fuck many
Topic 2: mental anxiety health didnt first school felt back night since
Topic 3: anxiety bad good person happy anxious someone everything tired talk
Topic 4: new happy alone everyone better everything anxiety anymore hope shit
Topic 5: anxiety panic attack sleep never someone told depression job pain

 During-COVID Topics (n=10517 posts) 
Topic 1: anxiety anxious panic also sleep attack bad attacks take heart
Topic 2: job mental health back home since house didnt last told
Topic 3: someone depression thoughts mental talk person love self good thing
Topic 4: anymore everything hate never tired every happy fucking better good
Topic 5: school didnt never deleted user friend parents mom said told

 Post-COVID Topics (n=6218 posts) 
Topic 1: end talk didnt someone say said friend never thing told
Topic 2: job never mental live love mom parents health school hate
Topic 3: depression better anymore everything thoughts good 

## Pre-COVID Topics

(n=576 posts)
Interpretation: With only 576 posts, this is your smallest dataset. The topics show:

Topic 1: Social isolation and loneliness ("alone", "everyone", "nye" suggesting New Year's Eve loneliness)

Topic 2: Mental health struggles, particularly anxiety related to school ("mental", "anxiety", "health", "school")

Topic 3: General anxiety and emotional states ("anxiety", "anxious", "happy", "tired")

Topic 4: Hope and improvement ("new", "happy", "better", "hope") mixed with anxiety

Topic 5: Clinical anxiety symptoms ("panic", "attack", "sleep", "depression")


Key insight: Even before COVID, anxiety and mental health were prominent topics, with distinct themes around panic attacks and social isolation.


## During COVID Topics

(n=10,517 posts)
The topics show:

Topic 1: Physical anxiety symptoms ("panic", "attack", "sleep", "heart") - very clinical

Topic 2: Life disruption ("job", "mental health", "home", "house") - likely reflecting lockdown impacts

Topic 3: Deep emotional struggles ("depression", "thoughts", "mental", "self") - more serious mental 
health concerns

Topic 4: Exhaustion and negativity ("anymore", "hate", "tired", "never", "fucking")

Topic 5: Youth and family dynamics ("school", "friend", "parents", "mom") - note "deleted user" suggests some users deleted their accounts


Key insight: The dramatic increase in posts and the shift toward more serious mental health language (depression, suicidal ideation implied by "thoughts") suggests COVID significantly intensified mental health struggles, particularly around isolation, disrupted routines, and family stress.

## Post-COVID Topics

About 60% of the during-COVID volume, but still 10x higher than pre-COVID. Topics show:

Topic 1: Communication and relationships ("talk", "friend", "say", "told")

Topic 2: Life circumstances and family ("job", "live", "parents", "school")

Topic 3: Depression and emotional fatigue ("depression", "tired", "thoughts")

Topic 4: Physical anxiety symptoms persist ("anxiety", "heart", "symptoms", "pain", "panic")

Topic 5: Continued panic disorder ("panic", "anxious", "attacks")


Key insight: While volume decreased from during-COVID, it remains much higher than pre-COVID levels, suggesting lasting mental health impacts. The persistence of clinical anxiety symptoms (Topics 4 & 5) indicates that anxiety disorders triggered or worsened during COVID haven't fully resolved.
