# Understanding Climate Change Discourse on Reddit: A Distributed Analysis of Public Themes, Sentiment, and Recommendations
## ST446 Group Project
### Candidate Nrs: 39884, 48099, 49308, 50250

# LSA Model for Topic Modeling with LLM Representation (Groq llama3-70b-8192) Model

This notebook applies Latent Semantic Analysis (LSA) to the Reddit comments dataset for topic modeling. LSA uncovers latent semantic patterns in a term-document matrix using Singular Value Decomposition (SVD). 
To improve topic interpretability, topic vectors are further labeled using an LLM-based representation model (Groq's llama3-70b-8192). 

### Dataproc Cluster Setup and Initialization Actions

Before running PySpark notebooks on Google Cloud Platform (GCP), we perform a series of steps to initialize the environment. This includes creating a GCS bucket to store resources, uploading setup scripts, and launching a Dataproc cluster configured for scalable distributed processing.

#### Create a GCS Bucket

We first create a Google Cloud Storage (GCS) bucket where initialization scripts are stored:

```bash
gsutil mb gs://st446-gp-sm
```
#### Upload Initialization Script

We upload a script (`my_actions.sh`) to the bucket. This script runs during cluster startup to install additional Python packages or configure the environment as needed.

```bash
gsutil cp my_actions.sh gs://st446-gp-sm
```
#### Create Dataproc Cluster with Custom Settings

We launch a Dataproc cluster optimized for parallel NLP workloads. Specs include:

- 2 worker nodes (`n2-standard-4`) for higher memory tasks
- Increased executor and driver memory for large matrix computations (e.g., SVD)
- 64 shuffle partitions for better load distribution
- Initialization script applied automatically on startup

```bash
gcloud dataproc clusters create st446-cluster-gp-sm \
  --enable-component-gateway \
  --public-ip-address \
  --region=europe-west1 \
  --master-machine-type=n2-standard-4 \
  --master-boot-disk-size=100 \
  --num-workers=2 \
  --worker-machine-type=n2-standard-4 \
  --worker-boot-disk-size=200 \
  --image-version=2.2-debian12 \
  --optional-components=JUPYTER \
  --metadata='PIP_PACKAGES=sklearn nltk pandas numpy' \
  --initialization-actions='gs://st446-gp-sm/my_actions.sh' \
  --project=capstone-data-1-wto \
  --properties='spark:spark.driver.memory=8g,spark:spark.executor.memory=6g,spark:spark.executor.cores=2,spark:spark.executor.instances=2,spark:spark.yarn.executor.memoryOverhead=1024,spark:spark.sql.shuffle.partitions=64'
```

In [1]:
!pip install gensim
!pip install groq

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m168.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading smart_open-7.1.0-py3-none-any.whl (61 kB)
Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB)
Installing collected packages: wrapt, smart-open, gensim
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [gensim]2m2/3[0m [gensim]
[1A[2KSuccessful

In [2]:
# Import libraries used in this notebook
import zipfile
import os
import re
import hashlib
from datetime import datetime
import numpy as np
import pandas as pd
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import groq
from pyspark.sql import SparkSession
import pyspark.sql.functions as sql_f 
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.clustering import LDA
from time import time
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
import numpy as np
import subprocess
from pyspark.ml.linalg import Vectors
from pyspark.ml.linalg import DenseVector
from sklearn.decomposition import TruncatedSVD
from nltk.stem import WordNetLemmatizer
import nltk
from time import time
lemmatizer = WordNetLemmatizer()
from pyspark.ml.feature import StopWordsRemover, Tokenizer, CountVectorizer, IDF
from pyspark.sql.functions import col, lower, regexp_replace, udf, size
from pyspark.sql.types import ArrayType, StringType
from pyspark.mllib.linalg import Vectors as OldVectors
from pyspark.mllib.linalg.distributed import RowMatrix
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

# Cluster Specifications

In [3]:
spark_conf = spark.sparkContext.getConf()
executor_cores = spark_conf.get("spark.executor.cores")
executor_memory = spark_conf.get("spark.executor.memory")
print(f"vCPUs per executor: {executor_cores}")
print(f"RAM per executor: {executor_memory}")

vCPUs per executor: 2
RAM per executor: 6g


25/05/05 22:48:51 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
25/05/05 22:48:51 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.


## Data Loading
1. Download the csv file from https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset
2. Upload the zip file to bucket in cloud storage
3. SSH into the masternode and move file from cloud storage to masternode via: gsutil cp gs://<your-bucket-name>/archive.zip /home/freya_nagel
4. Verify that it is there: ls -lh ~/archive.zip

In [4]:
# 1. Download the Kaggle dataset zip
!curl -L -o climate.zip \
    "https://www.kaggle.com/api/v1/datasets/download/pavellexyr/the-reddit-climate-change-dataset"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1536M  100 1536M    0     0  43.2M      0  0:00:35  0:00:35 --:--:-- 43.9MM


In [5]:
# 2. Unzip it (this will extract all files, including the comments CSV)
!unzip -o climate.zip

# 3. Remove any old copy in HDFS and put the comments file there
!hadoop fs -rm -f /the-reddit-climate-change-dataset-comments.csv
!hadoop fs -put the-reddit-climate-change-dataset-comments.csv /

# 4. Verify upload
!hadoop fs -ls /

Archive:  climate.zip
  inflating: the-reddit-climate-change-dataset-comments.csv  
  inflating: the-reddit-climate-change-dataset-posts.csv  
Found 4 items
-rw-r--r--   2 root hadoop 4111000325 2025-05-05 22:51 /the-reddit-climate-change-dataset-comments.csv
drwxrwxrwt   - hdfs hadoop          0 2025-05-05 22:42 /tmp
drwxrwxrwt   - hdfs hadoop          0 2025-05-05 22:47 /user
drwxrwxrwt   - hdfs hadoop          0 2025-05-05 22:40 /var


In [6]:
# point to the new HDFS path
comments_path = "hdfs://st446-cluster-gp-sm-m:8020/the-reddit-climate-change-dataset-comments.csv"

### Creating Corpus Dataframes

In [7]:
schema = StructType([
    StructField("type",           StringType(), True),
    StructField("id",             StringType(), True),
    StructField("subreddit.id",   StringType(), True),
    StructField("subreddit.name", StringType(), True),
    StructField("subreddit.nsfw", StringType(), True),
    StructField("created_utc",    StringType(), True),
    StructField("permalink",      StringType(), True),
    StructField("body",           StringType(), True),
    StructField("sentiment",      DoubleType(), True),
    StructField("score",          IntegerType(),True)
])

df = spark.read \
    .option("header", "true") \
    .option("multiLine", "true") \
    .option("escape", "\"") \
    .schema(schema) \
    .csv(comments_path)

df = df.repartition(64)
df.printSchema()
df.show(5)

root
 |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- subreddit.id: string (nullable = true)
 |-- subreddit.name: string (nullable = true)
 |-- subreddit.nsfw: string (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- body: string (nullable = true)
 |-- sentiment: double (nullable = true)
 |-- score: integer (nullable = true)



                                                                                

+-------+-------+------------+----------------+--------------+-----------+--------------------+--------------------+---------+-----+
|   type|     id|subreddit.id|  subreddit.name|subreddit.nsfw|created_utc|           permalink|                body|sentiment|score|
+-------+-------+------------+----------------+--------------+-----------+--------------------+--------------------+---------+-----+
|comment|icyjgoj|       2qt49|          berlin|         false| 1655656284|https://old.reddi...|&gt;These days yo...|  -0.1077|   11|
|comment|f0wkbi5|       2qh6p|    conservative|         false| 1569008948|https://old.reddi...|I don't think mos...|  -0.2109|    5|
|comment|hgbpjwd|       2qhw9|        collapse|         false| 1634019096|https://old.reddi...|Energy demand is ...|   0.6326|    2|
|comment|d9ugnl0|       2cneq|        politics|         false| 1478794354|https://old.reddi...|I'm a university ...|  -0.8412|   -1|
|comment|fcxm7ql|       2tk0s|unpopularopinion|         false| 157806

# For 20% Data

## Saple Data Set Creation and Specifications

In [8]:
# Sample
sample_df = df.sample(withReplacement=False, fraction=0.2, seed=42)
sample_df = sample_df.repartition(64)
print(f"Number of partitions: {sample_df.rdd.getNumPartitions()}")
# Print number of comments
num_comments = sample_df.count()
print(f"Number of comments in sample: {num_comments}")



Number of partitions: 64




Number of comments in sample: 920360


                                                                                

In [9]:
temp_path = "hdfs:///tmp/sampled_data_size_check"
sample_df.write.mode("overwrite").parquet(temp_path)
!hadoop fs -du -s -h /tmp/sampled_data_size_check

                                                                                

479.6 M  959.2 M  /tmp/sampled_data_size_check


## Data Pre-Processing

In [10]:
# Start of runtime
start_time = time()

In [11]:
custom_stopwords = set([
    "lt", "gt", "ref", "quot", "cite", "br", "amp", "https", "http", "urlhttps", "urlhttp", 
    "file", "image", "jpg", "png", "gif", "svg", "thumb", "px", "category", "url", "external", 
    "link", "source", "web", "cite", "reference", "reflist", "main", "article", "seealso", 
    "further", "infobox", "template", "navbox", "redirect", "harvnb", "isbn", "doi", "pmid", 
    "ssrn", "jstor", "bibcode", "arxiv", "ol", "hdl", "wikidata", "wiki", "math", "sup", "sub", 
    "nbsp", "equation", "displaystyle", "begin", "end", "left", "right", "sqrt", "frac", "sum", 
    "prod", "int", "lim", "rightarrow", "infty", "alpha", "beta", "gamma", "delta", "epsilon", 
    "zeta", "eta", "theta", "iota", "kappa", "lambda", "mu", "nu", "xi", "omicron", "pi", "rho", 
    "sigma", "tau", "upsilon", "phi", "chi", "psi", "omega", "mathrm", "mathbb", "mathcal", 
    "mathbf", "cdots", "ldots", "vdots", "ddots", "forall", "exists", "in", "ni", "subset", 
    "subseteq", "supset", "supseteq", "emptyset", "cap", "cup", "setminus", "not", "times", 
    "div", "cdot", "pm", "mp", "oplus", "otimes", "odot", "leq", "geq", "neq", "approx", 
    "aligncenter", "fontsize", "alignright", "alignleft", "textalign", "bold", "italic", 
    "underline", "strikethrough", "lineheight", "padding", "margin", "width", "height", "float", 
    "clear", "border", "background", "color", "font", "family", "size", "weight", "style", 
    "decoration", "verticalalign", "textindent", "pre-wrap", "nowrap", "valign", "bgcolor", 
    "style", "class", "id", "width", "height", "align", "border", "cellpadding", "cellspacing", 
    "colspan", "rowspan", "nowrap", "target", "rel", "hreflang", "title", "alt", "src", "dir", 
    "lang", "type", "name", "value", "readonly", "multiple", "onclick", "onmousedown", 
    "onmouseup", "onmouseover", "onmouseout", "onload", "onunload", "onsubmit", "onreset", 
    "onfocus", "onblur", "onkeydown", "onkeyup", "onkeypress", "onerror", "infobox", "caption", 
    "cite", "dmy", "mdy", "date", "archive", "www", "com", "org", "access", "ndash", "sfn", "dts", "vauthors", "mvar", 
    "ipaslink", "ipa", "iii", "ibn", "first", "last", "also", "html", "use", "publisher", "year", "one", 
    "page", "new", "trek", "ipablink", "similar", "usual", "two", "abbr", "used", "est", "ibm", "first1",
    "first2", "last1", "last2", "free", "pdf"
])

In [12]:
# 1. Clean text
df_clean = (
    sample_df
      .withColumn("body_clean", lower(col("body")))
      .withColumn("body_clean", regexp_replace("body_clean", r"http\S+", ""))  
      .withColumn("body_clean", regexp_replace("body_clean", r"[^a-z\s]", " "))  
      .withColumn("body_clean", regexp_replace("body_clean", r"\s+", " "))     
)

# 2. Tokenize
tokenizer = Tokenizer(inputCol="body_clean", outputCol="tokens")
df_tokens = tokenizer.transform(df_clean)

# 3. Remove stopwords
default_stops = StopWordsRemover.loadDefaultStopWords("english")
combined_stops = list(set(default_stops) | custom_stopwords)

remover = StopWordsRemover(
    inputCol="tokens",
    outputCol="filtered",
    stopWords=combined_stops
)
df_no_stop = remover.transform(df_tokens)

# 4. Lemmatize
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens if len(token) > 2]

lemmatize_udf = udf(lemmatize_tokens, ArrayType(StringType()))

df_lemmatized = df_no_stop.withColumn("lemmatized_tokens", lemmatize_udf(col("filtered")))

# 5. Filter out very short tokens
df_final = df_lemmatized.withColumn("final_tokens", lemmatize_udf(col("lemmatized_tokens")))

df_final.select("body_clean", "final_tokens").show(5)

25/05/05 22:58:59 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
[Stage 27:>                                                         (0 + 1) / 1]

+--------------------+--------------------+
|          body_clean|        final_tokens|
+--------------------+--------------------+
|sacrificing the f...|[sacrificing, fut...|
|lol you actually ...|[lol, actually, g...|
|global warming er...|[global, warming,...|
| gt action on cli...|[action, climate,...|
|meterologists hav...|[meterologists, s...|
+--------------------+--------------------+
only showing top 5 rows



                                                                                

In [13]:
# Vectorize on the sample
cv = CountVectorizer(
    inputCol="final_tokens",
    outputCol="Features",
    minDF=100,
    vocabSize=10000
)
cv_model = cv.fit(df_final)
df_tf = cv_model.transform(df_final)
df_tf.select("final_tokens", "Features").show(5)
df_tf.cache()

vocab_sample = cv_model.vocabulary
print(f"Sample vocab size: {len(vocab_sample)}")
print("First 20 sample-vocab entries:", vocab_sample[:20])

[Stage 44:>                                                         (0 + 1) / 1]

+--------------------+--------------------+
|        final_tokens|            Features|
+--------------------+--------------------+
|[sacrificing, fut...|(10000,[0,1,3,21,...|
|[lol, actually, g...|(10000,[0,1,2,3,7...|
|[global, warming,...|(10000,[0,1,24,50...|
|[action, climate,...|(10000,[0,1,5,6,7...|
|[meterologists, s...|(10000,[0,1,29,39...|
+--------------------+--------------------+
only showing top 5 rows

Sample vocab size: 10000
First 20 sample-vocab entries: ['climate', 'change', 'people', 'like', 'think', 'thing', 'even', 'get', 'make', 'world', 'need', 'much', 'way', 'year', 'know', 'going', 'want', 'time', 'say', 'issue']


                                                                                

# Topic Modeling

## LSA

In [14]:
# Convert DataFrame column to RDD of dense vectors and repartition
dense_rdd = df_tf.select("Features") \
    .rdd \
    .map(lambda row: OldVectors.dense(row["Features"].toArray())) \
    .repartition(64)

# Cache
dense_rdd = dense_rdd.cache()
dense_rdd.count()

# Create RowMatrix for SVD
mat = RowMatrix(dense_rdd)

                                                                                

In [15]:
# Number of topics/components
num_topics = 7

# Compute SVD
svd = mat.computeSVD(num_topics, computeU=True)

U = svd.U  # Document-topic matrix
s = svd.s  # Singular values (importance of topics)
V = svd.V  # Word-topic matrix


25/05/05 23:11:07 WARN RowMatrix: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
25/05/05 23:11:07 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.arpack.JNIARPACK
25/05/05 23:44:27 WARN RowMatrix: The input data was not directly cached, which may hurt performance if its parent RDDs are also uncached.


In [16]:
# End of Runtime
end_time = time()
runtime_minutes = (end_time - start_time) / 60
print(f"Runtime: {runtime_minutes:.2f} minutes")

Runtime: 45.64 minutes


In [17]:
V_np = np.array(V.toArray())
vocab = cv_model.vocabulary

for topic_idx in range(V_np.shape[1]):  
    topic_weights = V_np[:, topic_idx]  
    top_indices = topic_weights.argsort()[-10:][::-1]
    top_words = [vocab[i] for i in top_indices]
    print(f"\nTopic {topic_idx + 1}: {top_words}")


Topic 1: ['climate', 'change', 'people', 'like', 'think', 'thing', 'even', 'get', 'make', 'world']

Topic 2: ['climate', 'change', 'warming', 'global', 'temperature', 'scientist', 'science', 'earth', 'carbon', 'emission']

Topic 3: ['people', 'change', 'climate', 'think', 'thing', 'like', 'believe', 'want', 'someone', 'know']

Topic 4: ['human', 'warming', 'like', 'earth', 'temperature', 'thing', 'global', 'science', 'world', 'year']

Topic 5: ['like', 'think', 'trump', 'thing', 'know', 'science', 'biden', 'say', 'really', 'even']

Topic 6: ['biden', 'science', 'people', 'warming', 'global', 'temperature', 'human', 'year', 'scientist', 'earth']

Topic 7: ['science', 'trump', 'government', 'tax', 'state', 'party', 'country', 'policy', 'republican', 'issue']


### LSA Discussion

Topic modeling using LSA had several challenges. Although it effectively reduces dimensionality using Singular Value Decomposition (SVD), the resulting topics showed significant word overlap—terms like *climate*, *change*, *people*, and *like* appeared in multiple topics. This indicates that LSA produced less distinct and more blended themes.

From a computational perspective, LSA was not efficient. It required dense matrix representations and lacked native sparse matrix support in PySpark, which led to high memory usage. Even after limiting vocabulary size and sampling 20% of the dataset, we encountered memory errors and executor crashes during SVD. 64 partitions improved stability, but runtime remained high. 

## Evaluation of Model using different Metrics

In [18]:
sampled_docs = df_final.select("final_tokens").sample(withReplacement=False, fraction=0.01, seed=42)
docs_tokens = sampled_docs.rdd.map(lambda row: row[0]).collect()

# top words per topic from V
top_words = [[vocab[i] for i in topic.argsort()[-10:][::-1]] for topic in V_np.T]

25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_106_52 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_106_58 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_106_9 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_117_29 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_106_5 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_117_15 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_117_44 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_106_44 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_117_33 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_106_8 !
25/05/05 23:44:46 WARN BlockManagerMasterEndpoint: No

25/05/05 23:44:49 WARN YarnAllocator: Container from a bad node: container_1746484789411_0001_01_000018 on host: st446-cluster-gp-sm-w-0.europe-west1-b.c.capstone-data-1-wto.internal. Exit status: 143. Diagnostics: [2025-05-05 23:44:49.355]Container killed on request. Exit code is 143
[2025-05-05 23:44:49.355]Container exited with a non-zero exit code 143. 
[2025-05-05 23:44:49.356]Killed by external signal
.
25/05/05 23:44:49 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 7 for reason Container from a bad node: container_1746484789411_0001_01_000018 on host: st446-cluster-gp-sm-w-0.europe-west1-b.c.capstone-data-1-wto.internal. Exit status: 143. Diagnostics: [2025-05-05 23:44:49.355]Container killed on request. Exit code is 143
[2025-05-05 23:44:49.355]Container exited with a non-zero exit code 143. 
[2025-05-05 23:44:49.356]Killed by external signal
.
25/05/05 23:44:49 ERROR YarnScheduler: Lost executor 7 on st446-cluster-gp-sm-w-0.europe-west1-

In [19]:
# Gensim dictionary and corpus from tokenized docs
dictionary = Dictionary(docs_tokens)
corpus = [dictionary.doc2bow(doc) for doc in docs_tokens]

cm = CoherenceModel(topics=top_words, texts=docs_tokens, dictionary=dictionary, coherence='c_v')
cv_score = cm.get_coherence()

print(f"CV Coherence Score: {cv_score:.4f}")

CV Coherence Score: 0.5012


In [20]:
cm_umass = CoherenceModel(
    topics=top_words,
    texts=docs_tokens,
    dictionary=dictionary,
    coherence='u_mass'
)
umass_score = cm_umass.get_coherence()
print(f"UMass Coherence Score: {umass_score:.4f}")

UMass Coherence Score: -1.8095


In [21]:
cm_cnpmi = CoherenceModel(
    topics=top_words,
    texts=docs_tokens,
    dictionary=dictionary,
    coherence='c_npmi'
)
cnpmi_score = cm_cnpmi.get_coherence()
print(f"C_NPMI Coherence Score: {cnpmi_score:.4f}")

C_NPMI Coherence Score: 0.0333


In [22]:
def topic_diversity(top_words):
    unique_words = set()
    for topic in top_words:
        unique_words.update(topic)
    return len(unique_words) / (len(top_words) * len(top_words[0]))

diversity_score = topic_diversity(top_words)
print(f"Topic Diversity Score: {diversity_score:.4f}")

Topic Diversity Score: 0.5143


In [26]:
# Coded this with the help of ChatGPT
# document-topic matrix U
u_rows = svd.U.rows  

# dominant topic index per document
dominant_rdd = u_rows.map(lambda vec: int(vec.toArray().argmax()))

# Count documents per topic
counts = dominant_rdd.countByValue()

# Compute imbalance ratio
vals = list(counts.values())
imbalance = max(vals) / min(vals)
print(f"Topic Size Imbalance (Max/Min, filtered): {imbalance:.2f}")



Topic Size Imbalance (Max/Min, filtered): 27.47


                                                                                

## LLM (Groq llama3-70b-8192) representation model
To improve human interpretability and create more sound and clean topic labels, we prompt Groq llama3-70b-8192 via an API to provide us with topic labels.

In [24]:
import groq

# Set Groq API Key
GROQ_API_KEY = "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"
client = groq.Groq(api_key=GROQ_API_KEY)

# Define prompt template
prompt_template = """
I have a topic described by the following keywords:
{keywords}

Based on these keywords, generate a short and descriptive topic label of at most 5 words.
Make sure the output follows this format:
topic: <topic label>
"""

In [25]:
# Labeling function using Groq API
def get_groq_label(keywords):
    prompt = prompt_template.format(keywords=", ".join(keywords))
    response = client.chat.completions.create(
        model="llama3-70b-8192",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    raw_text = response.choices[0].message.content.strip()
    match = re.search(r"topic:\s*(.+)", raw_text, re.IGNORECASE)
    return match.group(1).strip() if match else raw_text

# Apply labeling to LSA topics
lsa_topic_labels = {}

print("=== GROQ Labeled LSA Topics ===")
for topic_id, words in enumerate(top_words):
    label = get_groq_label(words[:10])
    lsa_topic_labels[topic_id] = label
    print(f"Topic {topic_id}: {label}")
    print("Keywords:", ", ".join(words[:10]))
    print()

=== GROQ Labeled LSA Topics ===
Topic 0: People's Thoughts on Climate
Keywords: climate, change, people, like, think, thing, even, get, make, world

Topic 1: Global Climate Change Science
Keywords: climate, change, warming, global, temperature, scientist, science, earth, carbon, emission

Topic 2: People's Beliefs on Climate
Keywords: people, change, climate, think, thing, like, believe, want, someone, know

Topic 3: Global Warming Science
Keywords: human, warming, like, earth, temperature, thing, global, science, world, year

Topic 4: Politicians' Views on Science
Keywords: like, think, trump, thing, know, science, biden, say, really, even

Topic 5: Biden on Global Warming
Keywords: biden, science, people, warming, global, temperature, human, year, scientist, earth

Topic 6: Trump's GOP Tax Policy
Keywords: science, trump, government, tax, state, party, country, policy, republican, issue



### LSA + LLM Topic Interpretation

The topics generated through Latent Semantic Analysis (LSA), when enriched with GROQ's LLM-based labeling, demonstrate strong thematic coherence. Topics range from public opinion on climate change (Topics 0 and 2) to scientific discourse on global warming (Topics 1 and 3), and extend to political framing—highlighting figures like Biden and Trump (Topics 4 to 6). The LLM-generated labels effectively capture the semantic intent of each topic, improving interpretability and reducing ambiguity in downstream analysis.
