# Understanding Climate Change Discourse on Reddit: A Distributed Analysis of Public Themes, Sentiment, and Recommendations

### Candidate numbers: 39884, 48099, 49308, 50250

## Notebook Overview: Sentiment Classification

This notebook continues from the topic modeling analysis and focuses on sentiment analysis on the dataset using a **Logistic Regression classifier**. This classifier was user based on the performance in the comparitive study of sentiment analysis models earlier. The dataset has already been enriched with topic labels and LDA features in the previous `combined_topics.ipynb` notebook. 

Steps in the notebook:
- Load LDA-transformed data (`df_tf_after_topics`) stored in Parquet format.
- Label sentiments as Positive, Negative, or Neutral using predefined thresholds on the sentiment polarity score.
- Train and evaluate a Logistic Regression model using `rawFeatures` (term frequency vectors) as input.
- Save the main dataframe `df_predicted` as a parquet file in a bucket so that it can be imported for the visualisation part of this section.


## Cluster Setup and Initialization Actions
We used Google Cloud Dataproc to create a scalable cluster with the following settings:

#### Create the bucket
```gsutil mb gs://st446-gp-sm```

#### Upload the initialization script
```gsutil cp my_actions.sh gs://st446-gp-sm```

#### Create the Dataproc cluster
```gcloud dataproc clusters create st446-cluster-project \
  --enable-component-gateway \
  --public-ip-address \
  --region europe-west1 \
  --master-machine-type n2-standard-4 \
  --master-boot-disk-size 100 \
  --num-workers 3 \
  --worker-machine-type n2-standard-4 \
  --worker-boot-disk-size 300 \
  --image-version 2.2-debian12 \
  --optional-components=JUPYTER \
  --metadata 'PIP_PACKAGES=sklearn nltk pandas numpy' \
  --initialization-actions='gs://st446-gp-sm/my_actions.sh' \
  --properties=spark:spark.dynamicAllocation.enabled=true \
--project=capstone-data-1-wto

In [1]:
# Import libraries used in this notebook
import zipfile
!pip install gensim
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
!pip install groq
import os
from pyspark.sql.functions import when
import re
import hashlib
from datetime import datetime
import numpy as np
import pandas as pd
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from pyspark.sql import SparkSession
import pyspark.sql.functions as sql_f 
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.clustering import LDA
from time import time
from pyspark.sql.functions import udf, col, rand, monotonically_increasing_id
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, ArrayType
from pyspark.ml.feature import StopWordsRemover, Tokenizer, CountVectorizer, IDF
from pyspark.sql.functions import lower, regexp_replace, row_number, desc
import random
from pyspark.sql.functions import rand
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from pyspark.sql.functions import year
from pyspark.sql.window import Window
import matplotlib.pyplot as plt
import groq
from pyspark.ml.classification import LogisticRegression

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m179.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading smart_open-7.1.0-py3-none-any.whl (61 kB)
Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB)
Installing collected packages: wrapt, smart-open, gensim
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [gensim]2m2/3[0m [gensim]
[1A[2KSuccessful

# Data Frame Loading after LDA

In [2]:
# Define the same path
input_path = "gs://st446-gp-sm/processed_data/df_tf_after_topics"

# Load the Parquet data
df_tf = spark.read.parquet(input_path)

# Check schema
df_tf.printSchema()
df_tf.show(5)

                                                                                

root
 |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- subreddit.id: string (nullable = true)
 |-- subreddit.name: string (nullable = true)
 |-- subreddit.nsfw: string (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- body: string (nullable = true)
 |-- sentiment: double (nullable = true)
 |-- score: integer (nullable = true)
 |-- body_clean: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- final_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- rawFeatures: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)
 |-- predictedTopic: integer (nullable = true)
 |-- docTopWords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- PredictedTopicName: string (nullable = true)



                                                                                

+-------+-------+------------+-------------------+--------------+-----------+--------------------+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+
|   type|     id|subreddit.id|     subreddit.name|subreddit.nsfw|created_utc|           permalink|                body|sentiment|score|          body_clean|              tokens|            filtered|        final_tokens|         rawFeatures|   topicDistribution|predictedTopic|         docTopWords|  PredictedTopicName|
+-------+-------+------------+-------------------+--------------+-----------+--------------------+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+
|comment|fs7inmj|       2qh13|          wor

# Sentiment Analysis

## Logistic Regression

In [3]:
df_tf = df_tf.filter(col("sentiment").isNotNull())

df_tf = df_tf.withColumn(
    "sentiment_class",
    when(df_tf["sentiment"] > 0.35, 2) #positive
    .when(df_tf["sentiment"] < -0.35, 0) #negative
    .otherwise(1) #neutral
)

df_tf = df_tf.withColumn(
    "sentiment_label",
    when(df_tf.sentiment_class == 0, "Negative")
    .when(df_tf.sentiment_class == 1, "Neutral")
    .when(df_tf.sentiment_class == 2, "Positive")
)

In [4]:
df_tf.printSchema()

root
 |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- subreddit.id: string (nullable = true)
 |-- subreddit.name: string (nullable = true)
 |-- subreddit.nsfw: string (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- body: string (nullable = true)
 |-- sentiment: double (nullable = true)
 |-- score: integer (nullable = true)
 |-- body_clean: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- final_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- rawFeatures: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)
 |-- predictedTopic: integer (nullable = true)
 |-- docTopWords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- PredictedTopicName: string (nullable = true)
 |-- s

In [5]:
train_df, test_df = df_tf.randomSplit([0.8, 0.2], seed=42)
df_train = train_df.repartition(8)

lr = LogisticRegression(
    featuresCol="rawFeatures",
    labelCol="sentiment_class",
    predictionCol="PredictionSentiment",
    maxIter=10
)

lr_model = lr.fit(train_df)

25/05/05 09:49:25 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
25/05/05 09:49:26 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
25/05/05 09:49:27 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
25/05/05 09:49:27 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25
25/05/05 09:49:27 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.125
25/05/05 09:49:28 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5
25/05/05 09:49:29 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25
25/05/05 09:49:29 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.125
25/05/05 09:49:29 

In [6]:
predictions = lr_model.transform(test_df)

In [7]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="sentiment_class",
    predictionCol="PredictionSentiment",
    metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions)

                                                                                

In [8]:
print("Logistic Regression Results")
print(f"Accuracy:  {accuracy:.4f}")

Logistic Regression Results
Accuracy:  0.7237


### Comment on Model Performance

The logistic regression model achieved an **accuracy of ~72.4%**, which is a reasonable result for a 3-class sentiment classification task on social media text, especially using basic text vectorization (term frequency) without advanced embeddings. 

## Predicting Sentiment for all documents

In [9]:
df_predicted = lr_model.transform(df_tf)

In [11]:
# Add a human-readable label
df_predicted = df_predicted.withColumn(
    "predicted_sentiment_label",
    when(col("PredictionSentiment") == 0, "Negative")
    .when(col("PredictionSentiment") == 1, "Neutral")
    .when(col("PredictionSentiment") == 2, "Positive")
)

In [12]:
df_predicted.printSchema()

root
 |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- subreddit.id: string (nullable = true)
 |-- subreddit.name: string (nullable = true)
 |-- subreddit.nsfw: string (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- body: string (nullable = true)
 |-- sentiment: double (nullable = true)
 |-- score: integer (nullable = true)
 |-- body_clean: string (nullable = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- final_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- rawFeatures: vector (nullable = true)
 |-- topicDistribution: vector (nullable = true)
 |-- predictedTopic: integer (nullable = true)
 |-- docTopWords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- PredictedTopicName: string (nullable = true)
 |-- s

# Saving Dataframe with Sentiment Predictions

In [13]:
# Define your output path (change bucket name accordingly)
output_path = "gs://st446-gp-sm/processed_data/df_tf_after_sentiments"

# Save DataFrame as Parquet
df_predicted.write.mode("overwrite").parquet(output_path)

                                                                                