# Analysis of Reddit Comments on Climate Change
This notebook analyzes Reddit comments on climate change. Our team's goal is to: ...

SENG 550 Final Project
- Monmoy Maahdie
- Smitkumar Saraiya
- Farhan LASTNAME
- Kai Ferrer

## Preliminary requirements:
- Download The Reddit Climate Change Dataset (https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset?resource=download) and add to your root directory.



## 1. Create an ApacheSpark session

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
# import pyspark
# from collections import Counters

In [None]:
# Initialize spark session
spark = SparkSession.builder \
    .appName("Reddit Climate Change Comments") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "8g") \
    .config("spark.sql.shuffle.partitions", "100") \
    .getOrCreate()

## 2. Load Data

In [None]:
# Create dataframe
df = spark.read.csv("the-reddit-climate-change-dataset-comments.csv", header=True, inferSchema=True)
df = df.repartition(100)  #  increase the number of partitions for large datasets - idk what to put 
df.show(5, truncate=False) # checking the dataset by displaying first 5 rows
df_original = df # save original dataset


## 3. Data Cleaning
The dataset will be cleaned by
- Renaming columns 
- Removing rows with NULL values
- ???
- ???

In [None]:
# Fix the column names first because they use 'subreddit.id' and use a period
# for better readability we want to change it to "subreddit_id", etc.
new_columns = [col_name.replace('.', '_') for col_name in df.columns]
df = df.toDF(*new_columns)
df.show(5, truncate=False)

### Remove rows with NULL Values

In [None]:
# Remove rows with NULL values
df_clean = df.dropna()
df_clean.show()


In [None]:
row_count1 = df_clean.count() 
print(f"Cleaned dataset rows: {row_count1}")


### Fix data types that are incorrect (?)
There's an issue where basically the "body" feeds into the other columns. See output below.

In [276]:
df_clean.select("type").distinct().show()




+--------------------+
|                type|
+--------------------+
|&gt; The scientis...|
|Joe is weird with...|
|I agree with that...|
|I'm glad there's ...|
|Those companies y...|
|If we also reduce...|
|What I think I am...|
|  As for Indian food|
|https://www.theve...|
|&gt; We were prom...|
|Markets are prett...|
|Bike infrastructu...|
|I don't mean that...|
|Just look at a gr...|
|The comment I’m r...|
|               Yep."|
|[Here are some id...|
|&gt;They will die...|
|            Normally|
| Isn't there envy...|
+--------------------+
only showing top 20 rows



                                                                                

### Remove Duplicate Data (?) - cant really do much until the top is fixed

In [None]:
# Check for number of duplicates
df_clean.groupBy("id").count().filter("count > 1").show() 


In [None]:
# Remove duplicates based on columns that have the same inputs in 'id' (this would be the most unique)
df_clean = df_clean.dropDuplicates(["id"]) 


In [None]:
row_count2 = df_clean.count() 
print(f"Dropped dataset rows: {row_count2}") # Check updated if any duplicates were rlly dropped

In [None]:
# data = pd.read_csv('the-reddit-climate-change-dataset-comments.csv')
# data = data.dropna() # drop any rows with missing values

In [None]:
# counter = Counter(data['subreddit.nsfw'])
# print(counter)
# print(data['score'])
#print(data['body'].iloc[20340])

## 4. Data Transformation 
Not sure yet what to do here idk

Check data of type "comment" only

In [277]:
df_comments = df_clean.filter(df_clean["type"] == "comment")

In [278]:
df_comments.show(5, truncate=False) # intermediary check to see if the filtered work




+-------+-------+------------+--------------------+--------------+-----------+-----------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----+
|type   |id     |subreddit_id|subreddit_name      |subreddit_nsfw|created_utc|permalink                                                                                                        |body                                                    

                                                                                

In [None]:
# count the number of rows with type "comment"
print(f"Number of rows with type 'comment': {df_comments.count()}") # check to see if any rows were actually removed


## 4. Create Spark tables
idk if we need to do this really

In [None]:
# create spark database
# NOTE: UNCOMMENT IF NOT CREATED YET, OTHERWISE COMMENT OUT IF CREATED ALREADY
spark.sql("CREATE DATABASE reddit_db") 

In [None]:
spark.sql("SHOW DATABASES").show() # check that reddit_db is in here


In [None]:
spark.sql("SHOW TABLES").show() # should be empty tables

In [None]:
# Drop the table if it already exists
# spark.sql("USE reddit_db")
# df_filtered.write.mode("overwrite").saveAsTable("reddit_db.comments")
spark.sql("DROP TABLE IF EXISTS reddit_db.comments")


In [None]:
spark.sql("""
CREATE TABLE IF NOT EXISTS reddit_db.comments (
    `type` STRING,
    `id` STRING,
    `subreddit.id` STRING,
    `subreddit.name` STRING,
    `subreddit.nsfw` STRING,
    `created_utc` STRING,
    `permalink` STRING,
    `body` STRING,
    `sentiment` STRING,
    `score` STRING
)
USING PARQUET
""")

In [None]:
spark.sql("SHOW TABLES").show() # should be updated to have one table now

In [None]:
df_comments.show(5, truncate=False)

In [None]:
# align the columns - spark only accepts '_' but the dataset uses '.'
df_aligned = df_comments \
    .withColumnRenamed("subreddit.id", "subreddit_id") \
    .withColumnRenamed("subreddit.name", "subreddit_name") \
    .withColumnRenamed("subreddit.nsfw", "subreddit_nsfw")

# this is lowkey still transformation?

In [None]:
df_aligned.printSchema() # double check


In [None]:
spark.sql("SHOW TABLES").show()

In [None]:
df_aligned.write.insertInto("reddit_db.comments", overwrite=False) # insert data from csv/df into spark table

In [None]:
spark.sql("SELECT * FROM reddit_db.comments LIMIT 5").show() #validate the table

In [None]:
from pyspark.sql.functions import col, split

In [None]:
df_tokens = df_aligned.withColumn("words", split(col("body"), r"\s+"))
df_tokens = df_tokens.filter(df_tokens["words"].isNotNull())
df_tokens.show(5) #check if words column created

In [None]:
from pyspark.ml.feature import StopWordsRemover

In [None]:
# init stopwordsremover
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")

In [None]:
df_aligned_words = remover.transform(df_tokens)
df_aligned_words.show(5) 

In [None]:
from pyspark.sql.functions import explode

In [None]:
# explode - helps so that each word appears in a separate row so we can count frequency
df_exploded = df_aligned_words.withColumn("word", explode(col("filtered_words")))
df_exploded.show(5) 

In [None]:
df_word_count = df_exploded.groupBy("word").count().orderBy("count", ascending=False)
df_word_count.show(10)

In [None]:
spark.stop()