In [1]:
# Milestone 1 – Exploratory Data Analysis (EDA)
**Project:** Reddit Cost-of-Living Crisis Analysis  
**Course:** DSAN 6000 – Statistical Learning / Big Data Analytics  
**Author:** Jiachen Gao, Zihao Huang
**Date:** November 2025  

This notebook establishes the technical pipeline for analyzing Reddit discussions related to the
global cost-of-living crisis. It validates data access, cleaning, feature creation, and keyword
filters using a local sample before scaling to the full dataset in Azure.

SyntaxError: invalid character '–' (U+2013) (321166679.py, line 4)

In [None]:
## Project Plan: Business Goals & Technical Proposals

### 1. Inflation Concern Over Time
**Business goal:** Understand how concern about inflation and rising prices changes over time across major Reddit communities.  
**Technical proposal:** Use regex/keyword filters (e.g., "inflation", "CPI", "cost of living", "prices", "expensive") on titles and bodies in submissions/comments. Aggregate weekly counts by subreddit with PySpark, normalize by total posts to create a "concern intensity" index, and visualize trends to identify spikes and potential links to real-world inflation events.

### 2. Rent, Housing, and Eviction Stress
**Business goal:** Identify how often and where people discuss rent hikes, housing instability, and evictions, and which communities appear most affected.  
**Technical proposal:** Define regex patterns for housing-related terms ("rent", "landlord", "lease", "eviction", "mortgage", "housing"). Create housing dummy flags, compute per-subreddit and per-month frequencies and proportions, and build summary tables and bar charts to highlight hotspots of housing stress.

### 3. Wage, Salary, and Job Security
**Business goal:** See whether people feel their incomes keep up with costs and where job security fears are most visible.  
**Technical proposal:** Tag posts mentioning wages, salaries, layoffs, or job loss ("raise", "wage", "bonus", "fired", "layoff") and co-occurrence with inflation/cost-of-living terms. Aggregate by subreddit and time to examine patterns and prepare a labeled subset for later NLP/ML classification.

### 4. Emotional Tone of Cost-of-Living Discussions
**Business goal:** Understand if cost-of-living conversations are dominated by fear, anger, frustration, or hope.  
**Technical proposal:** Use simple lexicon-based sentiment and emotion flags (e.g., negative words like "broke", "stress", "anxious", versus positive/hopeful language) on cost-of-living posts. Compute average negative/positive flag rates by subreddit and over time and visualize basic sentiment trends as a foundation for more advanced NLP later.

### 5. Geographic Differences in Narratives
**Business goal:** Explore whether cost-of-living concerns differ by country/region where identifiable.  
**Technical proposal:** Use weak signals such as currency symbols ($, £, €), region-specific terms, or subreddit context (e.g., region-specific subs in full data) to assign rough geographic tags. Summarize topic frequencies and sentiment by geo-tag and document limitations of this heuristic labeling.

### 6. Linking Reddit Concern to Official Inflation Data
**Business goal:** Check whether spikes in Reddit cost-of-living discussions align with real-world inflation indicators.  
**Technical proposal:** Obtain an external monthly CPI time series and potentially gas price or rent indices. Align Reddit weekly/monthly cost-of-living post volumes and negativity with these indicators using time joins in PySpark or pandas. Compute simple correlations and visualize side-by-side trends to assess alignment.

### 7. Predicting High-Engagement Cost-of-Living Posts
**Business goal:** Identify what kinds of cost-of-living posts receive high engagement (upvotes/comments), indicating narratives that resonate strongly.  
**Technical proposal:** Create a labeled dataset of cost-of-living-related submissions with features such as subreddit, posting time, length, topic flags, and simple sentiment features. Define "high engagement" via score/comment quantiles and train baseline classification models in later milestones to predict high vs low engagement.

### 8. Classifying Personal Help vs Systemic Discussion
**Business goal:** Distinguish between users seeking personal financial help and users discussing systemic/policy issues, to understand different needs and narratives.  
**Technical proposal:** Use heuristics (question patterns like "what should I do", "need advice" vs. "the government", "the economy", "the Fed") to create weak labels for "personal" vs "systemic". Engineer text and metadata features and later train PySpark ML models to classify posts, evaluating performance and refining rules.

### 9. Early Warning Signals of Financial Distress
**Business goal:** Explore whether changes in Reddit discussions can act as early warning signals of worsening financial stress.  
**Technical proposal:** Construct weekly indicators from cost-of-living posts (mentions of "can't afford", "behind on bills", "credit card debt", "collections", "default") and sentiment volatility. Compare these against external indicators (when available in later milestones) using lagged correlations and baseline forecasting/threshold models.

### 10. Cross-Community Narrative Diffusion
**Business goal:** Understand how cost-of-living narratives spread between different Reddit communities (e.g., finance, employment, politics).  
**Technical proposal:** For the full dataset, analyze shared links, repeated phrases, and overlapping topic flags across subreddits over time. Build an aggregated subreddit–subreddit "topic overlap" matrix and visualize as a network to identify origin, bridge, and amplification communities. Use this as a structural lens in advanced analysis.

In [1]:
from pathlib import Path
from pyspark.sql import SparkSession, functions as F
import pandas as pd
import matplotlib.pyplot as plt

# Repository paths
REPO_ROOT = Path.cwd().parents[0]
DATA_RAW = REPO_ROOT / "data" / "raw"
DATA_PROCESSED = REPO_ROOT / "data" / "processed"
IMG_DIR = REPO_ROOT / "img"

for p in [DATA_RAW, DATA_PROCESSED, IMG_DIR]:
    p.mkdir(parents=True, exist_ok=True)

Bad value in file PosixPath('/Users/stevengao/.matplotlib/stylelib/my_custom_style.mplstyle'), line 4 ('axes.facecolor: #F0F0F0'): Key axes.facecolor: '' does not look like a color arg
Bad value in file PosixPath('/Users/stevengao/.matplotlib/stylelib/my_custom_style.mplstyle'), line 5 ('figure.facecolor: #EAEAF2'): Key figure.facecolor: '' does not look like a color arg
Bad value in file PosixPath('/Users/stevengao/.matplotlib/stylelib/my_custom_style.mplstyle'), line 9 ('axes.edgecolor: #333333'): Key axes.edgecolor: '' does not look like a color arg
Bad value in file PosixPath('/Users/stevengao/.matplotlib/stylelib/my_custom_style.mplstyle'), line 10 ('xtick.color: #333333'): Key xtick.color: '' does not look like a color arg
Bad value in file PosixPath('/Users/stevengao/.matplotlib/stylelib/my_custom_style.mplstyle'), line 11 ('ytick.color: #333333'): Key ytick.color: '' does not look like a color arg
Bad value in file PosixPath('/Users/stevengao/.matplotlib/stylelib/my_custom_styl

In [2]:
spark = (
    SparkSession.builder
    .appName("Reddit EDA")
    .master("local[*]")
    .getOrCreate()
)

spark

25/11/07 13:01:11 WARN Utils: Your hostname, Stevens-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.0.0.2 instead (on interface en0)
25/11/07 13:01:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/07 13:01:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
submissions_df = spark.read.json(str(DATA_RAW / "submissions-sample.json"))
comments_df = spark.read.json(str(DATA_RAW / "comments-sample.json"))

print("Submissions rows:", submissions_df.count())
print("Comments rows:", comments_df.count())

submissions_df.printSchema()
comments_df.printSchema()

25/11/07 13:01:15 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Submissions rows: 1000
Comments rows: 10000
root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_tex

In [4]:
## 1. Dataset Overview

We start with the provided Reddit sample (JSON) to validate our pipeline.  
The full project will use the pre-processed parquet data on the shared cluster; all logic here is written in PySpark so it can scale directly.

Key questions:
- Which subreddits appear?
- What text and metadata fields are available?
- Are there obvious missing or corrupted records?

SyntaxError: invalid syntax (2062046229.py, line 3)

In [5]:
def missing_summary(df):
    """Return count of missing values per column."""
    return df.select([
        F.count(F.when(F.col(c).isNull(), c)).alias(c)
        for c in df.columns
    ])

print("=== Missing values: submissions ===")
missing_summary(submissions_df).show(truncate=False)

print("=== Missing values: comments ===")
missing_summary(comments_df).show(truncate=False)

# Basic cleaning
submissions_clean = submissions_df.filter(F.col("subreddit").isNotNull())
submissions_clean = submissions_clean.filter(~F.col("author").isin(["[deleted]", "AutoModerator"]))
comments_clean = comments_df.filter(F.col("subreddit").isNotNull())
comments_clean = comments_clean.filter(~F.col("author").isin(["[deleted]", "AutoModerator"]))

print("Clean submissions rows:", submissions_clean.count())
print("Clean comments rows:", comments_clean.count())

=== Missing values: submissions ===
+------------------+------------------+--------+------+--------------+----------------------+-----------------+---------+----------+------------+-----------+----------------+---------------------+----------------+-------------+------+---------------+------+----------+---------+------+------+----------+--------+---+---------+----------------+----------------------+-------+--------+--------------------+---------------+------+-----+-----------+-------------+------------+--------------+-------------+-------+-----------------------+---------+------+---------+-------+--------+-----------+---------------------+------------+------------+-----+------------+------------------+--------+-------+--------+---------+------------+--------------+--------------------+--------------------+----------------------+---------+----------------+---------------+-----+---+----------------+
|adserver_click_url|adserver_imp_pixel|archived|author|author_cakeday|author_flair_css_cl

In [None]:
### Schema Exploration and Text Preview
We preview a few rows to understand available fields (`title`, `selftext`, `body`, etc.).

In [6]:
submissions_clean.select("subreddit", "title", "selftext").show(5, truncate=100)
comments_clean.select("subreddit", "author", "body").show(5, truncate=100)

+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|           subreddit|                                                            title|                                                                                            selftext|
+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|     PUBATTLEGROUNDS|                                         Just swiming and chillin|                                                                                                    |
|         CFBOffTopic|         Thursday night thread presented by NFL preseason week 4.|                                                                                                    |
|GlobalOffensiveTrade|                            

In [7]:
demo_kw = r"(?i)(inflation|rent|mortgage|debt|loan|bills|price|cost|money|pay|wage)"

submissions_feat = (
    submissions_clean
    .withColumn("col_demo_flag", F.col("title").cast("string").rlike(demo_kw).cast("int"))
    .withColumn(
        "selftext_len",
        F.when(F.col("selftext").isNull(), 0).otherwise(F.length("selftext"))
    )
    .withColumn(
        "post_length_bucket",
        F.when(F.col("selftext_len") < 100, "short")
         .when(F.col("selftext_len") < 500, "medium")
         .otherwise("long")
    )
)

print("Demo cost-of-living-ish submissions in sample:",
      submissions_feat.filter(F.col("col_demo_flag") == 1).count())
submissions_feat.select("subreddit", "title", "col_demo_flag").show(5, truncate=100)

Demo cost-of-living-ish submissions in sample: 33
+--------------------+-----------------------------------------------------------------+-------------+
|           subreddit|                                                            title|col_demo_flag|
+--------------------+-----------------------------------------------------------------+-------------+
|     PUBATTLEGROUNDS|                                         Just swiming and chillin|            0|
|         CFBOffTopic|         Thursday night thread presented by NFL preseason week 4.|            0|
|GlobalOffensiveTrade|                                  [H] Gut Knife | Vanilla [W] 25k|            0|
|           Overwatch|                                                And Here. We. Go.|            0|
|                asia|Soccer: Japan Beat Australia to Qualify for 2018 World Cup Finals|            0|
+--------------------+-----------------------------------------------------------------+-------------+
only showing top 5 rows

In [8]:
from pyspark.sql import functions as F

test_kw = r"(?i)(inflation|rent|mortgage|debt|loan|bills|price|cost|money)"

print("Submissions with any money/CoL-ish keyword:")
submissions_clean.filter(
    (F.col("title").rlike(test_kw)) | 
    (F.col("selftext").rlike(test_kw))
).select("subreddit", "title", "selftext").show(20, truncate=120)

print("Comments with any money/CoL-ish keyword:")
comments_clean.filter(
    F.col("body").rlike(test_kw)
).select("subreddit", "body").show(20, truncate=120)

Submissions with any money/CoL-ish keyword:
+--------------------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|           subreddit|                                                                                                                   title|                                                                                                                selftext|
+--------------------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|        megarequests|                                 [REQUEST] Literature into Film: Theory And Practical Approaches by Linda Costanzo Cahir|                  

In [None]:
spark.stop()