# Exploratory Data Analysis

**Name: Smyan Kapoor**

**Candidate Number: 36745**

<div style="font-family: system-ui; color: #000000; padding: 20px 30px 20px 20px; background-color: #FFFFFF; border-left: 8px solid #47315E; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1); max-width:700px">

**OBJECTIVE:** In this notebook we will query the database to uncover insights related to subreddit activity and engagement. After applying relevant reshaping techniques, we will explore the data and come up with preliminary plots. This will be followed by a series of more intricate visualisations to analyse the research question in an intuitive and engaging manner.

**Analysis Focus**

For each subreddit, we will look at:
- Posts per day, normalized by number of subscribers
- Comments per day, normalized by number of subscribers
- Total comment upvotes per day (normalized by number of comments for **average upvotes per comment**)
- Total post upvotes per day (normalized by number of posts for **average upvotes per post**)

- **We will also explore hourly data for these variables and aim to get some interesting conclusions there** 

**Visualisation**
- We plot multiple plots, looking at averages, hourly and monthly data 
- Our plots aim to be experimental and insightful
- The final plots are saved into the figures folder and can be viewed in ```REPORT.MD```

</div>


 ⚙️ Importing libraries:



In [1]:
import os
import json
import pandas as pd 
import numpy as np 
from lets_plot import *
LetsPlot.setup_html()
from IPython.display import Image

from sqlalchemy import create_engine, text

# this is stored in a utils.py within the notebooks folder, our engine connection is stored here and imported in
from utils import *

## Data Exploration

Lets query the posts data table and convert the timezone to USA so we can later track the hourly data during opening hours. We use ```.tz_localize``` to assign a timezone to values which don't have one already and ```.tz_convert``` to adjust it to our needs.

In [2]:
# Query database
query_posts = """
SELECT 
    p.subreddit, 
    p.created_utc, 
    p.post_id, 
    p.ups AS post_upvotes, 
    s.subscribers
FROM posts p
LEFT JOIN subreddits s ON p.subreddit = s.name;
"""

# Load into Pandas
posts_df = pd.read_sql(query_posts, con=engine)

# Convert timestamp to datetime & adjust to US Eastern Time
posts_df["post_created_datetime"] = pd.to_datetime(posts_df["created_utc"], unit="s")
posts_df["post_created_datetime"] = posts_df["post_created_datetime"].dt.tz_localize("UTC").dt.tz_convert('America/New_York')


posts_df



Unnamed: 0,subreddit,created_utc,post_id,post_upvotes,subscribers,post_created_datetime
0,stocks,1743519210,t3_1joxk0b,0,8564957,2025-04-01 10:53:30-04:00
1,stocks,1743518367,t3_1jox7vs,10,8564957,2025-04-01 10:39:27-04:00
2,stocks,1743513487,t3_1jovd2s,4,8564957,2025-04-01 09:18:07-04:00
3,stocks,1743499843,t3_1jorhcw,7,8564957,2025-04-01 05:30:43-04:00
4,stocks,1743497012,t3_1joquum,48,8564957,2025-04-01 04:43:32-04:00
...,...,...,...,...,...,...
3471,investing,1740777191,t3_1j0ihed,2,2990055,2025-02-28 16:13:11-05:00
3472,investing,1740775863,t3_1j0hymn,149,2990055,2025-02-28 15:51:03-05:00
3473,investing,1740774342,t3_1j0hdn5,5,2990055,2025-02-28 15:25:42-05:00
3474,investing,1740768198,t3_1j0ezm5,0,2990055,2025-02-28 13:43:18-05:00


Below we create columns for the date and hour and use the the ```aggregate_posts``` function defined in utils.py which uses the ```.groupby``` method and returns a grouped data frame. 

In [3]:
# DAILY AGGREGATION
posts_df["created_date"] = posts_df["post_created_datetime"].dt.date
daily_agg = aggregate_posts(posts_df, group_by_columns=["subreddit", "created_date"])
grouped_posts_df = daily_agg.rename(columns={"post_id": "posts_per_day"})

# HOURLY AGGREGATION
posts_df["created_hour"] = posts_df["post_created_datetime"].dt.hour
hourly_agg = aggregate_posts(posts_df, group_by_columns=["subreddit", "created_hour"])
grouped_hourly_posts_df = hourly_agg.rename(columns={"post_id": "posts_per_hour"})


In [4]:
grouped_posts_df

Unnamed: 0,subreddit,created_date,posts_per_day,post_upvotes,subscribers
0,StockMarket,2025-02-06,6,595,3509357
1,StockMarket,2025-02-07,15,2858,3509357
2,StockMarket,2025-02-08,13,896,3509357
3,StockMarket,2025-02-09,7,1672,3509357
4,StockMarket,2025-02-10,12,1999,3509357
...,...,...,...,...,...
150,wallstreetbets,2025-03-28,45,88181,18173083
151,wallstreetbets,2025-03-29,16,23629,18173083
152,wallstreetbets,2025-03-30,6,18423,18173083
153,wallstreetbets,2025-03-31,25,52766,18173083


In [5]:
# grouped_hourly_posts_df.head(45)

Here we query the comments table similar to posts, and after replicating our timezone operations, the data frames are split in the same format so they can be easily merged later using a similar aggregation function, but for comments.

In [6]:
query_comments = """
SELECT 
    c.comment_id, 
    c.post_id, 
    c.created_utc, 
    c.ups AS comment_upvotes,
    p.subreddit  
FROM comments c
LEFT JOIN posts p ON c.post_id = p.post_id;
"""

# Load into Pandas
comments_df = pd.read_sql(query_comments, con=engine)

# Convert UTC timestamp to datetime
comments_df["comment_created_datetime"] = pd.to_datetime(comments_df["created_utc"], unit="s")

# Convert to NY time
comments_df["comment_created_datetime"] = comments_df["comment_created_datetime"].dt.tz_localize('UTC').dt.tz_convert('America/New_York')



In [7]:
# Extract Date and Hour for comments
comments_df["created_date"] = comments_df["comment_created_datetime"].dt.date
comments_df["created_hour"] = comments_df["comment_created_datetime"].dt.hour

# DAILY AGGREGATION
daily_agg_comments = aggregate_comments(comments_df, group_by_columns=["subreddit", "created_date"])
grouped_comments_df = daily_agg_comments.rename(columns={"comment_id": "comments_per_day"})

# HOURLY AGGREGATION
hourly_agg_comments = aggregate_comments(comments_df, group_by_columns=["subreddit", "created_hour"])
grouped_hourly_comments_df = hourly_agg_comments.rename(columns={"comment_id": "comments_per_hour"})

In [8]:
grouped_hourly_comments_df 

Unnamed: 0,subreddit,created_hour,comments_per_hour,comment_upvotes
0,StockMarket,0,943,10301
1,StockMarket,1,833,10495
2,StockMarket,2,738,7237
3,StockMarket,3,661,7398
4,StockMarket,4,559,5906
...,...,...,...,...
91,wallstreetbets,19,2108,83534
92,wallstreetbets,20,2052,156603
93,wallstreetbets,21,1819,82052
94,wallstreetbets,22,1707,62170


Below, we use ```pd.merge()``` to combine the data into a single table, and use a scale function that takes in a list of columns and a dataframe to scale our data by 100k subscribers for more meaningful comparison

In [9]:
# Merge based on common columns 
daily_combined_df = pd.merge(grouped_posts_df, grouped_comments_df, on=['subreddit', "created_date"], how="outer")

# List of columns to scale
columns_to_scale = ["posts_per_day", "comments_per_day", "comment_upvotes", "post_upvotes"]

# Apply scaling function
daily_final = scale_columns_per_100k(daily_combined_df, columns_to_scale)

# daily_final

The merging and scaling process is repeated below for the hourly data:

In [10]:
# outer keeps all data from both datasets in the merged df 
hourly_combined_df = pd.merge(grouped_hourly_comments_df, grouped_hourly_posts_df, on=['subreddit', "created_hour"], how="outer")

# hourly_combined_df.head(45)

In [11]:
hourly_columns_to_scale = ["posts_per_hour", "comments_per_hour", "comment_upvotes", "post_upvotes"]

# Apply scaling function
hourly_final = scale_columns_per_100k(hourly_combined_df, hourly_columns_to_scale)

hourly_final

Unnamed: 0,subreddit,created_hour,comments_per_hour,comment_upvotes,posts_per_hour,post_upvotes,subscribers,posts_per_hour_per_100k_subs,comments_per_hour_per_100k_subs,comment_upvotes_per_100k_subs,post_upvotes_per_100k_subs
0,StockMarket,0,943,10301,22,6761,3509357,0.626895,26.871019,293.529555,192.656376
1,StockMarket,1,833,10495,15,4557,3509357,0.427429,23.736542,299.057634,129.852848
2,StockMarket,2,738,7237,19,14296,3509357,0.541410,21.029493,206.220114,407.368073
3,StockMarket,3,661,7398,20,3847,3509357,0.569905,18.835359,210.807849,109.621221
4,StockMarket,4,559,5906,14,697,3509357,0.398933,15.928844,168.292938,19.861188
...,...,...,...,...,...,...,...,...,...,...,...
91,wallstreetbets,19,2108,83534,23,32948,18173083,0.126561,11.599573,459.657836,181.301103
92,wallstreetbets,20,2052,156603,26,87656,18173083,0.143069,11.291425,861.730505,482.339733
93,wallstreetbets,21,1819,82052,17,40224,18173083,0.093545,10.009309,451.502918,221.338339
94,wallstreetbets,22,1707,62170,25,29122,18173083,0.137566,9.393013,342.099357,160.247989


We use our hourly data to find average values across all subreddits and store it in the ```avg_final``` dataframe 

In [12]:
# Compute the averages per subreddit
avg_final = (
    hourly_final
    .drop(columns=['subreddit', 'subscribers'])  # Drop the subreddit column
    .groupby("created_hour")
    .mean()
    .reset_index()
)

# avg_final

## Visualisation

Let's explore a couple ways we can plot some data, starting with a heatmap!

In [13]:
p = (
    ggplot(hourly_final, aes(x='created_hour', y='subreddit', fill='comment_upvotes'))
    + geom_tile(color='white')  # A classic tile-based heatmap
    + scale_x_continuous(breaks=list(range(24)))  # Ticks from 0 to 23
    + scale_fill_gradient(low='#abd9e9', high='#d7191c')
    + ggsize(1000, 500)
    + labs(
        x='Hour of Day',
        y='Subreddit',
        title='Wallstreetbets has the greatest number of comments - particularly in the second half of the day'
    )
    + theme_classic()
    + theme(legend_position='bottom')
)
p

Here we plot our subreddit data in seperate dot plots, allowing us to compare trends - this may be a useful visualisation tool

In [14]:
p_dot = (
    ggplot(hourly_final, aes(x='created_hour', y='posts_per_hour', color='subreddit'))
    + geom_point(size=3, alpha=0.8)
    + scale_x_continuous(breaks=list(range(24)))
    + facet_wrap('subreddit')
    + labs(x='Hour of Day', y='Posts per hour')
    + ggtitle('Posting activity is greatest around market open and close')
    + ggsize(1000, 500)
    + theme_classic() + theme(legend_position='bottom')
)
p_dot

But, it would be better to include our normalised metrics that account for subscribers! Below we plot posts and comments per 100k subscribers across all the subreddits, and use a heatmap to demonstrate where we get the greatest number of raw posts and comments. 

Although we can functionalise and abstract away most of the plotting, modularity and adjustment allows for more flexible analysis, so our code remains here. 

In [15]:
# here we define our first bar plot for total comments and normalised comments
p_bar_comments = (
    ggplot(hourly_final, aes(x='created_hour', y='comments_per_hour_per_100k_subs', fill='comments_per_hour'))
    + geom_bar(
        stat='identity',
        color='white',
        size=0.5,
        alpha=0.7,

        # here we can adjust what we see when we hover over 
        tooltips=layer_tooltips()
            .line("Comments/100k : @comments_per_hour_per_100k_subs")
            # custom line for fill to make sure we take in the total comments
            .line("Total comments/hour: @{comments_per_hour}") 
            .line("Hour: @created_hour")
            .line("Subreddit: @subreddit")
            
    )

    # plot all hours
    + scale_x_continuous(breaks=list(range(24)))
    # ensure y axis is readable and scaled correctly 
    + scale_y_continuous(breaks=list(range(0, 100, 5)))
    + scale_fill_gradient(low='#abd9e9', high='#d7191c')
    + facet_wrap('subreddit')
    # label our axes and data
    + labs(x='Hour of Day', y='Comments Per 100k Subscribers', fill="Total Comments in an hour")
    + ggsize(2100, 1200)
    + ggtitle('WSB Leads in Total Comments, Trails after normalisation')
    # this helps us adjust the size of the legend since it was running over the chart
    + guides(fill=guide_colorbar(barwidth=200, barheight=20))
    + theme_classic()
    # adjust sizing to make graph clear when combined
    + theme(
          legend_position='bottom',
          axis_text_x=element_text(size=12),
          axis_text_y=element_text(size=12),
          axis_title_x=element_text(size=14),
          axis_title_y=element_text(size=14),
          plot_title=element_text(face='bold',size=12),
          legend_text=element_text(size=8),
          legend_title=element_text(size=11)
      )
)


# Bar chart for post metrics
p_bar_posts = (
    ggplot(hourly_final, aes(x='created_hour', y='posts_per_hour_per_100k_subs', fill='posts_per_hour'))
    + geom_bar(
        stat='identity',
        color='white',
        size=0.5,
        alpha=0.7,
        tooltips=layer_tooltips()
            .line("Posts/100k: @{posts_per_hour_per_100k_subs}")
            .line("Total posts/hour: @{posts_per_hour}")
            .line("Hour: @created_hour")
            .line("Subreddit: @subreddit")
            
    )
    + scale_x_continuous(breaks=list(range(24)))
    + scale_y_continuous(breaks=np.arange(0, 3, 0.1))
    + scale_fill_gradient(name="Posts", low='#abd9e9', high='#d7191c')
    + facet_wrap('subreddit')
    + labs(x='Hour of Day', y='Posts Per 100k Subscribers', fill="Total Posts in an hour")
    + ggtitle('Smaller subs have better normalised post engagement')
    + ggsize(2100, 1200)
    + guides(fill=guide_colorbar(barwidth=200, barheight=20))
    + theme_classic()
    + theme(
          legend_position='bottom',
          axis_text_x=element_text(size=12),
          axis_text_y=element_text(size=12),
          axis_title_x=element_text(size=14),
          axis_title_y=element_text(size=14),
          plot_title=element_text(face='bold',size=12),
          legend_text=element_text(size=8),
          legend_title=element_text(size=11)
      )
)

# final_grid = gggrid([p_bar_comments, p_bar_posts], ncol=2)
# final_grid

lets do the same for our hourly upvotes data, using both normalised and raw values:

In [16]:
p_bar_comment_upvotes = (
    ggplot(hourly_final, aes(x='created_hour', y='comment_upvotes_per_100k_subs', fill='comment_upvotes'))
    + geom_bar(
          stat='identity',
          color='white',
          size=0.5,
          alpha=0.7,
          tooltips=layer_tooltips()
              .line("Comment Upvotes/100k: @{comment_upvotes_per_100k_subs}")
              .line("Total Comment Upvotes in hour: @{comment_upvotes}")
              .line("Hour: @created_hour")
              .line("Subreddit: @subreddit")
      )
    + scale_x_continuous(breaks=list(range(24)))
    + scale_y_continuous(breaks=list(range(0, 3000, 100)))
    + scale_fill_gradient(low='#abd9e9', high='#d7191c')
    + facet_wrap('subreddit')
    + labs(
          x='Hour of Day',
          y='Comment Upvotes Per 100k Subscribers',
          fill="Total Comment Upvotes/hour"
      )
    + ggsize(2100, 1200)
    + ggtitle('r/StockMarket shines during market hours')
    + guides(fill=guide_colorbar(barwidth=200, barheight=20))
    + theme_classic()
    + theme(
          legend_position='bottom',
          axis_text_x=element_text(size=12),
          axis_text_y=element_text(size=12),
          axis_title_x=element_text(size=14),
          axis_title_y=element_text(size=14),
          plot_title=element_text(face='bold',size=12),
          legend_text=element_text(size=8),
          legend_title=element_text(size=11)
      )
)

# Bar chart for post upvotes
p_bar_post_upvotes = (
    ggplot(hourly_final, aes(x='created_hour', y='post_upvotes_per_100k_subs', fill='post_upvotes'))
    + geom_bar(
        stat='identity',
        color='white',
        size=0.5,
        alpha=0.7,
        tooltips=layer_tooltips()
            .line("Post Upvotes/100k: @{post_upvotes_per_100k_subs}")
            .line("Total Post Upvotes in hour: @{post_upvotes}")
            .line("Hour: @created_hour")
            .line("Subreddit: @subreddit")
    )
    + scale_x_continuous(breaks=list(range(24)))
    + scale_y_continuous(breaks=list(range(0, 3000, 100)))
    + scale_fill_gradient(low='#abd9e9', high='#d7191c')
    + facet_wrap('subreddit')
    + labs(
          x='Hour of Day',
          y='Post Upvotes Per 100k Subscribers',
          fill="Total Post Ups/hour"
      )
    + guides(fill=guide_colorbar(barwidth=200, barheight=20))
    + ggtitle('Redditors rarely upvote posts - except r/StockMarket')
    + ggsize(2100, 1200)
    + theme_classic()
    + theme(
          legend_position=('bottom'),
          legend_direction='horizontal',
          axis_text_x=element_text(size=12),
          axis_text_y=element_text(size=12),
          axis_title_x=element_text(size=14),
          axis_title_y=element_text(size=14),
          plot_title=element_text(face='bold',size=12),
          legend_text=element_text(size=8),
          legend_title=element_text(size=11)
      )
)

# final_grid = gggrid([p_bar_comment_upvotes, p_bar_post_upvotes], ncol=2)
# final_grid

Let's present our plots in a format that lets us see everything at once: We will analyse this later in our report

In [17]:
final_grid_hourly_split = gggrid(
    [p_bar_comments, p_bar_posts,p_bar_comment_upvotes, p_bar_post_upvotes],
    ncol=2
)
# Initialize folder for figures
output_dir = "../figures"

# Define the filename for saving
filename = os.path.join(output_dir, "Hourly_Sub_Split_Chart.png") 

# Save the plot with the correct arguments
ggsave(final_grid_hourly_split, filename, w=8, h=6, unit='in', dpi=300, path=output_dir)

'/files/mini-project-2-ds105w-2025-smyea/figures/Hourly_Sub_Split_Chart.png'

Now let's explore a couple ways of presenting our averages data. We must first melt the data into a long format, so we can plot more easily

In [18]:
# all the values can be metrics except time against time (hence why 'created_hour' is in id_vars)
avg_final_long_all = avg_final.melt(
    id_vars='created_hour',
    var_name='metric',
    value_name='value'
)

Using ```geom_rect``` and minimum values created after we filter to our important metrics, we created a grid around the time the market is open, taking 9am to 5pm to avoid working with half hours. This allows for easily isolation and visualisation how activity changes during market open

In [19]:
filtered_df = avg_final_long_all[(avg_final_long_all['metric'] == 'post_upvotes') 
| (avg_final_long_all['metric'] == 'comment_upvotes')].reset_index(drop=True)

# Calculate shading boundaries based on filtered data
ymin_shade, ymax_shade = filtered_df['value'].min(), filtered_df['value'].max()
    
plot = (
    ggplot(filtered_df, aes(x='created_hour', y='value', fill='metric'))
    + geom_bar(stat='identity', position='dodge', alpha=0.7, color='white', size=0.5)
    + geom_rect(xmin=9, xmax=17,
                ymin=ymin_shade, ymax=ymax_shade,
                fill='#F0F0F0', alpha=0.3, color='#abd9e9')
    + scale_x_continuous(breaks=list(range(24)), labels=[f"{h:02d}:00" for h in range(24)])
    + scale_fill_manual(values=['#3EA0CC', '#FF9933'], labels=['Post Upvotes', 'Comment Upvotes'])
    + labs(title="Post and Comment Upvotes by Hour",
           x="Hour of Day",
           y="Number of Upvotes",
           fill="Upvote Type")
    + theme(
        axis_text_x=element_text(angle=80, size=10),
        axis_text_y=element_text(size=10),
        plot_title=element_text(size=16)
    )
    + ggsize(1000, 500)
)

plot.show()

Below we create a graph with two axes, however, one needs to be negative to allow for visualisation in one plot. These variables are normalised between -1 and 1 and combined into a single plot. Interestingly, Although the normalised scaled posts and comments are at highs during market open, from an hour to hour basis, the hour for the highest number of posts and comments are not at the same time

In [20]:

# Normalize data separately to scale from 0 to 1
posts_norm = avg_final['posts_per_hour'] / avg_final['posts_per_hour'].max()
comments_norm = avg_final['comments_per_hour'] / avg_final['comments_per_hour'].max()

# Combine into one dataframe for plotting
df_combined = pd.DataFrame({
    'created_hour': avg_final['created_hour'],
    'posts_norm': posts_norm,
    'comments_norm': -comments_norm,  # negative to go below the axis
    'post_upvotes': avg_final['post_upvotes'],
    'comment_upvotes': avg_final['comment_upvotes'],
    'posts_per_hour': avg_final['posts_per_hour'],
    'comments_per_hour': avg_final['comments_per_hour'],
})

combined_plot = (
    ggplot(df_combined, aes(x='created_hour'))
    # Peak hours shading (move this first so it's behind)
    + geom_rect(xmin=9, xmax=17, ymin=-1, ymax=1,
                fill='#F0F0F0', alpha=0.8, color='#204E4A')
    # Posts per hour normalized above axis
    + geom_bar(aes(y='posts_norm', fill='post_upvotes'),
               stat='identity', width=0.8, alpha=1, color='white',
               tooltips=layer_tooltips()
                   .line("Posts/hour: @{posts_per_hour}")
                   .line("Post Upvotes: @{post_upvotes}")
                   .line("Hour: @created_hour"))
    # Comments per hour normalized below axis
    + geom_bar(aes(y='comments_norm', fill='comment_upvotes'),
               stat='identity', width=0.8, alpha=0.8, color='white',
               tooltips=layer_tooltips()
                   .line("Comments/hour: @{comments_per_hour}")
                   .line("Comment Upvotes: @{comment_upvotes}")
                   .line("Hour: @created_hour"))
    # Zero line separating clearly
    + geom_hline(yintercept=0, color='black', size=0.6)
    # Axis labels clearly indicating normalization
    + scale_x_continuous(breaks=list(range(24)), labels=[f"{h:02d}:00" for h in range(24)])
    + scale_fill_gradient(low='#abd9e9', high='#d7191c')
    + labs(
        title='Normalized Posting Decreases over Market hours, but Comments Increase',
        x='Hour of Day',
        y='Normalized Scale: Posts (↑) | Comments (↓)',
        fill='Upvotes'
    )
    + theme_classic()
    + theme(
        axis_text_x=element_text(angle=80, size=10),
        axis_text_y=element_text(size=10),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=16),
        legend_position='right',
        legend_text=element_text(size=12),
        legend_title=element_text(size=12)
    )
    + ggsize(1000, 600)
)

combined_plot.show()

Below, we see that the highest number for all the metrics averaged across all subreddits were consistently in the period where the market was open. Using normalised data didn't change much, since we are already looking at averages, it is simply just divided by subreddit subscribers.

In [21]:
y_ranges = avg_final_long_all.groupby('metric')['value'].agg(['min', 'max']).to_dict('index')

# Helper to generate each plot
def create_hourly_bar(metric_y, metric_fill, title, y_label, fill_label):
    return (
        ggplot(avg_final, aes(x='created_hour', y=metric_y, fill=metric_fill))
        + geom_rect(
            xmin=9, xmax=17,
            ymin=y_ranges[metric_y]['min'], ymax=y_ranges[metric_y]['max'],
            fill='#F0F0F0', alpha=0.8, color='#204E4A'
        )
        + geom_bar(
            stat='identity',
            color='white',
            size=0.5,
            alpha=0.8,
            tooltips=layer_tooltips()
                .line(f"{y_label}: @{metric_y}")
                .line(f"{fill_label}: @{metric_fill}")
                .line("Hour: @created_hour")
        )
        + scale_x_continuous(breaks=list(range(24)), labels=[f"{h:02d}:00" for h in range(24)])
        + scale_fill_gradient(low='#abd9e9', high='#d7191c')
        + labs(
            x='Hour of Day',
            y=y_label,
            title=title,
            fill=fill_label
        )
        + theme_classic()
        + theme(
            axis_text_x=element_text(angle=80, size=10),
            axis_text_y=element_text(size=12),
            axis_title_x=element_text(size=14),
            axis_title_y=element_text(size=14),
            plot_title=element_text(size=16),
            legend_position='bottom',
            legend_text=element_text(size=12),
            legend_title=element_text(size=12)
        )
    )

# Create each individual plot
p1 = create_hourly_bar(
    'posts_per_hour', 'post_upvotes',
    'Posts per Hour (Filled by Post Upvotes)',
    'Posts per Hour', 'Total Post Upvotes'
)

p2 = create_hourly_bar(
    'comments_per_hour', 'comment_upvotes',
    'Comments per Hour (Filled by Comment Upvotes)',
    'Comments per Hour', 'Total Comment Upvotes'
)

p3 = create_hourly_bar(
    'posts_per_hour_per_100k_subs', 'post_upvotes_per_100k_subs',
    'Posts per Hour per 100k Subs (Filled by Upvotes per 100k)',
    'Posts/hour/100k Subs', 'Upvotes per 100k Subs'
)

p4 = create_hourly_bar(
    'comments_per_hour_per_100k_subs', 'comment_upvotes_per_100k_subs',
    'Comments per Hour per 100k Subs (Filled by Upvotes per 100k)',
    'Comments/hour/100k Subs', 'Upvotes per 100k Subs'
)

grid_plot = gggrid([p1, p2, p3, p4], ncol=2)  # 2 columns = 2x2 layout
grid_plot.show()

Now, lets take a look at our daily figures, and answer our main research question: How retail investor activity changed in the recent days of market ucncertainty under the Trump Administration. Note: the graphs are interactive, but I am not sure how to include that into the report unfortunately.

In [22]:
# Ensure the 'created_date' column is in datetime format
daily_final["created_date"] = pd.to_datetime(daily_final["created_date"])

# Create a discrete, formatted date for plotting in a string format
daily_final["created_date_str"] = daily_final["created_date"].dt.strftime("%b %d")

# Custom titles for each variable
vars_titles = {
    "posts_per_day": "r/WSB Has the Greatest Post Volatility...",
    "posts_per_day_per_100k_subs": "Until we Normalise and r/investing Fluctuates a Lot more ",
    "comments_per_day": "This Trend Continues for Comments...",
    "comments_per_day_per_100k_subs": "Where our Biggest Spikes are Lead by r/investing on March 4th and 10th",
    "post_upvotes": "r/wsb reaches Almost 100k Post Upvotes in a day...",
    "post_upvotes_per_100k_subs": " But redditors from r/StockMarket love Upvoting Posts the most",
    "comment_upvotes": "Comment Upvotes peak at 170k for r/wsb...",
    "comment_upvotes_per_100k_subs": "But comparatively, r/investing Outperforms again on March 4th "
}

# Helper function to build each plot kept outside utils.py for easier adjustment
def build_plot(var, title):
    return (
        ggplot(daily_final, aes(x="created_date_str", y=var, color="subreddit", group="subreddit"))
        + geom_line(size=1.5, alpha=0.8)
        # scale our colours manually!
        + scale_color_brewer(type='qual', palette='Set2')
        + scale_y_continuous(expand=[0.01, 0.01])
        + labs(
            title=title,
            x="Date",
            # removes _ and make them spaces and the title function capitalises the words
            y=var.replace('_', ' ').title(),
            color="Subreddit",
            caption=""
        )
        + theme_light()
        + theme(
            legend_position="top",
            legend_title=element_text(size=20),
            legend_text=element_text(size=25),
            axis_text_x=element_text(angle=90, hjust=1, size=20),
            axis_text_y=element_text(size=20),
            axis_title_x=element_text(size=18),
            axis_title_y=element_text(size=23),
            plot_title=element_text(size=33, face='bold'),
            panel_background=element_rect(fill="#F9F9F9"),
            panel_grid_minor=element_blank()
        )
        + ggsize(900, 450)
    )

# Build and collect all plots
plots = [build_plot(var, title) for var, title in vars_titles.items()]

# Arrange all plots into a vertical grid
final_grid = gggrid(plots, ncol=2) + ggsize(3000, 2200)

# Define the filename for saving
filename = os.path.join(output_dir, "Daily_Performance_Charts.png")

# Save the plot with the correct arguments
ggsave(final_grid, filename, w=18, h=9, unit='in', dpi=600, path=output_dir)

'/files/mini-project-2-ds105w-2025-smyea/figures/Daily_Performance_Charts.png'

Below are some dates I've collected that are notable over the past few weeks. It would be interesting to look at these dates and look at our graphs to see how our data changes during these key moments!

**Key Dates, Market Reactions, and Sources — February–March 2025**

| Date(s)     | Event Summary                                                                                                 | Market Reaction                                                                    | Source |
|-------------|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|--------|
| Feb 3       | Global sell-off led by Asian & European markets amid tariff concerns.                                         | S&P 500 -0.8%, Nasdaq -1.2%, Dow -0.3%                                              | [CNBC](https://www.cnbc.com/2025/02/02/stock-market-today-live-updates.html) |
| Feb 21      | Tariff threats, weak economic data, and falling consumer sentiment drove broad market losses.                | S&P 500 -1.7%, Nasdaq -2.2%, Dow -1.69%                                             | [The Guardian](https://www.theguardian.com/us-news/2025/feb/21/stocks-tariffs-prices) |
| Feb 28      | Trump reaffirmed tariffs; Nvidia dropped 8.5%; jobless claims jumped; inflation concerns rose.               | S&P 500 -1.59%, Nasdaq -2.78%, Dow -0.45%                                           | [CNBC](https://www.cnbc.com/2025/02/26/stock-market-today-live-updates.html) |
| Mar 3       | Trump confirmed 25% tariffs on Canada & Mexico; tech and retail led losses.                                  | S&P 500 -1.8%, Nasdaq -2.6%, Dow -1.5%                                              | [MarketWatch](https://www.marketwatch.com/livecoverage/stock-market-today-dow-s-p-and-nasdaq-to-hold-latest-rally-after-bitcoin-surge) |
| Mar 6       | Nasdaq entered correction territory amid tariff fallout and tech sell-off.                                   | S&P 500 -1.8%, Nasdaq -2.6%, Dow -1%                                                | [Yahoo Finance](https://finance.yahoo.com/news/live/stock-market-today-nasdaq-enters-correction-sp-500-sinks-to-lowest-since-november-as-stocks-get-clobbered-on-trump-tariff-whiplash-210544344.html) |
| Mar 10      | Market entered correction territory.                                                                          | S&P 500 down 8.6% from peak, Nasdaq down over 10%                                   | [Reuters](https://www.reuters.com/markets/us/investors-flee-equities-trump-driven-uncertainty-sparks-economic-worry-2025-03-10/) |
| Mar 19–21   | Trump hinted at tariff flexibility.                                          | Dow +1.2% for the week, S&P 500 +0.5%, Nasdaq +0.2%                                 | [WSJ](https://www.wsj.com/livecoverage/stock-market-today-dow-nasdaq-sp500-03-21-2025) |
| Mar 25      | Consumer sentiment and tariff speculation influenced markets.                                                | Markets ended higher; Apple rose, Nvidia dipped                                     | [Reuters](https://www.reuters.com/markets/us/wall-st-futures-slip-trump-led-rally-loses-steam-2025-03-25/) |
| Mar 28      | PCE inflation beat expectations; consumer sentiment hit lowest since 2022; tech & crypto fell sharply.       | S&P 500 -2%, Nasdaq -2.7%, Dow -1.7%; indexes posted weekly losses                  | [Investopedia](https://www.investopedia.com/dow-jones-today-03282025-11704900) |
| Mar 31      | Stocks rebounded late in the day ahead of a key earnings/data week.                                          | S&P 500 recovered but marked worst month since 2022                                 | [Investopedia](https://www.investopedia.com/dow-jones-today-03312025-11705913) |

## Next Steps:

- Unfortunately we can only choose two visualisations - although the average plot is great, it is more important to look at hourly data segmented by subreddit, so we look at that along without our daily plots

- Head to ```REPORT.md``` for some analysis, interpretation and conclusions!