In [2]:
import pandas as pd

# Sentiment Models Comparisons -- Reddit
In this part (3), I'll be comparing the VADER pretrained model from NLTK's presets, and the FinBERT model from the PyTorch library.

The idea is that using a sentiment analysis model trained on financial data will allow it to pick up financial terms and keywords from the corpora better than the general purpose NLTK pretrained sentiment model.

I'm not sure exactly yet how I'll measure this effect, but for each model's results, I'll make a plot of sentiment results for each month of 2024. Then I'll decide from there.

## Data Treatment
The cleaned data consists of rows being posts, and a column with the cleaned text for that post.
For a subreddit, I'll combine the texts of all the posts in a month into a month corpus.

I'll then use sentiment analysis on the 12 corpora I'll have to get the sentiment for each month. This will be for one subreddit, the plot will have 4 colors corresponding to each subreddit. So in total there
will be 48 sentiment scores.

In [3]:
df = pd.read_csv('reddit-cleaned.csv')

In [None]:
df.sample(3)

Unnamed: 0,subreddit,month,post_id,text
4104,investing,Jun,1d5yl57,people start invest year ago sort return avera...
1680,wallstreetbets,May,1cuf3il,straight good time point retire plz pump meme ...
1362,wallstreetbets,Feb,1an95o1,finally free delete sunday ameritrade app spea...


In [7]:
df.drop(columns=['post_id'], inplace=True)

In [36]:
df.sample(3)

Unnamed: 0,subreddit,month,text
3683,investing,Jan,slowly start convert ira roth fence year past ...
1481,wallstreetbets,Mar,nvidia announce aipowered health care agents o...
3678,investing,Jan,reinvesting k hysa tbills best option trying d...


In [37]:
# Ensure that the 'text' column contains only strings
df['text'] = df['text'].apply(lambda x: x if isinstance(x, str) else '')

In [38]:
subreddits = df['subreddit'].unique()
print(subreddits)

months = df['month'].unique()
print(months)

['cryptocurrency' 'wallstreetbets' 'finance' 'investing']
['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']


In [39]:
# Initialize an empty list to store the aggregated dataframes
agg_dfs = []

# Loop through each subreddit and perform the aggregation
for subreddit in subreddits:
    print(subreddit)
    df_sub = df[df['subreddit'] == subreddit]
    df_agg = df_sub.groupby('month', sort=False).agg({
        'subreddit': 'first',
        'text': ' '.join
    }).reset_index()
    agg_dfs.append(df_agg)

# Combine all the aggregated dataframes into one
df_combined = pd.concat(agg_dfs, ignore_index=True)

cryptocurrency
wallstreetbets
finance
investing


In [41]:
df_combined

Unnamed: 0,month,subreddit,text
0,Jan,cryptocurrency,bitcoin spot etf finally approve document go c...
1,Feb,cryptocurrency,coinbase block user sell show trust cex selfcu...
2,Mar,cryptocurrency,biden propose tax mining miners move interesti...
3,Apr,cryptocurrency,new theory satoshi nakamoto delete legit impre...
4,May,cryptocurrency,trump good crypto overwhelming number protrump...
5,Jun,cryptocurrency,logan paul sues youtuber coffeezilla cryptozoo...
6,Jul,cryptocurrency,sounds like trump crap bed bitcoin conference ...
7,Aug,cryptocurrency,kamala harris propose tax unrealized gain high...
8,Sep,cryptocurrency,years ago one biggest crypto scams happened sq...
9,Oct,cryptocurrency,bitcoin away break alltime high somebody tell ...


In [53]:
# Average character length of the text in the combined dataframe
avg_length = df_combined['text'].str.len().mean()
print(f"Average character length of text: {avg_length}")

Average character length of text: 106491.45833333333


## NLTK Sentiment Analysis
Now that we have our concatenated data.