In [1]:
import pandas as pd

# Sentiment Models Comparisons -- Reddit
In this part (3), I'll be comparing the VADER pretrained model from NLTK's presets, and the FinBERT model from the PyTorch library.

The idea is that using a sentiment analysis model trained on financial data will allow it to pick up financial terms and keywords from the corpora better than the general purpose NLTK pretrained sentiment model.

I'm not sure exactly yet how I'll measure this effect, but for each model's results, I'll make a plot of sentiment results for each month of 2024. Then I'll decide from there.

## Data Treatment
The cleaned data consists of lemmatized top comments, and the "headline" column which has the post title + self text

Here's what I'll do. For each post:
- Get a sentiment score for the post title + self text (headline), call it $s_0$
- Get a sentiment score for each top comment, call them $s_1, s_2, ..., s_{10}$
- Get a weighted aggregate sentiment score for the post. Weighing a score $s_i$ less as $i$ increases. I will just begin with the simple function: $w(s_i) = \frac{1}{100} * (i - 10)^2 + 0.01$ to multiply to a score to weigh it. The $0.01$ is to just avoid weighing the last element at 0.

For each month, I will take the median score of the post scores of that month as the representation for the entire month, as this statistic is more resistant to outliers.

I have to do it like this because sentiment analysis really starts to break when the text gets too long, either with NLTK's VADER, or with FinBERT. You will see this in my previous commits if you want to look, but essentially I tried to combine all the info associated with each post into one supertext of the post, and tried to run an analysis ont that, but the models failed spectacularly with incredibly large bodies of text like that.



In [3]:
df = pd.read_csv('reddit-cleaned.csv')

In [21]:
df.sample(3)

Unnamed: 0,subreddit,month,post_id,tc0,tc1,tc2,tc3,tc4,tc5,tc6,tc7,tc8,tc9,headline
4391,investing,Aug,1ev6ov8,mcd dividend stock big yearly return compare b...,comment totally wrong mcd trade nearly bb per ...,nobody buy hold,great point thanks,one word hamburglar,actual answer bid ask spread think basis c spr...,sit mine enjoy dividend,remove,maybe share price around large typical stock c...,sounds like good thing,mcdonald s stock big bidask spread delete
4277,investing,Jul,1e7f6g5,s day week dca etf,individual stock reason happy consumer service...,time market beat time market pick needle hayst...,everybody talk pe ratio already moon analysts ...,keep buying fskax every paycheck keep go every...,every week get pay every week get pay every we...,costco individual stock,solid business management customer adoption s ...,need scratch gamble itch fomo,nice discover undervalue,top reason buy buy stock get general understan...
1938,wallstreetbets,Aug,1ema5ue,actual fuck read,remove,already x investment regard,gt go eventually wind position nt think far ahead,m unrealized gain op please sell m live dividend,either smart stupid thing ve ever read,spend lego even taste good,get m ct avg cost per ct yet total cost basis ...,remove,hank,update spend quarter million dollar rock previ...


In [4]:
df.drop(columns=['post_id'], inplace=True)

In [23]:
df.sample(3)

Unnamed: 0,subreddit,month,tc0,tc1,tc2,tc3,tc4,tc5,tc6,tc7,tc8,tc9,headline
4605,investing,Nov,nvidia main topic thanksgiving diner guess tim...,tesla hardware year ahead nvidia elon spend bi...,say s crap storm room everyone cry,hear ya s consolation stock main thing talk ye...,nt want talk,summary sure nvda super specialize chip first ...,yes make sense give iterative sort chip making...,say cant happen engineer leave nvidia tesla cr...,work ml space close decade nvidia position wel...,heard first folk ai kill turkey thanksgiving t...,talking dad nvidia thanksgiving dad active inv...
4479,investing,Sep,yr treasury pay almost interest rate predict f...,get year tbill even need worry stocksetfs goal,almost everything pretax pre inflation point m...,money market fund pay right,nobody list return pre inflation except seem e...,trinity study use stocksbonds portfolio could ...,low risk specially op timeframe three year,would think investment expensive future compare,poster say put treasury take profit invest inv...,put m vt collect dividend earning k year divid...,st many questions yr passive income m m recent...
3532,finance,Dec,thats one way point,,,,,,,,,,nyse close jan honor late former president jim...


In [5]:
subreddits = df['subreddit'].unique()
print(subreddits)

months = df['month'].unique()
print(months)

['cryptocurrency' 'wallstreetbets' 'finance' 'investing']
['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']


In [6]:
# Average character length of the text in the combined dataframe
cols = ['headline'] + [f'tc{i}' for i in range(10)]
for col in cols:
    avg_length = df[col].str.len().mean()
    print(f"Average character length of '{col}': {avg_length}")

Average character length of 'headline': 251.26746131325805
Average character length of 'tc0': 101.6719512195122
Average character length of 'tc1': 104.11005502751375
Average character length of 'tc2': 101.50442477876106
Average character length of 'tc3': 98.6069779374038
Average character length of 'tc4': 102.36372950819673
Average character length of 'tc5': 98.59127291505293
Average character length of 'tc6': 94.35177968303455
Average character length of 'tc7': 95.81495960385718
Average character length of 'tc8': 97.39321148825066
Average character length of 'tc9': 94.92190775681341


In [7]:
# Some columns are not of the correct data type
df[cols] = df[cols].astype(str)

Okay these are all reasonable character lengths, so I think the sentiment analyses should be much nicer compared to my previous attempt.

## NLTK Sentiment Analysis
In this section I'll use NLTK's sentiment analysis to convert all the columns into sentiment scores.

In [50]:
sample_headline = df['headline'].sample(1).values[0]
print(sample_headline)

sphere las vegas loses million three months million past year


In [51]:
from nltk.sentiment import SentimentIntensityAnalyzer

In [52]:
sia = SentimentIntensityAnalyzer()

In [53]:
sia.polarity_scores(sample_headline)

{'neg': 0.204, 'neu': 0.796, 'pos': 0.0, 'compound': -0.3182}

### Compound Scores
So this mechanism divides it's output into a negative, neutral, positive, and compound score.
I'll just use compound for now, and see what the scores are.

In [54]:
# Apply the sentiment analysis and create new columns for each score
for col in cols:
    df[f'nltk_{col}'] = df[col].apply(lambda x: pd.Series(sia.polarity_scores(x)['compound']))

In [55]:
nltk_cols = ['nltk_' + col for col in cols]
df[nltk_cols].describe()

Unnamed: 0,nltk_headline,nltk_tc0,nltk_tc1,nltk_tc2,nltk_tc3,nltk_tc4,nltk_tc5,nltk_tc6,nltk_tc7,nltk_tc8,nltk_tc9
count,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0
mean,0.194423,0.103991,0.117702,0.115965,0.118734,0.117597,0.112533,0.110985,0.110592,0.117524,0.106035
std,0.50361,0.423685,0.423804,0.418796,0.405617,0.406864,0.407285,0.403414,0.407999,0.405583,0.40249
min,-0.9863,-0.984,-0.9985,-0.975,-0.9876,-0.989,-0.9919,-0.9651,-0.9709,-0.9783,-0.9856
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.6249,0.4019,0.4215,0.4215,0.4215,0.4022,0.4215,0.4019,0.4019,0.4019,0.3818
max,0.9977,0.9961,0.9961,0.9908,0.9964,0.9972,0.9966,0.9994,0.9958,0.9944,0.9936


Okay so it seems that a majority of scores are slightly positive. But this is across the columns. Perhaps after I run my special weighted mean on each post and then take the median grouped by month, I'll see a different story.

In [20]:
def weighted_mean(row: pd.Series, cols: list[str]) -> float:
    def w(i: int) -> float:
        return (1/100) * (i - 10) ** 2

    return sum([row[col] * w(i) for i, col in enumerate(cols)])

In [57]:
df['nltk_mu'] = df.apply(lambda row: weighted_mean(row, nltk_cols), axis=1)
nltk_month_avg = df.groupby('month', sort=False)['nltk_mu'].mean()

In [58]:
nltk_month_avg

month
Jan    0.606782
Feb    0.574916
Mar    0.469782
Apr    0.480691
May    0.596519
Jun    0.523441
Jul    0.514087
Aug    0.426764
Sep    0.465829
Oct    0.493374
Nov    0.566213
Dec    0.480120
Name: nltk_mu, dtype: float64

Hmm, still all positive! Note that this isn't a horrible thing, I shouldn't shape the results to my expectation. Looking at the actual S&P 500 graph:
![S&P 500 Graph 2024](S&P_500_2024.png)

Can see that there was a dip in March and July/August of that year, which does correspond to dips in sentiment scores around that time in the data, despite being positive, so perhaps there's something there! Let's standardize to see in detail.

In [60]:
nltk_month_avgs_standardized = (nltk_month_avg - nltk_month_avg.mean()) / nltk_month_avg.std()
nltk_month_avgs_standardized

month
Jan    1.567940
Feb    1.014258
Mar   -0.812498
Apr   -0.622943
May    1.389622
Jun    0.119856
Jul   -0.042674
Aug   -1.559961
Sep   -0.881187
Oct   -0.402573
Nov    0.863036
Dec   -0.632876
Name: nltk_mu, dtype: float64

Honestly pretty good! If we think out a monthly score of 0 representing a neutral attitude, a negative score seems to correlate to a bearish outlook versus a positive score being a bullish outlook in the short term.

## FinBERT Sentiment Analysis
Let's see how FinBERT does, given that it is specially trained on financial texts and sources.

In [12]:
from transformers import pipeline

finance_sentiment = pipeline("text-classification", model="ProsusAI/finbert")

  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0


In [13]:
finance_sentiment([sample_headline, sample_headline])

[{'label': 'neutral', 'score': 0.9170143604278564},
 {'label': 'neutral', 'score': 0.9170143604278564}]

In [None]:
from nltk import word_tokenize
from tqdm import tqdm

# Trim length of all entries to 256 tokens
# BERT tokens are defined differently to NLTK tokens, so I ran into issues with only accepting
# 512 tokens. I decided to use 256 tokens instead to avoid this issue.
def trim(text: str) -> str:
    tokens = word_tokenize(text)
    text = ' '.join(tokens[:256])
    return text

for col in tqdm(cols):
    df[col] = df[col].apply(trim)

100%|██████████| 11/11 [00:03<00:00,  3.27it/s]


In [None]:
# Batch processing because we are interfacing with an ML model, faster than processing one by one
for col in tqdm(cols):
    df[f'finbert_{col}'] = finance_sentiment(df[col].tolist())['score']

100%|██████████| 11/11 [12:47<00:00, 69.79s/it]


In [38]:
finbert_cols = ['finbert_' + col for col in cols]
df[finbert_cols].describe()

Unnamed: 0,finbert_headline,finbert_tc0,finbert_tc1,finbert_tc2,finbert_tc3,finbert_tc4,finbert_tc5,finbert_tc6,finbert_tc7,finbert_tc8,finbert_tc9
count,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0
mean,0.833033,0.846037,0.842018,0.842846,0.846565,0.845241,0.84636,0.845257,0.848469,0.849284,0.849421
std,0.118525,0.102189,0.106561,0.105406,0.101344,0.103228,0.10046,0.100523,0.097162,0.099106,0.0978
min,0.339315,0.419712,0.360333,0.345769,0.377977,0.378054,0.381369,0.388704,0.354028,0.385005,0.37675
25%,0.794248,0.828548,0.821627,0.822934,0.834749,0.832425,0.831627,0.831258,0.835944,0.83944,0.842858
50%,0.881537,0.883306,0.883306,0.883306,0.883306,0.883306,0.883306,0.883306,0.883306,0.883306,0.883306
75%,0.914856,0.910056,0.907943,0.906877,0.9074,0.90697,0.905709,0.904408,0.904562,0.908027,0.904848
max,0.974264,0.963359,0.970647,0.962415,0.967508,0.967063,0.966618,0.969753,0.967875,0.960606,0.971689


In [41]:
df['finbert_mu'] = df.apply(lambda row: weighted_mean(row, finbert_cols), axis=1)
fb_month_avg = df.groupby('month', sort=False)['finbert_mu'].mean()

In [42]:
fb_month_avg

month
Jan    3.254800
Feb    3.253643
Mar    3.235013
Apr    3.238001
May    3.238996
Jun    3.215775
Jul    3.250483
Aug    3.230884
Sep    3.245571
Oct    3.246990
Nov    3.237038
Dec    3.235268
Name: finbert_mu, dtype: float64

It seems that all the scores still are quite high, but the subtle patterns are there that correlate to March and July/August dips. Maybe standardizing could be helpful to see the dips.

In [62]:
fb_month_avgs_standardized = (fb_month_avg - fb_month_avg.mean()) / fb_month_avg.std()

In [63]:
pd.concat([nltk_month_avgs_standardized, fb_month_avgs_standardized], axis=1, keys=['nltk', 'finbert'])

Unnamed: 0_level_0,nltk,finbert
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,1.56794,1.329582
Feb,1.014258,1.224191
Mar,-0.812498,-0.47303
Apr,-0.622943,-0.200766
May,1.389622,-0.110183
Jun,0.119856,-2.225564
Jul,-0.042674,0.936322
Aug,-1.559961,-0.849149
Sep,-0.881187,0.488836
Oct,-0.402573,0.618049


![S&P500 2024 Graph](S&P_500_2024.png)
Little better, it seems that FinBERT follows similar trends to NLTK, but FinBERT's June is way too pessimistic, and FinBERT's July is way too optimistic. Based on the actual S&P500 graph, it seems that NLTK actually performed better, so I will be using NLTK for my plotting and analysis moving forward.

In [65]:
df.to_csv('reddit-sentiment.csv', index=False)