In [1]:
import pandas as pd

# Sentiment Models Comparisons -- Reddit
In this part (3), I'll be comparing the VADER pretrained model from NLTK's presets, and the FinBERT model from the PyTorch library.

The idea is that using a sentiment analysis model trained on financial data will allow it to pick up financial terms and keywords from the corpora better than the general purpose NLTK pretrained sentiment model.

I'm not sure exactly yet how I'll measure this effect, but for each model's results, I'll make a plot of sentiment results for each month of 2024. Then I'll decide from there.

## Data Treatment
The cleaned data consists of lemmatized top comments, and the "headline" column which has the post title + self text

Here's what I'll do. For each post:
- Get a sentiment score for the post title + self text (headline), call it $s_0$
- Get a sentiment score for each top comment, call them $s_1, s_2, ..., s_{10}$
- Get a weighted aggregate sentiment score for the post. Weighing a score $s_i$ less as $i$ increases. I will just begin with the simple function: $w(s_i) = \frac{1}{100} * (i - 10)^2 + 0.01$ to multiply to a score to weigh it. The $0.01$ is to just avoid weighing the last element at 0.

For each month, I will take the median score of the post scores of that month as the representation for the entire month, as this statistic is more resistant to outliers.

I have to do it like this because sentiment analysis really starts to break when the text gets too long, either with NLTK's VADER, or with FinBERT. You will see this in my previous commits if you want to look, but essentially I tried to combine all the info associated with each post into one supertext of the post, and tried to run an analysis ont that, but the models failed spectacularly with incredibly large bodies of text like that.



In [20]:
df = pd.read_csv('reddit-cleaned.csv')

In [21]:
df.sample(3)

Unnamed: 0,subreddit,month,post_id,tc0,tc1,tc2,tc3,tc4,tc5,tc6,tc7,tc8,tc9,headline
4391,investing,Aug,1ev6ov8,mcd dividend stock big yearly return compare b...,comment totally wrong mcd trade nearly bb per ...,nobody buy hold,great point thanks,one word hamburglar,actual answer bid ask spread think basis c spr...,sit mine enjoy dividend,remove,maybe share price around large typical stock c...,sounds like good thing,mcdonald s stock big bidask spread delete
4277,investing,Jul,1e7f6g5,s day week dca etf,individual stock reason happy consumer service...,time market beat time market pick needle hayst...,everybody talk pe ratio already moon analysts ...,keep buying fskax every paycheck keep go every...,every week get pay every week get pay every we...,costco individual stock,solid business management customer adoption s ...,need scratch gamble itch fomo,nice discover undervalue,top reason buy buy stock get general understan...
1938,wallstreetbets,Aug,1ema5ue,actual fuck read,remove,already x investment regard,gt go eventually wind position nt think far ahead,m unrealized gain op please sell m live dividend,either smart stupid thing ve ever read,spend lego even taste good,get m ct avg cost per ct yet total cost basis ...,remove,hank,update spend quarter million dollar rock previ...


In [22]:
df.drop(columns=['post_id'], inplace=True)

In [23]:
df.sample(3)

Unnamed: 0,subreddit,month,tc0,tc1,tc2,tc3,tc4,tc5,tc6,tc7,tc8,tc9,headline
4605,investing,Nov,nvidia main topic thanksgiving diner guess tim...,tesla hardware year ahead nvidia elon spend bi...,say s crap storm room everyone cry,hear ya s consolation stock main thing talk ye...,nt want talk,summary sure nvda super specialize chip first ...,yes make sense give iterative sort chip making...,say cant happen engineer leave nvidia tesla cr...,work ml space close decade nvidia position wel...,heard first folk ai kill turkey thanksgiving t...,talking dad nvidia thanksgiving dad active inv...
4479,investing,Sep,yr treasury pay almost interest rate predict f...,get year tbill even need worry stocksetfs goal,almost everything pretax pre inflation point m...,money market fund pay right,nobody list return pre inflation except seem e...,trinity study use stocksbonds portfolio could ...,low risk specially op timeframe three year,would think investment expensive future compare,poster say put treasury take profit invest inv...,put m vt collect dividend earning k year divid...,st many questions yr passive income m m recent...
3532,finance,Dec,thats one way point,,,,,,,,,,nyse close jan honor late former president jim...


In [24]:
subreddits = df['subreddit'].unique()
print(subreddits)

months = df['month'].unique()
print(months)

['cryptocurrency' 'wallstreetbets' 'finance' 'investing']
['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']


In [26]:
# Average character length of the text in the combined dataframe
cols = ['headline'] + [f'tc{i}' for i in range(10)]
for col in cols:
    avg_length = df[col].str.len().mean()
    print(f"Average character length of '{col}': {avg_length}")

Average character length of 'headline': 251.26746131325805
Average character length of 'tc0': 101.6719512195122
Average character length of 'tc1': 104.11005502751375
Average character length of 'tc2': 101.50442477876106
Average character length of 'tc3': 98.6069779374038
Average character length of 'tc4': 102.36372950819673
Average character length of 'tc5': 98.59127291505293
Average character length of 'tc6': 94.35177968303455
Average character length of 'tc7': 95.81495960385718
Average character length of 'tc8': 97.39321148825066
Average character length of 'tc9': 94.92190775681341


In [58]:
# Some columns are not of the correct data type
df[cols] = df[cols].astype(str)

Okay these are all reasonable character lengths, so I think the sentiment analyses should be much nicer compared to my previous attempt.

## NLTK Sentiment Analysis
In this section I'll use NLTK's sentiment analysis to convert all the columns into sentiment scores.

In [53]:
sample_headline = df['headline'].sample(1).values[0]
print(sample_headline)

julian assange free reach plea deal us wikileaks founder julian assange expect free prison agree plea deal us authority see sentence time already serve british prison assange plead guilty us charge one count conspiracy obtain disclose national defence information sky news uk report june


In [54]:
from nltk.sentiment import SentimentIntensityAnalyzer

In [55]:
sia = SentimentIntensityAnalyzer()

In [56]:
sia.polarity_scores(sample_headline)

{'neg': 0.213, 'neu': 0.55, 'pos': 0.237, 'compound': -0.3818}

### Compound Scores
So this mechanism divides it's output into a negative, neutral, positive, and compound score.
I'll just use compound for now, and see what the scores are.

In [59]:
# Apply the sentiment analysis and create new columns for each score
for col in cols:
    df[f'nltk_{col}'] = df[col].apply(lambda x: pd.Series(sia.polarity_scores(x)['compound']))

In [63]:
nltk_cols = ['nltk_' + col for col in cols]
df[nltk_cols].describe()

Unnamed: 0,nltk_headline,nltk_tc0,nltk_tc1,nltk_tc2,nltk_tc3,nltk_tc4,nltk_tc5,nltk_tc6,nltk_tc7,nltk_tc8,nltk_tc9
count,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0,4800.0
mean,0.195921,0.104262,0.117701,0.115909,0.11875,0.117196,0.112485,0.110992,0.110591,0.117524,0.106034
std,0.504399,0.423808,0.423802,0.418869,0.405652,0.406945,0.407351,0.403429,0.407998,0.405584,0.402489
min,-0.9874,-0.984,-0.9992,-0.975,-0.9876,-0.989,-0.9919,-0.9651,-0.9709,-0.9783,-0.9856
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.6369,0.4019,0.4215,0.4215,0.4215,0.4019,0.4215,0.4019,0.4019,0.4019,0.3818
max,0.9999,0.9982,0.9961,0.9902,0.9978,0.9983,0.9966,0.9994,0.9958,0.9965,0.9936


Okay so it seems that a majority of scores are slightly positive. But this is across the columns. Perhaps after I run my special weighted mean on each post and then take the median grouped by month, I'll see a different story.

In [70]:
def weighted_mean(row: pd.Series, cols: list[str]) -> float:
    def w(i: int) -> float:
        return (1/100) * (i - 10) ** 2

    return sum([row[col] * w(i) for i, col in enumerate(cols)])

In [73]:
df['nltk_mu'] = df.apply(lambda row: weighted_mean(row, nltk_cols), axis=1)
df.groupby('month', sort=False)['nltk_mu'].mean()

month
Jan    0.607752
Feb    0.576147
Mar    0.470315
Apr    0.482185
May    0.600320
Jun    0.528733
Jul    0.514776
Aug    0.432123
Sep    0.464614
Oct    0.493422
Nov    0.566831
Dec    0.480364
Name: nltk_mu, dtype: float64

Hmm, still all positive! Note that this isn't a horrible thing, I shouldn't shape the results to my expectation. Looking at the actual S&P 500 graph:
![S&P 500 Graph 2024](S&P_500_2024.png)

Can see that there was a dip in March and July/August of that year, which does correspond to dips in sentiment scores around that time in the data, despite being positive, so perhaps there's something there!

## FinBERT Sentiment Analysis
Let's see how FinBERT does, given that it is specially trained on financial texts and sources.

In [15]:
from transformers import pipeline

finance_sentiment = pipeline("text-classification", model="ProsusAI/finbert")

  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0


In [16]:
finance_sentiment(df_combined['text'].iloc[0])

Token indices sequence length is longer than the specified maximum sequence length for this model (26201 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: The size of tensor a (26201) must match the size of tensor b (512) at non-singleton dimension 1

In [17]:
finance_sentiment(df_combined['text'].iloc[0][:512])

[{'label': 'neutral', 'score': 0.9233734607696533}]

Okay, this is currently an issue. The text is too large, and has too many characters for the BERT model to be able to process it.

So I think the next step is to figure out how I want to chunk the data and combine the component scores into an aggregate score for each post, and then combine these post aggregates into a prediction for the month.
I will have to be a bit careful about how I combine scores to achieve the effect I want.