In [1]:
import pandas as pd

# Sentiment Models Comparisons -- Reddit
In this part (3), I'll be comparing the VADER pretrained model from NLTK's presets, and the FinBERT model from the PyTorch library.

The idea is that using a sentiment analysis model trained on financial data will allow it to pick up financial terms and keywords from the corpora better than the general purpose NLTK pretrained sentiment model.

I'm not sure exactly yet how I'll measure this effect, but for each model's results, I'll make a plot of sentiment results for each month of 2024. Then I'll decide from there.

## Data Treatment
The cleaned data consists of rows being posts, and a column with the cleaned text for that post.
For a subreddit, I'll combine the texts of all the posts in a month into a month corpus.

I'll then use sentiment analysis on the 12 corpora I'll have to get the sentiment for each month. This will be for one subreddit, the plot will have 4 colors corresponding to each subreddit. So in total there
will be 48 sentiment scores.

In [2]:
df = pd.read_csv('reddit-cleaned.csv')

In [3]:
df.sample(3)

Unnamed: 0,subreddit,month,post_id,text
2550,finance,Feb,1agpwh5,heading finance team uk remove
2043,wallstreetbets,Sep,1f7ikwy,rip intel guy intel looks options escape dire ...
3347,finance,Oct,1ge2dqs,canadian housing bubble brink crash cant crash...


In [4]:
df.drop(columns=['post_id'], inplace=True)

In [5]:
df.sample(3)

Unnamed: 0,subreddit,month,text
4058,investing,May,lulu actually lose long term share lulu s stoc...
422,cryptocurrency,May,year ago today pizza order change history back...
3600,investing,Jan,year actually care ten year ago much stupid to...


In [6]:
# Ensure that the 'text' column contains only strings
df['text'] = df['text'].apply(lambda x: x if isinstance(x, str) else '')

In [7]:
subreddits = df['subreddit'].unique()
print(subreddits)

months = df['month'].unique()
print(months)

['cryptocurrency' 'wallstreetbets' 'finance' 'investing']
['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']


In [8]:
# Initialize an empty list to store the aggregated dataframes
agg_dfs = []

# Loop through each subreddit and perform the aggregation
for subreddit in subreddits:
    print(subreddit)
    df_sub = df[df['subreddit'] == subreddit]
    df_agg = df_sub.groupby('month', sort=False).agg({
        'subreddit': 'first',
        'text': ' '.join
    }).reset_index()
    agg_dfs.append(df_agg)

# Combine all the aggregated dataframes into one
df_combined = pd.concat(agg_dfs, ignore_index=True)

cryptocurrency
wallstreetbets
finance
investing


In [9]:
df_combined

Unnamed: 0,month,subreddit,text
0,Jan,cryptocurrency,bitcoin spot etf finally approve document go c...
1,Feb,cryptocurrency,coinbase block user sell show trust cex selfcu...
2,Mar,cryptocurrency,biden propose tax mining miners move interesti...
3,Apr,cryptocurrency,new theory satoshi nakamoto delete legit impre...
4,May,cryptocurrency,trump good crypto overwhelming number protrump...
5,Jun,cryptocurrency,logan paul sues youtuber coffeezilla cryptozoo...
6,Jul,cryptocurrency,sounds like trump crap bed bitcoin conference ...
7,Aug,cryptocurrency,kamala harris propose tax unrealized gain high...
8,Sep,cryptocurrency,years ago one biggest crypto scams happened sq...
9,Oct,cryptocurrency,bitcoin away break alltime high somebody tell ...


In [10]:
# Average character length of the text in the combined dataframe
avg_length = df_combined['text'].str.len().mean()
print(f"Average character length of text: {avg_length}")

Average character length of text: 106491.45833333333


## NLTK Sentiment Analysis
Now that we have our concatenated data, gonna use the VADER sentiment analyzer built into NLTK.

In [11]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

In [12]:
sia = SentimentIntensityAnalyzer()

In [13]:
sia.polarity_scores(df_combined['text'].iloc[0])

{'neg': 0.106, 'neu': 0.692, 'pos': 0.202, 'compound': 1.0}

### Compound Scores
So this mechanism divides it's output into a negative, neutral, positive, and compound score.
I'll just use compound for now, and see what the scores are.

In [14]:
# Apply the sentiment analysis and create new columns for each score
df_combined[['nltk_neg', 'nltk_neu', 'nltk_pos', 'nltk_compound']] = df_combined['text'].apply(lambda x: pd.Series(sia.polarity_scores(x)))

# Display the descriptive statistics for the new columns
df_combined[['nltk_neg', 'nltk_neu', 'nltk_pos', 'nltk_compound']].describe()


Unnamed: 0,nltk_neg,nltk_neu,nltk_pos,nltk_compound
count,48.0,48.0,48.0,48.0
mean,0.121563,0.679167,0.199313,0.999983
std,0.022068,0.029364,0.014958,4.3e-05
min,0.078,0.616,0.162,0.9998
25%,0.1055,0.65675,0.19,1.0
50%,0.1245,0.682,0.202,1.0
75%,0.137,0.694,0.208,1.0
max,0.169,0.76,0.244,1.0


Okay so there's definitely some issues here. It seems to overall evaluate words as neutral or positive.
Maybe that's why the NLTK compound scores are always positive with barely any variance.

## FinBERT Sentiment Analysis
Hopefully FinBERT comes to the rescue here, given that it is trained on financial lingo.

In [15]:
from transformers import pipeline

finance_sentiment = pipeline("text-classification", model="ProsusAI/finbert")

  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0


In [16]:
finance_sentiment(df_combined['text'].iloc[0])

Token indices sequence length is longer than the specified maximum sequence length for this model (26201 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: The size of tensor a (26201) must match the size of tensor b (512) at non-singleton dimension 1

In [17]:
finance_sentiment(df_combined['text'].iloc[0][:512])

[{'label': 'neutral', 'score': 0.9233734607696533}]

Okay, this is currently an issue. The text is too large, and has too many characters for the BERT model to be able to process it.

So I think the next step is to figure out how I want to chunk the data and combine the component scores into an aggregate score for each post, and then combine these post aggregates into a prediction for the month.
I will have to be a bit careful about how I combine scores to achieve the effect I want.