In [1]:
import pandas as pd
import nltk
import re
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer

english_stopwords = set(stopwords.words('english'))

In [2]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/upanshuparekh/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/upanshuparekh/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/upanshuparekh/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /Users/upanshuparekh/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/upanshuparekh/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagg

True

# Data Preprocessing -- Reddit
In this part (2), I will be using [NLTK](https://www.nltk.org)'s data preprocessing functions to clean up the data produced by the `data_collection.py` script.

Here's what I'll do. For each post:
- Get a sentiment score for the post title + self text, call it $s_0$
- Get a sentiment score for each top comment, call them $s_1, s_2, ..., s_{10}$
- Get a weighted aggregate sentiment score for the post. Weighing a score $s_i$ less as $i$ increases. I will just begin with the simple function: $w(s_i) = 1/i$ to multiply to a score to weigh it.

I have to do it like this because sentiment analysis really starts to break when the text gets too long, either with NLTK's VADER, or with FinBERT. You will see this in my previous commits if you want to look.


In [3]:
df = pd.concat([
    pd.read_csv('reddit-cryptocurrency-data.csv'),
    pd.read_csv('reddit-wallstreetbets-data.csv'),
    pd.read_csv('reddit-finance-data.csv'),
    pd.read_csv('reddit-investing-data.csv'),
], ignore_index=True)

In [4]:
df.columns

Index(['subreddit', 'month', 'post_id', 'post_title', 'post_selftext', 'tc0',
       'tc1', 'tc2', 'tc3', 'tc4', 'tc5', 'tc6', 'tc7', 'tc8', 'tc9'],
      dtype='object')

In [5]:
df.sample(5)

Unnamed: 0,subreddit,month,post_id,post_title,post_selftext,tc0,tc1,tc2,tc3,tc4,tc5,tc6,tc7,tc8,tc9
3654,investing,Jan,19aoqyv,No one knows anything: Here’s How Barron’s Sto...,&gt; Investors who avoided the tech stocks kn...,&gt; Do your own homework. Dangerous advice t...,This is why you don't listen to 'experts'. Bec...,TSLA is the ONLY one Barron picked? It's the O...,No. If you more than double your money in 12 m...,I let Dunning–Kruger pick my stocks.,"""Do your own homework"" is how you get ""Vaccine...",Vaccines cause Armadillos?? Damn. i owe an ap...,"&gt; Like actual research, get more than one s...",I've made bank over the last 20 years doing ex...,It’s not one way or the other… if I had small ...
1590,wallstreetbets,Apr,1bve49e,Quick synopsis of how the market works for any...,,Most accurate chart I've seen,The 3 body problem,I think they track my moves.,I bought 2 nvidia stocks day befor yesterday. ...,3 regard problem\* ![img](emote|t5_2th52|4260),Sometimes friends complain to me that gas/oil ...,This is what they teach in the final semester ...,If I hold should be going down aswell,It's not even a secret.,3 problems regardless
2183,wallstreetbets,Oct,1geemmu,Up $36k in a day and yet i feel nothing. Is th...,,Just send me $36k bro. It'll be like it never ...,Throw it on black than you’ll feel something f...,Surprisingly it makes sense,"Go fuck yourself regard, then maybe you’ll fee...",Depends. Let us see YTD,doesn’t care at all. Brags to the internet….,"I think I met you once, nice tits, reasonable ...",My name is *Charity*. Send it to me and you'll...,Dude is trying so hard to come off as this lik...,you arent $36k up unless you closed all your p...
1666,wallstreetbets,May,1chwiuo,WTF? I think the market might be rigged.,,That's a middle finger to all of us regards,"Looks like a cock &amp; balls pattern, it's to...",its what the finger is telling us: https://pr...,https://preview.redd.it/q8mj95wnxvxc1.jpeg?wid...,This is why you sell when your call has gained...,Market manipulation to screw everyone out of t...,Sold my calls at $7.50. ![img](emote|t5_2th52|...,"""just a little bit more 150% gain is not enoug...","If anything, an example of how retail and even...",Classic dick in yo ass pattern. It’s a market ...
4058,investing,May,1cy4ews,Lulu actually losing long term share?,"Lulu's stock is down to $300, a 'technically' ...",Go to the lulu forum on reddit and read the la...,Guys clothing has also seen a quality downgrad...,"They have SO…MUCH…CASH. No debt. Sure, the d...",Yes I think you’re correct. Never bet against ...,Agree. I bought about 10 different piece over ...,The US consumer is getting squeezed by inflati...,"Why ask ladies, I am a man and I really love L...",I agree. I bought calls.,First thing I did actually; opinions are split...,Let me know if you find any


In [6]:
# Combine `post_title`, `post_selftext`, and `tc{0-9}` into a single column `text`
cols_to_combine = ['post_title', 'post_selftext'] + [f'tc{i}' for i in range(10)]
df['text'] = df[cols_to_combine].fillna('').agg(' '.join, axis=1)
df = df.drop(columns=cols_to_combine)

In [7]:
df.sample(5)

Unnamed: 0,subreddit,month,post_id,text
3306,finance,Oct,1g1b7xe,A Distributional Analysis of Donald Trump’s Ta...
2184,wallstreetbets,Oct,1fzw1tk,Only took 4 years but I’m back where I started...
1760,wallstreetbets,Jun,1dfpsj2,Wells Fargo Fires Over a Dozen for ‘Simulation...
2030,wallstreetbets,Sep,1fs7isk,Uncle Sam’s gangster economy: Starter pack G...
54,cryptocurrency,Jan,19bofot,100% of my Crypto is on CoinBase. Why? Because...


## Text Cleaning Function
I did research the tradeoff between stemming vs. lemmatizing, and in general I got that:
- Stemming = rules-based, heuristic algorithmic removal of common word endings
    - faster for larger datasets, loses accuracy and context, can produced nonexistent words
- Lemmatizing = more accurate, more computationally expensive with Part-of-Speech Tagging required

But I reason that I'm not training an ML model where accuracy is mission critical, so simply
stemming should suffice.

In [8]:
def clean_text_stemmer(text: str) -> str:
    """
    Clean the input text by removing URLs, special characters, and extra whitespace, and using
    NLTK's tokenization, stopword removal, stemming.
    """
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove special characters and digits
    text = re.sub(r'[^A-Za-z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip().lower()

    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in english_stopwords]
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    text = ' '.join(words)
    return text

In [9]:
example_text = df['text'].iloc[2226]
example_text

"Dow soars 1,300 points to a record, Russell 2000 jumps 4% as Trump defeats Harris  Stock market goes up. Which means more people should have more money to pay more for stuff. Thats how that works, right? [removed] [removed] Treasury yields are running. Mortgages just got that much more expensive. Over leveraged companies are sweating. Not the stocks I pick, they all go down Wait for the inevitable rug pull.  The average middle class citizen will shift to mitigate losses on 401ks.  While the top 10% will capitalize on clearance prices, further widening the gap.  But hey man, at least the fluoride is going to be removed from the tap water Smooth brains about to learn what broad tariffs would do to the US economy So what stocks took off because I'm deep red today? Not exactly. Like it’s usually rich people that in the stock market to begin with. So the rich get richer * **Top 1%**: Own approximately 50% of the stock market, valued at around $21 trillion. * **Top 10%**: Hold about 87% of 

In [10]:
example_text_stemmed = clean_text_stemmer(example_text)
example_text_stemmed

'dow soar point record russel jump trump defeat harri stock market goe mean peopl money pay stuff that work right remov remov treasuri yield run mortgag got much expens leverag compani sweat stock pick go wait inevit rug pull averag middl class citizen shift mitig loss ks top capit clearanc price widen gap hey man least fluorid go remov tap water smooth brain learn broad tariff would us economi stock took im deep red today exactli like usual rich peopl stock market begin rich get richer top approxim stock market valu around trillion top hold stock market total approxim trillion bottom possess stock market worth roughli billion sourc updat sept'

Okay, maybe lemmatizing is the better strategy, there are just too many nonsense words here that can throw off the sentiment analyzer.

In [11]:
def clean_text_lemmatizer(text: str) -> str:
    """
    Clean the input text by removing URLs, special characters, and extra whitespace, and using
    NLTK's tokenization, stopword removal, lemmatizer.
    """
    # POS tagging for lemmatization
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    pos_tags = pos_tag(word_tokenize(text))
    lemmatizer = WordNetLemmatizer()
    # Map POS tags to WordNet format
    tag_dict = {
        "J": wordnet.ADJ,  # Adjective
        "N": wordnet.NOUN, # Noun
        "V": wordnet.VERB, # Verb
        "R": wordnet.ADV   # Adverb
    }
    pos_tags = [(word, tag_dict.get(tag[0], 'n')) for word, tag in pos_tags]

    # Lemmatization
    words = [lemmatizer.lemmatize(word, pos).lower() for word, pos in pos_tags]
    # Remove stopwords
    words = [word for word in words if word not in english_stopwords]
    text = ' '.join(words)

    # Remove special characters and digits
    text = re.sub(r'[^A-Za-z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip().lower()

    return text

In [12]:
example_text

"Dow soars 1,300 points to a record, Russell 2000 jumps 4% as Trump defeats Harris  Stock market goes up. Which means more people should have more money to pay more for stuff. Thats how that works, right? [removed] [removed] Treasury yields are running. Mortgages just got that much more expensive. Over leveraged companies are sweating. Not the stocks I pick, they all go down Wait for the inevitable rug pull.  The average middle class citizen will shift to mitigate losses on 401ks.  While the top 10% will capitalize on clearance prices, further widening the gap.  But hey man, at least the fluoride is going to be removed from the tap water Smooth brains about to learn what broad tariffs would do to the US economy So what stocks took off because I'm deep red today? Not exactly. Like it’s usually rich people that in the stock market to begin with. So the rich get richer * **Top 1%**: Own approximately 50% of the stock market, valued at around $21 trillion. * **Top 10%**: Hold about 87% of 

In [13]:
example_text_lemmatized = clean_text_lemmatizer(example_text)
print('LEMMATIZED: ' + example_text_lemmatized)
print('STEMMED: ' + example_text_stemmed)

LEMMATIZED: dow soar point record russell jump trump defeat harris stock market go mean people money pay stuff thats work right remove remove treasury yield run mortgages get much expensive leveraged company sweat stock pick go wait inevitable rug pull average middle class citizen shift mitigate loss ks top capitalize clearance price far widen gap hey man least fluoride go remove tap water smooth brain learn broad tariff would us economy stock take m deep red today exactly like usually rich people stock market begin rich get rich top approximately stock market value around trillion top hold stock market total approximately trillion bottom possess stock market worth roughly billion source update sept
STEMMED: dow soar point record russel jump trump defeat harri stock market goe mean peopl money pay stuff that work right remov remov treasuri yield run mortgag got much expens leverag compani sweat stock pick go wait inevit rug pull averag middl class citizen shift mitig loss ks top capit 

## Applying Lemmatizer to Whole Text Column
Okay the lemmatizer **definitely works a lot better**, it's a whole lot more accurate in its processing. Gonna stick with that! Now to apply it to the whole text column.

In [14]:
df['text'] = df['text'].apply(clean_text_lemmatizer)

In [15]:
df.sample(4)

Unnamed: 0,subreddit,month,post_id,text
66,cryptocurrency,Jan,19d7lib,trezor support site breach expose personal dat...
4188,investing,Jun,1dd1pjd,people use theta gang strategy selling cover c...
782,cryptocurrency,Aug,1ercv27,world large bitcoin miner announce m raise buy...
812,cryptocurrency,Sep,1fhwtqs,early ethereum investor sell eth year x gain r...


In [16]:
# Save the cleaned DataFrame to a new CSV file
df.to_csv('reddit-cleaned.csv', index=False)