# Reddit Finance Data Analysis

This notebook analyzes a dataset of Reddit finance-related posts from various subreddits. The data is stored in JSONL format and contains multiple fields including post content, subreddit information, and various scoring metrics.

The analysis includes:
- Loading and parsing JSONL data
- Exploring the dataset structure
- Analyzing unique subreddits
- Sampling posts from different subreddits
- Filtering high-quality posts based on multiple scoring metrics (z-score, combined_score, and comment_normalized_score)

The data processing involves using pandas for data manipulation and json for parsing the JSONL files. The filtered dataset is saved and reloaded for further analysis.

In [8]:
import json
import pandas as pd
# Initialize an empty list to store all the data
all_data = []

# Open and read the JSONL file
with open('/home/zahemen/datasets/reddit-finance-250k/Data.jsonl', 'r') as file:
    for line in file:
        # Parse each JSON line and append to the list
        all_data.append(json.loads(line))

In [3]:
# Total number of samples
print(f"Total samples: {len(all_data)}")

# Access the first sample
print(all_data[0])


Total samples: 250000
{'id': 'utf5u', 'title': 'Where has all the money in the world gone?', 'selftext': "Honest question.\n\nWhere is all the money?  I hear nothing but bad news about financial crisis all over the world, and it seems that there is a shortage of cash - like it is some sort of natural resource.\n\nPeople haven't stopped buying stuff.  They still need food, clothing, medicine, shelter. Taxes are still collected. Fines are still levied. \n\nSo where is all the money?  I mean, labor has been produced to make things and wages paid to the laborers. The things are purchased by other laborers, who were paid for producing goods or services, etc.  It's a closed loop, right? \n\nCan someone explain it like I'm five or something?", 'z_score': 34.13125953284077, 'normalized_score': 1.0, 'subreddit': 'finance', 'body': '(relix already hit on some of this)\n\nIt\'s hard to explain this to a five-year-old, because there are some fairly abstract concepts involved, but here goes... \n\n

In [4]:
from pprint import PrettyPrinter

# Create a PrettyPrinter instance
printer = PrettyPrinter(indent=2, width=100, depth=3)

# Pretty print the first few records
printer.pprint(all_data[:5])

[ { 'body': '(relix already hit on some of this)\n'
            '\n'
            "It's hard to explain this to a five-year-old, because there are some fairly abstract "
            'concepts involved, but here goes... \n'
            '\n'
            'All actual "money" is debt. All of it, including monetary gold, etc. (Don\'t argue '
            "with me yet, I'll get to that.)\n"
            '\n'
            'Imagine a pretend world with no money, some kind of primitive villiage or something. '
            "Now let's invent paper money. You can't just print a bunch of paper that says people "
            'have to give you stuff, because nobody would honor it. But you *could* print IOUs. '
            "Let's walk through this...\n"
            '\n'
            "- Let's say you're an apple-farmer and I'm a hunter. You want some meat but haven't "
            'harvested your crops yet. You say to me, "hey, go hunt me some meat and I\'ll give '
            'you 1/10th of my apple harvest

In [5]:
# Extract all unique subreddits
unique_subreddits = {item['subreddit'] for item in all_data}

# Display the unique subreddits
print(f"Number of unique subreddits: {len(unique_subreddits)}")
print(unique_subreddits)


Number of unique subreddits: 43
{'Economics', 'IndiaInvestments', 'eupersonalfinance', 'dividends', 'UKPersonalFinance', 'economy', 'povertyfinance', 'personalfinance', 'finance', 'fatFIRE', 'Superstonk', 'CryptoMarkets', 'pennystocks', 'Wallstreetbetsnew', 'Forex', 'FinancialPlanning', 'AskEconomics', 'ETFs', 'ASX_Bets', 'Bitcoin', 'CanadianInvestor', 'ValueInvesting', 'options', 'Daytrading', 'AusFinance', 'Money', 'StocksAndTrading', 'Trading', 'Canadapennystocks', 'investing', 'CryptoCurrency', 'ethtrader', 'financialindependence', 'wallstreetbets', 'CryptoMoonShots', 'thetagang', 'stocks', 'algotrading', 'StockMarket', 'UKInvesting', 'realestateinvesting', 'CryptoCurrencyTrading', 'crypto_currency'}


In [None]:
from collections import defaultdict

# Dictionary to store two samples per subreddit
subreddit_samples = defaultdict(list)

# Iterate over the data and collect samples
for item in all_data:
    subreddit = item['subreddit']
    if subreddit in unique_subreddits and len(subreddit_samples[subreddit]) < 2:
        subreddit_samples[subreddit].append(item)

# Convert defaultdict to regular dictionary (optional)
subreddit_samples = dict(subreddit_samples)

# Check results
for subreddit, samples in subreddit_samples.items():
    print(f"Subreddit: {subreddit}, Samples: {len(samples)}")


Subreddit: finance, Samples: 2
Subreddit: AskEconomics, Samples: 2
Subreddit: Economics, Samples: 2
Subreddit: wallstreetbets, Samples: 2
Subreddit: Wallstreetbetsnew, Samples: 2
Subreddit: UKInvesting, Samples: 2
Subreddit: CanadianInvestor, Samples: 2
Subreddit: StocksAndTrading, Samples: 2
Subreddit: Superstonk, Samples: 2
Subreddit: eupersonalfinance, Samples: 2
Subreddit: povertyfinance, Samples: 2
Subreddit: UKPersonalFinance, Samples: 2
Subreddit: CryptoCurrencyTrading, Samples: 2
Subreddit: ethtrader, Samples: 2
Subreddit: IndiaInvestments, Samples: 2
Subreddit: Money, Samples: 2
Subreddit: ValueInvesting, Samples: 2
Subreddit: StockMarket, Samples: 2
Subreddit: ASX_Bets, Samples: 2
Subreddit: economy, Samples: 2
Subreddit: CryptoMoonShots, Samples: 2
Subreddit: ETFs, Samples: 2
Subreddit: dividends, Samples: 2
Subreddit: stocks, Samples: 2
Subreddit: fatFIRE, Samples: 2
Subreddit: Daytrading, Samples: 2
Subreddit: FinancialPlanning, Samples: 2
Subreddit: financialindependence,

In [7]:
for item in all_data:
    subreddit = item['subreddit']
    if subreddit in unique_subreddits and len(subreddit_samples[subreddit]) < 2:
        subreddit_samples[subreddit].append(item)

# Pretty print the results for each subreddit
for subreddit, samples in subreddit_samples.items():
    print(f"Subreddit: {subreddit}")
    printer.pprint(samples)  # Pretty print the two samples
    print("\n" + "=" * 50 + "\n")  # Separator for readability

Subreddit: finance
[ { 'body': '(relix already hit on some of this)\n'
            '\n'
            "It's hard to explain this to a five-year-old, because there are some fairly abstract "
            'concepts involved, but here goes... \n'
            '\n'
            'All actual "money" is debt. All of it, including monetary gold, etc. (Don\'t argue '
            "with me yet, I'll get to that.)\n"
            '\n'
            'Imagine a pretend world with no money, some kind of primitive villiage or something. '
            "Now let's invent paper money. You can't just print a bunch of paper that says people "
            'have to give you stuff, because nobody would honor it. But you *could* print IOUs. '
            "Let's walk through this...\n"
            '\n'
            "- Let's say you're an apple-farmer and I'm a hunter. You want some meat but haven't "
            'harvested your crops yet. You say to me, "hey, go hunt me some meat and I\'ll give '
            'you 1/10th 

In [9]:
df = pd.DataFrame(all_data)
df.head()

Unnamed: 0,id,title,selftext,z_score,normalized_score,subreddit,body,comment_normalized_score,combined_score
0,utf5u,Where has all the money in the world gone?,Honest question.\n\nWhere is all the money? I...,34.13126,1.0,finance,(relix already hit on some of this)\n\nIt's ha...,1.0,2.0
1,m3g13g,Is there a better sub where comments aren’t hi...,"So often someone will ask an amazing question,...",14.812156,1.0,AskEconomics,I someone were to make a good alternative then...,1.0,2.0
2,c84bp,How real-world corruption works.,This is a throwaway account (I'm a longtime re...,15.873294,1.0,Economics,So I said I would talk about the US Military i...,1.0,2.0
3,l6x130,CLASS ACTION AGAINST ROBINHOOD. Allowing peopl...,LEAVE ROBINHOOD. They dont deserve to make mon...,91.556224,1.0,wallstreetbets,Chapman Albin is an investors rights firm that...,1.0,2.0
4,l6i4t3,Wallstreet Bets Set to Private Megathread,The moderators there have made that sub privat...,78.806179,1.0,Wallstreetbetsnew,You there. Yeah you. The person reading this c...,1.0,2.0


In [15]:
# import json
# import pandas as pd

# # Load the data from JSONL
# all_data = []

# with open('/kaggle/input/reddit-finance-43-250k-dataset/Data.jsonl', 'r') as file:
#     for line in file:
#         # Parse each JSON line and append to the list
#         all_data.append(json.loads(line))

# # Convert to a DataFrame for easier processing
# df = pd.DataFrame(all_data)

# Define thresholds
z_score_threshold = df['z_score'].quantile(0.8)  # Top 10% z-scores
combined_score_threshold = df['combined_score'].quantile(0.8)
comment_normalized_threshold = df['comment_normalized_score'].quantile(0.8)

# Apply filters
filtered_data = df[
    (df['z_score'] > z_score_threshold) & 
    (df['combined_score'] > combined_score_threshold) & 
    (df['comment_normalized_score'] > comment_normalized_threshold)
]

# Display number of filtered samples
print(f"Number of filtered samples: {len(filtered_data)}")

# Save the filtered data if needed
filtered_data.to_json('/home/zahemen/datasets/reddit-finance-250k/filtered_data.jsonl', orient='records', lines=True)


Number of filtered samples: 14731


In [16]:
filtered_data = list()

with open('/home/zahemen/datasets/reddit-finance-250k/filtered_data.jsonl', 'r') as file:
    for line in file:
        # Parse each JSON line and append to the list
        filtered_data.append(json.loads(line))

# Convert to a DataFrame for easier processing
df_filtered = pd.DataFrame(filtered_data)
print(df_filtered.shape)
df_filtered.head()

(14731, 9)


Unnamed: 0,id,title,selftext,z_score,normalized_score,subreddit,body,comment_normalized_score,combined_score
0,utf5u,Where has all the money in the world gone?,Honest question.\n\nWhere is all the money? I...,34.13126,1.0,finance,(relix already hit on some of this)\n\nIt's ha...,1.0,2.0
1,m3g13g,Is there a better sub where comments aren’t hi...,"So often someone will ask an amazing question,...",14.812156,1.0,AskEconomics,I someone were to make a good alternative then...,1.0,2.0
2,c84bp,How real-world corruption works.,This is a throwaway account (I'm a longtime re...,15.873294,1.0,Economics,So I said I would talk about the US Military i...,1.0,2.0
3,l6x130,CLASS ACTION AGAINST ROBINHOOD. Allowing peopl...,LEAVE ROBINHOOD. They dont deserve to make mon...,91.556224,1.0,wallstreetbets,Chapman Albin is an investors rights firm that...,1.0,2.0
4,l6i4t3,Wallstreet Bets Set to Private Megathread,The moderators there have made that sub privat...,78.806179,1.0,Wallstreetbetsnew,You there. Yeah you. The person reading this c...,1.0,2.0


In [17]:
df_filtered.columns.tolist()

['id',
 'title',
 'selftext',
 'z_score',
 'normalized_score',
 'subreddit',
 'body',
 'comment_normalized_score',
 'combined_score']

In [18]:
print(df_filtered.iloc[:20, :]['selftext'].values)

["Honest question.\n\nWhere is all the money?  I hear nothing but bad news about financial crisis all over the world, and it seems that there is a shortage of cash - like it is some sort of natural resource.\n\nPeople haven't stopped buying stuff.  They still need food, clothing, medicine, shelter. Taxes are still collected. Fines are still levied. \n\nSo where is all the money?  I mean, labor has been produced to make things and wages paid to the laborers. The things are purchased by other laborers, who were paid for producing goods or services, etc.  It's a closed loop, right? \n\nCan someone explain it like I'm five or something?"
 'So often someone will ask an amazing question, something I’m really interested in getting a good answer to, or even someone’s opinion, but I always just see that message explaining that comments need to be approved. \n\n99.9% is an exaggeration, but not many people are going to come back to look at a post to see if any comments have been approved.'
 'Thi

In [3]:
import json
import pandas as pd
import re
from better_profanity import profanity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize profanity filter
profanity.load_censor_words()

# Helper functions
def contains_profanity(text):
    """Check if the text contains profanity."""
    return profanity.contains_profanity(text)

def is_conversational(text):
    """Check if the text has a conversational tone."""
    if re.search(r'\b(I|we|you|me|my|our|your|us|yours|they|them)\b', text, re.IGNORECASE) and "?" in text:
        return True
    return False

def is_relevant(text, keywords, vectorizer, keyword_vector):
    """Check if the text is relevant to the given domain using TF-IDF similarity."""
    text_vector = vectorizer.transform([text])
    similarity = cosine_similarity(text_vector, keyword_vector)
    return similarity[0][0] > 0.3  # Adjust threshold as needed

# Load data
all_data = []
with open('/home/zahemen/datasets/reddit-finance-250k/Data.jsonl', 'r') as file:
    for line in file:
        all_data.append(json.loads(line))

# Convert to DataFrame
df = pd.DataFrame(all_data)

# TF-IDF setup for relevance filtering
keywords = ["finance", "investment", "stocks", "bonds", "retirement", "savings", "debt", "credit"]
vectorizer = TfidfVectorizer(stop_words="english")
keyword_vector = vectorizer.fit_transform([" ".join(keywords)])

# Define thresholds
z_score_threshold = df['z_score'].quantile(0.8)  # Top 20% of z-scores
combined_score_threshold = df['combined_score'].quantile(0.8)
comment_normalized_threshold = df['comment_normalized_score'].quantile(0.8)

# Filter dataset
filtered_data = []
for _, row in df.iterrows():
    title = row['title']
    body = row['body']
    response = row['selftext']

    # Combine title and body for better context
    context = f"{title} {body}".strip()

    # Apply filters
    if (
        row['z_score'] > z_score_threshold
        and row['combined_score'] > combined_score_threshold
        and row['comment_normalized_score'] > comment_normalized_threshold
        and not contains_profanity(response)
        and is_conversational(response)
        and is_relevant(context, keywords, vectorizer, keyword_vector)
        and len(response.split()) > 5  # Minimum word count
        and len(response.split()) < 200  # Maximum word count
    ):
        filtered_data.append(row)

# Convert filtered data back to DataFrame
filtered_df = pd.DataFrame(filtered_data)

# Save filtered data to a new JSONL file
output_path = '/home/zahemen/datasets/reddit-finance-250k/filtered_data_advanced.jsonl'
filtered_df.to_json(output_path, orient='records', lines=True)

print(f"Filtered dataset saved to {output_path} with {len(filtered_df)} samples.")


KeyboardInterrupt: 

In [7]:
import json
import pandas as pd
import re
import numpy as np
from typing import List, Dict
import nltk
from nltk.tokenize import sent_tokenize
import spacy

# Download required NLTK data
nltk.download('punkt')
# Load spaCy model for text quality analysis
nlp = spacy.load('en_core_web_sm')

class DatasetCleaner:
    def __init__(self, input_file: str, output_file: str):
        self.input_file = input_file
        self.output_file = output_file
        self.quality_threshold = 0.7
        
        # Common profanity words - this is a basic list, you might want to expand it
        self.profanity_patterns = [
            r'\b(fuck|shit|damn|bitch|crap|ass|dick|porn|nsfw)\b',
            r'\b(wtf|stfu|omfg|btch)\b'
        ]
        
    def contains_profanity(self, text: str) -> bool:
        """Check if text contains profanity using regex patterns."""
        if not isinstance(text, str):
            return True
        
        text = text.lower()
        return any(bool(re.search(pattern, text, re.IGNORECASE)) 
                  for pattern in self.profanity_patterns)

    def load_data(self) -> pd.DataFrame:
        """Load and prepare the dataset."""
        all_data = []
        with open(self.input_file, 'r') as file:
            for line in file:
                all_data.append(json.loads(line))
        return pd.DataFrame(all_data)

    @staticmethod
    def clean_text(text: str) -> str:
        """Clean text by removing Reddit-specific formatting and noise."""
        if not isinstance(text, str):
            return ""
            
        # Remove URLs
        text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
        
        # Remove Reddit markdown
        text = re.sub(r'\[.*?\]\(.*?\)', '', text)
        text = re.sub(r'&amp;', '&', text)
        text = re.sub(r'&lt;', '<', text)
        text = re.sub(r'&gt;', '>', text)
        
        # Remove ASCII art and box drawings
        text = re.sub(r'[━┃┏┓┗┛│└┘╭╮╯╰▀▄█▌▐░▒▓]', '', text)
        
        # Clean formatting
        text = re.sub(r'\*{1,3}', '', text)  # Remove asterisks
        text = re.sub(r'~{2}.*?~{2}', '', text)  # Remove strikethrough
        text = re.sub(r'_{1,2}.*?_{1,2}', '', text)  # Remove underscores
        
        # Remove edit notices
        text = re.sub(r'edit\s*\d*\s*:', '', text, flags=re.IGNORECASE)
        text = re.sub(r'update\s*\d*\s*:', '', text, flags=re.IGNORECASE)
        
        # Remove award speech
        text = re.sub(r'thanks? for (?:the)? (?:gold|silver|platinum|award).*', '', text, flags=re.IGNORECASE)
        
        # Clean up whitespace
        text = ' '.join(text.split())
        return text.strip()

    def assess_text_quality(self, text: str) -> float:
        """
        Assess the quality of text based on multiple factors.
        Returns a score between 0 and 1.
        """
        if not isinstance(text, str) or len(text.strip()) == 0:
            return 0.0

        # Process text with spaCy
        doc = nlp(text)
        
        # Calculate various quality metrics
        avg_word_length = np.mean([len(token.text) for token in doc])
        sentence_count = len(list(doc.sents))
        has_punctuation = any(token.is_punct for token in doc)
        proper_capitalization = text[0].isupper() if text else False
        
        # Check for coherent sentences
        sentences = sent_tokenize(text)
        min_sent_length = 3  # minimum words per sentence
        coherent_sentences = sum(1 for sent in sentences if len(sent.split()) >= min_sent_length)
        
        # Calculate final quality score
        scores = [
            min(avg_word_length / 10, 1.0),  # Penalize extremely long average word lengths
            min(sentence_count / 3, 1.0),     # Reward multiple sentences up to a point
            float(has_punctuation),
            float(proper_capitalization),
            coherent_sentences / max(len(sentences), 1)
        ]
        
        return np.mean(scores)

    def is_conversational(self, text: str) -> bool:
        """Check if text appears to be conversational."""
        if not isinstance(text, str):
            return False
            
        # Look for conversation indicators
        indicators = [
            r'\b(hi|hello|hey|thanks|thank you)\b',
            r'\?',
            r'\b(you|your|yours)\b',
            r'\b(I|me|my|mine)\b'
        ]
        
        score = sum(bool(re.search(pattern, text, re.IGNORECASE)) for pattern in indicators)
        return score >= 2  # At least two indicators should be present

    def format_for_training(self, row: pd.Series) -> Dict:
        """Format data into the required training format."""
        return {
            "messages": [
                {
                    "role": "system",
                    "content": "You are ETA, an AI financial advisor. Provide helpful, accurate financial guidance while being clear that you're not a licensed professional."
                },
                {
                    "role": "user",
                    "content": self.clean_text(row['title'] + " " + (row['selftext'] or ""))
                },
                {
                    "role": "assistant",
                    "content": self.clean_text(row['body'])
                }
            ]
        }

    def process_dataset(self):
        """Main processing pipeline."""
        # Load data
        df = self.load_data()
        print(f"Initial dataset size: {len(df)}")
        
        # Apply score-based filtering
        for col in ['z_score', 'combined_score', 'comment_normalized_score']:
            threshold = df[col].quantile(0.8)
            df = df[df[col] > threshold]
        
        print(f"After score filtering: {len(df)}")
        
        # Clean and assess text quality
        df['cleaned_body'] = df['body'].apply(self.clean_text)
        df['body_quality'] = df['cleaned_body'].apply(self.assess_text_quality)
        
        # Filter based on quality and conversation metrics
        df = df[
            (df['body_quality'] > self.quality_threshold) &
            (df['cleaned_body'].apply(self.is_conversational)) &
            (~df['cleaned_body'].apply(self.contains_profanity)) &
            (df['cleaned_body'].str.len() > 50) &  # Minimum length
            (df['cleaned_body'].str.len() < 2000)  # Maximum length
        ]
        
        print(f"After quality filtering: {len(df)}")
        
        # Format for training
        training_data = [self.format_for_training(row) for _, row in df.iterrows()]
        
        # Save processed data
        with open(self.output_file, 'w') as f:
            for item in training_data:
                f.write(json.dumps(item) + '\n')
        
        print(f"Saved {len(training_data)} examples to {self.output_file}")

if __name__ == "__main__":
    # Initialize and run the cleaner
    cleaner = DatasetCleaner(
        input_file='/home/zahemen/datasets/reddit-finance-250k/Data.jsonl',
        output_file='/home/zahemen/datasets/reddit-finance-250k/cleaned_data.jsonl'
    )
    cleaner.process_dataset()

[nltk_data] Downloading package punkt to /home/zahemen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Initial dataset size: 250000
After score filtering: 2000
After quality filtering: 822
Saved 822 examples to /home/zahemen/datasets/reddit-finance-250k/cleaned_data.jsonl
