In [1]:
import pandas as pd, textwrap
import json
import os
import sys
import re
from datetime import datetime
import ast
import json
from langchain import OpenAI, LLMChain, PromptTemplate
from pathlib import Path
#install textblob if not already installed
from textblob import TextBlob #pip install textblob
from dotenv import load_dotenv
load_dotenv()

RAW  = Path('../data/raw/discord_msgs.csv')
OUT  = Path('../data/processed/discord_msgs_clean.parquet')


In [2]:
# ------------------------------------------------------------------ #
# 1. LOAD & BASIC NORMALISATION
df = pd.read_csv(RAW)
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df = df.sort_values('created_at')              # chronological
df = df.drop_duplicates('message_id')          # safety

# ------------------------------------------------------------------ #
# 2. LIGHT FEATURE ENGINEERING
df['char_len']   = df['content'].str.len()
df['word_len']   = df['content'].str.split().str.len()

# Tickers → list ( [] if NaN )
df['tickers'] = (df['tickers_detected']
                   .fillna('')
                   .apply(lambda s: re.findall(r'\$[A-Z]{2,6}', s)))

# Tweet URLs → list
df['tweet_urls'] = (df['tweet_urls']
                      .fillna('')
                      .str.split(',\s*') )

# Basic sentiment   (polarity ∈ [-1,1])
df['sentiment'] = df['content'].apply(
    lambda t: TextBlob(str(t)).sentiment.polarity
)

# Command flag (e.g. “!history”)
df['is_command'] = df['content'].str.startswith('!')

# Keep only useful columns
keep = ['message_id','created_at','channel','author_name',
        'content','tickers','tweet_urls',
        'char_len','word_len','sentiment','is_command']
df = df[keep]

# ------------------------------------------------------------------ #
# 3. SAVE TIDY VERSION
df.to_parquet(OUT, index=False)
print(f'✅ cleaned file saved → {OUT}\n'
      f'   rows: {len(df):,}   cols: {len(df.columns)}')

✅ cleaned file saved → ..\data\processed\discord_msgs_clean.parquet
   rows: 119   cols: 11


C:\Users\qaism\OneDrive - University of Virginia\Documents\Class Documents\DS 6051- Decoding LLMS\Trading Discord Bot\llm_portfolio_project
.
├── .env                                # Environment variables and API keys
├── April_Robinhood.csv                 # Robinhood data for April (ignore) 
├── ProjectContext                      # Project abstract and requirements
├── llm_portfolio_project/              # Main project directory
│   ├── data/                           # Data directory
│   │   ├── processed/                  # Processed data files
│   │   │   └── discord_msgs_clean.parquet  # Cleaned Discord messages
│   │   └── raw/                        # Raw data files
│   │       └── discord_msgs.csv        # Raw Discord messages data
│   ├── notebooks/                      # Jupyter notebooks
│   │   ├── 01_generate_journal.ipynb   # Portfolio journal generator
│   │   └── 02_clean_discord.ipynb      # Discord data cleaning notebook
│   ├── slides/                         # Presentation slides
│   └── src/                            # Source code
│       ├── data_collector.py           # Data collection utilities
│       ├── discord_logger_bot.py       # Discord bot for logging messages
│       └── __pycache__/                # Python cache files
└── js/                                 # JavaScript files
    ├── package.json                    # JS dependencies
    └── snaptrade.js                    # JS implementation for SnapTrade

File Descriptions
1. Environment Configuration
.env
Contains API keys and credentials for various services:
Discord bot token
Robinhood username and password
Twitter API keys and tokens
SnapTrade API credentials and user information
Robinhood account IDs
2. Notebooks
01_generate_journal.ipynb
Purpose: Generates a portfolio journal using trading data and Discord messages
Key Components:
Imports from SnapTrade client for accessing trading data
Data loading from positions.csv and discord_msgs.csv
Functions to add Discord messages (for stock analysis)
Prompt creation for LLM generation
OpenAI API integration to generate the journal entry
Saving journal entries to dated text files
02_clean_discord.ipynb
Purpose: Processes raw Discord message data and exports clean dataset
Key Components:
Data loading from discord_msgs.csv
Basic normalization (timestamp conversion, chronological sorting)
Feature engineering:
Character and word length calculations
Stock ticker extraction using regex
Tweet URL extraction
Sentiment analysis using TextBlob
Command flag detection (!commands)
Column selection for final dataset
Export to Parquet format for efficient storage
3. Source Code
data_collector.py
Purpose: Collects trading data from SnapTrade/Robinhood
Key Components:
SnapTrade client initialization
Functions to retrieve account positions, balances, and orders
CSV export functionality for positions and orders
Environment variable management for API credentials
Summary reporting for portfolio data
discord_logger_bot.py
Purpose: Discord bot for monitoring and logging trading-related messages
Key Components:
Discord bot setup and configuration
Twitter/X API integration for extracting tweet data from shared links
Message logging to CSV files
Stock ticker detection using regex
Tweet URL detection and data extraction
Command handling (!history to fetch past messages)
CSV file management for storing Discord messages and tweet data
4. Data Files
discord_msgs.csv
Purpose: Stores raw Discord messages related to trading
Structure:
Message metadata (ID, timestamp, channel, author)
Message content
Reply information
Mentions and ticker detection
Twitter URL tracking
Character and word counts
Contains messages from trading channels with stock analysis and portfolio positions
5. Project Documentation
ProjectContext
Purpose: Project overview and requirements
Content:
Project abstract describing the LLM-powered portfolio journal
Information about data sources (Discord, Robinhood, Twitter, Reddit)
API integration details
SnapTrade account information


1. data_collector.py
This file handles data collection from SnapTrade/Robinhood.

Main Components:

Imports and Setup (lines 1-24): Imports libraries and sets up environment variables and paths
SnapTrade Initialization (lines 26-39): Function to initialize the SnapTrade client
Account Position Functions (lines 42-80): Retrieves portfolio positions via SnapTrade API
Account Balance Function (lines 82-106): Gets account balance information
Order History Function (lines 108-142): Retrieves recent trade orders
Data Export Functions (lines 144-189): Saves positions and orders to CSV files
Main Block (lines 192-201): Example usage when script is run directly

areas for improving data_collector: No implementation for fetching current market prices for holdings, as well as historic price movement when needed or in reference to my related trade.

2. discord_logger_bot.py
This file manages the Discord bot for monitoring and logging messages.

Main Components:

Imports and Setup (lines 1-54): Imports libraries, sets up logging, loads environment vars
Twitter API Setup (lines 26-42): Initializes Twitter API client
Utility Functions (lines 56-145):
detect_tickers(): Extracts stock tickers with regex
detect_twitter_links(): Finds Twitter/X links
extract_tweet_id(): Parses tweet IDs from URLs
fetch_tweet_data(): Retrieves tweet data via Twitter API
log_tweet_to_file(): Saves tweet data to CSV
log_message_to_file(): Saves Discord messages to CSV
Bot Event Handlers (lines 148-164):
on_ready(): Confirms bot is online
on_message(): Processes and logs new messages
Bot Commands (lines 167-178): The !history command to fetch past messages
Bot Execution (line 181): Starts the bot
area for improvement:
Sentiment analysis is referenced but not implemented in this file, maybe include it here instead of in 02_clean_discord 
No data validation before CSV writing


1. 02_clean_discord.ipynb
This notebook processes raw Discord message data and exports a clean dataset.

Main Components:

Imports and Setup (cell 1): Loads necessary libraries and defines paths
Data Loading and Basic Normalization (cell 2):
Loads Discord messages from CSV
Converts timestamps and sorts chronologically
Removes duplicates
Feature engineering (message length, sentiment, etc.)
Creates clean version of the data
Saves to Parquet format
Areas for Improvement:

Uses very basic sentiment analysis with TextBlob
Limited data validation and no explicit error handling
No visualization of the data processing results
Ticker detection could be enhanced with a more robust financial symbol library