# Stock Movement Analysis Based on Social Media Sentiment
This notebook demonstrates scraping data from **Reddit**, preprocessing it for sentiment analysis, and performing basic feature engineering. We will conclude by preparing the data for a simple stock movement prediction model.

## Import Required Libraries

In [1]:
import praw
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import yfinance as yf
from dotenv import load_dotenv
import os

print("All libraries are installed and imported successfully!")

All libraries are installed and imported successfully!


## Set Up Reddit API Credentials

In [2]:
load_dotenv()

client_id = os.getenv('REDDIT_CLIENT_ID')
client_secret = os.getenv('REDDIT_CLIENT_SECRET')
user_agent = os.getenv('REDDIT_USER_AGENT')

reddit = praw.Reddit(
    client_id=client_id,         
    client_secret=client_secret, 
    user_agent=user_agent       
)


## Reddit Data Scraping

In [3]:
def fetch_reddit_posts(subreddit_name, limit=100):
    subreddit = reddit.subreddit(subreddit_name)
    posts = []
    for post in subreddit.hot(limit=limit):
        posts.append({
            'title': post.title,
            'timestamp': pd.to_datetime(post.created_utc, unit='s').date()  # Convert to datetime.date
        })
    return pd.DataFrame(posts)

reddit_data = fetch_reddit_posts(subreddit_name='stocks', limit=100)
print(reddit_data.head())


                                               title   timestamp
0  r/Stocks Daily Discussion & Fundamentals Frida...  2024-11-15
1  /r/Stocks Weekend Discussion Saturday - Nov 23...  2024-11-23
2  Glancy Prongay & Murray LLP, a Leading Securit...  2024-11-24
3          Evaluation of Companies with Down Revenue  2024-11-23
4  Thoughts on ESTC (Elastic) as the next hot AI ...  2024-11-23


Preprocessing and Sentiment Analysis
We'll preprocess the scraped text data and perform sentiment analysis using TextBlob.

In [4]:
def preprocess_and_analyze(data, text_column):
    def clean_text(text):
        return ' '.join(word for word in text.split() if not word.startswith(('http', '@', '#')))

    def get_sentiment(text):
        return TextBlob(text).sentiment.polarity

    data['cleaned_text'] = data[text_column].apply(clean_text)
    # Analyze sentiment
    data['sentiment'] = data['cleaned_text'].apply(get_sentiment)
    return data

reddit_data = preprocess_and_analyze(reddit_data, text_column='title')

reddit_data['timestamp'] = pd.to_datetime(reddit_data['timestamp'])  # Make sure it's a datetime object

print(reddit_data[['title', 'timestamp', 'sentiment']].head())


                                               title  timestamp  sentiment
0  r/Stocks Daily Discussion & Fundamentals Frida... 2024-11-15   0.000000
1  /r/Stocks Weekend Discussion Saturday - Nov 23... 2024-11-23   0.000000
2  Glancy Prongay & Murray LLP, a Leading Securit... 2024-11-24  -0.200000
3          Evaluation of Companies with Down Revenue 2024-11-23  -0.155556
4  Thoughts on ESTC (Elastic) as the next hot AI ... 2024-11-23   0.125000


## Fetch Historical Stock Data (SPY or QQQ) Using yfinance

In [5]:
# Fetch Historical Stock Data (SPY or QQQ) Using yfinance
def fetch_stock_data(ticker, start_date, end_date):
    stock_data = yf.download(ticker, start=start_date, end=end_date)
    return stock_data[['Adj Close']].copy()  # Ensure it's a copy

# Fetch SPY (S&P 500 ETF) data from yfinance
stock_data = fetch_stock_data('SPY', start_date='2023-01-01', end_date='2024-11-30')

stock_data['returns'] = stock_data['Adj Close'].pct_change()

stock_data['stock_movement'] = np.where(stock_data['returns'] > 0, 'Up', 'Down')

stock_data.dropna(inplace=True)

print("Stock Data with Returns and Movement:")
print(stock_data.head())


[*********************100%***********************]  1 of 1 completed

Stock Data with Returns and Movement:
             Adj Close   returns stock_movement
Date                                           
2023-01-04  374.483368  0.007720             Up
2023-01-05  370.209229 -0.011413           Down
2023-01-06  378.698914  0.022932             Up
2023-01-09  378.484222 -0.000567           Down
2023-01-10  381.138458  0.007013             Up





## Combine Reddit Sentiment Data with Stock Data

In [6]:
reddit_data['timestamp'] = pd.to_datetime(reddit_data['timestamp'])

stock_data['timestamp'] = pd.to_datetime(stock_data.index)

combined_data = pd.merge(stock_data, reddit_data[['timestamp', 'sentiment']], on='timestamp', how='inner')

print(combined_data.head())


    Adj Close   returns stock_movement  timestamp  sentiment
0  585.750000 -0.012809           Down 2024-11-15        0.0
1  588.150024  0.004097             Up 2024-11-18        0.2
2  588.150024  0.004097             Up 2024-11-18        0.0
3  588.150024  0.004097             Up 2024-11-18        0.0
4  588.150024  0.004097             Up 2024-11-18        0.0


## Training and predicting Stock Movements

In [7]:
X = combined_data[['sentiment']]  # Feature: Sentiment
y = combined_data['stock_movement']  # Target: Stock Movement

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"\nModel Prediction Accuracy: {accuracy * 100:.2f}%")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=0))


Accuracy: 0.9375

Model Prediction Accuracy: 93.75%

Classification Report:
              precision    recall  f1-score   support

        Down       0.00      0.00      0.00         1
          Up       0.94      1.00      0.97        15

    accuracy                           0.94        16
   macro avg       0.47      0.50      0.48        16
weighted avg       0.88      0.94      0.91        16



## Improvements that an be done:

- **Data Preprocessing**: Improve text cleaning, handle missing data.
- **Feature Engineering**: Add more features like moving averages, technical indicators.
- **Modeling**: Tune hyperparameters, try other models like XGBoost or LSTM.
- **Sentiment Analysis**: Use advanced models like VADER or BERT.
- **Evaluation**: Address class imbalance, use more evaluation metrics.
- **Time Series**: Try LSTM for better temporal analysis.
- **Visualization**: Visualize sentiment trends with stock prices.
- **Real-Time**: Build a real-time prediction system.
