<a href="https://colab.research.google.com/github/Shrutakeerti/Scraping-for-Stock-Movement-Predictions/blob/main/Notebook_1(big_content).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Install Necessary Libraries



In [8]:
!pip install praw pandas scikit-learn vaderSentiment matplotlib




Data Scraping (PRAW for Reddit)

In [9]:
import praw
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Initialize Reddit instance
reddit = praw.Reddit(client_id='zBslG12V_uTtmH1K8ieYSQ',
                     client_secret='d0IUvqDAK7Z1plYz0Vr4MBWHK2pgsg',
                     user_agent='TaeTaeBot:v1.0 (by /u/Equivalent_Let_8310)')

# Scrape Reddit data from the 'stocks' subreddit
subreddit = reddit.subreddit('stocks')

def scrape_reddit_data(limit=500):
    posts = []
    for submission in subreddit.new(limit=limit):
        posts.append([submission.title, submission.selftext, submission.score, submission.num_comments, submission.created])
    return pd.DataFrame(posts, columns=['Title', 'Body', 'Upvotes', 'Comments', 'Created_At'])

reddit_data = scrape_reddit_data(1000)

# Save raw scraped data
reddit_data.to_csv('reddit_stock_data.csv', index=False)
print("Scraped Reddit data successfully!")


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

Scraped Reddit data successfully!


Sentiment Analysis Using Pretrained BERT (FinBERT)

In [10]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch

# Load pre-trained FinBERT model for sentiment analysis
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone')

def get_finbert_sentiment(text):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)
    outputs = model(**inputs)
    probs = softmax(outputs.logits, dim=-1)
    sentiment = torch.argmax(probs).item()
    return sentiment  # 0: Negative, 1: Neutral, 2: Positive

# Apply FinBERT sentiment analysis to the Reddit posts
reddit_data['Content'] = reddit_data['Title'] + ' ' + reddit_data['Body']
reddit_data['Sentiment'] = reddit_data['Content'].apply(get_finbert_sentiment)

# Map sentiment to string labels for interpretability
reddit_data['Sentiment_Label'] = reddit_data['Sentiment'].map({0: 'Negative', 1: 'Neutral', 2: 'Positive'})




Feature Engineering

In [11]:
import numpy as np

# Adding basic features
reddit_data['Post_Length'] = reddit_data['Content'].apply(len)  # Post length (characters)
reddit_data['Upvote_Comment_Ratio'] = np.where(reddit_data['Comments'] == 0, reddit_data['Upvotes'], reddit_data['Upvotes'] / reddit_data['Comments'])  # Upvotes to comment ratio

# Example stock symbol mentions (add your own relevant symbols)
symbols = ['AAPL', 'TSLA', 'AMZN', 'GOOGL', 'MSFT']
for symbol in symbols:
    reddit_data[f'Mention_{symbol}'] = reddit_data['Content'].apply(lambda x: 1 if symbol in x else 0)

# Save preprocessed data for easy use
reddit_data.to_csv('preprocessed_stock_data_with_bert.csv', index=False)


Model Training with XGBoost

In [12]:
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Load preprocessed data
data = pd.read_csv('preprocessed_stock_data_with_bert.csv')

# Target variable (example: stock movement; this depends on how you're defining movement)
data['Stock_Movement'] = data['Upvotes'].apply(lambda x: 1 if x > 100 else 0)  # Adjust threshold as per your dataset

# Define features and target
features = ['Sentiment', 'Post_Length', 'Upvote_Comment_Ratio'] + [f'Mention_{symbol}' for symbol in symbols]
X = data[features]
y = data['Stock_Movement']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize XGBoost classifier
xgb_model = XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8)

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = xgb_model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")


Model Accuracy: 0.7634
Precision: 0.5789
Recall: 0.5641


Hyperparameter Tuning for XGBoost

In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1],
    'colsample_bytree': [0.8, 1]
}

grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

# Make predictions with the best model
y_pred_best = best_model.predict(X_test)

# Evaluate best model
accuracy_best = accuracy_score(y_test, y_pred_best)
precision_best = precision_score(y_test, y_pred_best)
recall_best = recall_score(y_test, y_pred_best)

print(f"Optimized Model Accuracy: {accuracy_best:.4f}")
print(f"Optimized Precision: {precision_best:.4f}")
print(f"Optimized Recall: {recall_best:.4f}")


Fitting 5 folds for each of 108 candidates, totalling 540 fits
Optimized Model Accuracy: 0.7993
Optimized Precision: 0.6667
Optimized Recall: 0.5641


In [21]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE

# Define stock symbols
symbols = ['AAPL', 'TSLA', 'AMZN', 'GOOGL', 'MSFT']

# Load preprocessed data
data = pd.read_csv('preprocessed_stock_data_with_bert.csv')

# Target variable
data['Stock_Movement'] = data['Upvotes'].apply(lambda x: 1 if x > 100 else 0)

# Define features and target
features = ['Sentiment', 'Post_Length', 'Upvote_Comment_Ratio'] + [f'Mention_{symbol}' for symbol in symbols]
X = data[features]
y = data['Stock_Movement']

# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

# Initialize individual models
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
lgb_model = LGBMClassifier()
rf_model = RandomForestClassifier()

# Create a voting classifier
voting_classifier = VotingClassifier(estimators=[
    ('xgb', xgb_model),
    ('lgb', lgb_model),
    ('rf', rf_model)],
    voting='soft'  # Use soft voting
)

# Fit the voting classifier
voting_classifier.fit(X_train, y_train)

# Make predictions
y_pred_voting = voting_classifier.predict(X_test)

# Evaluate model performance
accuracy_voting = accuracy_score(y_test, y_pred_voting)
precision_voting = precision_score(y_test, y_pred_voting)
recall_voting = recall_score(y_test, y_pred_voting)
f1_voting = f1_score(y_test, y_pred_voting)

print(f"Voting Classifier Accuracy: {accuracy_voting:.4f}")
print(f"Voting Classifier Precision: {precision_voting:.4f}")
print(f"Voting Classifier Recall: {recall_voting:.4f}")
print(f"Voting Classifier F1 Score: {f1_voting:.4f}")


Parameters: { "use_label_encoder" } are not used.



[LightGBM] [Info] Number of positive: 468, number of negative: 458
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000136 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 513
[LightGBM] [Info] Number of data points in the train set: 926, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.505400 -> initscore=0.021599
[LightGBM] [Info] Start training from score 0.021599
Voting Classifier Accuracy: 0.8719
Voting Classifier Precision: 0.8265
Voting Classifier Recall: 0.9330
Voting Classifier F1 Score: 0.8765
