<a href="https://colab.research.google.com/github/sg2083/independent_study/blob/main/sentiment_analysis_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of Stock-Related News & Posts and Predicting Stock Market Prices

### Introduction
The stock market is highly influenced by investor sentiment, which is often reflected in news articles, social media discussions, and online forums. This study focuses on analyzing the sentiment of stock-related posts from multiple sources, including Reddit, NewsAPI, and historical stock prices.

The goal is to determine whether online sentiment correlates with stock price movements and if it can be used as a predictive feature for stock performance.

## Research Questions
1. Does investor sentiment expressed in Reddit posts and news articles correlate with stock price movements for Tesla?
2. Can sentiment data extracted from online platforms be used to predict stock price trends in the short term?
3. What is the relative significance of different sentiment sources (Reddit vs. NewsAPI) in predicting stock market performance?
4. How does sentiment change in response to major news events, and does this sentiment shift correlate with subsequent stock price movements?

##Hypotheses
>**H1**:There is a significant positive correlation between positive sentiment in Reddit posts/news articles and an increase in Tesla's stock price.<br>
**H2**: Negative sentiment in Reddit posts/news articles is significantly correlated with a decrease in Tesla's stock price.<br>
**H3**: Sentiment data from Reddit is more predictive of short-term stock price fluctuations than sentiment data from news articles.<br>
**H4**: Major news events (e.g., product launches, regulatory announcements) cause a significant shift in sentiment, which is reflected in short-term stock price movements.

## Literature Review
The study by Nti, Adekoya, and Weyori (2020) investigates how public sentiment, derived from web news, Twitter, Google Trends, and forum discussions, influences stock market predictions. Using sentiment analysis with an Artificial Neural Network (ANN) model, the authors predict stock prices on the Ghana Stock Exchange (GSE) over time frames of 1 to 90 days. They find that combining multiple data sources improves prediction accuracy, with the highest accuracy (70.66–77.12%) achieved from a combined dataset. The study highlights a strong link between stock market behavior and social media, suggesting that sentiment data from online platforms can help investors predict future stock price movements and make better investment decisions.
link: https://sciendo.com/article/10.2478/acss-2020-0004

## How its different from whats already been done
Event-Driven Sentiment Evolution and Its Impact on Stock Price Prediction

### Data
The data for this study is collected from three primary sources: **Reddit, NewsAPI, and stock market data**. Reddit posts related to **Tesla** stock are retrieved using praw library from financial discussion subreddits like r/wallstreetbets, capturing post titles and timestamps. News articles mentioning Tesla are obtained via NewsAPI, extracting headlines, publication dates, and sources. Historical stock price data is being sourced from Yahoo Finance api, including daily open, high, low, close prices, trading volume, and other financial indicators.

Since these datasets originate from different platforms, they contain varying timestamp formats, time zones, and missing values, requiring careful preprocessing and merging to align sentiment data with stock price movements for further analysis.

### Data Preprocessing
The collected data is being cleaned and standardized before merging. Steps include:

1. Date Format Standardization

  * Convert timestamps from different time zones to UTC
  * Convert stock market timestamps (which include hours/minutes) to date-only format

2. Column Renaming for Clarity

  * Title → title_reddit (for Reddit)
  * Title → title_news (for NewsAPI)
  
  This prevents column name conflicts

3. Handling Missing Data

  * Some dates lack both Reddit posts and news articles
  * Missing values must be carefully handled to avoid bias

4. Merging Data

  Outer join used to keep all records from Reddit, NewsAPI, and stock price data Ensures no loss of important data points Note: Since data comes from multiple sources, preprocessing is still in progress to handle scattered and missing data.

In [None]:
# @title Importing required libraries
!pip install newsapi-python
!pip install praw

import yfinance as yf
from newsapi import NewsApiClient
import praw
from datetime import datetime
import pandas as pd

Collecting newsapi-python
  Downloading newsapi_python-0.2.7-py2.py3-none-any.whl.metadata (1.2 kB)
Downloading newsapi_python-0.2.7-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: newsapi-python
Successfully installed newsapi-python-0.2.7
Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update_checker, prawcore, praw
Successfully installed praw-7.8.1 prawcore-2.4.0 update_checker-0.18.0


In [None]:
# @title Fetching stock history data for Tesla stocks
tesla = yf.Ticker("TSLA")
tesla_data = tesla.history(period="1y")
print(tesla_data.head())

                                 Open        High         Low       Close  \
Date                                                                        
2024-02-12 00:00:00-05:00  192.110001  194.729996  187.279999  188.130005   
2024-02-13 00:00:00-05:00  183.990005  187.259995  182.110001  184.020004   
2024-02-14 00:00:00-05:00  185.300003  188.889999  183.350006  188.710007   
2024-02-15 00:00:00-05:00  189.160004  200.880005  188.860001  200.449997   
2024-02-16 00:00:00-05:00  202.059998  203.169998  197.399994  199.949997   

                              Volume  Dividends  Stock Splits  
Date                                                           
2024-02-12 00:00:00-05:00   95498600        0.0           0.0  
2024-02-13 00:00:00-05:00   86759500        0.0           0.0  
2024-02-14 00:00:00-05:00   81203000        0.0           0.0  
2024-02-15 00:00:00-05:00  120831800        0.0           0.0  
2024-02-16 00:00:00-05:00  111173600        0.0           0.0  


In [None]:
# import tweepy

# # Replace with your API keys
# api_key = "QJDjqRmClnqkLg7nmNQ2gg9Qc"
# api_secret = "kzuVz5zut9BdrbUML9w3upnwkW7mmFWe7iLUnxdsiBRU10w3ec"
# access_token = "1268163737153728512-Qr5jw5gZ6mz2ZbN4QW3pq5kxw5JAi7"
# access_secret = "qQ5WVKhmj3XX17WQxMfoJEzgAqwFxwz8qdYEsS3IaL69a"
# bearer_token = "AAAAAAAAAAAAAAAAAAAAAMMezAEAAAAAtN2XctoVzTlyEi8YhnI6%2FEeIRUM%3Dx7uIiCT3lQMTLkCSJkAGPBVIqawsEkrH4qXPlWJNOC2JYupBwa"

# # Authenticate
# client = tweepy.Client(bearer_token=bearer_token)

from tweepy import OAuthHandler
from tweepy import API

consumer_key = 'YFUHQYil2JuiR4ws600kN2yD2'
consumer_secret = '0x6LEfbpBz0Rnvm3dBw7lFAeygUTFcDaNU4YjN6eUMTFHLoUP9'
access_token = '1889172540888948736-jrvTNNgPQHvEfE5OAyaaCBBfx2xpWR'
access_token_secret = 'dK2tcEBm1nEBxZoGAiZw2TgnDZUhHyfMuYzVGrlorekeN'
# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)

# Access key authentication
auth.set_access_token(access_token, access_token_secret)

# Set up the API with the authentication handler
api = API(auth)

In [None]:
import tweepy
import csv
import datetime

# Twitter API credentials
bearer_token = 'AAAAAAAAAAAAAAAAAAAAAFw2zAEAAAAABxOmWd%2BogWMxVgvfWVX3Lrsy8T4%3D9QG2pYsqHFgF2XkzGbC7oiIALJNQGE13bI9uv60b0i5oPJWWdo'  # Bearer Token for API v2

# Authenticate to the Twitter API
client = tweepy.Client(bearer_token=bearer_token)

# Define the query and date range for historic tweets
query = 'Tesla'
# Use 'since' and 'until' as parameters to filter for tweets within the past 7 days
max_tweets = 10  # Limit to 10 tweets

# Create a function to collect the tweets
def collect_tweets(query, max_tweets=10):
    tweets = []
    # Using search_recent_tweets for recent tweets (within the past 7 days)
    for tweet in tweepy.Paginator(client.search_recent_tweets,  # Using search_recent_tweets for free access
                                  query=query,
                                  tweet_fields=['created_at', 'author_id', 'text'],
                                  max_results=10).flatten(limit=max_tweets):  # Limit to 10 tweets
        tweets.append([tweet.created_at, tweet.author_id, tweet.text])

    return tweets

# Collect the tweets
tweets = collect_tweets(query, max_tweets)

# Save the tweets to a CSV file
with open('tesla_tweets.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Date", "User ID", "Tweet"])
    writer.writerows(tweets)

print(f"Collected {len(tweets)} tweets about Tesla.")



Collected 10 tweets about Tesla.


In [None]:
from google.colab import files
print("\n--- CSV Content ---")
with open('tesla_tweets.csv', 'r', encoding='utf-8') as file:
    csv_content = file.read()
    print(csv_content)

files.download('tesla_tweets.csv')


--- CSV Content ---
Date,User ID,Tweet
2025-02-11 15:09:04+00:00,1266811878879215616,"RT @le20hfrancetele: 🚗📉 Chez Tesla, des ventes en panne sèche : elles se sont effondrées en Europe avec une baisse de 13 % sur un an.

En F…"
2025-02-11 15:09:04+00:00,613671143,"@elonmusk Hey @elonmusk! You should be pro work from home. The 2026 Model Y looks great but I’m not allowed to park that at the shop. It has been made clear to me that a red Ferrari 12Cilindri with a bumper sticker on the back that says, “Jingoism Is Sexy” is fine but no Tesla!"
2025-02-11 15:09:04+00:00,1503692354221338625,"RT @PhDcornerHub: @niccruzpatane Here's your passage refined for punctuation and English:

""Why is India, alongside Southeast countries, co…"
2025-02-11 15:09:03+00:00,1693759037253271552,"@BasedMikeLee @elonmusk Elon has millions of federal contracts and wants to benefit his business (tax exempt and payers) and wants more money for himself and shudder agencies that were investigating him for various th

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# query = "(Tesla OR TSLA OR Tesla stock OR Tesla shares) -is:retweet lang:en"

# # # Fetch recent tweets (last 7 days)
# # tweets = client.search_recent_tweets(query=query, max_results=5, tweet_fields=["created_at", "text"])

# # # Store in DataFrame & Save
# # data = [[tweet.created_at, tweet.text] for tweet in tweets.data]
# # df = pd.DataFrame(data, columns=["timestamp", "tweet"])
# # df.to_csv("stock_tweets.csv", index=False)

# # print("Saved tweets to stock_tweets.csv!")
# import requests

# url = "https://api.twitter.com/2/tweets/search/recent"

# params = {
#     "query": "Tesla OR TSLA OR Tesla stock -is:retweet lang:en",
#     "max_results": 10,
#     "tweet.fields": "created_at,text"
# }

# # API headers
# headers = {"Authorization": f"Bearer {bearer_token}"}

# # Make request
# response = requests.get(url, headers=headers, params=params)

# # Check response status
# if response.status_code == 200:
#     tweets = response.json()
#     for tweet in tweets["data"]:
#         print(f"{tweet['created_at']}: {tweet['text']}\n")
# else:
#     print(f"Error {response.status_code}: {response.text}")

In [None]:
newsapi = NewsApiClient(api_key="3c53572d3893466a8240a9916ff53acb")

articles = newsapi.get_everything(q="Tesla stock", language="en", page_size=100)

news_data=[]
for article in articles["articles"]:
    news_data.append({'Date': article['publishedAt'], 'Title': article['title']})

In [None]:
reddit = praw.Reddit(
    client_id="AHwBapk4BbgYfXKt4SSpAw",
    client_secret="_iiyfM51ZSRzzFtwYV3zqBxbbj7fqw",
    user_agent="StockSentimentAnalysis"
)

subreddit = reddit.subreddit("wallstreetbets")
posts = subreddit.search("Tesla stock", limit=100)

reddit_data = []
for post in posts:
    # post_date = datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S')
    post_date = datetime.datetime.utcfromtimestamp(post.created_utc)
    reddit_data.append({"Date": post_date, "Title": post.title})

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [None]:
reddit_df = pd.DataFrame(reddit_data)
news_df = pd.DataFrame(news_data)

# Convert Date column to datetime and remove timezones
for df in [reddit_df, news_df]:
    df["Date"] = pd.to_datetime(df["Date"], utc=True).dt.tz_convert(None).dt.date

reddit_df.rename(columns={"Title": "title_reddit"}, inplace=True)
news_df.rename(columns={"Title": "title_news"}, inplace=True)

# Merge using an outer join to retain all dates
merged_data = pd.merge(
    reddit_df[["Date", "title_reddit"]],
    news_df[["Date", "title_news"]],
    on="Date",
    how="outer"
)

# Sort by Date in place for memory efficiency
merged_data.sort_values("Date", inplace=True)

print(merged_data)


           Date                                       title_reddit  \
0    2017-11-03  Tesla stock to rebound to $400 tomorrow? DD in...   
1    2018-07-02  WSB reaction next Tesla quarterly earnings cal...   
2    2019-05-21  Tesla stock worth just $10 in worst case: Morg...   
3    2019-07-02  Tesla delivers 95,200 vehicles. Stock shoots u...   
4    2020-02-03  A request to all the owners of a Tesla (the ca...   
..          ...                                                ...   
193  2025-02-07  More electric cars sold in Europe, but Tesla t...   
192  2025-02-07  Elon Musk's Brother Kimbal Musk And Other Tesl...   
194  2025-02-09                                                NaN   
195  2025-02-10                                                NaN   
196  2025-02-11  Kimbal Musk sells Tesla stock worth $27.6 million   

                                            title_news  
0                                                  NaN  
1                                            

In [None]:
tesla_df = pd.DataFrame(tesla_data).reset_index()

tesla_df["Date"] = pd.to_datetime(tesla_df["Date"]).dt.tz_localize(None).dt.date

# Merge Tesla stock data with sentiment data
final_data = pd.merge(
    merged_data,
    tesla_df,
    on="Date",
    how="outer"
)

# Sort by Date for analysis
final_data.sort_values("Date", inplace=True)

# Display merged dataset
print(final_data.head(5))

         Date                                       title_reddit title_news  \
0  2017-11-03  Tesla stock to rebound to $400 tomorrow? DD in...        NaN   
1  2018-07-02  WSB reaction next Tesla quarterly earnings cal...        NaN   
2  2019-05-21  Tesla stock worth just $10 in worst case: Morg...        NaN   
3  2019-07-02  Tesla delivers 95,200 vehicles. Stock shoots u...        NaN   
4  2020-02-03  A request to all the owners of a Tesla (the ca...        NaN   

   Open  High  Low  Close  Volume  Dividends  Stock Splits  
0   NaN   NaN  NaN    NaN     NaN        NaN           NaN  
1   NaN   NaN  NaN    NaN     NaN        NaN           NaN  
2   NaN   NaN  NaN    NaN     NaN        NaN           NaN  
3   NaN   NaN  NaN    NaN     NaN        NaN           NaN  
4   NaN   NaN  NaN    NaN     NaN        NaN           NaN  


In [None]:
final_data.to_csv('final_data.csv', index=False)

print("final_data saved to final_data.csv")

final_data saved to final_data.csv


In [None]:

files.download('final_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>