# Project 1: Social Sentiment Analysis for Stock Prediction

# Introduction & Project Overview

## Purpose
The primary objective of this project is show how sentiment from social media (e.g., Reddit) and news articles can correlate with financial data (e.g., stock prices). This could help identify whether shifts in sentiment (positive/negative) have a measurable impact on stock price movements, which is crucial for traders, investors, or analysts who seek to incorporate alternative data sources into their decision-making processes.

**Important Note:** This project is aimed at exploring potential correlations between stock price movements and market sentiment.  It does not seek to establish causation or build a predictive model. The analysis will focus on identifying any statistical relationships between these variables, rather than determining if one directly causes the other.

## Skills Showcased
This project showcases my skills in data aggregation, cleaning, transformation, and merging datasets from different sources, as well as my proficiency in Python, SQL; and ability to work with PowerBI.

## Selection of Pharma & Defence Stocks
Pharmaceutical and defense sectors were selected due to their sensitivity to public perception and regulatory changes, making them ideal candidates for sentiment-based analysis. These sectors frequently experience significant/volatile price movements influenced by news coverage and social discussions, providing strong datasets for predictive analysis.

## Data Sources Selection

### Reddit
Reddit was chosen due to its popularity, extensive discussion threads, and niche-focused subreddits that specifically cover financial news, investment strategies, and sector-specific discussions (e.g., pharma and defense stocks).

### NEWSAPI
NEWSAPI provides access to diverse, credible, and timely news sources globally. It allows for comprehensive collection of articles relevant to the target companies, enhancing the quality and depth of the sentiment analysis.

### Twitter (Dropped)
Originally, Twitter (X) was intended to be a primary data source. However, due to recent API limitations (restrictive access, high costs, and limitations on the volume of data retrieval), Twitter was dropped to ensure the feasibility, and sustainability of the project.

## Notebook Objectives
This notebook documents the step-by-step processes implemented during each project phase. Each phase is explained through relevant outcomes.



---


# Phase 1: Data Collection

## **Summary of Accomplishments:**

* Collected news articles for Pfizer, Moderna, Lockheed Martin, and Raytheon using the NewsAPI.

* Scraped social media data from finance-related Reddit subreddits using PRAW.

* Acquired historical OHLCV stock data for the same companies from Yahoo Finance, including 7-day and 14-day moving averages.

* Ensured data integrity and consistency by implementing missing data handling (`dropna()`) and standardizing column formats.

* Structured all collected data into a unified SQLite database (`sentiment.db`) using a defined schema (`schema.sql`) and a data loading script (`load_data.py`).

* Secured sensitive information (API keys) using `.gitignore` and `.env` files.

* Verified the integrity and structure of the database using DB Browser for SQLite.


## **Step 1: Data Acquisition**
Raw data was collected from three primary sources: Reddit, NewsAPI, and Yahoo Finance.

### **Reddit Data Collection**

  * Posts were scraped from the following subreddits, known for discussions on finance, pharmaceuticals, and defense:

 ```
  r/stocks
  r/wallstreetbets
  r/investing
  r/biotechplays
  r/defensestocks
  r/biotech
```

  * The PRAW library was used to access the Reddit API. API access is configured via `.env`.

  * Relevant fields, including title, body, score, number of comments, and timestamp, were extracted from each post.

  * **Output:** `scripts/reddit_scraper.py` generates `data/reddit_posts.csv`.

  * **Code Snippet:**

  ``` python
  if __name__ == "__main__":
    try:
        target_subreddits = ['stocks', 'wallstreetbets', 'investing', 'biotechplays', 'defensestocks', 'biotech']
        # Defined date range
        start_date = datetime(2025, 4, 11)
        end_date = datetime(2025, 5, 13)
        df = scrape_reddit(target_subreddits, start_date, end_date, post_limit=200) # Get more posts

        if not df.empty:
            print(df.head())
```
**Rationale:** Reddit provides a valuable source of real-time discussion and sentiment regarding specific stocks and market trends. The selected subreddits were chosen for their high concentration of relevant financial discourse. Subreddits were also chosen for their varied userbases, which may aid in tracking a wider breadth of sentiments.

### **NewsAPI Scraper**
News Artices are pulled using NewsAPI for the following companies:
- Pfizer
- Moderna
- Lockheed Martin
- Raytheon

The script parses the title, description, pub. date, source, and URL. API access is configured via `.env`.

***Output:***
`scripts/newsapi_scraper.py` generates `data/news_articles.csv`

* **Code Snippet:**
``` python
def fetch_news_single_page(company_name, from_date, to_date, page=1):
    params = {
        'q': company_name,          # Keyword to search for
        'language': 'en',
        'pageSize': 100,            # Max n of results per page
        'sortBy': 'publishedAt',    
        'apiKey': API_KEY,          # API key stored in .env file      
        'from': from_date.strftime('%Y-%m-%d'),
        'to': to_date.strftime('%Y-%m-%d'),
        'page': page
```
 **Rationale:** NewsAPI aggregates news sources from a variety of sources and biases. It was chosen for both its comprehensiveness and ease-of-use.

  News articles can significantly impact stock prices by providing information about company performance, market trends, and breaking events. These companies were selected as representative examples in the pharmaceutical and defense sectors.


### **Stock Price Collection**

Historical OHLCV (Open, High, Low, Close, Volume) data was downloaded from Yahoo Finance using the `yfinance` library for the same companies:
- `PFE` = Pfizer
- `MRNA` = Moderna
- `LMT` = Lockheed Martin
- `RTX` = Raytheon
  
    * In addition to the standard OHLCV data, 7-day and 14-day moving averages were calculated for the closing prices.

***Output:***
`scripts/stock_fetcher.py` generates `data/stock_data.csv`

**Code Snippet:**

``` python
    # Download historical stock data
    stock = yf.download(ticker, start=start_date, end=end_date)
    stock.reset_index(inplace=True)

    # Filter necessary columns and missing data
    stock = stock[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']].copy()
    stock.dropna(inplace=True)

    # Add moving averages, ticker column, and rearrange
    stock['MA_7'] = stock['Close'].rolling(window=7).mean()
    stock['MA_14'] = stock['Close'].rolling(window=14).mean()
    stock['ticker'] = ticker
    stock = stock[['ticker', 'Date', 'Open', 'High', 'Low', 'Close', 'MA_7', 'MA_14', 'Volume']]
```
 **Rationale:** Historical stock price data provides the foundation for analyzing price trends and calculating technical indicators like moving averages, which can be used in trading strategies and financial analysis.

 This is necessary for my project in finding correlations between online sentiment and stock price movements.



## **Step 2: Database Setup (SQLite)**
A structured SQLite database (`sentiment.db`) was created to store the collected data. SQLite was chosen for its lightweight nature, ease of use, and suitability for this project's data volume.

* The database schema was defined in `scripts/schema.sql`. This file specifies the tables and their columns, ensuring data integrity and organization.

* The `scripts/database.py` script was used to create the database from the schema file. This script reads the SQL commands in `schema.sql` and executes them to generate the database.

**Code Snippet (Illustrative, from `database.py`):**

``` python
def create_database_from_schema(schema_file, db_file):
    """
    Creates an SQLite database from a given schema file.

    Args:
        schema_file (str): The path to the schema.sql file.
        db_file (str): The name of the database file to create.
    """
    try:
        with sqlite3.connect(db_file) as conn:
            with open(schema_file, 'r') as f:
                conn.executescript(f.read())
            print(f"Database file '{db_file}' created successfully from schema file '{schema_file}'.")

    except (...) # Here, errors were printed for debugging.

```
**Rationale:** Using a database ensures that the collected data is stored in a structured and easily queryable format, facilitating efficient data retrieval and analysis in subsequent phases. A schema definition promotes data consistency.

The following tables are designed:
- `reddit_posts`: Contains scraped Reddit content
- `news_articles`: Contains article metadata from NEWSAPI
- `stock_prices`: Contains historical OHLCV stock data & MAs for the selected companies


### Creating the Database from Schema

The `scripts/database.py` script reads the schema file and builds the SQLite database.

**Output:**
Creates `data/sentiment.db`

## **Step 3: Data Loading**

The data collected in Step 1 was loaded into the `sentiment.db` database using the `scripts/load_data.py` script.

* This script reads the data from the CSV files generated in Step 1 (`reddit_posts.csv`, `news_articles.csv`, and `stock_data.csv`) and inserts it into the corresponding tables in the database (`reddit_posts`, `news_articles`, and `stock_prices`).

* The `pandas` library was used to read the CSV files, and the `to_sql()` method was used to write the data to the database.

**Code Snippet (Illustrative, from `load_data.py`):**
``` python
# Load data from CSV files into the database
load_csv_to_db(os.path.join("data", "reddit_posts.csv"), "reddit_posts")
load_csv_to_db(os.path.join("data", "news_articles.csv"), "news_articles")
load_csv_to_db(os.path.join("data", "stock_data.csv"), 'stock_prices')

```
**Note on importing OS:** By integrating predominantly `os.path.join()` throughout the project, I ensured that file paths were constructed in a way that is both reliable and compatible with various operating systems. This approach not only enhanced the robustness of the project but also contributed to its portability and ease of maintenance.

**Tables ← CSV sources:**

`reddit_posts` ← `data/reddit_posts.csv`

`news_articles` ← `data/news_articles.csv`

`stock_prices` ← `data/stock_data.csv`

**Rationale:** This step centralizes all the collected data into the SQLite database, making it easier to manage and query for further analysis.


## Final Note on Phase 1

**Security Considerations:**

* API keys for NewsAPI and Reddit were stored in a `.env` file.

* The `.gitignore` file was configured to prevent the `.env` file from being committed to version control, ensuring that sensitive credentials are not exposed.

**Verification:**

* The contents of the `sentiment.db` database were verified using DB Browser for SQLite to ensure that the data was loaded correctly and the database schema was properly applied.

---




# Phase 2: Data Processing

##**Summary of Accomplishments:**

* Performed initial data cleaning and manipulation with `preprocess_data.py` using the `pandas` library.

* Processed text using natural language processing techniques with `clean_sentiment.db.py`, using the `nltk` library.

* Calculated sentiment scores using the VADER sentiment analysis tool with `calculate_sentiment.py`.

* Stored all processed data, including sentiment scores, back into the SQLite database.

##**Step 1: Initial Cleaning and Transformation**

The `preprocess_data.py` script was created to perform the initial cleaning and transformation of the raw data. This script reads data from the database created in *phase 1*, performs data manipulation using pandas, then writes it back to the database.

#### Data Transformations:
**Reddit:**
```python
reddit_df['body'] = reddit_df['body'].fillna('')  #  Fills in missing body text
reddit_df['combined_text'] = reddit_df['title'] + ' ' + reddit_df['body'] # Combines title and body to get new column 'combined_text' column
reddit_df['created_at'] = pd.to_datetime(reddit_df['created_at']) # Datetime format for later analysis
reddit_df.drop(columns=['title', 'body'], inplace=True) # Title and body are no longer needed
reddit_df.reset_index(drop=True, inplace=True) #  reset index
```
*For the News data, similar transformations are undertaken, and can be found in the `data/preprocess_data.py` file.

**Stocks:**

```python
# Date column converted to datetime format
stock_df['date'] = pd.to_datetime(stock_df['date'])

# Calculate daily percentage change, grouped by ticker
stock_df['daily_return'] = stock_df.groupby('ticker')['close'].pct_change()

```

**Rationale:** This step prepares the data for subsequent analysis by addressing missing values, combining relevant text fields, standardizing date formats, and calculating a key financial metric (daily return) for the stock data.  Storing the cleaned data in new tables preserves the original data and makes it clear which data has been processed.

##**Step 2: Text Data Processing**

The `clean_sentiment.db.py` script processed the data using natural language processing techniques. This was necessary in order to calculate accurate sentiment scores using VADER in the next step.

The script performs the following:

* Converts the text to lowercase using `.lower()`.

* Removes URLs and special characters using regular expressions `re.sub()`.

* Tokenizes the text into individual words using `word_tokenize()`.

* Removes stop words (common words like "the", "and", "is") using NLTK's stop word list.

* Lemmatizes the words (reduce them to their base form) using NLTK's WordNet Lemmatizer.

Data is then loaded into Pandas DataFrames, and written to the cleaned tables.

**Code Snippet:**
 ```python
def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words("english"))
    filtered = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(filtered)
```
**Rationale:** This step removes noise and irrelevant information from the text data, standardizing it and improving the accuracy of the sentiment analysis in the next step.  Lemmatization helps to group words with similar meanings.  Retaining the original timestamps is essential for time-series analysis and merging with stock price data.

##**Step 3: Sentiment Analysis**

The `calculate_sentiment.py` script calculates sentiment scores for the cleaned text data using the VADER sentiment analysis tool. Vader combines the score of each word in the combined text to give an overall score for the text.

The sentiment scores are then stored in the database.

**Code snippet**

```python
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

def get_vader_sentiment(text):
    """Calculates the VADER sentiment scores for the input text."""
    if pd.isna(text):
        return {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}
    return analyzer.polarity_scores(text)
```

The `get_vader_sentiment()` function calculates the negative, neutral, positive, and compound sentiment scores for each text.  The compound score is stored in a new column (`reddit_sentiment` and `news_sentiment` for Reddit and News data, respectively).

The processed DataFrames, now containing the sentiment scores, are written back to the `cleaned_reddit_posts` and `cleaned_news_articles` tables in the sentiment.db database.

**Rationale:** This step quantifies the emotional tone of the text data, providing a numerical representation of market sentiment. VADER is well-suited for analyzing text from social media and news articles. Storing the sentiment scores in the database allows for easy integration with other data, such as stock prices, in subsequent analysis.


#***NEXT STEPS IN POWERBI***

**Key results on my Substack: https://rashidalouat.substack.com/**