# Phase 2: Sentiment Scoring - VADER vs FinBERT Comparison

## Objective

Transform 52,974 raw news headlines into daily, ticker-specific sentiment time series using:
- **VADER** (Lexicon-based pproach)
- **FinBERT** (Transformer-based approach)

This enables a direct comparison of baseline vs state-of-the-art sentiment analysis for financial text.

## Pipeline Overview

```
News (53k articles)
    ↓
1. Entity Resolution (yfinance keyword matching)
    ↓
2. News Attribution (assign to tickers or MARKET_GENERAL)
    ↓
3. VADER Scoring (lexicon-based, CPU)
    ↓
4. FinBERT Scoring (transformer-based, GPU)
    ↓
5. Validation & Comparison
    ↓
6. Daily Aggregation by Ticker
    ↓
Output: sentiment_scores_60.csv + market_sentiment_general.csv
```

---

## 1. Environment & Setup

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

import yfinance as yf

import matplotlib.pyplot as plt
import seaborn as sns

import os
import re
from tqdm import tqdm

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f'  GPU detected: {torch.cuda.get_device_name(0)}')
    print(f'  CUDA version: {torch.version.cuda}')
    print(f'  Memory allocated: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB')
    print(f'  Memory reserved: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB')
else:
    device = torch.device('cpu')

print(f'\nDevice set to: {device}')

  GPU detected: Tesla T4
  CUDA version: 12.8
  Memory allocated: 0.00 MB
  Memory reserved: 0.00 MB

Device set to: cuda


In [6]:
project_root = '/content/drive/MyDrive/market-sentiment-impact-analysis'

data_processed = os.path.join(project_root, 'data', 'processed')
data_tickers = os.path.join(project_root, 'data', 'tickers')

os.makedirs(data_processed, exist_ok=True)

print(f"Project Root: {project_root}")
print(f"Processed Data: {data_processed}")
print(f"Tickers Data: {data_tickers}")

Project Root: /content/drive/MyDrive/market-sentiment-impact-analysis
Processed Data: /content/drive/MyDrive/market-sentiment-impact-analysis/data/processed
Tickers Data: /content/drive/MyDrive/market-sentiment-impact-analysis/data/tickers
