# Phase 2: Sentiment Scoring - VADER vs FinBERT Comparison

## Objective

Transform 52,974 raw news headlines into daily, ticker-specific sentiment time series using:
- **VADER** (Lexicon-based pproach)
- **FinBERT** (Transformer-based approach)

This enables a direct comparison of baseline vs state-of-the-art sentiment analysis for financial text.

## Pipeline Overview

```
News (53k articles)
    ↓
1. Entity Resolution (yfinance keyword matching)
    ↓
2. News Attribution (assign to tickers or MARKET_GENERAL)
    ↓
3. VADER Scoring (lexicon-based, CPU)
    ↓
4. FinBERT Scoring (transformer-based, GPU)
    ↓
5. Validation & Comparison
    ↓
6. Daily Aggregation by Ticker
    ↓
Output: sentiment_scores_60.csv + market_sentiment_general.csv
```

---

## 1. Environment & Setup

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

import yfinance as yf

import matplotlib.pyplot as plt
import seaborn as sns

import os
import re
from tqdm import tqdm

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

In [None]:
project_root = '/content/drive/MyDrive/market-sentiment-impact-analysis'

data_processed = os.path.join(project_root, 'data', 'processed')
data_tickers = os.path.join(project_root, 'data', 'tickers')

os.makedirs(data_processed, exist_ok=True)

print(f"Project Root: {project_root}")
print(f"Processed Data: {data_processed}")
print(f"Tickers Data: {data_tickers}")