# Phase 1b: Data Ingestion & Schema Harmonization

## Objective

Transform raw, messy text and external API data into two clean, separate datasets:

1. **`daily_news_cleaned.csv`**: A harmonized list of unique news events with standardized timestamps
2. **`stock_returns_60.csv`**: A clean record of log-returns for the 60 target stocks, preserving sector and beta metadata

## Data Quality Challenges

Each news source has distinct irregularities that must be handled individually:

| Source | Issues | Solutions |
|--------|--------|----------|
| **CNBC** | Empty rows, "ET" timezone noise, verbose date format | Drop NaN rows, strip timezone, parse datetime |
| **Guardian** | No description column, `\n\n` in text, DMY format | Fill empty descriptions, clean newlines, parse DMY |
| **Reuters** | Clean format (minimal issues) | Direct date parsing (MMM DD YYYY) |

---

## Setup: Imports and Path Configuration

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

import yfinance as yf

import os
from pathlib import Path

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

In [None]:
project_root = '/content/drive/MyDrive/market-sentiment-impact-analysis'

data_raw = os.path.join(project_root, 'data', 'raw')
data_processed = os.path.join(project_root, 'data', 'processed')
data_tickers = os.path.join(project_root, 'data', 'tickers')

os.makedirs(data_raw, exist_ok=True)
os.makedirs(data_processed, exist_ok=True)
os.makedirs(data_tickers, exist_ok=True)