# Phase 1b: Data Ingestion & Schema Harmonization

## Objective

Transform raw, messy text and external API data into two clean, separate datasets:

1. **`daily_news_cleaned.csv`**: A harmonized list of unique news events with standardized timestamps
2. **`stock_returns_60.csv`**: A clean record of log-returns for the 60 target stocks, preserving sector and beta metadata

## Data Quality Challenges

Each news source has distinct irregularities that must be handled individually:

| Source | Issues | Solutions |
|--------|--------|----------|
| **CNBC** | Empty rows, "ET" timezone noise, verbose date format | Drop NaN rows, strip timezone, parse datetime |
| **Guardian** | No description column, `\n\n` in text, DMY format | Fill empty descriptions, clean newlines, parse DMY |
| **Reuters** | Clean format (minimal issues) | Direct date parsing (MMM DD YYYY) |

---

## 1. Setup: Imports and Path Configuration

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

import yfinance as yf

import os
from pathlib import Path

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
project_root = '/content/drive/MyDrive/market-sentiment-impact-analysis'

data_raw = os.path.join(project_root, 'data', 'raw')
data_processed = os.path.join(project_root, 'data', 'processed')
data_tickers = os.path.join(project_root, 'data', 'tickers')

os.makedirs(data_raw, exist_ok=True)
os.makedirs(data_processed, exist_ok=True)
os.makedirs(data_tickers, exist_ok=True)

## 2. News Ingestion & Specific Schema Mapping

Each news source requires individual processing due to distinct data quality issues.


### 2.1 CNBC Data Processing

**Input Schema**: `Headlines`, `Time`, `Description`

**Data Irregularities**:
- Empty rows (`,,` patterns)
- Timezone noise: `"7:51 PM ET Fri, 17 July 2020"` contains "ET" and day names
- Verbose date format requiring custom parsing

**Action Plan**:
1. Load CSV, drop fully empty rows
2. Strip " ET" from time strings
3. Parse datetime with custom logic
4. Rename columns to standard schema
5. Add source tag

In [6]:
cnbc_path = os.path.join(data_raw, 'cnbc_headlines.csv')

print(f'Loading CNBC data from {cnbc_path}')
cnbc_raw = pd.read_csv(cnbc_path)

print(f'Initial shape: {cnbc_raw.shape}')
print('\nFirst 5 rows:')
print(cnbc_raw.head())
print(f'\nColumn names: {list(cnbc_raw.columns)}')

Loading CNBC data from /content/drive/MyDrive/market-sentiment-impact-analysis/data/raw/cnbc_headlines.csv
Initial shape: (3080, 3)

First 5 rows:
                                           Headlines                            Time                                        Description
0  Jim Cramer: A better way to invest in the Covi...   7:51  PM ET Fri, 17 July 2020  "Mad Money" host Jim Cramer recommended buying...
1     Cramer's lightning round: I would own Teradyne   7:33  PM ET Fri, 17 July 2020  "Mad Money" host Jim Cramer rings the lightnin...
2                                                NaN                             NaN                                                NaN
3  Cramer's week ahead: Big week for earnings, ev...   7:25  PM ET Fri, 17 July 2020  "We'll pay more for the earnings of the non-Co...
4  IQ Capital CEO Keith Bliss says tech and healt...   4:24  PM ET Fri, 17 July 2020  Keith Bliss, IQ Capital CEO, joins "Closing Be...

Column names: ['Headlines', 'Time', 

In [7]:
# Drop completely empty rows
cnbc_clean = cnbc_raw.dropna(how='all')

rows_removed = len(cnbc_raw) - len(cnbc_clean)
print(f'Removed {rows_removed} completely empty rows')
print(f'Shape after cleaning: {cnbc_clean.shape}')

Removed 280 completely empty rows
Shape after cleaning: (2800, 3)


In [8]:
# Clean the Time colum, remove " ET" timezone marker
# Example: "7:51  PM ET Fri, 17 July 2020" -> "7:51  PM Fri, 17 July 2020"

def clean_cnbc_time(time_str):
    if pd.isna(time_str):
        return time_str
    cleaned = str(time_str).replace(' ET', '')
    return cleaned

cnbc_clean['Time_Cleaned'] = cnbc_clean['Time'].apply(clean_cnbc_time)

print('Sample of cleaned timestamps:')
print(cnbc_clean[['Time', 'Time_Cleaned']].head(3))

Sample of cleaned timestamps:
                             Time                 Time_Cleaned
0   7:51  PM ET Fri, 17 July 2020   7:51  PM Fri, 17 July 2020
1   7:33  PM ET Fri, 17 July 2020   7:33  PM Fri, 17 July 2020
3   7:25  PM ET Fri, 17 July 2020   7:25  PM Fri, 17 July 2020


In [10]:
# Parse datetime

def parse_cnbc_datetime(time_str):
    if pd.isna(time_str):
        return pd.NaT
    try:
        return pd.to_datetime(time_str, format='mixed', utc=False)
    except:
        return pd.NaT

cnbc_clean['date'] = cnbc_clean['Time_Cleaned'].apply(parse_cnbc_datetime)

print('Date parsing results:')
print(f'Successfully parsed: {cnbc_clean['date'].notna().sum()}')
print(f'Failed to parse: {cnbc_clean['date'].isna().sum()}')
print('\nSample parsed dates:')
print(cnbc_clean[['Time_Cleaned', 'date']].head())

Date parsing results:
Successfully parsed: 2800
Failed to parse: 0

Sample parsed dates:
                  Time_Cleaned                date
0   7:51  PM Fri, 17 July 2020 2020-07-17 19:51:00
1   7:33  PM Fri, 17 July 2020 2020-07-17 19:33:00
3   7:25  PM Fri, 17 July 2020 2020-07-17 19:25:00
4   4:24  PM Fri, 17 July 2020 2020-07-17 16:24:00
5   7:36  PM Thu, 16 July 2020 2020-07-16 19:36:00


In [12]:
# Schema standardization
# Time -> date, Headlines -> headline, Description -> description
# Add: source = "CNBC"

cnbc_standard = cnbc_clean[['date', 'Headlines', 'Description']].copy()
cnbc_standard.columns = ['date', 'headline', 'description']
cnbc_standard['source'] = 'CNBC'

cnbc_standard = cnbc_standard.dropna(subset=['date'])

print(f'CNBC standardized shape: {cnbc_standard.shape}')
print(f'\nColumns: {list(cnbc_standard.columns)}')
print('\nSample:')
print(cnbc_standard.head(3))

CNBC standardized shape: (2800, 4)

Columns: ['date', 'headline', 'description', 'source']

Sample:
                 date                                           headline                                        description source
0 2020-07-17 19:51:00  Jim Cramer: A better way to invest in the Covi...  "Mad Money" host Jim Cramer recommended buying...   CNBC
1 2020-07-17 19:33:00     Cramer's lightning round: I would own Teradyne  "Mad Money" host Jim Cramer rings the lightnin...   CNBC
3 2020-07-17 19:25:00  Cramer's week ahead: Big week for earnings, ev...  "We'll pay more for the earnings of the non-Co...   CNBC
