# Phase 1b: Data Ingestion & Schema Harmonization

## Objective

Transform raw, messy text and external API data into two clean, separate datasets:

1. **`daily_news_cleaned.csv`**: A harmonized list of unique news events with standardized timestamps
2. **`stock_returns_60.csv`**: A clean record of log-returns for the 60 target stocks, preserving sector and beta metadata

## Data Quality Challenges

Each news source has distinct irregularities that must be handled individually:

| Source | Issues | Solutions |
|--------|--------|----------|
| **CNBC** | Empty rows, "ET" timezone noise, verbose date format | Drop NaN rows, strip timezone, parse datetime |
| **Guardian** | No description column, `\n\n` in text, DMY format | Fill empty descriptions, clean newlines, parse DMY |
| **Reuters** | Clean format (minimal issues) | Direct date parsing (MMM DD YYYY) |

---

## 1. Setup: Imports and Path Configuration

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

import yfinance as yf

import os
from pathlib import Path

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
project_root = '/content/drive/MyDrive/market-sentiment-impact-analysis'

data_raw = os.path.join(project_root, 'data', 'raw')
data_processed = os.path.join(project_root, 'data', 'processed')
data_tickers = os.path.join(project_root, 'data', 'tickers')

os.makedirs(data_raw, exist_ok=True)
os.makedirs(data_processed, exist_ok=True)
os.makedirs(data_tickers, exist_ok=True)

## 2. News Ingestion & Specific Schema Mapping

Each news source requires individual processing due to distinct data quality issues.


### 2.1 CNBC Data Processing

**Input Schema**: `Headlines`, `Time`, `Description`

**Data Irregularities**:
- Empty rows (`,,` patterns)
- Timezone noise: `"7:51 PM ET Fri, 17 July 2020"` contains "ET" and day names
- Verbose date format requiring custom parsing

**Action Plan**:
1. Load CSV, drop fully empty rows
2. Strip " ET" from time strings
3. Parse datetime with custom logic
4. Rename columns to standard schema
5. Add source tag

In [4]:
cnbc_path = os.path.join(data_raw, 'cnbc_headlines.csv')

print(f'Loading CNBC data from {cnbc_path}')
cnbc_raw = pd.read_csv(cnbc_path)

print(f'Initial shape: {cnbc_raw.shape}')
print('\nFirst 5 rows:')
print(cnbc_raw.head())
print(f'\nColumn names: {list(cnbc_raw.columns)}')

Loading CNBC data from /content/drive/MyDrive/market-sentiment-impact-analysis/data/raw/cnbc_headlines.csv
Initial shape: (3080, 3)

First 5 rows:
                                           Headlines                            Time                                        Description
0  Jim Cramer: A better way to invest in the Covi...   7:51  PM ET Fri, 17 July 2020  "Mad Money" host Jim Cramer recommended buying...
1     Cramer's lightning round: I would own Teradyne   7:33  PM ET Fri, 17 July 2020  "Mad Money" host Jim Cramer rings the lightnin...
2                                                NaN                             NaN                                                NaN
3  Cramer's week ahead: Big week for earnings, ev...   7:25  PM ET Fri, 17 July 2020  "We'll pay more for the earnings of the non-Co...
4  IQ Capital CEO Keith Bliss says tech and healt...   4:24  PM ET Fri, 17 July 2020  Keith Bliss, IQ Capital CEO, joins "Closing Be...

Column names: ['Headlines', 'Time', 

In [5]:
# Drop completely empty rows
cnbc_clean = cnbc_raw.dropna(how='all')

rows_removed = len(cnbc_raw) - len(cnbc_clean)
print(f'Removed {rows_removed} completely empty rows')
print(f'Shape after cleaning: {cnbc_clean.shape}')

Removed 280 completely empty rows
Shape after cleaning: (2800, 3)


In [6]:
# Clean the Time colum, remove " ET" timezone marker
# Example: "7:51  PM ET Fri, 17 July 2020" -> "7:51  PM Fri, 17 July 2020"

def clean_cnbc_time(time_str):
    if pd.isna(time_str):
        return time_str
    cleaned = str(time_str).replace(' ET', '')
    return cleaned

cnbc_clean['Time_Cleaned'] = cnbc_clean['Time'].apply(clean_cnbc_time)

print('Sample of cleaned timestamps:')
print(cnbc_clean[['Time', 'Time_Cleaned']].head(3))

Sample of cleaned timestamps:
                             Time                 Time_Cleaned
0   7:51  PM ET Fri, 17 July 2020   7:51  PM Fri, 17 July 2020
1   7:33  PM ET Fri, 17 July 2020   7:33  PM Fri, 17 July 2020
3   7:25  PM ET Fri, 17 July 2020   7:25  PM Fri, 17 July 2020


In [7]:
# Parse datetime

def parse_cnbc_datetime(time_str):
    if pd.isna(time_str):
        return pd.NaT
    try:
        return pd.to_datetime(time_str, format='mixed', utc=False)
    except:
        return pd.NaT

cnbc_clean['date'] = cnbc_clean['Time_Cleaned'].apply(parse_cnbc_datetime)

print('Date parsing results:')
print(f'Successfully parsed: {cnbc_clean['date'].notna().sum()}')
print(f'Failed to parse: {cnbc_clean['date'].isna().sum()}')
print('\nSample parsed dates:')
print(cnbc_clean[['Time_Cleaned', 'date']].head())

Date parsing results:
Successfully parsed: 2800
Failed to parse: 0

Sample parsed dates:
                  Time_Cleaned                date
0   7:51  PM Fri, 17 July 2020 2020-07-17 19:51:00
1   7:33  PM Fri, 17 July 2020 2020-07-17 19:33:00
3   7:25  PM Fri, 17 July 2020 2020-07-17 19:25:00
4   4:24  PM Fri, 17 July 2020 2020-07-17 16:24:00
5   7:36  PM Thu, 16 July 2020 2020-07-16 19:36:00


In [8]:
# Schema standardization
# Time -> date, Headlines -> headline, Description -> description
# Add: source = "CNBC"

cnbc_standard = cnbc_clean[['date', 'Headlines', 'Description']].copy()
cnbc_standard.columns = ['date', 'headline', 'description']
cnbc_standard['source'] = 'CNBC'

cnbc_standard = cnbc_standard.dropna(subset=['date'])

print(f'CNBC standardized shape: {cnbc_standard.shape}')
print(f'\nColumns: {list(cnbc_standard.columns)}')
print('\nSample:')
print(cnbc_standard.head(3))

CNBC standardized shape: (2800, 4)

Columns: ['date', 'headline', 'description', 'source']

Sample:
                 date                                           headline                                        description source
0 2020-07-17 19:51:00  Jim Cramer: A better way to invest in the Covi...  "Mad Money" host Jim Cramer recommended buying...   CNBC
1 2020-07-17 19:33:00     Cramer's lightning round: I would own Teradyne  "Mad Money" host Jim Cramer rings the lightnin...   CNBC
3 2020-07-17 19:25:00  Cramer's week ahead: Big week for earnings, ev...  "We'll pay more for the earnings of the non-Co...   CNBC


### 2.2 Guardian Data Processing

**Input Schema**: `Time`, `Headlines` (Note: Reversed order, no Description)

**Data Irregularities**:
- Missing `description` column entirely
- Dirty text: `\n\n` patterns in headlines
- Date format: `18-Jul-20` (Day-Month-Year)

**Action Plan**:
1. Load CSV
2. Clean newlines from Headlines
3. Parse DMY date format
4. Create empty `description` column (use `""`, not NaN)
5. Rename and add source tag

In [9]:
guardian_path = os.path.join(data_raw, 'guardian_headlines.csv')

print(f'Loading Guardian data from {guardian_path}')
guardian_raw = pd.read_csv(guardian_path)

print(f'Initial shape: {guardian_raw.shape}')
print(f'First 10 rows:')
print(guardian_raw.head(10))
print(f'Column names: {list(guardian_raw.columns)}')

Loading Guardian data from /content/drive/MyDrive/market-sentiment-impact-analysis/data/raw/guardian_headlines.csv
Initial shape: (17800, 2)
First 10 rows:
        Time                                          Headlines
0  18-Jul-20   Johnson is asking Santa for a Christmas recovery
1  18-Jul-20  ‘I now fear the worst’: four grim tales of wor...
2  18-Jul-20  Five key areas Sunak must tackle to serve up e...
3  18-Jul-20  Covid-19 leaves firms ‘fatally ill-prepared’ f...
4  18-Jul-20  The Week in Patriarchy  \n\n\n  Bacardi's 'lad...
5  18-Jul-20  English councils call for smoking ban outside ...
6  18-Jul-20              Can Tesla justify a $300bn valuation?
7  18-Jul-20  Empty city centres: 'I’m not sure it will ever...
8  18-Jul-20  Democratising finance for all? An investment a...
9  18-Jul-20  Homebuyer loses £300,000 to fraudsters – but g...
Column names: ['Time', 'Headlines']


In [10]:
# Clean newlines from Headlines

def clean_newlines(text):
    """Replace multiple newlines and tabs with single space."""
    if pd.isna(text):
        return text

    cleaned = str(text).replace('\n', ' ').replace('\r', ' ').replace('\t', ' ')
    cleaned = ' '.join(cleaned.split())
    return cleaned

guardian_raw['Headlines_Cleaned'] = guardian_raw['Headlines'].apply(clean_newlines)

has_newlines = guardian_raw['Headlines'].str.contains('\n', na=False)
print(f'Headlines with newlines: {has_newlines.sum()}')
print(f'\nBefore/After sample:')
if has_newlines.sum() > 0:
    sample_idx = guardian_raw[has_newlines].index[0]
    print(f'BEFORE: {repr(guardian_raw.loc[sample_idx, 'Headlines'])}')
    print(f'AFTER:  {guardian_raw.loc[sample_idx, 'Headlines_Cleaned']}')

Headlines with newlines: 1246

Before/After sample:
BEFORE: "The Week in Patriarchy  \n\n\n  Bacardi's 'lady vodka': the latest in a long line of depressing gendered products"
AFTER:  The Week in Patriarchy Bacardi's 'lady vodka': the latest in a long line of depressing gendered products


In [12]:
from pandas.core.dtypes.missing import isna
# Parse date - format is "18-Jul-20" (DMY)

def parse_guardian_date(date_str):
    if pd.isna(date_str):
        return pd.NaT
    try:
        return pd.to_datetime(date_str, format='%d-%b-%y')
    except:
        return pd.NaT

guardian_raw['date'] = guardian_raw['Time'].apply(parse_guardian_date)

print('Date parsing results:')
print(f'Successfully parsed: {guardian_raw['date'].notna().sum()}')
print(f'Failed to parse: {guardian_raw['date'].isna().sum()}')
print('\nDate range:')
print(f'Earliest: {guardian_raw['date'].min()}')
print(f'Latest: {guardian_raw['date'].max()}')

Date parsing results:
Successfully parsed: 17760
Failed to parse: 40

Date range:
Earliest: 2017-12-17 00:00:00
Latest: 2020-07-18 00:00:00


In [13]:
# Schema standardization

guardian_standard = pd.DataFrame({
    'date': guardian_raw['date'],
    'headline': guardian_raw['Headlines_Cleaned'],
    'description': '',
    'source': 'Guardian'
})

guardian_standard = guardian_standard.dropna(subset=['date'])

print(f'Guardian standardized shape: {guardian_standard.shape}')
print(f'\nColumns: {list(guardian_standard.columns)}')
print('\nDescription column check:')
print(f'  Type: {type(guardian_standard['description'].iloc[0])}')
print(f'  Is NaN: {guardian_standard['description'].isna().sum()}')
print(f'  Is empty string: {(guardian_standard['description'] == '').sum()}')
print('\nSample:')
print(guardian_standard.head(3))

Guardian standardized shape: (17760, 4)

Columns: ['date', 'headline', 'description', 'source']

Description column check:
  Type: <class 'str'>
  Is NaN: 0
  Is empty string: 17760

Sample:
        date                                           headline description    source
0 2020-07-18   Johnson is asking Santa for a Christmas recovery              Guardian
1 2020-07-18  ‘I now fear the worst’: four grim tales of wor...              Guardian
2 2020-07-18  Five key areas Sunak must tackle to serve up e...              Guardian


### 2.3 Reuters Data Processing

**Input Schema**: `Headlines`, `Time`, `Description`

**Data Irregularities**: Minimal - clean format `"Jul 18 2020"`

**Action Plan**:
1. Load CSV
2. Parse clean date format (MMM DD YYYY)
3. Rename columns to standard schema
4. Add source tag

In [14]:
reuters_path = os.path.join(data_raw, 'reuters_headlines.csv')

print(f'Loading Reuters data from: {reuters_path}')
reuters_raw = pd.read_csv(reuters_path)

print(f'Initial shape: {reuters_raw.shape}')
print('\nFirst 5 rows:')
print(reuters_raw.head())
print(f'\nColumn names: {list(reuters_raw.columns)}')

Loading Reuters data from: /content/drive/MyDrive/market-sentiment-impact-analysis/data/raw/reuters_headlines.csv
Initial shape: (32770, 3)

First 5 rows:
                                           Headlines         Time                                        Description
0  TikTok considers London and other locations fo...  Jul 18 2020  TikTok has been in discussions with the UK gov...
1  Disney cuts ad spending on Facebook amid growi...  Jul 18 2020  Walt Disney  has become the latest company to ...
2  Trail of missing Wirecard executive leads to B...  Jul 18 2020  Former Wirecard  chief operating officer Jan M...
3  Twitter says attackers downloaded data from up...  Jul 18 2020  Twitter Inc said on Saturday that hackers were...
4  U.S. Republicans seek liability protections as...  Jul 17 2020  A battle in the U.S. Congress over a new coron...

Column names: ['Headlines', 'Time', 'Description']


In [15]:
# Parse date, format is "Jul 18 2020" (MMM DD YYYY)

def parse_reuters_date(date_str):
    if pd.isna(date_str):
        return pd.NaT
    try:
        return pd.to_datetime(date_str, format='%b %d %Y')
    except:
        return pd.NaT

reuters_raw['date'] = reuters_raw['Time'].apply(parse_reuters_date)

print('Date parsing results:')
print(f'Successfully parsed: {reuters_raw['date'].notna().sum()}')
print(f'Failed to parse: {reuters_raw['date'].isna().sum()}')
print('\nDate range:')
print(f'Earliest: {reuters_raw['date'].min()}')
print(f'Latest: {reuters_raw['date'].max()}')
print('\nSample parsed dates:')
print(reuters_raw[['Time', 'date']].head())

Date parsing results:
Successfully parsed: 32770
Failed to parse: 0

Date range:
Earliest: 2018-03-20 00:00:00
Latest: 2020-07-18 00:00:00

Sample parsed dates:
          Time       date
0  Jul 18 2020 2020-07-18
1  Jul 18 2020 2020-07-18
2  Jul 18 2020 2020-07-18
3  Jul 18 2020 2020-07-18
4  Jul 17 2020 2020-07-17


In [16]:
# Schema standardization

reuters_standard = pd.DataFrame({
    'date': reuters_raw['date'],
    'headline': reuters_raw['Headlines'],
    'description': reuters_raw['Description'],
    'source': 'Reuters'
})

reuters_standard = reuters_standard.dropna(subset=['date'])

print(f'Reuters standardized shape: {reuters_standard.shape}')
print(f'Columns: {list(reuters_standard.columns)}')
print('\nSample:')
print(reuters_standard.head(3))

Reuters standardized shape: (32770, 4)
Columns: ['date', 'headline', 'description', 'source']

Sample:
        date                                           headline                                        description   source
0 2020-07-18  TikTok considers London and other locations fo...  TikTok has been in discussions with the UK gov...  Reuters
1 2020-07-18  Disney cuts ad spending on Facebook amid growi...  Walt Disney  has become the latest company to ...  Reuters
2 2020-07-18  Trail of missing Wirecard executive leads to B...  Former Wirecard  chief operating officer Jan M...  Reuters
