We need to collect headline data for stocks. We're going to start with one stock, google class C ($GOOG). And we also need a variety of sources. The more sources the better. We can always trim the fat.

yfinance will give us a good dataset for Yahoo finance data.

Since we're using Python, we need to rely on websites that are builf in HTML to not make this too over;y complicated. If most of our data sources are buiit in React or are JS- heavy, we'll switch to using JS.

News Sources:
Yahoo Finance
CNBC
BizToc
Reuters


Update (06/13/2025):

After developing a decent solution using python, the problem I'm facing are:
1. A majority of the big websites like Bloomberg or NYT have paywalls. This isn't a real problem and it doesn't mean the website can't be scrapped because the website typically has a 'soft paywall' where the paywall can be turned off and the news can be scrapped.
2. Most modern websites, especially for the large news websites use React. So, the best solution here would be to use Javascript instead of Python to scrape these sites. We'll move over to collecting_headlines (v2) to continue the goal.





In [14]:
import requests
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

In [15]:
#Setting options to make chrome headless
chrome_options = Options()
chrome_options.add_argument("--headless")

In [16]:
url = "https://biztoc.com/wire"

response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")

all_links = [a['href'] for a in soup.find_all('a', href=True)]
print(f"Found {len(all_links)} links:")

valid_urls = list(url for url in all_links if url and url.startswith(('http://', 'https://')))
valid_urls = valid_urls[6:]

Found 258 links:


In [17]:


def get_valid_links(url):
    try:
        response = requests.get(url, timeout=10, headers={'User-Agent': 'Mozilla/5.0'})
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        links = set()  # Avoid duplicates
        for a in soup.find_all('a', href=True):
            href = a['href'].strip()
            if (href and not href.startswith(('javascript:', '#', 'mailto:', 'tel:'))
                and not any(href.endswith(ext) for ext in ('.png', '.jpg', '.pdf', '.docx'))):
                
                absolute_url = urljoin(url, href)
                if urlparse(absolute_url).scheme in ('http', 'https'):  # Validate URL
                    links.add(absolute_url)
        return list(links)
    
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return []

def get_headline_and_text(url):
    service = Service()  # Specify your path
    driver = webdriver.Chrome(service=service, options=chrome_options)
    try:
        driver.get(url)
        # Wait for main content to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, 'body')))
        
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        # Better title extraction (fallback to URL)
        title = soup.title.string if soup.title else url
        
        # Cleaner text extraction (skip nav/footer)
        main_content = soup.find('main') or soup.find('article') or soup.find('body')
        text = ' '.join([p.get_text(' ', strip=True) 
                        for p in main_content.find_all(['p', 'h1', 'h2', 'h3']) 
                        if p.get_text(strip=True)])
        
        return title, text
    
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None, None
    
    finally:
        driver.quit()  # Ensure driver closes





In [18]:
import time
import random
import pandas as pd
from tqdm import tqdm  # Progress bar
from datetime import datetime

# Configuration
MAX_LINKS = 20  # Prevent runaway scraping
output_data = []

# Scrape with progress tracking
for link in tqdm(valid_urls[:MAX_LINKS]):
    try:
        title, text = get_headline_and_text(link)
        if title and text:  # Only store valid data
            output_data.append({
                'url': link,
                'headline': title,
                'body': text,
                'date': datetime.now().strftime('%Y-%m-%d')  # Placeholder
            })
        time.sleep(random.uniform(1, 5))  # Be polite
    
    except Exception as e:
        print(f"Skipping {link} due to error: {e}")
        continue

# Convert to DataFrame once
df = pd.DataFrame(output_data)

100%|██████████| 20/20 [03:16<00:00,  9.81s/it]


In [22]:
df.head(20)

Unnamed: 0,url,headline,body,date
0,https://tippinsights.com/israel-strikes-tehran...,"Israel Strikes Tehran As Iran, U.S. Tensions E...","Israel Strikes Tehran As Iran, U.S. Tensions E...",2025-06-23
1,https://abcnews.go.com/Business/wireStory/stoc...,"With its stock in sharp decline, Trump's media...",Stream on ABC News With its stock in sharp dec...,2025-06-23
2,https://www.theglobeandmail.com/business/artic...,"Court approves Hudson’s Bay name change, and s...","Court approves Hudson’s Bay name change, and s...",2025-06-23
3,https://www.forexlive.com/news/trump-everyone-...,"Trump: Everyone keep oil prices down, I'm watc...",Trump is posting in all caps: Careful out ther...,2025-06-23
4,https://www.syracuse.com/news/2025/06/hochul-o...,Hochul orders New York Power Authority to buil...,Hochul orders New York Power Authority to buil...,2025-06-23
5,https://www.newsweek.com/hurricane-season-upda...,Hurricane Season Update: Tropical Cyclone Coul...,Hurricane Season Update: Tropical Cyclone Coul...,2025-06-23
6,https://www.zerohedge.com/political/jd-vance-s...,JD Vance Says Don't Worry About New Forever Wa...,JD Vance Says Don't Worry About New Forever Wa...,2025-06-23
7,https://www.semafor.com/article/06/23/2025/mur...,Murkowski opens up about caucusing with Democr...,Exclusive / Murkowski opens up about pressure ...,2025-06-23
8,https://www.benzinga.com/crypto/cryptocurrency...,Are We Measuring Crypto Resilience All Wrong? ...,Are We Measuring Crypto Resilience All Wrong? ...,2025-06-23
9,https://www.newsweek.com/us-economic-outlook-e...,US Economic Outlook Remains Dark as Major Fore...,US Economic Outlook Remains Dark as Major Fore...,2025-06-23
