# Additional Data Scraping

This notebook looks at other sources of BS and Non-BS data in order to generate additional training data in multiple domains. This is useful for expanding the dataset and improving the model's ability to generalize across different types of content.

## arXiv

In [1]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("Cornell-University/arxiv")

print("Path to dataset files:", path)
print("Files in the directory:")
for filename in os.listdir(path):
    if os.path.isfile(os.path.join(path, filename)):
        print(filename)
        


  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/ssgrummo/.cache/kagglehub/datasets/Cornell-University/arxiv/versions/237
Files in the directory:
arxiv-metadata-oai-snapshot.json


In [2]:
import pandas as pd
arxiv_df = pd.read_json(os.path.join(path, "arxiv-metadata-oai-snapshot.json"), lines=True, nrows=5000)
arxiv_df.head(5)

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"


## Scraping RSS Feeds for BS and Non BS

In [3]:
from newspaper import Article
import feedparser
import pandas as pd

def scrape_rss_feed(feed_url, max_articles=50):
    feed = feedparser.parse(feed_url)
    data = []
    for entry in feed.entries[:max_articles]:
        url, date = entry.link, entry.published
        art = Article(url)
        art.download(); art.parse()
        if len(art.text) > 200:  # filter for meatier content
            data.append({
                "url": url,
                "title": art.title,
                "date": date,
                "text": art.text[:1000]  # limit excerpt to safe use
            })
    return pd.DataFrame(data)


In [4]:
legit_feed_urls = ['https://www.sciencenews.org/feed',
             'https://www.pewresearch.org/feed/',
             'https://theconversation.com/topics/economics-488/articles.atom',
             'https://theconversation.com/topics/extreme-weather-3799/articles.atom']

df_list = []
for url in legit_feed_urls: 
    df = scrape_rss_feed(url, 5000)
    df_list.append(df)

legit_df = pd.concat(df_list, ignore_index=True)
legit_df.shape

(79, 4)

In [5]:
bs_feed_url = 'https://www.naturalnews.com/rss.xml'
bs_df = scrape_rss_feed(bs_feed_url, 10000)
bs_df.shape

(31, 4)

In [6]:
from datasets import load_dataset
dataset = load_dataset("cc_news", split="train")

dataset

Dataset({
    features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
    num_rows: 708241
})

In [7]:
unique_values = dataset['domain']
unique_elements = list(set(unique_values))
len(unique_elements)

8759

In [8]:
item_to_check = "science"

science_domains = [item for item in unique_elements  if item_to_check in item]
science_domains


['scienceblog.com',
 'www.scienceworldreport.com',
 'www.sciencefriday.com',
 'www.sciencedaily.com',
 'www.myscience.org',
 'horizon.scienceblog.com',
 'www.sciencealert.com',
 'thebrainbank.scienceblog.com',
 'joshmitteldorf.scienceblog.com',
 'www.sciencebeing.com',
 'www.sciencespacerobots.com',
 'www.france-science.org',
 'forthesakeofscience.com',
 'www.livescience.com',
 'scienceofmind.com',
 'alankandel.scienceblog.com',
 'science.slashdot.org',
 'www.sciencebase.com',
 'scienceblogs.com',
 'abitofscience.com',
 'science.howstuffworks.com',
 'news.science360.gov']

In [9]:
def filter_dataset(dataset, column_name, target_string):
    return dataset.filter(lambda example: target_string in example[column_name])

domain = "dailymail.co.uk/health"
filtered_dataset = filter_dataset(dataset, 'url', domain)

filtered_dataset

Dataset({
    features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
    num_rows: 237
})

In [10]:
domain = "theconversation.com"
filtered_dataset = filter_dataset(dataset, 'url', domain)

filtered_dataset

Dataset({
    features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
    num_rows: 357
})

In [11]:
for url in science_domains:
    filtered_dataset = filter_dataset(dataset, 'url', url)
    print(f"{url} Row Count: {filtered_dataset.num_rows}")

scienceblog.com Row Count: 71
www.scienceworldreport.com Row Count: 34
www.sciencefriday.com Row Count: 18
www.sciencedaily.com Row Count: 106
www.myscience.org Row Count: 17
horizon.scienceblog.com Row Count: 4
www.sciencealert.com Row Count: 63
thebrainbank.scienceblog.com Row Count: 1
joshmitteldorf.scienceblog.com Row Count: 1
www.sciencebeing.com Row Count: 10
www.sciencespacerobots.com Row Count: 5
www.france-science.org Row Count: 2
forthesakeofscience.com Row Count: 1
www.livescience.com Row Count: 34
scienceofmind.com Row Count: 1
alankandel.scienceblog.com Row Count: 5
science.slashdot.org Row Count: 50
www.sciencebase.com Row Count: 2
scienceblogs.com Row Count: 7
abitofscience.com Row Count: 19
science.howstuffworks.com Row Count: 8
news.science360.gov Row Count: 22


Use www.sciencedaily.com for non-BS data

In [12]:
item_to_check = "health"

health_domains = [item for item in unique_elements  if item_to_check in item]
health_domains


['www.diabeteshealth.com',
 'healthyregion.bangordailynews.com',
 'www.dailyhealthneeds.com',
 'www.thehealthsite.com',
 'health.fajar.co.id',
 'www.menshealth.com',
 'www.healthcareitnews.com',
 'www.health.com',
 'www.health24.com',
 'health.good.is',
 'healthcitysun.com',
 'www.healthcanal.com',
 'catchinghealth.bangordailynews.com']

In [13]:
for url in health_domains:
    filtered_dataset = filter_dataset(dataset, 'url', url)
    print(f"{url} Row Count: {filtered_dataset.num_rows}")

www.diabeteshealth.com Row Count: 4
healthyregion.bangordailynews.com Row Count: 1
www.dailyhealthneeds.com Row Count: 1
www.thehealthsite.com Row Count: 55
health.fajar.co.id Row Count: 2
www.menshealth.com Row Count: 10
www.healthcareitnews.com Row Count: 37
www.health.com Row Count: 33
www.health24.com Row Count: 17
health.good.is Row Count: 3
healthcitysun.com Row Count: 10
www.healthcanal.com Row Count: 2
catchinghealth.bangordailynews.com Row Count: 10


In [14]:
bs_dataset = filter_dataset(dataset, 'url', 'www.thehealthsite.com')
bs_dataset['title']

['Sedentary lifestyle and no exercise leading to critical illness',
 'Union Minister Uma Bharti admitted to AIIMS for high blood pressure',
 'Delhi Burari deaths: Did ‘shared psychosis’ lead to the mass suicides in the Bhatia family?',
 '3 bathroom sex positions you should surely try!',
 'Can good bacteria keep gut healthy?',
 'Odisha government to establish 19 hospitals on Public Private Partnership (PPP) mode',
 'Which one is healthier: pasteurised milk, unpasteurised milk, homogenous milk or toned milk?',
 'Can pelvic exams help diagnose STDs in girls?',
 'World No Tobacco Day 2018: Secondhand smoke and its risk to your heart',
 'International Day of Happiness 2018: These yoga asanas will help improve your mental health',
 'World Oral Health Day 2018: 7 harmful habits that are destroying your tooth enamel',
 'World No Tobacco Day 2018: Ditch smoking and tobacco; choose heart health',
 'Lap-band surgery may lower chronic knee pain',
 'Punjab government to impose ban on hookah bars ac

In [15]:
bs_urls = ["dailymail.co.uk/health",
           ]

In [16]:
from collections import Counter
Counter([x['domain'] for x in dataset])


Counter({'uk.reuters.com': 24480,
         'www.dailymail.co.uk': 15452,
         'www.topix.com': 13354,
         'www.reuters.com': 11378,
         'www.which.co.uk': 7411,
         'www.express.co.uk': 4875,
         'indianexpress.com': 4151,
         'www.cbssports.com': 4039,
         'www.mirror.co.uk': 3791,
         'nypost.com': 3646,
         'shepherdexpress.com': 3605,
         'www.channelnewsasia.com': 3462,
         'www.cnn.com': 3403,
         'www.amarujala.com': 3395,
         'nationalpost.com': 3238,
         'www.nigeriatoday.ng': 3148,
         'www.metronews.ca': 3128,
         'www.taiwannews.com.tw': 3103,
         'www.inquisitr.com': 2873,
         'www.seattletimes.com': 2784,
         'allafrica.com': 2661,
         'www.businessinsider.com': 2656,
         'www.theguardian.com': 2610,
         'www.foxnews.com': 2432,
         'www.nzherald.co.nz': 2358,
         'seekingalpha.com': 2339,
         'in.reuters.com': 2332,
         'au.news.yahoo.com': 233

In [None]:
from urllib.parse import urlparse
from datasets import load_dataset

target_domains = [
    "www.science.org",
    "www.pewresearch.org",
    "www.naturalnews.com",
    "www.rand.org"
]


# Use streaming mode
ds = load_dataset("mc4", name="en", split="train", streaming=True, trust_remote_code=True)

# Normalize domains
target_domains = set(d.lower().replace("https://", "").replace("http://", "") for d in target_domains)

def is_target_domain(example):
    try:
        domain = urlparse(example["url"]).netloc.lower()
        return domain in target_domains
    except Exception:
        return False

# Filter stream
filtered_ds = filter(is_target_domain, ds)

import logging
from urllib.parse import urlparse
from collections import defaultdict
import time

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

n = 10  # number of samples per domain
per_domain_counts = defaultdict(int)
examples = []

check_interval = 1000  # how often to log progress
seen = 0
start_time = time.time()

for ex in filtered_ds:
    seen += 1
    try:
        domain = urlparse(ex["url"]).netloc.lower()
    except Exception as e:
        logger.warning(f"Skipping malformed URL: {ex.get('url')} | Error: {e}")
        continue

    if domain in target_domains and per_domain_counts[domain] < n:
        examples.append(ex)
        per_domain_counts[domain] += 1

        logger.info(f"Added example from {domain}: {per_domain_counts[domain]}/{n}")

    if seen % check_interval == 0:
        elapsed = time.time() - start_time
        logger.info(f"Processed {seen} examples in {elapsed:.1f} sec")
        for d in target_domains:
            logger.info(f"  {d}: {per_domain_counts[d]}/{n}")

    if all(per_domain_counts[d] >= n for d in target_domains):
        logger.info("✅ Collected enough examples from all domains.")
        break

logger.info(f"Finished. Total processed: {seen}. Final counts:")
for d in target_domains:
    logger.info(f"  {d}: {per_domain_counts[d]}")


In [1]:
import logging
import time
from collections import defaultdict
from urllib.parse import urlparse
from datasets import load_dataset, IterableDataset
import pandas as pd
from typing import List, Dict, Any

logger = logging.getLogger(__name__)


def normalize_domains(domains: List[str]) -> set:
    """Normalize domain names by stripping protocol prefixes and converting to lowercase."""
    return set(d.lower().replace("https://", "").replace("http://", "") for d in domains)


def is_target_domain(example: Dict[str, Any], all_domains: set) -> bool:
    """Check if the example's domain is in the set of target domains."""
    try:
        domain = urlparse(example["url"]).netloc.lower()
        return domain in all_domains
    except Exception:
        return False


def load_filtered_mc4_stream(all_domains: List[str]) -> IterableDataset:
    """Load a streaming MC4 dataset and filter by target domains."""
    ds = load_dataset("mc4", name="en", split="train", streaming=True, trust_remote_code=True)
    normalized_domains = normalize_domains(all_domains)
    return filter(lambda ex: is_target_domain(ex, normalized_domains), ds)


def collect_domain_samples(
    filtered_ds: IterableDataset,
    domains: List[str],
    n: int = 10,
    check_interval: int = 1000
) -> List[Dict[str, str]]:
    """Collect up to n samples per specified domain from a filtered dataset."""
    domains = normalize_domains(domains)
    per_domain_counts = defaultdict(int)
    examples: List[Dict[str, str]] = []
    seen = 0
    start_time = time.time()

    for ex in filtered_ds:
        seen += 1
        try:
            domain = urlparse(ex["url"]).netloc.lower()
        except Exception as e:
            logger.warning(f"Skipping malformed URL: {ex.get('url')} | Error: {e}")
            continue

        if domain in domains and per_domain_counts[domain] < n:
            examples.append({"text": ex["text"], "url": ex["url"]})
            per_domain_counts[domain] += 1
            logger.info(f"Added example from {domain}: {per_domain_counts[domain]}/{n}")

        if seen % check_interval == 0:
            elapsed = time.time() - start_time
            logger.info(f"Processed {seen} examples in {elapsed:.1f} sec")
            for d in domains:
                logger.info(f"  {d}: {per_domain_counts[d]}/{n}")

        if all(per_domain_counts[d] >= n for d in domains):
            logger.info("✅ Collected enough examples from all domains.")
            break

    logger.info(f"Finished. Total processed: {seen}. Final counts:")
    for d in domains:
        logger.info(f"  {d}: {per_domain_counts[d]}")

    return examples


def label_domain_samples(
    sample_list: List[Dict[str, str]],
    bs_domains: List[str],
    legit_domains: List[str],
    label_name: str = "is_bs"
) -> List[Dict[str, Any]]:
    """Label samples based on whether their domains are in BS or legit domain lists."""
    bs_domains_set = normalize_domains(bs_domains)
    legit_domains_set = normalize_domains(legit_domains)

    for ex in sample_list:
        domain = urlparse(ex["url"]).netloc.lower()
        if domain in bs_domains_set:
            ex[label_name] = 1
        elif domain in legit_domains_set:
            ex[label_name] = 0
    return sample_list


def samples_to_dataframe(sample_list: List[Dict[str, Any]]) -> pd.DataFrame:
    """Convert a list of labeled samples into a pandas DataFrame."""
    return pd.DataFrame(sample_list)



bs_domains = ["www.naturalnews.com"]
legit_domains = ["theconversation.com"]
n = 3

all_domains = bs_domains + legit_domains
filtered_ds = load_filtered_mc4_stream(all_domains)
samples = collect_domain_samples(filtered_ds, all_domains, n=n)
labeled_samples = label_domain_samples(samples, bs_domains, legit_domains)
samples_df = samples_to_dataframe(labeled_samples)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
samples_df

Unnamed: 0,text,url,is_bs
0,New study establishes a link between magnetic ...,http://www.naturalnews.com/036732_magnetic_fie...,1
1,"January 28, 2020 8.46am EST\nJean Frederic Isi...",http://theconversation.com/how-sensors-and-big...,0
2,Earth's magnetic pole shift unleashing poisono...,https://www.naturalnews.com/030996_bird_deaths...,1
3,"Sunday, May 31, 2009 by: Paul FassaTags: serra...",http://www.naturalnews.com/026360_serrapeptase...,1
4,Over the last few years our economic debate ha...,http://theconversation.com/hockeys-budget-mess...,0
5,"December 2, 2014 9.21am EST\nMelanie Klinkner ...",http://theconversation.com/the-icc-can-improve...,0
6,"March 1, 2012 11.46pm EST\nAndrea Carson, Alex...",http://theconversation.com/the-finkelstein-inq...,0
7,"October 16, 2013 7.11pm EDT\nThe end of the tw...",http://theconversation.com/coming-out-on-top-n...,0
8,"Lenka Vodstrcil, Monash University, Catriona B...",https://theconversation.com/we-need-a-cure-for...,0
9,"Even “healthy people,” if they eat sugar, are ...",https://www.naturalnews.com/2017-10-23-even-he...,1


In [29]:
import requests
from urllib.parse import quote

def query_cc_index(domain: str, match_pattern: str = "*", limit: int = 50, index: str ='CC-MAIN-2025-21-index'):
    """
    Queries the Common Crawl Index for a given domain and pattern.

    Args:
        domain (str): Domain to search (e.g., 'www.naturalnews.com')
        match_pattern (str): Pattern to match after domain (e.g., '*article*')
        limit (int): Maximum number of URLs to return

    Returns:
        List[str]: List of matching URLs
    """
    index_url = (
        f"http://index.commoncrawl.org/{index}?"
        f"url={quote(domain + '/' + match_pattern)}&output=json"
    )
    response = requests.get(index_url, stream=True)
    response.raise_for_status()

    urls = []
    for line in response.iter_lines():
        if line:
            record = eval(line.decode('utf-8'))  # Use `json.loads()` if unsure of eval safety
            url = record.get("url")
            if url and len(urls) < limit:
                urls.append(url)

    return urls


In [30]:
urls = query_cc_index("www.naturalnews.com", "202*", limit=20)
print(urls[:5])


['https://naturalnews.com/2020-01-01-fake-male-nonbinary-partner-give-birth-female-sperm-donor.html', 'https://www.naturalnews.com/2020-01-01-fake-male-nonbinary-partner-give-birth-female-sperm-donor.html', 'https://naturalnews.com/2020-01-01-small-doses-of-aspirin-cause-brain-hemorrhage.html', 'https://www.naturalnews.com/2020-01-01-small-doses-of-aspirin-cause-brain-hemorrhage.html', 'https://naturalnews.com/2020-01-01-vaccine-industry-reveals-new-all-in-one-super-injection-multiple-vaccines.html']


In [7]:
import pandas as pd
from newspaper import Article
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def scrape_articles(urls: list[str], max_words: int = 500) -> pd.DataFrame:
    """
    Scrape full text from a list of article URLs using newspaper3k.

    Args:
        urls (list[str]): List of article URLs.
        max_words (int): Max number of words to retain from article text.

    Returns:
        pd.DataFrame: DataFrame with columns ['url', 'title', 'text'].
    """
    data = []
    for i, url in enumerate(urls):
        try:
            logger.info(f"Scraping ({i+1}/{len(urls)}): {url}")
            article = Article(url)
            article.download()
            article.parse()
            words = article.text.split()
            truncated_text = " ".join(words[:max_words])
            data.append({
                "url": url,
                "title": article.title,
                "text": truncated_text
            })
        except Exception as e:
            logger.warning(f"Failed to scrape {url}: {e}")
            continue

    return pd.DataFrame(data)


In [31]:
urls = query_cc_index("www.naturalnews.com", "202*", limit=20)
news_df = scrape_articles(urls, max_words=500)
news_df.head(5)

INFO:__main__:Scraping (1/20): https://naturalnews.com/2020-01-01-fake-male-nonbinary-partner-give-birth-female-sperm-donor.html
INFO:__main__:Scraping (2/20): https://www.naturalnews.com/2020-01-01-fake-male-nonbinary-partner-give-birth-female-sperm-donor.html
INFO:__main__:Scraping (3/20): https://naturalnews.com/2020-01-01-small-doses-of-aspirin-cause-brain-hemorrhage.html
INFO:__main__:Scraping (4/20): https://www.naturalnews.com/2020-01-01-small-doses-of-aspirin-cause-brain-hemorrhage.html
INFO:__main__:Scraping (5/20): https://naturalnews.com/2020-01-01-vaccine-industry-reveals-new-all-in-one-super-injection-multiple-vaccines.html
INFO:__main__:Scraping (6/20): https://naturalnews.com/2020-01-02-7-new-years-resolution-ideas-for-overall-health.html
INFO:__main__:Scraping (7/20): https://naturalnews.com/2020-01-02-california-orders-pastors-push-lgbt-activism-transgender-tyranny.html
INFO:__main__:Scraping (8/20): https://naturalnews.com/2020-01-02-contagious-emotions-study-suggests

Unnamed: 0,url,title,text
0,https://naturalnews.com/2020-01-01-fake-male-n...,Transgender insanity: Fake “male” and “non-bin...,Transgender insanity: Fake “male” and “non-bin...
1,https://www.naturalnews.com/2020-01-01-fake-ma...,Transgender insanity: Fake “male” and “non-bin...,Transgender insanity: Fake “male” and “non-bin...
2,https://naturalnews.com/2020-01-01-small-doses...,Lay off the aspirin: Research says even “small...,Lay off the aspirin: Research says even “small...
3,https://www.naturalnews.com/2020-01-01-small-d...,Lay off the aspirin: Research says even “small...,Lay off the aspirin: Research says even “small...
4,https://naturalnews.com/2020-01-01-vaccine-ind...,FLASHBACK: Vaccine industry reveals new “all-i...,FLASHBACK: Vaccine industry reveals new “all-i...


In [32]:
urls = query_cc_index("www.urban.org", "urban-wire*", limit=20)
mydf = scrape_articles(urls, max_words=500)
mydf.head(5)

INFO:__main__:Scraping (1/20): https://www.urban.org/urban-wire/22-million-renters-and-owners-manufactured-homes-are-mostly-left-out-pandemic-assistance?cm_ven=ExactTarget&cm_cat=HFPC.09.01.2020&cm_pla=All+Subscribers&cm_ite=https%3A%2F%2Fwww.urban.org%2Furban-wire%2F22-million-renters-and-owners-manufactured-homes-are-mostly-left-out-pandemic-assistance&cm_ainfo=&&utm_source=%20urban_newsletters&&utm_medium=news-HFPC&&utm_term=HFPC&&
INFO:__main__:Scraping (2/20): https://www.urban.org/urban-wire/asian-americans-face-systemic-higher-mortgage-denial-rates-despite-having-stronger-credit-profiles
INFO:__main__:Scraping (3/20): https://www.urban.org/urban-wire/building-power-freelance-arts-workers-and-all-independent-contractors-will-make-them-more-resilient-economic-shocks
INFO:__main__:Scraping (4/20): https://www.urban.org/urban-wire/changes-school-lunch-reporting-make-achievement-gaps-harder-measure
INFO:__main__:Scraping (5/20): https://www.urban.org/urban-wire/covid-19-pandemic-unde

Unnamed: 0,url,title,text
0,https://www.urban.org/urban-wire/22-million-re...,22 Million Renters and Owners of Manufactured ...,In analyses of COVID-19’s effects on the housi...
1,https://www.urban.org/urban-wire/asian-america...,Asian Americans Face Systemic Higher Mortgage ...,The Asian American population in the US has in...
2,https://www.urban.org/urban-wire/building-powe...,Building Power for Freelance Arts Workers—and ...,The COVID-19 pandemic has devastated the arts ...
3,https://www.urban.org/urban-wire/changes-schoo...,Changes in School Lunch Reporting Make Achieve...,With today’s release of the 2019 National Asse...
4,https://www.urban.org/urban-wire/covid-19-pand...,The COVID-19 Pandemic Underscored the Child Ta...,"In late December and early January, COVID-19 c..."


In [None]:
urls = query_cc_index("theconversation.com", "*", limit=10, index='CC-MAIN-2023-50-index')
mydf = scrape_articles(urls, max_words=500)
mydf.head(5)

INFO:__main__:Scraping (1/10): https://theconversation.com/
INFO:__main__:Scraping (2/10): http://theconversation.com/
INFO:__main__:Scraping (3/10): https://theconversation.com/1-288-milliards-de-dollars-chiffrer-les-degats-causes-par-les-invasions-biologiques-pour-enfin-agir-158204
INFO:__main__:Scraping (4/10): https://theconversation.com/1-2t-infrastructure-plan-offers-lucrative-target-for-fraud-171453
INFO:__main__:Scraping (5/10): https://theconversation.com/1-4-million-less-than-projected-how-coronavirus-could-hit-australias-population-in-the-next-20-years-143544
INFO:__main__:Scraping (6/10): https://theconversation.com/1-in-10-uni-students-submit-assignments-written-by-someone-else-and-most-are-getting-away-with-it-166410
INFO:__main__:Scraping (7/10): https://theconversation.com/1-in-10-us-students-are-english-learners-143324
INFO:__main__:Scraping (8/10): https://theconversation.com/1-in-10-women-with-endometriosis-report-using-cannabis-to-ease-their-pain-126516
INFO:__main_

Unnamed: 0,url,title,text
0,https://theconversation.com/,News written by experts to help you understand,Trade in a mythical fish is threatening real s...
1,http://theconversation.com/,News written by experts to help you understand,Trade in a mythical fish is threatening real s...
2,https://theconversation.com/1-288-milliards-de...,1 288 milliards de dollars : chiffrer les dégâ...,Elles ont plus d’impacts que le changement cli...
3,https://theconversation.com/1-2t-infrastructur...,$1.2T infrastructure plan offers lucrative tar...,Lawmakers passed the US$1.2 trillion bipartisa...
4,https://theconversation.com/1-4-million-less-t...,1.4 million less than projected: how coronavir...,"In the early stages of COVID-19, much of the f..."


In [34]:
urls = query_cc_index("www.sciencenews.org", "article/*", limit=10)
mydf = scrape_articles(urls, max_words=500)
mydf.head(5)

INFO:__main__:Scraping (1/10): https://www.sciencenews.org/article/1000-genomes-pilot-hit-geneticists
INFO:__main__:Scraping (2/10): https://www.sciencenews.org/article/10000-year-explosion-how-civilization-accelerated-human-evolution-gregory-cochran-and-henry
INFO:__main__:Scraping (3/10): https://www.sciencenews.org/article/101-american-geo-sites-youve-gotta-see-geology-underfoot-albert-b-dickas
INFO:__main__:Scraping (4/10): https://www.sciencenews.org/article/1177-bc-bronze-age-societies-review
INFO:__main__:Scraping (5/10): https://www.sciencenews.org/article/18-new-species-pelican-spiders-discovered
INFO:__main__:Scraping (6/10): https://www.sciencenews.org/article/18903
INFO:__main__:Scraping (7/10): https://www.sciencenews.org/article/18910
INFO:__main__:Scraping (8/10): https://www.sciencenews.org/article/18912
INFO:__main__:Scraping (9/10): https://www.sciencenews.org/article/18917
INFO:__main__:Scraping (10/10): https://www.sciencenews.org/article/18918


Unnamed: 0,url,title,text
0,https://www.sciencenews.org/article/1000-genom...,1000 Genomes pilot a hit with geneticists,The average person walks around with defective...
1,https://www.sciencenews.org/article/10000-year...,"The 10,000 Year Explosion: How Civilization Ac...","Subscribers, enter your e-mail address for ful..."
2,https://www.sciencenews.org/article/101-americ...,101 American Geo-Sites You’ve Gotta See (Geolo...,This handy guide has plenty of labeled photos ...
3,https://www.sciencenews.org/article/1177-bc-br...,‘After 1177 B.C.’ describes how societies fare...,"After 1177 B.C. Eric H. Cline Princeton Univ.,..."
4,https://www.sciencenews.org/article/18-new-spe...,18 new species of pelican spiders discovered,"Despite their name, pelican spiders aren’t mas..."


In [35]:
urls = query_cc_index("arxiv.org", "abs/*", limit=10)
mydf = scrape_articles(urls, max_words=500)
mydf.head(5)

INFO:__main__:Scraping (1/10): https://arxiv.org/abs/0608700
INFO:__main__:Scraping (2/10): http://www.arxiv.org/abs/0704.0095
INFO:__main__:Scraping (3/10): http://arxiv.org/abs/0704.0101
INFO:__main__:Scraping (4/10): https://www.arxiv.org/abs/0704.0103
INFO:__main__:Scraping (5/10): https://www.arxiv.org/abs/0704.0106
INFO:__main__:Scraping (6/10): https://www.arxiv.org/abs/0704.0117
INFO:__main__:Scraping (7/10): http://www.arxiv.org/abs/0704.0129
INFO:__main__:Scraping (8/10): https://www.arxiv.org/abs/0704.0138
INFO:__main__:Scraping (9/10): https://www.arxiv.org/abs/0704.0143
INFO:__main__:Scraping (10/10): https://www.arxiv.org/abs/0704.0167


Unnamed: 0,url,title,text
0,http://www.arxiv.org/abs/0704.0095,[0704.0095] Geometry of Locally Compact Groups...,
1,http://arxiv.org/abs/0704.0101,[0704.0101] The birth of string theory,
2,https://www.arxiv.org/abs/0704.0103,[0704.0103] Generalized regularly discontinuou...,
3,https://www.arxiv.org/abs/0704.0106,[0704.0106] Multiple Parton Scattering in Nucl...,
4,https://www.arxiv.org/abs/0704.0117,[0704.0117] Lower ground state due to counter-...,


In [60]:
urls = query_cc_index("gemini.com", "blog/*", limit=10)
mydf = scrape_articles(urls, max_words=500)
mydf.head(5)

INFO:__main__:Scraping (1/10): https://www.gemini.com/blog/a-message-from-cameron-and-tyler?utm_content=null&utm_source=Sailthru&utm_medium=email&utm_campaign=Friday%20Email&utm_term=4ABCD
INFO:__main__:Scraping (2/10): https://www.gemini.com/blog/an-early-btc-ath-switches-up-halving-trend-eth-shows-mighty-market
INFO:__main__:Scraping (3/10): https://www.gemini.com/blog/bitcoin-rallies-back-to-usd94k-paul-atkins-sworn-in-as-sec-chair-and-cantor
INFO:__main__:Scraping (4/10): https://www.gemini.com/blog/bitcoin-rockets-higher-eth-etfs-log-record-inflows-and-dogecoin-continues-to
INFO:__main__:Scraping (5/10): https://www.gemini.com/blog/blackrock-to-offer-usd150-billion-blockchain-treasury-fund-crypto-continues
INFO:__main__:Scraping (6/10): https://www.gemini.com/blog/ethereum-hard-fork-modified-exchange-operations
INFO:__main__:Scraping (7/10): https://www.gemini.com/blog/exploring-gemini-activetrader-a-high-performance-crypto-trading-platform
INFO:__main__:Scraping (8/10): https://g

Unnamed: 0,url,title,text
0,https://www.gemini.com/blog/a-message-from-cam...,A Message from Cameron & Tyler,The following message was shared with Gemini e...
1,https://www.gemini.com/blog/an-early-btc-ath-s...,"An Early BTC ATH Switches Up Halving Trend, ET...",*Percentages reflect trends over the past seve...
2,https://www.gemini.com/blog/bitcoin-rallies-ba...,"Bitcoin Rallies Back to $94K, Paul Atkins Swor...",*Percentages reflect trends over the past seve...
3,https://www.gemini.com/blog/bitcoin-rockets-hi...,"Bitcoin Rockets Higher, ETH ETFs Log Record In...",*Percentages reflect trends over the past seve...
4,https://www.gemini.com/blog/blackrock-to-offer...,BlackRock To Offer $150 Billion Blockchain Tre...,*Percentages reflect trends over the past seve...


In [61]:
urls = query_cc_index("www.icr.org", "article/*", limit=10)
mydf = scrape_articles(urls, max_words=500)
mydf.head(5)

INFO:__main__:Scraping (1/10): http://www.icr.org/article/10015
INFO:__main__:Scraping (2/10): https://www.icr.org/article/10015
INFO:__main__:Scraping (3/10): https://www.icr.org/article/10019
INFO:__main__:Scraping (4/10): http://www.icr.org/article/10039/
INFO:__main__:Scraping (5/10): https://www.icr.org/article/10039/
INFO:__main__:Scraping (6/10): https://www.icr.org/article/10083/389
INFO:__main__:Scraping (7/10): https://www.icr.org/article/1011
INFO:__main__:Scraping (8/10): http://www.icr.org/article/10129
INFO:__main__:Scraping (9/10): https://www.icr.org/article/10129
INFO:__main__:Scraping (10/10): https://www.icr.org/article/10137


Unnamed: 0,url,title,text
0,http://www.icr.org/article/10015,Chicxulub Crater Theory Mostly Smoke,Chicxulub Crater Theory Mostly Smoke In secula...
1,https://www.icr.org/article/10015,Chicxulub Crater Theory Mostly Smoke,Chicxulub Crater Theory Mostly Smoke In secula...
2,https://www.icr.org/article/10019,Engineered Adaptability: Engineering Principle...,Engineered Adaptability: Engineering Principle...
3,http://www.icr.org/article/10039/,The Institute for Creation Research,Faithful Smyrna “And unto the angel of the chu...
4,https://www.icr.org/article/10039/,The Institute for Creation Research,Faithful Smyrna “And unto the angel of the chu...


In [70]:
urls = query_cc_index("thetruthaboutvaccines.com", "*", limit=10)
mydf = scrape_articles(urls, max_words=500)
mydf.head(15)

INFO:__main__:Scraping (1/10): https://thetruthaboutvaccines.com/
INFO:__main__:Scraping (2/10): https://thetruthaboutvaccines.com/admission-epic-proportions-health-canada-confirms-dna-plasmid-contamination-covid-vaccines/
INFO:__main__:Scraping (3/10): https://thetruthaboutvaccines.com/be-brave-pt1-vaccine-science-settled/
INFO:__main__:Scraping (4/10): https://thetruthaboutvaccines.com/bombshell-study-reveals-negative-efficacy-flu-vaxx/
INFO:__main__:Scraping (5/10): https://thetruthaboutvaccines.com/breaking-show-us-papers-vaccine-passport-system-launched-europe/
INFO:__main__:Scraping (6/10): https://thetruthaboutvaccines.com/canadian-detective-punished-probing-baby-deaths/
INFO:__main__:Scraping (7/10): https://thetruthaboutvaccines.com/category/medical-tyranny/
INFO:__main__:Scraping (8/10): https://thetruthaboutvaccines.com/category/medical-tyranny/page/3/
INFO:__main__:Scraping (9/10): https://thetruthaboutvaccines.com/category/news/
INFO:__main__:Scraping (10/10): https://thet

Unnamed: 0,url,title,text
0,https://thetruthaboutvaccines.com/,The Truth About Vaccines,Remember when they told us it was “just two we...
1,https://thetruthaboutvaccines.com/admission-ep...,‘An Admission of Epic Proportions’: Health Can...,TTAV is experiencing heavy censorship on many ...
2,https://thetruthaboutvaccines.com/be-brave-pt1...,Be Brave! – Part 1: Is Vaccine Science Really ...,TTAV is experiencing heavy censorship on many ...
3,https://thetruthaboutvaccines.com/bombshell-st...,Bombshell Study Reveals 27% Negative Efficacy ...,TTAV is experiencing heavy censorship on many ...
4,https://thetruthaboutvaccines.com/breaking-sho...,Vaccine Passport System Launched in Europe,TTAV is experiencing heavy censorship on many ...
5,https://thetruthaboutvaccines.com/canadian-det...,“Safe and Effective”… Or Else! Canadian Detect...,TTAV is experiencing heavy censorship on many ...
6,https://thetruthaboutvaccines.com/category/med...,Medical Tyranny Archives,Remember when they told us it was “just two we...
7,https://thetruthaboutvaccines.com/category/med...,Medical Tyranny Archives,It is a sad reality that many of our trusted i...
8,https://thetruthaboutvaccines.com/category/news/,The Truth About Vaccines,Remember when they told us it was “just two we...
9,https://thetruthaboutvaccines.com/category/new...,The Truth About Vaccines,Religious exemptions for vaccinations have bee...


In [69]:
urls = query_cc_index("cryptopotato.com", "*", limit=10)
mydf = scrape_articles(urls, max_words=500)
mydf.head(15)

INFO:__main__:Scraping (1/10): https://cryptopotato.com/
INFO:__main__:Scraping (2/10): https://cryptopotato.com/
INFO:__main__:Scraping (3/10): https://cryptopotato.com/
INFO:__main__:Scraping (4/10): https://cryptopotato.com/
INFO:__main__:Scraping (5/10): https://cryptopotato.com/
INFO:__main__:Scraping (6/10): https://cryptopotato.com/
INFO:__main__:Scraping (7/10): https://cryptopotato.com/1-34-billion-liquidated-following-the-tesla-driven-bitcoin-rally/
INFO:__main__:Scraping (8/10): https://cryptopotato.com/1-3b-of-bitcoin-withdrawn-from-exchanges-as-miners-reserves-reach-yearly-high/
INFO:__main__:Scraping (9/10): https://cryptopotato.com/1-72b-worth-of-bitcoin-moved-to-accumulation-addresses-after-dip-below-63k-data/
INFO:__main__:Scraping (10/10): https://cryptopotato.com/1-billion-just-sent-from-tether-treasury-to-binance-here-is-why/


Unnamed: 0,url,title,text
0,https://cryptopotato.com/,CryptoPotato,Claimed ownership of cryptocurrency is more co...
1,https://cryptopotato.com/,CryptoPotato,Claimed ownership of cryptocurrency is more co...
2,https://cryptopotato.com/,CryptoPotato,Claimed ownership of cryptocurrency is more co...
3,https://cryptopotato.com/,CryptoPotato,Claimed ownership of cryptocurrency is more co...
4,https://cryptopotato.com/,CryptoPotato,Claimed ownership of cryptocurrency is more co...
5,https://cryptopotato.com/,CryptoPotato,Claimed ownership of cryptocurrency is more co...
6,https://cryptopotato.com/1-34-billion-liquidat...,$1.34 Billion Liquidated Following the Tesla-D...,Today is yet another tumultuous day in the cry...
7,https://cryptopotato.com/1-3b-of-bitcoin-withd...,$1.3B of Bitcoin Withdrawn from Exchanges as M...,After the most recent correction in which BTC ...
8,https://cryptopotato.com/1-72b-worth-of-bitcoi...,$1.72B Worth of Bitcoin Moved to Accumulation ...,A recurring pattern that has been observed dur...
9,https://cryptopotato.com/1-billion-just-sent-f...,$1 Billion Just Sent From Tether Treasury to B...,"Amid the current market crash, $1 billion wort..."


In [73]:
urls = query_cc_index("blockworks.co", "news/*", limit=10, index='CC-MAIN-2023-50-index')
mydf = scrape_articles(urls, max_words=500)
mydf.head(15)

INFO:__main__:Scraping (1/10): https://blockworks.co/news/0xresearch-bitwise-spot-bitcoin-etf
INFO:__main__:Scraping (2/10): https://blockworks.co/news/1-19-billion-invested-in-crypto-companies-this-week
INFO:__main__:Scraping (3/10): https://blockworks.co/news/1-9t-digital-asset-market-cap-closing-in-on-apples-value
INFO:__main__:Scraping (4/10): https://blockworks.co/news/100m-crypto-etf
INFO:__main__:Scraping (5/10): https://blockworks.co/news/100m-crypto-etf?nocache
INFO:__main__:Scraping (6/10): https://blockworks.co/news/10t-holdings-raises-750m-for-first-growth-equity-funds
INFO:__main__:Scraping (7/10): https://blockworks.co/news/140-billion-asset-management-firm-considers-launching-crypto-fn
INFO:__main__:Scraping (8/10): https://blockworks.co/news/140-billion-asset-management-firm-considers-launching-crypto-fn
INFO:__main__:Scraping (9/10): https://blockworks.co/news/1500-bahamian-ftx-accounts-withdrew-funds
INFO:__main__:Scraping (10/10): https://blockworks.co/news/190k-peop

Unnamed: 0,url,title,text
0,https://blockworks.co/news/0xresearch-bitwise-...,Spot bitcoin ETF approval incoming? Bitwise CI...,Recent “breakthroughs” are reason enough for B...
1,https://blockworks.co/news/1-19-billion-invest...,Funding Roundup: $1.19 Billion Invested in Cry...,key takeaways Big investments closed this week...
2,https://blockworks.co/news/1-9t-digital-asset-...,$1.9T Digital Asset Market Cap Closing in on A...,key takeaways Soaring past the market cap of A...
3,https://blockworks.co/news/100m-crypto-etf,First ‘crypto’ ETF in US eclipses $100M mark,The first US ETF allowed to have “crypto” in i...
4,https://blockworks.co/news/100m-crypto-etf?noc...,First ‘crypto’ ETF in US eclipses $100M mark,The first US ETF allowed to have “crypto” in i...
5,https://blockworks.co/news/10t-holdings-raises...,10T Holdings Raises $750M for First Growth Equ...,key takeaways Billionaire hedge fund manager A...
6,https://blockworks.co/news/140-billion-asset-m...,$140 Billion Asset-Management Firm Considers L...,key takeaways Everybody is talking crypto in o...
7,https://blockworks.co/news/140-billion-asset-m...,$140 Billion Asset-Management Firm Considers L...,key takeaways Everybody is talking crypto in o...
8,https://blockworks.co/news/1500-bahamian-ftx-a...,"FTX May Need To Claw Back $100M From 1,500 Bah...",FTX’s granting of a peculiar withdrawal window...
9,https://blockworks.co/news/190k-people-work-in...,"Nearly 190K people work in crypto, more than 5...","The cryptocurrency industry boasts nearly 190,..."


In [74]:
urls = query_cc_index("biologos.org", "common-questions/*", limit=10, index='CC-MAIN-2023-50-index')
mydf = scrape_articles(urls, max_words=500)
mydf.head(15)

INFO:__main__:Scraping (1/10): https://biologos.org/common-questions/are-gaps-in-scientific-knowledge-evidence-for-god
INFO:__main__:Scraping (2/10): https://biologos.org/common-questions/can-science-and-scripture-be-reconciled
INFO:__main__:Scraping (3/10): https://biologos.org/common-questions/can-science-and-scripture-be-reconciled?/
INFO:__main__:Scraping (4/10): https://biologos.org/common-questions/does-modern-science-make-miracles-impossible
INFO:__main__:Scraping (5/10): https://biologos.org/common-questions/does-thermodynamics-disprove-evolution
INFO:__main__:Scraping (6/10): https://biologos.org/common-questions/how-can-evolution-account-for-the-complexity-of-life-on-earth-today
INFO:__main__:Scraping (7/10): https://biologos.org/common-questions/how-is-biologos-different-from-evolutionism-intelligent-design-and-creationism
INFO:__main__:Scraping (8/10): https://biologos.org/common-questions/how-long-are-the-days-of-genesis-1
INFO:__main__:Scraping (9/10): https://biologos.or

Unnamed: 0,url,title,text
0,https://biologos.org/common-questions/are-gaps...,Are gaps in scientific knowledge evidence for ...,In both of these examples — one related to the...
1,https://biologos.org/common-questions/can-scie...,Can science and Scripture be reconciled?,The heavens declare the glory of God…The law o...
2,https://biologos.org/common-questions/can-scie...,Can science and Scripture be reconciled?,The heavens declare the glory of God…The law o...
3,https://biologos.org/common-questions/does-mod...,Does modern science make miracles impossible?,"What is a miracle? In the Bible, events variou..."
4,https://biologos.org/common-questions/does-the...,Does thermodynamics disprove evolution?,Introduction A common argument against evoluti...
5,https://biologos.org/common-questions/how-can-...,How Can Evolution Account for the Complexity o...,Introduction A complex biological structure wi...
6,https://biologos.org/common-questions/how-is-b...,How is Evolutionary Creation different from Ev...,We at BioLogos maintain that the scientific ev...
7,https://biologos.org/common-questions/how-long...,How Long are the Days of Genesis 1?,Introduction Did the author of Genesis 1 inten...
8,https://biologos.org/common-questions/how-was-...,How was the Genesis Account of Creation Interp...,Later Christian thought There are many other n...
9,https://biologos.org/common-questions/human-or...,Were Adam and Eve Historical Figures?,Traditional interpretations of Scripture shoul...
