# Additional Data Scraping

This notebook looks at other sources of BS and Non-BS data in order to generate additional training data in multiple domains. This is useful for expanding the dataset and improving the model's ability to generalize across different types of content.

In [2]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("Cornell-University/arxiv")

print("Path to dataset files:", path)
print("Files in the directory:")
for filename in os.listdir(path):
    if os.path.isfile(os.path.join(path, filename)):
        print(filename)
        


  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/ssgrummo/.cache/kagglehub/datasets/Cornell-University/arxiv/versions/237
Files in the directory:
arxiv-metadata-oai-snapshot.json


In [3]:
import pandas as pd
arxiv_df = pd.read_json(os.path.join(path, "arxiv-metadata-oai-snapshot.json"), lines=True, nrows=5000)
arxiv_df.head(5)

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"


In [8]:
from newspaper import Article
import feedparser
import pandas as pd

def scrape_rss_feed(feed_url, max_articles=50):
    feed = feedparser.parse(feed_url)
    data = []
    for entry in feed.entries[:max_articles]:
        url, date = entry.link, entry.published
        art = Article(url)
        art.download(); art.parse()
        if len(art.text) > 200:  # filter for meatier content
            data.append({
                "url": url,
                "title": art.title,
                "date": date,
                "text": art.text[:1000]  # limit excerpt to safe use
            })
    return pd.DataFrame(data)

# Example usage:
feed_url = 'https://www.sciencenews.org/feed'

df = scrape_rss_feed(feed_url, 1000)

df.head(5)

Unnamed: 0,url,title,date,text
0,https://www.sciencenews.org/article/oldest-nea...,"A 43,000-year-old Neandertal fingerprint has b...","Tue, 10 Jun 2025 16:00:00 +0000","In a rugged landscape in central Spain, archae..."
1,https://www.sciencenews.org/article/climate-ch...,Climate change is coming for your cheese,"Tue, 10 Jun 2025 14:00:00 +0000","By affecting cows’ diets, climate change can a..."
2,https://www.sciencenews.org/article/biggest-sp...,How to get the biggest splash at the pool usin...,"Mon, 09 Jun 2025 16:00:00 +0000","When it comes to making a splash, technique to..."
3,https://www.sciencenews.org/article/milk-fda-f...,"FDA cuts imperil food safety, but not how you ...","Mon, 09 Jun 2025 14:00:00 +0000",A pause in checking milk-testing labs. A withd...
4,https://www.sciencenews.org/article/dwarf-plan...,A possible new dwarf planet skirts the solar s...,"Fri, 06 Jun 2025 15:00:00 +0000",A possible cousin of Pluto seems to be circlin...
