# Additional Data Scraping

This notebook looks at other sources of BS and Non-BS data in order to generate additional training data in multiple domains. This is useful for expanding the dataset and improving the model's ability to generalize across different types of content.

## arXiv

In [None]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("Cornell-University/arxiv")

print("Path to dataset files:", path)
print("Files in the directory:")
for filename in os.listdir(path):
    if os.path.isfile(os.path.join(path, filename)):
        print(filename)
        


In [None]:
import pandas as pd
arxiv_df = pd.read_json(os.path.join(path, "arxiv-metadata-oai-snapshot.json"), lines=True, nrows=5000)
arxiv_df.head(5)

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"


## Scraping RSS Feeds for BS and Non BS

In [17]:
from newspaper import Article
import feedparser
import pandas as pd

def scrape_rss_feed(feed_url, max_articles=50):
    feed = feedparser.parse(feed_url)
    data = []
    for entry in feed.entries[:max_articles]:
        url, date = entry.link, entry.published
        art = Article(url)
        art.download(); art.parse()
        if len(art.text) > 200:  # filter for meatier content
            data.append({
                "url": url,
                "title": art.title,
                "date": date,
                "text": art.text[:1000]  # limit excerpt to safe use
            })
    return pd.DataFrame(data)


In [18]:
legit_feed_urls = ['https://www.sciencenews.org/feed',
             'https://www.pewresearch.org/feed/',
             'https://theconversation.com/topics/economics-488/articles.atom',
             'https://theconversation.com/topics/extreme-weather-3799/articles.atom']

df_list = []
for url in legit_feed_urls: 
    df = scrape_rss_feed(url, 5000)
    df_list.append(df)

legit_df = pd.concat(df_list, ignore_index=True)
legit_df.shape

(79, 4)

In [20]:
bs_feed_url = 'https://www.naturalnews.com/rss.xml'
bs_df = scrape_rss_feed(bs_feed_url, 10000)
bs_df.shape

(31, 4)

In [90]:
from datasets import load_dataset
dataset = load_dataset("cc_news", split="train")

dataset

Dataset({
    features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
    num_rows: 708241
})

In [91]:
unique_values = dataset['domain']
unique_elements = list(set(unique_values))
len(unique_elements)

8759

In [None]:
item_to_check = "science"

science_domains = [item for item in unique_elements  if item_to_check in item]
science_domains


['www.france-science.org',
 'joshmitteldorf.scienceblog.com',
 'science.slashdot.org',
 'scienceofmind.com',
 'scienceblogs.com',
 'horizon.scienceblog.com',
 'www.sciencebase.com',
 'www.scienceworldreport.com',
 'www.sciencealert.com',
 'news.science360.gov',
 'www.sciencefriday.com',
 'alankandel.scienceblog.com',
 'thebrainbank.scienceblog.com',
 'www.sciencespacerobots.com',
 'www.sciencebeing.com',
 'science.howstuffworks.com',
 'forthesakeofscience.com',
 'abitofscience.com',
 'www.sciencedaily.com',
 'www.livescience.com',
 'scienceblog.com',
 'www.myscience.org']

In [63]:
def filter_dataset(dataset, column_name, target_string):
    return dataset.filter(lambda example: target_string in example[column_name])

domain = "dailymail.co.uk/health"
filtered_dataset = filter_dataset(dataset, 'url', domain)

filtered_dataset

Dataset({
    features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
    num_rows: 237
})

In [68]:
domain = "theconversation.com"
filtered_dataset = filter_dataset(dataset, 'url', domain)

filtered_dataset

Dataset({
    features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
    num_rows: 357
})

In [None]:
for url in science_domains:
    filtered_dataset = filter_dataset(dataset, 'url', url)
    print(f"{url} Row Count: {filtered_dataset.num_rows}")

Filter: 100%|██████████| 708241/708241 [00:05<00:00, 132182.65 examples/s]


www.france-science.org Row Count: 2


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 203724.89 examples/s]


joshmitteldorf.scienceblog.com Row Count: 1


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 203751.26 examples/s]


science.slashdot.org Row Count: 50
scienceofmind.com Row Count: 1


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 204959.54 examples/s]


scienceblogs.com Row Count: 7


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 200232.96 examples/s]


horizon.scienceblog.com Row Count: 4


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 206525.93 examples/s]


www.sciencebase.com Row Count: 2


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 206239.28 examples/s]


www.scienceworldreport.com Row Count: 34


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 203885.36 examples/s]


www.sciencealert.com Row Count: 63


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 206164.50 examples/s]


news.science360.gov Row Count: 22


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 197916.49 examples/s]


www.sciencefriday.com Row Count: 18


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 193836.66 examples/s]


alankandel.scienceblog.com Row Count: 5


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 198473.84 examples/s]


thebrainbank.scienceblog.com Row Count: 1


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 202735.43 examples/s]


www.sciencespacerobots.com Row Count: 5


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 205051.18 examples/s]


www.sciencebeing.com Row Count: 10


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 180949.99 examples/s]


science.howstuffworks.com Row Count: 8


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 187774.71 examples/s]


forthesakeofscience.com Row Count: 1
abitofscience.com Row Count: 19


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 193676.39 examples/s]


www.sciencedaily.com Row Count: 106


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 202879.35 examples/s]


www.livescience.com Row Count: 34
scienceblog.com Row Count: 71


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 197650.93 examples/s]

www.myscience.org Row Count: 17





Use www.sciencedaily.com for non-BS data

In [92]:
item_to_check = "health"

health_domains = [item for item in unique_elements  if item_to_check in item]
health_domains


['www.dailyhealthneeds.com',
 'healthyregion.bangordailynews.com',
 'www.healthcareitnews.com',
 'www.menshealth.com',
 'health.fajar.co.id',
 'health.good.is',
 'www.healthcanal.com',
 'www.health24.com',
 'www.diabeteshealth.com',
 'healthcitysun.com',
 'www.thehealthsite.com',
 'catchinghealth.bangordailynews.com',
 'www.health.com']

In [93]:
for url in health_domains:
    filtered_dataset = filter_dataset(dataset, 'url', url)
    print(f"{url} Row Count: {filtered_dataset.num_rows}")

Filter: 100%|██████████| 708241/708241 [00:05<00:00, 118841.36 examples/s]


www.dailyhealthneeds.com Row Count: 1


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 202892.30 examples/s]


healthyregion.bangordailynews.com Row Count: 1


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 201321.32 examples/s]


www.healthcareitnews.com Row Count: 37


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 206519.29 examples/s]


www.menshealth.com Row Count: 10


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 200239.19 examples/s]


health.fajar.co.id Row Count: 2


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 201649.39 examples/s]


health.good.is Row Count: 3


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 204456.03 examples/s]


www.healthcanal.com Row Count: 2


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 199562.13 examples/s]


www.health24.com Row Count: 17


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 205777.90 examples/s]


www.diabeteshealth.com Row Count: 4


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 205890.34 examples/s]


healthcitysun.com Row Count: 10


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 198683.29 examples/s]


www.thehealthsite.com Row Count: 55


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 202630.61 examples/s]


catchinghealth.bangordailynews.com Row Count: 10


Filter: 100%|██████████| 708241/708241 [00:03<00:00, 203335.30 examples/s]

www.health.com Row Count: 33





In [95]:
bs_dataset = filter_dataset(dataset, 'url', 'www.thehealthsite.com')
bs_dataset['title']

['Sedentary lifestyle and no exercise leading to critical illness',
 'Union Minister Uma Bharti admitted to AIIMS for high blood pressure',
 'Delhi Burari deaths: Did ‘shared psychosis’ lead to the mass suicides in the Bhatia family?',
 '3 bathroom sex positions you should surely try!',
 'Can good bacteria keep gut healthy?',
 'Odisha government to establish 19 hospitals on Public Private Partnership (PPP) mode',
 'Which one is healthier: pasteurised milk, unpasteurised milk, homogenous milk or toned milk?',
 'Can pelvic exams help diagnose STDs in girls?',
 'World No Tobacco Day 2018: Secondhand smoke and its risk to your heart',
 'International Day of Happiness 2018: These yoga asanas will help improve your mental health',
 'World Oral Health Day 2018: 7 harmful habits that are destroying your tooth enamel',
 'World No Tobacco Day 2018: Ditch smoking and tobacco; choose heart health',
 'Lap-band surgery may lower chronic knee pain',
 'Punjab government to impose ban on hookah bars ac