Run all these blocks at the beginning of a session (if applicable) as they will be continuously used throughout the notebook for each section

In [None]:
# run if connecting to google drive
from google.colab import drive

drive.mount('/content/drive')

In [None]:
# used for writing to multiple sheets in a single Excel workbook
!pip install xlsxwriter

In [None]:
import pandas as pd

# Script for Scraping Text Articles from Fake News Sites
## General notes on webscraping
Webscraping: Extracting content from a website using a script or other automated process

Before scraping:
* Review the website's terms and conditions
 * Certain websites do not allow scraping of any kind (e.g. Youtube does not allow people to download videos hosted on that platform under their ToS)
 * Check the robots.txt file of the website (generally accessible by appending '/robots.txt' to the end of the main page's url) - this specifies which web scrapers/bots/APIs are allowed to do what action on their website

Choosing a webscraper:
* Various Python libraries available depending on usage
* With the rise of GenAI that train on scapped data, becoming difficult to scrape as websites are blocking webscrapers to prevent LLMs from training on their data
* Building own webscraping script will always be better than using existing webscraping software (for the aforementioned AI reason)

## Existing Webscrapers
1. BeautifulSoup

* Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Tutorial: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
* Basic description: Python library that extracts and presents data from HTML and XML files in a format that can be easily navigated
 * Pros: Works with any website, can extract any type of format with some tweaking
 * Cons: Need precise HTML structure and understanding, code needs to be different for every website

2. Newspaper3k

* Documentation: https://newspaper.readthedocs.io/en/latest/
* Tutorial: https://towardsdatascience.com/the-easy-way-to-web-scrape-articles-online-d28947fc5979
* Basic description: Python library that extracts content from news articles online
 * Pros: Relatively easier to use, decently plug-and-play for any website
 * Cons: Seems to be text-only, slow

3. Newscatcher

* Documentation: https://github.com/kotartemiy/newscatcher
* Tutorial: https://codarium.substack.com/p/build-your-first-news-data-pipeline-with-python-newscatcher-5fa18b8c5083
* Basic description: Python library that extracts metadata from news sites about their articles (titles, summaries, etc.)
 * Pros: Relatively easy to use
 * Cons: Works for websites formatted as news websites, does not extract article text

## Experimenting with Newscatcher

* Read articles from the propaganda website [The Counter Signal](https://thecountersignal.com/)
* Save scraped articles in a Pandas dataframe that can be converted to an Excel spreadsheet
 * Headers: title, author, text, date, url

Notes/Observations:
* Easy to set up and get going with minimal work
* Tend to only be compatible with specifically formatted news sites (it pulls from url/feeds and url/rss which doesn't work for hoax websites that do not follow these conventions)
 * Would need to look into whether it's possible to specify to the library which page(s) to look at
* Need to look into whether or not it's possible to use Newscatcher to download audio, especially for video-hosting platforms like Youtube

In [None]:
!pip install newspaper3k

In [None]:
import newspaper
from newspaper import Article
from newspaper import Source

In [None]:
# build newspaper for the news website
# setting memoize_articles to False to avoid caching articles from each run
pulse = newspaper.build("https://thecountersignal.com/", memoize_articles=False)

# create a new dataframe where the articles will be saved
news_df = pd.DataFrame()

# parse articles for each article found on the webpage
for article in pulse.articles:
  article.download()
  article.parse()

  # run this if we want to get a summary of the articles
  # using the built-in nlp process of the library
  # article.nlp()

  # create a temp dataframe to hold info for each article
  # and read the articles into the temp dataframe
  temp_df = pd.DataFrame(columns = ['title', 'author', 'text', 'url'])

  temp_df['title'] = article.title
  temp_df['author'] = article.authors
  temp_df['text'] = article.text
  temp_df['url'] = article.url

  # add article to the masterlist dataframe
  news_df = pd.concat([news_df, temp_df], ignore_index=True)

In [None]:
news_df.head()

Unnamed: 0,title,author,text,url
0,,Mike Campbell,Alberta Premier Danielle Smith announced that ...,https://thecountersignal.com/breaking-smith-an...
1,,Mike Campbell,"With the election four weeks out, the BC Conse...",https://thecountersignal.com/bc-conservatives-...
2,,Mike Campbell,Environment Minister Stephen Guilbeault mostly...,https://thecountersignal.com/guilbeault-says-t...
3,,Mike Campbell,Prime Minister Justin Trudeau has become so to...,https://thecountersignal.com/justin-trudeau-ha...
4,,Tcs Wire,NDP leader Jagmeet Singh has voted to keep Pri...,https://thecountersignal.com/singh-votes-to-ke...


In [None]:
# export dataframe as an excel sheet
output_fp = '/content/drive/My Drive/kickstarter_seed_project/newscatcher_test.xlsx'

news_df.to_excel(output_fp, index=False)

# use the following code to write to a specific sheet in the excel spreadsheet
'''
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(output, engine="xlsxwriter")

# write dataframes to different worksheets
df1.to_excel(writer, sheet_name=source1, index=False)
df2.to_excel(writer, sheet_name=source2, index=False)
df3.to_excel(writer, sheet_name=source3, index=False)

# close the excel writer and output file
writer.close()
'''

## Experimenting with Beautiful Soup
* Read articles from the hoax website [The Pulse](https://www.thepulse.one/)
* Save scraped articles in a Pandas dataframe that can be converted to an Excel spreadsheet
 * Headers: title, author (if possible), text, date (if possible), url

 Experiment with front page (latest news) only for now

 Notes/Observations:
 * More complicated to get started
 * More versatile (can do it with any website, preliminary research suggests it's possible to use in combination with other libraries to scrape audio files)
 * Website-dependent (need to manually detect html structure of each website for the code to work)
 * Output may need to be cleaned up on a case-by-case basis (in the first pass, BeautifulSoup was catching weird 'None' rows in the article text)

In [None]:
from bs4 import BeautifulSoup
import os
import re
import requests

In [None]:
pulse_df = pd.DataFrame(columns = ['title', 'author', 'date', 'text', 'url'])

root = requests.get('https://www.thepulse.one/archive?sort=new')
site_soup = BeautifulSoup(root.content, 'html.parser')

links = site_soup.find_all('a', attrs={'class': 'pencraft pc-reset _color-pub-primary-text_3axfk_204 _font-pub-headings_3axfk_140 _clamp_3axfk_259 _clamp-3_3axfk_271 _reset_3axfk_1'})

for link in links:
  path = requests.get(link.get('href'))
  article_soup = BeautifulSoup(path.content, 'html.parser')

  # extract article info from soup
  article_title = article_soup.find(attrs={'class': 'post-title unpublished'})
  article_author = article_soup.find(attrs={'class': 'pencraft pc-reset _decoration-hover-underline_3axfk_298 _reset_3axfk_1'})
  article_date = article_soup.find(attrs={'class': 'pencraft pc-reset _color-pub-secondary-text_3axfk_207 _line-height-20_3axfk_95 _font-meta_3axfk_131 _size-11_3axfk_35 _weight-medium_3axfk_162 _transform-uppercase_3axfk_242 _reset_3axfk_1 _meta_3axfk_442'})
  article_text = article_soup.find_all('p')

  text = []

  for p in article_text:
    text.append(str(p.string))

  # delete first text index which is None type
  text = text[1:]
  article_text = '\n'.join(text) # turn text into newline separated string

  # clean up article text to get rid of the weird empty lines
  article_text = re.sub(r'None\n', '', article_text)


  # create new row of dataframe
  new_row = {
      'title': article_title.string,
      'author': article_author.string,
      'date': article_date.string,
      'text': article_text,
      'url': link.get('href')
  }

  # add row to dataframe
  pulse_df.loc[len(pulse_df)] = new_row

In [None]:
pulse_df.head()

Unnamed: 0,title,author,date,text,url
0,What The Heck Is Going on With The World?,Joe Martino,"Sep 26, 2024",I titled this in jest.\nIt has been a busy mon...,https://www.thepulse.one/p/what-the-heck-is-go...
1,What If You Heard a Rumor That You Don’t Exist?,Tom Bunzel,"Sep 25, 2024",When I lived in Los Angeles I worked with a te...,https://www.thepulse.one/p/what-if-you-heard-a...
2,Not Another Video About 9/11...,Joe Martino,"Sep 09, 2024",It simply states we are at a time where many u...,https://www.thepulse.one/p/not-another-video-a...
3,A Sense of the Sacred,Tom Bunzel,"Sep 07, 2024",It made me consider where the sacred might be ...,https://www.thepulse.one/p/a-sense-of-the-sacred
4,The Genetic Modification Of Our Food Compared ...,Arjun Walia,"Aug 26, 2024","“As part of the process, they portrayed the va...",https://www.thepulse.one/p/the-genetic-modific...


In [None]:
# export dataframe as an excel sheet
output_fp = '/content/drive/My Drive/kickstarter_seed_project/beautiful_soup_test.xlsx'

pulse_df.to_excel(output_fp, index=False)

# use the following code to write to a specific sheet in the excel spreadsheet
'''
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(output, engine="xlsxwriter")

# write dataframes to different worksheets
df1.to_excel(writer, sheet_name=source1, index=False)
df2.to_excel(writer, sheet_name=source2, index=False)
df3.to_excel(writer, sheet_name=source3, index=False)

# close the excel writer and output file
writer.close()
'''