### **Scraping**

This is not my scraper. I utilized code from https://holwech.github.io/blog/Automatic-news-scraper/ since I wanted to add some more recent articles to my dataset. However, I modified some of the code so that I could scrape more articles than the scraper was originally scraping.

In [0]:
!pip install feedparser

In [0]:
!pip install newspaper3k

In [0]:
import os
from google.colab import drive

# Mount google drive
DRIVE_MOUNT='/content/gdrive'
drive.mount(DRIVE_MOUNT)

# create folder to write data to
CIS545_FOLDER=os.path.join(DRIVE_MOUNT, 'My Drive', 'CIS545_2020')
HOMEWORK_FOLDER=os.path.join(CIS545_FOLDER, 'Project')
os.makedirs(HOMEWORK_FOLDER, exist_ok=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import json

In [0]:
dictionary = {
  "cnn": {
    "link": "http://edition.cnn.com/"
  },
  "bbc": {
    "rss": "http://feeds.bbci.co.uk/news/rss.xml",
    "link": "http://www.bbc.com/"
  },
  "theguardian": {
    "rss": "https://www.theguardian.com/uk/rss",
    "link": "https://www.theguardian.com/international"
  },
  "breitbart": {
    "link": "http://www.breitbart.com/"
  },
  "infowars": {
    "link": "https://www.infowars.com/"
  },
  "foxnews": {
    "link": "http://www.foxnews.com/"
  },
  "nbcnews": {
    "link": "http://www.nbcnews.com/"
  },
  "washingtonpost": {
    "rss": "http://feeds.washingtonpost.com/rss/world",
    "link": "https://www.washingtonpost.com/"
  },
  "theonion": {
    "link": "http://www.theonion.com/"
  }
}

In [0]:
json_object = json.dumps(dictionary, indent = 4) 

In [0]:
with open("NewsPapers.json", "w") as outfile: 
    outfile.write(json_object) 

In [0]:
import feedparser as fp
import json
import newspaper
from newspaper import Article
from time import mktime
from datetime import datetime

# Set the limit for number of articles to download
LIMIT = 14500

data = {}
data['newspapers'] = {}

# Loads the JSON files with news sites
with open('NewsPapers.json') as data_file:
    companies = json.load(data_file)

count = 1

# Iterate through each news company
for company, value in companies.items():
    # If a RSS link is provided in the JSON file, this will be the first choice.
    # Reason for this is that, RSS feeds often give more consistent and correct data.
    # If you do not want to scrape from the RSS-feed, just leave the RSS attr empty in the JSON file.
    if 'rss' in value:
        d = fp.parse(value['rss'])
        print("Downloading articles from ", company)
        newsPaper = {
            "rss": value['rss'],
            "link": value['link'],
            "articles": []
        }
        for entry in d.entries:
            # Check if publish date is provided, if no the article is skipped.
            # This is done to keep consistency in the data and to keep the script from crashing.
            if hasattr(entry, 'published'):
                if count > LIMIT:
                    break
                article = {}
                article['link'] = entry.link
                date = entry.published_parsed
                article['published'] = datetime.fromtimestamp(mktime(date)).isoformat()
                try:
                    content = Article(entry.link)
                    content.download()
                    content.parse()
                except Exception as e:
                    # If the download for some reason fails (ex. 404) the script will continue downloading
                    # the next article.
                    print(e)
                    print("continuing...")
                    continue
                article['title'] = content.title
                article['text'] = content.text
                newsPaper['articles'].append(article)
                print(count, "articles downloaded from", company, ", url: ", entry.link)
                count = count + 1
    else:
        # This is the fallback method if a RSS-feed link is not provided.
        # It uses the python newspaper library to extract articles
        print("Building site for ", company)
        paper = newspaper.build(value['link'], memoize_articles=False)
        newsPaper = {
            "link": value['link'],
            "articles": []
        }
        noneTypeCount = 0
        for content in paper.articles:
            if count > LIMIT:
                break
            try:
                content.download()
                content.parse()
            except Exception as e:
                print(e)
                print("continuing...")
                continue
            # Again, for consistency, if there is no found publish date the article will be skipped.
            # After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.
            if content.publish_date is None:
                print(count, " Article has date of type None...")
                noneTypeCount = noneTypeCount + 1
                if noneTypeCount > 100:
                    print("Too many noneType dates, aborting...")
                    noneTypeCount = 0
                    break
                count = count + 1
                continue
            article = {}
            article['title'] = content.title
            article['text'] = content.text
            article['link'] = content.url
            article['published'] = content.publish_date.isoformat()
            newsPaper['articles'].append(article)
            print(count, "articles downloaded from", company, " using newspaper, url: ", content.url)
            count = count + 1
            noneTypeCount = 0
    count = 1
    data['newspapers'][company] = newsPaper

# Finally it saves the articles as a JSON-file.
try:
    with open('scraped_articles.json', 'w') as outfile:
        json.dump(data, outfile)
except Exception as e: print(e)

In [0]:
with open('scraped_articles.json') as json_data:
    d = json.load(json_data)

In [0]:
for i, site in enumerate((list(d['newspapers']))):
    print(i, site)

0 cnn
1 bbc
2 theguardian
3 breitbart
4 infowars
5 foxnews
6 nbcnews
7 washingtonpost
8 theonion


In [0]:
import pandas as pd
for i, site in enumerate((list(d['newspapers']))):
    articles = list(d['newspapers'][site]['articles'])
    if i == 0:
        df = pd.DataFrame.from_dict(articles)
        df["site"] = site
    else:
        new_df = pd.DataFrame.from_dict(articles)
        new_df["site"] = site
        df = pd.concat([df, new_df], ignore_index = True)     

In [0]:
df.shape

(1844, 5)

In [0]:
df

Unnamed: 0,title,text,link,published,site
0,Meghan and Harry dial into London court hearin...,London (CNN) The Duke and Duchess of Sussex di...,http://edition.cnn.com/2020/04/24/uk/harry-and...,2020-04-24T00:00:00,cnn
1,Coronavirus may force the UK to rethink its re...,London (CNN) Coronavirus travel restrictions h...,http://edition.cnn.com/2020/04/17/europe/migra...,2020-04-17T00:00:00,cnn
2,British politicians are concerned a 'virtual p...,London (CNN) The UK Parliament is expected on ...,http://edition.cnn.com/2020/04/16/uk/uk-parlia...,2020-04-16T00:00:00,cnn
3,UK's concern for Boris Johnson overrides politics,Julia Hobsbawm is a social philosopher and aut...,http://edition.cnn.com/2020/04/10/opinions/uks...,2020-04-10T00:00:00,cnn
4,Notable US Spies Fast Facts,(CNN) Here is a look at some US citizens who h...,http://edition.cnn.com/2014/06/09/us/imprisone...,2014-06-09T00:00:00,cnn
...,...,...,...,...,...
1839,Ambush near Congo’s Virunga park kills 12 rang...,Get our Coronavirus Updates newsletter\n\nRece...,https://www.washingtonpost.com/world/ambush-ne...,2020-04-24T05:09:10,washingtonpost
1840,2nd French court orders Amazon to better prote...,"The standoff has drawn global attention, as wo...",https://www.washingtonpost.com/world/europe/2n...,2020-04-24T16:58:21,washingtonpost
1841,No more today: UK’s new testing site closes fo...,"Clicking on the link, aspiring applicants were...",https://www.washingtonpost.com/world/europe/no...,2020-04-24T16:43:20,washingtonpost
1842,Nations back UN plan to speed wide rollout of ...,“This is a landmark collaboration to accelerat...,https://www.washingtonpost.com/world/asia_paci...,2020-04-24T15:48:18,washingtonpost


In [0]:
!cp scraped_articles.json "/content/gdrive/My Drive/CIS545_2020/Project/"