# Get a single page from CBC and pull out the story

In [1]:
from bs4 import BeautifulSoup
from html.parser import HTMLParser
from urllib import request

url = 'https://www.cbc.ca/news/canada/british-columbia/sooke-wildlife-game-cams-cougar-bears-custodian-1.5497408?cmp=rss'
# url = "https://www.cbc.ca/news/canada/british-columbia/6-new-covid-19-infections-in-b-c-as-virus-spreads-inside-care-home-1.5489921"
res = request.urlopen(url)
html = res.read().decode('utf8')
soup = BeautifulSoup(html, 'html.parser')

# CAVEAT: To get the text we need to hone in on the right HTML component
story = soup.findAll("div", {"class": "story"})[0].findAll('p')

#### 
For CBC the story is contained in the paragraphs inside the `div` with class `story`. We make some characterwise substitutions and remove any remaining HTML tags (e.g. paragraphs `<p>` and links `<a>`.

In [9]:
import re, textwrap

class TagRemover(HTMLParser):
  def __init__(self):
    super().__init__()
    self.data = ''
    
  def handle_data(self, data):
    self.data += data
    
  def parse(self, text):
    self.data = ''
    self.feed(text)
    return self.data
  
reSubs = [
  (r'\xa0', ' '),    # replace incorrect spaces
  (r'\.\.\.', '…')   # replace '...' with '…'; periods mistaken as a sentence
]

def process_story(story, join='\n\n'):
  for sub in reSubs:
    story = [re.sub(sub[0], sub[1], str(p)) for p in story]
  story = [TagRemover().parse(p.strip()) for p in story]
  return join.join(story)

def print_text(text):
  for p in text.split('\n'): 
    print(textwrap.fill(p))

text = process_story(story) 
print_text(text)

Fear of catching the coronavirus has led to panic buying of products
such as face masks, hand sanitizer and disposable gloves. But health
experts warn some of these items may be ineffective, especially if
they're not used properly.

Here's what you need to know before stocking up on supplies during the
coronavirus pandemic.

The best way to prevent the spread of infections is to keep your hands
clean, so it's no surprise that as the coronavirus spreads, stores are
running out of hand sanitizer. Fortunately, old-fashioned soap and
water will also do the trick.

Microbiologist Keith Warriner said, if you're using a hand sanitizer,
it must contain at least 60 per cent alcohol — the ingredient that
kills the virus.

"If you haven't got enough [alcohol] in there, it doesn't do anything.
It has basically dried the virus onto your hand," said Warriner, a
professor at the University of Guelph.

Even if your sanitizer has the right ingredients, Warriner still
believes washing with soap provides

#### 
Now we just get our simple Textrank summarizer and summarize the story:

In [13]:
from summa.summarizer import summarize
textrank = lambda text: re.sub('\n','\n\n', summarize(text, words=50))

summary = textrank(text)
print_text(summary)

Fear of catching the coronavirus has led to panic buying of products such as face masks, hand sanitizer and disposable gloves.

But she said a mask won't necessarily protect you from catching the coronavirus because the main way people get it is by touching an infected surface and then touching their mouth, eyes or nose.


# Read RSS with `feedparser`
The `feedparser` will give us a list of articles and their URLs given the link to some RSS feed.

In [4]:
import feedparser

rss = 'https://www.cbc.ca/cmlink/rss-technology'
feed = feedparser.parse(rss)
print(f'Number of RSS posts : {len(feed.entries)}')

Number of RSS posts : 20


In [5]:
entry = feed.entries[1]

print('Example entry -- Keys:')
print(list(entry.keys()))

print(f'\nPost Title:\n{entry.title}')
print(f'\nPost URL:\n{entry.link}')

Example entry -- Keys:
['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'authors', 'author', 'author_detail', 'tags', 'summary', 'summary_detail']

Post Title:
Send in the trolls: Canada braces for an online disinformation assault on COVID-19

Post URL:
https://www.cbc.ca/news/politics/covid-19-coronavirus-disinformation-trolls-1.5497805?cmp=rss


#### 
Now just take a bunch of these articles and run the summarizer on them. Done!

In [16]:
articleLim = 5
titleLim = 50
for i, entry in enumerate(feed.entries):
  if i == articleLim:  break
  else:
    res = request.urlopen(entry.link)
    html = res.read().decode('utf8')
    soup = BeautifulSoup(html, 'html.parser')
    
    story = soup.findAll("div", {"class": "story"})[0].findAll('p')
    text = process_story(story)
    
    summary = textrank(text)
    
    truncate = lambda t, lim: t if len(t)<=lim else t[:lim]+'...'
    print(truncate(entry.title, 50))
    print(truncate(entry.link, 60))
    print('---------')
    print_text(summary)
    print('\n')

COVID-19 vaccine research takes on new urgency
https://www.cbc.ca/news/health/covid-19-vaccine-re...
---------
Medical researchers are working on multiple approaches to experimental
vaccines to protect against COVID-19.

Kaushic, an immunologist and HIV vaccine researcher at McMaster
University, said because COVID-19 is a lung infection, any vaccine
needs to protect specifically against the virus getting into the
lungs.


Send in the trolls: Canada braces for an online di...
https://www.cbc.ca/news/politics/covid-19-coronavi...
---------
Vance said he was not prepared to name names on Friday, but suggested
that paying attention to — and trusting — Canada's elected leaders and
government officials is the best inoculation against a viral
disinformation campaign.


Secret life of cougars captured by Sooke man's wil...
https://www.cbc.ca/news/canada/british-columbia/so...
---------
When Paul Homer was putting up a wildlife camera on his Sooke, B.C.,
property years ago, he captured images o