In [1]:
import urllib
import math
from bs4 import BeautifulSoup

In [2]:
# using an example url
urltxt = "https://www.telegraph.co.uk/news/earth/environment/climatechange/10933964/Emperor-Penguins-are-now-endangered-warn-biologists.html"
with urllib.request.urlopen(urltxt) as url:
    html = url.read()
soup = BeautifulSoup(html)

In [3]:
def getPlainText(soup):
    # remove script and style elements
    for script in soup(["script", "style"]):
        script.extract()
    text = soup.body.get_text()
    
    # remove trailing white space and blank lines
    lines = (line.strip() for line in text.splitlines())
    text = '\n'.join(line for line in lines if line)
    return text

text = getPlainText(soup)
print(text)

Accessibility links
Skip to article
Skip to navigation
Telegraph.co.uk
Tuesday 24 December 2019
Home
Video
News
World
Sport
Business
Money
Comment
Culture
Travel
Life
Women
Fashion
Luxury
Tech
Film
Politics
Investigations
Obits
Education
Science
Earth
Weather
Health
Royal
Celebrity
Defence
Scotland
News
Environment
Climate Change
Wildlife
Picture Galleries
Earth Video
Tree diseases
Advertisement
Home»
News»
Earth»
Environment»
Climate Change
Emperor Penguins are now endangered, warn biologists
A new study has estimated that by 2100, at least two-thirds of Emperor penguin
colonies will have dramatically declined by more than half if temperatures
rise at the rate predicted by the Intergovernmental Panel on Climate Change
(IPCC)
Emperor Penguins on frozen sea ice in Antarctica. Photo: Paul Souders / Barcroft Media
By
Sarah Knapton, Science Correspondent
5:00PM BST 29 Jun 2014
Follow
Emperor penguins should be classed as an endangered species because the majority of colonies will have lost

In [4]:
# heuristic algorithm for extracting only sentences
# TODO: replace with a semi-supervised learning model
        
def isSentence(s):
    return isStringLong(s) and noLongConsecutiveChars(s)

def isStringLong(string):
    if (len(string) > 50): return True
    else: return False
        
def noLongConsecutiveChars(string):
    init = string[0] if len(string) > 0 else '';
    consec_freq = []
    count = 0
    for c in string:
        if c == init: count += 1
        else:
            consec_freq.append(count)
            init = c
            count = 1
    consec_freq.append(count)
    return (max(consec_freq) / len(string) < 0.5)
    
sentences = ""
for line in text.splitlines():
    if isSentence(line):
        sentences += line + "\n"
print(sentences)

Emperor Penguins are now endangered, warn biologists
A new study has estimated that by 2100, at least two-thirds of Emperor penguin
colonies will have dramatically declined by more than half if temperatures
rise at the rate predicted by the Intergovernmental Panel on Climate Change
Emperor Penguins on frozen sea ice in Antarctica. Photo: Paul Souders / Barcroft Media
Emperor penguins should be classed as an endangered species because the majority of colonies will have lost half their populations by the end of the century, biologists have warned.  The flightless birds which inhabit Antarctica are threatened by changes to sea ice which are being driving by climate change.  Emperor penguins are heavily dependent on sea ice as it provides krill, one of their primary food sources.  A new study has estimated that by 2100, at least two-thirds of emperor penguin colonies will have dramatically declined by more than half if temperatures rise at the rate predicted by the Interngovernmental Panel

In [5]:
# estimate read time for article
# this will be more weighted on the number of characters in the extracted sentences
# as well as some weight given to headings, other text, etc.
# given average human read speed = 200 WPM

# inputs: text with headings, etc and only sentences
def getReadTime(sentences, text):
    num_words = len(sentences.split())
    read_time = num_words / 200

    num_other_words = len(text.split()) - num_words
    other_read_time = 0.1 * (num_other_words / 200)

    total_read_time = math.ceil(read_time + other_read_time)
    return total_read_time

total_read_time = getReadTime(sentences, text)
print("total read time:", total_read_time, "minutes")

total read time: 4 minutes


# Summarize article text

In [6]:
from gensim.summarization.summarizer import summarize

# TODO: tune and optimize parameters
def summarizeArticle(text):
    return summarize(text, word_count=60)

summary = summarizeArticle(sentences)
print(summary)

A new study has estimated that by 2100, at least two-thirds of emperor penguin colonies will have dramatically declined by more than half if temperatures rise at the rate predicted by the Interngovernmental Panel on Climate Change (IPCC)  The study was conducted by lead author Stephanie Jenouvrier, a biologist with the Woods Hole Oceanographic Institution (WHOI).
