## Easily Scrape and Summarize News Articles Using Python

We’ll scrape an example article using the requests and BeautifulSoup packages, then we’ll summarize it using the excellent gensim library.

In [14]:
import requests
from bs4 import BeautifulSoup
from gensim.summarization import summarize

Now that we have an article, we’ll retrieve its content:

In [15]:
# Retrieve page text
url = 'http://www.vit.edu/index.php/institute/director-s-message'
page = requests.get(url).text

### Webscraping:

First, we’ll turn the page content into a BeautifulSoup object, which will allow us to parse the HTML tags.

In [16]:
# Turn page into BeautifulSoup object to access HTML tags
soup = BeautifulSoup(page)

Then, we’ll need to figure out which HTML tags contain the headline and the main text of the article.

In [17]:
# Get headline
headline = soup.find('h2').get_text()

In [18]:
# Get text from all <p> tags.
p_tags = soup.find_all('p')
# Get the text from each of the “p” tags and strip surrounding whitespace.
p_tags_text = [tag.get_text().strip() for tag in p_tags]

n this article, the image captions contain the newline ‘\n’ character to add whitespace around them. Since we know that an actual sentence from the article wouldn’t have a random line break, we can safely drop these. 

Similarly, we can drop out bits of text that don’t contain a period, since we know any proper sentence in an article would contain a period. That will drop out the author’s name and some other irrelevant bits.

In [19]:
# Filter out sentences that contain newline characters '\n' or don't contain periods.
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
# Combine list items into string.
article = ' '.join(sentence_list)

### Summarization:

Gensim is an excellent Python package for a variety of NLP tasks. It includes a fairly robust summarization function that is easy to use. It’s a variation of the TextRank algorithm. And as you’ll see, we can use it in one line of code:

In [20]:
summary = summarize(article, ratio=0.3)

In [21]:
print(f'Length of original document: {len(article)}')
print(f'Length of summery: {len(summary)}\n')
print(f'Headline: {headline}\n')
print(f'Artical Summary: \n {summary}')

Length of original document: 2448
Length of summery: 1048

Headline: Director's Message

Artical Summary: 
 Vishwakarma Institute of Technology, Pune, was established in year 1983 by Bansilal Ramnath Agarwal Charitable Trust.
It is offering U.G., P.G., Ph.D. programmes of University of Pune in almost all major branches like Mechanical, Computer, E&TC, Chemical, Instrumentation and Industrial Engineering.
For this the Institute has introduced Honors / Minor Streams, General Proficiency Courses, Professional Development Courses, Skill Development Courses, Semester Project, Open Electives (which include Psychology, Sociology, Philosophy, Economics, etc.) and Soft skills as a part of the Curriculum.
The Institute has established academic collaboration, collaborative research, student exchange, Global Internship program with various foreign Universities of repute like Binghamton University from U.S.A, Nanyang Technological University in Singapore, Purdue University U.S.A., Hof University in