# Scrape and Summarise News Articles



### Objective: This program scrapes and summarized news articles.
We will use the following libraries for this program.
- NLTK
- newspaper

## 1. Import Dependencies

In [0]:
# install NLTK & newspaper3k
!pip install NLTK
!pip install newspaper3k

In [0]:
# Import libraries
import nltk
from newspaper import Article

In [0]:
# get the article
url = 'https://www.washingtonpost.com/technology/2019/07/17/you-downloaded-faceapp-heres-what-youve-just-done-your-privacy/'
article = Article(url)

## 2. Data Processing - NLP

In [0]:
# do some NLP
article.download()  # download the article
article.parse()  # parse the article
nltk.download('punkt')  # download punkt
article.nlp()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 3. Extract Data

### 3a. Title of the article

In [0]:
print(article.title)

You downloaded FaceApp. Here’s what you’ve just done to your privacy.


### 3b. Get author


In [0]:
# get the authors of the article
article.authors

['Geoffrey A. Fowler',
 'Technology Columnist Based In San Francisco',
 'Technology Columnist']

### 3c. Get publish date

In [0]:
# get the publish date
print(article.publish_date)
article.publish_date  # datetime dtype

2019-07-17 00:00:00


datetime.datetime(2019, 7, 17, 0, 0)

### 3d. Get the top image of the article

In [0]:
# get the top image of the article
article.top_image

'https://www.washingtonpost.com/resizer/p4mCbRw3t6nAwEDs5hf3mY7-3Rk=/1440x0/smart/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/J76RFFMEIVAJ3NTZ4YEXMMBJGQ.jpg'

In [0]:
print(article.top_img)

https://www.washingtonpost.com/resizer/p4mCbRw3t6nAwEDs5hf3mY7-3Rk=/1440x0/smart/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/J76RFFMEIVAJ3NTZ4YEXMMBJGQ.jpg


### 3e. Get all images from the article

In [0]:
article.images

{'https://www.washingtonpost.com/resizer/p4mCbRw3t6nAwEDs5hf3mY7-3Rk=/1440x0/smart/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/J76RFFMEIVAJ3NTZ4YEXMMBJGQ.jpg',
 'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://s3.amazonaws.com/arc-authors/washpost/059a0168-736b-43cb-9473-20e8b42e454f.png&w=90&h=90'}

In [0]:
article.imgs

{'https://www.washingtonpost.com/resizer/p4mCbRw3t6nAwEDs5hf3mY7-3Rk=/1440x0/smart/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/J76RFFMEIVAJ3NTZ4YEXMMBJGQ.jpg',
 'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://s3.amazonaws.com/arc-authors/washpost/059a0168-736b-43cb-9473-20e8b42e454f.png&w=90&h=90'}

### 3f. Extract the text of the article

In [0]:
# get the article text
print(article.text)

I got some answers by running my own forensic analysis and talking to the CEO of the company that made the app. But the bigger lesson was how much app-makers and the stores run by Apple and Google leave us flying blind when it comes to privacy.

AD

AD

I raised similar questions a few weeks ago when I ran an experiment to find out what my iPhone did while I slept at night. I found apps sending my personal information to all sorts of tracking companies I’d never heard of.

So what about FaceApp? It was vetted by Apple’s App Store and Google’s Play Store, which even labeled it an “Editors’ Choice.” They both link to its privacy policy — which they know nobody reads.

Looking under the hood of FaceApp with the tools from my iPhone test, I found it sharing information about my phone with Facebook and Google AdMob, which probably help it place ads and check the performance of its ads. The most unsettling part was how much data FaceApp was sending to its own servers, after which … who knows

### 3e. Summarize the article

In [0]:
# get a summary of the article
print(article.summary)

The most unsettling part was how much data FaceApp was sending to its own servers, after which … who knows what happens.
Goncharov said FaceApp deletes “most” of the photos from its servers after 48 hours.
Just deleting the app won’t get rid of the photos FaceApp may have in the cloud.
“For the fastest processing, we recommend sending the requests from the FaceApp mobile app using ‘Settings->Support->Report a bug’ with the word ‘privacy’ in the subject line.
We’re literally paying them to read the privacy policies — and vet that companies such as FaceApp are telling the truth.


### 3f. Get article URL & source URL

#### Article URL

In [0]:
print(article.url)

https://www.washingtonpost.com/technology/2019/07/17/you-downloaded-faceapp-heres-what-youve-just-done-your-privacy/


#### Wesbite or Source URL

In [0]:
print(article.source_url)

https://www.washingtonpost.com


### 3g. Get artcile's meta information

#### Meta Data

In [0]:
article.meta_data

defaultdict(dict,
            {'article': {'content_tier': 'metered', 'opinion': 'false'},
             'description': '5 questions we all should have asked before we downloaded the latest viral app that ages your face',
             'og': {'description': '5 questions we all should have asked before we downloaded the latest viral app that ages your face',
              'image': 'https://www.washingtonpost.com/resizer/p4mCbRw3t6nAwEDs5hf3mY7-3Rk=/1440x0/smart/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/J76RFFMEIVAJ3NTZ4YEXMMBJGQ.jpg',
              'site_name': 'Washington Post',
              'title': 'Perspective | You downloaded FaceApp. Here’s what you’ve just done to your privacy.',
              'type': 'article',
              'url': 'https://www.washingtonpost.com/technology/2019/07/17/you-downloaded-faceapp-heres-what-youve-just-done-your-privacy/'},
             'twitter': {'card': 'summary_large_image',
              'description': '5 questions we all should

#### Meta Description

In [0]:
article.meta_description


'5 questions we all should have asked before we downloaded the latest viral app that ages your face'

#### Favicon

In [0]:
article.meta_favicon


'/pf/resources/images/favicon.ico?d=173'

#### Canonical Link

In [0]:
print(article.canonical_link)

https://www.washingtonpost.com/technology/2019/07/17/you-downloaded-faceapp-heres-what-youve-just-done-your-privacy/


#### Meta Image

In [0]:
article.meta_img


'https://www.washingtonpost.com/resizer/p4mCbRw3t6nAwEDs5hf3mY7-3Rk=/1440x0/smart/arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/J76RFFMEIVAJ3NTZ4YEXMMBJGQ.jpg'

#### Meta Language

In [0]:
article.meta_lang


'en'

## 4. Create a UDF to get complete article summary

#### Define UDF

In [0]:
# UDF to get article summary
def get_news_article_summary(url):
  # # Import libraries
  # import nltk
  # from newspaper import Article

  # define Article object from url of the news webpage
  article = Article(url)
  
  # process with NLP
  print("[INFO] Processing the article . . .")
  article.download()  # download article
  article.parse()  # parse article
  # nltk.download('punkt')  # download punkt
  article.nlp()  # process the article with NLP
  print("[INFO] Article processed, printing article information . . .")

  # display article summary
  print("\nTitle :")  # get article's title
  print(article.title)
  print("\nAuthor :")  # get author of the article
  print(article.authors)
  print("\nPublished on :")  # get publish date of the article
  print(article.publish_date)
  print("\nTop Image :")  # get top image of the article
  print(article.top_image)
  print("\nAll Images :")  # get all images from the article 
  for i in article.images: print(i)

  print('- - - ')
  print("\nText Content :")  # get text content
  print(article.text)

  print('- - - ')
  print("\nSummary :")  # article summary
  print(article.summary)
  
  print('- - - ')
  print("\nArticle URL :")  # URL of the article
  print(article.url)
  print("\nWebsite URL :")  # URL of the website
  print(article.source_url)

  print('- - - ')
  print("\nArticle's Meta Information :")
  print("\nMeta data    	  :")  # meta info
  print(article.meta_data)
  print("\nMeta Description :")  # meta description
  print(article.meta_description)
  print("\nMeta Favicon :")  # meta favicon
  print(article.meta_favicon)
  print("\nCanonical Link :")  # Canonical Link
  print(article.canonical_link)
  print("\nMeta Image :")  # meta image
  print(article.meta_img)
  print("\nMeta Language :")  # meta language
  print(article.meta_lang)

#### Define URL

In [0]:
url = 'https://www.hindustantimes.com/india-news/army-chief-bipin-rawat-set-to-step-down-tomorrow-named-first-chief-of-defence-staff/story-sdHxA2EVpjYkEQx0iczijN.html'
# article = Article(url)

#### Feed URL to to the UDF

In [0]:
# get article summary from UDF
get_news_article_summary(url)

[INFO] Processing the article . . .
[INFO] Article processed, printing article information . . .

Title :
Army Chief Gen Bipin Rawat appointed first Chief of Defence Staff

Author :
['Ht Correspondent']

Published on :
2020-12-30 21:33:31+05:30

Top Image :
https://www.hindustantimes.com/rf/image_size_960x540/HT/p2/2019/12/30/Pictures/atal-bujal-tunnel-yojana_54bb5e34-2ae7-11ea-b337-29936d1a9c86.jpg

All Images :
https://www.hindustantimes.com/rf/image_size_444x250/HT/p2/2019/12/31/Pictures/_0bad45e2-2b35-11ea-96cb-8d9426408fe0.JPG
https://www.hindustantimes.com/images/app-images/ht/default_author.png
https://www.hindustantimes.com/rf/image_size_960x540/HT/p2/2019/12/30/Pictures/atal-bujal-tunnel-yojana_54bb5e34-2ae7-11ea-b337-29936d1a9c86.jpg
https://www.hindustantimes.com/rf/image_size_444x250/HT/p2/2019/12/31/Pictures/_09cd57b6-2b3b-11ea-96cb-8d9426408fe0.png
https://www.hindustantimes.com/res/img/app-images/HomePageV1/zero.gif
https://www.hindustantimes.com/rf/image_size_90x90/HT/p