# News Crawler Application
*This is an interactive Jupyter Notebook application which retrieves top news on a topic (as asked by the user) from trustworthy news sources, powered by News API.*

When we try to find news on any topic, generally we search on Google and find the results filled with *fake news or biased news*,  and mostly belong to *unverified or unreliable resources*. Since Google hasn't indexed most of the parts of the internet, the results aren't soo accurate. This ends up in spreading fake news sometimes. One of the solutions is to get news from the websites or sources that we can trust like BBC, CNN, The Times of India, The Hindu, etc. This will help us to stay away from articles that are meant to propagate fake news. NewsCrawler is an application that can perform a *search within the selected RSS feeds*. The possibility to filter the news by keywords, which is very useful when the user is interested in a particular topic can be anything like, Crypto-currency, Prime Minister of India,  stock prices of a company, etc. It is no longer necessary to access the websites of a specific source, as you can log into them from NewScrawler. NewScrawler works based on the concept of *web scraping using news APIs*.

**The following procedure describes the workflow behind the application and how it captures major news sources at any given point of time.**

#Requirements

In [13]:
pip install newsapi-python

Collecting newsapi-python
  Downloading newsapi_python-0.2.6-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: newsapi-python
Successfully installed newsapi-python-0.2.6


# All the imports

### For user input

In [6]:
from ipywidgets import widgets
from ipywidgets import *

**Widgets** are eventful **python objects** that have a representation in the browser, often as a control like a slider, textbox, etc.
You can use widgets to build **interactive GUIs for your notebooks**.
You can also use widgets to synchronize stateful and stateless information between Python and JavaScript.

In [7]:
from traitlets import *

**Traitlets** is a framework that lets Python classes have attributes with type checking, dynamically calculated default values, and ‘on change’ callbacks.
The package also includes a mechanism to use traitlets for configuration, loading values from files or from command line arguments. This is a distinct layer on top of traitlets, so you can use traitlets in your code without using the configuration machinery.

In [8]:
from IPython.display import display, Markdown

*To display data using markdown.*

In [9]:
import unicodedata

This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters.

*To convert unicode to string*

In [14]:
from newsapi import NewsApiClient

Use the **unofficial Python client library** to *integrate News API into your Python application* without having to make HTTP requests directly.

*NewsApiClient to find URLs of all the news on given topic.*

In [15]:
import requests

Requests **allows you to send HTTP requests** extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your PUT & POST data — but nowadays, just use the json method!


In [16]:
from bs4 import BeautifulSoup

Beautiful Soup is a library that makes it easy **to scrape information from web pages**. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

# Ask user about their choice of the topic

We'll start by asking the user about what topic they're interested in. Run the cell and enter a name of a company and press _enter_

### Get user choice about the topic

In [None]:
text_input = widgets.Text(value=input("Enter the topic you want to find news about: "))
print("You want to find news on", text_input.value+"! Here you go...")
# display(text_input)


Store user's topic choice for use in the rest of the notebook. (Note: need to convert from unicode to string for ease of use later)

### Convert user's topic preference into a string ready for API query

In [106]:
company_name = unicodedata.normalize('NFKD', text_input.value).encode('ascii','ignore').decode("utf-8")

# Find top news article URLs

Next, we'll use the **News API** to find the top trending news about the topic from trustworthy sources like CNBC, BBC News, etc. (to find a diversified, yet reputable mix of news articles can be found by selecting multiple sources at a time).
Currently, we're only fetching 5 news articles, but that value can be changed.

**Some trustworthy sources**: 
  + bbc-news, cnn, espn,
  + the-hindu, the-times-of-india, news24,
  + google-news, crypto-coins-news, bloomberg etc.

### Call to NewsApi
Python wrapper around A JSON API for live news and blog headlines (a.k.a. News Api): https://newsapi.org/

In [107]:
#Call to NewsApi
newsapi = NewsApiClient(api_key='27a062ac4e3340bead02d732875bddff')
all_articles = newsapi.get_everything(q = company_name,
                                      language='en',
                                      # bbc-news, cnn, espn, the-hindu, the-times-of-india,
                                      # news24, google-news, crypto-coins-news, bloomberg etc.
                                      sources='bbc-news',
                                      # top (default), relevancy, latest and popular
                                      sort_by='relevancy',
                                      page_size = 5)

### Create a list of URLs to crawl through from the API

In [None]:
list_url = []
for item in all_articles['articles']:
    list_url.append(unicodedata.normalize('NFKD', item['url']).encode('ascii','ignore').decode("utf-8"))
# print(list_url)

### Iterate through URLs and capture text

Here we'll utilize the beautiful soup library to parse through html content and store only the article content. For each url:
1. We fetch the HTML
2. Parse the HTML document using Beautiful Soup 
3. Find and store the content of news article inside the webpage

In [112]:
# Iterate through list of urls, capture text on each and make presentable
article_content = []
for i in list_url:
    page = requests.get(i)
    soup = BeautifulSoup(page.content, 'html.parser')
    content = soup.find_all(['p'])
    for line in content:
        line_string = unicodedata.normalize('NFKD', line.get_text()).encode('ascii','ignore').decode("utf-8")
        if (len(line_string)>20):
            article_content.append(line_string)
    article_content.append(',')

### Display the articles

Use the Display class to present the articles in Markdown

In [None]:
counter = 1
display(Markdown("# Article " + str(counter)))
for line in article_content:
    if(line == ','):
        counter = counter + 1
        if (not(counter<=5)):
            break;
        display(Markdown("# Article " + str(counter)))
        continue
    display(Markdown(line))