# News API Headline Monitor
### Designed for the Disinformation Team at UC Berkeley Human Rights Investigations Lab

This notebook uses [News API](newsapi.org) and [newspaper](https://newspaper.readthedocs.io/en/latest/) to monitor current top English-language headlines in a particular country from among News API's 50 options. It downloads the headlines, full text, and NLP-generated keywords of the current 40 most popular articles from among the news sources News API monitors in the given country into a Google Sheet using [pygsheets](https://pygsheets.readthedocs.io/en/stable/spreadsheet.html). It also calculates polarity scores reflecting positive or negative sentiment in the articles' headlines and generates a list of the top `n` most frequently appearing words among both the headlines and article keywords.  

This information can help provide a snapshot of the current events in a given country and inform further searching to examine the social media conversations surrounding current events. Because of the current limitation to English-language content, it cannot considered an exhaustive representation of the news, but rather a limited, preliminary summary intended to support additional investigation. 

_Created by Sonnet Phelps. [Contact](sonnet@berkeley.edu) for more details._

#### Note: 
You should have pygsheets credentials saved in a JSON file entitled `pygsheets_authfile.json` and located in the same folder as this notebook in order to access Google Sheets. Read [this walkthrough](https://pygsheets.readthedocs.io/en/stable/authorization.html) for help setting up your API credentials.  

You will also need to set up a NewsAPI account and save your API key in a JSON file entitled `newsapi_key.json`. You can register for your API key [here](https://newsapi.org/register).

### Quick Setup Instructions:
1. Run the entire notebook `(Cell > Run All)` to set it up.
2. Uncomment the bottom line in the cell below (delete the `#` in front of the last line).
3. Run the cell below. It will take a while.
4. Go to __[this Google Sheet](http://https://docs.google.com/spreadsheets/d/1z1ZQnSZrWAKbqbNlr9uIJGbVoeaSJrTDh0dPdaR6Zuc/edit#gid=1361559015)__ to view your newly downloaded articles.

In [29]:
def search_and_download(title, n=20):
    top_headlines = headline_search()
    top_headlines['headline_polarity_score'] = score_strings(list(top_headlines['title']))
    top_headlines = download_articles(top_headlines)
    df_to_gsheet(top_headlines,title, gc)
    generate_frequencies(n, top_headlines);
    return top_headlines

#you can replace the string in the function call below with the title of your own Google Sheet
search_and_download('India News Scraping'); 

The 20 most common words among the headlines are: ['india', 'firstpost', 'attack', 'ndtv', 'pm', 'modi', 'world', 'could', 'news18', '2019', 'new', 'used', 'bjp', 'singh', 'cup', 'hindu', 'hit', '8', 'gwadar', 'men']
The 20 most common words among the keywords are: ['attack', 'election', 'west', 'singh', 'india', 'modi', 'firstpost', 'china', 'used', 'tiny', 'delhi', 'remark', 'state', 'mumbai', 'kings', 'super', 'final', 'ipl', 'runs', 'season']
         Word  Frequency
0       india          9
1   firstpost          7
2      attack          5
3        ndtv          5
4          pm          4
5        modi          4
6       world          3
7       could          3
8      news18          3
9        2019          3
10        new          2
11       used          2
12        bjp          2
13      singh          2
14        cup          2
15      hindu          2
16        hit          2
17          8          2
18     gwadar          2
19        men          2
         Word  Frequency

Run this cell to set up the notebook:

In [17]:
#imports and setup
import sys
import datetime as dt
import re
import pandas as pd
import numpy as np
import json
import csv
import collections
import matplotlib.pyplot as plt
import seaborn as sns
import pprint
pp = pprint.PrettyPrinter(indent=4)

pd.options.display.max_colwidth = 500

try:
    import requests
except:
    !{sys.executable} -m pip install requests
    import requests

try:
    import newspaper
except:
    !{sys.executable} -m pip install newspaper3k
    import newspaper

try:
    import pygsheets
except:
    !{sys.executable} -m pip install pygsheets
    import pygsheets
    
try:
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
except:
    !{sys.executable} -m pip install vaderSentiment
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
    
try:
    from nltk.corpus import stopwords
except:
    !{sys.executable} -m pip install nltk
    import nltk
    nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
stopwords.update(['says', 'today', 'times', 'news'])

gc = pygsheets.authorize(service_file='pygsheets_authfile.json')
newsapi_key = json.load(open('newsapi_key.json'))['key']

## Step-by-step walkthrough

Run the cell below to download the current top articles in India.

#### To modify your article search:
Enter `keyword='YOUR SEARCH TERMS'` in the `headline_search()` function call below to specify search terms, or leave it blank to get just the top 100 headlines in India. 

Enter `keyword='CATEGORY'` in the `headline_search()` function call below to specify a category. The options are: `business` `entertainment` `general` `health` `science` `sports` `technology`.

Source: NewsAPI.org

In [18]:
#top_headlines = headline_search()

Run the cell below to calculate polarity scores for the headlines using the `Vader Sentiment Lexicon` and see a distribution plot of the sentiments.

Source: John DeNero and DS100 staff

In [19]:
#top_headlines['headline_polarity_score'] = score_strings(list(top_headlines['title']))
#sns.distplot(top_headlines['headline_polarity_score']);

Run the cell below to download the text and keywords of the articles using `newspaper`. Make sure you have consistent internet connectivity so that it can pull the text from each url. This one may take a while.

In [20]:
#top_headlines = download_articles(top_headlines)

Run the cell below to save your articles to a Google Sheet. To share the sheet with yourself, replace `YOUR EMAIL HERE` with your email address (keep the surrounding `''` quotes) and uncomment the bottom line.

In [21]:
#sh = df_to_gsheet(all_headlines,'India News Scraping', gc)
#sh.share('YOUR EMAIL HERE')

Run the cell below to get the `n` most popular words among the headlines and keywords.

In [22]:
#generate_frequencies(20, top_headlines);

In [23]:
def headline_search(key=newsapi_key, keyword='', date=str(dt.date.today()), 
                    country='in', category='general'):
    '''
    Uses NewsAPI.org to pull up to 100 top headlines in a country (default = India).
    Returns a dataframe of the articles generated in the request
    '''

    url = ('https://newsapi.org/v2/top-headlines?'
           f'q={keyword}&'
           f'country={country}&'
           f'category={category}&'
           f'from={date}&'
           f'sortBy=popularity&'
           'pageSize=100&'
           f'apiKey={key}')

    search_json = requests.get(url).json()
    return pd.DataFrame.from_dict(search_json['articles'], orient='columns')

In [24]:
def df_to_gsheet(df, sheet_title, gc):
    '''Writes dataframe into a new worksheet in a given Google Sheet'''
    sh = gc.open(sheet_title)
    wks = sh.add_worksheet(f'Download on {dt.datetime.now().strftime("%Y-%m-%d %H:%M")}')
    wks.set_dataframe(df, (1,1))
    return sh

In [25]:
def score_strings(strings):
    '''Takes in a list of strings and returns a list of their polarity scores from -1 to 1'''
    scores = []
    for s in strings:
        score = analyzer.polarity_scores(s)
        scores.append(score['compound'])
    return scores

In [26]:
def generate_frequencies(n, top_headlines):
    wordfreq_headlines = count_frequencies(clean_headlines(top_headlines['title']))
    wordfreq_keywords = count_frequencies(clean_keywords(top_headlines['keywords']))

    print(f"The {n} most common words among the headlines are: " + str(list(wordfreq_headlines['Word'].iloc[0:n])))
    print(f"The {n} most common words among the keywords are: " + str(list(wordfreq_keywords['Word'].iloc[0:n])))
    
    pp.pprint(wordfreq_headlines.head(n))
    pp.pprint(wordfreq_keywords.head(n))
    
    return wordfreq_headlines, wordfreq_keywords



In [27]:
def download_articles(df):
    '''
    Takes in a dataframe of articles of which one column is the article urls
    Uses newspaper to download and parse the text of all articles and generate a list of keywords
    Returns the same dataframe with full_text and keywords columns added
    '''

    full_text, keywords = [], [] 
    for url in df['url']:
        article = newspaper.Article(url)
        try:
            article.download()
            article.parse()
            article.nlp()
            full_text.append(article.text)
            keywords.append(article.keywords)
        except:
            full_text.append('Failed to download')
            keywords.append('Failed to download')
            continue

    df['keywords'] = keywords    
    df['full_text'] = full_text

    return df

In [28]:
def count_frequencies(words):
    '''
    Takes in a list of words
    Returns a dataframe with their individual frequencies
    '''   
    return pd.DataFrame(collections.Counter(words).most_common(), columns=['Word', 'Frequency'])

def clean_headlines(headlines_series):
    '''
    Takes in a pandas Series of headlines
    Returns a list of words in the series, excepting stopwords
    '''
    punct_re = r'[^\w | \s]'
    headline_list = list(headlines_series.str.lower().str.replace(punct_re, ' '))
    headline_string = ' '.join(headline_list)
    return filter_stopwords(headline_string, stopwords)
  
def clean_keywords(keywords_series):
    words = []
    for item in keywords_series:
        if item != 'Failed to download':
            words += item
    return words
    
def filter_stopwords(words, stopwords):
    '''
    Filters stopwords out of a string
    (Implemented in clean_headlines)
    '''
    filtered_words = []
    for w in words.split():
        if w not in stopwords:
            filtered_words.append(w)
    return filtered_words
    