# Good News
A service to find good news from your usual headlines :)

- Trained on data from https://www.kaggle.com/uciml/news-aggregator-dataset.
- Followed MSFT tutorial here: https://notebooks.azure.com/Microsoft/libraries/samples/html/Discover%20Sentiments%20in%20Tweets.ipynb
- Also followed NLTK tutorial here: http://www.nltk.org/howto/sentiment.html

## Data Import

### Setup

In [1]:
# Handling data
import pandas as pd

# Cleaning data
import re

# API interaction
import requests, json

In [26]:
# API Variables
API_KEY = '88d3bf9f4bed4dfabda05fbfa5f3999e'
BASE_URL = 'https://newsapi.org/v2/top-headlines'
default_params = {'country': 'us', 'category': 'general'}

sample_url = 'https://jsonplaceholder.typicode.com/posts/1'

test_mode = False

### API Functions

In [33]:
def run_sample_json_tests(sj):
    '''Since we're going to be running the same API tests, this is helpful'''
    assert sj['userId'] == 1
    assert sj['id'] == 1
    assert sj['title'] == "sunt aut facere repellat provident occaecati excepturi optio reprehenderit"
    assert sj['body'] == "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"

In [34]:
def get_raw_json(url, api_key, api_params):
    '''API client'''
    payload = api_params
    payload['apiKey'] = api_key
    return requests.get(url, params=payload).json()

In [35]:
# Test for get_raw_json()
sample_json = get_raw_json(sample_url, None, default_params)
run_sample_json_tests(sample_json)

In [46]:
def import_to_dict(filename=None, url=BASE_URL, api_key=API_KEY, params=default_params):
    '''Imports data as dict. If test_mode, uses local file. Otherwise uses api'''
    raw_json = None

    if test_mode:
        with open(filename) as f:
            return json.load(f)       
    else:
        return get_raw_json(url, api_key, params)

In [47]:
# Tests for import_to_dict()
previous_test_mode_status = test_mode

# test_mode ON
test_mode = True
imported_from_file = import_to_dict('sample.json')
run_sample_json_tests(imported_from_file)

# test_mode OFF
test_mode = False
imported_from_web = import_to_dict(url=sample_url)
run_sample_json_tests(imported_from_web)

# Restore previous value of test_mode
test_mode = previous_test_mode_status

## Data Cleaning

In [51]:
data = pd.DataFrame(import_to_dict('news.json'))
print("Shape: {}".format(data.shape))
data.head()

Shape: (20, 3)


Unnamed: 0,articles,status,totalResults
0,"{'source': {'id': 'the-wall-street-journal', '...",ok,30423
1,"{'source': {'id': 'the-wall-street-journal', '...",ok,30423
2,"{'source': {'id': 'the-wall-street-journal', '...",ok,30423
3,"{'source': {'id': 'the-wall-street-journal', '...",ok,30423
4,"{'source': {'id': 'the-wall-street-journal', '...",ok,30423


Now we split original df into two:
1) training_data df: data to be passed directly into model.
2) articles df: data about a specific article that isn't useful to the model.
Both are indexed by url.

First we create the articles DataFrame:

In [17]:
article_cols = ['url', 'urlToImage']
articles = data.filter(article_cols, axis=1).set_index('url')

In [18]:
assert False

AssertionError: 

And now the training data. First, a small utility function:

In [22]:
def extract_author(author_str):
    '''Naive way to extract author from a URL.'''
    author_str = author_str.split('/') # If it's a URL, remove everything leading up to author
    
    # The following is taken from the azure notebook from Microsoft:
    author_str = re.sub(r"@\w+", "", author_str) #remove twitter handle
    author_str = re.sub(r"\d", "", author_str) # remove numbers  
    author_str = re.sub(r"_+", "", author_str) # remove consecutive underscores, keep hyphens
    author_str = author_str.lower() # tranform to lower case
    
    return author_str

In [23]:
sample_authors = pd.Series(['Normal Author', '@twitterhandle', 'alice123', 'the_blogger-mommy'])
expected_authors = pd.Series(['normal author', 'twitterhandle', 'alice', 'theblogger-mommy'])

assert sample_authors.apply(extract_author) == expected_authors

NameError: name 'pd' is not defined

In [None]:
training_data = data.set_index('url')
training_data.drop('urlToImage', inplace=True)

# Source id and name are redundant. Use id only.
training_data.drop('name', inplace=True)

training_data['author'] = training_data['author'].apply(extract_author)

In [19]:
assert False

AssertionError: 

## Prep for Model

## Model