## Parsing an RSS feed
This is a work in progress technical POC with the objective of:
1. Taking an RSS feed URL
2. Reading the items on the feed
3. Linking through to the source articles
4. Creating a datastructure to hold the atricle and metadata

This is intended to be passed to an NL parser that interprets it, a handler that stores the article datastructure and interpretation and a controller that decides whether action should be taken.

In [None]:
# Set testing to 1 if you want to run absolutely everything
testing = 0

In [3]:
# Before running, use 'pip install feedparser'
import feedparser
from bs4 import BeautifulSoup
import requests
import hashlib

In [4]:
# Use the feedparser to get an RSS feed. This is from my personal Google News
feed = 'http://news.google.com/news?cf=all&hl=en&pz=1&ned=au&q=lucapa+diamond+company+kimberlite&output=rss'
parser = feedparser.parse(feed)

In [None]:
# Testing to understand the feed data structure. Don't need to run.
# Check the feed metadata
if testing = 1:
    print(parser['feed'])
    print('\n')
    print(parser['feed']['title'])
    print('\n')
    print(parser['feed']['link'])
    print('\n')
    print(parser['feed']['updated'])

In [6]:
# Testing to understand the feed data structure. Don't need to run.
# How many articles do we have?
if testing = 1:
    print(len(parser['entries']))

10


In [5]:
# Testing to understand the feed data structure. Don't need to run.
# What's in entry 1?
#print(parser['entries'][0])
#if testing = 1:
#print('\nTitle:\n' + parser['entries'][0]['title_detail']['value'] + '\n')
#print('\nPublished:\n' + parser['entries'][0]['published'] + '\n')
print('\nLink:\n' + parser['entries'][0]['link'] + '\n')
#print('\nSummary:\n' + parser['entries'][0]['summary_detail']['value'] + '\n')


Link:
http://news.google.com/news/url?sa=t&fd=R&ct2=au&usg=AFQjCNE_p9Dej1a3iSOVhCNsqDaIo9-YdQ&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779048508106&ei=YKbGVsi4D4GV4ALJ3r9o&url=http://www.forbes.com/sites/trevornace/2016/02/17/largest-diamond-found-angola-flawless-404-carats/



## The datastructure we want is probably the following:
### Article
- **id** : a md5 hash of the article URL; only the real bits not the click-through crap
- **metadata_feed_name** : name of the RSS feed or search criteria that returned this article
- **metadata_feed_link** : URL for the feed that teh article was sourced from
- **metadata_feed_accessed** : timestamp when the article was obtained (the first time, duplicates are dropped)
- **article_title** : title
- **article_published** : timestamp when the article was first published
- **article_publisher** : the news source
- **article_link** : link to the article (minus the click-through crap)
- **article_content** : the text of the article (scrape the HTML and keep only the body content)

We then do natural language processing on the Article and produce a new table of article_ids and tuples that represent the themes of the article. E.g. (noun, verb, preposition etc), ('BHP', 'dam', 'collapse'), ('Australian company Lucapa', 'finds', 'huge diamond', 'Angola')

In [12]:
# Now save the summary of the whole list
summary = []

for item in parser['entries']:
    
    # Initiate the HTML parser for this article
    article_html = requests.get(item.link).text
    soup = BeautifulSoup(article_html, 'html.parser')
    
    # All decent articles should have Facebook OpenGraph tags that look like <meta property="og:type" content="article" />
    # These have useful stuff to describe the article, including site_name and the clean URL
    # If the Facebook OG tags are not present, try the Twitter ones, else leave the publisher blank, use the messy link.
    publisher = ''
       
    # Check for a standard meta-publisher tag
    publisher_search = soup.find('meta', attrs={'property': 'publisher', 'content': True})
    if publisher_search:
        publisher = publisher_search['content']
    else:
        # No meta-publisher, so look for a Facebook publisher tag
        publisher_search = soup.find('meta', attrs={'property': 'og:site_name', 'content': True})
        if publisher_search:
            publisher = publisher_search['content']
    
    link = item.link
    
    # Check for a Facebook URL (og:url)
    link_search = soup.find('meta', attrs={'property': 'og:url', 'content': True})
    if link_search:
        link = link_search['content']
        
    # That failed, so try for a Twitter one (twitter:url)
    if link_search == None:
        link_search = soup.find('meta', attrs={'property': 'twitter:url', 'content': True})
        if link_search:
            link = link_search['content']
            
    # That failed, so try for a 'link rel="canonical"'
    if link_search == None:
        link_search = soup.find('link', attrs={'rel': 'canonical', 'href': True})
        if link_search:
            link = link_search['href']
    # If we found a result (link_search != None), then we've updated 'link', else it still contains the messy link 
    
    summary.append({'id': hashlib.md5(link.encode('utf-8')).hexdigest(),
                    'metadata_feed_name': parser.feed.title, 
                    'metadata_feed_link': parser.feed.link,
                    'metadata_feed_accessed': parser.feed.updated, 
                    'article_title': item.title, 
                    'article_published': item.published, 
                    'article_publisher': publisher,
                    'article_link': link,
                    'article_content': article_html
                   })

# Print it so we know it worked; remember that dictionaries are not ordered
for items in summary:
    print('--- Article Start ---')
    for keys,values in items.items():
        if keys != 'article_content':
            print(keys, ':\t', values)
    print('--- Article End ---\n')

--- Article Start ---

article_title :  A tiny Australian miner just found this huge diamond - Business Insider Australia
article_link :  http://www.businessinsider.com.au/a-tiny-australian-miner-just-found-this-huge-diamond-in-africa-2016-2
metadata_feed_name :  lucapa diamond company kimberlite - Google News
metadata_feed_accessed :  Thu, 18 Feb 2016 13:25:26 GMT
article_published :  Mon, 15 Feb 2016 01:24:27 GMT
metadata_feed_link :  http://news.google.com/news?hl=en&pz=1&ned=au&q=lucapa+diamond+company+kimberlite
id :  8af41cc74cd5330faa7d35db6f890694
article_publisher :  Business Insider Australia
--- Article End ---

--- Article Start ---

article_title :  Largest Diamond Ever Found In Angola: Near Flawless & 404-Carats - Forbes
article_link :  http://www.forbes.com/sites/trevornace/2016/02/17/largest-diamond-found-angola-flawless-404-carats/
metadata_feed_name :  lucapa diamond company kimberlite - Google News
metadata_feed_accessed :  Thu, 18 Feb 2016 13:25:26 GMT
article_publi

Bugs:
- Strip crap out of RSS article links
- Parse time stamps into useable formats
- Parse the publisher out of the URL or maybe the site name?
- Scraping the articles gets super confusing witht eh amount of shit tags. Maybe take everything from the start of the 1st 'h1' to the start of the next 'h1'. In between there should be the actual data and not the page comments and advertising crap.

In [7]:
feed = 'http://news.google.com/news?cf=all&hl=en&pz=1&ned=au&q=analytics&output=rss'
parser = feedparser.parse(feed)
parser['entries'][0]
#for keys,values in parser['entries'][0]:
#    print(keys, ':\t', values, '\n')

{'guidislink': False,
 'id': 'tag:news.google.com,2005:cluster=http://venturebeat.com/2016/02/18/yahoos-flurry-unveils-redesign-launches-analytics-apps-and-apple-tv-sdk/',
 'link': 'http://news.google.com/news/url?sa=t&fd=R&ct2=au&usg=AFQjCNHbjeV4YOkJzxGmKtHpCEfFXVu3bg&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779049021818&ei=RqzGVsCrMIyh4ALto7GYDA&url=http://venturebeat.com/2016/02/18/yahoos-flurry-unveils-redesign-launches-analytics-apps-and-apple-tv-sdk/',
 'links': [{'href': 'http://news.google.com/news/url?sa=t&fd=R&ct2=au&usg=AFQjCNHbjeV4YOkJzxGmKtHpCEfFXVu3bg&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52779049021818&ei=RqzGVsCrMIyh4ALto7GYDA&url=http://venturebeat.com/2016/02/18/yahoos-flurry-unveils-redesign-launches-analytics-apps-and-apple-tv-sdk/',
   'rel': 'alternate',
   'type': 'text/html'}],
 'published': 'Fri, 19 Feb 2016 01:47:31 GMT',
 'published_parsed': time.struct_time(tm_year=2016, tm_mon=2, tm_mday=19, tm_hour=1, tm_min=47, tm_sec=31, tm_wday=4, tm_yday=50, tm_