## Parsing an RSS feed
This is a work in progress technical POC with the objective of:
1. Taking an RSS feed URL
2. Reading the items on the feed
3. Linking through to the source articles
4. Creating a datastructure to hold the atricle and metadata

This is intended to be passed to an NL parser that interprets it, a handler that stores the article datastructure and interpretation and a controller that decides whether action should be taken.

In [2]:
# Before running, use 'pip install feedparser'
import feedparser
from bs4 import BeautifulSoup
import requests
import hashlib

In [3]:
# Use the feedparser to get an RSS feed. This is from my personal Google News
feed = 'https://news.google.com/news?cf=all&hl=en&pz=1&ned=au&q=analytics&output=rss'
parser = feedparser.parse(feed)

In [None]:
# Testing to understand the feed data structure. Don't need to run.
# Check the feed metadata
print(parser['feed'])
print('\n')
print(parser['feed']['title'])
print('\n')
print(parser['feed']['link'])
print('\n')
print(parser['feed']['updated'])

In [6]:
# Testing to understand the feed data structure. Don't need to run.
# How many articles do we have?
print(len(parser['entries']))

10


In [None]:
# Testing to understand the feed data structure. Don't need to run.
# What's in entry 1?
#print(parser['entries'][0])
print('\nTitle:\n' + parser['entries'][0]['title_detail']['value'] + '\n')
print('\nPublished:\n' + parser['entries'][0]['published'] + '\n')
print('\nLink:\n' + parser['entries'][0]['link'] + '\n')
print('\nSummary:\n' + parser['entries'][0]['summary_detail']['value'] + '\n')

## The datastructure we want is probably the following:
### Article
- **id** : a md5 hash of the article URL; only the real bits not the click-through crap
- **metadata_feed_name** : name of the RSS feed or search criteria that returned this article
- **metadata_feed_link** : URL for the feed that teh article was sourced from
- **metadata_feed_accessed** : timestamp when the article was obtained (the first time, duplicates are dropped)
- **article_title** : title
- **article_published** : timestamp when the article was first published
- **article_publisher** : the news source
- **article_link** : link to the article (minus the click-through crap)
- **article_content** : the text of the article (scrape the HTML and keep only the body content)

We then do natural language processing on the Article and produce a new table of article_ids and tuples that represent the themes of the article. E.g. (noun, verb, preposition etc), ('BHP', 'dam', 'collapse'), ('Australian company Lucapa', 'finds', 'huge diamond', 'Angola')

In [4]:
# Now save the summary of the whole list
summary = []

for item in parser['entries']:
    summary.append({'id': hashlib.md5(item.link.encode('utf-8')).hexdigest(), 
                    'metadata_feed_name': parser.feed.title, 
                    'metadata_feed_link': parser.feed.link,
                    'metadata_feed_accessed': parser.feed.updated, 
                    'article_title': item.title, 
                    'article_published': item.published, 
                    'article_publisher': 0, # to be parsed and added
                    'article_link': item.link,
                    'article_content': 0 # to be parsed and added
                   })

# Print it so we know it worked; remember that dictionaries are not ordered
for keys,values in summary[0].items():
    print(keys, ': ', values)

article_title :  6 Ways to Use Google Analytics You Haven't Thought Of - Search Engine Journal
article_publisher :  0
metadata_feed_link :  http://news.google.com/news?hl=en&pz=1&ned=au&q=analytics
metadata_feed_name :  analytics - Google News
article_published :  Tue, 16 Feb 2016 12:07:42 GMT
metadata_feed_accessed :  Wed, 17 Feb 2016 13:28:28 GMT
article_content :  0
id :  5f3852c02d9086f338f535bf397187d3
article_link :  http://news.google.com/news/url?sa=t&fd=R&ct2=au&usg=AFQjCNHgv_12F-QA-h7C8wFQYEh4QCwScw&clid=c3a7d30bb8a4878e06b80cf16b898331&ei=e3XEVvD3O6er4ALTl4LIBw&url=https://www.searchenginejournal.com/6-ways-use-google-analytics-havent-thought/155118/


In [10]:
# testing the link following and html parsing
link = summary[0]['article_link']
print(link) #works

html_doc = requests.get(link).text
#print(html_doc) #works

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find('title').get_text())

#for tag in soup.find_all(True):
#    print(tag.name)

#print('----------------------------------------------------------------')    

soup_index = soup.find("h1")
print(soup_index.get_text())
soup_index_sibling = soup_index.next_sibling
print(soup_index_sibling.get_text())
soup_index_sibling2 = soup_index_sibling.next_sibling
print(soup_index_sibling.get_text())


#html = u""
#for tag in soup.find("h1").next_siblings:
#    if tag.name == "h1":
#        break
#    else:
#        print(tag)

#print(html)
#soup2 = BeautifulSoup(html)

http://news.google.com/news/url?sa=t&fd=R&ct2=au&usg=AFQjCNHgv_12F-QA-h7C8wFQYEh4QCwScw&clid=c3a7d30bb8a4878e06b80cf16b898331&ei=e3XEVvD3O6er4ALTl4LIBw&url=https://www.searchenginejournal.com/6-ways-use-google-analytics-havent-thought/155118/
6 Ways to Use Google Analytics You Haven't Thought Of | SEJ

Search Marketing Advice, News, and Tutorials
Search Marketing Advice, News, and Tutorials


Bugs:
- Strip crap out of RSS article links
- Parse time stamps into useable formats
- Parse the publisher out of the URL or maybe the site name?
- Scraping the articles gets super confusing witht eh amount of shit tags. Maybe take everything from the start of the 1st <h1> to the start of the next <h1>. In between there shoudl be the actual data and not the page comments and advertising crap.