In [4]:
%load_ext autoreload
%autoreload 2

import config
import news_handler as nh
import scrapers
import storage
import spacy

In [8]:
dummy_evt = 1
dummy_contxt = 2

articles = nh.articles_handler(dummy_evt, dummy_contxt, True)

{'_id': ObjectId('5a5028f8a4a4563bdb93f7ac'),
 'author': 'https://www.facebook.com/bbcnews',
 'description': 'State employees will get $260 (£190) a month after 5% VAT and higher fuel prices came into force.',
 'publishedAt': '2018-01-06T01:19:02Z',
 'source': {'id': 'bbc-news', 'name': 'BBC News'},
 'text': "Image copyright Reuters Image caption The Saudi government wants to reduce the country's dependence on oil\n\nState employees in Saudi Arabia are to be given money to compensate for a new sales tax and a rise in fuel prices.\n\nKing Salman has ordered monthly payments of more than $260 (£190) for the next year.\n\nThe kingdom has roughly doubled domestic petrol prices and introduced a 5% tax on most goods and services, including food and utility bills.\n\nThe Saudi government wants to reduce its dependence on oil following recent turbulence in the crude oil market.\n\nThe United Arab Emirates (UAE) has also introduced a 5% sales tax.\n\nThe Saudi royal decree says citizens using p

In [6]:
# Connect to the database collection that stores the articles
store = storage.Storage(config.DB_URI)
store.connect_db(nh.DB_NAME)
store.set_collection(nh.ARTICLE_COLLECTION)

In [9]:
# Load stored articles from document store
stored_articles = store.find_all({})

# Load the english spacy model and process each article's text
nlp = spacy.load('en')
docs = [nlp(a.get('text')) for a in stored_articles[:10]]

The document is processed through a model, which is a spaCy Pipeline consisting of a tokenizer, tagger, dependency parser and entity recognizer. Aside from the tokenizer, which receives a text string input and outputs a `Doc` object, all other steps in the pipeline apply transformations on a `Doc` object.

The `Doc` object exposes an interface of a list of Span objects, which correspond to tokens with a myriad of annotated information attributes, including pointers to other tokens in the document for dependency parses.

In [10]:
# Iterate through the document span objects and display some of the annotations as well as the subtrees
for token in docs[0][:5]:
    print(token.text, token.pos_, token.tag_, token.dep_, token.head.text, token.ent_type_)
    print("Subtree for '{}':".format(token.text))
    print(' '.join([token.text for token in token.subtree]))

# We can traverse the dep parse tree by iterating through the tokens once and checking .dep_ and .head for the 
# dependency relationship we want
from spacy.symbols import nsubj, VERB

subjects = [t for t in docs[0] if t.dep == nsubj and t.head.pos == VERB]
print("Subjects: ", subjects)

Image NOUN NN compound Image 
Subtree for 'Image':
Image
copyright NOUN NN compound Image 
Subtree for 'copyright':
copyright
Reuters PROPN NNP compound Image ORG
Subtree for 'Reuters':
Reuters
Image PROPN NNP compound caption 
Subtree for 'Image':
Image copyright Reuters Image
caption NOUN NN ROOT caption 
Subtree for 'caption':
Image copyright Reuters Image caption
Subjects:  [government, Salman, kingdom, government, Emirates, decree, citizens, state, It, personnel, Organisations, countries, %, it, Arabia, it, salaries, Arabia, this, tax]


In [19]:
for n in docs[0].noun_chunks:
    print(n.text)
    print(n[0].)

Image copyright Reuters Image caption


AttributeError: 'spacy.tokens.token.Token' object has no attribute 'root'

## Matching

With `spaCy`, we can also perform rule-based matching which operates on tokens. Matchers except custom patterns described in terms of their token constituents and their attributes. A pattern is a list of dictionaries, with each dictionary being a template for one token in the pattern; a dictionary's key-value pairs act as the criteria for the token's corresponding attribute.

In [36]:
from spacy.matcher import Matcher

# Matcher instance must be initialized with a vocab
matcher = Matcher(nlp.vocab)

patterns = [{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}]
predicates = []
def match_callback(matcher, doc, i, matches):
    match_id, start, end = matches[i] # start and end correspond to 
    span = doc[start:end]
    span.merge()
    predicates.append(span)

matcher.add("Simple sentence", match_callback, patterns)
matches = matcher(docs[1])

print(predicates)

[]


In [38]:
a = articles[0]
a.get('url')

'http://www.bbc.co.uk/news/world-middle-east-42587037'

In [39]:
from newspaper import Article

article = Article(a.get('url'))
article.build()
article.summary

LookupError: 
**********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/roland/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

In [2]:
from pyteaser import SummarizeUrl

summary = SummarizeUrl("https://www.nytimes.com/2018/01/22/us/politics/schumer-democrats-shutdown.html")
for s in summary:
    print(s)

Democrats need to protect 10 Senate Democrats up for re-election in states carried by the president and appeal to the swing voters who could flip control of the House.
Yet many of the Senate Democrats considered potential 2020 presidential candidates voted to keep the shutdown going.
Those vulnerable Senate Democrats were key to sparking the bipartisan drive to bring the shutdown to an end.
The agreement to end the shutdown came after weekend negotiations among a bipartisan group of about two dozen lawmakers assembled by Mr.
Republican participation in the bipartisan group was crucial for Democrats seeking assurances from Mr.
