# Text analysis with spaCy

## What is spaCy?
https://spacy.io

spaCy is a library for advanced Natural Language Processing that utilizes convolution network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages.



spaCy is designed for large scale text extraction, using Cython to provide increased processing speed. spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, Keras, Scikit-learn or PyTorch.

## Install required packages and models

In [1]:
import spacy
from spacy import displacy # visualization tools for spaCy

## Model: 'en_core_web_lg'

English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, syntactic dependency parse and named entities.

685k keys, 685k unique vectors (300 dimensions)

In [2]:
#!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

## Tokenization

spaCy automatically tokenizes text and provides several context relevant properties for each token.

Let's look at the following sentence:


**In downtown Evanston, Rhonda Smith bought 1 iPhone at 8 a.m. on October 5th because they were 30% off at BestBuy.**

In [3]:
# process document with spaCy nlp model
doc = nlp(u'In downtown Evanston, Rhonda Smith bought 1 iPhone at 8 a.m. on October 5th because they were 30% off at BestBuy.')

# get tokenized representation of sentence
tokenized = [token for token in doc]
print(tokenized)

[In, downtown, Evanston, ,, Rhonda, Smith, bought, 1, iPhone, at, 8, a.m., on, October, 5th, because, they, were, 30, %, off, at, BestBuy, .]


## Named entity recognition

In [4]:
displacy.render(doc, style='ent', jupyter=True)

## Named entities can be used for disambiguation

In [5]:
doc = nlp(u"Tim Cook, CEO of Apple, has many apple trees on his property.")
displacy.render(doc, style='ent', jupyter=True)

## Token properties

In [6]:
# print properties of each token in sentence

import pandas as pd
from IPython.display import display, HTML

df = pd.DataFrame(columns='TEXT LEMMA POS TAG DEP SHAPE ALPHA ENT'.split())

for token in doc:
    tokendict = {'TEXT':token.text,
                 'LEMMA':token.lemma_,
                 'POS':token.pos_,
                 'TAG':token.tag_,
                 'DEP':token.dep_,
                 'SHAPE':token.shape_,
                 'ALPHA':token.is_alpha,
                 'ENT':token.ent_type_}
    df = df.append(tokendict, ignore_index=True)

display(HTML(df.to_html(index=False)))


TEXT,LEMMA,POS,TAG,DEP,SHAPE,ALPHA,ENT
Tim,tim,PROPN,NNP,compound,Xxx,True,PERSON
Cook,cook,PROPN,NNP,nsubj,Xxxx,True,PERSON
",",",",PUNCT,",",punct,",",False,
CEO,ceo,PROPN,NNP,appos,XXX,True,
of,of,ADP,IN,prep,xx,True,
Apple,apple,PROPN,NNP,pobj,Xxxxx,True,ORG
",",",",PUNCT,",",punct,",",False,
has,have,VERB,VBZ,ROOT,xxx,True,
many,many,ADJ,JJ,amod,xxxx,True,
apple,apple,NOUN,NN,compound,xxxx,True,


## Syntactic dependency relationships

Syntactic dependencies are the grammatical relationships between words. spaCy can be used to extract this dependency information from sentences in a text. 

In [7]:
# visualization of syntactic dependency 
displacy.render(doc, style='dep', jupyter=True)

# Syntactic dependency relationships in practice:
### Currency values in press releases and the nouns they refer to

In [8]:
# define functions

class stop_loop(Exception): pass

def qualifier_value(money_txt):
    money_doc = nlp(str(money_txt))
    pos_list = [token.pos_ for token in money_doc]
    money_list = [token.text for token in money_doc]
    money_start = min(loc for loc, pos in enumerate(pos_list) if (pos == 'SYM' or pos == 'NUM'))
    qualifier = ' '.join(money_list[:money_start])
    value = ' '.join(money_list[money_start:])
    return qualifier, value
    

def extract_currency_relations(doc):
    # merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = []
    for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
        try:
            # syntactic relationship 1
            advcl = [w for w in money.head.children if w.dep_ == 'advcl']
            if advcl:
                for child in advcl[0].children:
                    if child.dep_ == 'dobj':
                        parse_type = 1
                        qual, val = qualifier_value(money.text)
                        relations.append((qual, val, child, parse_type))
                        raise stop_loop()
                        
            # syntactic relationship 2
            cprep = [w for w in money.children if w.dep_ == 'prep']
            if cprep:
                for child in cprep[0].children:
                    if child.dep_ == 'pobj':
                        parse_type = 2
                        qual, val = qualifier_value(money.text)
                        relations.append((qual, val, child, parse_type))
                        raise stop_loop()
            
            # syntactic relationship 3
            hprep = [w for w in money.head.children if w.dep_ == 'prep']
            if hprep:
                for child in hprep[0].children:
                    if child.dep_ == 'pobj':
                        parse_type = 3
                        qual, val = qualifier_value(money.text)
                        relations.append((qual, val, child, parse_type))
                        raise stop_loop()
                        
            # syntactic relationship 4
            if money.dep_ in ('attr', 'dobj'):
                subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
                if subject:
                    parse_type = 4
                    subject = subject[0]
                    qual, val = qualifier_value(money.text)
                    relations.append((qual, val, subject, parse_type))
                    raise stop_loop()
                    
            # syntactic relationship 5
            elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
                parse_type = 5
                qual, val = qualifier_value(money.text)
                relations.append((qual, val, money.head.head, parse_type))
                raise stop_loop()
                
        except stop_loop:
            pass
                 
    return relations



In [9]:
from bs4 import BeautifulSoup
import zipfile

'''
iterate over zip file of press releases and
extract currency values, the assets they refer to,
and qualifying adjectives
'''

df = pd.DataFrame(columns='FILENAME TIMESTAMP TITLE QUALIFIER VALUE ASSET'.split())
data_dir = "../data/press_releases/"

with zipfile.ZipFile(data_dir+"AMZN.zip", "r") as f:
    for filename in f.namelist():
        if not filename.startswith("AMZN") or filename == "AMZN/" or "MACOS" in filename:
            continue
        print(filename)
        html = f.read(filename)
        soup = BeautifulSoup(html)
        body = soup.findAll("div", {"class": "caas-body"})
        heading = soup.findAll("h1", {"data-test-locator": "headline"})
        date_time = soup.findAll("time")
        try:
            timestamp = date_time[0]['datetime']
        except IndexError:
            timestamp = ''
        try:
            title = heading[0]
        except IndexError:
            continue
        try:
            text = body[0].text
        except IndexError:
            continue
        doc = nlp(str(text))
        try:
            relations = extract_currency_relations(doc)
        except ValueError:
            continue
        for r0, r1, r2, r3 in relations:
            relation_dict = {'FILENAME':filename, 'TIMESTAMP':timestamp, 'TITLE':title, 'QUALIFIER':r0, 'VALUE':r1, 'ASSET':r2.text}
            df = df.append(relation_dict, ignore_index=True)



AMZN/AMZN103.html
AMZN/AMZN19.html
AMZN/AMZN154.html
AMZN/AMZN6.html
AMZN/AMZN142.html
AMZN/AMZN58.html
AMZN/AMZN115.html
AMZN/AMZN74.html
AMZN/AMZN23.html
AMZN/AMZN139.html
AMZN/AMZN35.html
AMZN/AMZN62.html
AMZN/AMZN15.html
AMZN/AMZN158.html
AMZN/AMZN42.html
AMZN/AMZN54.html
AMZN/AMZN119.html
AMZN/AMZN78.html
AMZN/AMZN97.html
AMZN/AMZN162.html
AMZN/AMZN135.html
AMZN/AMZN39.html
AMZN/AMZN123.html
AMZN/AMZN81.html
AMZN/AMZN80.html
AMZN/AMZN38.html
AMZN/AMZN122.html
AMZN/AMZN134.html
AMZN/AMZN79.html
AMZN/AMZN96.html
AMZN/AMZN163.html
AMZN/AMZN118.html
AMZN/AMZN55.html
AMZN/AMZN159.html
AMZN/AMZN43.html
AMZN/AMZN14.html
AMZN/AMZN63.html
AMZN/AMZN34.html
AMZN/AMZN22.html
AMZN/AMZN138.html
AMZN/AMZN75.html
AMZN/AMZN114.html
AMZN/AMZN143.html
AMZN/AMZN59.html
AMZN/AMZN7.html
AMZN/AMZN155.html
AMZN/AMZN102.html
AMZN/AMZN18.html
AMZN/AMZN109.html
AMZN/AMZN13.html
AMZN/AMZN44.html
AMZN/AMZN148.html
AMZN/AMZN52.html
AMZN/AMZN164.html
AMZN/AMZN91.html
AMZN/AMZN29.html
AMZN/AMZN133.html
AMZN/AMZN

In [10]:
# print dataframe
df = df.sort_values(by=['TIMESTAMP'])
display(HTML(df.to_html(index=False)))

FILENAME,TIMESTAMP,TITLE,QUALIFIER,VALUE,ASSET
AMZN/AMZN169.html,2020-06-30T14:00:00.000Z,[Amazon Announces Plans to Build Second Fulfillment Centre in Ottawa],a,$ 3 million donation,Amazon Canada
AMZN/AMZN169.html,2020-06-30T14:00:00.000Z,[Amazon Announces Plans to Build Second Fulfillment Centre in Ottawa],,$ 7.5 billion,Canada
AMZN/AMZN169.html,2020-06-30T14:00:00.000Z,[Amazon Announces Plans to Build Second Fulfillment Centre in Ottawa],,16,starting
AMZN/AMZN168.html,2020-07-07T14:30:00.000Z,[Amazon Announces First Fulfillment Center and Second Delivery Station in Little Rock],,15,Amazon’s industry-leading minimum starting wage
AMZN/AMZN167.html,2020-07-09T10:01:00.000Z,[Amazon Reveals the Top 10 States with the Most Digital Entrepreneurs Per Capita and the Top 10 States with the Fastest Growing Digital Entrepreneurs: Iowa Tops Both Lists],,$ 1 million,sales
AMZN/AMZN167.html,2020-07-09T10:01:00.000Z,[Amazon Reveals the Top 10 States with the Most Digital Entrepreneurs Per Capita and the Top 10 States with the Fastest Growing Digital Entrepreneurs: Iowa Tops Both Lists],,500000,sales
AMZN/AMZN167.html,2020-07-09T10:01:00.000Z,[Amazon Reveals the Top 10 States with the Most Digital Entrepreneurs Per Capita and the Top 10 States with the Fastest Growing Digital Entrepreneurs: Iowa Tops Both Lists],,$ 15 billion,small and medium-sized businesses
AMZN/AMZN163.html,2020-07-15T07:00:00.000Z,[AWS and HSBC Reach Long-Term Strategic Cloud Agreement],,"US$ 2,918bn",assets
AMZN/AMZN161.html,2020-07-15T13:35:00.000Z,[Amazon Announces New Pflugerville Fulfillment Center],an additional,$ 9 billion,the Texas economy
AMZN/AMZN161.html,2020-07-15T13:35:00.000Z,[Amazon Announces New Pflugerville Fulfillment Center],more than,$ 10.5 billion,its local fulfillment center infrastructure


In [11]:
# Convert monetary values to integer using regex substitution

import re

values = []
for text_value in df['VALUE']:
    if 'million' in text_value:
        money_expr = re.sub('million', '*1000000', text_value.strip())
    elif 'billion' in text_value:
        money_expr = re.sub('billion', '*1000000000', text_value.strip())
    elif 'trillion' in text_value:
        money_expr = re.sub('trillion', '*1000000000000', text_value.strip())
    money_expr = re.sub(r'\$', '', money_expr)
    try:
        money_value = eval(money_expr)
    except SyntaxError:
        money_value = str(text_value)
    values.append(money_value)
    
df_intvalue = df.assign(VALUE = values)

#pd. set_option('display. float_format', lambda x: '%. nf' % x)
display(HTML(df_intvalue.to_html(index=False)))

FILENAME,TIMESTAMP,TITLE,QUALIFIER,VALUE,ASSET
AMZN/AMZN169.html,2020-06-30T14:00:00.000Z,[Amazon Announces Plans to Build Second Fulfillment Centre in Ottawa],a,$ 3 million donation,Amazon Canada
AMZN/AMZN169.html,2020-06-30T14:00:00.000Z,[Amazon Announces Plans to Build Second Fulfillment Centre in Ottawa],,7.5e+09,Canada
AMZN/AMZN169.html,2020-06-30T14:00:00.000Z,[Amazon Announces Plans to Build Second Fulfillment Centre in Ottawa],,7.5e+09,starting
AMZN/AMZN168.html,2020-07-07T14:30:00.000Z,[Amazon Announces First Fulfillment Center and Second Delivery Station in Little Rock],,7.5e+09,Amazon’s industry-leading minimum starting wage
AMZN/AMZN167.html,2020-07-09T10:01:00.000Z,[Amazon Reveals the Top 10 States with the Most Digital Entrepreneurs Per Capita and the Top 10 States with the Fastest Growing Digital Entrepreneurs: Iowa Tops Both Lists],,1000000,sales
AMZN/AMZN167.html,2020-07-09T10:01:00.000Z,[Amazon Reveals the Top 10 States with the Most Digital Entrepreneurs Per Capita and the Top 10 States with the Fastest Growing Digital Entrepreneurs: Iowa Tops Both Lists],,1000000,sales
AMZN/AMZN167.html,2020-07-09T10:01:00.000Z,[Amazon Reveals the Top 10 States with the Most Digital Entrepreneurs Per Capita and the Top 10 States with the Fastest Growing Digital Entrepreneurs: Iowa Tops Both Lists],,15000000000,small and medium-sized businesses
AMZN/AMZN163.html,2020-07-15T07:00:00.000Z,[AWS and HSBC Reach Long-Term Strategic Cloud Agreement],,15000000000,assets
AMZN/AMZN161.html,2020-07-15T13:35:00.000Z,[Amazon Announces New Pflugerville Fulfillment Center],an additional,9000000000,the Texas economy
AMZN/AMZN161.html,2020-07-15T13:35:00.000Z,[Amazon Announces New Pflugerville Fulfillment Center],more than,1.05e+10,its local fulfillment center infrastructure


## And much much more...
https://spacy.io/usage/linguistic-features