# Text analysis with spaCy and textacy

## What is spaCy?
https://spacy.io

spaCy is a library for advanced Natural Language Processing, written in Python and Cython. spaCy utilizes convolution network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages.

spaCy is designed for large scale text extraction, using Cython to provide increased processing speed. spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, Keras, Scikit-learn or PyTorch.

In [1]:
import spacy
from spacy import displacy # visualization tools for spaCy

## Model: 'en_core_web_lg'

English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, syntactic dependency parse and named entities.

685k keys, 685k unique vectors (300 dimensions)

In [None]:
nlp = spacy.load('en_core_web_lg')

## Example: Extracting currency amounts, the nouns they refer to, and location

In [None]:
# syntactic dependency relationships in practice
# currency values and the nouns they refer to

import pandas as pd
from IPython.display import display, HTML

TEXTS = [
    """Google just made another giant move in its Silicon Valley land grab.

The internet company spent $1 billion on a large office park near its headquarters in Mountain View, California, according to the Mercury News, and has now spent at least $2.8 billion on properties in Mountain View, Sunnyvale and San Jose over the last two years.
In this case, Google is purchasing property that it's already been leasing. The company is the main tenant of the 12 buildings that comprise the 51.8-acre Shoreline Technology Park.

Google declined to comment on its purchase.

Earlier this month, Google agreed to pay an additional $110 million for 10.5 acres for a new campus in downtown San Jose, with the possibility of buying about 11 more acres. The city will vote on the plans in early December.
It's also been a big year for Google property purchases outside of Silicon Valley.

In the first quarter, the company spent $2.4 billion to buy New York City's Chelsea Market. Chief Financial Officer Ruth Porat said that the company favors "owning rather than leasing real estate when we see good opportunities."

As for leases, Google just signed on for a massive new space in downtown San Francisco."""
]

class stop_loop(Exception): pass

def qualifier_value(money_txt):
    money_doc = nlp(str(money_txt))
    pos_list = [token.pos_ for token in money_doc]
    money_list = [token.text for token in money_doc]
    money_start = min(loc for loc, pos in enumerate(pos_list) if (pos == 'SYM' or pos == 'NUM'))
    qualifier = ' '.join(money_list[:money_start])
    value = ' '.join(money_list[money_start:])
    
    return qualifier, value
    
    
def get_root_verb(token):
    verb = None
    while not verb:
        if token.pos_ == 'VERB':
            verb = token
        else:
            token = token.head
        if not token.head:
            break
            
    return [verb.text, verb.idx]


def get_locations(money_ent):
    gpe_list = [gpe for gpe in filter(lambda w: w.ent_type_ == 'GPE', money_ent.sent)]
    location_list = []
    for gpe in gpe_list:
        money_verb_index = get_root_verb(money_ent)
        gpe_verb_index = get_root_verb(gpe)
        if money_verb_index == gpe_verb_index:
            location_list.append(gpe.text)
    if not location_list:
        gpe_doc = nlp(str(money_ent.sent))
        location_list = [gpe.text for gpe in filter(lambda w: w.ent_type_ == 'GPE', gpe_doc)]

    return location_list


def extract_currency_relations(doc):
    # merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    for span in spans:
        span.merge()

    relations = []
    for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
        try:
            # syntactic relationship 1
            advcl = [w for w in money.head.children if w.dep_ == 'advcl']
            if advcl:
                for child in advcl[0].children:
                    if child.dep_ == 'dobj':
                        parse_type = 1
                        qual, val = qualifier_value(money.text)
                        locations = get_locations(money)
                        relations.append((qual, val, child, locations, parse_type))
                        raise stop_loop()
                        
            # syntactic relationship 2
            cprep = [w for w in money.children if w.dep_ == 'prep']
            if cprep:
                for child in cprep[0].children:
                    if child.dep_ == 'pobj':
                        parse_type = 2
                        qual, val = qualifier_value(money.text)
                        locations = get_locations(money)
                        relations.append((qual, val, child, locations, parse_type))
                        raise stop_loop()
            
            # syntactic relationship 3
            hprep = [w for w in money.head.children if w.dep_ == 'prep']
            if hprep:
                for child in hprep[0].children:
                    if child.dep_ == 'pobj':
                        parse_type = 3
                        qual, val = qualifier_value(money.text)
                        locations = get_locations(money)
                        relations.append((qual, val, child, locations, parse_type))
                        raise stop_loop()
                        
            # syntactic relationship 4
            if money.dep_ in ('attr', 'dobj'):
                subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
                if subject:
                    parse_type = 4
                    subject = subject[0]
                    qual, val = qualifier_value(money.text)
                    locations = get_locations(money)
                    relations.append((qual, val, subject, locations, parse_type))
                    raise stop_loop()
                    
            # syntactic relationship 5
            elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
                parse_type = 5
                qual, val = qualifier_value(money.text)
                locations = get_locations(money)
                relations.append((qual, val, money.head.head, locations, parse_type))
                raise stop_loop()
                
        except stop_loop:
            pass
                 
    return relations

df = pd.DataFrame(columns='QUALIFIER VALUE ASSET LOCATION'.split())

for text in TEXTS:
    print(text)
    doc = nlp(str(text))
    relations = extract_currency_relations(doc)
    for r0, r1, r2, r3, r4 in relations:
        relation_dict = {'QUALIFIER':r0, 'VALUE':r1, 'ASSET':r2.text, 'LOCATION':r3}
        df = df.append(relation_dict, ignore_index=True)


display(HTML(df.to_html(index=False)))


In [None]:
# Convert monetary values to integer using regex substitution

import re

int_values = []
for text_value in df['VALUE']:
    if 'million' in text_value:
        money_expr = re.sub('million', '*1000000', text_value.strip())
    elif 'billion' in text_value:
        money_expr = re.sub('billion', '*1000000000', text_value.strip())
    money_expr = re.sub(r'\$', '', money_expr)
    money_int = eval(money_expr)
    int_values.append(int(money_int))
    
df = df.assign(VALUE=int_values)

display(HTML(df.to_html(index=False)))

## Tokenization

spaCy automatically tokenizes text and provides several context relevant properties for each token.

Let's look at the following sentence:

**In downtown Evanston, Rhonda Smith bought 1 iPhone at 8 a.m. on October 5th because they were 30% off at BestBuy.**

In [None]:
# process document with spaCy nlp model
doc = nlp(u'In downtown Evanston, Rhonda Smith bought 1 iPhone at 8 a.m. on October 5th because they were 30% off at BestBuy.')

# get tokenized representation of sentence
tokenized = [token for token in doc]
print(tokenized)

## Named entity recognition

In [None]:
displacy.render(doc, style='ent', jupyter=True)

In [None]:
# named entities can be use for disambiguation

doc = nlp(u"Tim Cook, CEO of Apple, has many apple trees on his property.")
displacy.render(doc, style='ent', jupyter=True)

## Token properties

In [None]:
# print properties of each token in sentence

df = pd.DataFrame(columns='TEXT LEMMA POS TAG DEP SHAPE ALPHA ENT'.split())

for token in doc:
    tokendict = {'TEXT':token.text,
                 'LEMMA':token.lemma_,
                 'POS':token.pos_,
                 'TAG':token.tag_,
                 'DEP':token.dep_,
                 'SHAPE':token.shape_,
                 'ALPHA':token.is_alpha,
                 'ENT':token.ent_type_}
    df = df.append(tokendict, ignore_index=True)

display(HTML(df.to_html(index=False)))


## Syntactic dependency relationships

Syntactic dependencies are the grammatical relationships between words. spaCy can be used to extract this dependency information from sentences in a text. 

In [None]:
# visualization of syntactic dependency 
doc = nlp(str("In the first quarter, the company spent $2.4 billion to buy New York City's Chelsea Market."))
displacy.render(doc, style='dep', jupyter=True)

## textacy 
https://chartbeat-labs.github.io/textacy/index.html

textacy builds upon spaCy's framework and provides convenient functions for many advanced NLP tools. textacy also performs basic text feature counts and computes several readability measures. 

In [None]:
# Load earnings call transcript
# Flex Ltd. Q3 2020

with open('FLEX_earnings_call.txt', 'r') as f:
    transcript = f.read()
    doc = nlp(transcript)

# print first few lines from CEO
for line in transcript.splitlines()[23:33]:print(line)

In [None]:
# import textacy package

import textacy

# compute counts and readability stats
ts = textacy.TextStats(doc)

print('Unique words')
print(ts.n_unique_words)
print('-----------------')
print('Basic counts')
print(ts.basic_counts)
print('-----------------')
print('Readabiltiy stats')
print(ts.readability_stats)



In [None]:
# examination of Presidental Inaugural addresses
# download speeches from nltk

import nltk
nltk.download('inaugural')

In [None]:
%matplotlib inline

# plot inaugural speech grade-level over time
import matplotlib.pyplot as pyplot
from nltk.corpus import inaugural

names = inaugural.fileids()
print(names)

years = []
grade_lvls = []
for name in names:
    filetext = inaugural.raw(fileids=name)
    year = int(name.split('-')[0])
    years.append(year)
    
    doc = nlp(filetext)
    ts = textacy.TextStats(doc)
    grade_lvl = ts.readability_stats['flesch_kincaid_grade_level']
    grade_lvls.append(grade_lvl)

pyplot.plot(years, grade_lvls, 'bo')
pyplot.xlabel('year')
pyplot.ylabel('grade level')

## Topic modeling with spaCy and textacy

Topic models can provide a means to analyze and categorize a corpus of texts. Topics often refer to clusters of words that frequently occur together. 

In [None]:
# Use spaCy to generate list of terms from corpus of documents
import os
import textacy
from spacy.lang.en.stop_words import STOP_WORDS

# import the inaugural addresses
from nltk.corpus import inaugural

names = inaugural.fileids()

# create list of terms from token lemmas in texts
terms_list = []

for name in names:
    filetext = inaugural.raw(fileids=name)
    doc = nlp(filetext)
    terms_list.append([token.lemma_ for token in doc if token.text.lower() not in STOP_WORDS and token.text.isalnum()])


In [None]:
%matplotlib inline
import matplotlib
 
# create word vectors from speech terms
from textacy.tm import TopicModel
from textacy.vsm import Vectorizer

vectorizer = Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
doc_term_matrix = vectorizer.fit_transform(terms_list)

#initialize and train a topic model:
model = TopicModel('nmf', n_topics=5)
model.fit(doc_term_matrix)

print ("======================model=================")
print (model)
 
doc_topic_matrix = model.transform(doc_term_matrix)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]):
    print('topic', topic_idx, ':', '   '.join(top_terms))
    
for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
     print(i, val)
           
model.termite_plot(doc_term_matrix, vectorizer.id_to_term, topics=-1,  n_terms=25, sort_terms_by='seriation')  
model.save('nmf-25topics_inaugural.pkl')        


## And much much more...
https://spacy.io/usage/linguistic-features
https://spacy.io/usage/examples