# NLP Intro with Spacy

We'll see a few basic NLP ops in this notebook. Specifically:<p>

- Parts-of-Speech tagging or POSTagging <p>

- Chunking Ops for Phrase Detection<p>

- Named Entity Recognition or NER<p>

I will deal with syntactic dependency parsing in a separate notebook tough we'll debut the concept here. 

At some point, Q will arise "So what all can spacy do?" See below.
    
https://spacy.io/usage/spacy-101#features

![spacy%20functionality%20tbl.png](attachment:spacy%20functionality%20tbl.png)

Let's begin, as always, with the setup chunk.

In [1]:
# setup chunk
import spacy

from spacy import displacy
from collections import Counter

import pandas as pd
import en_core_web_sm
nlp = en_core_web_sm.load()

## POSTagging for NLP

English text can roughly be divided into “sentences” which are composed of individual *words*, each of which has a function in expressing the meaning of the sentence. 

The function of a word in a sentence is called its “Part of Speech”—i.e., a word functions as a noun, a verb, an adjective, etc.

Of course, the “part of speech” of a word isn’t a property of the word itself. 

We know this because a single “word” can function as two different parts of speech (due to *polysemy*). 

For instance, consider 2 sentences:

> I love cheese.

versus 

> Love is a battlefield.    

No wonder then that it's hard for computers to accurately determine a word's POS in a sentence. (It’s difficult sometimes even for humans to do this.) 

But NLP procedures do their best, training on massive corpora.

Let's start with some dummy examples and then head towards real world examples. Behold.

In [2]:
# sample sentence
sent0 = "It was when I bought my first real six-string."
print(sent0, "\n")

# annotate the sentence with spaCy's nlp()
ann_sent0 = nlp(sent0)

# print the POSTag results
for token in ann_sent0:
    print(token.text, " ==> ", token.pos_)

It was when I bought my first real six-string. 

It  ==>  PRON
was  ==>  VERB
when  ==>  ADV
I  ==>  PRON
bought  ==>  VERB
my  ==>  DET
first  ==>  ADJ
real  ==>  ADJ
six  ==>  NUM
-  ==>  PUNCT
string  ==>  NOUN
.  ==>  PUNCT


Notice how we preserve *everything* including punctuations (dots, commas, hypohens) and throw out nothing. 

Very unlike the bag-of-words (BOW) model we did in MKTR wherein we discarded stopwords, upper-case, punctuation etc.

### Code a POSTag function

Below I code a func to display: <p>

- the token itself<p>
 
- it's POSTag, of course <p>

- it's **lemma** (i.e. root form of the word. E.g., 'went', 'gone', 'going' all become ==> 'go').<p>

- it's **syntactic dependency** (how each token relates to others in the sentcne - will do later)<p>

Behold.

In [3]:
# routine to display sentence as DF postags
def token_attrib(sent0):
	doc = nlp(sent0)

	text=[]
	lemma=[]
	postag=[]
	depcy=[]

	for token in doc:
		text.append(token.text)
		lemma.append(token.lemma_)
		postag.append(token.pos_)
		depcy.append(token.dep_)

	test_df = pd.DataFrame({'text':text, 'lemma':lemma, 'postag':postag, 'depcy':depcy})
	return(test_df)

# test-drive above func on test data
sent0 = "Donald Trump is a controversial American President"
test_df = token_attrib(sent0)
test_df

Unnamed: 0,text,lemma,postag,depcy
0,Donald,Donald,PROPN,compound
1,Trump,Trump,PROPN,nsubj
2,is,be,VERB,ROOT
3,a,a,DET,det
4,controversial,controversial,ADJ,amod
5,American,american,ADJ,amod
6,President,President,PROPN,attr


Focus on the POSTAg colm above. Time to refresh our Eng grammer (Wren & Martin, anyone?).

Pls open the *Penn Treebank* to see what the POStags mean.

Quick Quiz: Which is the *most important* POSTag in a sentence? W/o which no sentence can grammatically form?
  
Now let's look at a slightly more complex sentence than the simple one above.

In [5]:
# slightly more complex sentence
sent0 = "Universal Studios will not franchise copyrighted content regardless of what the press speculates."

# POStag it and display DF result
token_attrib(sent0)

Unnamed: 0,text,lemma,postag,depcy
0,Universal,Universal,PROPN,compound
1,Studios,Studios,PROPN,nsubj
2,will,will,VERB,aux
3,not,not,ADV,neg
4,franchise,franchise,VERB,ROOT
5,copyrighted,copyright,VERB,amod
6,content,content,NOUN,dobj
7,regardless,regardless,ADV,advmod
8,of,of,ADP,prep
9,what,what,PRON,dobj


Clearly from the above example, a sentence can have multiple nouns and verbs.

However, seems like it can have only one ROOT verb. Which is central to the sentence.

This below is an example of a 2-clause sentence. In which clause do you think the ROOT verb will lie? Behold.

In [6]:
# A 2-clause sentence
sent0="Because I went early, I saw the sunrise."

token_attrib(sent0)  # display anotation as a DF

Unnamed: 0,text,lemma,postag,depcy
0,Because,because,ADP,mark
1,I,-PRON-,PRON,nsubj
2,went,go,VERB,advcl
3,early,early,ADV,advmod
4,",",",",PUNCT,punct
5,I,-PRON-,PRON,nsubj
6,saw,see,VERB,ROOT
7,the,the,DET,det
8,sunrise,sunrise,NOUN,dobj
9,.,.,PUNCT,punct


Clauses are sort of mini-sentences. Those which can stand alone as complete sentences by themselves are 'independent cluases'. E.g., "I saw the sunrise."

A dependent clause makes some sort of sense but is not a complete sentence byu itself. E.g., from above: "Because I went early"

Note that the ROOT verb always shows up in the independent clause only.

In fact, if the word ROOT rings any bells (ROOT nodes in d-trees, remember?), we can:<p>

- **parse** the sentence's syntax into a tree like structure (the parse-tree)<p>

- starting with the root verb as the root node at the top<p> 

- and [progressively mapping (syntactic) dependencies with other tokens in the sentence.<p>

### A Dependency Parse Tree Illustration

To do that,let me first code a small helper routine that will map the parse-tree out for us. Behold.

In [7]:
# def func to display depcy tree. Note recursive struc!!
from nltk import Tree
def to_nltk_tree(node):
	if node.n_lefts + node.n_rights > 0:
		return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
	else:
		return node.orth_

# print tree for seq of sents if needed
sent0 = "Donald Trump is a controversial American President." 
sent = nlp(sent0)
[to_nltk_tree(sent.root).pretty_print() for sent in sent.sents]

            is                       
  __________|________                 
 |  Trump        President           
 |    |      ________|__________      
 .  Donald  a  controversial American



[None]

Note the classic tree structure above. Root node is typically always the ROOT verb, in this case 'is'. From the Root node, the dependencies start. 

Not hard to see from above that the dependency parse tree automatically shows us coherent 'chunks' of phrases, and even clauses if we put our mind to it.

Let me head quickly to the slides to demo what I mean. 

We'll dig into the Syntactic dependency parsing further down.

For now, let me turn my attention to more basic if mundane stuff starting with token-chunks called *phrases*.       

## Chunking Ops with spaCy

spaCy defines a *chunk* entity that detects consecutive groups tokens that coukld likely be NPs or VPs. spaCy is smart like that.

Behold.

In [8]:
## define a func to extract & display chunking ops
def chunkAttrib(sent0):

	doc = nlp(sent0)
	chunk1 = [(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text) for chunk in doc.noun_chunks]
    
	out_df1 = pd.DataFrame(chunk1, columns = ['chText', 'chRootText', 'chRootDep', 'chRootHead'])
	return(out_df1)

# test-drive above func
sent0 = "Donald Trump is a controversial American President."
chunk_df = chunkAttrib(sent0)  # 0.01 secs
print(chunk_df)

                               chText chRootText chRootDep chRootHead
0                        Donald Trump      Trump     nsubj         is
1  a controversial American President  President      attr         is


For example a rule could be that whenever any of a NNP, NNPS, ADJ, DET or xyz occur consecutively together in a sentence, detect such pattern and extract it as a 'noun phrase' chunk.

One could do similar for verb phrases too.

## Named Entity Recognition NER with spaCy

What kind of named entities (people, dates and times, money etc) might businesses be interested in? 

What are the types that standard NER in spaCy supports recognition for?

<img src="https://cdn-images-1.medium.com/max/1000/1*qQggIPMugLcy-ndJ8X_aAA.png" alt="Alt text that describes the graphic" title="Named Entities in Spacy" />

the *token.ents* command run on the annotated document in spaCy is for NER. See below.


In [9]:
# trying entity detection in one sample sentence first
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print(doc)

# Note the .ents type
ent_text = [X.text for X in doc.ents]
ent_label = [X.label_ for X in doc.ents]

# store and sisplay as panda DF
ent_df = pd.DataFrame({'ent_text':ent_text, 'ent_label':ent_label})
ent_df

European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices


Unnamed: 0,ent_text,ent_label
0,European,NORP
1,Google,ORG
2,$5.1 billion,MONEY
3,Wednesday,DATE


Chances are many entities will be multi-token chunks (n-grams) like that '$5.1 billion' wala entity above. 

How then to ID where the entity chunk begins and where it ends?

### Spacy's NER with BILOU chunking

The old IOB scheme (Inside-Outside-Beginning) for multi-token chunk identification has now given way to an expanded version called BILOU (see below). 

<img src = "https://cdn-images-1.medium.com/max/800/1*_sYTlDj2p_p-pcSRK25h-Q.png" alt="Alt text that describes the graphic" title="BILOU for Named Entities in Spacy" />

We will invoke both IOB and BILOU in spacy to supplement NER info. See below.

In [10]:
# NER with IOB chunking
# using X.ent_iob and X.ent_type
ent_token = [X for X in doc]
ent_iob = [X.ent_iob_ for X in doc]  # for IOB scheme
ent_type = [X.ent_type_ for X in doc]

# store and display as DF
ent_chunk = pd.DataFrame({'ent_token':ent_token, 'ent_iob':ent_iob, 'ent_type':ent_type})
ent_chunk

Unnamed: 0,ent_token,ent_iob,ent_type
0,European,B,NORP
1,authorities,O,
2,fined,O,
3,Google,B,ORG
4,a,O,
5,record,O,
6,$,B,MONEY
7,5.1,I,MONEY
8,billion,I,MONEY
9,on,O,


Too long and messy. And all just for one sentence. The 'O' tags are uninformative anyway. 

In [11]:
# cleaning it up a little bit to drop the Os above
ent_token = [X for X in doc if X.ent_iob_ != 'O']
ent_iob = [X.ent_iob_ for X in doc if X.ent_iob_ != 'O']
ent_type = [X.ent_type_ for X in doc if X.ent_iob_ != 'O']

ent_chunk = pd.DataFrame({'ent_token':ent_token, 'ent_iob':ent_iob, 'ent_type':ent_type})
ent_chunk

Unnamed: 0,ent_token,ent_iob,ent_type
0,European,B,NORP
1,Google,B,ORG
2,$,B,MONEY
3,5.1,I,MONEY
4,billion,I,MONEY
5,Wednesday,B,DATE


Helpfully, spaCy provides different ways to get a view of and an overview of named entities in text.

Below, I import displaCy for one of its display capabilities. 

### Display NER output with displacy

See the use of displacy.render() func below with argument 'jupyter=True'.

Behold.

In [12]:
## displaCy example
from spacy import displacy

text = "Donald Trump is a controversial American president both in the US and abroad."
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

Another, richer  example with more entity types. 

In [13]:
# rendering entities recognized by spacy
doc1 = nlp(u'I went first to Africa to shoot an elephant on camera. Then I went to Australia to shoot kangaroos. Then to Dubai in the Middle East to ride camels. Then I came to India to see the holy cow.')

displacy.render(doc1, style='ent', jupyter=True) # note 'options'

So far, we've dealt only with small, dummy sample stuff. 

Does spaCy NLP scale to larger real-world datasets? Time to try and find out. 

## spaCy NLP on a real world dataset (Nokia)

The 120 reviews wala corpus from 2013 on the Nokia Lumia smartphone is an old favorite of mine and we've seen this before in MKTR.

Let's give it a quick spin here.

In [14]:
# test-drive NLP funcs on Nokia dataset
import urllib.request
url = "https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt"
nokia = urllib.request.urlopen(url).readlines()

print(type(nokia), "\n")
print(len(nokia), "\n")

# try on 1 doc in nokia dataset
import time
from nltk import sent_tokenize
nokia1 = str(nokia[0]).strip('[]')  # make string
nokia1_sents = sent_tokenize(nokia1)  # sent tokenization

# define empty DF to populate
poso_df = pd.DataFrame(columns=['sent_ind', 'text', 'lemma', 'postag', 'depcy'])

t1 = time.time()
for i in range(len(nokia1_sents)):
    
    sent0 = nokia1_sents[i]
    df0 = token_attrib(sent0)
    df0.insert(0, "sent_ind", i)
    poso_df = poso_df.append(df0)
    
t2 = time.time()
print(round(t2-t1,2))    # 0.25 secs
print("=======\n")
print(poso_df.shape)  # what is size of the DF?
poso_df.iloc[0:9,:]  # view a few rows of the DF

<class 'list'> 

120 

0.18

(482, 5)


Unnamed: 0,sent_ind,text,lemma,postag,depcy
0,0,"b""I","b""I",PROPN,nsubj
1,0,have,have,VERB,aux
2,0,had,have,VERB,ROOT
3,0,Samsung,Samsung,PROPN,amod
4,0,phones,phone,NOUN,dobj
5,0,",",",",PUNCT,punct
6,0,where,where,ADV,advmod
7,0,the,the,DET,det
8,0,screens,screen,NOUN,nsubj


This was just the first review. Took a fraction of a second.

### NER on Nokia data

Time now to run spaCy`s famed NER on Nokia. The entire 120 review corpus this time. See below.

In [15]:
# analyze real data - nokia with spacy
a0 = [i for i in range(len(nokia))]
nokia_str = [str(x).strip('[]') for x in nokia]

start_time = time.clock()

# dropping 'O' entities & enumerating doc_num 
nokia_ents = [[(i, X, X.ent_iob_, X.ent_type_) for X in nlp(nokia_str[i]) if X.ent_iob_ != 'O'] for i in a0] # 6.87 secs
print(time.clock() - start_time, "seconds")  # 5.5 secs

# make DF and display
nokia_ents_df = pd.DataFrame(columns = ['doc_ind', 'token', 'ent_iob', 'ent_type'])

for i in range(len(nokia_ents)):
    df0 = pd.DataFrame(nokia_ents[i], columns =['doc_ind', 'token', 'ent_iob', 'ent_type']) 
    nokia_ents_df = nokia_ents_df.append(df0)
    
nokia_ents_df.iloc[0:12,:]  # view first 12 rows

  """
  if __name__ == '__main__':


3.4001127000000224 seconds


Unnamed: 0,doc_ind,token,ent_iob,ent_type
0,0,Samsung,B,ORG
1,0,iPhones,B,ORG
2,0,money.<br,B,PERSON
3,0,/><br,I,PERSON
4,0,Nokia,B,GPE
5,0,Nokia,B,ORG
6,0,Nokia,B,ORG
7,0,Drive,I,ORG
8,0,Nokia,B,ORG
9,0,Music,I,ORG


Easy, eh?

Now say, we want to find every mention of organizations ('ORG') in the review corpus. Or persons ('PERSON') or facilities ('FAC') or money. Etc.

Then its a simple matter of cleverly using entities to get a superset which we could then further refine.

P.S. Of course, the NER in spaCy or elsewhere for that matter isn't perfect and will make errors of both type I and II (i.e. of ommission or commission).

See below for an example with the ORG entity.

In [17]:
# filter above to retain only 'ORG' named entity type
new_ent_df = nokia_ents_df[(nokia_ents_df['ent_type'] == 'ORG')]
new_ent_df.iloc[:8, :]

Unnamed: 0,doc_ind,token,ent_iob,ent_type
0,0,Samsung,B,ORG
1,0,iPhones,B,ORG
5,0,Nokia,B,ORG
6,0,Nokia,B,ORG
7,0,Drive,I,ORG
8,0,Nokia,B,ORG
9,0,Music,I,ORG
11,0,ESPN,B,ORG


Chalo, back to the slides now.

What follows is a separate foray into dependency parsing but some Q & A before that.

Sudhir