SPACY is a open source natural language processing library in python

NLTK is natural language toolkit. It is one more open source natural language processing library.

#######
SPACY BASICS
#######

In [1]:
import spacy

In [27]:
# Here this line of code is called loading the model
# 'en_core_web_sm' is a smaller version of the core english language library
nlp = spacy.load('en_core_web_sm')

Below 'u' stands for unicode string. Here each of the words will be treated as tokens.

In [28]:
# Here we create a document object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

Here,

token.text => returns text of individual token

token.pos => returns codes of parts of speech for individual tokens

token.pos_ => returns names of parts of speech for individual tokens

token.dep_ => returns syntactic dependency of individual token

In [30]:
# Every parts of speech has a code in NLP
for token in doc:
    print(token.text, token.pos, token.pos_, token.dep_)

Tesla 95 PROPN nsubj
is 99 VERB aux
looking 99 VERB ROOT
at 84 ADP prep
buying 99 VERB pcomp
U.S. 95 PROPN compound
startup 91 NOUN dobj
for 84 ADP prep
$ 98 SYM quantmod
6 92 NUM compound
million 92 NUM pobj


In [12]:

nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x7f5fe74951d0>),
 ('parser', <spacy.pipeline.DependencyParser at 0x7f5fe6e18590>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x7f5fe6e18b30>)]

Above our text is entering processing pipeline. Here the text broken down and then enters a series of operation i.e. tagging, parsing and describing the data. 

Basic NLP pipeline has a tagger, parser and a 'ner'(name-entity recogniser)

In [14]:
# returns list of pipeline operations
nlp.pipe_names

['tagger', 'parser', 'ner']

Tokenization:

The very first step in processing any text is to split up all the component parts i.e. words and punctuation into tokens and these tokens are annotated inside doc object to contain descriptive information.

In [16]:
doc2 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [19]:
# Here we take a span of the entire document
life_quote = doc2[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [20]:
type(life_quote)

spacy.tokens.span.Span

In [21]:
type(doc2)

spacy.tokens.doc.Doc

In [22]:
doc3 = nlp(u'This is the first sentence. This is the second sentence. This is the last sentence.')
for sentence in doc3.sents:
    print(sentence)

This is the first sentence.
This is the second sentence.
This is the last sentence.


In [25]:
# returns TRUE if it is the start of the sentence
# returns NONE if it is not the start of the sentence
print(doc3[6].is_sent_start, doc3[7].is_sent_start)

True None


TOKENIZATION - part-One

In [31]:
import spacy

In [32]:
nlp = spacy.load('en_core_web_sm')

In [33]:
str1 = '"We\'re moving to L.A.!"'
doc1 = nlp(str1)
for t in doc1:
    print(t.text, t.pos, t.pos_, t.dep_)

" 96 PUNCT punct
We 94 PRON nsubj
're 99 VERB aux
moving 99 VERB ROOT
to 84 ADP prep
L.A. 95 PROPN pobj
! 96 PUNCT punct
" 96 PUNCT punct


-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

Notice that tokens are pieces of the original text. That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

## Prefixes, Suffixes and Infixes
spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [35]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")
for t in doc2:
    print(t.text, t.pos, t.pos_, t.dep_)

We 94 PRON nsubj
're 99 VERB ROOT
here 85 ADV advmod
to 93 PART aux
help 99 VERB advcl
! 96 PUNCT punct
Send 99 VERB ROOT
snail 91 NOUN compound
- 96 PUNCT punct
mail 91 NOUN dobj
, 96 PUNCT punct
email 91 NOUN conj
support@oursite.com 100 X ROOT
or 88 CCONJ cc
visit 99 VERB conj
us 94 PRON dobj
at 84 ADP prep
http://www.oursite.com 100 X pobj
! 96 PUNCT punct


<font color=green>Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.</font>

In [36]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')
for t in doc3:
    print(t.text, t.pos, t.pos_, t.dep_)

A 89 DET det
5 92 NUM nummod
km 91 NOUN compound
NYC 95 PROPN compound
cab 91 NOUN compound
ride 91 NOUN nsubj
costs 99 VERB ROOT
$ 98 SYM nmod
10.30 92 NUM dobj


<font color=green>Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.</font>

In [37]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")
for t in doc4:
    print(t.text, t.pos, t.pos_, t.dep_)

Let 99 VERB ROOT
's 94 PRON nsubj
visit 99 VERB ccomp
St. 95 PROPN compound
Louis 95 PROPN dobj
in 84 ADP prep
the 89 DET det
U.S. 95 PROPN pobj
next 83 ADJ amod
year 91 NOUN npadvmod
. 96 PUNCT punct


<font color=green>Here the abbreviations for "Saint" and "United States" are both preserved.</font>

## COUNTING TOKENS

In [39]:
# returns list of tokens
len(doc)

11

In [41]:
# returns the total vocabulary in the 'en_core_web_sm' language library
doc.vocab

<spacy.vocab.Vocab at 0x7f5fd3f2bf80>

In [42]:
# returns total number of vocabulary in the 'en_core_web_sm' language library
len(doc.vocab)

57852

In [46]:
doc[0]

Tesla

In [48]:
# reassignment is not not allowed for document
doc[0] = 'hello'

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

In [57]:
doc2 = nlp(u'Apple to build a India factory for $6 million')
for entity in doc2.ents:
    print(entity)
    print(entity.label_)
    print(spacy.explain(entity.label_))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


India
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




LABELS:
    
ORG: organization
GPE: Geo political Entity
MONEY: Monetary

## NOUNS CHUNKS
Similar to Doc.ents, Doc.noun_chunks are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley's 1958 song, a "one-eyed, one-horned, flying, purple people-eater" would be one long noun chunk.

Noun chunks can be thought of a nouns + the words describing that noun.

In [58]:
doc = nlp(u'Autonomous cars shift insurance liability towards manufacturers.')
for nchunks in doc.noun_chunks:
    print(nchunks)

Autonomous cars
insurance liability
manufacturers


We'll look at additional noun_chunks components besides `.text` in an upcoming section.<br>For more info on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks

## VISUALIZER

___
# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

In [2]:
from spacy import displacy

ModuleNotFoundError: No module named 'spacy'

In [61]:
nlp  = spacy.load('en_core_web_sm')

In [62]:
doc = nlp('Apple is going to build a U.K. factory for $6 million.')

In [64]:
# style = 'dep' is syntactic dependency
displacy.render(doc,style='dep',jupyter=True,options={'distance':80})

In [71]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

In [3]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 6.7 MB/s eta 0:00:01
[?25hCollecting typer<0.8.0,>=0.3.0
  Downloading typer-0.7.0-py3-none-any.whl (38 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 33.9 MB/s eta 0:00:01
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4
  Downloading pydantic-1.10.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 6.2 MB/s eta 0:00:01
[?25hCollecting thinc<8.2.0,>=8.1.0
  Downloading thinc-8.1.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (825 kB)
[K     |████████████████████████████████| 825 kB 32.0 MB/s eta 0:00:01
[?25hCollecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (490 kB)
[K     |████████

In [3]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'he is a nice person')
for token in doc:
    print(token)
    print(token.pos)
    print(token.pos_)
    print(spacy.explain(token.pos_))
    print("\n")
    

he
94
PRON
pronoun


is
99
VERB
verb


a
89
DET
determiner


nice
83
ADJ
adjective


person
91
NOUN
noun




In [4]:
import nltk

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


In [5]:
!pip install nltk



In [11]:
from nltk.stem.porter import PorterStemmer

In [12]:
ps = PorterStemmer()

In [13]:
words = ['run','runner', 'ran', 'runs', 'easily','fairly', 'fairness']

In [14]:
# Porter Stemmer is not so sophisticated stemmer, sometimes gives some interesting weird stems like for words
# easily and fairly
# Here Porter stemmer for easily able to identify runner as a noun hence it is not stemmed 
for word in words:
    print(word + '------->' + ps.stem(word))

run------->run
runner------->runner
ran------->ran
runs------->run
easily------->easili
fairly------->fairli
fairness------->fair


In [15]:
from nltk.stem.snowball import SnowballStemmer

In [17]:
sbs = SnowballStemmer(language='english')

In [19]:
for word in words:
    print(word + '------->' + sbs.stem(word))

run------->run
runner------->runner
ran------->ran
runs------->run
easily------->easili
fairly------->fair
fairness------->fair


In [21]:
words2 = ['generation', 'generate', 'generous', 'generously']

In [23]:
for word in words2:
    print(word + '------>' + sbs.stem(word))

generation------>generat
generate------>generat
generous------>generous
generously------>generous


## LEMMATIZATION

In [26]:
def show_lemmas(text):
    for t in text:
        print(f'{t.text:{15}} {t.pos_:{5}} {t.lemma:<{30}} {t.lemma_:{20}}')

In [27]:
doc = nlp(u'I am runner running in a race because I love to run since I ran today')
show_lemmas(doc)

I               PRON  561228191312463089             -PRON-              
am              VERB  10382539506755952630           be                  
runner          ADV   12640964157389618806           runner              
running         VERB  12767647472892411841           run                 
in              ADP   3002984154512732771            in                  
a               DET   11901859001352538922           a                   
race            NOUN  8048469955494714898            race                
because         ADP   16950148841647037698           because             
I               PRON  561228191312463089             -PRON-              
love            VERB  3702023516439754181            love                
to              PART  3791531372978436496            to                  
run             VERB  12767647472892411841           run                 
since           ADP   10066841407251338481           since               
I               PRON  5612281913124630

In [28]:
doc2 = nlp(u'I saw ten mice today!')
show_lemmas(doc2)

I               PRON  561228191312463089             -PRON-              
saw             VERB  11925638236994514241           see                 
ten             NUM   7970704286052693043            ten                 
mice            NOUN  1384165645700560590            mouse               
today           NOUN  11042482332948150395           today               
!               PUNCT 17494803046312582752           !                   


## STOP WORDS

In [29]:
# returns a set of all the stop words in the dictionary
print(nlp.Defaults.stop_words)

{'his', 'being', 'onto', 'top', 'ever', 'had', 'serious', 'until', 'back', 'here', 'before', 'please', 'unless', 'six', 'enough', 'everyone', 'side', 'toward', 'her', 'something', 'used', 'which', 'becomes', 'my', 'your', 'once', 'so', 'this', 'very', 'have', 'some', 'me', 'though', 'three', 'hereafter', 'sometime', 'such', 'via', 'whereby', 'afterwards', 'beforehand', 'any', 'nevertheless', 'no', 'moreover', 'somehow', 'seemed', 'hundred', 'may', 'would', 'go', 'own', 'others', 'sometimes', 'ten', 'two', 'itself', 'too', 'yourself', 'make', 'indeed', 'third', 'themselves', 'whereupon', 'must', 'therefore', 'do', 'is', 'could', 'less', 'call', 'during', 'first', 'himself', 'various', 'whom', 'further', 'ca', 'herself', 'show', 'behind', 'these', 'became', 'she', 'thru', 'perhaps', 'does', 'on', 'namely', 'even', 'see', 'off', 'can', 'among', 'someone', 'twenty', 'thereafter', 'well', 'those', 'towards', 'whose', 'why', 'beside', 'quite', 'then', 'one', 'name', 'without', 'cannot', 'as'

In [32]:
len(nlp.Defaults.stop_words)

305

In [33]:
nlp.vocab['an']

<spacy.lexeme.Lexeme at 0x7f352007a7d0>

In [34]:
nlp.vocab['an'].is_stop

True

In [36]:
nlp.vocab['mystery'].is_stop

False

Adding stop words to default set of stop words

In [37]:
nlp.Defaults.stop_words.add('btw')

In [38]:
nlp.vocab['btw'].is_stop = True

In [47]:
len(nlp.Defaults.stop_words)

305

In [40]:
nlp.vocab['btw'].is_stop

True

Removing a stop word

In [51]:
nlp.Defaults.stop_words.remove('btw')

KeyError: 'btw'

In [52]:
nlp.vocab['btw'].is_stop = False

In [53]:
nlp.vocab['btw'].is_stop

False

## Phrase Matching and Vocabulary

Here we will identify and label specific phrases that match patterns we can define ourselves.

### Rule Based Matching 

spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [43]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [44]:
# import matcher library
from spacy.matcher import Matcher
mtchr = Matcher(nlp.vocab)

<font color=green>Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed.</font>

#### Creating Patterns

In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [45]:
ptrn1 = [{'LOWER': 'solarpower'}]
ptrn2 = [{'LOWER': 'solar'},{'LOWER':'power'}]
ptrn3 = [{'LOWER': 'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]

mtchr.add('SolarPower', None, ptrn1, ptrn2, ptrn3)

Let's break this down:
* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>

<font color=green>\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>
<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` (more on callbacks later).

In [46]:
doc = nlp(u'The Solar Power industry continues to grow as demand for solarpower increases.Solar-power cars are gaining popularity.')
found_matches = mtchr(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


In [47]:
for match_id, start, end in found_matches:
    str_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, str_id, start, end, span.text)
    print('\n')

8656102463236116519 SolarPower 1 3 Solar Power


8656102463236116519 SolarPower 10 11 solarpower


8656102463236116519 SolarPower 13 16 Solar-power




The `match_id` is simply the hash value of the `str_id` 'SolarPower'

### Setting pattern options and quantifiers
You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:

In [69]:
ptrn1 = [{'LOWER': 'solarpower'}]
ptrn2 = [{'LOWER':'solar'},{'IS_PUNCT':True, 'OP':'*'},{'LOWER':'power'}]

# removing old matcher with stringid SolarPower
mtchr.remove('SolarPower')

# adding the new matcher 
mtchr.add('Solar_Power', None, ptrn1, ptrn2)

match_found = mtchr(doc)

In [70]:
for mtch_id, start, end in match_found:
    str_id = nlp.vocab.strings[mtch_id]
    span = doc[start:end]
    print(mtch_id, str_id, start, end, span.text)
    print('\n')

9627793059523059485 Solar_Power 1 3 Solar Power


9627793059523059485 Solar_Power 10 11 solarpower


9627793059523059485 Solar_Power 13 16 Solar-power




This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


In [108]:
#practice -start
import spacy
from spacy.matcher import Matcher

In [109]:
nlp = spacy.load('en_core_web_sm')

In [110]:
mtchr = Matcher(nlp.vocab)

In [111]:
ptrn1 = [{'LOWER': 'solarpower'}]
ptrn2 = [{'LOWER': 'solar'},{'IS_PUNCT': True, 'OP': '*'},{'LOWER': 'power'}]

In [112]:
mtchr.add('SolarPower', None, ptrn1, ptrn2)

In [113]:
doc = nlp(u'Solarpower is the Solar--power of the Solar power')

In [114]:
mtchs_found = mtchr(doc)
print(mtchs_found)

[(8656102463236116519, 0, 1), (8656102463236116519, 3, 6), (8656102463236116519, 8, 10)]


In [115]:
for mtch_id, start, end in mtchs_found:
    match_str = nlp.vocab.strings[mtch_id]
    span = doc[start:end]
    print(match_str, start, end, span.text)

SolarPower 0 1 Solarpower
SolarPower 3 6 Solar--power
SolarPower 8 10 Solar power


In [116]:
mtchr.remove('SolarPower')
#pratice-end

### Phrase Matching

In [119]:
from spacy.matcher import PhraseMatcher

In [120]:
nlp = spacy.load('en_core_web_sm')

In [125]:
phrs_mtchr = PhraseMatcher(nlp.vocab)

In [129]:
with open('/home/santosh/Desktop/MASTERS/EXTRA_COURSE_WORK/NLP/UPDATED_NLP_COURSE/TextFiles/reaganomics.txt', encoding='utf8', errors='ignore') as myfile:
    doc = nlp(myfile.read())

In [130]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

phrs_ptrn = [nlp(text) for text in phrase_list]

In [131]:
phrs_mtchr.add('economics', None, *phrs_ptrn)

In [132]:
found_mtchs = phrs_mtchr(doc)
print(found_mtchs)

[(2666819224875146687, 41, 45), (2666819224875146687, 49, 53), (2666819224875146687, 54, 56), (2666819224875146687, 61, 65), (2666819224875146687, 673, 677), (2666819224875146687, 2984, 2988)]


In [134]:
for mtch_id, start, end in found_mtchs:
    mtch_str = nlp.vocab.strings[mtch_id]
    span = doc[start:end]
    print(mtch_id, mtch_str, start, end, span.text)

2666819224875146687 economics 41 45 supply-side economics
2666819224875146687 economics 49 53 trickle-down economics
2666819224875146687 economics 54 56 voodoo economics
2666819224875146687 economics 61 65 free-market economics
2666819224875146687 economics 673 677 supply-side economics
2666819224875146687 economics 2984 2988 trickle-down economics
