# Text Analytics with Spacy

![spacypipeline](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

https://spacy.io/usage/processing-pipelines

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the **processing pipeline**. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

**How pipelines work**

spaCy makes it very easy to create your own pipelines consisting of reusable components – this includes spaCy’s default tagger, parser and entity recognizer, but also your own custom processing functions. A pipeline component can be added to an already existing nlp object, specified when initializing a Language class, or defined within a model package.

## Let's Get Started

* install and import spacy
* install and load language model.

Another great feature of spacy are the pre-built trained pipelines
See here: https://spacy.io/usage/models#download

In [3]:
import spacy # pip install spacy #takes a minute or two.
from spacy import displacy

In [4]:
nlp = spacy.load('en_core_web_sm') #python -m spacy download en_core_web_sm

In [5]:
text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

In [6]:
text

'Netanyahu\'s visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister\'s unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack "despicable."'

In [7]:
doc = nlp(text)

In [8]:
doc

Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack "despicable."

In [9]:
for token in doc:
    print(token)

Netanyahu
's
visit
was
cut
short
by
reports
late
Sunday
that
a
rocket
was
fired
from
Gaza
into
central
Israel
,
wounding
at
least
seven
people
.
Following
criticism
from
political
opponents
over
what
they
consider
the
prime
minister
's
unclear
stance
toward
the
militant
political
group
,
Israel
responded
with
a
series
of
strikes
into
Gaza
against
Hamas
,
which
largely
governs
the
contested
strip
.
President
Donald
Trump
tacitly
endorsed
the
strike
following
his
meetings
with
Netanyahu
,
calling
the
Hamas
attack
"
despicable
.
"


### removing stop words

In [10]:
from spacy.lang.en.stop_words import STOP_WORDS

In [11]:
stopwords = list(STOP_WORDS)

In [12]:
print(stopwords)

['back', 'other', 'toward', 'go', 'wherein', "'ll", 'move', 'anyhow', 'regarding', 'can', 'seeming', 'everywhere', 'whither', 'around', 'four', 'may', 'him', 'his', "'re", 'something', 'not', 'whereafter', 'those', 'them', 'does', 'twenty', '‘ll', 'also', 'none', "n't", 'becoming', 'without', 'rather', 'my', 'behind', '’s', 'however', 'empty', 'ca', 'show', 'whole', 'whence', 'many', 'wherever', 'another', 'sometimes', 'to', 'had', 'mostly', 'whoever', 'throughout', 'else', 'among', '’ll', 'any', 'here', 'on', 'now', 'thereupon', 'amount', 'upon', 'quite', "'d", 'latterly', 'amongst', 'then', 'within', 'below', 'through', 'whose', 'serious', '’d', 'hundred', 'n’t', 'against', 'seemed', 'keep', 'most', 'name', 'beforehand', 'namely', 'along', 'never', '’m', 'yourselves', 'himself', 'nobody', 'have', 'nowhere', 'ours', 'first', 'herein', 'onto', 'such', 'because', 'how', 'anything', 'since', 'yourself', 'top', 'seems', 'down', 'its', 'over', 'one', 'few', 'nothing', 'thereby', 'herself',

In [13]:
len(stopwords)

326

Tokens that are not stopwords in my text

In [14]:
for token in doc:
    if token.is_stop == False:
        print(token)

Netanyahu
visit
cut
short
reports
late
Sunday
rocket
fired
Gaza
central
Israel
,
wounding
seven
people
.
Following
criticism
political
opponents
consider
prime
minister
unclear
stance
militant
political
group
,
Israel
responded
series
strikes
Gaza
Hamas
,
largely
governs
contested
strip
.
President
Donald
Trump
tacitly
endorsed
strike
following
meetings
Netanyahu
,
calling
Hamas
attack
"
despicable
.
"


What are the stop words in my text

In [15]:
for token in doc:
    if token.is_stop:
        print(token)

's
was
by
that
a
was
from
into
at
least
from
over
what
they
the
's
toward
the
with
a
of
into
against
which
the
the
his
with
the


### Lemmatization 

In [16]:
for lem in doc:
    print(lem.text, lem.lemma_)

Netanyahu Netanyahu
's 's
visit visit
was be
cut cut
short short
by by
reports report
late late
Sunday Sunday
that that
a a
rocket rocket
was be
fired fire
from from
Gaza Gaza
into into
central central
Israel Israel
, ,
wounding wound
at at
least least
seven seven
people people
. .
Following follow
criticism criticism
from from
political political
opponents opponent
over over
what what
they they
consider consider
the the
prime prime
minister minister
's 's
unclear unclear
stance stance
toward toward
the the
militant militant
political political
group group
, ,
Israel Israel
responded respond
with with
a a
series series
of of
strikes strike
into into
Gaza Gaza
against against
Hamas Hamas
, ,
which which
largely largely
governs govern
the the
contested contest
strip strip
. .
President President
Donald Donald
Trump Trump
tacitly tacitly
endorsed endorse
the the
strike strike
following follow
his his
meetings meeting
with with
Netanyahu Netanyahu
, ,
calling call
the the
Hamas Hamas
attac

#### Let's try with another text, this time a bit shorter

In [17]:
doc = nlp('run runs running runner')

In [18]:
for lem in doc:
    print(lem.text, lem.lemma_)

run run
runs run
running run
runner runner


### POS 

In [19]:
doc = nlp('President tacitly endorsed the strike following his meetings with Netanyahu.')

In [20]:
for token in doc:
    print(token.text, token.pos_)

President PROPN
tacitly ADV
endorsed VERB
the DET
strike NOUN
following VERB
his PRON
meetings NOUN
with ADP
Netanyahu PROPN
. PUNCT


**<span class="mark">TODO</span>**: For this same text, now also print the lemmas along with pos and tokens

In [21]:
# Your code below

for token in doc:
    print(token.text, token.pos_, token.lemma_)

President PROPN President
tacitly ADV tacitly
endorsed VERB endorse
the DET the
strike NOUN strike
following VERB follow
his PRON his
meetings NOUN meeting
with ADP with
Netanyahu PROPN Netanyahu
. PUNCT .


## Entity Detection

In [22]:
doc = nlp("New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.")

In [23]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

New York City 0 13 GPE
Tuesday 17 24 DATE
At least 285 217 229 CARDINAL
September 279 288 DATE
Brooklyn 300 308 GPE
four 355 359 CARDINAL
Zip 360 363 PERSON
Bill de Blasio 383 397 PERSON
Tuesday 407 414 DATE
Orthodox 501 509 NORP
6 months old 576 588 DATE
up to $1,000 624 636 MONEY


In [24]:
displacy.render(doc, style = 'ent')

### Putting everything together

In [25]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


**<span class="mark">TODO</span>**: Print all the named entities for the following text:

"WHO recommends use of AstraZeneca Covid-19 vaccine as two-dose shot, 8 to 12 weeks apart"


In [26]:
## Your code below

In [27]:
doc = nlp("Double masking can block over 92% of potentially infectious particles from escaping, CDC study says")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

over 92% 25 33 PERCENT
CDC 85 88 ORG


In [28]:
text = "WHO recommends use of AstraZeneca Covid-19 vaccine as two-dose shot, 8 to 12 weeks apart"

doc = nlp(text)
print('TOKEN, LABEL, start, end\n-----------------------------')
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)

TOKEN, LABEL, start, end
-----------------------------
AstraZeneca PRODUCT 22 33
two CARDINAL 54 57
8 to 12 weeks DATE 69 82


**Spacy tags for POS and NER**

https://spacy.io/api/annotation