## Introduction to SpaCy
The spaCy library is one of the most popular NLP libraries along with NLTK. The basic difference between the two libraries is the fact that NLTK contains a wide variety of algorithms to solve one problem whereas spaCy contains only one, but the best algorithm to solve a problem.

NLTK was released back in 2001 while spaCy is relatively new and was developed in 2015. In this series of articles on NLP, we will mostly be dealing with spaCy, owing to its state of the art nature. However, we will also touch NLTK when it is easier to perform a task using NLTK rather than spaCy.

## Installing spaCy
If you use the pip installer to install your Python libraries, go to the command line and execute the following statement:

    - pip install -U spacy

In [1]:
!pip install -U spacy

Collecting spacy
  Downloading https://files.pythonhosted.org/packages/b7/f2/052bfe5861761599b5421916aba3eb0064d83145ff3072390ecdc5a836de/spacy-2.2.3.tar.gz (5.9MB)
[K    100% |████████████████████████████████| 5.9MB 215kB/s eta 0:00:01
[?25hCollecting blis<0.5.0,>=0.4.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/db/db/bfae863870f79260e57e293dd835e848e8450d2a2c9e273795b13060ff86/blis-0.4.1-cp27-cp27mu-manylinux1_x86_64.whl (3.7MB)
[K    100% |████████████████████████████████| 3.7MB 260kB/s ta 0:00:011
[?25hCollecting catalogue<1.1.0,>=0.0.7 (from spacy)
  Downloading https://files.pythonhosted.org/packages/4b/4c/0e0fa8b1e193c1e09a6b72807ff4ca17c78f68f0c0f4459bc8043c66d649/catalogue-0.2.0-py2.py3-none-any.whl
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading https://files.pythonhosted.org/packages/ce/8d/d095bbb109a004351c85c83bc853782fc27692693b305dd7b170c36a1262/cymem-2.0.3.tar.gz (51kB)
[K    100% |████████████████████████████████| 51kB 3.7MB/s eta

  Running setup.py bdist_wheel for scandir ... [?25ldone
[?25h  Stored in directory: /home/synerzip/.cache/pip/wheels/91/95/75/19c98a91239878abbc7c59970abd3b4e0438a7dd5b61778335
Successfully built spacy cymem wasabi pathlib scandir
Installing collected packages: numpy, blis, six, more-itertools, zipp, configparser, contextlib2, scandir, pathlib2, importlib-metadata, catalogue, cymem, murmurhash, plac, preshed, urllib3, certifi, chardet, idna, requests, setuptools, pathlib, srsly, tqdm, wasabi, thinc, spacy
Successfully installed blis-0.4.1 catalogue-0.2.0 certifi-2019.11.28 chardet-3.0.4 configparser-4.0.2 contextlib2-0.6.0.post1 cymem-2.0.3 idna-2.8 importlib-metadata-1.3.0 more-itertools-5.0.0 murmurhash-1.0.2 numpy-1.16.5 pathlib-1.0.1 pathlib2-2.3.5 plac-1.1.3 preshed-3.0.2 requests-2.22.0 scandir-1.10.0 setuptools-42.0.2 six-1.13.0 spacy-2.2.3 srsly-0.2.0 thinc-7.3.1 tqdm-4.41.0 urllib3-1.25.7 wasabi-0.4.2 zipp-0.6.0


## download spacy model
Once you download and install spaCy, the next step is to download the language model. We will be using the English language model. The language model is used to perform a variety of NLP tasks, which we will see in a later section.

The following command downloads the language model:

    -python -m spacy download en

In [None]:
python3 -m spacy download en

As a first step, you need to import the spacy library as follows:



In [3]:
import spacy

Next, we need to load the spaCy language model.



In [4]:
sp = spacy.load('en_core_web_sm')

In the script above we use the load function from the spacy library to load the core English language model. The model is stored in the sp variable.

Let's now create a small document using this model. A document can be a sentence or a group of sentences and can have unlimited length. The following script creates a simple spaCy document.

In [6]:
sentence = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')

In [22]:
# sentence tokenization
sentence = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')
for sentence in sentence.sents:
    print(sentence)

Hello from Stackabuse.
The site with the best Python Tutorials.
What are you looking for?


In [23]:
# word tokenization
sentence = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')
for word in sentence:
    print(word.text)

Hello
from
Stackabuse
.
The
site
with
the
best
Python
Tutorials
.
What
are
you
looking
for
?


In [24]:
# word tokenization
sentence = 'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?'
for word in sentence:
    print(word)

H
e
l
l
o
 
f
r
o
m
 
S
t
a
c
k
a
b
u
s
e
.
 
T
h
e
 
s
i
t
e
 
w
i
t
h
 
t
h
e
 
b
e
s
t
 
P
y
t
h
o
n
 
T
u
t
o
r
i
a
l
s
.
 
W
h
a
t
 
a
r
e
 
y
o
u
 
l
o
o
k
i
n
g
 
f
o
r
?


In [27]:
sentence = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')
for word in sentence:
    print(word.text)

Hello
from
Stackabuse
.
The
site
with
the
best
Python
Tutorials
.
What
are
you
looking
for
?


In [31]:
sentence[4]

The

In [32]:
print(sentence[4].is_sent_start)

True


## Parts of Speech (POS) Tagging
Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence, which means that, unlike phrase matching, which is performed at the sentence or multi-word level, parts of speech tagging is performed at the token level.

In [40]:
doc = sp("Apple is looking at buying U.K. startup for $1 billion")
print(doc.text)
# doc = ("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)
#     print(token)

Apple is looking at buying U.K. startup for $1 billion
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


In [43]:
import spacy

sp = spacy.load("en_core_web_sm")
# doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# doc = sp("My name is Gayatri123s")
doc = sp('How are you man?')

# print(doc[2].text,"--->",doc[2].pos_)
# print(spacy.explain(doc[2].tag_))
for token in doc:
    print(token.text,"--->",token.pos_)
    print(spacy.explain(token.tag_))
#     print(token.text, token.lemma_, token.pos_)

How ---> ADV
wh-adverb
are ---> AUX
verb, non-3rd person singular present
you ---> PRON
pronoun, personal
man ---> NOUN
noun, singular or mass
? ---> PUNCT
punctuation mark, sentence closer


In [50]:
import spacy

sp = spacy.load("en_core_web_sm")
# doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# doc =sp("My name is Gayatri123s")
doc = sp('Tit for tat')

for token in doc:
    print('pos -->', token.text,token.pos_, spacy.explain(token.pos_))
    print('tag-->', token.text,token.tag_, spacy.explain(token.tag_))

pos --> Tit VERB verb
tag--> Tit VB verb, base form
pos --> for ADP adposition
tag--> for IN conjunction, subordinating or preposition
pos --> tat NOUN noun
tag--> tat NN noun, singular or mass


In [44]:
import spacy

sp = spacy.load("en_core_web_sm")
# doc = sp("Apple is looking at buying U.K. startup for $1 billion")
doc = sp("My name is Gayatri123s")


for token in doc:
    print(token.text,token.tag_, spacy.explain(token.tag_))

My PRP$ pronoun, possessive
name NN noun, singular or mass
is VBZ verb, 3rd person singular present
Gayatri123s NNP noun, proper singular


In [37]:
import spacy

sp = spacy.load("en_core_web_sm")
# doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
doc = sp("My name is Gayatri123s")


for token in doc:
    print(token.text,token.dep_)

My poss
name nsubj
is ROOT
Gayatri123s attr


In [54]:
import spacy

sp = spacy.load("en_core_web_sm")
doc = sp("Apple is looking at buying U.K. startup for $1 billion")
# doc = sp("My name is Gayatri")


for token in doc:
    print(token.text,token.shape_)

Apple Xxxxx
is xx
looking xxxx
at xx
buying xxxx
U.K. X.X.
startup xxxx
for xxx
$ $
1 d
billion xxxx


In [62]:
import spacy

sp = spacy.load("en_core_web_sm")
# doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
doc = sp("My name is demo 2534 s89")
print(dir(doc[0]))

for token in doc:
    
    print(token.text,token.is_alpha)
    print(token.text,token.is_digit)
    print(token.text, token.is_oov)

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nb

In [84]:
import spacy

nlp = spacy.load("en_core_web_sm")
# doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
doc = nlp("My name is Gayatri123s is and the all aditi")


for token in doc:
    print(token.text,token.is_stop)

My True
name True
is True
Gayatri123s False
is True
and True
the True
all True
aditi False


## chunking
Now that we know the parts of speech, we can do what is called chunking, and group words into hopefully meaningful chunks. One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

In [75]:
# text = "Australian striker John hits century"
text = "The idea is to group nouns with the words that are in relation to them"
# text = "How are you, Snehal?"
doc = sp(text)
print(*doc.noun_chunks)
for nc in doc.noun_chunks:
    print(nc)  
    for token in nc:
#         print(token.tag_)
#         if(token.tag_ == 'NN'):
        if(token.tag_ == 'PRP'):
            print('inside if', token.text,"===",token.tag_)

The idea group nouns the words relation them
The idea
group nouns
the words
relation
them
inside if them === PRP


## Visualizing Parts of Speech Tags
Visualizing POS tags in a graphical way is extremely easy. The displacy module from the spacy library is used for this purpose. To visualize the POS tags inside the Jupyter notebook, you need to call the render method from the displacy module and pass it the spacy document, the style of the visualization, and set the jupyter attribute to True as shown below:

In [81]:
from spacy import displacy

# sen = sp("I like to play football. I hated it in my childhood though")
# sen = sp("It's raining today")

displacy.render(sen, style='dep', jupyter=True, options={'distance': 120})

In [92]:
from spacy import displacy
 
# doc = sp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
# doc = sp("I like to play football. I hated it in my childhood though")
# doc = sp("It was raining on Wednesday at 10 p.m. ")
# doc = sp("Startups raised $320 million")
# doc = sp("AWAKE! FEAR! FIRE! FOES! AWAKE! FEAR! FIRE! FOES!AWAKE! AWAKE! -- J. R. R. Tolkien")
doc = sp("On Thursday, Vijayawada police had house-arrested TDP MP Kesineni Srinivas, MLA Buddha Venkanna.Ahead of the meeting, unprecedented security arrangements have been made in Amaravati region, particularly in Mandadam village and the roads leading to the Secretariat at Velagapudi in.I just bought 2 shares at 9 a.m. because the stock went up 30% in 3/4/2019 have $2 just 2 days according to the WSJ'")
displacy.render(doc, style='ent', jupyter=True)
 

## Why do we Need to Remove Stopwords?
Quite an important question and one you must have in mind.

Removing stopwords is not a hard and fast rule in NLP. It depends upon the task that we are working on. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text so that more focus can be given to those words which define the meaning of the text.

In [95]:
# Stop words collection
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stop words: %d' % len(spacy_stopwords))
print('First ten stop words: %s' % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['through', 'six', 'next', 'my', 'its', 'yourself', '‘re', 'otherwise', 'own', 'are']


In [107]:
# stop words removing
# doc = sp("I like to play football. I hated it in my childhood")
doc = sp("Lemmatization is the process of converting a word to its base form.")
tokens = [token.text for token in doc if not token.is_stop]
print('Original Article: %s' % (doc))
print()
print(tokens)

Original Article: Lemmatization is the process of converting a word to its base form.

['Lemmatization', 'process', 'converting', 'word', 'base', 'form', '.']


## Lemmatization and Stemming
Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

In [15]:
from nltk.stem.snowball import SnowballStemmer
# Stemming
stemmer = SnowballStemmer(language='english')

# tokens = ['compute', 'computer', 'computed', 'computing']
tokens = ['running', 'run', 'ran']
sentence = sp('She was hastily running away from the scary bear')
for token in sentence:
#     print(token.text)
    print(token.text + ' --> ' + stemmer.stem(token.text))

She --> she
was --> was
hastily --> hastili
running --> run
away --> away
from --> from
the --> the
scary --> scari
bear --> bear


In [16]:
# Lemmatization
# sentence = "The striped bats are hanging on their feet for best"
# sentence = "I like to play football. I hated it in my childhood"
# sentence = "'compute', 'computer', 'computed', 'computing'"
# sentence = "The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors."
# sentence = "They’re versioned and can be defined as a dependency in your requirements.txt"
sentence = "She was hastily running away from the scary bear"
# Parse the sentence using the loaded 'en' model object `nlp`
doc = sp(sentence)
for token in doc:
    print(token.lemma_)


-PRON-
be
hastily
run
away
from
the
scary
bear


## Detecting Entities
In addition to tokenizing the documents to words, you can also find if the word is an entity such as a company, place, building, currency, institution, etc.

Let's see a simple example of named entity recognition:

In [27]:
sentence5 = sp('Manchester United is looking to sign Harry Kane for $90 million') 
# sentence5 = sp('She was hastily running away from the scary bear')
sentence5 = sp('When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.')

In [28]:
for word in sentence5:
    print(word.text)

When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.


This is where named entity recognition comes to play. To get the named entities from a document, you have to use the ents attribute. Let's retrieve the named entities from the above sentence. Execute the following script:

In [32]:
for entity in sentence5.ents:
#     print(dir(entity))
    print(entity)
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Sebastian
Sebastian - NORP - Nationalities or religious or political groups
Google
Google - ORG - Companies, agencies, institutions, etc.
2007
2007 - DATE - Absolute or relative dates or periods
