In [8]:
#imports language of your choice, I am starting with english
import spacy
from spacy.lang.en import English

In [2]:
#making an nlp object
nlp=English()

In [None]:
#creating a list of sentences

In [20]:
#doc will be used to acess multiple attributes of nlp
sent=[["Howdy Stranger, where have you been?"],["My day was rough today, I was taken down by my friends and it was terrible"],["I also bought a pair of blue glasses but they delivered pink instead and now I am no one but a mermaid"]]

In [30]:
doc=nlp(*sent[0])
# print(*sent[0])

In [31]:
doc.text

'Howdy Stranger, where have you been?'

In [7]:
from spacy.lang.de import German


In [10]:
nlp1=German()
doc1=nlp1('Liebe Grüße!')
doc1.text

'Liebe Grüße!'

## When you call nlp on a string, spaCy first tokenizes the text and creates a document object.


Thus we can access tokens via doc by using indices. As nlp already tokenized it for us

In [33]:
doc[5]

you

In [35]:
for i in doc:
    print(i)
    print('-----')

Howdy
-----
Stranger
-----
,
-----
where
-----
have
-----
you
-----
been
-----
?
-----


If we slice the doc then the output received is in form of spans.


In [36]:
doc=doc[1:4]
print(doc.text)

Stranger, where


In [43]:
#like_num attribute and tok.i
doc3=nlp('Would you find a number 5 in the sentence?')
[(tok,doc3[tok.i+1]) for tok in doc3 if tok.like_num]

[(5, in)]

### Statistical Models

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

 "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.

In [2]:
#when spacy.load('model') gives error this is the right way of dodging the bullet

import en_core_web_sm
nlp = en_core_web_sm.load()

Now that we have loaded the english model and created an object nlp let's get started with more cool stuff.

For each token in the doc, we can print the text and the .pos_ attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

In [62]:
doc4=nlp('I was eating a cookie I baked in quarantine and it tasted terrible')

In [64]:
for tok in doc4:
    print(tok,tok.pos_)

I PRON
was VERB
eating VERB
a DET
cookie NOUN
I PRON
baked VERB
in ADP
quarantine NOUN
and CCONJ
it PRON
tasted VERB
terrible ADJ


In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The .dep_ attribute returns the predicted dependency label.

The .head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.



In [68]:
for tok in doc4:
    print(tok,tok.dep_)
#     print(tok.head)

I nsubj
was aux
eating ROOT
a det
cookie dobj
I nsubj
baked relcl
in prep
quarantine pobj
and cc
it nsubj
tasted conj
terrible acomp


## How about plotting these dependencies? :hearty eyes

In [85]:
# !pip install displacy
!pip install "msgpack-numpy<0.4.4.0"

Collecting msgpack-numpy<0.4.4.0
  Downloading msgpack_numpy-0.4.3.2-py2.py3-none-any.whl (5.2 kB)
Installing collected packages: msgpack-numpy
  Attempting uninstall: msgpack-numpy
    Found existing installation: msgpack-numpy 0.4.4.3
    Uninstalling msgpack-numpy-0.4.4.3:
      Successfully uninstalled msgpack-numpy-0.4.4.3
Successfully installed msgpack-numpy-0.4.3.2


In [3]:
from spacy import displacy


In [90]:
#Gives Error in local Jupyter notebook but works in colab
# displacy.render(doc4, style='dep', jupyter=True, options={'distance': 90})

# displacy.serve(doc4)

nlp = spacy.load("en_core_web_sm")
doc1 = nlp("This is a sentence.")
doc2 = nlp("This is another sentence.")
displacy.serve([doc1, doc2], style="dep")

In [4]:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
# nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")


[93m    Serving on port 5000...[0m
    Using the 'ent' visualizer


    Shutting down server on port 5000.



# NER

These are the attributes:

- Text: The original entity text.
- Start: Index of start of entity in the Doc. Use ent.start_char
- End: Index of end of entity in the Doc. Use ent.end_char
- Label: Entity label, i.e. type. Use ent.label_

In [5]:
#Entities and Labels

for i in doc.ents:
    print(i.text,i.label_)
    

Darshna ORG


In [6]:
for i in doc.ents:
    print(i.text,i.start_char,i.end_char)

Darshna 88 95


A quick tip: To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy.explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

In [22]:
print(spacy.explain("GPE"))

Countries, cities, states


# Rule Based Matching


The SpaCy's matcher works similar to Regex but is better in terms of working with doc and tok objects as well and not just strings.

We can search for both text and lexical attributeds. The model's prediction can be used to define the rules.

Let's say we want to use 'Love' only if it's noun and not verb.

The matcher pattern is a dictionary with keys=token attribute and value=expexted value.

In [42]:
# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

In [25]:
pat1=[{'TEXT':'Love'},{'TEXT':'Hard'}]
pat2=[{'LOWER':'enemy'},{'LOWER':'closer'}]
pat3=[{'LEMMA':'eat'},{'POS':'NOUN'}]

In [29]:
text1="It's hard to love someone but every song says Love Hard"
text2="Keep your frns close but enemies closer"
text3="I dunno how to eat with chopsticks but eating donuts is easier"

We use matcher.add() function to add the pattern.

- The first argument is a unique ID to identify which pattern was matched. 
- The second argument is an optional callback. We don't need one here, so we set it to None. 
- The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.

In [31]:
matcher.add('love',None,pat1)
doc1=nlp(text1)
matches=matcher(doc1)

In [32]:
matches

[(9223393045180330578, 0, 1),
 (9223397443226579542, 0, 1),
 (9223393045180330578, 1, 2),
 (9223397443226579542, 1, 2),
 (9223393045180330578, 2, 3),
 (9223397443226579542, 2, 3),
 (9223393045180330578, 3, 4),
 (9223397443226579542, 3, 4),
 (9223393045180330578, 4, 5),
 (9223397443226579542, 4, 5),
 (9223393045180330578, 5, 6),
 (9223397443226579542, 5, 6),
 (9223393045180330578, 6, 7),
 (9223397443226579542, 6, 7),
 (9223393045180330578, 7, 8),
 (9223397443226579542, 7, 8),
 (9223393045180330578, 8, 9),
 (9223397443226579542, 8, 9),
 (9223393045180330578, 9, 10),
 (9223397443226579542, 9, 10),
 (9223393045180330578, 10, 11),
 (9223397443226579542, 10, 11),
 (9223393045180330578, 11, 12),
 (9223397443226579542, 11, 12)]

Ew! What does that mean?

The matcher returned the match_id, start index and end index. Loop game will make this easier

In [36]:
for match_id,start,end in matches:
    # Get the matched span
    matched_span = doc1[start:end]
    print(matched_span.text)

It
It
's
's
hard
hard
to
to
love
love
someone
someone
but
but
every
every
song
song
says
says
Love
Love
Hard
Hard


In [46]:
matcher.add('eat',None,pat3)
doc3=nlp(text3)
matches=matcher(doc3)
for match_id,start,end in matches:
    # Get the matched span
    matched_span = doc3[start:end]
    print(matched_span.text)
    break

eating donuts


A pattern can look like this as well

In [47]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

Example	Description
- {"OP": "!"}	Negation: match 0 times
- {"OP": "?"}	Optional: match 0 or 1 times
- {"OP": "+"}	Match 1 or more times
- {"OP": "*"}	Match 0 or more times

"OP" can have one of four values:

An "!" negates the token, so it's matched 0 times.

A "?" makes the token optional, and matches it 0 or 1 times.

A "+" matches a token 1 or more times.

And finally, an "*" matches 0 or more times.

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely

In [48]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]