# Introduction to Natural Language Processing

### Tutorial 4

---

This tutorial will show how to use spaCy to obtain features that we have been extracting using a rule-based approach on pure Python and how to use spaCy to get word vectors.

### spaCy

is an open-source library for advanced NLP in Python, which supports a wide variety of languages. One crucial advantage of using spaCy is that it's designed to be integrated into real-world products without serious difficulties.


To begin working with spaCy, we need to specify which language class we want to use. Remember that spaCy was created to be used for several languages (currently 64+ languages and 55 trained pipelines for 17 languages). It can't assume that we want to use English. We need to specify this explicitly.

#### Note

If you haven't intalled spaCy, please run the following line in a separate cell

`!pip install spacy`

In [1]:
import spacy

Let's begin with an example in English. Since we already know how to tokenize a text, let's take a look of how spaCy does this process for us.

In [2]:
sentences = ['I always take tea for breakfast.',
             'It was not my intent to break something.',
             'Bild.de can update content for the channel on-the-fly and thus ensure the neartime presentation of breaking news and information.',
             'We have enough time to take a break.',
             'In case the drive direction is changed the outputs for up and down will be switched off during break time. ']

We want to extract only the sentences that contain the word: "break"

In [3]:
def has_break(text):
    return "break" in text

g = (sentence for sentence in sentences if has_break(sentence))
print(type(g))

[next(g) for i in range(5)]

<class 'generator'>


['I always take tea for breakfast.',
 'It was not my intent to break something.',
 'Bild.de can update content for the channel on-the-fly and thus ensure the neartime presentation of breaking news and information.',
 'We have enough time to take a break.',
 'In case the drive direction is changed the outputs for up and down will be switched off during break time. ']

In [4]:
# Import English
from spacy.lang.en import English

nlp = English()
# One sentence/document
raw = "Hard to judge whether, these sides were good. We were grossed " \
      "out by the melted styrofoam and didn't want to eat it for fear of getting sick."

doc = nlp(raw)

In [5]:
print(doc)

Hard to judge whether, these sides were good. We were grossed out by the melted styrofoam and didn't want to eat it for fear of getting sick.


In [6]:
type(doc)

spacy.tokens.doc.Doc

In [7]:
for token in doc:
    print(token.text)

Hard
to
judge
whether
,
these
sides
were
good
.
We
were
grossed
out
by
the
melted
styrofoam
and
did
n't
want
to
eat
it
for
fear
of
getting
sick
.


In [8]:
# Now it's your turn to do the same for the following Spanish text taken from BBC in Spanish.
from spacy.lang.es import Spanish 
 
nlp = Spanish()
 
spanish_raw = '¿Es posible "desconectar" a un país entero de internet? ' \
              'La respuesta corta es "sí".'
 
document = nlp(spanish_raw)
 
for token in document:
    print(token.text)

¿
Es
posible
"
desconectar
"
a
un
país
entero
de
internet
?
La
respuesta
corta
es
"
sí
"
.


### Indexing

spaCy uses the same syntax as Python for indexing. This way you can address specific tokens in your documents

#### Note
Space can also be token. It is splitting document into tokens based on space (besides other things that I cannot conclude easily). But if there is 2 or more spaces, it will take space as token as well. So my doubt is clear now. Everything can be token, depending how you define it.

In [9]:
raw = "Hard to judge whether these sides were good. We were grossed " \
      "out by the melted styrofoam and didn't want to eat it for fear of getting sick."

nlp = English()
doc = nlp(raw)

In [10]:
last_word = doc[-1]
first_word = doc[0]
print("+" + first_word.text + "+" + "=======" + "+" + last_word.text + "+")



In [11]:
# Type
type(last_word)

spacy.tokens.token.Token

In [12]:
# Properties
first_word.is_bracket

False

Every token in our document has some characteristics that are know in spaCy as **lexical attributes**.

In [13]:
print(first_word.is_digit)
print(last_word.is_punct)

False
True


### Documents and spans

Two tokens or a sequence of them can be referred to as a $\textbf{span}$. In some NLP tasks, spans are very relevant. For instance, in Question Answering (QA), obtaining the correct span that answers a query is crucial for the task itself. With spaCy, we can also define spans and use their `lexical attributes` in the same way we can do it for a token.

In [14]:
span = doc[4:9]

In [15]:
print(span)

these sides were good.


In [16]:
type(span)

spacy.tokens.span.Span

In [17]:
span.text

'these sides were good.'

In [18]:
doc[3]

whether

In [19]:
# This cell is reserved for you to explore more about lexical attributes on the previous text. 
# Check this link: https://spacy.io/api/token for more attributes.
third_word = doc[3]
print("Here is a part-of-speech tag:", third_word.pos_) # Why is it empty? => Because that word is not part of dataset or similar (check this)

Here is a part-of-speech tag: 


### Let's get a bit deeper into attributes

Working with language requires most of the time math to solve problems. As an example, we can decide if a the word _tweet_ refers to a noun or to a verb by counting.

Knowing the context of a word and counting how often our desired word appears after a verb or after a noun would give us the probability that we are searching for.

### How can we include statistics in spaCy?

The good news is that spaCy provides pre-trained models that we can use depending on our necessities. There is an offer of small, medium and large models for different languages. Having such a model, we can use attributes in context. 

But what exactly is contained in a pre-trained model? 

It contains a vocabulary of the words used to train our model, their weights and meta-information useful for spaCy. 

Let's download and use a small model for English.

Please run the following line in a separate cell

`!python -m spacy download en_core_web_sm`

### How do we load a spaCy model?

Loading the model is as simple as telling spaCy the name of the model to load.

In [20]:
nlp = spacy.load('en_core_web_sm')

And we already know what to do...

In [21]:
# It's your turn to create a new document of our English text 
# and define a span for its last two words excluding the dot.

raw = "Hard to judge whether these sides were good. We were grossed " \
      "out by the melted styrofoam and didn't want to eat it for fear of getting sick."

# new_doc =
# last_span = 

new_doc = nlp(raw)

word_two = new_doc[1]
last_span = new_doc[-3:-1]
print(last_span.text)

getting sick


In [22]:
print(type(new_doc))
print(type(last_span))
print(type(word_two))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.span.Span'>
<class 'spacy.tokens.token.Token'>


In [23]:
third_word = new_doc[3]
print("Here is a part-of-speech tag:", third_word.pos_)

Here is a part-of-speech tag: SCONJ


In [24]:
spacy.explain('SCONJ')

'subordinating conjunction'

In [25]:
word_judge = new_doc[2]
print(word_judge.dep_)

advcl


In [26]:
# Now display part-of-speech tags, dependencies and lemma for them.
# Token has lexical attributes
for token in last_span:
    print(token.text, token.pos_, token.dep_, token.lemma_)

getting VERB pcomp get
sick ADJ acomp sick


In [27]:
spacy.explain('acomp')

'adjectival complement'

### Structure inside spaCy

We have seen how to pass raw text to spaCy and process it into lexical features until this point. However, keeping every token for every occurrence in a text is memory expensive. Therefore, spaCy manages everything in a sort of `internal structure`. 

This structure has three levels or components, the document (doc), a vocabulary called **vocab**, and a look-up table called in spaCy the **string store**. The vocab contains token ids stored as **hashes**. From now on, we will call every entry in vocab a **lexeme**. A look-up table indicates which **token** (from **doc**) corresponds to which lexeme (from **vocab**).

### How does it look like in terms of code?

- A document contains tokens with their lexical attributes

In [28]:
for token in last_span:
    print(token.text, token.pos_, token.dep_, token.lemma_)
    print(type(token))
    print('--------------------')

getting VERB pcomp get
<class 'spacy.tokens.token.Token'>
--------------------
sick ADJ acomp sick
<class 'spacy.tokens.token.Token'>
--------------------


- Each object in our vocab is a lexeme

In [29]:
lexeme = nlp.vocab[last_span[1].text]
print(type(nlp.vocab))
print(type(lexeme))
print(lexeme.text, lexeme.orth)
print(type(last_span[1]))
print(last_span[1].text, last_span[1].orth)

<class 'spacy.vocab.Vocab'>
<class 'spacy.lexeme.Lexeme'>
sick 14841597609857081305
<class 'spacy.tokens.token.Token'>
sick 14841597609857081305


- Each string representation of a hash id can be search in the string store and viceversa.

In [30]:
searched_string = nlp.vocab.strings[lexeme.orth]
searched_hash = nlp.vocab.strings[lexeme.text]

print("This is my desired string:", searched_string)
print("This is my desired hash:", searched_hash)

This is my desired string: sick
This is my desired hash: 14841597609857081305


### My Example

In [31]:
token = new_doc[-2]
print(type(new_doc))
print(type(token))
print(token.text)
print(token.orth)

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.token.Token'>
sick
14841597609857081305


In [32]:
lexeme = nlp.vocab[token.text]
print(type(nlp.vocab))
print(type(lexeme))
print(lexeme.text)
print(lexeme.orth)

<class 'spacy.vocab.Vocab'>
<class 'spacy.lexeme.Lexeme'>
sick
14841597609857081305


In [33]:
# Search for token given lexeme in lookup table
searched_string = nlp.vocab.strings[lexeme.orth]
searched_hash = nlp.vocab.strings[lexeme.text]

print("This is my desired string:", searched_string)
print("This is my desired hash:", searched_hash)

This is my desired string: sick
This is my desired hash: 14841597609857081305


In [34]:
# Search for lexeme given token in lookup table
searched_string = nlp.vocab.strings[token.orth]
searched_hash = nlp.vocab.strings[token.text]

print("This is my desired string:", searched_string)
print("This is my desired hash:", searched_hash)

This is my desired string: sick
This is my desired hash: 14841597609857081305


### Displacy

You can also look take a look of visualizations, for instance the graphs presented in our slides for today were created with spaCy

Let's take a look of an example

In [35]:
raw_string = "You can also look take a look of visualizations."

In [36]:
this_doc = nlp(raw_string)

In [37]:
from spacy import displacy

displacy.render(this_doc, jupyter=True)

### Searching for specific patterns with Matcher

spaCy provides a `Matcher`, which works similar to regular expressions in Python. The difference is that you can search not only the text, but also other token attributes. In this way we could for example differentiate between _break_ being a verb or a noun and search only for noun appearances.

Here, we have examples of searching text, lexical attributes for a specific token and lexical attributes in a more general search.

In [38]:
text = "Google Inc. is a company that has a big development in NLP. "\
       "When users google for a word or any query, their system internally " \
       "runs a pipeline in order to process what the person is querying."

In [39]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [40]:
matcher = Matcher(nlp.vocab)
# Those keys in dictionaries (TEXT, LOWER, LEMMA, IS_PUNCT) are lexical attributes of the spaCy Token
# Returns span of tokens (if span lenght is 1 then that is individual token)
patterns = [
    [{'TEXT': 'Google'}, {'TEXT': 'Inc.'}], # .text = (Google AND Inc.)
    [{'TEXT': 'Google'}], # Google
    [{'TEXT': 'Inc.'}],
    [{'LOWER': 'google'}],
    [{'LEMMA': 'query'}, {'IS_PUNCT': True}]
]

In [41]:
matcher.add("TEST_PATTERNS", patterns)
doc = nlp(text)
matches = matcher(doc)

In [42]:
print(matches)
print("Total of matches found:", len(matches))

[(3004906285683798724, 0, 1), (3004906285683798724, 0, 2), (3004906285683798724, 1, 2), (3004906285683798724, 15, 16), (3004906285683798724, 21, 23), (3004906285683798724, 37, 39)]
Total of matches found: 6


But, what can we do with this output? What does it mean?

`Matcher` returns a list of tuples indicating start and end of each found matched span. 

In [43]:
# Display a list of found matches
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['Google', 'Google Inc.', 'Inc.', 'google', 'query,', 'querying.']


Following what we have seen until now, download a German model and create patterns to find several tokens with more than one ocurrence in the text given in following cells. 

***Hint:*** Notice that models for other languages were trained on news data instead of web data.

In [None]:
de_nlp = spacy.load('de_core_news_sm')

In [None]:
from spacy.lang.de.examples import sentences
raw_german = sentences[0:5]
print(raw_german)

# Word embedding using spaCy
In this notebook you will find out how to use spaCy to get word vectors.

Loading pre-trained embeddings with spaCy.

In [44]:
nlp_lg = spacy.load('en_core_web_sm')

In [45]:
text = "The quick brown fox jumps over the lazy dog"
doc = nlp_lg(text)
print(type(nlp_lg))
print(type(doc))
print(doc.text)

<class 'spacy.lang.en.English'>
<class 'spacy.tokens.doc.Doc'>
The quick brown fox jumps over the lazy dog


Task 1: Retrieve the second Token in the Doc object at index 1, and  the first 30 dimensions of its vector representation

In [46]:
second_token = doc[1]
print(type(second_token))
print(second_token.text)
print(second_token.vector.shape)
print(type(second_token.vector))
print(second_token.vector[:30])

<class 'spacy.tokens.token.Token'>
quick
(96,)
<class 'numpy.ndarray'>
[ 0.8233709  -1.1887459   0.9393252   1.1702604   0.26628137 -0.00492318
 -0.36061257 -0.85801125 -0.33466592 -0.6535297   0.18827835  0.17926934
 -1.456599   -0.3641851  -0.45112038  0.3773813   0.6100651  -0.07231146
 -1.2368598  -0.6287472  -0.31162527  0.9800494   0.25633457 -0.04945124
 -0.4890042   0.2708063   0.15849347  0.43778464  0.7010016   0.6340733 ]


Retrieving word vectors for "dog", "fox" and "sun":

In [47]:
nlp_lg('dog').vector

array([-1.6806675 , -1.2663747 , -0.71255565,  0.22143888,  0.28581634,
        0.23924345,  1.2992647 ,  1.0683641 , -0.1666507 , -0.593021  ,
        0.20207635, -0.71124184, -0.5710875 , -0.2685267 , -0.5052826 ,
        0.60505986, -1.5851773 , -1.6874862 ,  0.7026561 ,  0.60366225,
        0.3043416 ,  1.3963956 , -0.056483  , -0.6299367 ,  0.09717859,
        0.5463655 ,  0.36506647,  0.73901325, -0.16906115,  0.35410628,
        0.33770823, -0.7352046 ,  1.6755302 ,  0.48371333,  0.0184648 ,
       -0.92315257,  0.6245377 ,  0.11393103,  0.8193037 , -0.01115507,
       -0.49064368, -0.30198318,  0.43095675, -0.05127436, -0.11000359,
       -0.64060974, -1.4619632 ,  0.85834503, -0.4855454 ,  0.01614086,
       -0.10124743,  1.2471893 ,  0.7936216 , -0.49573362,  1.0994403 ,
       -0.22723481,  1.289906  , -0.8966851 , -0.07580797, -0.6877714 ,
       -0.9954758 , -0.70957065, -0.3484952 , -0.35015944,  0.6045856 ,
        0.21346438, -0.22741812,  0.12088367,  0.8521288 ,  0.11

In [48]:
nlp_lg('dog').vector.shape

(96,)

In spaCy, the vector representation for the entire Doc is calculated by averaging the vectors for each Token in the Doc.

In [49]:
doc.vector.shape

(96,)

In [50]:
heeiiyy = nlp_lg("heeiiyy")
heeiiyy.vector

array([-0.88126457, -0.6202806 , -1.5005524 , -0.40187687,  0.36590713,
       -0.30607873, -0.21017855,  1.8757141 , -0.17634499, -0.03765219,
        0.82590777, -0.08941758, -0.99377054, -0.40427023, -0.37538618,
        0.912148  , -1.1020703 ,  0.1130835 ,  0.85248625,  0.6235067 ,
       -2.0766525 ,  0.06002179, -0.45891204, -0.04191741, -0.09031391,
        0.94857234, -0.17983632,  1.2523849 , -0.8432184 ,  0.16701965,
       -0.34979108,  0.59383965,  0.7371594 ,  0.9947243 , -0.339404  ,
       -0.8623344 , -0.2672234 ,  1.7776172 ,  1.0509973 , -0.6767344 ,
        0.02610242, -0.8435878 ,  0.34696946, -0.8810245 ,  0.06217456,
       -0.63664925, -0.12952982, -0.15066275, -0.25706494,  0.6818833 ,
        0.6012949 ,  0.22682005,  0.04043951,  0.71112496,  0.4236604 ,
       -0.9417363 ,  2.9450629 , -0.8021494 ,  0.02782455,  1.0667045 ,
       -0.32455254, -1.1640505 , -0.05386074, -0.91956043,  0.9660482 ,
        0.32777134, -0.54104346,  0.40791127, -0.92241853,  1.67

Task 2: Compare the similarity of words "dog" and "fox" & "dog" and "sun".

In [55]:
print("Similarity of dog and fox: " + str(nlp_lg("dog").similarity(nlp_lg("fox"))))
print("Similarity of dog and sun: " + str(nlp_lg("dog").similarity(nlp_lg("sun"))))

Similarity of dog and fox: 0.7081974719792407
Similarity of dog and sun: 0.7552881594262363


  print("Similarity of dog and fox: " + str(nlp_lg("dog").similarity(nlp_lg("fox"))))
  print("Similarity of dog and sun: " + str(nlp_lg("dog").similarity(nlp_lg("sun"))))


In [56]:
from whatlies.language import SpacyLanguage

ModuleNotFoundError: No module named 'whatlies'

In [None]:
# Wrap the spaCy language model under 'nlp_lg' into the
# whatlies SpacyLanguage class and assign the result 
# under the variable 'language_model'
language_model = SpacyLanguage(nlp_lg)

# Call the variable to examine the output
language_model

The result is a SpacyLanguage object that wraps a spaCy Language object.

In [None]:
embeddings = language_model[['fox', 'dog', 'sun']]

# Plot the EmbSet
embeddings.plot(kind='arrow', color='red', x_axis=0, y_axis=1)

# GENSIM

In [68]:
from gensim.models import Word2Vec
import gensim.downloader as api

In [69]:
wv_model = api.load('word2vec-google-news-300')

In [61]:
vec_king = wv_model['king']
len(vec_king)

300

In [62]:
print(wv_model.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


In [63]:
sims = wv_model.most_similar('car', topn=10)  # get similar words
sims

[('vehicle', 0.7821096181869507),
 ('cars', 0.7423831224441528),
 ('SUV', 0.7160962820053101),
 ('minivan', 0.6907036900520325),
 ('truck', 0.6735789775848389),
 ('Car', 0.6677608489990234),
 ('Ford_Focus', 0.6673202514648438),
 ('Honda_Civic', 0.6626849174499512),
 ('Jeep', 0.651133120059967),
 ('pickup_truck', 0.6441438794136047)]

In [64]:
wv_model.similarity(w1='car', w2='ship')

0.16958451

In [65]:
wv_model.similarity(w1='car', w2='airplane')

0.4243558

In [66]:
wv_model.similarity(w1='ship', w2='airplane')

0.325606

In [67]:
wv_model.similarity(w1='car', w2='truck')

0.67357904