# Exercise - 02

Working with __SpaCy__ 

Credit: Adapted from SpaCy.io

This work refers to v2.x with 3.0 release on the anvil, however, the key changes for 3.0 are not impacting the core API, rather, making Transformer pipelines feasible. 

We will cover Transformer as a separate module with Huggingface in a later session.

In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm
#python -m spacy download en_core_web_md

Collecting spacy
  Downloading spacy-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 5.5 MB/s 
[?25hCollecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 924 kB/s 
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (130 kB)
[K     |████████████████████████████████| 130 kB 1.1 MB/s 
[?25hCollecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x

* Import the English language class
* Create the `nlp` object that contains:
    * the processing pipeline and
    * includes language-specific rules for tokenization

In [2]:
import spacy

2021-11-16 00:11:50.213489: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-16 00:11:50.213555: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object that will contain the processing pipeline and includes the language specific rules for tokenization, etc.
nlp = English()

When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

In [4]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


## Token objects, Lexemes, & Hashes

![SpaCy Arch](images/vocab_stringstore.png)

* Represent the tokens in a document – for example, a word or a punctuation character.
* To get a token at a specific position, you can index into the doc.
* also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

In [5]:
doc = nlp("Hello world!")

# print the text
print(doc.text)

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)


Hello world!
world


In [6]:
# SpaCy shares words as hash objects
print(nlp.vocab.strings["world"])
print(nlp.vocab.strings[1703489418272052182])

# A Lexeme object is an entry in the vocabulary
lexeme = nlp.vocab["world"]
print(lexeme.text, lexeme.orth, lexeme.is_alpha)


1703489418272052182
world
world 1703489418272052182 True


## Exercise 1

In [None]:
from spacy.lang.__ import _____
nlp_de = ____
doc = nlp_de("Viel Spaß beim Codieren")
print(___)

# print the second token
token2 = ___
print(___)

## Exercise 2

You can use `token.like_num` to find numbers in the text. Find $ values in this exercise

In [None]:
# Process the text
doc = nlp(
    "According to the BigMac price index, the cost of a Big Mac is cheapest in Turkey at $1.74, while the cost in Switzerland is highest at $7.01. In the US, it is about $5.75"
)

# Iterate over the tokens in the doc
for ___ in doc:
    # Check if the token resembles a number
    if token.___:
        # is it a USD currency
        prev_token = doc[token.i - 1]
        if ____.___ == "$":
            print("Price found: $", _____, sep="")

## Exercise 3

Let's move into using pre-trained model packages


**POS** and **NER** tagging 
Models, though trained on large corpus of labeled texts, can be updated with specific examples to fine-tune their predictions. 

1. Download the pre-trained model package, `en_core_web_sm` (trained on web corpus)
2. Load using `spacy.load`

In [None]:
nlp = spacy.load("en_core_web_sm") # contains pipeline, vocabulary, and weights

In [None]:
# POS using the pre-trained model

# Process the text - create tokens
doc = nlp("They went to the Eiffel Tower for sight-seeing and ate a pizza that cost $5.54 per slice.")
# Iterate over tokens
for token in doc:
    print(f"{token.___:<12}{token.___:15}")

## Pro-tip

Get explanations of the most common tags & labels using `spacy.explain(tag)`

In [None]:
print("POS:", spacy.explain('POS'))
print("GPE:", spacy.explain('GPE')) # Geopolitical entity

### Exercise 3: Syntactic Dependencies
* The `dep` attribute returns the predicted dependency label. 
* the `head` attribute returns the parent token 
* Predict the entity and their labels
NOTE: Any attribute that returns a text is followed by an underscore _, for instance, `ent.label_`

In [None]:
# print the token text, position, dependencies and parent (head)
for token in doc[:5]:
    print(token.text, token.___, token.___, token.___.text)

In [None]:
for ent in doc.ents:
    print(ent.___, ent.___)

## Rule based matching

### Using `Matcher`

`Matcher` helps us find patterns in our text. 
* First initialize `Matcher` with the object's vocabulary, `vocab`
* The `add` attribute lets you add a pattern and uses 3 inputs: a unique ID, a callback function - `None` in our case, and a pattern.

In [None]:
from spacy.matcher import Matcher

# initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# add the pattern to the matcher
pattern = [{"TEXT": "12"}, {"TEXT": "Angry"},{"TEXT":"Men"}]
matcher.add("MOVIE_PATTERN", [pattern])

doc = nlp("We watched 12 Angry Men together")
matches = matcher(doc)

In [None]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

## Exercise 4: Complex Pattern Matching

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [[{"TEXT": "iOS"}, {"IS_DIGIT": True}]]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "A new study on the aerosol and surface stability of SARS-CoV 2 compared to SARS-CoV 1 that was published"
    "by the New England Journal of Medicine reveals that SARS-CoV 2 can stay suspended in the air for three hours, "
    "with a similar reduction in its infectious rate as that of SARS-CoV 1"
)

# Write a pattern to match all instances of SARS-CoV 1 and SARS-CoV 2
pattern = [[{"TEXT": "CoV"}, {"IS_DIGIT": True}]]
pattern2 = [{"LOWER": "sars"}, {"TEXT": "-"}, {"LOWER": "cov"}, {"TEXT": "-"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("VIRUS_TYPE_PATTERN", pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

In [None]:
for token in doc:
    print(f"{token.text:<10}{token.pos_:<10}")

In [None]:
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{"LEMMA": "love", "POS": "VERB", "LOWER": "cats"}]
matcher.add("LOVE_CATS", [pattern])

# Operators can specify how often a token should be matched
pattern = [{"TEXT": "very", "OP": "+"}, {"TEXT": "happy"}]
matcher.add("VERY_HAPPY", [pattern])

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)
print(len(matches))

for match_id, start, end in matches:
    print(doc[start:end])

## Phrase Matching

In [None]:
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

In [None]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

## Lemma + Proper Noun

In [None]:
matcher = Matcher(nlp.vocab)
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)
patterns = [{"LEMMA": "download"}, {"POS": "PROPN"}]
matcher.add("DrivePattern",[pattern])
matches = matcher(doc)

print("Total matches found:", len(matches))

for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

In [None]:
# Let's validate if this makes sense:
# Remember, you can use spacy.explain("TAG")

for token in doc:
    if token.lemma_ == "download":
        next_token = doc[token.i + 1]
        print(token.lemma_, "POS:", next_token.pos_) 

# Document Similarity

* spaCy can compare two documents, spans, or tokens
* each have a `.similarity` method that takes another object as input
* retuns similarity score between 0 & 1

NOTE: Cannot use with `en_core_web_sm` -- you need a model with word vectors included


In [None]:
# Load a larger model with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
doc3 = nlp("The car was old")
span = nlp("They baked cookies and muffins")[2:5]
print(doc1.similarity(doc2))
print(doc2.similarity(doc3))
print(doc2.similarity(span))

In [None]:
# Print the word vector of a document
doc = nlp("The king was furious") # would this change if you modified the sentence?
print(doc[1].vector)

# Bigrams with SpaCy

Unigrams, Bi-Grams, Tri-grams, and N-Grams

In [None]:
def get_alpha_tokens(doc):
    sentence = list()
    for token in doc:
        if token.is_alpha:
            sentence.append(token)            
    return sentence

def get_bigram(doc):
    bigrams = list()
    sentence = get_alpha_tokens(doc)
    
    for index in range(len(sentence)-1):
        word1 = sentence[index]
        word2 = sentence[index+1]
        bigram = (word1, word2)
        bigrams.append(bigram)
    
    return bigrams

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("It was a dark and cloudy day yesterday")

bigrams = get_bigram(doc)
for token1, token2 in bigrams:
    print(token1.text, token2.text)

## Exercise 5

Using `SpaCy` write a function that returns n-grams of a document. You may use the bigram as reference. Helper functions are also available.

In [None]:
def get_ngram(doc, n=2):

    <<YOUR CODE HERE>> # Fill this
    
    return ngrams

def print_ngrams(ngrams):
    n = len(ngrams[0])
    for ngram in ngrams:
        print(ngram)

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("It was a dark and cloudy day yesterday")

ngrams = get_ngram(doc, n=2)
if len(ngrams) > 0:
    print_ngrams(ngrams)

## Exercise 6

Process the text to only print verbs

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Process the texts and print the adjectives

TEXTS = ['We went kayaking last week', 'It was hard to drive on the road due to rain', 'Can you come here please?', "What's taking you so long?"]
for text in TEXTS:
    doc = nlp(text)
    print([token.text for token in doc if <<YOUR CODE HERE>>]) # Fill this

## Exercise 7

Print the entities

In [None]:
import json
import spacy

json_text = '''
[
    "McDonalds is my favorite restaurant.",
    "Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..",
    "People really still eat McDonalds :(",
    "The McDonalds in Spain has chicken wings. My heart is so happy ",
    "@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P",
    "please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D",
    "This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it"
]'''
TEXTS = json.loads(json_text)

# Process the texts and print the entities
docs = [nlp(text) for text in TEXTS]
entities = [<<YOUR CODE HERE>>] # Fill this
print(*entities)

## Solution 1

In [None]:
from spacy.lang.de import German
nlp_de = German()
doc = nlp_de("Viel Spaß beim Codieren")
print(doc.text)

# print the second token
token2 = doc[1]
print(token2.text)

## Solution 2

In [None]:
# Process the text
doc = nlp(
    "According to the BigMac price index, the cost of a Big Mac is cheapest in Turkey at $1.74, while the cost in Switzerland is highest at $7.01. In the US, it is about $5.75"
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        prev_token = doc[token.i - 1]
        # Check if the previous token's text equals "$"
        if prev_token.text == "$":
            print("Price found: $", token.text, sep="")

## Solution 3

In [None]:
for token in doc[:5]:
    print(token.text, token.pos_, token.dep_, token.head.text)

In [None]:
for ent in doc.ents:
    print(f"{ent.text:<35}{ent.label_:<10}")

## Solution 5

In [None]:
def get_ngram(doc, n=2):
    ngrams = list()
    sentence = get_alpha_tokens(doc)
    
    for index in range(len(sentence) - n):
        ngram = tuple(sentence[index:index+n])
        ngrams.append(ngram)
    
    return ngrams

def print_ngrams(ngrams):
    n = len(ngrams[0])
    for ngram in ngrams:
        print(ngram)

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp("It was a dark and cloudy day yesterday")

ngrams = get_ngram(doc, n=7)
if len(ngrams) > 0:
    print_ngrams(ngrams)

## Solution 6

In [None]:
# Process the texts and print the adjectives

nlp = spacy.load('en_core_web_sm')
TEXTS = ['We went kayaking last week', 'It was hard to drive on the road due to rain', 'Can you come here please?', "What's taking you so long?"]
for text in TEXTS:
    doc = nlp(text)
    print([token.text for token in doc if token.pos_ == "VERB"])

## Solution 7

In [None]:
import json
import spacy

json_text = '''
[
    "McDonalds is my favorite restaurant.",
    "Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..",
    "People really still eat McDonalds :(",
    "The McDonalds in Spain has chicken wings. My heart is so happy ",
    "@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P",
    "please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D",
    "This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it"
]'''
TEXTS = json.loads(json_text)

# Process the texts and print the entities
docs = [nlp(text) for text in TEXTS]
entities = [doc.ents for doc in docs]
print(*entities)