# spaCy: An modern NLP pipeline
Learning goals:
 - Understand how a modern NLP pipeline works and what spacy models typically do
 - Understand the main data structures: Doc, Token, Vocab
 - Understand how cosine similarity between tokens, spans and documents can be computed
 - Understand how rule-based matching with a domain-specific search query language works 
 - Understand how rule-based entity matching can be done
 - Understand the Matcher and DependencyMatcher class
 - Understand what (de)serialization is and how it applies to spaCy data structures
 - Get to know the displacy visualization
 
 See [spaCy 101](https://spacy.io/usage/spacy-101) for an introduction

## Installation

In [None]:
# Only run this once (restart the kernel if the models are not loaded)
! pip install spacy
! python -m spacy download en_core_web_sm
! python -m spacy download en_core_web_md # with word vectors

## Loading pipeline models

In [None]:
import spacy
from spacy import displacy 
model = 'en_core_web_sm'
model_md = 'en_core_web_md'
nlp = spacy.load(model)
nlp_md = spacy.load('en_core_web_md')

The nlp object is an instance of the class [Language](https://spacy.io/api/language)

In [None]:
isinstance(nlp, spacy.Language)

## Pipeline processing and dependency parsing visualization

In [None]:
doc = nlp('Mr. Smith founded a healthcare company. He is the CEO of Health Inc.')
displacy.render(doc, style='dep',jupyter=True)

In [None]:
# Displaying entities
displacy.render(doc, style='ent',jupyter=True)

What information is in the analysed document? A compact way  to look at the most relevant information is  the [JSON](https://www.youtube.com/watch?v=pTT7HMqDnJw) serialization. 

In [None]:
doc.to_json()

Start and end of segments: spaCy uses Python [slice](https://docs.python.org/3.11/library/functions.html#slice) object semantics!
`text[start:end]`

In [None]:
'Mr. Smith founded a healthcare company. He is the CEO of Health Inc.'[4:9]

A spaCy [Token](https://spacy.io/api/token) contains all information that is token-related. 

In [None]:
type(doc[0])

In [None]:
# look at all (inherited) attribute of the first token...
for attribute in dir(doc[0]):
    print(attribute, getattr(doc[0],attribute))

Iterating over a document is iterating over its tokens. Many attributes are numerical indexes into the nlp's vocabulary. The string value can typically be found by adding an underscore.

In [None]:
for t in doc:
    print(t.i,t.text, t.pos_, t.pos, t.tag, t.lemma_,t.head.i, t.is_sent_start)

The vocabulary object maps strings on integers and vice versa.

In [None]:
nlp.vocab.strings.as_string(96)

In [None]:
nlp.vocab.strings[96]

In [None]:
nlp.vocab.strings.as_int("PROPN")

In [None]:
nlp.vocab.strings["PROPN"]

## Noun Chunks
The nlp pipeline computes noun chunks for your convenience.

In [None]:
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

In [None]:
## Word Vectors
Visualization of word vectors for human eyeballing

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_words(words:str, nlp):
    doc = nlp(words)
    vectors = [token.vector for token in doc][:3]
    x = np.arange(len(vectors[0])) 
    
    # Plotting the data
    plt.figure(figsize=(20, 4))
    
    for i, vector in enumerate(vectors):
        plt.plot(x, vector,  linewidth=2, label=f'Word {i+1}', color=['red', 'green', 'blue'][i])
    
    plt.title('Which line represents which word?')
    plt.legend()
    plt.show()

In [None]:
plot_words("love hate",nlp_md)

## Computing Vector-based Similarity
The medium pipeline contains better semantic word vectors than the small pipeline. If you want to know more about how the cosine similarity is computed, see [this cosine-similarity notebook](cosine-similarity.ipynb)

In [None]:
text3= ("Chemical similarity (or molecular similarity) refers to the similarity of chemical elements, molecules or chemical compounds "
        "with respect to either structural or functional qualities, i.e. the effect that the chemical compound has on reaction partners "
        "in inorganic or biological settings. Biological effects and thus also similarity of effects are usually quantified using "
        "the biological activity of a compound. In general terms, function can be related to the chemical activity of compounds (among others).")
doc3 = nlp_md(text3)

How similar is the word "molecular" to the word "molecules"?

In [None]:
print(doc3[4],doc3[15])
doc3[4].similarity(doc3[15])

How similar is the word "or" to the word "molecules"?

In [None]:
print(doc3[3],doc3[15])
doc3[3].similarity(doc3[15])

How similar is the word "molecular" to the overall document?

In [None]:
help(doc3[3].similarity)

In [None]:
doc3[4], doc3[4].similarity(doc3)

How similar is the word "or" to the overall document?

In [None]:
doc3[3], doc3[3].similarity(doc3)

You can check whether a word has a word vector (is not OOV) or whether it is a vector computed on the fly

In [None]:
doc3[17].is_oov,doc3[17].vector

## Comparing tokens
Compute for each word its most similar other word in the text and print them in order of their similarity score

In [None]:
most_similar = {}  # dictionary that maps each token to its most similar, including the similarity score
for i in doc3:
    for j in doc3:
        if i.text != j.text and i.text < j.text:
            similarity = i.similarity(j)
            if i.text in most_similar:
                if most_similar[i.text][1] > similarity:
                        continue
            most_similar[i.text] = (j.text, similarity)

for (sim, w1, w2) in sorted(((sim, w1, w2) for (w1,(w2,sim)) in most_similar.items()), reverse=True):
    print(w1, w2, f"{sim:.2f}")            

## Comparing tokens with documents
Compute a ranking of the words that are most similar to the document vector (average of all words)

In [None]:
most_similar = {}
for i in doc3:
    similarity = i.similarity(doc3)
    if (i.text) in most_similar:
        if most_similar[i.text] > similarity:
            continue
    most_similar[i.text] = similarity

for (sim, w1) in sorted(((sim, w1) for (w1,sim) in most_similar.items()), reverse=True):
    print(w1, f"{sim:.3f}") 


## Matching using token patterns
spaCy has a powerful token-based pattern matching engine that can use [any of the linguistic properties of tokens for searching and adding informations](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes). See the online demo for [interactively creating matching patterns](https://explosion.ai/demos/matcher).

Let's implement a simple matcher:
The normal NER of this model does not include titles. We can write a token matcher to rectify the situation.

In [None]:
displacy.render(doc2, style='ent',jupyter=True)

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) # initialize the matcher with the current Vocab

A pattern is a list of dictionaries. Each dictionary specifies a token pattern that can include regex-like repetition information. An empty dictionary `{}` matches any token. 

In [None]:
pattern = [{"LOWER":"mr."},{"ENT_TYPE":"PERSON", "OP":"+"}]

In [None]:
matcher.add("TitledPERSON",[pattern])

In [None]:
matches = matcher(doc2)

In [None]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]          # Get string representation
    span = doc[start:end]                            # The matched span
    print(match_id, string_id, start, end, span.text)

## EntityRuler: A Matcher for Adding Entities from Scratch
Automatically adds matched spans to Doc.ents See [doc](https://spacy.io/api/entityruler)

In [None]:
nlp2 = spacy.load("en_core_web_sm")
doc3 = nlp2('"Ah, Lysara," Thalric had said, his voice as deep and mysterious.')

In [None]:
displacy.render(doc3, style='ent',jupyter=True)

In [None]:
ruler = nlp2.add_pipe("entity_ruler", config={"overwrite_ents":True}) 
patterns = [{"label": "PERSON", "pattern": "Thalric"}]  # Add more patterns here]
ruler.add_patterns(patterns)

In [None]:
doc3 = nlp2('"Ah, Lysara," Thalric had said, his voice as deep and mysterious.')
displacy.render(doc3, style='ent',jupyter=True)

## Navigating the dependency structure
Apart from following the head attribute of each token, there is more functionality for traversing the dependency relations.
The [children](https://spacy.io/api/token#children) generator for immediatly dependent words.

In [None]:
doc[2].children

In [None]:
print("Head:", doc[2])
for c in doc[2].children:
    print(c)

Navigating the dependency structure: enumerating the heads of a token

In [None]:
for t in doc[0].ancestors:
    print(t, type(t))

## Matching using Dependency Patterns
We can match non-contiguous slices via dependency relations.
See [usage guide](https://spacy.io/usage/rule-based-matching#dependencymatcher-patterns) for more information

In [None]:
from spacy.matcher import DependencyMatcher
dep_matcher = DependencyMatcher(nlp.vocab) # initialize the matcher with the current Vocab


Let's define a simple verb subject pattern!

In [None]:
dep_pattern = [
    {
        "RIGHT_ID": "anchor_founded",      # Introduce ID for anchor token
        "RIGHT_ATTRS": {"ORTH": "founded"} # Specify anchor token
    },
    {
        "LEFT_ID": "anchor_founded",       # Refer to anchor token by its ID
        "REL_OP": ">",                     # Direct dependency
        "RIGHT_ID": "founded_subject",     # Introduce ID for dependent subject 
        "RIGHT_ATTRS": {"DEP": "nsubj"},   # Specify subject token
    } 
]

In [None]:
dep_matcher.add("FOUNDED", [dep_pattern])
doc3 = nlp("Smith, an experienced CEO, has founded two AI startups.")
dep_matches = dep_matcher(doc3)

In [None]:
print(dep_matches)

In [None]:
# Each token_id corresponds to one pattern dict
match_id, token_ids = dep_matches[0]

for i in range(len(token_ids)):
    print(dep_pattern[i]["RIGHT_ID"] + ":", doc3[token_ids[i]].text)

In [None]:
doc3 = nlp("Smith, an experienced CEO, has founded two AI startups. He also launched several other companies.")

## Serialization to disk and reading from disk

In [None]:
doc.to_disk('my_doc.spacy')

In [None]:
from spacy.tokens import Doc
from spacy.vocab import Vocab
my_vocab = Vocab() # create empty vocabulary or use existing compatible one, e.g. `nlp.vocab`
loaded_doc = Doc(my_vocab).from_disk('my_doc.spacy')

In [None]:
loaded_doc.to_json()

## Combining a list of docs
If you want to combine several docs in one document, you can use the static method (a function that is directly called from the class) `from_docs()`

In [None]:
doc2 = nlp('Mr. Smith continues to work for the company for the next 10 years.')

In [None]:
all_docs = Doc(nlp.vocab).from_docs([doc,doc2])

In [None]:
all_docs.to_json()