# spaCy: An modern NLP pipeline
Learning goals:
 - Understand how a modern NLP pipeline works and what spacy models typically do
 - Understand the main data structures: Doc, Token, Vocab
 - Understand how rule-based matching with a domain-specific search query language works 
 - Understand the Matcher and DependencyMatcher class
 - Understand what (de)serialization is and how it applies to spaCy data structures
 - Get to know the displacy visualization
 
 See [spaCy 101](https://spacy.io/usage/spacy-101) for an introduction

In [None]:
# Only run this once (and maybe restart the kernel)
! pip install spacy
! python -m spacy download en_core_web_sm

In [None]:
import spacy
from spacy import displacy 

nlp = spacy.load('en_core_web_sm')

The nlp object is an instance of the class [Language](https://spacy.io/api/language)

In [None]:
isinstance(nlp, spacy.Language)

In [None]:
doc = nlp('Mr. Smith founded a healthcare company. He is the CEO.')
displacy.render(doc, style='dep',jupyter=True)

What information is in the analysed document? A compact way  to look at the most relevant informoation is  the [JSON](https://www.youtube.com/watch?v=pTT7HMqDnJw) serialization.

In [None]:
doc.to_json()

A spaCy [Token](https://spacy.io/api/token) contains all information that is token-related. 

In [None]:
type(doc[0])

In [None]:
for attribute in dir(doc[0]):
    print(attribute, getattr(doc[0],attribute))

Iterating over a document is iterating over its tokens. Many attributes are numerical indexes into the nlp's vocabulary. The string value can typically be found by adding an underscore.

In [None]:
for t in doc:
    print(t.i,t.text, t.pos_, t.pos, t.tag, t.lemma_,t.head.i, t.is_sent_start)

The vocabulary object maps strings on integers and vice versa.

In [None]:
nlp.vocab.strings.as_string(96)

In [None]:
nlp.vocab.strings[96]

In [None]:
nlp.vocab.strings.as_int("PROPN")

In [None]:
nlp.vocab.strings["PROPN"]

## Serialization to disk and reading from disk

In [None]:
doc.to_disk('my_doc.spacy')

In [None]:
from spacy.tokens import Doc
from spacy.vocab import Vocab
my_vocab = Vocab() # create empty vocabulary or use existing compatible one, e.g. `nlp.vocab`
loaded_doc = Doc(my_vocab).from_disk('my_doc.spacy')

In [None]:
loaded_doc.to_json()

## Combining a list of docs
If you want to combine several docs in one document, you can use the static method (a function that is directly called from the class) `from_docs()`

In [None]:
doc2 = nlp('Mr. Smith continues to work for the company for the next 10 years.')

In [None]:
all_docs = Doc(nlp.vocab).from_docs([doc,doc2])

In [None]:
all_docs.to_json()

## Navigating the dependency structure
Apart from following the head attribute of each token, there is more functionality for traversing the dependency relations.
The [children](https://spacy.io/api/token#children) generator for immediatly dependent words.

In [None]:
doc[2].children

In [None]:
print("Head:", doc[2])
for c in doc[2].children:
    print(c)

Navigating the dependency structure: enumerating the heads of a token

In [None]:
for t in doc[0].ancestors:
    print(t, type(t))

## Matching using token patterns
spaCy has a powerful token-based pattern matching engine that can use [any of the linguistic properties of tokens for searching and adding informations](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes). See the online demo for [interactively creating matching patterns](https://explosion.ai/demos/matcher).

Let's implement a simple matcher:
The normal NER of this model does not include titles. We can write a token matcher to rectify the situation.

In [None]:
displacy.render(doc2, style='ent',jupyter=True)

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) # initialize the matcher with the current Vocab

A pattern is a list of dictionaries. Each dictionary specifies a token pattern that can include regex-like repetition information. An empty dictionary `{}` matches any token. 

In [None]:
pattern = [{"LOWER":"mr."},{"ENT_TYPE":"PERSON", "OP":"+"}]

In [None]:
matcher.add("TitledPERSON",[pattern])

In [None]:
matches = matcher(doc2)

In [None]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]          # Get string representation
    span = doc[start:end]                            # The matched span
    print(match_id, string_id, start, end, span.text)

## Matching using Dependency Patterns
We can match non-contiguous slices via dependency relations.
See [usage guide](https://spacy.io/usage/rule-based-matching#dependencymatcher-patterns) for more information

In [None]:
from spacy.matcher import DependencyMatcher
dep_matcher = DependencyMatcher(nlp.vocab) # initialize the matcher with the current Vocab


Let's define a simple verb subject pattern!

In [None]:
dep_pattern = [
    {
        "RIGHT_ID": "anchor_founded",      # Introduce ID for anchor token
        "RIGHT_ATTRS": {"ORTH": "founded"} # Specify anchor token
    },
    {
        "LEFT_ID": "anchor_founded",       # Refer to anchor token by its ID
        "REL_OP": ">",                     # Direct dependency
        "RIGHT_ID": "founded_subject",     # Introduce ID for dependent subject 
        "RIGHT_ATTRS": {"DEP": "nsubj"},   # Specify subject token
    } 
]

In [None]:
dep_matcher.add("FOUNDED", [dep_pattern])
doc3 = nlp("Smith, an experienced CEO, has founded two AI startups.")
dep_matches = dep_matcher(doc3)

In [None]:
print(dep_matches)

In [None]:
# Each token_id corresponds to one pattern dict
match_id, token_ids = dep_matches[0]

for i in range(len(token_ids)):
    print(dep_pattern[i]["RIGHT_ID"] + ":", doc3[token_ids[i]].text)

In [None]:
doc3 = nlp("Smith, an experienced CEO, has founded two AI startups. He also founded several other companies.")