# Advanced NLP with spaCy (online course)

In [1]:
# Import the English library class
from spacy.lang.en import English

import spacy

## Chapter 1: Finding words, phrases, names and concepts

Basic text prcoessing, data structures, statistical models and how to predict linguistic features in our text.

### Introduction to spaCy

In [2]:
# Create the nlp object (contains processing pipeline)
nlp = English()

# Document object (created when nlp called on a string - text is tokenized)
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Tokens in a document (i.e. words or punctuation)
# Token objects have attributes to access information, e.g. .text
token = doc[3]
print(token.text)
    
# Span object is a slice of the Document (i.e. multiple tokens)
# Only used to view Doc, doesn't contain any data itself
span = doc[1:3]

# Lexical attributes (returns boolean) (refer to vocab, and don't depend on token's context)
print(token.is_alpha, token.is_punct, token.like_num)

# Example search for percentages in text
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

more
True False False
Percentage found: 60
Percentage found: 4


### Statistical models

Statistical models allow spaCy to make predictions in context  
e.g.part-of-speech tags, syntactic dependencies, named entities  
Models trained on labelled data  

spaCy provides pretrained model packages  
e.g. en_core_web_sm is a small english model trained on web text  
Package provides binary weights, vocabulary, meta information (langauge + how to configure processing pipeline)

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

.pos_ = predicted part-of-speech tag (i.e. word types in context)  
.dep_ = predicted (syntactic) dependency level  
.head = syntactic head token (i.e. parent token this word is attached to)

In [3]:
# Loads the small English model
nlp = spacy.load("en_core_web_sm")

doc = nlp("She ate the pizza")

for token in doc:
    # Print text and text predictions
    #print(token.text, token.pos_, token.dep_, token.head.text)
    print(f"{token.text:<8}{token.pos_:<8}{token.dep_:<8}{token.head.text:<8}")
    

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted (named) entities (e.g. a person, organization, country)
# .ents returns an iterator of Span objects
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
# Use explain() helper function to get definitions for common tags and labels
print(spacy.explain("GPE"))

She     PRON    nsubj   ate     
ate     VERB    ROOT    ate     
the     DET     det     pizza   
pizza   NOUN    dobj    ate     
Apple ORG
U.K. GPE
$1 billion MONEY
Countries, cities, states


### Rule-based matching

Why not just use regular expressions?  
The matcher works with Doc and Token objects, instead of just strings  
You can match on tokens and token attributes  
You can write rules that use the model's predictions

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.  
Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

Four "OP" values:  
{"OP": "!"}	Negation: match 0 times  
{"OP": "?"}	Optional: match 0 or 1 times  
{"OP": "+"}	Match 1 or more times  
{"OP": "*"}	Match 0 or more times  

In [4]:
# Match exact token texts
pattern1 = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
# Match lexical attributes
pattern2 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
# Match any token attributes
pattern3 = [{"LEMMA": "buy"}, {"POS": "NOUN"}]     # e.g. buying milk, bought flowers

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)
    
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
    
# Pattern for matching lexical attributes
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

# Example text with above pattern
doc = nlp("2018 FIFA World Cup: France won!")

# Pattern for matching other token attributes
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

# Example text with above pattern
doc = nlp("I loved dogs but now I love cats more.")

# Pattern showing use of operators and quantifiers
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

# Example text with above pattern
doc = nlp("I bought a smartphone. Now I'm buying apps.")

iPhone X
Matches: ['iPhone X']


## Chapter 2: Large-scale data analysis with spaCy

Extract specific information from large volumes of text. How to make the most of spaCy's data structures. How to effectively combine statistical and rule-based approaches of text analysis.

### Data structures 1 - Vocab, Lexemes and StringStore

Vocab: stores data shared across multiple documents  
To save memory, spaCy encodes all strings to hash values  
Strings are only stored once in the StringStore via nlp.vocab.strings  
String store: lookup table in both directions  

Hash IDs can't be reversed, though. If a word is not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

The doc also exposes the vocab and strings  
i.e. `doc.vocab.strings['coffee']`

Lexemes are context-independent entries in the vocabulary.  
You can get a lexeme by looking up a string or a hash ID in the vocab.  
Lexemes expose attributes, just like tokens.  
They hold context-independent information about a word, like the text, or whether the word consists of alphabetic characters.  
Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.  
Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

In [5]:
#nlp = spacy.load("en_core_web_sm")
nlp = English()

doc = nlp("I love coffee")

coffee_hash = nlp.vocab.strings["coffee"]

# Raises an error if we haven't seen the string before
coffee_string = nlp.vocab.strings[coffee_hash]

print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

lexeme = nlp.vocab["coffee"]

# Print the lexical attributes
print('Lexical attributes:', lexeme.text, lexeme.orth, lexeme.is_alpha)

hash value: 3197928453018144401
string value: coffee
Lexical attributes: coffee 3197928453018144401 True


### Data structures 2 - Doc, Span and Token

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.  
The spaces are a list of boolean values indicating whether the word is followed by a space.  
The Doc class takes three arguments: the shared vocab, the words and the spaces.

A Span is a slice of a doc consisting of one or more tokens.  
The Span takes at least three arguments: the doc it refers to, and the start and end index of the span.

To create a Span manually, we can also import the class from spacy.tokens. We can then instantiate it with the doc and the span's start and end index, and an optional label argument.  
The doc.ents are writable, so we can add entities manually by overwriting it with a list of spans.

Best practice tips:
- The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.
- If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.
- To keep things consistent, try to use built-in token attributes wherever possible. For example, token.i for the token index.
- Also, don't forget to always pass in the shared vocab!

In [6]:
nlp = English()

# Import the Doc class
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

print([(ent.text, ent.label_) for ent in doc.ents])

[('Hello world', 'GREETING')]


In [7]:
# Example to analyse text and collect all proper nouns that are followed by a verb.

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


### Word vectors and semantic similarity

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.  
The Doc, Token and Span objects have a .similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.  
One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included. For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg".  
You can also use the similarity methods to compare different types of objects.

**Word vectors:**  
Similarity is determined using word vectors, multi-dimensional representations of meanings of words.  
You might have heard of Word2Vec, which is an algorithm that's often used to train word vectors from raw text.  
Vectors can be added to spaCy's statistical models.  
By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary.  
Vectors for objects consisting of several tokens, like the Doc and Span, default to the average of their token vectors.  
That's also why you usually get more value out of shorter phrases with fewer irrelevant words.

**Similarity depends on the application context:**  
Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform.  
However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do.  
Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

In [8]:
# Load a larger model with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")
print(span.similarity(doc))

doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)
# The result is a 300-dimensional vector of the word "banana".

0.8627204117787385
0.6199091710787739
[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 

### Combining models and rules

Statistical models are useful if your application needs to be able to generalize based on a few examples.  
For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships.  
To do this, you would use spaCy's entity recognizer, dependency parser or part-of-speech tagger.

Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.  
In spaCy, you can achieve this with custom tokenization rules, as well as the matcher and phrase matcher.

**Efficient phrase matching:**  
The phrase matcher is another helpful tool to find sequences of words in your data.  
It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context.  
It takes Doc objects as patterns.  
It's also really fast.  
This makes it very useful for matching large dictionaries and word lists on large volumes of text.  
The phrase matcher can be imported from spacy.matcher and follows the same API as the regular matcher.  
Instead of a list of dictionaries, we pass in a Doc object as the pattern.  
We can then iterate over the matches in the text, which gives us the match ID, and the start and end of the match. This lets us create a Span object for the matched tokens "Golden Retriever" to analyze it in context.

In [9]:
# Adding statistical predictions to rule based matching:

nlp = spacy.load("en_core_web_sm")

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

matcher.add("DOG", None, [{"LOWER": "golden"}, {"LOWER": "retriever"}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)
    # Get the span's root token and root head token
    print("Root token:", span.root.text)
    print("Root head token:", span.root.head.text)
    # Get the previous token and its POS tag
    print("Previous token:", doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


In [10]:
# Phrase matching example

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: Golden Retriever


In [11]:
# Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. 

COUNTRIES = ["France", "Czech Republic", "Slovakia"]

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


## Chapter 3: Processing pipelines

What goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens.

### Processing pipelines

Processing pipelines: a series of functions applied to a doc to add attributes like part-of-speech tags, dependency labels or named entities.

**What does the nlp object actually do?**  
First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed doc is returned, so you can work with it.

**Built in pipeline components:**  
Name - Description - Creates

tagger - Part-of-speech tagger	- Token.tag, Token.pos  
parser - Dependency parser	- Token.dep, Token.head, Doc.sents, Doc.noun_chunks  
ner	- Named entity recognizer - Doc.ents, Token.ent_iob, Token.ent_type  
textcat	- Text classifier - Doc.cats

All models you can load into spaCy include several files and a meta.json.  
The meta defines things like the language and pipeline. This tells spaCy which components to instantiate.  
The built-in components that make predictions also need binary data. The data is included in the model package and loaded into the component when you load the model.

`nlp.pipe_names`: list of pipeline component names  
`nlp.pipeline`: list of (name, component) tuples

The component functions are the functions applied to the doc to process it and set attributes – for example, part-of-speech tags or named entities.

In [12]:
nlp = spacy.load("en_core_web_sm")

print(nlp.pipe_names)
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x000001F2386F6308>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000001F23B35E5E8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000001F23B35E528>)]


### Custom pipeline components

After the text is tokenized and a Doc object has been created, pipeline components are applied in order. spaCy supports a range of built-in components, but also lets you define your own.  
Custom components are executed automatically when you call the nlp object on a text.  
They're especially useful for adding your own custom metadata to documents and tokens.  
You can also use them to update built-in attributes, like the named entity spans.

Fundamentally, a pipeline component is a function or callable that takes a doc, modifies it and returns it, so it can be processed by the next component in the pipeline.  
Components can be added to the pipeline using the nlp.add_pipe method. The method takes at least one argument: the component function.

To specify where to add the component in the pipeline, you can use the following keyword arguments:  
Setting last to True will add the component last in the pipeline. This is the default behavior.  
Setting first to True will add the component first in the pipeline, right after the tokenizer.  
The before and after arguments let you define the name of an existing component to add the new component before or after. For example, before="ner" will add it before the named entity recognizer.  
The other component to add the new component before or after needs to exist, though – otherwise, spaCy will raise an error.

In [13]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print("Pipeline:", nlp.pipe_names)

# Process a text
doc = nlp("Hello world!")

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']
Doc length: 3


In [14]:
# Example custom component to add animals in the text to doc.ents

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


### Extension attributes

Custom attributes let you add any metadata to docs, tokens and spans. The data can be added once, or it can be computed dynamically.  
Custom attributes are available via the ._ (dot underscore) property. This makes it clear that they were added by the user, and not built into spaCy, like token.text.  
Attributes need to be registered on the global Doc, Token and Span classes you can import from spacy.tokens.  
To register a custom attribute on the Doc, Token and Span, you can use the set_extension method.  
The first argument is the attribute name. Keyword arguments let you define how the value should be computed. In this case, it has a default value and can be overwritten.

```
doc._.title = "My document"
token._.is_color = True
span._.has_color = False
```

There are three types of extensions: attribute extensions, property extensions and method extensions.

**Atribute extensions:** 
Attribute extensions set a default value that can be overwritten.  
For example, a custom is_color attribute on the token that defaults to False.  
On individual tokens, its value can be changed by overwriting it – in this case, True for the token "blue".

**Property extensions:**  
Property extensions work like properties in Python: they can define a getter function and an optional setter.  
The getter function is only called when you retrieve the attribute. This lets you compute the value dynamically, and even take other custom attributes into account.  
Getter functions take one argument: the object, in this case, the token. In this example, the function returns whether the token text is in our list of colors.  
We can then provide the function via the getter keyword argument when we register the extension.  
The token "blue" now returns True for `._.is_color.`

If you want to set extension attributes on a span, you almost always want to use a property extension with a getter. Otherwise, you'd have to update every possible span ever by hand to set all the values.  
In this example, the get_has_color function takes the span and returns whether the text of any of the tokens is in the list of colors.  
After we've processed the doc, we can check different slices of the doc and the custom `._.has_color` property returns whether the span contains a color token or not.

**Method extensions:**  
Method extensions make the extension attribute a callable method.

You can then pass one or more arguments to it, and compute attribute values dynamically – for example, based on a certain argument or setting.

In this example, the method function checks whether the doc contains a token with a given text. The first argument of the method is always the object itself – in this case, the doc. It's passed in automatically when the method is called. All other function arguments will be arguments on the method extension. In this case, token_text.

Here, the custom `._.has_token` method returns True for the word "blue" and False for the word "cloud".

In [15]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension("title", default=None, force=True)
Token.set_extension("is_color", default=False, force=True)
Span.set_extension("has_color", default=False, force=True)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

# Define getter function
def get_is_color(token):
    colors = ["red", "yellow", "blue"]
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension("is_color", getter=get_is_color, force=True)

print(doc[3]._.is_color, "-", doc[3].text)

# Define getter function
def get_has_color(span):
    colors = ["red", "yellow", "blue"]
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension("has_color", getter=get_has_color, force=True)

print(doc[1:4]._.has_color, "-", doc[1:4].text)
print(doc[0:2]._.has_color, "-", doc[0:2].text)

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension("has_token", method=has_token, force=True)

print(doc._.has_token("blue"), "- blue")
print(doc._.has_token("cloud"), "- cloud")

True - blue
True - sky is blue
False - The sky
True - blue
False - cloud


### Scaling and performance

If you need to process a lot of texts and create a lot of Doc objects in a row, the `nlp.pipe` method can speed this up significantly.  
It processes the texts as a stream and yields Doc objects.  
It is much faster than just calling nlp on each text, because it batches up the texts.  
`nlp.pipe` is a generator that yields Doc objects, so in order to get a list of docs, remember to call the list method around it.

`nlp.pipe` also supports passing in tuples of text / context if you set as_tuples to True.  
The method will then yield doc / context tuples.  
This is useful for passing in additional metadata, like an ID associated with the text, or a page number.

You can even add the context metadata to custom attributes.  
In this example, we're registering two extensions, id and page_number, which default to None.  
After processing the text and passing through the context, we can overwrite the doc extensions with our context metadata.

Another common scenario: Sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text.  
Running the whole pipeline is unnecessarily slow, because you'll be getting a bunch of predictions from the model that you don't need.  
If you only need a tokenized Doc object, you can use the nlp.make_doc method instead, which takes a text and returns a doc.
This is also how spaCy does it behind the scenes: `nlp.make_doc` turns the text into a doc before the pipeline components are called.

spaCy also allows you to temporarily disable pipeline components using the nlp.disable_pipes context manager.  
It takes a variable number of arguments, the string names of the pipeline components to disable. For example, if you only want to use the entity recognizer to process a document, you can temporarily disable the tagger and parser.  
After the with block, the disabled pipeline components are automatically restored.  
In the with block, spaCy will only run the remaining components.

In [16]:
# Efficient way to process lots of texts
LOTS_OF_TEXTS = ['text1', 'text2']
docs = list(nlp.pipe(LOTS_OF_TEXTS))
    
# Passing in context example

from spacy.tokens import Doc

Doc.set_extension("id", default=None)
Doc.set_extension("page_number", default=None)

data = [
    ("This is a text", {"id": 1, "page_number": 15}),
    ("And another text", {"id": 2, "page_number": 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context["id"]
    doc._.page_number = context["page_number"]
    print(doc.text, context["page_number"])
    
# Only tokenizes the text
doc = nlp.make_doc("Hello world!")

# Disable tagger and parser
with nlp.disable_pipes("tagger", "parser"):
    # Process the text and print the entities
    doc = nlp('text')
    print(doc.ents)

This is a text 15
And another text 16
()


## Chapter 4: Training a neural network model

Update spaCy's statistical models to customize them for your use case – for example, to predict a new entity type in online comments. You'll write your own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful.

### Training and updating models

Why would we want to update the model with our own examples? Why can't we just rely on pre-trained models?  
Statistical models make predictions based on the examples they were trained on.  
You can usually make the model more accurate by showing it examples from your domain.  
You often also want to predict categories specific to your problem, so the model needs to learn about them.  
This is essential for text classification, very useful for entity recognition and a little less critical for tagging and parsing.

spaCy supports updating existing models with more examples, and training new models.  
If we're not starting with a pre-trained model, we first initialize the weights randomly.  
Next, we call nlp.update, which predicts a batch of examples with the current weights.  
The model then checks the predictions against the correct answers, and decides how to change the weights to achieve better predictions next time.  
Finally, we make a small correction to the current weights and move on to the next batch of examples.  
We continue calling nlp.update for each batch of examples in the data.

The training data are the examples we want to update the model with.  
The text should be a sentence, paragraph or longer document. For the best results, it should be similar to what the model will see at runtime.  
The label is what we want the model to predict. This can be a text category, or an entity span and its type.  
The gradient is how we should change the model to reduce the current error. It's computed when we compare the predicted label to the true label.  
After training, we can then save out an updated model and use it in our application.

Let's look at an example for a specific component: the entity recognizer.  
The entity recognizer takes a document and predicts phrases and their labels. This means that the training data needs to include texts, the entities they contain, and the entity labels.  
Entities can't overlap, so each token can only be part of one entity.  
Because the entity recognizer predicts entities in context, it also needs to be trained on entities and their surrounding context.  
The easiest way to do this is to show the model a text and a list of character offsets. For example, "iPhone X" is a gadget, starts at character 0 and ends at character 8.  
e.g. `("iPhone X is coming", {"entities": [(0, 8, "GADGET")]})`  
It's also very important for the model to learn words that aren't entities.  
In this case, the list of span annotations will be empty.  
e.g. `("I need a new phone! Any tips?", {"entities": []})`
Our goal is to teach the model to recognize new entities in similar contexts, even if they weren't in the training data.

The training data tells the model what we want it to predict. This could be texts and named entities we want to recognize, or tokens and their correct part-of-speech tags.  
To update an existing model, we can start with a few hundred to a few thousand examples.  
To train a new category we may need up to a million.  
spaCy's pre-trained English models for instance were trained on 2 million words labelled with part-of-speech tags, dependencies and named entities.  
Training data is usually created by humans who assign labels to texts.  
This is a lot of work, but can be semi-automated – for example, using spaCy's Matcher.

### The training loop

The training loop is a series of steps that's performed to train or update a model.  
We usually need to perform it several times, for multiple iterations, so that the model can learn from it effectively. If we want to train for 10 iterations, we need to loop 10 times.  
To prevent the model from getting stuck in a suboptimal solution, we randomly shuffle the data for each iteration. This is a very common strategy when doing stochastic gradient descent.  
Next, we divide the training data into batches of several examples, also known as minibatching. This increases the reliability of the gradient estimates.  
Finally, we update the model for each batch, and start the loop again until we've reached the last iteration.  
We can then save the model to a directory and use it in spaCy.

spaCy lets you update an existing pre-trained model with more data – for example, to improve its predictions on different texts.  
This is especially useful if you want to improve categories the model already knows, like "person" or "organization".  
You can also update a model to add new categories.  
Just make sure to always update it with examples of the new category and examples of the other categories it previously predicted correctly. Otherwise improving the new category might hurt the other categories.

In this example, we start off with a blank English model using the spacy.blank method. The blank model doesn't have any pipeline components, only the language data and tokenization rules.  
We then create a blank entity recognizer and add it to the pipeline.  
Using the add_label method, we can add new string labels to the model.  
We can now call nlp.begin_training to initialize the model with random weights.  
To get better accuracy, we want to loop over the examples more than once and randomly shuffle the data on each iteration.  
On each iteration, we divide the examples into batches using spaCy's minibatch utility function. Each example consists of a text and its annotations.  
Finally, we update the model with the texts and annotations and continue the loop.

In [None]:
# Example training loop / pipeline

# Start with blank English model
nlp = spacy.blank("en")
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
# Add a new label
ner.add_label("GADGET")

# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

### Training best practices for spaCy models

Statistical models can learn lots of things – but it doesn't mean that they won't unlearn them.  
If you're updating an existing model with new data, especially new labels, it can overfit and adjust too much to the new examples.  
For instance, if you're only updating it with examples of "website", it may "forget" other labels it previously predicted correctly – like "person".  
This is also known as the catastrophic forgetting problem.

To prevent this, make sure to always mix in examples of what the model previously got correct.  
If you're training a new category "WEBSITE", also include examples of "PERSON".  
spaCy can help you with this. You can create those additional examples by running the existing model over data and extracting the entity spans you care about.  
You can then mix those examples in with your existing data and update the model with annotations of all labels.

Another common problem is that your model just won't learn what you want it to.  
spaCy's models make predictions based on the local context – for example, for named entities, the surrounding words are most important.  
If the decision is difficult to make based on the context, the model can struggle to learn it.  
The label scheme also needs to be consistent and not too specific.  
For example, it may be very difficult to teach a model to predict whether something is adult clothing or children's clothing based on the context. However, just predicting the label "clothing" may work better.

Before you start training and updating models, it's worth taking a step back and planning your label scheme.  
Try to pick categories that are reflected in the local context and make them more generic if possible.  
You can always add a rule-based system later to go from generic to specific.  
Generic categories like "clothing" or "band" are both easier to label and easier to learn.

In [17]:
# Example training data

TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    (
        "PewDiePie smashes YouTube record",
        {"entities": [(0, 9, "PERSON"), (18, 25, "WEBSITE")]},
    ),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE"), (15, 29, "PERSON")]},
    ),
    # And so on...
]

## Course overview

- Extract linguistic features: part-of-speech tags, dependencies, named entities
- Work with pre-trained statistical models
- Find words and phrases using Matcher and PhraseMatcher match rules
- Best practices for working with data structures Doc, Token Span, Vocab, Lexeme
- Find semantic similarities using word vectors
- Write custom pipeline components with extension attributes
- Scale up your spaCy pipelines and make them fast
- Create training data for spaCy' statistical models
- Train and update spaCy's neural network models with new data