# Processing Pipelines

<h3>What happens when you call nlp?</h3>
<img src="pipeline.png">
<br>
<section data-markdown="" data-markdown-parsed="true" class="present" style="display: block;"><h3 id="built-in-pipeline-components">Built-in pipeline components</h3>
<table>
<thead>
<tr>
<th>Name</th>
<th align="left">Description</th>
<th align="left">Creates</th>
</tr>
</thead>
<tbody><tr>
<td><strong>tagger</strong></td>
<td align="left">Part-of-speech tagger</td>
<td align="left"><code>Token.tag</code>, <code>Token.pos</code></td>
</tr>
<tr>
<td><strong>parser</strong></td>
<td align="left">Dependency parser</td>
<td align="left"><code>Token.dep</code>, <code>Token.head</code>, <code>Doc.sents</code>, <code>Doc.noun_chunks</code></td>
</tr>
<tr>
<td><strong>ner</strong></td>
<td align="left">Named entity recognizer</td>
<td align="left"><code>Doc.ents</code>, <code>Token.ent_iob</code>, <code>Token.ent_type</code></td>
</tr>
<tr>
<td><strong>textcat</strong></td>
<td align="left">Text classifier</td>
<td align="left"><code>Doc.cats</code></td>
</tr>
</tbody></table>
<aside class="notes"><p>spaCy ships with the following built-in pipeline components.</p>
<p>The part-of-speech tagger sets the <code>token.tag</code> and <code>token.pos</code> attributes.</p>
<p>The dependency parser adds the <code>token.dep</code> and <code>token.head</code> attributes and is
also responsible for detecting sentences and base noun phrases, also known as
noun chunks.</p>
<p>The named entity recognizer adds the detected entities to the <code>doc.ents</code>
property. It also sets entity type attributes on the tokens that indicate if a
token is part of an entity or not.</p>
<p>Finally, the text classifier sets category labels that apply to the whole text,
and adds them to the <code>doc.cats</code> property.</p>
<p>Because text categories are always very specific, the text classifier is not
included in any of the pre-trained models by default. But you can use it to
train your own system.</p>
</aside></section>

<h3>Inspecting a pipeline</h3>
<p>1 Load the en_core_web_sm model and create the nlp object.</p>
<p>2 Print the names of the pipeline components using nlp.pipe_names.</p>
<p>3 Print the full pipeline of (name, component) tuples using nlp.pipeline.</p>

In [1]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f7c1cd58790>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f7c4e525210>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f7c1cf73d70>)]


# Custome pipeline components

In [2]:
import spacy

# create the nlp object
nlp = spacy.load("en_core_web_sm")

# define a custom component
def custom_component(doc):
    # print the doc's length
    print("Doc lenght:",len(doc))
    # return the doc object
    return doc

# add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# print the pipeline component names
print("Pipeline:", nlp.pipe_names)

# process a text
doc = nlp("Hello World!")

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']
Doc lenght: 3


# Practice

<h3>Simple Components</h3>
<p>1 Complete the component function with the doc’s length.</p>
<p>2 Add the length_component to the existing pipeline as the first component.</p>
<p>3 Try out the new pipeline and process any text with the nlp object – for example “This is a sentence.”.</p>

In [3]:
import spacy

# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first = True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

['length_component', 'tagger', 'parser', 'ner']
This document is 5 tokens long.


<h3>Complex Components</h3>
<p>1 Define the custom component and apply the matcher to the doc.</p>
<p>2 Create a Span for each match, assign the label ID for "ANIMAL" and overwrite the doc.ents with the new spans.</p>
<p>3 Add the new component to the pipeline after the "ner" component.</p>
<p>4 Process the text and print the entity text and entity label for the entities in doc.ents.</p>

In [4]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = list(doc.ents) + spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


# Extension attributes

In [5]:
# setting custom attributes
# add custome metadata to documents, tokens and spans
# accessinle via the ._ property

# doc._.title = "My document"
# token._.is_color = True
# span._.has_color = False

# Registered on the global Doc, Token or Span using the set_extension method

# Import global classes
# from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
# Doc.set_extension("title", default=None)
# Token.set_extension("is_color", default=False)
# Span.set_extension("has_color", default=False)

# Extension attributes types:
#                            Attribute extensions
#                            Property extensions
#                            Method extensions

# Attribute extensions
import spacy
from spacy.tokens import Token

# set extension on the Token with default value
Token.set_extension("is_color", default=False, force = True)

doc = nlp("The sky is blue")

# overwrite extension attribute value
doc[3]._.is_color = True

# Property extensions
from spacy.tokens import Token

# define getter function
def get_is_color(token):
    colors = ["red", "yellow", "blue"]
    return token.text in colors

# set extension on the Token with getter
Token.set_extension("is_color", getter=get_is_color, force = True)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color, "-", doc[3].text)

# span extensions should almost always use a getter
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ["red", "yellow", "blue"]
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension("has_color", getter=get_has_color, force=True)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, "-", doc[1:4].text)
print(doc[0:2]._.has_color, "-", doc[0:2].text)

# method extensions
from spacy.tokens import Doc

# define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# set the extension on the Doc with method

Doc.set_extension("has_token", method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token("blue"), "- blue")
print(doc._.has_token("cloud"), "- cloud")

True - blue
True - sky is blue
False - The sky
True - blue
False - cloud


# Practice

<h3>Setting extension attributes (1)</h3>
<p>1 Use Token.set_extension to register "is_country" (default False).</p>
<p>2 Update it for "Spain" and print it for all tokens.</p>
<p>3 Use Token.set_extension to register "reversed" (getter function get_reversed).</p>
<p> 4Print its value for each token.</p>

In [6]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute "is_country" with the default value False
Token.set_extension("is_country", default=False, force = True)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

nlp = English()

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]


# Register the Token property extension "reversed" with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]
reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


<h3>Setting extension attributes (1)</h3>
<p>1 Complete the get_has_number function .</p>
<p>2 Use Doc.set_extension to register "has_number" (getter get_has_number) and print its value.</p>
<p>3 Use Span.set_extension to register "to_html" (method to_html).</p>
<p>4 Call it on doc[0:2] with the tag "strong".</p>

In [7]:
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension "has_number" with the getter get_has_number
Doc.set_extension("has_number", getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

from spacy.lang.en import English
from spacy.tokens import Span

nlp = English()

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return f"<{tag}>{span.text}</{tag}>"


# Register the Span method extension "to_html" with the method to_html
Span.set_extension("to_html", method=to_html)

# Process the text and call the to_html method on the span with the tag name "strong"
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

has_number: True
<strong>Hello world</strong>


<h3>Entities and extensions</h3>
<p>1 Complete the get_wikipedia_url getter so it only returns the URL if the span’s label is in the list of labels.</p>
<p>2 Set the Span extension "wikipedia_url" using the getter get_wikipedia_url.</p>
<p>3 Iterate over the entities in the doc and output their Wikipedia URL.</p>

In [8]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

fifty years None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


<h3>Components with extension</h3>
<p>1 Complete the countries_component and create a Span with the label "GPE" (geopolitical entity) for all matches.</p>
<p>2 Add the component to the pipeline.</p>
<p>3 Register the Span extension attribute "capital" with the getter get_capital.</p>
<p>4 Process the text and print the entity text, entity label and entity capital for each entity span in doc.ents.</p>

In [9]:
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher


COUNTRIES = ['Afghanistan', 'Åland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'United States Minor Outlying Islands', 'Virgin Islands (British)', 'Virgin Islands (U.S.)', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cabo Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo (Democratic Republic of the)', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', "Côte d'Ivoire", 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia (the former Yugoslav Republic of)', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Moldova (Republic of)', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', "Korea (Democratic People's Republic of)", 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine, State of', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Kosovo', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Korea (Republic of)', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']

CAPITALS = {'Afghanistan': 'Kabul', 'Åland Islands': 'Mariehamn', 'Albania': 'Tirana', 'Algeria': 'Algiers', 'American Samoa': 'Pago Pago', 'Andorra': 'Andorra la Vella', 'Angola': 'Luanda', 'Anguilla': 'The Valley', 'Antarctica': '', 'Antigua and Barbuda': "Saint John's", 'Argentina': 'Buenos Aires', 'Armenia': 'Yerevan', 'Aruba': 'Oranjestad', 'Australia': 'Canberra', 'Austria': 'Vienna', 'Azerbaijan': 'Baku', 'Bahamas': 'Nassau', 'Bahrain': 'Manama', 'Bangladesh': 'Dhaka', 'Barbados': 'Bridgetown', 'Belarus': 'Minsk', 'Belgium': 'Brussels', 'Belize': 'Belmopan', 'Benin': 'Porto-Novo', 'Bermuda': 'Hamilton', 'Bhutan': 'Thimphu', 'Bolivia (Plurinational State of)': 'Sucre', 'Bonaire, Sint Eustatius and Saba': 'Kralendijk', 'Bosnia and Herzegovina': 'Sarajevo', 'Botswana': 'Gaborone', 'Bouvet Island': '', 'Brazil': 'Brasília', 'British Indian Ocean Territory': 'Diego Garcia', 'United States Minor Outlying Islands': '', 'Virgin Islands (British)': 'Road Town', 'Virgin Islands (U.S.)': 'Charlotte Amalie', 'Brunei Darussalam': 'Bandar Seri Begawan', 'Bulgaria': 'Sofia', 'Burkina Faso': 'Ouagadougou', 'Burundi': 'Bujumbura', 'Cambodia': 'Phnom Penh', 'Cameroon': 'Yaoundé', 'Canada': 'Ottawa', 'Cabo Verde': 'Praia', 'Cayman Islands': 'George Town', 'Central African Republic': 'Bangui', 'Chad': "N'Djamena", 'Chile': 'Santiago', 'China': 'Beijing', 'Christmas Island': 'Flying Fish Cove', 'Cocos (Keeling) Islands': 'West Island', 'Colombia': 'Bogotá', 'Comoros': 'Moroni', 'Congo': 'Brazzaville', 'Congo (Democratic Republic of the)': 'Kinshasa', 'Cook Islands': 'Avarua', 'Costa Rica': 'San José', 'Croatia': 'Zagreb', 'Cuba': 'Havana', 'Curaçao': 'Willemstad', 'Cyprus': 'Nicosia', 'Czech Republic': 'Prague', 'Denmark': 'Copenhagen', 'Djibouti': 'Djibouti', 'Dominica': 'Roseau', 'Dominican Republic': 'Santo Domingo', 'Ecuador': 'Quito', 'Egypt': 'Cairo', 'El Salvador': 'San Salvador', 'Equatorial Guinea': 'Malabo', 'Eritrea': 'Asmara', 'Estonia': 'Tallinn', 'Ethiopia': 'Addis Ababa', 'Falkland Islands (Malvinas)': 'Stanley', 'Faroe Islands': 'Tórshavn', 'Fiji': 'Suva', 'Finland': 'Helsinki', 'France': 'Paris', 'French Guiana': 'Cayenne', 'French Polynesia': 'Papeetē', 'French Southern Territories': 'Port-aux-Français', 'Gabon': 'Libreville', 'Gambia': 'Banjul', 'Georgia': 'Tbilisi', 'Germany': 'Berlin', 'Ghana': 'Accra', 'Gibraltar': 'Gibraltar', 'Greece': 'Athens', 'Greenland': 'Nuuk', 'Grenada': "St. George's", 'Guadeloupe': 'Basse-Terre', 'Guam': 'Hagåtña', 'Guatemala': 'Guatemala City', 'Guernsey': 'St. Peter Port', 'Guinea': 'Conakry', 'Guinea-Bissau': 'Bissau', 'Guyana': 'Georgetown', 'Haiti': 'Port-au-Prince', 'Heard Island and McDonald Islands': '', 'Holy See': 'Rome', 'Honduras': 'Tegucigalpa', 'Hong Kong': 'City of Victoria', 'Hungary': 'Budapest', 'Iceland': 'Reykjavík', 'India': 'New Delhi', 'Indonesia': 'Jakarta', "Côte d'Ivoire": 'Yamoussoukro', 'Iran (Islamic Republic of)': 'Tehran', 'Iraq': 'Baghdad', 'Ireland': 'Dublin', 'Isle of Man': 'Douglas', 'Israel': 'Jerusalem', 'Italy': 'Rome', 'Jamaica': 'Kingston', 'Japan': 'Tokyo', 'Jersey': 'Saint Helier', 'Jordan': 'Amman', 'Kazakhstan': 'Astana', 'Kenya': 'Nairobi', 'Kiribati': 'South Tarawa', 'Kuwait': 'Kuwait City', 'Kyrgyzstan': 'Bishkek', "Lao People's Democratic Republic": 'Vientiane', 'Latvia': 'Riga', 'Lebanon': 'Beirut', 'Lesotho': 'Maseru', 'Liberia': 'Monrovia', 'Libya': 'Tripoli', 'Liechtenstein': 'Vaduz', 'Lithuania': 'Vilnius', 'Luxembourg': 'Luxembourg', 'Macao': '', 'Macedonia (the former Yugoslav Republic of)': 'Skopje', 'Madagascar': 'Antananarivo', 'Malawi': 'Lilongwe', 'Malaysia': 'Kuala Lumpur', 'Maldives': 'Malé', 'Mali': 'Bamako', 'Malta': 'Valletta', 'Marshall Islands': 'Majuro', 'Martinique': 'Fort-de-France', 'Mauritania': 'Nouakchott', 'Mauritius': 'Port Louis', 'Mayotte': 'Mamoudzou', 'Mexico': 'Mexico City', 'Micronesia (Federated States of)': 'Palikir', 'Moldova (Republic of)': 'Chișinău', 'Monaco': 'Monaco', 'Mongolia': 'Ulan Bator', 'Montenegro': 'Podgorica', 'Montserrat': 'Plymouth', 'Morocco': 'Rabat', 'Mozambique': 'Maputo', 'Myanmar': 'Naypyidaw', 'Namibia': 'Windhoek', 'Nauru': 'Yaren', 'Nepal': 'Kathmandu', 'Netherlands': 'Amsterdam', 'New Caledonia': 'Nouméa', 'New Zealand': 'Wellington', 'Nicaragua': 'Managua', 'Niger': 'Niamey', 'Nigeria': 'Abuja', 'Niue': 'Alofi', 'Norfolk Island': 'Kingston', "Korea (Democratic People's Republic of)": 'Pyongyang', 'Northern Mariana Islands': 'Saipan', 'Norway': 'Oslo', 'Oman': 'Muscat', 'Pakistan': 'Islamabad', 'Palau': 'Ngerulmud', 'Palestine, State of': 'Ramallah', 'Panama': 'Panama City', 'Papua New Guinea': 'Port Moresby', 'Paraguay': 'Asunción', 'Peru': 'Lima', 'Philippines': 'Manila', 'Pitcairn': 'Adamstown', 'Poland': 'Warsaw', 'Portugal': 'Lisbon', 'Puerto Rico': 'San Juan', 'Qatar': 'Doha', 'Republic of Kosovo': 'Pristina', 'Réunion': 'Saint-Denis', 'Romania': 'Bucharest', 'Russian Federation': 'Moscow', 'Rwanda': 'Kigali', 'Saint Barthélemy': 'Gustavia', 'Saint Helena, Ascension and Tristan da Cunha': 'Jamestown', 'Saint Kitts and Nevis': 'Basseterre', 'Saint Lucia': 'Castries', 'Saint Martin (French part)': 'Marigot', 'Saint Pierre and Miquelon': 'Saint-Pierre', 'Saint Vincent and the Grenadines': 'Kingstown', 'Samoa': 'Apia', 'San Marino': 'City of San Marino', 'Sao Tome and Principe': 'São Tomé', 'Saudi Arabia': 'Riyadh', 'Senegal': 'Dakar', 'Serbia': 'Belgrade', 'Seychelles': 'Victoria', 'Sierra Leone': 'Freetown', 'Singapore': 'Singapore', 'Sint Maarten (Dutch part)': 'Philipsburg', 'Slovakia': 'Bratislava', 'Slovenia': 'Ljubljana', 'Solomon Islands': 'Honiara', 'Somalia': 'Mogadishu', 'South Africa': 'Pretoria', 'South Georgia and the South Sandwich Islands': 'King Edward Point', 'Korea (Republic of)': 'Seoul', 'South Sudan': 'Juba', 'Spain': 'Madrid', 'Sri Lanka': 'Colombo', 'Sudan': 'Khartoum', 'Suriname': 'Paramaribo', 'Svalbard and Jan Mayen': 'Longyearbyen', 'Swaziland': 'Lobamba', 'Sweden': 'Stockholm', 'Switzerland': 'Bern', 'Syrian Arab Republic': 'Damascus', 'Taiwan': 'Taipei', 'Tajikistan': 'Dushanbe', 'Tanzania, United Republic of': 'Dodoma', 'Thailand': 'Bangkok', 'Timor-Leste': 'Dili', 'Togo': 'Lomé', 'Tokelau': 'Fakaofo', 'Tonga': "Nuku'alofa", 'Trinidad and Tobago': 'Port of Spain', 'Tunisia': 'Tunis', 'Turkey': 'Ankara', 'Turkmenistan': 'Ashgabat', 'Turks and Caicos Islands': 'Cockburn Town', 'Tuvalu': 'Funafuti', 'Uganda': 'Kampala', 'Ukraine': 'Kiev', 'United Arab Emirates': 'Abu Dhabi', 'United Kingdom of Great Britain and Northern Ireland': 'London', 'United States of America': 'Washington, D.C.', 'Uruguay': 'Montevideo', 'Uzbekistan': 'Tashkent', 'Vanuatu': 'Port Vila', 'Venezuela (Bolivarian Republic of)': 'Caracas', 'Viet Nam': 'Hanoi', 'Wallis and Futuna': 'Mata-Utu', 'Western Sahara': 'El Aaiún', 'Yemen': "Sana'a", 'Zambia': 'Lusaka', 'Zimbabwe': 'Harare'}

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))


def countries_component(doc):
    # Create an entity Span with the label "GPE" for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute "capital" with the getter get_capital
Span.set_extension("capital", getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


# Scaling and performance

In [10]:
# processing large volumes of text
# use nlp.pipe method
# processes texts as a stream, yields, Doc objects
# much faster than calling nlp on each text
# BAD:
# docs = [nlp(text) for text in LOT_OF_TEXTS]
# GOOD:
# docs = list(nlp.pipe(LOTS_OF_TEXTS))

# Passing in context (1)

# Setting as_tuples=True on nlp.pipe lets you pass in (text, context) tuples
# Yields (doc, context) tuples
# Useful for associating metadata with the doc

# data = [
#    ("This is a text", {"id": 1, "page_number": 15}),
#    ("And another text", {"id": 2, "page_number": 16}),
#]

#for doc, context in nlp.pipe(data, as_tuples=True):
#    print(doc.text, context["page_number"])

#Passing in context (2)

#from spacy.tokens import Doc

#Doc.set_extension("id", default=None)
#Doc.set_extension("page_number", default=None)

#data = [
#    ("This is a text", {"id": 1, "page_number": 15}),
#    ("And another text", {"id": 2, "page_number": 16}),
#]

#for doc, context in nlp.pipe(data, as_tuples=True):
#    doc._.id = context["id"]
#    doc._.page_number = context["page_number"]

# Using only the tokenizer (1)
# don't run the whole pipeline!

# Using only the tokenizer (2)
# Use nlp.make_doc to turn a text into a Doc object

# BAD:
#doc = nlp("Hello world")

# GOOD:
# doc = nlp.make_doc("Hello world!")

# Disabling pipeline components

#Use nlp.disable_pipes to temporarily disable one or more pipes

# Disable tagger and parser
# with nlp.disable_pipes("tagger", "parser"):
    # Process the text and print the entities
#    doc = nlp(text)
#    print(doc.ents)
# Restores them after the with block
# Only runs the remaining components

<h3>Processing Streams</h3>
<p>1 Rewrite the example to use nlp.pipe. Instead of iterating over the texts and processing them, iterate over the doc objects yielded by nlp.pipe.</p>
<p>2 Rewrite the example to use nlp.pipe. Don’t forget to call list() around the result to turn it into a list.</p>

In [11]:
import spacy

nlp = spacy.load("en_core_web_sm")

TEXTS = ['McDonalds is my favorite restaurant.', 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..', 'People really still eat McDonalds :(', 'The McDonalds in Spain has chicken wings. My heart is so happy ', '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P', 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D', 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

# Process the texts and print the adjectives
#for text in TEXTS:
#    doc = nlp(text)
#    print([token.text for token in doc if token.pos_ == "ADJ"])
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])
print()

import spacy

nlp = spacy.load("en_core_web_sm")
    
TEXTS = ['McDonalds is my favorite restaurant.', 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..', 'People really still eat McDonalds :(', 'The McDonalds in Spain has chicken wings. My heart is so happy ', '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P', 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D', 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']
    
# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']

(McDonalds,) () (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) () (This morning, gettin mcdonalds)


<h3>Processing data with context</h3>
<p>1 Use the set_extension method to register the custom attributes "author" and "book" on the Doc, which default to None.</p>
<p>2 Process the [text, context] pairs in DATA using nlp.pipe with as_tuples=True.</p>
<p>3 Overwrite the doc._.book and doc._.author with the respective info passed in as the context.</p>

In [12]:
from spacy.lang.en import English
from spacy.tokens import Doc

DATA = [['One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.', {'author': 'Franz Kafka', 'book': 'Metamorphosis'}], ["I know not all that may be coming, but be it what it will, I'll go to it laughing.", {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}], ['It was the best of times, it was the worst of times.', {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}], ['The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.', {'author': 'Jack Kerouac', 'book': 'On the Road'}], ['It was a bright cold day in April, and the clocks were striking thirteen.', {'author': 'George Orwell', 'book': '1984'}], ['Nowadays people know the price of everything and the value of nothing.', {'author': 'Oscar Wilde', 'book': 'The Picture Of Dorian Gray'}]]

nlp = English()

# Register the Doc extension "author" (default None)
Doc.set_extension("author", default=None)

# Register the Doc extension "book" (default None)
Doc.set_extension("book", default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context["book"]
    doc._.author = context["author"]

    # Print the text and custom attribute data
    print(f"{doc.text}\n — '{doc._.book}' by {doc._.author}\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
 — 'Metamorphosis' by Franz Kafka

I know not all that may be coming, but be it what it will, I'll go to it laughing.
 — 'Moby-Dick or, The Whale' by Herman Melville

It was the best of times, it was the worst of times.
 — 'A Tale of Two Cities' by Charles Dickens

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.
 — 'On the Road' by Jack Kerouac

It was a bright cold day in April, and the clocks were striking thirteen.
 — '1984' by George Orwell

Nowadays people know the price of everything and the value of nothing.
 — 'The Picture Of Dorian Gray' by Oscar Wilde



<h3>Selective processing</h3>
<p>1 Rewrite the code to only tokenize the text using nlp.make_doc.</p>
<p>2 Disable the tagger and parser using the nlp.disable_pipes method.</p>
<p>3 Process the text and print all entities in the doc.</p>

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Only tokenize the text
doc = nlp.make_doc(text)
print([token.text for token in doc])

import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Disable the tagger and parser
with nlp.disable_pipes("tagger", "parser"):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']
(American, College Park, Georgia)
