<a href="https://colab.research.google.com/github/tanaymukherjee/Natural-Language-Processing/blob/master/09_Scaling_and_performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scaling and Performance

## Processing streams

In [1]:
import spacy

In [2]:
# Create an NLP object
from spacy.lang.en import English
nlp = English()

In [3]:
# Import the Doc class
from spacy.tokens import Doc, Span, Token

In [4]:
Texts = ['McDonalds is my favorite restaurant.',
 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
 'People really still eat McDonalds :(',
 'The McDonalds in Spain has chicken wings. My heart is so happy ',
 '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

In [6]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [10]:
# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

['length_component', 'tagger', 'parser', 'ner']


In [11]:
# Process the texts and print the adjectives
for text in Texts:
    doc = nlp(text)
    print([token.text for token in doc if token.pos_ == 'ADJ'])

This document is 6 tokens long.
['favorite']
This document is 27 tokens long.
['sick']
This document is 6 tokens long.
[]
This document is 13 tokens long.
['happy']
This document is 18 tokens long.
['delicious', 'fast']
This document is 15 tokens long.
[]
This document is 18 tokens long.
['terrible']


Rewrite the example to use nlp.pipe. Instead of iterating over the texts and processing them, iterate over the doc objects yielded by nlp.pipe.

In [13]:
# Process the texts and print the adjectives
for doc in nlp.pipe(Texts):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

This document is 6 tokens long.
This document is 27 tokens long.
This document is 6 tokens long.
This document is 13 tokens long.
This document is 18 tokens long.
This document is 15 tokens long.
This document is 18 tokens long.
['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']


Rewrite the example to use nlp.pipe. Don't forget to call list() around the result to turn it into a list.

In [15]:
# Process the texts and print the entities
docs = list(nlp.pipe(Texts))
entities = [doc.ents for doc in docs]
print(*entities)

This document is 6 tokens long.
This document is 27 tokens long.
This document is 6 tokens long.
This document is 13 tokens long.
This document is 18 tokens long.
This document is 15 tokens long.
This document is 18 tokens long.
(McDonalds,) () (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) () (This morning, gettin mcdonalds)


Rewrite the example to use nlp.pipe. Don't forget to call list() around the result to turn it into a list.

In [16]:
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))

This document is 2 tokens long.
This document is 2 tokens long.
This document is 2 tokens long.


## Processing data with context

A list of (text, context) examples is available as the variable Data. The texts are quotes from famous books, and the contexts dictionaries with the keys 'author' and 'book'

In [20]:
Data = [('One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.',
  {'author': 'Franz Kafka', 'book': 'Metamorphosis'}),
 ("I know not all that may be coming, but be it what it will, I'll go to it laughing.",
  {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}),
 ('It was the best of times, it was the worst of times.',
  {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}),
 ('The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.',
  {'author': 'Jack Kerouac', 'book': 'On the Road'}),
 ('It was a bright cold day in April, and the clocks were striking thirteen.',
  {'author': 'George Orwell', 'book': '1984'}),
 ('Nowadays people know the price of everything and the value of nothing.',
  {'author': 'Oscar Wilde', 'book': 'The Picture Of Dorian Gray'})]

Import the Doc class and use the set_extension method to register the custom attributes 'author' and 'book', which default to None.

In [17]:
# Import the Doc class
from spacy.tokens import Doc

# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default=None)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default=None)

1. Process the (text, context) tuples in Data using nlp.pipe with as_tuples=True.
2. Overwrite the doc._.book and doc._.author with the respective info passed in as the context.

In [21]:
# Import the Doc class and register the extensions 'author' and 'book'
from spacy.tokens import Doc
Doc.set_extension('book', default=None, force=True)
Doc.set_extension('author', default=None, force=True)

for doc, context in nlp.pipe(Data, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']
    
    # Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')

This document is 23 tokens long.
This document is 23 tokens long.
This document is 14 tokens long.
This document is 64 tokens long.
This document is 16 tokens long.
This document is 13 tokens long.
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '19

## Selective processing

In this exercise, we will use the nlp.make_doc and nlp.disable_pipes methods to only run selected components when processing a text. 

In [24]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Rewrite the code to only tokenize the text using nlp.make_doc.

In [25]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Only tokenize the text
doc = nlp.make_doc(text)

print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


1. Disable the tagger and parser using the nlp.disable_pipes method.
2. Process the text and print all entities in the doc.

In [26]:
text = "Chick-fil-A is an American fast food restaurant chain headquartered in the city of College Park, Georgia, specializing in chicken sandwiches."

# Disable the tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

This document is 27 tokens long.
(American, College Park, Georgia)
