## Processing Text
When you call nlp on a text, spaCy will tokenize it and then call each component on the Doc, in order. It then returns the processed Doc that you can work with.

In [95]:
# Importing Necessary Libraries and Filtering Warnings
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="spacy.pipeline.lemmatizer")

In [96]:
# Processing a Simple Text
nlp = spacy.load('en_core_web_sm')
doc = nlp("This is raw text")
print(doc)

This is raw text


When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts. SpaCy's nlp.pipe method takes an iterable of text and yields processed Doc objects. The batching is done internally.

In [97]:
# Processing Multiple Texts (nlp.pipe)
texts = ["This is raw text", "There is lots of text"]
doc = list(nlp.pipe(texts))
print(doc)

[This is raw text, There is lots of text]


> Tips for efficient processing

*   Process the texts as stream using nlp.pipe and buffer them in batches, instead of by-one-by. This is usually much more efficient.
*   Only apply the pipline components you need. Getting prediction from the model that you don't actually need adds up and becomes very efficient at scale. To prevent this, use the disable keyword argument to disable components you don't need.

> In this example, we're using `nlp.pipe`to process a (potentially very large) iterable of texts as a stream. Because we're only accessing the named entities in `doc.ents` (set by the ner component), we'll disable all other statistical components (the tagger and parser) during processing. nlp.pipe yields Doc objects, so we can iterable over them and access the named entity predictions:

In [98]:
# Disabling NER (inside pipe)
texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million ",
    "Revenue exceeded twelve billion dollars, with a loss of $1b."
    ]

nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, disable=["tagger", "parser"]))

for doc in docs:
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
    print()

[('$9.4 million', 'MONEY'), ('the prior year', 'DATE'), ('$2.7 million', 'MONEY')]

[('twelve billion dollars', 'MONEY'), ('1b', 'MONEY')]



## How `Pipelines` Work
SpaCy makes it very easy to create your own pipelines consisting of reusable components - this includes spaCy's default tagger, parser and entity recognizer, but also your own custom processing functions. A pipeline component can be added to an already existing nlp objects, specified when initializing a Language class, or defined within a model package.

When you load a model, spaCy first consults the model's meta.json.

The meta typically includes the model details, the ID of a Language class, and a optional list of pipeline components.SpaCy then does the following:


*   Load the Language class and data for the given ID via `get_lang_class` and initialize it. The Language class contains the shared vocabulary, tokenization rules and the language-spesific annotation scheme.
*   Iterate over the pipeline names and create each component using `create_pipe`, which looks them up in Language.factories
*   Add each pipeline component to the pipeline in order, using `add_pipe`.
*   Make the model data available to the Language class by calling `from_disk` with the path to the model data directory.  


`{`

  `"lang":"en",`

  `"name":"core_web_sm",`

  `"description":"Example model for spaCy",`

  `"pipeline":["tagger","parser","ner"]`

`}`

Fundamentally, a spaCy model consists of three components: the weights, i.e. binaty data loaded in from a directory, a pipeline of functions called in order, and language data like the tokenization rules and annotation scheme.

## Disabling and Modifying Pipeline Components
If you don't need a particular component of the pipeline - for example, the tagger or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.

In [99]:
# Disabling NER (inside nlp.load)
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
nlp

<spacy.lang.en.English at 0x7dab43b9b8d0>

In some cases, you don't want to load all pipeline components and their weights, because you need them at different points in your application. However, if you only need a Doc object with named entities, there's no need to run all pipeline components on it.

In [100]:
doc = nlp("Apple is buying a startup")

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG


In [101]:
# Disabling NER, Tagger, Parser
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser","ner"])
doc = nlp("Apple is buying a startup")

for ent in doc.ents:
    print(ent.text, ent.label_)

In [102]:
# 1. Use as a contextmanager
nlp = spacy.load("en_core_web_sm")

with nlp.disable_pipes("tagger", "parser"):
    doc = nlp("I won't be tagged and parsed")
    doc = nlp("I will be tagged and parsed")
    print(doc)

I will be tagged and parsed


In [103]:
# 2. Resrtore manually
nlp = spacy.load("en_core_web_sm")
disablled = nlp.disable_pipes("ner")
doc = nlp("I won't have named entities")
disablled.restore()
print(doc)

I won't have named entities
