https://spacy.io

## Industrial-Strength Natural Language Processing

### Fast and easy integration with deep learning purposes

#### Make sure you install the package via `pip3 install spacy` and then download the English language model via `python3 -m spacy download en`

#### Let's see some basic tasks with Spacy

In [1]:
import spacy

WEBSITE_DES = '''
spaCy is the best way to prepare text for deep learning.
'''

# Load the spacy language model for English
nlp = spacy.load('en_core_web_sm')

# Now let's create a document and see how spacy process it for us
document = nlp(WEBSITE_DES.strip())

#### Let's see the tokens and their part of speech (POS) in our text

In [2]:
for word in document:
    # Spacy ships with a POS-tagger without pain
    print(word.text, word.pos_, word.tag_, spacy.explain(word.tag_))

spaCy PROPN NNP noun, proper singular
is VERB VBZ verb, 3rd person singular present
the DET DT determiner
best ADJ JJS adjective, superlative
way NOUN NN noun, singular or mass
to PART TO infinitival to
prepare VERB VB verb, base form
text NOUN NN noun, singular or mass
for ADP IN conjunction, subordinating or preposition
deep ADJ JJ adjective
learning NOUN NN noun, singular or mass
. PUNCT . punctuation mark, sentence closer


#### Now let's see the Token type and its properties

In [3]:
# Let's see what else each token has
print(type(document[-1]), document[-1].__dir__())

<class 'spacy.tokens.token.Token'> ['__repr__', '__hash__', '__str__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__len__', '__new__', 'set_extension', 'get_extension', 'has_extension', 'remove_extension', '__unicode__', '__bytes__', '__reduce__', 'check_flag', 'nbor', 'similarity', 'is_ancestor', '_', 'lex_id', 'rank', 'string', 'text', 'text_with_ws', 'prob', 'sentiment', 'lang', 'idx', 'cluster', 'orth', 'lower', 'norm', 'shape', 'prefix', 'suffix', 'lemma', 'pos', 'tag', 'dep', 'has_vector', 'vector', 'vector_norm', 'tensor', 'n_lefts', 'n_rights', 'sent', 'sent_start', 'is_sent_start', 'lefts', 'rights', 'children', 'subtree', 'left_edge', 'right_edge', 'ancestors', 'head', 'conjuncts', 'ent_type', 'ent_type_', 'ent_iob', 'ent_iob_', 'ent_id', 'ent_id_', 'ent_kb_id', 'ent_kb_id_', 'whitespace_', 'orth_', 'lower_', 'norm_', 'shape_', 'prefix_', 'suffix_', 'lang_', 'lemma_', 'pos_', 'tag_', 'dep_', 'is_oov', 'is_stop', 'is_alpha', 'is_ascii', 'is_digit', 'is_lower'

#### Let's check for dependencies

In [4]:
doc_copy = nlp(WEBSITE_DES.replace('is', 'isn\'t').strip())
for word in doc_copy:
    print(word, word.dep_)

spaCy nsubj
is ROOT
n't neg
the det
best amod
way attr
to aux
prepare relcl
text dobj
for prep
deep amod
learning pobj
. punct


#### How about sentences?

In [5]:
WEBSITE_DES = '''spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world.'''
doc_copy = nlp(WEBSITE_DES)
for sentence in doc_copy.sents:
    print(sentence)


spaCy excels at large-scale information extraction tasks.
It's written from the ground up in carefully memory-managed Cython.
Independent research in 2015 found spaCy to be the fastest in the world.


#### Drawbacks to Spacy's tokenization

In [6]:
sentence = nlp('I am a non-vegetarian student and zamanias@mcmaster.ca is my email address.')
for word in sentence:
    print(word.text)

I
am
a
non
-
vegetarian
student
and
zamanias@mcmaster.ca
is
my
email
address
.


Not intelligent enough to recognize non-vegetarian as one word, but
intelligent enough to recognize email address.

#### Let's dive into entities

In [7]:
doc_copy = nlp('New York is a city in the United States. I have a $3 million apartment there.')
for entity in doc_copy.ents:
    print(' - '.join([entity.text, entity.label_,
                     str(spacy.explain(entity.label_))]))

New York - GPE - Countries, cities, states
the United States - GPE - Countries, cities, states
$3 million - MONEY - Monetary values, including unit


You may also define new entities, e.g. the name of your company, by
using the `Span` method from `spacy.tokens`

#### How about names?

In [8]:
doc_copy = nlp('Albert Einstein and Marilyn Monroe married in Royal Albert Hall last night')
for noun in doc_copy.noun_chunks:
    print(noun)

Albert Einstein
Marilyn Monroe
Royal Albert Hall


Pretty good results, hmm?

#### How about one or two cool visualizations?

In [12]:
from spacy import displacy
doc_copy = nlp('Python is a good programming language.')
displacy.render(doc_copy, style='dep', jupyter=True,
                options={'distance': 100})

In [13]:
doc_copy = nlp('McMaster is a cool university. Hamilton is a cool city near the cooler Toronto city in the coolest country, Canada!')
displacy.render(doc_copy, style='ent', jupyter=True)


#### Let's dive into Lemmatization and Stemming in Spacy
Stemming refers to reducing a word to its root form.
For instance, walk, walking, walker, walked, etc. all
come from a common root. It's not a good idea to treat them
as distinctive words while doing NLP.

Lemmatization also refers to pretty much similar task, with
a slightly different approach.

As for Stemming, we usually chop off the ends of words **in
the hope of** achieving our goal (like a heuristic) while in
Lemmatization we use vocabulary analysis of the words to
return the dictionary form of them.

##### Enough of theory, let's see them in action

In [11]:
doc_copy = nlp('walk walked walker walking computer computing has have are You they')
for word in doc_copy:
    print(word.lemma_)

walk
walk
walker
walk
computer
computing
have
have
be
-PRON-
-PRON-


A little strange!

Also note that there is no direct way of doing *Stemming* in
Spacy, so we need to use another tool for that.