In [4]:
# Import spacy
import spacy

In [5]:
nlp = spacy.load("en_core_web_sm")

Construct a `Doc` object. This is a sequence of `Token` objects representing a `lexical token`
Each Token object has information about a particular piece—typically one word—of text. You can instantiate a Doc object by calling the Language object with the input string as an argument

In [6]:

introduction_doc = nlp(
    "This tutorial is about Natural Language Processing in spaCy."

)

In [7]:
type(introduction_doc)

spacy.tokens.doc.Doc

Iterate over the Doc object with a list comprehension that produces a series of Token objects. 
On each Token object, you call the .text attribute to get the text contained within that token

In [8]:
# Get the token in Doc
[token.text for token in introduction_doc]

['This',
 'tutorial',
 'is',
 'about',
 'Natural',
 'Language',
 'Processing',
 'in',
 'spaCy',
 '.']

In [9]:
import pathlib
file_name = "introduction.txt"

introduction_doc = nlp(
    pathlib.Path(file_name).read_text(encoding="utf-8")
)

print([token.text for token in introduction_doc])

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']


**Sentence Detection**

This is the process of locating where sentences start and end in a given text.

In `spaCy`, the `.sents` property is used to extract sentences from the Doc object

In [10]:
about_text = (
    "Gus Proto is a Python developer currently"
     " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."


)

about_doc = nlp(about_text)
sentences = list(about_doc.sents)

In [11]:
type(about_doc.sents )

generator

In [12]:
len(sentences)

2

In [13]:
for sentence in sentences:
    print(f'Sentence : {sentence}')

Sentence : Gus Proto is a Python developer currently working for a London-based Fintech company.
Sentence : He is interested in learning Natural Language Processing.


You can also customize sentence detection behavior by using custom delimiters. Here’s an example where an ellipsis (...) is used as a delimiter, in addition to the full stop, or period (.):

In [14]:
ellipsis_text = (
     "Gus, can you, ... never mind, I forgot"
     " what I was saying. So, do you think"
     " we should ..."
 )

In [15]:
from spacy.language import Language

In [16]:
@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
    """Add support to use `...` as delimited for sentence detection"""
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i + 1].is_sent_start = True
    return doc

In [17]:
custom_nlp = spacy.load('en_core_web_sm')
custom_nlp.add_pipe('set_custom_boundaries', before='parser')
custom_ellipsis_doc = custom_nlp(ellipsis_text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

for sentence in custom_ellipsis_sentences:
    print(sentence)

Gus, can you, ...
never mind, I forgot what I was saying.
So, do you think we should ...


In [18]:
len(custom_ellipsis_sentences)

3

Buidling the `Doc` container involces tokeninzing the text. The process of tokenization breats a text  down into its basic units or **tokens** whixch are represented in SpaCy as `Token` objects

In [19]:
# The token’s original index position in the string is still available as an attribute on Token

about_doc = nlp(about_text)

for token in about_doc:
    print(token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


spaCy provides various other `attributes` for the `Token` class

In [20]:
print(
    f"{"Text with Whitespace":22}"
    f"{"Is Alphabetic?":18}"
    f"{"Is Puntuation?":18}"
    f"{"Is Stop Word?"}"


)
for token in about_doc:
    print(
    f"{str(token.text_with_ws):22}"
    f"{str(token.is_alpha):18}"
    f"{str(token.is_punct):18}"
    f"{str(token.is_stop)}"
    )



Text with Whitespace  Is Alphabetic?    Is Puntuation?    Is Stop Word?
Gus                   True              False             False
Proto                 True              False             False
is                    True              False             True
a                     True              False             True
Python                True              False             False
developer             True              False             False
currently             True              False             False
working               True              False             False
for                   True              False             True
a                     True              False             True
London                True              False             False
-                     False             True              False
based                 True              False             False
Fintech               True              False             False
company               True          

As with many aspects of spaCy, you can also customize the tokenization process to detect tokens on custom characters. This is often used for hyphenated words such as London-based.

To customize tokenization, you need to update the tokenizer property on the callable Language object with a new Tokenizer object.

To see what’s involved, imagine you had some text that used the @ symbol instead of the usual hyphen (-) as an infix to link words together. So, instead of London-based, you had London@based

In [21]:
custom_about_text = (
    "Gus proto is a Python developer currently"
    " working for a London@based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing"
)

print([token.text for token in nlp(custom_about_text)[8:15]])

['for', 'a', 'London@based', 'Fintech', 'company', '.', 'He']


In this example, the default parsing read the London@based text as a single token, but if you used a hyphen instead of the @ symbol, then you’d get three tokens.

In [22]:
custom_about_text = (
    "Gus proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing"
)

print([token.text for token in nlp(custom_about_text)[8:15]])

['for', 'a', 'London', '-', 'based', 'Fintech', 'company']


In [23]:
# To include the @ symbol as a custom infix, you need to build your own Tokenizer object
import re
from spacy.tokenizer import Tokenizer

custom_nlp = spacy.load("en_core_web_sm")
prefix_re = spacy.util.compile_prefix_regex(
    custom_nlp.Defaults.prefixes
)
suffix_re = spacy.util.compile_suffix_regex(
    custom_nlp.Defaults.suffixes
)

custom_infixes = [r"@"]

infix_re = spacy.util.compile_infix_regex(
    list(custom_nlp.Defaults.infixes) + custom_infixes
)

custom_nlp.tokenizer = Tokenizer(
    nlp.vocab,
    prefix_search=prefix_re.search,
    suffix_search=suffix_re.search,
    infix_finditer=infix_re.finditer,
    token_match=None
)

custom_tokenizer_about_doc = custom_nlp(custom_about_text)

print([token.text for token in custom_tokenizer_about_doc[8:15]])

['for', 'a', 'London', '-', 'based', 'Fintech', 'company']


**Stop Words**

`Stop words` are typically defined as the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make grammatical sense

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. spaCy stores a list of stop words for the English language

In [24]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [25]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

to
about
otherwise
last
’m
been
‘s
whose
doing
out


In [26]:
# You can remove stop words from the input text by making use of the .is_stop attribute of each token
print([token.text for token in about_doc if not token.is_stop])

['Gus', 'Proto', 'Python', 'developer', 'currently', 'working', 'London', '-', 'based', 'Fintech', 'company', '.', 'interested', 'learning', 'Natural', 'Language', 'Processing', '.']


While you can’t be sure exactly what the sentence is trying to say without stop words, you still have a lot of information about what it’s generally about.

**Lemmatization**

`Lemmatization` is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a **lemma**

For example `organise` is the lemma for `organises` , `organised` and `organising`

In [27]:
# spaCy puts `lemma_` attribute om the `Token` class
conference_help_text = (
    "Gus is helping organise a developer"
    " conference on Application of Natural Language"
    " Processing. he keeps organising local Python meetups"
    " and several internal talks at his workplace"

)

conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
    if str(token) != str(token.lemma_):
        print(f"{str(token):>20} : {str(token.lemma_)}")

                  is : be
             helping : help
               keeps : keep
          organising : organise
             meetups : meetup
               talks : talk


Lemmatization helps you avoid duplicate words that may overlap conceptually

**Word Frequency**

You can now convert a given text into tokens and perform statistical analysis on it. This analysis can give you various insights, such as common words or unique words in the text

In [28]:
from collections import Counter
complete_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech company. He is"
    " interested in learning Natural Language Processing."
    " There is a developer conference happening on 21 July"
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number'
    " available at +44-1234567891. Gus is helping organize it."
    " He keeps organizing local Python meetups and several"
    " internal talks at his workplace. Gus is also presenting"
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    " Apart from his work, he is very passionate about music."
    " Gus is learning to play the Piano. He has enrolled"
    " himself in the weekend batch of Great Piano Academy."
    " Great Piano Academy is situated in Mayfair or the City"
    " of London and has world-class piano instructors."
    )

complete_doc = nlp(complete_text)

words = [
    token.text for token in complete_doc
    if not token.is_stop and not token.is_punct
]

print(Counter(words).most_common(5))

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]


In [29]:
# Without removing stop words

words = [
    token.text for token in complete_doc
    if not token.is_punct
]

print(Counter(words).most_common(5))

[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('of', 4)]


Four out of five of the most common words are stop words that don’t really tell you much about the summarized text. This is why stop words are often considered noise for many applications.

**Part of Speech Tagging**

`Part of speech` or `POS` is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:

1. Noun
2. Pronoun
3. Adjective
4. Verb
5. Adverb
6. Preposition
7. Conjunction
8. Interjection

`Part-of-speech tagging` is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.

In [30]:
for token in about_doc:
    print(
        f"""
    TOKEN: {str(token)}
    =======
    TAG: {str(token.tag_):10} POS: {token.pos_}
    EXPLANATION: {spacy.explain(token.tag_)}"""
    )


    TOKEN: Gus
    TAG: NNP        POS: PROPN
    EXPLANATION: noun, proper singular

    TOKEN: Proto
    TAG: NNP        POS: PROPN
    EXPLANATION: noun, proper singular

    TOKEN: is
    TAG: VBZ        POS: AUX
    EXPLANATION: verb, 3rd person singular present

    TOKEN: a
    TAG: DT         POS: DET
    EXPLANATION: determiner

    TOKEN: Python
    TAG: NNP        POS: PROPN
    EXPLANATION: noun, proper singular

    TOKEN: developer
    TAG: NN         POS: NOUN
    EXPLANATION: noun, singular or mass

    TOKEN: currently
    TAG: RB         POS: ADV
    EXPLANATION: adverb

    TOKEN: working
    TAG: VBG        POS: VERB
    EXPLANATION: verb, gerund or present participle

    TOKEN: for
    TAG: IN         POS: ADP
    EXPLANATION: conjunction, subordinating or preposition

    TOKEN: a
    TAG: DT         POS: DET
    EXPLANATION: determiner

    TOKEN: London
    TAG: NNP        POS: PROPN
    EXPLANATION: noun, proper singular

    TOKEN: -
    TAG: HYPH       POS:

In [31]:
# By using POS tags, you can extract a particular category of words:
nouns = []
adjectives = []
for token in about_doc:
    if token.pos_ == "NOUN":
        nouns.append(token)
    if token.pos_ == "ADJ":
        adjectives.append(token)

nouns, adjectives

([developer, company], [interested])

You can use this type of word classification to derive insights. For instance, you could gauge sentiment by analyzing which adjectives are most commonly used alongside nouns.

In [32]:
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
about_interested_text = (
    "He is interested in learning Natural Language Processing."
)
about_interest_doc = nlp(about_interested_text)
displacy.serve(about_interest_doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [33]:
displacy.render(about_interest_doc, style="dep", jupyter=True)

**Preprocessing Functions**