In [2]:
# Import spacy
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

Construct a `Doc` object. This is a sequence of `Token` objects representing a `lexical token`
Each Token object has information about a particular piece—typically one word—of text. You can instantiate a Doc object by calling the Language object with the input string as an argument

In [4]:

introduction_doc = nlp(
    "This tutorial is about Natural Language Processing in spaCy."

)

In [5]:
type(introduction_doc)

spacy.tokens.doc.Doc

Iterate over the Doc object with a list comprehension that produces a series of Token objects. 
On each Token object, you call the .text attribute to get the text contained within that token

In [6]:
# Get the token in Doc
[token.text for token in introduction_doc]

['This',
 'tutorial',
 'is',
 'about',
 'Natural',
 'Language',
 'Processing',
 'in',
 'spaCy',
 '.']

In [7]:
import pathlib
file_name = "introduction.txt"

introduction_doc = nlp(
    pathlib.Path(file_name).read_text(encoding="utf-8")
)

print([token.text for token in introduction_doc])

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']


**Sentence Detection**

This is the process of locating where sentences start and end in a given text.

In `spaCy`, the `.sents` property is used to extract sentences from the Doc object

In [8]:
about_text = (
    "Gus Proto is a Python developer currently"
     " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."


)

about_doc = nlp(about_text)
sentences = list(about_doc.sents)

In [9]:
type(about_doc.sents )

generator

In [10]:
len(sentences)

2

In [11]:
for sentence in sentences:
    print(f'Sentence : {sentence}')

Sentence : Gus Proto is a Python developer currently working for a London-based Fintech company.
Sentence : He is interested in learning Natural Language Processing.


You can also customize sentence detection behavior by using custom delimiters. Here’s an example where an ellipsis (...) is used as a delimiter, in addition to the full stop, or period (.):

In [12]:
ellipsis_text = (
     "Gus, can you, ... never mind, I forgot"
     " what I was saying. So, do you think"
     " we should ..."
 )

In [13]:
from spacy.language import Language

In [14]:
@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
    """Add support to use `...` as delimited for sentence detection"""
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i + 1].is_sent_start = True
    return doc

In [15]:
custom_nlp = spacy.load('en_core_web_sm')
custom_nlp.add_pipe('set_custom_boundaries', before='parser')
custom_ellipsis_doc = custom_nlp(ellipsis_text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

for sentence in custom_ellipsis_sentences:
    print(sentence)

Gus, can you, ...
never mind, I forgot what I was saying.
So, do you think we should ...


In [16]:
len(custom_ellipsis_sentences)

3

Buidling the `Doc` container involces tokeninzing the text. The process of tokenization breats a text  down into its basic units or **tokens** whixch are represented in SpaCy as `Token` objects

In [21]:
# The token’s original index position in the string is still available as an attribute on Token

about_doc = nlp(about_text)

for token in about_doc:
    print(token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


spaCy provides various other `attributes` for the `Token` class

In [29]:
print(
    f"{"Text with Whitespace":22}"
    f"{"Is Alphabetic?":18}"
    f"{"Is Puntuation?":18}"
    f"{"Is Stop Word?"}"


)
for token in about_doc:
    print(
    f"{str(token.text_with_ws):22}"
    f"{str(token.is_alpha):18}"
    f"{str(token.is_punct):18}"
    f"{str(token.is_stop)}"
    )



Text with Whitespace  Is Alphabetic?    Is Puntuation?    Is Stop Word?
Gus                   True              False             False
Proto                 True              False             False
is                    True              False             True
a                     True              False             True
Python                True              False             False
developer             True              False             False
currently             True              False             False
working               True              False             False
for                   True              False             True
a                     True              False             True
London                True              False             False
-                     False             True              False
based                 True              False             False
Fintech               True              False             False
company               True          

As with many aspects of spaCy, you can also customize the tokenization process to detect tokens on custom characters. This is often used for hyphenated words such as London-based.

To customize tokenization, you need to update the tokenizer property on the callable Language object with a new Tokenizer object.

To see what’s involved, imagine you had some text that used the @ symbol instead of the usual hyphen (-) as an infix to link words together. So, instead of London-based, you had London@based

In [31]:
custom_about_text = (
    "Gus proto is a Python developer currently"
    " working for a London@based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing"
)

print([token.text for token in nlp(custom_about_text)[8:15]])

['for', 'a', 'London@based', 'Fintech', 'company', '.', 'He']


In this example, the default parsing read the London@based text as a single token, but if you used a hyphen instead of the @ symbol, then you’d get three tokens.

In [None]:
custom_about_text = (
    "Gus proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing"
)

print([token.text for token in nlp(custom_about_text)[8:15]])

['for', 'a', 'London@based', 'Fintech', 'company', '.', 'He']


In [35]:
# To include the @ symbol as a custom infix, you need to build your own Tokenizer object
import re
from spacy.tokenizer import Tokenizer

custom_nlp = spacy.load("en_core_web_sm")
prefix_re = spacy.util.compile_prefix_regex(
    custom_nlp.Defaults.prefixes
)
suffix_re = spacy.util.compile_suffix_regex(
    custom_nlp.Defaults.suffixes
)

custom_infixes = [r"@"]

infix_re = spacy.util.compile_infix_regex(
    list(custom_nlp.Defaults.infixes) + custom_infixes
)

custom_nlp.tokenizer = Tokenizer(
    nlp.vocab,
    prefix_search=prefix_re.search,
    suffix_search=suffix_re.search,
    infix_finditer=infix_re.finditer,
    token_match=None
)

custom_tokenizer_about_doc = custom_nlp(custom_about_text)

print([token.text for token in custom_tokenizer_about_doc[8:15]])

['for', 'a', 'London', '@', 'based', 'Fintech', 'company']


**Stop Words**

`Stop words` are typically defined as the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make grammatical sense

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. spaCy stores a list of stop words for the English language

In [36]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [37]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

with
can
please
’d
take
’ll
noone
i
nobody
’re


In [38]:
# You can remove stop words from the input text by making use of the .is_stop attribute of each token
print([token.text for token in about_doc if not token.is_stop])

['Gus', 'Proto', 'Python', 'developer', 'currently', 'working', 'London', '-', 'based', 'Fintech', 'company', '.', 'interested', 'learning', 'Natural', 'Language', 'Processing', '.']


While you can’t be sure exactly what the sentence is trying to say without stop words, you still have a lot of information about what it’s generally about.