In [1]:
import spacy

# Tokenization

Tokenization is the process of breaking down a text into tokens. In english language tokens are words, numeric sequence separated by white space or punctuations. 

In [2]:
lmodel = spacy.load("en_core_web_sm")

In [3]:
sent ="Pizza is my favourite food. Don't underestimate the power of italian pizza."


In [4]:
tokens = lmodel(sent)

In [5]:
for i in tokens:
    print(i.pos_, i)

NOUN Pizza
AUX is
DET my
ADJ favourite
NOUN food
PUNCT .
AUX Do
PART n't
VERB underestimate
DET the
NOUN power
ADP of
ADJ italian
NOUN pizza
PUNCT .


#  ngram

N-grams are fixed length consecutive token sequences occutring in the text. 


In [6]:
def get_ngram(text, n=2):
    #we consider text as string we then tokenize the text using spacy
    tokens  = [str(t) for t in lmodel(text.lower())]
    tok_len = len(tokens)
    return [tokens[i:i+n] for i in range(tok_len-n+1)]



Here is the example of bigram for text in variable `sent`

In [7]:
get_ngram(sent)

[['pizza', 'is'],
 ['is', 'my'],
 ['my', 'favourite'],
 ['favourite', 'food'],
 ['food', '.'],
 ['.', 'do'],
 ['do', "n't"],
 ["n't", 'underestimate'],
 ['underestimate', 'the'],
 ['the', 'power'],
 ['power', 'of'],
 ['of', 'italian'],
 ['italian', 'pizza'],
 ['pizza', '.']]

Here is the example of trigram for text in variable `sent`

In [8]:
get_ngram(sent, 3)

[['pizza', 'is', 'my'],
 ['is', 'my', 'favourite'],
 ['my', 'favourite', 'food'],
 ['favourite', 'food', '.'],
 ['food', '.', 'do'],
 ['.', 'do', "n't"],
 ['do', "n't", 'underestimate'],
 ["n't", 'underestimate', 'the'],
 ['underestimate', 'the', 'power'],
 ['the', 'power', 'of'],
 ['power', 'of', 'italian'],
 ['of', 'italian', 'pizza'],
 ['italian', 'pizza', '.']]

# Lemma

Lemmas are root of the words. For example the word `fly` can also be depicted in different forms like `flew`, `flown`, `flying`, etc. All these different words same action. Sometimes inorder to reduce the vocabulary (or improve the inference time) lemmas are used instead of original words


In [9]:
for i in lmodel(sent):
    print(str(i), i.lemma_)

Pizza pizza
is be
my -PRON-
favourite favourite
food food
. .
Do do
n't not
underestimate underestimate
the the
power power
of of
italian italian
pizza pizza
. .


# Named entity recognition

Spacy can also be used to perform named entity recognition. NER is useful to access information from textual data. Here is the example where nouns are extracted from text using spacy.

In [10]:
sentence = "The black witch flew over the amazon river, using her broom."

In [11]:
doc = lmodel(sentence)

In [12]:
for i in doc.noun_chunks:
    print(str(i))

The black witch
the amazon river
her broom
