# Preprocessing

To use spacy we need to create a spacy’s NLP object depending on the language provided in the configuration file.

In [1]:
import spacy
nlp = spacy.load('en')

In [2]:
nlp

<spacy.lang.en.English at 0x7f019efd0fd0>

# Spacy Tokenizer

This step converts each training sample from your training file and converts them into a list of tokens(words). At the end of this step, we have a bag of words.

In [3]:
tokens = nlp("Suggest me a chinese food")

Note that nlp by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. 

In [4]:
tokens

Suggest me a chinese food

# Spacy Featurizer

Now that we have the bag of words we can feed them into the ML algorithms. However, an ML algorithm understands only numerical data. It is the featurizer’s job to convert tokens into word vectors. At the end of this step, we will have a list of numbers which will make sense only for ML models. Spacy’s token comes with a vector attribute which makes this conversion easy.

In [5]:
features = [token.vector for token in tokens]

In [6]:
features

[array([ 5.0428004 ,  0.5848423 ,  0.84865355, -1.4685935 ,  0.03154445,
        -1.355509  ,  1.1335483 ,  2.097792  , -2.648609  ,  0.5992427 ,
         0.5250029 , -6.0911155 , -0.27508432, -0.61268306,  0.82375413,
        -6.6278424 ,  1.0685546 ,  2.9171565 , -4.5007124 , -2.8751345 ,
        -1.9237388 , -3.100243  ,  3.8483026 ,  1.637509  , -0.7808366 ,
        -2.690024  , -0.5483384 ,  0.5666906 , -2.7214477 ,  4.9644704 ,
         2.1093059 ,  6.366849  , -1.3273665 ,  2.1760817 , -2.0642526 ,
        -0.78111356, -0.70352423,  1.937985  ,  2.3383934 ,  1.4124227 ,
        -3.2648463 , -1.9244783 , -3.5789187 ,  0.59029424,  0.27700362,
        -0.02363824, -4.1589155 ,  1.1164593 ,  0.22579789,  2.1304052 ,
        -4.1874447 , -0.45018405, -1.3243811 , -0.68259364, -0.5898883 ,
         2.8102374 ,  4.1099195 , -1.6714737 , -3.5124679 ,  2.0984204 ,
         2.8390021 ,  4.237991  ,  2.6879635 , -0.37433296,  2.366744  ,
        -0.1946837 ,  2.1761866 ,  4.3991795 ,  1.0