# SpaCy Tutorial
#### 02 Intermediate spaCy

## Load model

In [1]:
import spacy
import pandas as pd

In [2]:
# Create an nlp object
nlp = spacy.load("en_core_web_md")

## Word Vectors  
SpaCy models except for the small one come with word vectors that are trained in GloVe, a similar algorithm to Word2Vec. SpaCy allows to use other pretrained word vectors or your custom word vectors, such as ones that are trained in Gensim, FastText, TensorFlow, etc.  

Examples below show how to compute similarity between words/sentences, but you can also feed the vector representation as a feature to downstream tasks, such as text classifier in your favorite ML framework.

In [7]:
# Printing a word vector for "cat" and its demension size
word1 = nlp("cat")
print(f"{word1.vector} Dimension: {len(word1.vector)}")

[-0.15067   -0.024468  -0.23368   -0.23378   -0.18382    0.32711
 -0.22084   -0.28777    0.12759    1.1656    -0.64163   -0.098455
 -0.62397    0.010431  -0.25653    0.31799    0.037779   1.1904
 -0.17714   -0.2595    -0.31461    0.038825  -0.15713   -0.13484
  0.36936   -0.30562   -0.40619   -0.38965    0.3686     0.013963
 -0.6895     0.004066  -0.1367     0.32564    0.24688   -0.14011
  0.53889   -0.80441   -0.1777    -0.12922    0.16303    0.14917
 -0.068429  -0.33922    0.18495   -0.082544  -0.46892    0.39581
 -0.13742   -0.35132    0.22223   -0.144     -0.048287   0.3379
 -0.31916    0.20526    0.098624  -0.23877    0.045338   0.43941
  0.030385  -0.013821  -0.093273  -0.18178    0.19438   -0.3782
  0.70144    0.16236    0.0059111  0.024898  -0.13613   -0.11425
 -0.31598   -0.14209    0.028194   0.5419    -0.42413   -0.599
  0.24976   -0.27003    0.14964    0.29287   -0.31281    0.16543
 -0.21045   -0.4408     1.2174     0.51236    0.56209    0.14131
  0.092514   0.71396   -0.02

In [8]:
# Cosine similarity of cat and dog
word2 = nlp("dog")
print(word1.similarity(word2))

0.8016855517329495


In [9]:
# Cosine similarity of 2 different documents by averaging word vectors
doc1 = nlp("I like sushi.")
doc2 = nlp("My favorite food is ramen.")
print(doc1.similarity(doc2))

0.8223489933262161


In [10]:
# An unrelated sentence pair returns low value
doc1 = nlp("I like sushi.")
doc2 = nlp("Jupyter notebook installation guide")
print(doc1.similarity(doc2))

0.31461387624735404


## Working with Big Dataset  
This Twitter data was downloaded from [Kaggle](https://www.kaggle.com/c/twitter-sentiment-analysis2).

In [39]:
df = pd.read_csv('data/train.csv.zip', sep=',', compression='zip', encoding='latin_1')
df = df.sample(1000, random_state=1111)
df.shape

(1000, 3)

In [40]:
df.head(10)

Unnamed: 0,ItemID,Sentiment,SentimentText
1266,1267,1,Awesome. &lt;3 TEDDY! &lt;3
39898,39910,1,@Andre_R its been my mission to find other sou...
88879,88891,1,@Azura999 Told you
3688,3689,1,thanks for the birthday DMs
68437,68449,0,@bryandl i know! you should come visit again!!
28556,28568,0,@airlanggatwerp bagi link nya dong nce huhu
48974,48986,1,@AriaParadiso @ChelseaParadiso nighty night u ...
68698,68710,0,@BSBSavedMyLife it won't play
90592,90604,0,@chiniehdiaz Im out of it..havent had any in d...
55359,55371,1,"@barrymoltz Ahhh, yes-- shiny object syndrome...."


### Typical way (slow)
Single-threaded

In [41]:
def tokenize(text:str=None):
    doc = nlp(text)
    token_list = []
    
    for token in doc:
        token_list.append(token.text)
        
    return token_list

In [42]:
%%timeit
df['token_list1'] = df.apply(lambda x: tokenize(x.SentimentText), axis=1)

7.89 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [43]:
df.head(10)

Unnamed: 0,ItemID,Sentiment,SentimentText,token_list
1266,1267,1,Awesome. &lt;3 TEDDY! &lt;3,"[ , Awesome, ., &, lt;3, TEDDY, !, &, lt;3]"
39898,39910,1,@Andre_R its been my mission to find other sou...,"[@Andre_R, its, been, my, mission, to, find, o..."
88879,88891,1,@Azura999 Told you,"[@Azura999, Told, you]"
3688,3689,1,thanks for the birthday DMs,"[ , thanks, for, the, birthday, DMs]"
68437,68449,0,@bryandl i know! you should come visit again!!,"[@bryandl, i, know, !, , you, should, come, v..."
28556,28568,0,@airlanggatwerp bagi link nya dong nce huhu,"[@airlanggatwerp, bagi, link, nya, dong, nce, ..."
48974,48986,1,@AriaParadiso @ChelseaParadiso nighty night u ...,"[@AriaParadiso, @ChelseaParadiso, nighty, nigh..."
68698,68710,0,@BSBSavedMyLife it won't play,"[@BSBSavedMyLife, it, wo, n't, play]"
90592,90604,0,@chiniehdiaz Im out of it..havent had any in d...,"[@chiniehdiaz, I, m, out, of, it, .., havent, ..."
55359,55371,1,"@barrymoltz Ahhh, yes-- shiny object syndrome....","[@barrymoltz, Ahhh, ,, yes--, shiny, object, s..."


### SpaCy way (fast)  
SpaCy allows multi-processing by treating texts as a stream and yielding Doc objects.

In [44]:
%%timeit
token_list = []

for doc in nlp.pipe(df.SentimentText.astype('unicode').values, batch_size=100, n_threads=40):
    word_list = []
    for token in doc:
        word_list.append(token.text)
        
    token_list.append(word_list)

df['token_list2'] = token_list

2.1 s ± 86.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### SpaCy way (faster)  
When an `nlp` object is created, spaCy adds pipelines. By disabling unused pipeline components, spaCy can become even faster! Pipelines can be customized.    

![nlp_pipeline](img/nlp_pipeline.png)  

In [51]:
# Print pipeline components
for p in nlp.pipeline:
    print(p)

('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fb268b9fda0>)
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fb268ab3ca8>)
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fb268ab3d08>)


In [45]:
%%timeit
token_list = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
# Alternatively, you can use nlp.make_doc method, which skips all pipelines, if you just need a tokenizer.
with nlp.disable_pipes('tagger', 'parser', 'ner'):
    for doc in nlp.pipe(df.SentimentText.astype('unicode').values, batch_size=100, n_threads=40):
        word_list = []
        for token in doc:
            word_list.append(token.text)

        token_list.append(word_list)

df['token_list3'] = token_list

139 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
