# Need for Pre-processing:

- Inconsistent results from the NLP applications can be minimized if we use right kind of preprocessing on text.
- One type of pre-processing may not be suitable for other, so it's task dependent.
- Let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stop words because some other task used it, then you are probably going to miss out on some of the common words as you have ALREADY eliminated it. So really, it’s not a one-size-fits-all approach.



# Dataset:

- A data which contains what corporations actually talk about on social media. The dataset has statements classified as information (objective statements about the company or it's activities), dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.).
- Our interest is in the text column of dataset, so we can apply pre-processing on it.

# Types of text preprocessing techniques

- There are different ways to preprocess your text. Here are some of the approaches that you should know about and I will highlight the importance of each.

In [1]:
# Import necessary libraries.
import re, string, unicodedata
import pandas as pd
import numpy as np
import nltk           
                        # Natural language processing tool-kit
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

!pip install contractions
import contractions


from bs4 import BeautifulSoup                 # Beautiful soup is a parsing library that can use different parsers.
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet    # Stopwords, and wordnet corpus
from nltk.stem import LancasterStemmer, WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Collecting contractions
  Downloading https://files.pythonhosted.org/packages/00/92/a05b76a692ac08d470ae5c23873cf1c9a041532f1ee065e74b374f218306/contractions-0.0.25-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 4.5MB/s 
[?25hCollecting Unidecode
[?25l  Downloading https://files.py

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
# Load dataset.
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/corporate_messaging_dfe.csv')

In [5]:
# Chect first 5 rows of data.
dataset.head()

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,category,category_confidence,category_gold,id,screenname,text
0,662822308,False,finalized,3,2015-02-18T04:31:00,Information,1.0,,436528000000000000,Barclays,Barclays CEO stresses the importance of regula...
1,662822309,False,finalized,3,2015-02-18T13:55:00,Information,1.0,,386013000000000000,Barclays,Barclays announces result of Rights Issue http...
2,662822310,False,finalized,3,2015-02-18T08:43:00,Information,1.0,,379580000000000000,Barclays,Barclays publishes its prospectus for its �5.8...
3,662822311,False,finalized,3,2015-02-18T09:13:00,Information,1.0,,367530000000000000,Barclays,Barclays Group Finance Director Chris Lucas is...
4,662822312,False,finalized,3,2015-02-18T06:48:00,Information,1.0,,360385000000000000,Barclays,Barclays announces that Irene McDermott Brown ...


In [10]:
# Here we are going to deal with text data, so we seperate out the text column in a new dataframe: data
data = dataset.drop(['golden', 'unit_state', 'trusted_judgments', 'last_judgment_at', 'category', 'category_confidence', 'category_gold', 'screenname'], axis=1)

In [11]:
df = data.copy()

## George's Challenge
- A better way to save some typing?
- Difference between `[]` and `[[]]`

In [12]:
# Check first 5 rows of dataframe.
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regula...
1,662822309,386013000000000000,Barclays announces result of Rights Issue http...
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8...
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is...
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown ...


In [13]:
# First row of data.
pd.set_option('display.max_colwidth', None) # It will enable the entire row visible with truncation of the text. (We can see full text.)
data.loc[[0]]

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference http://t.co/Ge9Lp7hpyG


In [14]:
# Removal of the http link using Regular Expression.
for i, row in data.iterrows():
    clean_text = re.sub(r"http\S+", "", data.at[i, 'text'])
    data.at[i,'text'] = clean_text
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference
1,662822309,386013000000000000,Barclays announces result of Rights Issue
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8bn Rights Issue:
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director


## George's Tip
Don't loop thru dataframes, instead, use `.apply()`, and better yet, vectorized operations
- https://realpython.com/fast-flexible-pandas/#dont-forget-numpy

let's time it!

In [15]:
df['tmp'] = np.nan
df.head()

Unnamed: 0,unit_id,id,text,tmp
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference http://t.co/Ge9Lp7hpyG,
1,662822309,386013000000000000,Barclays announces result of Rights Issue http://t.co/LbIqqh3wwG,
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8bn Rights Issue: http://t.co/YZk24iE8G6,
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health http://t.co/nkuHoAfnSD,
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director http://t.co/c3fNGY6NMT,


In [16]:
%%time
for i, row in df.iterrows():
    df.loc[i, 'tmp'] = re.sub(r"http\S+", "", data.at[i, 'text'])

CPU times: user 1.73 s, sys: 1.14 ms, total: 1.73 s
Wall time: 1.71 s


In [17]:
df['tmp'] = np.nan
df.head()

Unnamed: 0,unit_id,id,text,tmp
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference http://t.co/Ge9Lp7hpyG,
1,662822309,386013000000000000,Barclays announces result of Rights Issue http://t.co/LbIqqh3wwG,
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8bn Rights Issue: http://t.co/YZk24iE8G6,
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health http://t.co/nkuHoAfnSD,
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director http://t.co/c3fNGY6NMT,


In [18]:
%%time
df['tmp'] = df['text'].apply(lambda x: re.sub(r"http\S+", "", x))

CPU times: user 7.48 ms, sys: 0 ns, total: 7.48 ms
Wall time: 7.33 ms


In [19]:
1.71 * 1000 / 7.33

233.28785811732607

# cleaning of the text.

In [20]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

# Perform the above operation over all the rows of text column of the dataframe.
for i, row in data.iterrows():
    text = data.at[i, 'text']
    clean_text = replace_contractions(text)
    data.at[i,'text'] = clean_text
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference
1,662822309,386013000000000000,Barclays announces result of Rights Issue
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8bn Rights Issue:
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director


In [21]:
# Tokenize the words of whole dataframe.
for i, row in data.iterrows():
    text = data.at[i, 'text']
    words = nltk.word_tokenize(text)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,"[Barclays, CEO, stresses, the, importance, of, regulatory, and, cultural, reform, in, financial, services, at, Brussels, conference]"
1,662822309,386013000000000000,"[Barclays, announces, result, of, Rights, Issue]"
2,662822310,379580000000000000,"[Barclays, publishes, its, prospectus, for, its, �5.8bn, Rights, Issue, :]"
3,662822311,367530000000000000,"[Barclays, Group, Finance, Director, Chris, Lucas, is, to, step, down, at, the, end, of, the, week, due, to, ill, health]"
4,662822312,360385000000000000,"[Barclays, announces, that, Irene, McDermott, Brown, has, been, appointed, as, Group, Human, Resources, Director]"


In [22]:
# save the stopwords in a list named stopwords.
stopwords = stopwords.words('english')
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [23]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

## George's Tip
Learn more about the functions
- https://docs.python.org/2/library/unicodedata.html
- https://kite.com/python/docs/unicodedata.normalize
- https://docs.python.org/3/howto/unicode.html

In [24]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = word.lower()           # Converting to lowercase
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

# Lowercasing

- Lowercasing ALL your text data, although commonly overlooked, is one of the simplest and most effective form of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where your dataset is not very large and significantly helps with consistency of expected output.

- An example where lowercasing is very useful is for search. Imagine, you are looking for documents containing “usa”. However, no results were showing up because “usa” was indexed as “USA”.

- An example where lowercasing may result in inaccuracy is in predicting the programming language of a source code file. The word System in Java is quite different from system in python. Lowercasing the two makes them identical, causing the classifier to lose important predictive features. While lowercasing is generally helpful, it may not be applicable for all tasks.

In [26]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)    # Append processed words to new list.
    return new_words

## George's Tip
Not sure what a regex pattern mean? Use a regex tester, e.g. https://regex101.com/

# Stopword Removal:
- Stop words are a set of commonly used words in a language.
- Examples of stop words in English are “a”, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead.

In [27]:
def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        if word not in stopwords:
            new_words.append(word)        # Append processed words to new list.
    return new_words

# Stemming:

- Stemming is the process of reducing inflection in words (e.g. running, runs) to their root form (e.g. run). The “root” in this case may not be a real root word, but just a canonical form of the original word.

In [28]:
def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []                            # Create empty list to store pre-processed words.
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)                # Append processed words to new list.
    return stems

# Lemmatization:

- Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form.
- The only difference is that, lemmatization tries to do it the proper way.
- It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”.

In [29]:
def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []                           # Create empty list to store pre-processed words.
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)              # Append processed words to new list.
    return lemmas

### Now it's time to execute the above functions:

### So we define a new function normalize, which processes all the steps together.

In [30]:
def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    return words

In [31]:
# Iterate the normalize funtion over whole data.
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,"[barclays, ceo, stresses, importance, regulatory, cultural, reform, financial, services, brussels, conference]"
1,662822309,386013000000000000,"[barclays, announces, result, rights, issue]"
2,662822310,379580000000000000,"[barclays, publishes, prospectus, 58bn, rights, issue]"
3,662822311,367530000000000000,"[barclays, group, finance, director, chris, lucas, step, end, week, due, ill, health]"
4,662822312,360385000000000000,"[barclays, announces, irene, mcdermott, brown, appointed, group, human, resources, director]"


In [32]:
def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

In [34]:
data['lemma'] = ''
data['stem'] = ''

for i, row in data.iterrows():
    words = data.at[i, 'text']
    stems, lemmas = stem_and_lemmatize(words)
    data.at[i,'stem'] = stems
    data.at[i, 'lemma'] = lemmas
data.head()

Unnamed: 0,unit_id,id,text,lemma,stem
0,662822308,436528000000000000,"[barclays, ceo, stresses, importance, regulatory, cultural, reform, financial, services, brussels, conference]","[barclays, ceo, stress, importance, regulatory, cultural, reform, financial, service, brussels, conference]","[barclay, ceo, stresses, import, reg, cult, reform, fin, serv, brussel, conf]"
1,662822309,386013000000000000,"[barclays, announces, result, rights, issue]","[barclays, announce, result, right, issue]","[barclay, annount, result, right, issu]"
2,662822310,379580000000000000,"[barclays, publishes, prospectus, 58bn, rights, issue]","[barclays, publish, prospectus, 58bn, right, issue]","[barclay, publ, prospect, 58bn, right, issu]"
3,662822311,367530000000000000,"[barclays, group, finance, director, chris, lucas, step, end, week, due, ill, health]","[barclays, group, finance, director, chris, lucas, step, end, week, due, ill, health]","[barclay, group, fin, direct, chris, luca, step, end, week, due, il, heal]"
4,662822312,360385000000000000,"[barclays, announces, irene, mcdermott, brown, appointed, group, human, resources, director]","[barclays, announce, irene, mcdermott, brown, appoint, group, human, resources, director]","[barclay, annount, ir, mcdermott, brown, appoint, group, hum, resourc, direct]"


## George's Challenge
Walk thru the code please :)

- As we can see here that, the text column contains tokenized words, lemma contains lemmatized words, and stem column contains the stemmed words.
- So, we can use these techniques according to our need of the project as suitable.

# So, the tasks are:

- ### Noise removal (Special character, html tags, accented characters, punctuation removal)
- ### Lowercasing (can be task dependent in some cases)
- ### Stop-word removal
- ### Stemming / lemmatization

- ### Now that the text cleaning is done, our text data is ready to be converted into the format, which the machine can understand (i.e. numbers).
- ### We will see it in the next lectures in Vectorization and after that we can perform the following tasks on that:
  - ### Sentiment Analysis
  - ### Text Classification
### etc. etc.

## George's Bonus $$$
__Word Embedding__
- http://jalammar.github.io/illustrated-word2vec/
- https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
- https://machinelearningmastery.com/what-are-word-embeddings/

In [35]:
#!python -m spacy download en_core_web_md
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.4MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp36-none-any.whl size=829180944 sha256=46e5a0c5f3c9eee8831da287bbd645e3eb8d49edc63cec586de4a022d95622c7
  Stored in directory: /tmp/pip-ephem-wheel-cache-jah98r_6/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [37]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

In [38]:
tokens = nlp("dog cat banana apple human man woman afskfsd")

In [39]:
tokens

dog cat banana apple human man woman afskfsd

In [40]:
tokens[0].similarity(tokens[1])
# dog and cat

0.80168545

In [41]:
tokens[0].similarity(tokens[0])
# dog and dog

1.0

In [42]:
tokens[0].similarity(tokens[2])
# dog and banana

0.24327643

In [43]:
tokens[0].similarity(tokens[4])
# dog and human

0.35814866

In [44]:
tokens = nlp('queen king man woman boy girl company school animal')

In [45]:
queen, king, man, woman, boy, girl, company, school, animal = tokens[0].vector, tokens[1].vector, tokens[2].vector, tokens[3].vector, tokens[4].vector, tokens[5].vector, tokens[6].vector, tokens[7].vector, tokens[8].vector

In [46]:
queen

array([ 0.4095   , -0.22693  ,  0.25362  , -0.36055  , -0.37095  ,
       -0.35181  ,  0.50669  , -0.77897  , -0.32571  ,  1.4895   ,
        0.052438 , -0.36751  , -0.074025 ,  0.37078  ,  0.063077 ,
        0.32274  ,  0.346    ,  0.64214  , -0.09583  ,  0.14303  ,
       -0.33826  ,  0.79005  , -0.7136   , -0.050134 , -0.46467  ,
       -0.067917 , -0.32107  ,  0.042919 ,  0.018576 ,  0.59272  ,
       -0.032392 ,  0.72779  ,  0.26002  ,  0.30401  ,  0.43033  ,
        0.25546  , -0.37986  , -0.14398  , -0.54399  , -0.46181  ,
        0.11046  , -0.034391 , -0.10458  , -0.069689 ,  0.091839 ,
       -0.19097  , -0.057108 ,  0.61218  , -0.19544  , -0.31698  ,
       -0.46372  ,  0.088749 , -0.052501 , -0.27969  ,  0.025125 ,
       -0.42097  , -0.069404 , -0.038672 , -0.26489  ,  0.10911  ,
       -0.084848 , -0.23826  ,  0.61538  ,  0.0039223,  0.20285  ,
        0.56085  ,  0.015419 ,  0.30707  ,  0.19435  , -0.20358  ,
       -0.18724  , -0.10311  , -0.46468  , -0.16804  ,  0.2261

In [47]:
vec = queen - woman + man

In [48]:
from scipy import spatial

result = 1 - spatial.distance.cosine(vec, king)
result

0.771614134311676

In [49]:
1 - spatial.distance.cosine(queen, king)

0.7252610325813293

In [None]:
1 - spatial.distance.cosine(boy, king)

0.37796980142593384

In [None]:
1 - spatial.distance.cosine(girl, king)

0.2687934339046478

In [None]:
1 - spatial.distance.cosine(school, king)

0.17461009323596954

In [None]:
1 - spatial.distance.cosine(company, king)

0.1744476556777954

In [None]:
1 - spatial.distance.cosine(animal, king)

0.2161560207605362

In [59]:
tokens = nlp('apple apples car cars family families') 

In [51]:
apple, apples, car, cars, family, families = tokens[0].vector, tokens[1].vector, tokens[2].vector, tokens[3].vector, tokens[4].vector, tokens[5].vector

In [52]:
1 - spatial.distance.cosine(apple, apples)

0.7504112720489502

In [53]:
1 - spatial.distance.cosine(car, cars)

0.8425762057304382

In [54]:
1 - spatial.distance.cosine(family, families)

0.7884814739227295

In [55]:
vec_1 = apple - apples
vec_2 = car - cars
vec_3 = family - families

In [56]:
1 - spatial.distance.cosine(vec_1, vec_2)

0.23837804794311523

In [57]:
1 - spatial.distance.cosine(vec_1, vec_3)

0.24297365546226501

In [58]:
1 - spatial.distance.cosine(vec_2, vec_3)

0.3244694471359253

In [70]:
s1 = nlp('i like apple')
s2 = nlp('like i apple')
s3 = nlp('i know how to drive a car')

In [71]:
1 - spatial.distance.cosine(s1.vector, s2.vector)

1.0

In [69]:
1 - spatial.distance.cosine(s1.vector, s3.vector)

0.7251461148262024