# Chapter 0: A brief introduction

This is my first public notebook. For now, the most interesting topic to me is NLP. So I decide to setup this notebook and update it regularly to share with friends what I have learnt and what I have experienced along the way.

NLP means Natural Language Process. This process is to deal with text or speech so that we can gain insight of massive texts and speeches. After we preprocess text/speech data into useful representations, we can use computer to help us, in a highly efficient way, to dig out a lot from the massive texts/speeches, so that NLP can be used in:
* Machine Translation
* Speech Recognition
* Sentiment Analysis
* Question Answering System
* Text Automatic Summariztion
* Chatbots
* Text Classification
* Character Recognition
* Spell Checking


Actually, since NLP is so important and attractive, there are more than one way to handle the process. To my knowledge, TensoFlow can do this job, spaCy is also very hot and spicy on this topic, [NLTK](https://www.nltk.org/) is another powerful tool. Hopefully, in the future, I can pick those treasures up one by one and collect them here, so my friends can find a good starting point in this notebook. 

One thing I have to admit here is that I am not a smart man. I am not a genius, but an ordinary person with genuine interesting in data science. People always say there are two species can land on the top of a pyramid, the eagle and the snail.  For those "eagles", they can understand snipet of code in a blink of eye, to me, in my "snail" way, I have to type them word by word, test, debug, more test more debug, until I really understand them. In the same manner, I will build up this notebook little by little with detailed explaination of code and complete links to the references or other useful resources. I hope my friends can find enjoyment in my community. Let us work together!

## 0.1 Some General Ideas on Machine Learning Script

After reading some of the examples in Kaggle, you will realize 99% percent of them, no matter what topic they are talking about, no matter what kind of specific tools they are using, the steps of the scripts are similar. It is useful to summarize the steps at the very beginning, so that everyone will have a simplified map at hand to keep us at the right track.

1. Loading in the data

> Typical machine learning repository can be found from [UCI](https://archive.ics.uci.edu/ml/index.php). For a data scientist, the firt thing need to care about is the attribute information of the data. Those attributs can actually divided into the *factors* part and the *labels* part. For example, if you want to study the risk of diabetes, you man take into account the factors like age, sex, sudden weight loss, visual blurring, itching, etc. Then those factors will be associated to a certain label -- whether this guy has diabetes or not. 

> When data is not complex, we use Pandas to handl it.

2. Splitting the whole dataset into training/testing sets
3. Building a model
4. Fitting the model 
This is also called the optimisation process, the model training part.
5. Evaluating the model
6. Making prediction based on the model

It seems I have forgotten to mention **data visualization**.  No, I don't. The only reason is, data visualiztion, as a powerful tool, can be used in almost every step mentioned above. I will learn how to use it along the way. 

One thing we need to keep in mind is the six steps listed above is only a simplifed road map, or a sketch of the "pipeline", we will meet more details later on. For example, when the data is loaded, we need a **preprocess** procedure to get rid of the noise, to figure out if the data is complete or not, to scale the data, to do the dimensionality reduction, etc. 


# Chapter 1: NLP with TensorFlow
NLP can be realized on TensorFlow, I put them together in the following notebook.
For now, most of the contents are from [here](https://medium.com/@saitejaponugoti/nlp-natural-language-processing-with-tensorflow-b2751aa8c460). In the future, I will add more projects so that everyone can get what they want from the "buffet table".

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
print(tf.__version__)

Complete [official documentation of TensorFlow](https://www.tensorflow.org/api_docs)

## 1.1 Tokenization
<face size="5">Since computer (tensorflow) cannot calculate character or words directly, we must figure out a process or some algorithms which can transform those uncalulatable information into numbers. This process is called **tokenization**.</face>

In [None]:
# List of sample sentences that we want to tokenize
sentences = ['I love my dog',
             'I love my cat',
            ]

# intializing a tokenizer that can index
# num_words is the maximum words that can be kept
# tokenizer will automatically help in choosing most frequent words
tokenizer = Tokenizer(num_words = 100)

# fitting the sentences to using created tokenizer object
tokenizer.fit_on_texts(sentences)

# the full list of words is available as the tokenizer's word index
word_index = tokenizer.word_index

# the result will be a dictionary, key being the words and the value being the token for that word
print(word_index)

<face size = "5">The following code is to show the *tokenizer* is smart enough that it will NOT add a new token for "dog!".</face>

In [None]:
sentences = ['I love my dog',
             'I love my cat',
             'you love my dog!'
            ]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# Newly generated dictionary doesn't contain new token for "dog!"
print(word_index)

## 1.2 Sequencing:
sequence the numbers (tokens) in the correct order.

In [None]:
sentences = ['I love my dog',
             'I love my cat',
             'you love my dog!',
             'Do you think my dog is amazing?'
            ]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# this creates sequence of tokens representing each sentence
sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

Now, we got the tensors which can be handled by tensorflow tools.

But before we move on, let us have a test first.

In [None]:
test_data = [ 
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)

print(test_seq)

In the new test_data, the new words like "really" and "manatee" which are not tokenized cannot be seen in the sequences. Therefore, in other words, we are losing the lenght of the sequence. In order to fix this, we employ a trick named **OOV (out of vocabulary) token**. Actually, OOV is a property set to those words never been used anywhere, like "<OOV>"

In [None]:
sentences = ['I love my dog',
             'I love my cat',
             'you love my dog!',
             'Do you think my dog is amazing?'
            ]

# adding a "out of vocabulary" word to the tokenizer
tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)

test_seq = ['i really love my dog',
            'my dog loves my manatee'
           ]

test_seq = tokenizer.texts_to_sequences(test_data)

print(word_index)
print(test_seq)

From the output from the code above, we can see new words like "really" and "manatee" are all replaced by '<OOV>' and then tokenized as "1". The result can be explained like:
> 'i really love my dog' = [5, 1, 3, 2, 4] = 'i \< OOV\> love my dog'

## 1.3 Padding the sequences:
In sentences, different items contain different number of words, which means the sequences may be of different length. However, tensorflow cannot handle input vector of different length, so we need padding.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ['I love my dog',
             'I love my cat',
             'you love my dog!',
             'Do you think my dog is amazing?'
            ]

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)

# padding sequences
padded = pad_sequences(sequences)

print(word_index)
print(sequences)
print(padded)

As a user, we can have some options over the paddings. If we want place the padding  at the end of the sequence, we can use `padding = "post"`. With the options decided, the function pad_Sequences might look like:

`padded = pad_sequences(sequences, maxlen = 5, padding = 'post', truncating = 'post')`

Detailed information can be found in [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) documentation.

Since the text has been tokenized now, next step for us is to use tensorflow. Let us start from the [dataset of News headlines](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection).

This dataset has 3 fields:
> ___**is_sarcastic field**: "1" if sarcastic and 0 otherwise.___ <br>
> ___**headline**: the headline if the news article___ <br>
> ___**article_link**: link to the original news article.___

In [None]:
# importing the json library
import json
import os

# loading sarcasm.json file using the json library

#with open("../input/news-headlines-dataset-for-sarcasm-detection/Sarcasm_Headlines_Dataset.json", 'r') as f:
#    data = json.load(f)

f = open("../input/news-headlines-dataset-for-sarcasm-detection/Sarcasm_Headlines_Dataset.json", 'r')   

data = []
for line in f.readlines():
    dic = json.loads(line)
    data.append(dic)
#print(data)    
# creating lists for sentences, labels and urls
sentences = [] # headlines
labels = [] # labels
urls = []

# iterating through the json data and loading
# the requisite values into our python lists
for item in data:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

## 1.4 Splitting data into train and test sets:

In [None]:
training_size = 20000

training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]

training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

Since we have tokenized every word in the set, the test set has also been tokenized, which is not what we want. The reason is we want the words in the test set have not been tokenized so that we can use them to test the training set.

The following code is used to fix the problem:

In [None]:
vocab_size = 10000
max_length = 100
trunc_type = 'post'
padding_type = 'post'
oov_tok = "<OOV>"
training_size = 20000

tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_tok)
# fitting tokenizer only to training set
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

# creating training sequences and padding them 
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen = max_length,
                               padding = padding_type,
                               truncating = trunc_type,
                               )

# creating testing sequences and padding them using same tokenizer
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen = max_length,
                              padding = padding_type,
                              truncating = trunc_type,
                              )

# converting all variables to numpy arrays, to be able to work with tf v2
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [15, 15]  # make the plot bigger sothe subplot don't overlap
training_padded.shape
#training_padded.hist(); # use a semicolon to supress return value

## 1.5 Word Embeddings:

In this step, basic sentiments will be taken into account, since the tokenized words--the numbers--cannot tell us preference, attitude or things like that. 

Well, here is where the context of **embeddings** comes in.

In [None]:
embedding_dim = 16

# creating a model for sentiment analysis
model = tf.keras.Sequential([
    # addinging an Embedding layer for Neural Network to learn the vectors
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    # Global average pooling is similar to adding up vectors in this case
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation = 'relu'),
    tf.keras.layers.Dense(1, activation = 'sigmoid')
])

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

num_epochs = 10

history = model.fit(training_padded, training_labels, epochs = num_epochs,
                   validation_data = (testing_padded, testing_labels))

## 1.6 Establishing Sentiment:
This section is to used the trained neural network to do the prediction.

In [None]:
# forming new sentences for testing, feel free to experiment
# sentence 1 is bit sarcastic, whereas sentence two is a general statement.

new_sentence = [
    "granny starting to fear spider in the garden might be real",
    "game of thrones season finale showing this sunday night"
]

# Converting the sentences to sequences using tokenizer
new_sequences = tokenizer.texts_to_sequences(new_sentence)
# padding the new sequences to make them have same dimensions
new_padded = pad_sequences(new_sequences, maxlen = max_length, 
                          padding = padding_type,
                          truncating = trunc_type)

new_padded = np.array(new_padded)

print(model.predict(new_padded))

# Chapter 2: NLP with spaCy
The following is from the [tutorial](https://www.kaggle.com/matleonard/intro-to-nlp)

Detailed spaCy API documentation is [here](https://spacy.io/api/language) We may need to refer to this document from time to time to explain the following examples. 

![From Analytics Vidhya](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/spacy_pipeline.png)

<center>Pipeline of spaCy   </center>

spaCy relies on **models** that are language-specific and come in different sizes. You can load a spaCy model with `spacy.load`.

For example, here's how you would load the English language model.

In [None]:
import spacy
nlp = spacy.load('en')

nlp.pipe_names

`
nlp.pipe_names
` shows the active pipeline componets. If we want to disable some of the components, we can use something like: `nlp.diable_pipes('tagger', 'parser')`.

## 2.1 Tokenize

Different from the tokenizing process happened in the previous section about tensorflow. Token here is still a word, not a number we get after tokenization from the tensorflow. For spaCy, tokenization is a process of purifying raw text into simpler individual words which will make more sense in the future computation.

As you will see from the following example, in spaCy, punctuation is token, and contractions like "don't" will be splitted into "do" and "n't" and counted as two tokens. 

Also in the following example, we displayed POS (Part-of-Speech) tagging. In English, the parts of speech tell us what is the function of a word and how it is used in a sentence. They may be Noun, Pronoun, Adjective, Verb, Adverb, etc. Punctuation is labelled as "PUNCT". 

In [None]:
doc = nlp("Tea is healthy and calming, don't you think?")

for token in doc:
    print(token)
print("~"*40)    
for token in doc:
    # print the token and its part-of-speech tag
    print(token.text, "-->", token.pos_)

In [None]:
spacy.explain("AUX")

Almost all the POS tags shown in the above result are in the short form. If you want to what it stands for, you can use code like `spacy.explain("AUX")` to figure it out.

## 2.1.1 Play with 'ner'

As you have seen from the previous sections, after we load the spacy and show the pipe names by using `nlp.pipe_name`, we can see one of the results is 'ner'. Then, what is 'ner'? It is short for **N**ame **E**ntity **R**ecognition. As long as there are name entities in the given text, it can be referred to by using `.ents`. See the following example:

In [None]:
doc = nlp("D. Trump will win the 2020 U.S.A. election! I predicted it on 11/27th.")
print(doc.ents)
print(f"Entity \t\tType \t\tStart \t\t End".format('Entity', 'Type', 'Start', 'End'))
for ent in doc.ents:
    print(f"{ent.text}\t\t{ent.label_}\t\t{ent.start_char}\t\t{ent.end_char}")
    #print(ent.text, ent.label_, ent.start_char, ent.end_char)

From the result above, we can see not all the words or phrases are labelled as NAME ENTITY. Type is presented in the short form, we can find their meanings from [here](https://spacy.io/api/annotation). You can also use `spacy.explain('GPE')` to figure it out.

A fun fact: if you use "Trump", this word will be labelled as "ORG", and if you use "D. Trump", you will get "PERSON".  

In [None]:
#from spacy import displacy
displacy.render(doc, style='ent')

## 2.2 Text preprocessing

With a spaCy token, `token.lemma_` returns the lemma, while `token.is_stop` returns a boolean `True` if the token is a stopword (and `False` otherwise).

Lemma is the root of a family of words. All the other members of the family are derived from this root word (lemma). For example, "walking" and "walks" are all from "walk". After using `token.lemma_`, this word family will be presented by "walk" only. 

Stopwords are the words like "is", "the", "and", "but". They glue the meaningful words together to form a complete sentence. However, they don't contain much information, so we need to get rid of them in the text preprocessing phase.

In [None]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

## 2.3 Pattern Matching

To mathch individual tokens, you need to create a `Matcher`. When you want to match a list of terms, it is easier and more efficient to use `PhraseMatcher`.

In [None]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier. Setting `attr='LOWER'` will match the phrases on lowercased text. This provides case insensitive matching.

In [None]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", patterns)
print(patterns)
#x = nlp('Galaxy Note')
#for token in x:
#    print(token)

Create a document from the text to search and use the phrase matcher to find where the terms occur in the text.

In [None]:
# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
              "photography tests pitting the iPhone 11 Pro against the "
              "Galaxy Note 10 Plus and last year's iPhone XS and Google Pixel 3.")
matches = matcher(text_doc)
print(matches)
n = 0
print(f"token \t\t\tIndex".format(token, 'index'))
for token in text_doc:
    print(f"{str(token)}\t\t\t{n}")
    #print(f"token \t\tn".format(token, n))
    n = n+1

`print(matches)` prints out matches. We can see from the result, they are tuples. For example `(3766102292120407359, 17, 19)` is the first one. The big number `3766102292120407359` is the ID, `17` is the starting point while `19` is the ending point. 
* One thing need to be emphasize here is index of toeknized `text_doc` starts from zero not 1. 
* Another thing we need to know is if we want to print out matched word, the ending is 18 not 19. 
* The space behind the word "side-by-side" is important. If you neglect it, you will have a token like sidephotography, which is not what we want.

This can be seen from the following code.

In [None]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

## 2.4 Text Classification with SpaCy

Text classification can be used to the area like spam detection, sentiment analysis, and tagging customer queries

In [None]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')
spam.head(10)

## 2.5 Bag of Words
Machine learning models don't learn from raw text data. Instead, you need to convert the text to something numeric. In the previous section about TensorFlow, we get those numeric information via a so-called 'tokenize' process. However, here in the case of spaCy, this text-to-number process is done by a variation of one-hot encoding.



## Building a Bag of Words model

spaCy handles the bag of words conversion and building a simple linear model for you with the `TextCategorizer` class.

The TextCategorizer is a spaCy **pipe**. Pipes are classes for processing and transforming tokens. When you create a spaCy model with `nlp = spacy.load('en_core_web_sm')`, there are default pipes tht perform part of speech tagging, entity recognition, and other transformations. When you run text through a model `doc = nlp("some text here")`, the output of the pipe are attached to the tokens in the `doc` object. The lemmas for `token.lemma_`come from one of these pipes.

In [None]:
# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
    "textcat",        # name of the TextCategorizer
    config = {        # configuration information written in a dictionary
        "exclusive_classes": True,  # which means it is a binary classification, here it is either ham or spam
        "arhitecture": "bow"    # bow is the acronym of bag of words. 
    }
)

# Add the TextCategorizer to the empty model
#nlp.pipe_names
nlp.add_pipe(textcat)
nlp.pipe_names

`['tagger', 'parser', 'ner', 'textcat']`

NER: Named Entity Recognition

In [None]:
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

## Training a Text Categorizer Model

The `train_labels` set below is a dictionary contains `{'ham': True, 'spam': False}`. This dictionay is nested in another dictionary with the key `'cats'`.

In [None]:
train_texts = spam["text"].values
train_labels = [{'cats': {'ham': label == 'ham',
                         'spam': label == 'spam'}}
               for label in spam['label']]
print(train_labels[0:4]) # Print out the first four items in the dictionary to take a look.

Then we combine the texts and labels into a single list using `zip()` function.

In [None]:
train_data = list(zip(train_texts, train_labels))
train_data[:3]

Now you are ready to train the model. First create an `optimizer` using `nlp.begin_training()`. This optimizer is used to update the model. `minibatch` returns a generator yielding minibatches for training.Then, the minibatches are split into texts and labels, then used with `nlp.update` to update the model's parameters.

A detailed description of nlp.update method is here: https://spacy.io/api/language#update

In [None]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size = 8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd = optimizer)  
    # 

In [None]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size = 8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to 
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd = optimizer, losses = losses)
    print(losses)

One of the sample results from the code above:

`{'ner': 0.0, 'parser': 0.0, 'textcat': 0.2844166667900331, 'tagger': 431.55187639655196}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.42443664483654675, 'tagger': 804.3928032657932}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.49458903677683475, 'tagger': 1079.4695060554511}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.5406185587745101, 'tagger': 1329.9959928379762}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.5782408489383685, 'tagger': 1538.3814673529532}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.611205267520201, 'tagger': 1704.0516921713795}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.6690017411580279, 'tagger': 1875.847960154827}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.6983035982691773, 'tagger': 2054.010863930191}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.7010045384345747, 'tagger': 2187.8098707783442}
{'ner': 0.0, 'parser': 0.0, 'textcat': 0.7010062075913526, 'tagger': 2290.876525081628}`

## Making Predictions
Make predictions with `predict()` method. The input text needs to be tokenized with `nlp.tokenizer`.

In [None]:
texts = ["Are you ready for the tea party????? It's gonnabe wild",
        "URGENT Reply to this message for GURANTEED FREE TEA"]
docs = [nlp.tokenizer(text) for text in texts]

# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

In [None]:
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis = 1)
print([textcat.labels[label] for label in predicted_labels])

## Word Embeddings
Another way to represent the text numerically except for bag of words. It can take into account the context in which the words appear in the form of vectors. So it is also called word vectors. Words appear in the similar context will have similar vectors. For example, "leopard", "lion" and "tiger" will be close to each other whil they will be far away from "planet" and "castle".

Once the vectors are set up, we can perform mathematical manipulation over them. Magically, those operations, like subtracting, adding, will have some real-world meaning. For example, if we subtract the vector of "man" from that of "woman", we will get a new vector. And if we add this new vector to the vector of "king", the resulting vector will be close to the vector of "queen"! 

![Manipulation of word vectors](https://www.tensorflow.org/images/linear-relationships.png)

The vectors can sure be fed into a machine learning model. In **spaCy**, there is a model called Word2Vec. You can access them by loading a large language model like `en_core_web_lg`. Then they will be available on tokens from the `.vector` attribute.

In [None]:
import numpy as np
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

In [None]:
# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in nlp(text)])

In [None]:
vectors.shape
#print(vectors[11])


The result shows there are 12 300-dimentional vectors for each word and the punctuation ".". 

However, the vectors we got now is only the word-level embeddings. In practice, for modeling, we need document-level labels, which can be obtained from the word-level vectors via an easy approach -- averaging the vectors for each word in the documents.

spaCy provide us with such functionality through `doc.vector`.

In [None]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')

with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])
    
doc_vectors.shape

## Classification Models

With the document vectors, you can train scikit-learn models, xgboost models, or any other standard approach to modeling.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label, test_size = 0.1, random_state = 1)

[SVMs](https://scikit-learn.org/stable/modules/svm.html#svm) provided by scikit-learn (`LinearSVC`) is used here. 

In [None]:
from sklearn.svm import LinearSVC

# Set dual = False to speed up training, and it's not needed
svc = LinearSVC(random_state = 1, dual = False, max_iter = 10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

## Document Similarity

Since documents have now been presented in the form of vectors, the questions we are facing is how to measure the similarity between the vectors. The geomety tells us if there are two vectors in the space, and they are parallel to each other, then the angle between the two vectors is zero. If we do not care about the starting point and length of the two vectors, in this case, we can conclude those two vectors are identical to each other. They have the biggest similarity now. So, it is natural to use the angle between the vectors as a metric to tell if two vectors are similar to each other. The bigger the angle is, the smaller the similarity is. 

A little math here, although I hate this part, I cannot figure out any other way to explain this issue without using math. Suppose we a vector $\bf{a}$ and a vector $\bf{b}$, the angle between them is $\theta$. Then the $\theta$ can be calculated by:
$$\cos \theta = \frac{\bf{a}\bullet \bf{b}}{\|\bf{a}\|\space\|\bf{b}\|}$$
$\|\bf{a}\|$ and $\|\bf{b}\|$ are the magnitude of those two vectors respectively. The following snippet is used to calculate the formula


In [None]:
def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

In [None]:
a = nlp("Reply now for free tea").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a,b)