# Word Embeddings

- machine learning on text requires that you first represent the text numerically

    - bag of words representation is one way
    
    - can usually do better with word embeddings

- **Word embeddings** (also called **word vectors**) 

    - represent each word numerically in a way 

        - that the vector corresponds to how that word is used or what it means

    - vector encodings are learned by considering the context in which the words appear

    - words that appear in similar contexts will have similar vectors
    
        - For example, vectors for "leopard", "lion", and "tiger" will be close together, 
            
            - while they'll be far away from "planet" and "castle"

- These vectors can be used as features for machine learning models

    - Word vectors will typically improve the performance of your models above bag of words encoding

### SpaCy Word Embeddings

- spaCy provides embeddings learned from a model called `Word2Vec`

- access them by loading a large language model like `en_core_web_lg`
    
    - they will be available on tokens from the `.vector` attribute

In [4]:
# import numpy 
import numpy as np

# import spacy
import spacy

In [5]:
# load a large model to get word vectors 
nlp = spacy.load('en_core_web_lg')

In [6]:
# define the text to analyze
text_to_analyze = "These vectors can be used as features for machine learning models."

# for the rest of this exercise, always work with disabled pipes to speed up the process a bit and also because we dont need the pipes 
with nlp.disable_pipes():

    # get the word vectors for the words in the sectence of the text to analyze
    vectors = np.array( [ token.vector for token in nlp(text_to_analyze) ] )

# check the shape of the newly created array
vectors.shape

(12, 300)

- These are 300-dimensional vectors, with one vector for each word

- However, we only have document-level labels and our models won't be able to use the word-level embeddings

    - So, you need a vector representation for the entire document

### Document Level Vectors

- there are many ways to combine all the word vectors into a single document vector we can use for model training

- a simple and surprisingly effective approach is simply averaging the vectors for each word in the document

    - then, you can use these document vectors for modeling

- spaCy calculates the average document vector which you can get with `doc.vector`
    
    - Here is an example loading the spam data and converting it to document vectors

In [7]:
import pandas as pd 

# load the spam/ham data 
spam = pd.read_csv('kaggle_data/spam.csv')

# disable pipes again
with nlp.disable_pipes():

    # generate the document vectors 
    doc_vectors = np.array( [ nlp(text).vector for text in spam.text ] )

doc_vectors.shape

(5572, 300)

# Classification Models

- with document vectors, scikit-learn models, xgboost model and other standard models can be trained

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)

- Scikit-learn provides an SVM classifier LinearSVC. This works similar to other scikit-learn models

In [9]:
from sklearn.svm import LinearSVC

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

Accuracy: 97.312%


### Document Similarity

- Documents with similar content generally have similar vectors

- So you can find similar documents by measuring the similarity between the vectors

- A common metric for this is the cosine similarity which measures the angle between two vectors,  $ 𝐚 $  and  $ 𝐛 $ 

$$

\cos{\theta} = \frac{a \cdot b}{\lvert\lvert a \rvert\rvert \lvert\lvert b \rvert\rvert}

$$

- The cosine similarity can vary between -1 and 1, corresponding complete opposite to perfect similarity, respectively

In [12]:
# define function to compute cosine similarity between two vectors
def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

In [13]:
# initilize two docuemnt vectors 
a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector

# compute cosine similarity between the two vectors 
cosine_similarity(a, b)

0.7030031