# spaCy Word Vectors

Based on **Dr. William Mattingly** video: https://www.youtube.com/watch?v=dIUTsFT2MeQ&t

and his Jupyter Book: http://spacy.pythonhumanities.com/01_03_word_vectors.html

## Concept

Word vectors, also known as word embeddings, are numerical representations of words in a high-dimensional space. They capture semantic and syntactic relationships between words, allowing NLP models to understand the meaning and context of words based on their vector representations. Word vectors are typically learned from large text corpora using techniques like Word2Vec, GloVe, or FastText.

1. **Word2Vec**: Word2Vec is a widely used algorithm developed by Tomas Mikolov et al, It offers two architectures: Continuous Bag of Words (CBOW) and Skip-gram. Word2Vec learns word embeddings by predicting the context (neighborin words) given a target word or vice versa. The resulting word vectors are capable of capturing word similarities and analogies.
2. **GloVe (Global Vectors for Word Representation)**: GloVe is an algorithm developed by Stanford researchers. It combines global matrix factorization and local context window-based methods to learn word embeddings. GloVe learns word vectors by analyzing global word co-occurrence statistics. It considers the probabilities of word appearing together and constructs a co-occurrence matrix to capture word relationships.
3. **FastText**: FastText is an extension of Word2Vec developed by Facebook AI Research. It introduces a subword-level modeling approach by representing words as bag of character n-grams. FastText considers subword information to handle out-of-vocabulary words and can generate embeddings for rare or unseen word based on their character constituents.

In the context of spaCy, word vectors play a crucial role in many NLP tasks. spaCy provides pre-trained word vectors for many languages, which can be accessed using **vector** atrribute of a **Token** object. These word vectors enable spaCy models to preform various tasks such as similarity analysis, entity recognition, text classification, and more.

In [None]:
import spacy
# We need to download a larger model than the one we downloaded last time
!python -m spacy download en_core_web_md

In [3]:
nlp = spacy.load("en_core_web_md")
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()
doc = nlp(text)
sentence1 = list(doc.sents)[0]

## Word Vectors

Word vectors, also known as word embeddings, are numerical representations of words in a continuous vector space. Each word in a given language is assigned a fixed-size vector, typically with hundreads of dimensions, where the values in each dimenstion capture different apects of the word's meaning.

Word vectors are derived from training models on large amounts of text data, such as books, articles, or web pages. These models learn to assign similar vector representations to words that have similar contexts or meanings. The underlying assumption is that words appearing in similar contexts are likely to have similar meanings.

Word vectors have become a fundamental component of many NLP applications. They enable machines to capture semantic relationships between words. such as analogies or similarities, and perform various language-related tasks. By representing words as dense vectors, it becomes possible to perform mathematical operations on them, such as vector addition and subtraction, to explore relationships between words.

Word vectors have several advantages in NLP tasks. They can improve the performance of the models in tasks like language modeling, sentiment analysis, machine translation, NER, and document classification. They also help in capturing semantic properties of wrods, allowing models to generalize better and handle out-of-vocabulary words.

Populat lagorithms for learning word vectors include Word2Vec, GloVe, and FastText. These algorithms have been successful in capturing word semantics and have pre-trained models available for various languages, making i easier to incorporate word vector into NLP applications.

### Initial method

Initial methods for creating word vectors in a pipeline take all words in a corpus and convert them into a single, unique number. These words are then stored in a dictionary that would look like this: {"the":1, "a":2} etc. This is known as a bag of words. This approach to representing words numerically, however, only allow a computer to understand words numerically to identify unique words. It does not, however, allow a computer to understand meaning.

sentences:

Tom loves to eat chocolate.

Tom likes to eat chocolate.

List representation:

1, 2, 3, 4, 5

1, 6, 3, 4, 5

Both sentences are nearly identical. The only difference is the degree to which Tom appreciates eating chocolate. If we examine the numbers, however, these two sentences semm quite close, but their semantical meaning is impossible to know for certain. How similar is 2 to 6? The number 6 could represent "hates" as much as it could represent "likes". This is where word vectors come in.

Word vectors transform one-dimensional bag-of-words representation into multi-dimensional representations by mapping them to higher-dimensional spaces using machine learning techniques.


### Why Word Vectors?

The goal of word vectors is to enable computers to comprehend language numerically, enabling them to perform more advanced tasks on textual data. To illustrate this let's consider the scenarion mentioned aboce. One possible solution to help computer understand that **2** and **6** are synonymous or have similar meanings could be to provide the computer synonyms and thereby understand word meanings. While this approach may seem logical, it is not a viable solution.

In [4]:
# Looking for synonyms in PyDictionary
from PyDictionary import PyDictionary

dictionary=PyDictionary()

text = "Tom loves to eat chocolate"

words = text.split()
for word in words:
    syns = dictionary.synonym(word)
    print(f"Word: {word:<10}, synonyms: {syns[0:5]}\n")

Tom has no Synonyms in the API


TypeError: 'NoneType' object is not subscriptable

In [5]:
from PyDictionary import PyDictionary

dictionary = PyDictionary()

words = ["like", "love"]
for word in words:
    syns = dictionary.synonym(word)
    print(f"Word: {word:<10}, synonyms: {syns[0:5]}\n")

like has no Synonyms in the API


TypeError: 'NoneType' object is not subscriptable

### Apperance of Word Vectors

+ Words vectors have a fixed number of dimensions, which are determined through machine learning techniques.
+ Machine learning models consider various factors such as word frequency, co-occurrence of words within a corpus, and contextual similarites to shape the dimension of word vectors.
+ Word vectors enable the computer to measure syntactical similarity between words using numerical values.
+ To represent these relationships numerically, word vectors are often structured as a matrix of matrices, commonly known as a tensor.
+ To make the representation more concise, models flatten the matrix into a single floating-point number, typically a decimal.
+ The number of dimensions in word vectors corresponds to the number of floating-point values in the flattened matrix, thus influencing the richness anc complexity of the representation.

In [6]:
# first word
sentence1[0].vector

array([-7.2681e+00, -8.5717e-01,  5.8105e+00,  1.9771e+00,  8.8147e+00,
       -5.8579e+00,  3.7143e+00,  3.5850e+00,  4.7987e+00, -4.4251e+00,
        1.7461e+00, -3.7296e+00, -5.1407e+00, -1.0792e+00, -2.5555e+00,
        3.0755e+00,  5.0141e+00,  5.8525e+00,  7.3378e+00, -2.7689e+00,
       -5.1641e+00, -1.9879e+00,  2.9782e+00,  2.1024e+00,  4.4306e+00,
        8.4355e-01, -6.8742e+00, -4.2949e+00, -1.7294e-01,  3.6074e+00,
        8.4379e-01,  3.3419e-01, -4.8147e+00,  3.5683e-02, -1.3721e+01,
       -4.6528e+00, -1.4021e+00,  4.8342e-01,  1.2549e+00, -4.0644e+00,
        3.3278e+00, -2.1590e-01, -5.1786e+00,  3.5360e+00, -3.1575e+00,
       -3.5273e+00, -3.6753e+00,  1.5863e+00, -8.1594e+00, -3.4657e+00,
        1.5262e+00,  4.8135e+00, -3.8428e+00, -3.9082e+00,  6.7549e-01,
       -3.5787e-01, -1.7806e+00,  3.5284e+00, -5.1114e-02, -9.7150e-01,
       -9.0553e-01, -1.5570e+00,  1.2038e+00,  4.7708e+00,  9.8561e-01,
       -2.3186e+00, -7.4899e+00, -9.5389e+00,  8.5572e+00,  2.74

Once a word vector model is trained, we can do similarity matches very quickly and very reliably.

In [12]:
# Words most closely related to the word dog
import numpy as np

my_word1 = "dog"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[my_word1]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['dogsbody', 'wolfdogs', 'Baeg', 'duppy', 'pet(s', 'postcanine', 'Kebira', 'uppies', 'Toropets', 'moggie']


In [14]:
my_word2 = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[my_word2]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


## Doc Similarity

In spaCy we can do this same thin at the document level. Through word vectors we can calculate the similarity between two documents.

In [9]:
nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


In [15]:
doc3 = nlp("The Empire State Building is in New York.")

print(doc1, "<->", doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. <-> The Empire State Building is in New York. 0.1766669125394067


In [16]:
doc4 = nlp("I enjoy oranges.")
doc5 = nlp("I enjoy apples.")\

print(doc4, "<->", doc5, doc4.similarity(doc5))

I enjoy oranges. <-> I enjoy apples. 0.977570143948367


In [17]:
doc6 = nlp("I enjoy burgers.")


print(doc4, "<->", doc6, doc4.similarity(doc6))

I enjoy oranges. <-> I enjoy burgers. 0.9628306772893752


Apples and orages are in a similar cluster category around the fruit because of their word embedding.

## Word Similarity

We can also calculate the similarity betwen two given words.

In [11]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489079475403
