# **Vector Semantics**

In the real world, everthing has a meaning, in natural language, we often uses words and sentences based on the meaning and the context. For example, "He is very lazy", in this sentence, each word convey the overall meaning that the person is lazy, not just but very, here the word "very" can be used to emphasize how lazy the person is, so there must be a meaning hiding in some particular words which can change according to context.

Vector Semnatics is the idea of converting words or sentences in the form of vectors which holds the meaning and relation with other words or sentences. This is often called vector semnatics of words or sentences.

# **Words and Vectors**

Simply, a vector is list of numbers that has a direction and a maginitude(length). In NLP, we use vectors to encode words so that the words that has the similar meaning will tend to have the same direction represented by the vector. These vectors that contains rich information about a word in a context is called **Embedding**.

Words can be represented as vectors using various methods. One approach involves counting the occurrences of words within different contexts and storing this information in a matrix. In this matrix, the columns represent different contexts, such as neighboring words within a certain window, while the rows represent individual words from the vocabulary.

Intuitively, words that frequently occur in similar contexts are likely to have similar meanings. Therefore, by storing the counts of word occurrences according to context in this matrix, we can capture the similarities between words. This allows us to represent words as vectors where similar words are closer together in the vector space, reflecting their shared contextual usage.

$$
\begin{array}{|c|c|c|c|}
\hline
 & \text{Context 1} & \text{Context 2} & \text{Context 3} \\
\hline
\text{Word 1} & 5 & 3 & 0 \\
\text{Word 2} & 1 & 0 & 4 \\
\text{Word 3} & 2 & 1 & 3 \\
\text{Word 4} & 0 & 2 & 1 \\
\hline
\end{array}
$$

$$
\begin{array}{|c|c|c|c|}
\hline
 & \text{Fruit} & \text{Color} & \text{Taste} \\
\hline
\text{Apple} & 10 & 2 & 5 \\
\text{Banana} & 8 & 1 & 4 \\
\text{Orange} & 6 & 3 & 2 \\
\text{Grape} & 4 & 2 & 3 \\
\hline
\end{array}
$$

From the co-occurrence matrix provided:

* The term "apple" has a higher count in the context of "Fruit" compared to "Color" or "Taste", indicating that "apple" is more often associated with the concept of fruit.

* The term "orange" appears in both the "Fruit" and "Color" contexts, suggesting that "orange" is associated with both the fruit and the color of the same name.

* The counts in the matrix provide insights into the associations between words and contexts based on their co-occurrence patterns in the corpus. In this case, the higher counts indicate stronger associations between certain words and specific contexts.

Now if we consider the rows of this matrix which represents each word, we'll get a dense vector that represents the meaning of those words, atleast in small scale. This is how word embedding are generated from context, but modern way of this is complex like Word2Vec, Glove uses advanced techniques to embed words.


## **Co-occurrence count as Embeddings**




In [None]:
import re
from collections import defaultdict

def generate_co_occurrence_embeddings(corpus, window_size = 2):
    tokens = re.findall(r'\b\w+\b', corpus.lower())
    co_occurrence_matrix = defaultdict(lambda: defaultdict(int))

    for i, target_word in enumerate(tokens):
        context = tokens[max(0, i - window_size) : i] + tokens[i+1:i+window_size + 1]
        for context_word in context:
            co_occurrence_matrix[target_word][context_word] += 1

    return co_occurrence_matrix


corpus = "The quick brown fox jumps over the lazy dog. The lazy dog barks at the fox."
embeddings = generate_co_occurrence_embeddings(corpus)

for target_word, context_words in embeddings.items():
    print(target_word, context_words)

the defaultdict(<class 'int'>, {'quick': 1, 'brown': 1, 'jumps': 1, 'over': 1, 'lazy': 3, 'dog': 3, 'barks': 1, 'at': 1, 'fox': 1})
quick defaultdict(<class 'int'>, {'the': 1, 'brown': 1, 'fox': 1})
brown defaultdict(<class 'int'>, {'the': 1, 'quick': 1, 'fox': 1, 'jumps': 1})
fox defaultdict(<class 'int'>, {'quick': 1, 'brown': 1, 'jumps': 1, 'over': 1, 'at': 1, 'the': 1})
jumps defaultdict(<class 'int'>, {'brown': 1, 'fox': 1, 'over': 1, 'the': 1})
over defaultdict(<class 'int'>, {'fox': 1, 'jumps': 1, 'the': 1, 'lazy': 1})
lazy defaultdict(<class 'int'>, {'over': 1, 'the': 3, 'dog': 3, 'barks': 1})
dog defaultdict(<class 'int'>, {'the': 3, 'lazy': 3, 'barks': 1, 'at': 1})
barks defaultdict(<class 'int'>, {'lazy': 1, 'dog': 1, 'at': 1, 'the': 1})
at defaultdict(<class 'int'>, {'dog': 1, 'barks': 1, 'the': 1, 'fox': 1})


## **1. Vectorization**

Vectorization is the process of converting any type of data into which are non-numeric (text, image, documents) into a form of vectors so that it can be processed by Machine Learning Models.

Word Embeddings are a type of vectorization where we convert words to vectors that holds the meaning and the context at which they are useful.

Different vectorization techniques include:

### **1.1 TF-IDF vectorization**

TF-IDF is one of the common technique to vectorize words based on the number of times they appear in documents. the TF part stands for Term-Frequency which counts how many times a particular word occured in a document which is being multipied by IDF (Inverse Document Frequency). Document Frequency (DF) which refers to the number of documents in which a particular word occur, so IDF calculates the number of documents where the a particular word is not occuring. Thus word which occurs less will have more weights, this will help to determine the relevant words that appears less and give more contextual and rich information rather than words like "is", "and", "or", "with", etc which are very common in language. For example if we are dealing with a document about Astronomy, we need to give more importance to words that occurs less like "Cosmic Microwave Background" rather than words like "good", "fly", "rocket", etc.


So TF-IDF is the combined version of Term-Frequency and Inverse Document-Term Frequency.

Here is how we can calculate the Term-Frequency:

$$\text{tf}(t, d) = \begin{cases}
1 + \log_{10}(\text{count}(t, d)) & \text{if } \text{count}(t, d) > 0 \\
0 & \text{otherwise}
\end{cases}$$

Here we need to handle when the count become zero when counting term-frequeny since we cannot take the log of 0.

Now we can calculate the Inverse Document Frequency:

$$
\text{idf}(t) = \log_{10}\left(\frac{N}{\text{df}_{t}}\right)
$$

TF-IDF:

$$
\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)
$$




In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Sklearn TF-IDF Built-in Vectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)
tfidf_array = tfidf_matrix.toarray()
tfidf_array

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

## **1.2 Pointwise Mutual Information (PMI)**

Pointwise Mutual Infromation is another type of vectorization to perform which calculates the probability of two words to co-occur from a large corpora of text. This is often useful because in some context, there is a higher chance that some terms can co-occur which will give important information about the meaning.

To calculate the probability of words to co-occur, we need to find the probability of both to co-occur from the overall probability of their co-occurances considering them individually. This can be given by,

\begin{equation}
\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w)P(c)}
\end{equation}

Here $P(w, c)$ is the porbability of two word co-occuring.
$P(w)P(c)$ is the probability of combined probability of two words to co-occur given they are unrelated.

The thing is PMI values range from negative to positive infinity. But negative PMI values are hard and unreliable unless the corpora is very large. So th ebetter option is to convert negative values to zero if there are less or no co-ccourances of two words, here is how we can do it,

\begin{equation}
\text{PPMI}(w, c) = \max\left(\log_2 \frac{P(w, c)}{P(w)P(c)}, 0\right)
\end{equation}



In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

count_vectorizer = CountVectorizer()
tf_matrix = count_vectorizer.fit_transform(documents)

total_term_count = np.sum(tf_matrix)
term_probabilities = np.asarray(tf_matrix.sum(axis = 0) / total_term_count)

pmi_matrix = tf_matrix / term_probabilities

print(pmi_matrix.toarray())

[[ 0.   5.5 11.   5.5  0.   0.   5.5  0.   5.5]
 [ 0.  11.   0.   5.5  0.  22.   5.5  0.   5.5]
 [22.   0.   0.   5.5 22.   0.   5.5 22.   5.5]
 [ 0.   5.5 11.   5.5  0.   0.   5.5  0.   5.5]]


## **1.3 Embeddings**

Embeddings are dense vector representation of a word with n dimensions which holds the semantic meaning of the word. It is useful in NLP to represent words with their meaning and which context they can occur. Embeddings are much more better than sparse representation of words since sparse representation brings unnecessary values which makes the computation less efficient and not generalize well. Embeddings are rich meaning that each of the value in the embedding vector holds some meaning about the word and the context in which it can appear.

### **1.3.1 Word2vec**

Word2vec, a word embedding method which is one of the most popular one in NLP to embed words to efficienty store their semantic meaning. Word2vec is static embedding meaning that once it learned the word meaning based on the context, It will not change the emebdding when we use it in different NLP tasks unlike dynamic embedding in BERT. Word2vec works by learning how to represent words.

First we train a model (Logistic Regression of Simple Neural Network). On the training data which contains a corpus of text. The model learns in a self-supervised manner where the lables are just the next word in the sequence and the training data is the previous n words in the sequence. Based on the context the model is able to predict the word, and based on the word, the model is able to predict the context, these two are called Continious Bag of Words (CBOW) and Skip Gram respectively.

After training the model, we are not caring about the prediction it makes, but rather, we need the weights learned from the data, and these weights are used as embeddings.

#### 1.3.1.1 **Skip Gram Method**

In Skip Gram method, the idea is to learn to predict the context words given the target word. This is done using a classifier which learns to distinguish between postive class and negative class of words in context.

* Postive classes contains the context words and target word with a fixed context window
* Negative classes are choosen randomly which are not in the context word around the target word

So a simple classifier (Logistic Regression or Simple Neural Network) can learn to distinguish between postive and negative classes. The classifier learns the meaning and semantic at which a word is occuring near the surroundings to differetiate postive classes from negative ones. So the internal represenation (weights) of the model contains the semantic meaning of a word and at which context it will most likely to occur and which context it will not occur. We use this learned weights as embeddings.

**Skip Gram Mathematical Idea**

Our goal is to train a classifier to learn to differentiate between postive classes (true context words near target word) and negative classes (false context words near target word).

So the probability of word paired with context can be given by:

$$P(+|w, c)$$

The probability that word c is not real context word for w is just 1 - the probability of word c in real context word for w.

$$P(-|w, v) = 1 - P(+|w, c)$$

To compute the probability, we have to find the similartity between the words, The simple idea is that, a context word which frequently appears near target word has some similarity between their embeddings, so to find the similarity a simple way is to perform dot product operation.

$$Similarity(w, c) \approx c.w$$

The dot product for two words will give us a number, it can be bigger or smaller depending on the embedding, but we have to scale this down in a range so that the operation of will give probabilities, so the best way is to use a sigmoid function.

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

So when applying sigmoid to the dot product of target and context word will be:

$$P(+|w, c) = \sigma(c \cdot w) = \frac{1}{1 + e^{-c \cdot w}}$$

Similarly for Negative classes:

$$P(-|w, c) = 1 - P(+|w, c) = 1 - \sigma(-c \cdot w) = \frac{1}{1 + e^{c \cdot w}}$$

In [None]:
# Skip Gram Embedding
import tensorflow as tf
import re

with open("/content/drive/MyDrive/Natural-Language-Processing/internet_archive_scifi_v3.txt", 'r') as f:
    corpus_text = f.read()

sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', corpus_text[:100000])

tokenized_corpus = [sentence.split() for sentence in sentences]

vocab = set()
for sentence in tokenized_corpus:
    vocab.update(sentence)
vocab_size = len(vocab)

word2idx = {word:idx for idx, word in enumerate(vocab)}
idx2word = {idx:word for word, idx in word2idx.items()}

def generate_training_data(tokenized_corpus, word2idx, vocab_size, window_size, num_neg_samples):
    X_train = []
    y_train = []

    for sentence in tokenized_corpus:
        for idx, target_word in enumerate(sentence):
            for context_word in sentence[max(0, idx - window_size): min(len(sentence), idx + window_size + 1)]:
                if context_word != target_word:
                    X_train.append(word2idx[target_word])
                    y_train.append(word2idx[context_word])

                    for _ in range(num_neg_samples):
                        negative_word = np.random.randint(0, vocab_size)
                        while negative_word == word2idx[context_word]:
                            negative_word = np.random.randint(0, vocab_size)

                        X_train.append(word2idx[target_word])
                        y_train.append(negative_word)

    return np.array(X_train), np.array(y_train)

X_train, y_train = generate_training_data(tokenized_corpus, word2idx, vocab_size, window_size = 2, num_neg_samples = 5)

In [None]:
embedding_dim = 100
learning_rate = 0.001
epochs = 20

inputs = tf.keras.layers.Input(shape=(1, ))
embeddings = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = 1)(inputs)
flatten = tf.keras.layers.Flatten()(embeddings)


outputs = tf.keras.layers.Dense(vocab_size, activation='softmax')(flatten)

model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss='sparse_categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate))

# Train Word2Vec Skip-gram model
model.fit(X_train, y_train, batch_size=64, epochs=epochs)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7cc61933b220>

In [None]:
embeddings

<KerasTensor: shape=(None, 1, 100) dtype=float32 (created by layer 'embedding')>

In [None]:
embedding_layer = model.layers[1]  # Assuming the embedding layer is the second layer (index 1) in your model

# Get the learned weights of the embedding layer
embeddings_weights = embedding_layer.get_weights()[0]

embeddings_weights[0]

array([-0.3009966 , -0.7295156 , -0.6519991 , -0.98750925,  0.20136711,
        0.299907  , -0.03875716, -0.98883086, -0.10912559, -0.6644671 ,
       -0.23029837,  0.24951251,  0.5543495 , -0.15299556, -0.62965393,
        0.66754645,  1.2575268 ,  0.03683195, -0.05611789, -1.0504594 ,
        0.69487596, -0.50810206, -0.0940849 , -0.6556955 , -0.5689909 ,
        0.45712554, -0.2510039 ,  0.3444879 , -1.1509255 , -0.06846295,
        0.06436139,  0.3616584 , -1.4464504 , -0.5124545 ,  1.2238147 ,
        0.12293353,  0.5904486 ,  0.14293973,  0.3939653 , -0.40347943,
       -0.31042683, -0.49271733,  1.3926729 ,  0.66801065,  0.47947583,
        0.07961302,  0.2502459 , -0.27600524, -0.839901  ,  0.05820488,
       -0.59180176, -1.3522875 , -0.04737721,  0.9631013 ,  0.6254041 ,
        1.467144  , -1.2824275 ,  0.5162412 , -1.6171275 , -0.08007313,
        0.04449471, -0.35340053,  1.4446166 , -0.79094684,  0.38393384,
        0.794485  ,  0.20135272,  0.9583429 , -0.36904293, -0.36