<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**Transformers**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Imports

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
!pip install gensim

In [None]:
import random
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)

## Transformers

Transformer models are dating back to the seminal paper "Attention Is All You Need":

https://arxiv.org/abs/1706.03762

_From ChatGPT_:

Transformers in Natural Language Processing (NLP) are a type of neural network architecture designed to process sequences of data, like text, by focusing on the relationship between words in a sentence. They use something called self-attention to understand how different words in a sentence are related to each other, regardless of their position. This allows them to capture the context of words in a way that previous models like RNNs or LSTMs struggled with.

**Key Concept: Self-Attention**

Self-attention allows the model to weigh the importance of different words when analyzing each word in a sentence. For example, in the sentence "The cat sat on the mat," the word "mat" is more related to "sat" than to "the." The transformer can focus on those important relationships even if the sentence becomes longer.

**Transformer Architecture Overview (High Level):**

Encoder: It reads the input sentence and creates a representation of each word by looking at the entire sentence (using self-attention).

Decoder: It processes this representation to generate an output (in tasks like machine translation).

Self-Attention: This mechanism helps the transformer model decide which words in the sentence are most important for each word.

## Simple Example

Example input: a simple sentence represented by 3 words (each word is a vector). Let's assume each word is represented by a vector of 4 values (features). In practice, these are embeddings, but we'll use random vectors here for simplicity.

In [None]:
sentence = np.array([[1, 0, 1, 0],    # Word 1
                     [0, 2, 0, 2],    # Word 2
                     [1, 1, 1, 1]])   # Word 3

Step 1: Compute the dot product of the sentence matrix with itself (transpose) to get attention scores between words. This gives us a matrix where each element (i, j) represents the "importance" of word i with respect to word j.

In [None]:
attention_scores = np.dot(sentence, sentence.T)

In [None]:
attention_scores

Step 2: Normalize the attention scores using softmax to get the attention weights


In [None]:
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)

In [None]:
attention_weights = softmax(attention_scores)

In [None]:
attention_weights

In [None]:
attention_weights.sum(axis=1)

Step 3: Multiply the attention weights by the original sentence vectors to create a new representation for each word based on the context of the entire sentence.

In [None]:
new_sentence_representation = np.dot(attention_weights, sentence)

In [None]:
new_sentence_representation

**Why Self-Attention Is Powerful**

_Context Awareness_: Each word's new representation is a combination of all other words in the sentence, meaning it has a sense of context, regardless of position.

_Parallelization_: Unlike RNNs or LSTMs, transformers process all words simultaneously, making them much faster for long sentences.

## Another Example

In [None]:
# Define a simple sequence of 4 input vectors (each vector represents a word or token)
# In this case, we are using 3-dimensional vectors for simplicity
inputs = np.array([[1, 0, 1],    # Input 1
                   [0, 1, 0],    # Input 2
                   [1, 1, 1],    # Input 3
                   [0, 0, 1]])   # Input 4

In [None]:
# Step 1: Compute attention scores (similarity of each input with the others)
# Here we'll compute the dot product of each input with a learnable attention vector (query)
attention_vector = np.array([1, 1, 1])  # This can be learned during training

In [None]:
attention_vector

In [None]:
# Compute attention scores for each input by taking the dot product with the attention vector
attention_scores = np.dot(inputs, attention_vector)

In [None]:
attention_scores

In [None]:
# Step 2: Normalize the attention scores using softmax to get the attention weights
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

In [None]:
attention_weights = softmax(attention_scores)

In [None]:
attention_weights

In [None]:
attention_weights.sum()

In [None]:
# Step 3: Compute the weighted sum of the inputs using the attention weights
# This results in a context vector that captures the important information from the sequence
context_vector = np.dot(attention_weights, inputs)

In [None]:
context_vector

In [None]:
# Print the attention scores, weights, and the resulting context vector
print("Inputs:\n", inputs)
print("\nAttention Scores:\n", attention_scores)
print("\nAttention Weights (after softmax):\n", attention_weights)
print("\nContext Vector (weighted sum of inputs):\n", context_vector)

## TF-IDF Example

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Step 1: Create a small set of documents (texts)
documents = [
    "The cat sat on the mat",
    "The dog barked at the cat",
    "The bird flew over the tree",
    "The dog chased the bird",
    "The cat climbed the tree"
]

In [None]:
# Step 2: Train a TF-IDF model on the documents
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(documents).toarray()

In [None]:
# Print the TF-IDF vectors for reference
print("TF-IDF Vectors:\n", tfidf_vectors.round(3))

In [None]:
# Step 3: Choose one document as the "query" document (we'll calculate attention relative to this)
# Let's take the first document: "The cat sat on the mat"
query_vector = tfidf_vectors[0].round(3)

In [None]:
query_vector

In [None]:
# Step 4: Calculate attention scores by computing the dot product between the query and other documents
attention_scores = np.dot(tfidf_vectors, query_vector)

In [None]:
attention_scores.round(3)

In [None]:
# Step 5: Normalize the attention scores using softmax to get the attention weights
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

In [None]:
attention_weights = softmax(attention_scores)

In [None]:
attention_weights.round(3)

In [None]:
attention_weights.sum()

In [None]:
# Step 6: Calculate the context vector as the weighted sum of all TF-IDF vectors
context_vector = np.dot(attention_weights, tfidf_vectors)

In [None]:
context_vector

In [None]:
# Print results
print("\nAttention Scores:\n", attention_scores)
print("\nAttention Weights (after softmax):\n", attention_weights)
print("\nContext Vector (weighted sum of TF-IDF vectors):\n", context_vector)

## Word2Vec Example

In this example, we'll use a pre-trained Word2Vec model from gensim. We will follow the same process as before:

1. Load a pre-trained Word2Vec model.
2. Convert a sentence into word embeddings using this model.
3. Apply the self-attention mechanism on the Word2Vec embeddings to create new word representations.


In [None]:
from pprint import pprint

In [None]:
import gensim.downloader as api
# pprint(api.info())

In [None]:
# Load pre-trained Word2Vec model from gensim
word2vec_model = api.load('glove-wiki-gigaword-50')

In [None]:
similar_words = word2vec_model.most_similar('cat', topn=5)
similar_words

In [None]:
# Step 1: Create a sentence and convert each word to its Word2Vec embedding
sentence = 'The cat sat on the mat'.lower().split()

In [None]:
# Step 1: Create a sentence and convert each word to its Word2Vec embedding
# sentence = 'The bird flew over the tree'.lower().split()

In [None]:
# Fetch Word2Vec embeddings for each word in the sentence
# Note: If a word is not in the Word2Vec vocabulary, we can skip it or assign a zero vector
word_vectors = []
for word in sentence:
    if word in word2vec_model:
        word_vectors.append(word2vec_model[word])
    else:
        word_vectors.append(np.zeros(50))  # Assign a zero vector if word is not in vocab

In [None]:
word_vectors = np.array(word_vectors)

In [None]:
# word_vectors.round(3)

In [None]:
# Step 2: Compute the dot product of the sentence matrix with itself (transpose)
attention_scores = np.dot(word_vectors, word_vectors.T)

In [None]:
attention_scores.round(3)

In [None]:
# Step 3: Normalize the attention scores using softmax
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)

In [None]:
attention_weights = softmax(attention_scores)

In [None]:
attention_weights.round(3)

In [None]:
attention_weights.sum(axis=1)

In [None]:
# Step 4: Multiply the attention weights by the original sentence vectors
new_sentence_representation = np.dot(attention_weights, word_vectors)

In [None]:
# new_sentence_representation.round(3)

In [None]:
# Step 5: Show the original word vectors and the new word vectors after self-attention
print(f"Original sentence: {' '.join(sentence)}")
print("\nAttention Weights (after softmax):\n", attention_weights)

print("\nNew Sentence Representation (first 5 values of each vector):")
for i, word in enumerate(sentence):
    print(f"New vector for '{word}': {new_sentence_representation[i][:5]}")

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>