<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**Word Embeddings**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Imports

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
import numpy as np
import pandas as pd

## Word2Vec

_From ChatGPT._

Word2Vec is a popular technique used in natural language processing (NLP) to create word embeddings, which are dense vector representations of words. Developed by a team of researchers at Google led by Tomas Mikolov in 2013, Word2Vec models learn to map words into high-dimensional continuous vector spaces where semantically similar words are located close to each other.

### Key Concepts of Word2Vec

1. **Distributed Representations**: Unlike traditional one-hot encoding, which represents words as sparse vectors with many dimensions (one per unique word) and no meaningful distances between them, Word2Vec creates dense vectors where the semantic relationships between words are captured in the vector space.

2. **Training Approaches**:
   - **Continuous Bag of Words (CBOW)**: Predicts the target word based on the context of surrounding words. It uses a window of words around the target word to predict the target word itself.
   - **Skip-gram**: Predicts the surrounding context words based on the target word. Given a word, it tries to predict the words in its neighborhood.

### How Word2Vec Works

1. **Input**: A large corpus of text.
2. **Training**: The model is trained on the corpus to predict context words from a target word (skip-gram) or a target word from context words (CBOW).
3. **Output**: A set of word vectors where each word is represented by a dense vector of real numbers.

### Benefits of Word2Vec

- **Semantic Relationships**: Words with similar meanings are close together in the vector space. For example, "king" - "man" + "woman" is close to "queen".
- **Efficient**: Word2Vec can be trained on large corpora efficiently using stochastic gradient descent and other optimization techniques.
- **Generalization**: The vectors can be used in various downstream NLP tasks, improving their performance by providing meaningful word representations.

### Example in Python using Gensim

Here's how you can use the Gensim library to create Word2Vec embeddings:

1. **Install Gensim**:
   ```bash
   pip install gensim
   ```

2. **Train a Word2Vec Model**:
   ```python
   from gensim.models import Word2Vec

   # Sample sentences
   sentences = [
       ["cat", "sat", "on", "the", "mat"],
       ["dog", "barked", "at", "the", "mailman"],
       ["fish", "swims", "in", "the", "water"]
   ]

   # Train Word2Vec model
   model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, sg=0)

   # Get embeddings for a word
   cat_vector = model.wv['cat']
   print("Embedding for 'cat':", cat_vector)
   
   # Find similar words
   similar_to_cat = model.wv.most_similar('cat')
   print("Words similar to 'cat':", similar_to_cat)
   ```

### Explanation

- **Training Data**: We use a small set of sample sentences.
- **Word2Vec Model**: We create a Word2Vec model using these sentences.
  - `vector_size`: The size of the word vectors.
  - `window`: The maximum distance between the current and predicted word within a sentence.
  - `min_count`: Ignores all words with a total frequency lower than this.
  - `sg`: Training algorithm, 0 for CBOW (Continuous Bag of Words), and 1 for skip-gram.
- **Get Embeddings**: We retrieve the embeddings for specific words like 'cat'.
- **Find Similar Words**: We find words similar to 'cat' based on the trained embeddings.

### Applications of Word2Vec

1. **Text Classification**: Improved feature representations lead to better classification performance.
2. **Semantic Analysis**: Understanding relationships between words.
3. **Machine Translation**: Capturing the semantic meaning of words helps in translating sentences more accurately.
4. **Recommendation Systems**: Finding similar items or content based on word embeddings.
5. **Information Retrieval**: Improving search results by understanding the semantic context of queries.

Word2Vec has been a foundational technique in NLP, leading to more advanced embedding methods like GloVe, FastText, and contextual embeddings such as BERT.

[Note: `gensim` currently requires `scipy` version 1.12 (does not work with 1.13 anymore).]

In [None]:
from gensim.models import Word2Vec

In [None]:
# Sample sentences
sentences = [
   ["cat", "sat", "on", "the", "mat"],
   ["dog", "barked", "at", "the", "mailman"],
   ["fish", "swims", "in", "the", "water"]
]

In [None]:
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=3,
                 window=2, min_count=1, sg=0)

In [None]:
# Get embeddings for a word
cat_vector = model.wv['cat']
print("Embedding for 'cat':", cat_vector)

In [None]:
# Find similar words
similar_to_cat = model.wv.most_similar('cat')
print("Words similar to 'cat':", similar_to_cat)

In [None]:
# Get embeddings for a word
dog_vector = model.wv['dog']
print("Embedding for 'dog':", dog_vector)

In [None]:
# Find similar words
similar_to_dog = model.wv.most_similar('dog')
print("Words similar to 'dog':", similar_to_dog)

## Distance and Word Similarity

In [None]:
python_snippets = [
    "Python is a versatile language for web development, data analysis, and automation.",
    "Use Python's libraries like NumPy and Pandas for efficient data manipulation.",
    "Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming.",
    "The Python community offers extensive documentation and a wealth of online resources.",
    "Python's syntax is designed to be readable and straightforward, making it beginner-friendly.",
    "Django and Flask are popular frameworks for developing web applications in Python.",
    "Automate repetitive tasks with Python scripts and save time in your workflow."
]

In [None]:
nlp_snippets = [
    "Natural Language Processing (NLP) enables computers to understand and process human language.",
    "NLP is used in applications like sentiment analysis, chatbots, and machine translation.",
    "Tokenization is a fundamental step in NLP, breaking text into meaningful units.",
    "Named Entity Recognition (NER) identifies proper nouns in text, such as names and locations.",
    "Vectorization converts text data into numerical form for machine learning models.",
    "Popular NLP libraries include NLTK, SpaCy, and Hugging Face Transformers.",
    "NLP combines computational linguistics and machine learning for language understanding."
]

In [None]:
llm_snippets = [
    "Large Language Models (LLMs) are advanced neural networks trained on vast text corpora.",
    "LLMs like GPT-3 generate human-like text based on input prompts.",
    "Applications of LLMs include content creation, code generation, and conversational agents.",
    "LLMs utilize transformers, a deep learning architecture, for efficient processing.",
    "Training LLMs requires substantial computational resources and large datasets.",
    "Fine-tuning LLMs on specific tasks enhances their performance and accuracy.",
    "Ethical considerations in LLMs include bias, misinformation, and data privacy."
]

In [None]:
X = list()
X.extend(python_snippets)
X.extend(nlp_snippets)
X.extend(llm_snippets)

In [None]:
X = [s.lower() for s in X]
X = [s.split() for s in X]

In [None]:
sentences = list()
for s in X:
    sentences.append([w.strip('.,()') for w in s])
sentences[:2]

In [None]:
model = Word2Vec(sentences, min_count=1,
                 vector_size=5, sg=0, window=2)

In [None]:
model.wv['python']

In [None]:
model.wv['django']

In [None]:
model.wv.key_to_index.keys()

In [None]:
model.wv.distance('python', 'django')

In [None]:
model.wv.distance('python', 'chatbots')

In [None]:
model.wv.distance('django', 'chatbots')

In [None]:
model.wv.most_similar('python')

In [None]:
model.wv.most_similar('django')

In [None]:
model.wv.most_similar('chatbots')

## Document Similarity

In [None]:
d1 = ' '.join(python_snippets).lower()
d2 = ' '.join(nlp_snippets).lower()
d3 = ' '.join(llm_snippets).lower()

In [None]:
d1

In [None]:
d1_tokens = [w.strip('.,()') for w in d1.split()]

In [None]:
d1_tokens[:7]

In [None]:
len(d1_tokens)

In [None]:
[model.wv[t] for t in d1_tokens][:5]

In [None]:
d1_vec = np.mean([model.wv[t] for t in d1_tokens], axis=0)
d1_vec

In [None]:
def get_doc_vector(doc):
    doc_tokens = [w.strip('.,()') for w in doc.split()]
    dv = np.mean([model.wv[t] for t in doc_tokens], axis=0)
    return dv

In [None]:
d2_vec = get_doc_vector(d2)
d2_vec

In [None]:
d3_vec = get_doc_vector(d3)
d3_vec

In [None]:
model.wv.cosine_similarities(d1_vec, [d2_vec])

In [None]:
model.wv.cosine_similarities(d1_vec, [d3_vec])

In [None]:
model.wv.cosine_similarities(d2_vec, [d3_vec])

## Sentiment Analysis

In [None]:
sentences_ = [
    'i am so happy today',
    'i hate rainy days',
    'i love this book',
    'i hate boring movies',
    'i hope for the best',
    'i fear the worst'
]

In [None]:
sentiment = np.array((1, 0, 1, 0, 1, 0))

In [None]:
sentences = [s.split() for s in sentences_]
sentences

In [None]:
model = Word2Vec(sentences, min_count=1,
                 vector_size=5, window=2, sg=0)

In [None]:
sent_vecs = [get_doc_vector(s) for s in sentences_]

In [None]:
sent_vecs

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
classifier = LogisticRegression()

In [None]:
classifier.fit(sent_vecs, sentiment)

In [None]:
classifier.predict(sent_vecs) 

In [None]:
classifier.predict(sent_vecs) == sentiment

In [None]:
test = [
    'i love movies',
    'i hate today',
    'i fear rainy days',
    'i hope for happy days'
]

In [None]:
test_vecs = [get_doc_vector(s) for s in test]

In [None]:
test_vecs

In [None]:
classifier.predict(test_vecs)

## APPENDIX: Word2Vec Algorithm 

_From ChatGPT._

Word2Vec uses two main algorithms to learn vector representations of words: **Continuous Bag of Words (CBOW)** and **Skip-gram**. Both of these are neural network models designed to predict either a word given its context (CBOW) or the context given a word (Skip-gram).

### Continuous Bag of Words (CBOW)

In CBOW, the model predicts the target word (center word) based on the context words (surrounding words). 

**Example:**

Let's consider a simple sentence: "The cat sat on the mat."

- Context window size: 2 (consider 2 words before and 2 words after the target word)

For the word "sat" (target), the context words are ["The", "cat", "on", "the"].

The CBOW model tries to predict "sat" from ["The", "cat", "on", "the"].

### Skip-gram

In Skip-gram, the model predicts the context words given a target word (center word).

**Example:**

Let's use the same sentence: "The cat sat on the mat."

- Context window size: 2

For the word "sat" (target), the model will try to predict ["The", "cat", "on", "the"].

### How the Algorithm Works

1. **Initialization:**
   - Initialize the weights of the neural network randomly.

2. **Training:**
   - For each word in the vocabulary, use the word and its context to update the neural network weights.
   - The training process involves:
     - Converting words into one-hot vectors.
     - Feeding these one-hot vectors into the neural network.
     - Calculating the error between the predicted word(s) and the actual word(s).
     - Backpropagating the error to adjust the weights.

### Simple Example with Skip-gram

Consider the sentence: "I like cats."

- Vocabulary: ["I", "like", "cats"]
- Context window size: 1

For the target word "like":
- Context words: ["I", "cats"]

The Skip-gram model will generate training pairs: 
- ("like", "I")
- ("like", "cats")

**Steps:**

1. **One-hot Encoding:**
   - "like": [0, 1, 0]
   - "I": [1, 0, 0]
   - "cats": [0, 0, 1]

2. **Neural Network Structure:**
   - Input layer: One-hot vector of the target word (e.g., "like" -> [0, 1, 0])
   - Hidden layer: A weight matrix that transforms the one-hot vector into a dense vector.
   - Output layer: A weight matrix that transforms the dense vector back into the vocabulary space.

3. **Forward Pass:**
   - Input: [0, 1, 0] (one-hot for "like")
   - Hidden layer: Multiply the input with the weight matrix to get the hidden representation (embedding).
   - Output layer: Multiply the hidden representation with the weight matrix to get the scores for each word in the vocabulary.

4. **Prediction and Error Calculation:**
   - Apply softmax to get the probability distribution over the vocabulary.
   - Calculate the error based on the actual context words.

5. **Backpropagation:**
   - Update the weights based on the error using gradient descent.

After training, the hidden layer weights will contain the word embeddings. Words that appear in similar contexts will have similar embeddings.

## APPENDIX: Cosine Similarity

Cosine similarity is a measure that calculates the cosine of the angle between two non-zero vectors in an inner product space. It is often used to measure the similarity between two vectors, particularly in the context of document similarity and word embeddings.

### Cosine Similarity Formula

The cosine similarity between two vectors $A$ and $B$ is calculated as:

$$ \text{Cosine Similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} $$

where:
- $ A \cdot B $ is the dot product of vectors $A$ and $B$.
- $\|A\|$ is the magnitude (length) of vector $A$.
- $\|B\|$ is the magnitude (length) of vector $B$.
- $\theta$ is the angle between the two vectors.

### Step-by-Step Calculation

1. **Dot Product:**
   The dot product of two vectors $A$ and $B$, each of dimension $n$, is calculated as:
   
   $$ A \cdot B = \sum_{i=1}^{n} A_i \times B_i $$

2. **Magnitude of Vectors:**
   The magnitude of a vector $A$ is calculated as:
   
   $$ \|A\| = \sqrt{\sum_{i=1}^{n} A_i^2} $$

   Similarly, the magnitude of vector $B$ is:
   
   $$ \|B\| = \sqrt{\sum_{i=1}^{n} B_i^2} $$

3. **Cosine Similarity Calculation:**
   Combine the dot product and magnitudes to get the cosine similarity:
   
   $$ \cos(\theta) = \frac{\sum_{i=1}^{n} A_i \times B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} $$

### Example Calculation

Let's consider two vectors $A$ and $B$:

$$ A = [1, 2, 3] $$
$$ B = [4, 5, 6] $$

1. **Dot Product:**
   
   $$ A \cdot B = (1 \times 4) + (2 \times 5) + (3 \times 6) = 4 + 10 + 18 = 32 $$

2. **Magnitude of Vectors:**

   $$ \|A\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{1 + 4 + 9} = \sqrt{14} \approx 3.74 $$
   
   $$ \|B\| = \sqrt{4^2 + 5^2 + 6^2} = \sqrt{16 + 25 + 36} = \sqrt{77} \approx 8.77 $$

3. **Cosine Similarity:**

   $$ \cos(\theta) = \frac{32}{3.74 \times 8.77} = \frac{32}{32.8} \approx 0.98 $$

The cosine similarity between vectors $A$ and $B$ is approximately 0.98, indicating a high degree of similarity.

### Python Code Example

Here's how you can calculate cosine similarity using Python and the `numpy` library:

```python
import numpy as np

# Define vectors
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])

# Calculate dot product
dot_product = np.dot(A, B)

# Calculate magnitudes
magnitude_A = np.linalg.norm(A)
magnitude_B = np.linalg.norm(B)

# Calculate cosine similarity
cosine_similarity = dot_product / (magnitude_A * magnitude_B)
print("Cosine Similarity:", cosine_similarity)
```

### Applications

Cosine similarity is widely used in:
- **Information Retrieval:** To measure the similarity between documents.
- **Text Mining:** To find similar texts.
- **Recommender Systems:** To recommend items that are similar to user preferences.
- **Word Embeddings:** To measure the similarity between word vectors in NLP tasks.

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>