In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Word Embeddings**

### 1. **Introduction**

Word embeddings are **dense vector representations** of words that capture their **semantic meaning**.
Unlike one-hot vectors (which are sparse and high-dimensional), embeddings place similar words **closer together in a continuous vector space**.

Example:
[
\text{king} - \text{man} + \text{woman} \approx \text{queen}
]

This shows that embeddings capture **relationships and analogies** between words.



### 2. **Motivation**

Traditional NLP models represent words using:

* **One-hot encoding:** each word = one dimension in a large vector (e.g., 10,000 words → 10,000-d vector).
* Problems:

  * No notion of similarity (all words equally distant)
  * Huge memory requirements
  * Inefficient learning

**Word embeddings** solve this by learning **lower-dimensional, continuous-valued vectors** that capture **context and meaning**.



### 3. **Core Idea**

Each word is represented as a vector in ( \mathbb{R}^n ), typically where ( n ) = 100 to 300.
Words that appear in **similar contexts** (e.g., "doctor", "nurse") end up with **similar vectors**.

Embedding spaces are learned **from data** — using models like **Word2Vec**, **GloVe**, or within neural networks (via **Embedding layers**).



### 4. **Mathematical View**

Given a large corpus, embeddings are trained so that **words with similar contexts have similar vectors**.

Formally, embeddings maximize:
[
P(\text{context} | \text{word}) \text{ or } P(\text{word} | \text{context})
]
depending on the training model.



### 5. **Popular Word Embedding Models**

#### a. **Word2Vec** (Mikolov et al., 2013)

Two main architectures:

1. **CBOW (Continuous Bag of Words):**

   * Predicts the current word based on its surrounding context.
   * Faster for frequent words.
     [
     P(w_t | w_{t-m}, ..., w_{t+m})
     ]

2. **Skip-Gram:**

   * Predicts the context words given a target word.
   * Works better for rare words.
     [
     P(w_{t-m}, ..., w_{t+m} | w_t)
     ]

#### b. **GloVe (Global Vectors)**

* Learns embeddings based on **global word co-occurrence statistics**.
* Builds a matrix of how often words co-occur and factors it to generate embeddings.

[
\text{word vector similarity} \propto \log(\text{co-occurrence frequency})
]

#### c. **FastText**

* Extends Word2Vec by representing words as **n-grams of characters**.
* Captures **morphological features**, so it works better for rare and misspelled words.
* Example: “playing” → “play”, “lay”, “ing”.



### 6. **Word Embeddings in Deep Learning Models**

In neural NLP models, embeddings are learned **as part of the training** using an **Embedding layer**.

Example in PyTorch:

```python
import torch
import torch.nn as nn

embedding = nn.Embedding(num_embeddings=10000, embedding_dim=300)
input_word_ids = torch.LongTensor([1, 5, 8, 12])
output = embedding(input_word_ids)

print(output.shape)  # torch.Size([4, 300])
```

Here:

* `num_embeddings` = vocabulary size
* `embedding_dim` = vector dimension
* The layer maps word IDs → learned 300-d vectors

---

### 7. **Visualizing Embeddings**

You can project embeddings into 2D using PCA or t-SNE to observe clusters:

* “king”, “queen”, “prince”, “princess” will cluster together.
* “car”, “bus”, “train” form another cluster.

Example:

```python
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2)
reduced = tsne.fit_transform(embedding.weight.detach().numpy())

plt.scatter(reduced[:, 0], reduced[:, 1])
plt.title("Word Embedding Space")
plt.show()
```

---

### 8. **Applications**

* Text classification
* Sentiment analysis
* Machine translation
* Information retrieval
* Named Entity Recognition (NER)
* Transfer learning (pretrained embeddings)



### 9. **Pre-trained Embeddings**

You can use pre-trained vectors instead of training from scratch:

* **Word2Vec** (Google News, 3M words)
* **GloVe** (Wikipedia + Gigaword)
* **FastText** (Common Crawl, 600B tokens)

Using pre-trained embeddings helps models generalize faster with less data.



### 10. **Key Takeaways**

* Embeddings capture **semantic relationships** between words.
* Represent words as **dense, low-dimensional vectors**.
* Similar words → similar vectors.
* Trained via **Word2Vec, GloVe, or neural networks**.
* Core foundation for all modern NLP models.

# **Contextual Word Embeddings**

### 1. **Why We Need Contextual Embeddings**

Traditional embeddings such as **Word2Vec**, **GloVe**, and **FastText** represent each word with **a single, fixed vector**, regardless of context.

Example:

* Sentence 1: *“He sat by the **bank** of the river.”*
* Sentence 2: *“She went to the **bank** to deposit money.”*

In static embeddings, both “bank” words would have **the same vector**, even though their meanings differ.

**Contextual embeddings** solve this by generating **different vectors for the same word** depending on its **surrounding context**.



### 2. **Core Idea**

Instead of assigning a fixed vector to each word, contextual models generate embeddings **dynamically** based on the sentence.

For example:
[
\text{Embedding("bank", context="river")} \neq \text{Embedding("bank", context="money")}
]

This is achieved through **deep neural networks** (usually built with LSTMs or Transformers) that process the entire sentence before producing word representations.



### 3. **Key Models in Contextual Embeddings**

Let’s look at the main models that introduced this idea.

---

### **a. ELMo (Embeddings from Language Models)**

**Developed by:** AllenNLP (2018)

* Based on **bi-directional LSTM language models**.
* Generates word embeddings that depend on **both left and right context**.
* Produces different embeddings for the same word in different sentences.

**Architecture:**

* Two-layer bidirectional LSTM trained as a language model.
* Embeddings are taken from internal states of the network.

**Mathematically:**
[
\text{ELMo}(w_t) = \gamma \sum_{l=1}^{L} s_l h_{t,l}
]
where:

* ( h_{t,l} ) = hidden state of layer ( l ) at position ( t )
* ( s_l ) = softmax-normalized weights for each layer
* ( \gamma ) = scaling factor

**Strength:** Captures **contextual, syntactic, and semantic information**.



### **b. BERT (Bidirectional Encoder Representations from Transformers)**

**Developed by:** Google AI (2018)

* Based on the **Transformer encoder** architecture.
* Trained using **Masked Language Modeling (MLM)**:

  * Randomly mask some words and predict them using context from both sides.
* Produces deeply **contextualized embeddings** for each token.

**Example:**
Input: “The man went to the **[MASK]**.”
Model predicts: “store”, “market”, etc., depending on context.

**Advantages:**

* Considers **both left and right context simultaneously**.
* Learns rich semantic and syntactic information.
* Easily fine-tuned for downstream tasks (classification, NER, QA, etc.).



### **c. GPT (Generative Pre-trained Transformer)**

**Developed by:** OpenAI (2018)

* Based on the **Transformer decoder** architecture.
* Trained as a **causal language model** — predicts the next word given previous words.
* Generates embeddings based on **left-to-right context** only.

**Key difference from BERT:**

* BERT = bidirectional (understands full context)
* GPT = unidirectional (good for generation tasks)



### **d. RoBERTa, ALBERT, and DistilBERT**

* Variants of BERT trained with different optimizations:

  * **RoBERTa:** more data and longer training → better performance
  * **ALBERT:** fewer parameters (shared layers)
  * **DistilBERT:** smaller, faster version for edge devices

All still produce **contextual embeddings** at the token level.



### 4. **How Contextual Embeddings Work**

1. Input sentence is tokenized.
2. Model processes the sentence through **multiple Transformer layers**.
3. Each token’s final hidden state represents its **context-aware embedding**.
4. These embeddings can then be used for:

   * Classification
   * Named Entity Recognition (NER)
   * Translation
   * Sentiment analysis



### 5. **Example: Using BERT for Contextual Embeddings**

Here’s a small example using Hugging Face Transformers:

```python
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input sentence
sentence = "He sat by the bank of the river"
inputs = tokenizer(sentence, return_tensors='pt')

# Get contextual embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state  # Shape: [batch, sequence_length, hidden_dim]

print(embeddings.shape)
```

Each token (word or subword) gets a **768-dimensional contextual embedding**.

---

### 6. **Advantages Over Static Embeddings**

| Feature                | Static (Word2Vec/GloVe) | Contextual (BERT/ELMo)      |
| ---------------------- | ----------------------- | --------------------------- |
| Representation         | Fixed for each word     | Depends on context          |
| Handles polysemy       | No                      | Yes                         |
| Architecture           | Shallow                 | Deep (LSTM/Transformer)     |
| Language Understanding | Limited                 | Rich semantic understanding |
| Fine-tuning            | Hard                    | Easy and effective          |

---

### 7. **Applications**

* Sentiment Analysis
* Question Answering
* Named Entity Recognition
* Machine Translation
* Semantic Search
* Text Summarization



### 8. **Key Takeaways**

* Contextual embeddings understand **meaning in context**.
* ELMo introduced the idea using **BiLSTM**.
* BERT and GPT improved it using **Transformers**.
* These embeddings form the **foundation of modern NLP**.