In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1. What is a Feature Vector in Word2Vec?

In **Word2Vec**, every **word** in your vocabulary is represented by a **dense numerical vector** — called a **feature vector** or **word embedding**.

Example:

| Word    | Feature Vector (size = 5)        |
| ------- | -------------------------------- |
| “king”  | [0.21, 0.58, -0.31, 0.44, 0.12]  |
| “queen” | [0.19, 0.63, -0.27, 0.51, 0.08]  |
| “apple” | [-0.11, 0.44, 0.75, -0.33, 0.19] |

Each word becomes a **point in a high-dimensional space** (often 100–300 dimensions in real models).
The idea is that **semantic relationships** between words are captured as **distances and directions** in this space.



## 2. What These Vectors Represent

Each **dimension** in a Word2Vec vector doesn’t have a human-readable meaning like “gender” or “royalty.”
But together, the **pattern across dimensions** captures:

* Contextual similarity (words appearing in similar contexts have similar vectors)
* Relationships like:

  * **king − man + woman ≈ queen**
  * **Paris − France + Italy ≈ Rome**

So, these feature vectors encode both **similarity** and **relationships**.



## 3. How Word2Vec Learns These Vectors

Word2Vec is **not just a lookup**; it learns these feature vectors using a shallow neural network trained on large text corpora.

There are two main architectures:

### (a) CBOW (Continuous Bag of Words)

* Predicts the **target word** from the **context words**.
  Example: from “The cat _ on the mat”, predict “sat”.

### (b) Skip-Gram

* Does the opposite: predicts **context words** from the **target word**.
  Example: from “cat”, predict words like “the”, “sat”, “on”.

During training:

* Each word starts as a random vector.
* The model adjusts these vectors so words that appear in similar contexts end up close in vector space.



## 4. The Vector Space Concept

Imagine a geometric space where:

* Similar words cluster together.
  (“good”, “nice”, “great” → nearby)
* Opposite or unrelated words are far apart.
  (“good” vs “terrible”)
* Semantic axes naturally form:
  Gender, tense, country–capital, singular–plural, etc.

Example visualization (2D simplification):

```
        woman      king
           \       /
            \     /
             \   /
              \ /
              man
```

In higher dimensions, these relationships are encoded as **vector directions and magnitudes**.



## 5. Why It’s Called a “Feature” Vector

Each number in the vector represents a **latent feature** the model has learned automatically — not predefined by humans.

For example:

* Some dimensions may loosely encode sentiment (positive ↔ negative)
* Some may encode part of speech (noun ↔ verb)
* Others may encode syntactic or semantic patterns

These are **emergent features** — the model discovers them purely from word co-occurrence patterns.



## 6. Example: Training Outcome (Simplified)

Let’s say the vocabulary is `[king, queen, man, woman]`, embedding size = 3

| Word  | Feature Vector     |
| ----- | ------------------ |
| king  | [0.7, 0.2, 0.9]    |
| queen | [0.69, 0.19, 0.91] |
| man   | [0.5, 0.1, 0.4]    |
| woman | [0.49, 0.09, 0.42] |

Here:

* “king” and “queen” are close because they share similar contexts (“throne”, “crown”, “royal”)
* “man” and “woman” are close because they appear in similar contexts (“person”, “adult”)

And the **difference vector** between king–man and queen–woman is nearly the same direction — showing learned relational structure.



## 7. How to Use These Feature Vectors

Once trained, these embeddings can be used for:

* **Similarity:**
  `cosine_similarity(vec("good"), vec("great"))`
* **Analogy solving:**
  `vec("king") - vec("man") + vec("woman") ≈ vec("queen")`
* **Downstream tasks:**
  Feed embeddings into classifiers, LSTMs, or transformers.



## 8. In Short

| Concept            | Meaning                                              |
| ------------------ | ---------------------------------------------------- |
| **Feature vector** | A numeric representation of a word capturing meaning |
| **Dimensions**     | Latent semantic features learned from data           |
| **Similarity**     | Geometric closeness = contextual similarity          |
| **Learned from**   | Word co-occurrence patterns in text                  |
| **Used for**       | Semantic tasks, NLP pipelines, and transfer learning |

## **Continuous Bag of Words (CBOW)**

### 1. Introduction
The **Continuous Bag of Words (CBOW)** model is one of the two main architectures used in **Word2Vec** (the other being Skip-Gram).  
It aims to predict a **target word** based on its **surrounding context words**.  

CBOW helps the network learn vector representations of words — called **embeddings** — that capture semantic relationships.



### 2. Intuitive Example

Consider the sentence:  
> "The cat sat on the mat"

If the context window size is 2, and the target word is **"sat"**,  
then the **context words** are:  
`["The", "cat", "on", "the"]`

The CBOW task:  
> Predict "sat" from the context ["The", "cat", "on", "the"].



### 3. Model Architecture

CBOW is a **shallow neural network** with three layers:
1. **Input layer** – represents context words as one-hot encoded vectors.
2. **Hidden layer** – shared weight matrix that acts as the word embedding space.
3. **Output layer** – predicts the target word using a softmax function.

The overall flow:





### 4. Working Mechanism

#### Step 1: Input Representation
Each context word is converted into a **one-hot vector**.  
For a vocabulary of size `V`, the vector has a single `1` at the index of the word, and `0`s elsewhere.

Example:  
If `V = 10` and the word “cat” is the 3rd in the vocabulary:



#### Step 2: Hidden Layer
Each one-hot input vector is multiplied by a shared weight matrix **W** of size `(V × N)`  
(where `N` is the embedding dimension).

This produces an **embedding** for each context word.

CBOW **averages the embeddings** of all context words to get a single context vector `h`.

#### Step 3: Output Layer
The context vector `h` is multiplied by another matrix `W'` and passed through a **softmax** layer to produce probabilities for all words in the vocabulary.

The word with the highest probability is the **predicted target word**.



### 5. Mathematical Formulation

Given:
- Vocabulary size: `V`
- Embedding dimension: `N`
- Context window size: `n`
- Target word: `w_t`
- Context words: `w_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}`

**Input:**
\[
\text{Context words} = \{ w_{t-n}, ..., w_{t+n} \}
\]

**Hidden layer (average context vector):**
\[
h = \frac{1}{2n} \sum_{-n \le j \le n, j \ne 0} W^T x_{t+j}
\]

**Output layer (softmax prediction):**
\[
y = \text{softmax}(W'^T h)
\]

**Loss function (cross-entropy):**
\[
E = -\log P(w_t | \text{context})
\]



### 6. Why “Continuous Bag of Words”?

- **Continuous** – works with continuous-valued word embeddings.
- **Bag of Words** – the order of context words is ignored (like a bag of words).



### 7. Strengths and Weaknesses

| Aspect | Description |
|--------|--------------|
| **Strengths** | Simple, efficient for large datasets, fast to train |
| **Weaknesses** | Ignores word order, less effective for rare words |
| **Best used for** | Capturing meaning from frequent co-occurrences |



### 8. Comparison with Skip-Gram

| Feature | CBOW | Skip-Gram |
|----------|------|-----------|
| Predicts | Target word from context | Context words from target |
| Training speed | Faster | Slower |
| Works better for | Frequent words | Rare words |
| Input | Multiple context words | Single target word |
| Output | One target word | Multiple context words |



### 9. Summary

CBOW helps a model learn **word embeddings** by predicting a missing word from its surrounding context.  
These embeddings capture **semantic similarity**, meaning similar words (like “king” and “queen”) end up with vectors close to each other in the embedding space.

In practice, CBOW embeddings are used as a foundational feature in many NLP models, including language models, text classifiers, and neural translation systems.


## Skip-Gram Model

### 1. Introduction
The **Skip-Gram model** is one of the two main architectures used in **Word2Vec** (the other is CBOW).  
It aims to do the opposite of CBOW — instead of predicting a word from its context, Skip-Gram predicts the **context words** given a **target word**.

This approach helps capture how words are used in different contexts and is especially powerful for learning good representations of **rare words**.



### 2. Intuitive Example

Sentence:
> "The cat sat on the mat"

If the window size is 2, and the target word is **“sat”**,  
then the **context words** are:
`["The", "cat", "on", "the"]`

Skip-Gram’s task:
> Given the target “sat”, predict the surrounding words [“The”, “cat”, “on”, “the”].



### 3. Model Architecture

Skip-Gram is a **simple neural network** with three layers:
1. **Input layer** – one-hot encoded target word  
2. **Hidden layer** – weight matrix that acts as the embedding lookup  
3. **Output layer** – predicts probabilities for all words in the vocabulary using a softmax function

The structure:



---

### 4. Working Mechanism

#### Step 1: Input Representation
The input word (target word) is represented as a **one-hot vector** of length `V` (the vocabulary size).  
Example:



#### Step 2: Hidden Layer (Embedding Lookup)
The one-hot vector is multiplied by a weight matrix **W** of size `(V × N)`,  
where `N` is the embedding dimension.

This effectively selects the **embedding vector** corresponding to the target word.

#### Step 3: Output Layer
The resulting embedding is multiplied by another matrix **W'** of size `(N × V)` and passed through a **softmax** to predict the probabilities of each word in the vocabulary being a context word.

#### Step 4: Training Objective
The model is trained to **maximize the probability of actual context words** while minimizing the probability of unrelated words.



### 5. Mathematical Formulation

Given:
- Vocabulary size: `V`
- Embedding dimension: `N`
- Context window size: `n`
- Target word: `w_t`
- Context words: `w_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}`

The Skip-Gram objective is to maximize the log probability:

\[
J = \frac{1}{T} \sum_{t=1}^{T} \sum_{-n \le j \le n, j \ne 0} \log P(w_{t+j} | w_t)
\]

Where:

\[
P(w_{t+j} | w_t) = \frac{\exp(v_{w_{t+j}}'^{T} v_{w_t})}{\sum_{w=1}^{V} \exp(v_w'^{T} v_{w_t})}
\]

Here:
- \( v_{w_t} \) → vector representation of the target word  
- \( v_{w_{t+j}}' \) → vector representation of a context word  
- The denominator normalizes probabilities across the vocabulary.



### 6. Why Use Skip-Gram?

| Feature | Description |
|----------|--------------|
| **Direction** | Predicts context from target |
| **Performance** | Works well with small datasets |
| **Handling rare words** | Learns better representations for infrequent words |
| **Flexibility** | Captures asymmetric relationships (e.g., “doctor” → “hospital”, but not vice versa) |



### 7. Skip-Gram vs CBOW

| Feature | Skip-Gram | CBOW |
|----------|------------|------|
| **Predicts** | Context words from target word | Target word from context |
| **Training speed** | Slower | Faster |
| **Works better for** | Rare words | Frequent words |
| **Direction** | One → Many | Many → One |
| **Context handling** | Treats each context pair independently | Averages all context embeddings |



### 8. Improving Efficiency: Negative Sampling and Hierarchical Softmax

Since computing the full softmax for every word in a large vocabulary is expensive, Skip-Gram often uses:
1. **Negative Sampling** – updates weights for only a few negative examples per step, instead of all words.  
2. **Hierarchical Softmax** – organizes the vocabulary as a binary tree to speed up probability computation.

These techniques make Skip-Gram scalable to massive corpora.



### 9. Summary

The Skip-Gram model learns embeddings by predicting the words that appear around a given target word.  
It captures rich semantic and syntactic relationships — words that appear in similar contexts end up close together in the embedding space.

For example:
- “king” and “queen” → similar vectors  
- “Paris” and “France” → similar relationship as “Tokyo” and “Japan”

These learned embeddings are widely used in NLP tasks such as:
- Sentiment analysis  
- Named entity recognition  
- Machine translation  
- Text classification
