<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/04.vswe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/04.vswe.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# Vector Semantics and Embeddings

📝 SALP chapter 6

## 🤔 **How Do We Determine Word Meanings?**
- Words appearing in `similar contexts` tend to have `similar meanings`.
  - **Synonyms like "oculist" and "eye-doctor"** tend to occur near words like *eye* or *examined*.
  - **Meaning Similarity = Context Similarity:**
    - The amount of difference in meaning corresponds to differences in their contexts.
- **Distributional Hypothesis:** Words that occur in similar contexts have similar meanings.


### **Vector Semantics and Representation Learning**
- **Vector Semantics:**
  - Represents word meanings based on their distribution in text.
  - Widely used in NLP tasks that involve understanding word meaning.
- **Representation Learning:** 
  - Automatically learns useful representations of input data, instead of using handcrafted features.


### **Static vs. Dynamic Embeddings**

| **Feature**    | **Static Embeddings**   | **Dynamic Embeddings**    |
|---------|-------------|---------------|
| **Definition**    | Fixed representation for each word   | Context-dependent representation      |
| **Examples**     | Word2Vec, GloVe     | BERT, GPT       |
| **Context Awareness**     | Ignores context differences    | Adapts based on sentence context      |
| **Flexibility**      | Limited to one meaning per word   | Captures multiple meanings            |
| **Use Case**     | Simpler tasks (e.g., word similarity)  | Complex tasks (e.g., sentiment analysis, QA) |

- Dynamic embeddings offer more nuanced and context-sensitive word representations, enhancing performance in complex NLP tasks.

## Lexical Semantics
- `Classical/dictionary representation` of word meaning: 
  - Words as strings of letters or symbols (e.g., DOG for “dog”).
  - This approach is unsatisfactory because it doesn't capture relationships between words.
- **Desirable Features of Word Meaning Representation:**
  - **Similarity:** Cat is similar to dog.
  - **Antonymy:** Cold is the opposite of hot.
  - **Connotation:** Happy (positive) vs. sad (negative).
  - **Relatedness:** Buy, sell, and pay represent different perspectives on the same event.
- **Goal:**
  - Develop models that capture these nuances to handle tasks like question-answering and dialogue.


### **Lemmas, Senses, and Synonymy**
- **Lemmas and Word Senses:**
  - **Lemma:** The base form of a word
    - e.g., *mouse* for both mouse and mice
  - **Word Sense:** Different meanings of a lemma
    - e.g., *mouse* as a rodent vs. a computer device.
- **Synonymy:**
  - Words with `nearly identical meanings` in context 
  - e.g., car/automobile, couch/sofa.
- **Principle of Contrast:** 
  - No two words are exactly identical in meaning. 
  - Even near-synonyms differ in usage or connotation.
- 🍎 **Examples:**
  - Scientific vs. informal context: 
    - *H₂O* (scientific) vs. *water* (informal).
  - Genre differences are part of the meaning.


### **Word Similarity, Relatedness, and Connotation**
- **Word Similarity:**
  - Not necessarily synonyms, but share `related features` (e.g., cat/dog).
  - Important for tasks like paraphrasing and summarization.
- **Word Relatedness:**
  - Words can be related without being similar
    - e.g., coffee/cup, surgeon/scalpel.
  - **Semantic Fields:** Words grouped by topics
    - e.g., hospital: surgeon, nurse, anesthetic.


### **Semantic Frames and Roles**
- A **semantic frame** is a set of words that represent perspectives or participants in a specific event.
- 🍎**Example: Commercial Transaction Frame**
  - **Event:** Trading money for goods/services.
  - **Verbs:** 
    - *Buy* (from the buyer's perspective)
    - *Sell* (from the seller's perspective)
    - *Pay* (focuses on the monetary aspect)
  - **Roles:**
    - *Buyer*: Entity providing money
    - *Seller*: Entity providing goods/services
    - *Goods*: The item being exchanged
    - *Money*: The currency used in the transaction
- **Practical Application:**
  - Understanding that "Sam bought the book from Ling" is equivalent to "Ling sold the book to Sam."
  - Essential for tasks like **question-answering** and **machine translation**.


### **Connotation and Sentiment Analysis**
**Connotation** is the `affective meanings or emotions` associated with a word.
- **Positive Connotations:** positive sentiment
  - *Wonderful*, *Love, Great*
- **Negative Connotations:** negative sentiment
  - *Dreary*, *Terrible, Hate*
- **Contextual Differences:**
  - Words with similar meanings can have different connotations:
    - *Fake, Knockoff, Forgery* (negative) vs. *Copy, Replica, Reproduction* (neutral/positive).
    - *Innocent* (positive) vs. *Naive* (negative).
- **Application:**
  - Important for **sentiment analysis**, understanding user opinions in reviews, and political language analysis.


### **Dimensions of Affective Meaning**
- **Three Key Dimensions (Book: Osgood et al., The Measurement of Meaning):**
  1. **Valence:** Pleasantness of the word (e.g., *happy* = high, *unhappy* = low).
  2. **Arousal:** Intensity of the emotion (e.g., *excited* = high, *calm* = low).
  3. **Dominance:** Control exerted by the word (e.g., *controlling* = high, *awed* = low).
- **Examples:**
  - **Courageous:** [Valence: 8.05, Arousal: 5.5, Dominance: 7.38]
  - **Heartbreak:** [Valence: 2.45, Arousal: 5.65, Dominance: 3.58]
- **Implications:**
  - Words can be represented as points in a 3D space.
  - Foundation for **vector semantics** and understanding complex emotional nuances in text.

## Vector Semantics
- Each word is represented as a **vector** in a `semantic space`.
  - These vectors are called `embeddings`
- **Types of Vectors:**
  1. **Sparse Vectors:**
     - Derived from traditional methods like 
       - **tf-idf** or **PPMI** (positive pointwise mutual information)
     - Typically very long and contain mostly zeros.
  2. **Dense Vectors:**
     - Generated using models like **word2vec**.
     - Shorter, compact representations with useful semantic properties.

### **Vector/Distributional Models of Meaning**
- Represent word meanings based on `co-occurrence patterns` in text.
- Capture how often words appear together in **Co-occurrence Matrix:** 
  1. **Term-Document Matrix**
  2. **Term-Term Matrix**


### **Term-Document Matrix**
- Rows represent words, and columns represent documents.
- Each cell shows the frequency of a word in a specific document.
- 🍎 **Example:** a term-document matrix for 4 words across 4 Shakespeare plays.

|               | As You Like It | Twelfth Night | Julius Caesar | Henry V  |
|---------------|----------------|---------------|---------------|---------|
| **battle**    | 1              | 0             | 7             | 13      |
| **good**      | 114            | 80            | 62            | 89      |
| **fool**      | 36             | 58            | 1             | 4       |
| **wit**       | 20             | 15            | 2             | 3       |


### **Document Representation as Vectors**
- Each column is called a document **vector** 
  - Which is an array of numbers representing a document’s word frequencies.
  - e.g. *As You Like It* is represented by the vector **[1, 114, 36, 20]**.
- Each document is a point in a high-dimensional vector space.
- This vector space `dimensionality` is equal to the `vocabulary size (|V|)`.


### **Document Similarity**
- Documents with similar vectors have similar content.
- 🍎 **Example:**
  - *As You Like It* and *Twelfth Night* are closer in the vector space because they share more similar words like "fool" and "wit".
  - In contrast, *Julius Caesar* and *Henry V* have higher frequencies of "battle".



### **Words as Vectors: Document Dimensions**
- Words can be represented as vectors too, based on their distribution across documents.
- Each word vector is a row in the term-document matrix.
- 🍎 **Example:**
  - The vector for "fool" is **[36, 58, 1, 4]**.
  - Dimensions correspond to the four Shakespeare plays.
- **Observation:**
  - Related words (e.g., "fool" and "wit") have similar vectors because they occur in similar documents.



### **Words as Vectors: Word Dimensions**
- **Term-Term Matrix (Word-Word Matrix):** Columns and rows represent words instead of documents.
  - Each cell records the number of times `two words co-occur` in a defined context
    - e.g., within a ±4 word window.
- 🍎 **Example:** Four words in the Wikipedia corpus

|               | computer  | data     | result  | pie   | sugar | count(w)  |
|---------------|-----------|----------|---------|-------|-------|------|
| **cherry**    | 2         | 8        | 9       | 442   | 25    | 486  |
| **strawberry**| 0         | 0        | 1       | 60    | 19    | 80  |
| **digital**   | 1670      | 1683     | 85      | 5     | 4     | 3447  |
| **information**| 3325     | 3982     | 378     | 5     | 13    | 7703  |
| **count(context)**| 4997  | 5673     | 473     | 512   | 61    | 11716 |

- **Word Similarity:**
  - Words like "cherry" and "strawberry" are more similar because they share contexts like "pie" and "sugar".



### **Dimensionality of Word Vectors**
- The number of dimensions is generally the size of the vocabulary $|V|$
  - often between 10,000 and 50,000 words, based on the most frequent words in the training corpus.
- Most cells are zeros, resulting in `sparse` vector representations.
- There are efficient algorithms for storing and computing with sparse matrices.

## Cosine for Measuring Similarity
- `Cosine similarity` measures how similar two vectors are by calculating the cosine of the angle between them.
- Commonly used to measure similarity between word or document vectors in NLP.
- Requires two vectors of the same dimensionality, either:
  - Both with words as dimensions (length |V|).
  - Both with documents as dimensions (length |D|).


### **The Dot Product**
- The **dot product** of two vectors $v$ and $w$ is calculated as:
  - $\displaystyle\text{dot product}(v, w) = v \cdot w = \sum_{i=1}^{N} v_i w_i = v_1w_1 + v_2w_2 + ... + v_N w_N$
- Acts as a `similarity metric`
  - high when vectors have large values in the same dimensions.
  - 0 if vectors are orthogonal (`unrelated`)
- **Issue:**
  - The dot product favors longer vectors, resulting in higher similarity scores for frequent words.
  - ∵ Frequent words have longer vectors because they co-occur with more words, leading to inflated similarity scores.
- **Solution:**
  - Normalize the dot product by dividing by the lengths of both vectors.
    - $\displaystyle\cos(v, w) = \frac{v \cdot w}{|v||w|}$
    - Where:
      - $\displaystyle|v| = \sqrt{\sum_{i=1}^{N} v_i^2} \quad \text{and} \quad |w| = \sqrt{\sum_{i=1}^{N} w_i^2}$


### **Properties of Cosine Similarity**
- Cosine similarity ranges from 0 to 1 for non-negative frequency vectors:
  - **1:** Vectors are identical.
  - **0:** Vectors are orthogonal (completely dissimilar).
- **Advantages:**
  - Measures similarity regardless of vector length (frequency).
  - Useful in comparing high-dimensional, sparse vectors like word embeddings.


### 🍎 **Example: Comparing Words with Cosine Similarity**
- Given the term-term matrix:

|         | pie | data | computer |
|---------|-----|------|----------|
| **cherry**     | 442 | 8    | 2        |
| **digital**    | 5   | 1683 | 1670     |
| **information**| 5   | 3982 | 3325     |


- **Calculation: Cosine Similarity of Cherry and Information**
  - $\displaystyle\cos(\text{cherry}, \text{information}) = \frac{442 \times 5 + 8 \times 3982 + 2 \times 3325}{\sqrt{442^2 + 8^2 + 2^2} \sqrt{5^2 + 3982^2 + 3325^2}}$
  - $= 0.018$

- **Calculation: Cosine Similarity of Digital and Information**
  - $\displaystyle\cos(\text{digital}, \text{information}) = \frac{5 \times 5 + 1683 \times 3982 + 1670 \times 3325}{\sqrt{5^2 + 1683^2 + 1670^2} \sqrt{5^2 + 3982^2 + 3325^2}}$
  - $= 0.996$

- **Digital** is much closer in meaning to **information** than **cherry** is, based on cosine similarity.
  - This result makes intuitive sense as "digital" and "information" frequently co-occur in similar contexts, unlike "cherry."

## TF-IDF (Term Frequency - Inverse Document Frequency)
- A technique used to weigh terms in a document to evaluate how important a word is relative to a document or a collection.
- Used to handle the limitations of raw frequency which is skewed and not discriminative.
  - Words like **the**, **it**, and **they** appear frequently but are not informative.
  - Frequent words across documents can obscure meaningful associations
  - e.g. Words like "good" in Shakespeare's plays are frequent across all plays and don't help in distinguishing between them.


### **Balancing Frequency with TF-IDF**
- **The Paradox:**
  - Rare words are important, but overly frequent words are not useful.
- **TF-IDF Solution:**
  - Combines **Term Frequency (TF)** and **Inverse Document Frequency (IDF)** to balance word frequency across documents.


### **Term Frequency (TF)**
- Measures the frequency of a word $t$ in a document $d$.
  - $\text{tf}_{t, d} = \text{count}(t, d)$
- **Log Scaling** is often applied to squash the frequency values:
  - $\text{tf}_{t, d} = \begin{cases} 
  1 + \log_{10} \text{count}(t, d) & \text{if } \text{count}(t, d) > 0 \\ 
  0 & \text{otherwise} 
  \end{cases}$
  - **Example Values:**
    - 1 occurrence: $\text{tf} = 1$
    - 10 occurrences: $\text{tf} = 2$
    - 100 occurrences: $\text{tf} = 3$
- The **collection frequency** of a term is its total number of occurrences in the whole collection of documents. 


### **Inverse Document Frequency (IDF)**
- Measures the importance of a term across all documents.
- $\text{idf}_t = \log_{10} \left( \dfrac{N}{\text{df}_t} \right)$
  - $N$ = Total number of documents.
  - Document frequency $\text{df}_t$ of a term $t$ = Number of documents containing term $t$.
- Gives higher weight to terms appearing in fewer documents.


### **Example of IDF in Shakespeare's Plays**

| Word     | Document Frequency | IDF  |
|----------|--------------------|------|
| Romeo    | 1                  | 1.57 |
| Salad    | 2                  | 1.27 |
| Falstaff | 4                  | 0.967 |
| Forest   | 12                 | 0.489 |
| Battle   | 21                 | 0.246 |
| Wit      | 34                 | 0.037 |
| Fool     | 36                 | 0.012 |
| Good     | 37                 | 0    |

- Words like "Romeo" have high IDF and are discriminative, whereas "good" has an IDF of 0 and is non-informative.


### **Calculating TF-IDF**
- The tf-idf weighted value $w_{t, d}$ for word $t$ in document $d$ combines term
frequency $tf_{t, d}$
  - $w_{t, d} = \text{tf}_{t, d} \times \text{idf}_t$
- **Example Calculation:**
  - **Word:** *Wit* in *As You Like It*
  - **Term Frequency:** $1 + \log_{10}(20) = 2.301$
  - **IDF:** $0.037$
  - **TF-IDF Weight:** $2.301 \times 0.037 = 0.085$
- The weight of "wit" is reduced significantly, reflecting its high frequency across many plays.


### **Benefits of TF-IDF Weighting**
- **Advantages:**
  - Reduces the impact of common, non-informative words.
  - Highlights unique terms that distinguish documents from each other.
- **Applications:**
  - Commonly used in information retrieval and text mining.
  - Provides a strong baseline for NLP tasks.

## **Pointwise Mutual Information (PMI)**
- A statistical measure used to evaluate the association between two events $x$ and $y$ (e.g., words) by comparing their observed co-occurrence to what would be expected if they were independent.
  - $\displaystyle \text{PMI}(x, y) = \log_2 \left( \frac{P(x, y)}{P(x)P(y)} \right)$
- **Intuition:**
  - If two words co-occur frequently but occur independently with low probability, PMI will be high.
  - For example, in a text corpus, if "cat" and "fur" appear together more often than independently, PMI will indicate a strong association.
- PMI is frequently used to quantify the relationship between a **target word** $w$ (e.g., "cat") and a **context word** $c$ (e.g., "fur").
  - $\displaystyle \text{PMI}(w, c) = \log_2 \left( \frac{P(w, c)}{P(w)P(c)} \right)$
  1. $P(w, c)$: Probability of observing both the target word $w$ and context word $c$ together.
  2. $P(w)P(c)$: Probability of observing $w$ and $c$ independently.
- **Interpretation:**
  - PMI > 0: Words co-occur more often than by chance (positive association).
  - PMI = 0: Words co-occur as expected by chance.
  - PMI < 0: Words co-occur less often than by chance (negative association).



### 🍎**Calculating PMI for 'information' and 'data':**
1. **Co-occurrence Probability:**
   - $\displaystyle P(w=\text{information}, c=\text{data}) = \frac{3982}{11716} = 0.3399$

2. **Individual Probabilities:**
   - $\displaystyle P(w=\text{information}) = \frac{7703}{11716} = 0.6575$
   - $\displaystyle P(c=\text{data}) = \frac{5673}{11716} = 0.4842$

3. **PMI Calculation:**
   - $\displaystyle \text{PMI}(\text{information}, \text{data}) = \log_2 \left( \frac{0.3399}{0.6575 \times 0.4842} \right) = 0.0944$

- The PMI value of 0.0944 suggests a slight positive association between "information" and "data" in this context.



### **Challenges with PMI**

- **Issues with Negative PMI:**
  - Negative PMI values suggest that two words co-occur less often than expected by chance, but these values can be unreliable unless the corpus is extremely large.
  - For example, distinguishing whether two words with very low probabilities occur together less often than by chance would require vast amounts of data.

- **Bias Towards Infrequent Events:**
  - PMI tends to give high values for rare word pairs even if their co-occurrence is not meaningful.
  - Example: A rare term like "quantum" paired with "entanglement" might have an inflated PMI due to their low individual frequencies.

- **Impact on NLP Models:**
  - This bias can lead to incorrect conclusions about word associations and affect downstream tasks such as topic modeling or semantic similarity.



### **Positive Pointwise Mutual Information (PPMI)**

- **Positive PMI (PPMI)** sets all negative PMI values to zero, making it more robust against unreliable associations.

  - $\displaystyle \text{PPMI}(w, c) = \max \left( \log_2 \left( \frac{P(w, c)}{P(w)P(c)} \right), 0 \right)$

- **Why Use PPMI?**
  - It focuses only on positive associations, eliminating misleading negative values.
  - PPMI is commonly used in word embeddings and vector space models, where negative PMI values do not contribute meaningful information.

- **Example:**
  - If PMI for (word1, word2) is -2, PPMI sets it to 0.
  - If PMI for (word1, word2) is 4, PPMI retains the value of 4.



### **Constructing the PPMI Matrix**

- A co-occurrence matrix $F$ has:
  - **W rows:** Represent words in the vocabulary.
  - **C columns** Represent contexts (surrounding words).
  - Each cell $f_{ij}$ in $F$ indicates the number of times word $w_i$ appears within context $c_j$.

- **Constructing the PPMI Matrix:**
1. Calculate the joint probability $p_{ij}$ of $w_i$ and $c_j$:
   - $\displaystyle p_{ij}=\frac{f_{ij}}{∑_{i=1}^W ∑_{j=1}^C f_{ij}}$
2. Calculate the marginal probabilities of word $p_{i*}$ and context $p_{*j}$:
   - $\displaystyle p_{i*} = \dfrac{∑_{j=1}^C f_{ij}}{ ∑_{i=1}^W ∑_{j=1}^C f_{ij}}  \quad  p_{*j} = \dfrac{∑_{i=1}^W f_{ij}}{ ∑_{i=1}^W ∑_{j=1}^C f_{ij}}$ 
3. Compute: $\displaystyle \text{PPMI}_{ij} = \max \left( \log_2 \left( \frac{p_{ij}}{p_{i*} \times p_{*j}} \right), 0 \right)$

- The resulting PPMI matrix highlights the strength of association between words and their contexts.



### **Example PPMI Calculations**

| `w\P(w,c)\c`| Computer | Data   | Result | Pie    | Sugar  | p(w)  |
|-------------|----------|--------|--------|--------|--------|-------|
| Cherry      | 0.0002   | 0.0007 | 0.0008 | 0.0377 | 0.0021 | 0.0415|
| Strawberry  | 0.0000   | 0.0000 | 0.0001 | 0.0051 | 0.0016 | 0.0068|
| Digital     | 0.1425   | 0.1436 | 0.0073 | 0.0004 | 0.0003 | 0.2942|
| Information | 0.2838   | 0.3399 | 0.0323 | 0.0004 | 0.0011 | 0.6575|
| p(c)        | 0.4265   | 0.4842 | 0.0404 | 0.0437 | 0.0052 |       |


| Word       | Computer | Data | Result | Pie  | Sugar |
|------------|------|------|--------|------|-------|
| Cherry     | 0    | 0    | 0      | 4.38 | 3.30  |
| Strawberry | 0    | 0    | 0      | 4.10 | 5.51  |
| Digital    | 0.18 | 0.01 | 0      | 0    | 0     |
| Information| 0.02 | 0.09 | 0.28   | 0    | 0     |

- Cherry and strawberry are highly associated with "pie" and "sugar," indicating strong co-occurrence.
- Digital has a weaker association with "data," possibly due to its broader usage context.



### **Handling PMI Bias with $\alpha$-Smoothing**
- Rare contexts can disproportionately inflate PMI values.
- **Solution:**
  - Modify context probability using $\alpha$-smoothing:
  - $\displaystyle \text{PPM}I_{\alpha}(w, c) = \max \left( \log_2 \left( \frac{P(w, c)}{P(w)P^{\alpha}(c)} \right), 0 \right)$
  - **$P^{\alpha}(c) = \dfrac{count(c)^α}{Σ_{c}count(c)^α}$:** Raises context probability $P(c)$ to a power $\alpha$.
  - Setting $\alpha = 0.75$ gives a balanced reduction in bias, particularly for rare contexts.
- Reduces high PMI values for rare word-context pairs, providing more meaningful associations.



### **Another Solution:Laplace Smoothing for PMI**
- **Laplace Smoothing**: Add a small constant $k$ (e.g., 0.1 to 3) to each count before calculating PMI.
- **Why Laplace Smoothing?**
  - Helps reduce the impact of zero or low-frequency counts in co-occurrence matrices.
  - Larger $k$ values discount more, reducing PMI bias further.
- **Modified PMI Formula:**
  - $\displaystyle \text{PMI}(w, c) = \log_2 \left( \frac{f(w, c) + k}{(f(w) + k)(f(c) + k)} \right)$
- **Choosing $k$:**
  - Smaller values maintain original PMI properties; larger values decrease variability in low-frequency pairs.

## Applications of tf-idf and PPMI Vector Models

- **Vector Representation:**
  - Target word represented as a vector.
  - Dimensions correspond to either:
    - Documents in a collection (term-document matrix).
    - Counts of neighboring words (term-term matrix).
  - Values in dimensions weighted by:
    - **tf-idf** (for term-document matrices).
    - **PPMI** (for term-term matrices).
  - Vectors are sparse (mostly zeros).

- **Similarity Computation:**
  - Similarity between two words $x$ and $y$ computed using cosine similarity.
  - High cosine value indicates high similarity.
  - Referred to as the **tf-idf model** or **PPMI model** based on the weighting function.


### **Applications of tf-idf and PPMI Models in Document and Word Similarity**

- **Document Similarity:**
  - **Document Vector:** 
    - Represented by the centroid of word vectors in the document.
    - Centroid minimizes the sum of squared distances to each vector.
    - Formula: $\mathbf{d} = \dfrac{\mathbf{w}_1 + \mathbf{w}_2 + ... + \mathbf{w}_k}{k}$
  - Compute similarity between two documents $\mathbf{d}_1$ and $\mathbf{d}_2$ using cosine similarity.
  - Applications:
    - Information retrieval.
    - Plagiarism detection.
    - News recommendation.
    - Digital humanities (comparing texts).

- **Word Similarity:**
  - Use PPMI or tf-idf models to find word paraphrases.
  - Track changes in word meanings.
  - Discover meanings in different corpora.
  - Find top 10 similar words by cosine similarity.
