<a href="https://colab.research.google.com/github/roguxivlo/machine-learning-24L/blob/main/hw12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework Assignment: Is a *queen* really just a *king*, minus a *man*, plus a *woman*?**

--------------



In class, we dealt with **embeddings** trained for **sentiment classification**. These embeddings are optimized to separate *positive* from *negative* expressions and **do not encode deeper semantic information**.

However, in modern natural language processing, there exist other embeddings — such as those from **BERT**, **word2vec**, or **GloVe** — that **do capture semantic structure**. These models are trained on large corpora, and their embeddings often allow for meaningful **vector arithmetic**, like the famous:

```
embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")
```

This homework explores **semantic vector relationships** using such pretrained embeddings.

## **The Objective**

Your task is to:

1. Construct semantic classes of word pairs.
2. Visualize them using PCA.
3. Explore arithmetic operations in embedding space.

## **Tasks & Deliverables**

### 1. **Semantic Pair Classes**

- You must gather **at least 10 classes** of semantically related word pairs.
- Each class must contain **at least 5 pairs**.
- That gives a **minimum total of 100 unique words** (10 classes x 5 pairs x 2 words per pair).

Two example classes:

**Class 1: Gender**

- (king, queen)
- (man, woman)
- (doctor, nurse)
- (prince, princess)
- *(you must add one more)*

**Class 2: Verb tense (past tense)**

- (bring, brought)
- (get, got)
- (like, liked)
- *(you must add two more)*

**Your job:**

- Invent or search for **at least 10 such classes**, including the examples above.
- Each class must be conceptually coherent.
- Other examples: singular/plural, country/capital, comparative/superlative, tool/user, job/object, etc.

### 2. **Global PCA (Across All Words)**

- Use PCA to reduce the **entire set of 100 word embeddings** to 2D, and plot it.
- Plot the additional **10 separate charts**, one for each class.
  - Each chart should display only the 10 words (5 pairs) of the given class.
- Points should be labeled with the words themselves.

### 3. **Local PCA (Per Class)**

- For each class (10 total), perform PCA **only** on the 10 words of that class.
- Plot these class-wise PCA visualizations as separate charts.
- Again, points should be labeled with the words.

**Total: 21 charts**
(1 global plot with 100 words + 10 global-space class plots + 10 local PCA class plots)

Charts should be presented in a self-explanatory manner with clear labels.

### 4. **Embedding Arithmetic**

For each class, choose **one example pair** (e.g., (king, queen)) and perform the operation:

```
embedding(B) - embedding(A) + embedding(C)
```

Where A and B form a known pair, and C is another base word.
For example:

```
embedding("king") - embedding("man") + embedding("woman")
```

* For each such result vector, find the **5 closest word embeddings** (using cosine similarity or Euclidean distance).
* Print the top 5 neighbors **with their distances**.
* Do this **once per class** (i.e., 10 times).

This will make it possible to verify if
 ```
embedding("queen") ≈ embedding("king") - embedding("man") + embedding("woman")
```
for the *gender*-related class.


### 5. **Discussion**

* Analyze and interpret your 21 plots.
* Discuss whether the vector relationships are preserved.
* Does PCA capture semantic differences?
* Are the closest words from the arithmetic meaningful?
* What kinds of relationships are captured, and what are not?
* Are some classes better behaved than others?


### 6. **Publish on GitHub**  
   - Place the Colab notebook in your **GitHub repository** for this course.
   - In your repository’s **README**, add a **link** to the notebook and also include an **“Open in Colab”** badge at the top of the notebook so it can be launched directly from GitHub.


## Acknowledgments

*This homework assignment was inspired by an idea from my master's student **Andrzej Małek**, to whom I would like to express my thanks.*



# 1. Semantic pair classes

In [11]:
import torch
import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

semantic_classes = {
    "Gender": [
        ("king", "queen"),
        ("man", "woman"),
        ("doctor", "nurse"),
        ("prince", "princess"),
        ("actor", "actress"),
    ],
    "Verb Tense (Past)": [
        ("bring", "brought"),
        ("get", "got"),
        ("like", "liked"),
        ("go", "went"),
        ("eat", "ate"),
    ],
    "Singular-Plural": [
        ("cat", "cats"),
        ("dog", "dogs"),
        ("house", "houses"),
        ("car", "cars"),
        ("book", "books"),
    ],
    "Country-Capital": [
        ("france", "paris"),
        ("germany", "berlin"),
        ("italy", "rome"),
        ("japan", "tokyo"),
        ("spain", "madrid"),
    ],
    "Comparative-Superlative": [
        ("good", "best"), # Irregular
        ("bad", "worst"), # Irregular
        ("fast", "fastest"),
        ("tall", "tallest"),
        ("happy", "happiest"),
    ],
    "Tool-User/Profession": [
        ("hammer", "carpenter"),
        ("microphone", "singer"),
        ("scalpel", "surgeon"),
        ("microscope", "scientist"),
        ("paint", "painter"),
    ],
    "Object-Property": [
        ("lemon", "sour"),
        ("sugar", "sweet"),
        ("fire", "hot"),
        ("ice", "cold"),
        ("sun", "bright"),
    ],
    "Animal-Habitat": [
        ("lion", "savanna"),
        ("fish", "water"),
        ("bird", "sky"),
        ("bear", "forest"),
        ("whale", "ocean"),
    ],
    "Part-Whole": [
        ("toe", "foot"),
        ("petal", "flower"),
        ("leaf", "plant"),
        ("engine", "plane"),
        ("branch", "tree"),
    ],
    "Antonyms": [
        ("love", "hate"),
        ("big", "small"),
        ("up", "down"),
        ("open", "close"),
        ("day", "night"),
    ],
}

# Verify no duplicates:

words = set()

for semantic_class, pairs in semantic_classes.items():
    for (a,b) in pairs:
        if a in words:
            print(a)
        if b in words:
            print(b)
        words.add(a)
        words.add(b)

print(f"Total: {len(words)} words")


Total: 100 words
