## Setup (Dependencies & Environment)

**Python**: 3.12+

**Install**:

```bash
uv add openai python-dotenv ipykernel
uv sync
```

or

```bash
pip install openai python-dotenv ipykernel
```

**Environment**:

Set `OPENAI_API_KEY` in your shell or create a `.env` file in the same folder (this notebook calls `load_dotenv(override=True)`).

Example `.env`:

```bash
OPENAI_API_KEY=your_key_here
```


# Embeddings

This notebook demonstrates semantic closeness using cosine similarity.
Words with similar meanings should have higher similarity scores than unrelated ones.


In [None]:
from dotenv import load_dotenv

# Load environment variables
load_dotenv(override=True)

## Getting embeddings

In [None]:
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-3-small"
)

print(response.data[0].embedding)

In [None]:
# Dimension
len(response.data[0].embedding)

In [None]:
response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-3-small",
    dimensions=100
)

print(response.data[0].embedding)
print(len(response.data[0].embedding))

## Cosine similarity

Cosine similarity measures how close two vectors are in direction (ignoring magnitude).
A score of **1.0** means identical direction, **0** means orthogonal, and **−1** means opposite.

In [None]:
import math

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

In [None]:
# Quick sanity check with simple 2D vectors
cosine_similarity([1, 0], [1, 0])
# cosine_similarity([1, 0], [-1, 0])
#cosine_similarity([1, 0], [0, 1])
#cosine_similarity([1, 0], [1, 1])

## Comparing word embeddings

Let's embed a set of words and see how cosine similarity captures semantic relationships.

In [None]:
words = [
    "cat", "dog", "kitten", "puppy",
    "car", "automobile",
    "banana", "apple",
    "king", "queen", "man", "woman",
    "Paris", "France", "Tokyo", "Japan", "Singapore",
    "汽车"
]

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=words,
)

embedding_by_word = {
    word: response.data[i].embedding
    for i, word in enumerate(words)
}

In [None]:
pairs = [
    ("cat", "dog"),
    ("kitten", "puppy"),
    ("car", "automobile"),
    ("king", "queen"),
    ("man", "woman"),
    ("Paris", "France"),
    ("Paris", "Singapore"),
    ("Tokyo", "Japan"),
    ("banana", "apple"),
    ("cat", "banana"),
    ("car", "king"),
    ("Paris", "puppy"),
    ("queen", "automobile"),
    ("汽车", "automobile"),
    ("汽车", "car"),
    ("汽车", "dog"),
]

print("Cosine similarity (higher = more similar):\n")
for a, b in pairs:
    score = cosine_similarity(embedding_by_word[a], embedding_by_word[b])
    print(f"  {a:<12} {b:<12} {score:.3f}")