# Setup

Starting out with some basic setup to simplify the walk-through.

In [24]:
import os

import numpy as np
import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def embedding(word):
    return np.array(df.loc[df["word"] == word, "embedding"].item())

def calculate_cosine_similarity(a, b):
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def find_similar_concepts(query_embedding, top_n=5, filter=[]):
    df["similarity"] = df["embedding"].apply(lambda emb: calculate_cosine_similarity(query_embedding, emb))
    result = df[~df["word"].isin(filter)].copy()
    return result.sort_values("similarity", ascending=False).head(top_n).reset_index(drop=True)

In [2]:
def calculate_embedding(text):
  return np.array(client.embeddings.create(input=[text.replace("\n", " ")], model="text-embedding-3-small").data[0].embedding)

words = [
    "man", "woman", "king", "queen", "boy", "girl", "prince", "princess",
    "father", "mother", "husband", "wife", "son", "daughter", "friend",
    "enemy", "teacher", "student", "boss", "employee", "doctor", "nurse",
    "potatoes", "cheese", "hamburger", "french fries", "cheeseburger",
    "banana", "chocolate", "cake", "cheesecake", "bacon", "pancakes",
    "avocado", "salad", "peanut butter", "toast", "burrito", "sushi",
    "pizza", "ice cream", "cookie", "hot dog",
    "soda", "water", "milk", "la croix", "coffee", "latte", "espresso",
    "mocha", "tea", "hot chocolate", "cappuccino", "matcha", "smoothie",
    "kombucha", "wine", "beer", "cocktail", "lemonade",
    "red", "blue", "green", "yellow", "brown", "black", "white", "orange",
    "pink", "purple", "beige", "gray", "neon", "clear", "golden",
    "crispy", "fizzy", "sweet", "salty", "creamy", "hot", "cold", "spicy",
    "refreshing", "bitter", "sour", "smooth", "crunchy", "bubbly", "strong",
    "light", "dark",
]

df = pd.DataFrame({
    "word": words,
    "embedding": [calculate_embedding(word) for word in words],
})

# Embedding Concepts in Space

Embeddings are a technique used in natural language processing (NLP) to represent words, phrases, and even larger blocks of text as vectors of numbers in a high-dimensional space.

* Each vector can be thought of as a coordinate in this high-dimensional space.
* Each dimension in the vector captures an aspect of semantic meaning.
* The structure of this space reflects learned relationships from training data.

For example, the idea of a 'hot dog' is represented by a vector. That means that the model has learned to locate the concept of "hot dog" at specific coordinates in its semantic space.

In [3]:
embedding("hot dog")

array([-0.03104307, -0.04807128, -0.01409304, ..., -0.00020377,
        0.0029014 , -0.0072846 ], shape=(1536,))

Where GPS has two dimensions (latitude and longitude), and physical space has three (x, y, z), the model we're using represents concepts in **many** dimensions—often hundreds or thousands. This is what we mean by a "high-dimensional space."

In [4]:
len(embedding("hot dog"))

1536

# Comparing Similarity by Measuring Direction

A key capability in this space is measuring similarity. One technique is **cosine similarity**. Instead of measuring literal distance, cosine similarity measures the **angle** between two vectors:

These values are normalized in a range from -1 to 1:
* 1 indicates that the vectors are in the same direction (meaning very similar)
* 0 indicates that they are orthogonal or have no similarity
* -1 indicates that they are in exactly opposite directions (meaning they are dissimilar)

This allows us to infer semantic relationships based on direction — such as meaning, context, or usage.

So in this example, we’ll ask: *What example words are most similar to the word "burrito"?*

In [5]:
find_similar_concepts(embedding("burrito"), filter=["burrito"])

Unnamed: 0,word,embedding,similarity
0,cheeseburger,"[-0.025503935292363167, -0.05670874938368797, ...",0.512808
1,hamburger,"[-0.046442754566669464, -0.056426260620355606,...",0.476006
2,hot dog,"[-0.031043073162436485, -0.048071280121803284,...",0.433672
3,bacon,"[0.027432523667812347, -0.007462325040251017, ...",0.42324
4,sushi,"[-0.023778622969985008, -0.025500649586319923,...",0.408351


# Emergent Meaning through Vector Arithmatic

One of the most fascinating aspects of this semantic space is that we can not only measure similarity — we can **perform math on meaning**.

By adding, subtracting, and blending vectors, we can explore entirely new conceptual combinations.

Take this example:

In [35]:
find_similar_concepts(embedding("milk") + embedding("espresso"), filter=["milk", "espresso"])

Unnamed: 0,word,embedding,similarity
0,cappuccino,"[-0.02482164278626442, -0.030885322019457817, ...",0.642723
1,latte,"[-0.017921417951583862, -0.03153694421052933, ...",0.57775
2,mocha,"[0.02025661990046501, -0.026742229238152504, -...",0.562984
3,chocolate,"[0.013673103414475918, -0.04669247195124626, 0...",0.536867
4,hot chocolate,"[-0.04974162578582764, -0.04538385942578316, -...",0.518881


In [36]:
find_similar_concepts(embedding("father") - embedding("man") + embedding("woman"), filter=["father", "man", "woman"])

Unnamed: 0,word,embedding,similarity
0,mother,"[0.06384002417325974, 0.002675893483683467, -0...",0.747128
1,daughter,"[0.07486723363399506, -0.017612170428037643, -...",0.61072
2,wife,"[0.017703169956803322, 0.0015181683702394366, ...",0.579919
3,queen,"[0.043817322701215744, -0.03984493762254715, 0...",0.423459
4,friend,"[-0.017377346754074097, -0.03304498642683029, ...",0.411813


# Emergent Meaning through Phrases

Instead of combining known vectors, we can also embed full **natural language phrases** directly.

This allows us to describe concepts in our own words — and let the model retrieve similar ideas based on learned meaning.

In [37]:
find_similar_concepts(calculate_embedding("dessert you eat with a spoon"))

Unnamed: 0,word,embedding,similarity
0,ice cream,"[0.015797361731529236, -0.0515950508415699, -0...",0.449068
1,cheesecake,"[0.04396495223045349, -0.02441263012588024, -0...",0.435389
2,smoothie,"[-0.022336609661579132, -0.034296080470085144,...",0.403105
3,cake,"[0.04641443118453026, -0.009220455773174763, -...",0.396299
4,creamy,"[0.049029696732759476, -0.008617117069661617, ...",0.378161


In [38]:
find_similar_concepts(calculate_embedding("members of a family"))

Unnamed: 0,word,embedding,similarity
0,father,"[0.02880435809493065, 0.024050738662481308, 0....",0.469758
1,daughter,"[0.07486723363399506, -0.017612170428037643, -...",0.44574
2,mother,"[0.06384002417325974, 0.002675893483683467, -0...",0.433022
3,wife,"[0.017703169956803322, 0.0015181683702394366, ...",0.423573
4,husband,"[-0.019892802461981773, 0.03318863362073898, -...",0.39254


# Take Away

A core idea behind large language models (LLMs) is that they are mathematical models of human understanding.

Rather than relying on strict logic or fixed rules, these models represent meaning as positions in a high-dimensional space — where the geometry between points reflects relationships between concepts. This allows computers to interpret meaning, not just match literal words.

And it doesn’t stop at language. Today’s models are being trained to understand meaning across images, audio, and video — enabling multimodal understanding, a kind of modern-day Rosetta Stone for human perception.

I don’t know about you, but as a technologist, I find this to be an incredibly exciting time.

After years of working in a world defined by literal, logical, and deterministic systems, we’re now stepping into an era shaped by probabilistic, semantic, and conceptual ones. Just as exciting is how accessible and cost-effective these capabilities have become — not just for researchers, but for everyday developers and creators.

The very nature of how we interact with computers is evolving — and it’s happening faster, and at a greater scale, than anything we’ve seen before.