# Setup

Starting out with some basic setup to reuse and simplify the walk-through.

In [1]:
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def get_embedding(text, model="text-embedding-3-small"):
  return np.array(client.embeddings.create(input=[text.replace("\n", " ")], model=model).data[0].embedding)

def cosine_similarity(a, b):
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Embedding Concepts in Space

Embeddings are a technique used in natural language processing (NLP) to represent words, phrases, and even larger blocks of text as vectors of numbers in a high-dimensional space.

* Each vector can be thought of as a point in this high-dimensional space.
* Each number in the array represents a distinct dimension that contributes to capturing some aspect of the semantic meaning.
* The numbers are not random. They are learned from vast amounts of text data.
* What we've been able to train into this space is where the real advanced have been made.

In [2]:
hot_dog_vector = get_embedding('hot dog')

In [3]:
hot_dog_vector#.tolist()

array([-0.03104656, -0.04807668, -0.01410173, ..., -0.00022234,
        0.0028964 , -0.00729964], shape=(1536,))

The model used here currently used the following number of dimensions represent each point in space - hence the term high-dimensional space.

In [4]:
len(hot_dog_vector)

1536

Every string of text will land on a point in that space.

In [5]:
words = ["red","potatoes","soda","cheese","water","blue","crispy","hamburger","coffee","green","milk","la croix","yellow","chocolate","french fries","latte","cake","brown","cheeseburger","espresso","cheesecake","black","mocha","fizzy","carbon","banana"]

In [6]:
df = pd.DataFrame(words, columns=['word'])
df['embedding'] = df['word'].apply(lambda x: get_embedding(x))
df

Unnamed: 0,word,embedding
0,red,"[-0.02211996167898178, -0.010933708399534225, ..."
1,potatoes,"[-0.02531793899834156, -0.04170173779129982, -..."
2,soda,"[0.01846451684832573, -0.025069599971175194, -..."
3,cheese,"[0.00935442466288805, -0.04611948877573013, 0...."
4,water,"[0.0030297308694571257, 0.017433997243642807, ..."
5,blue,"[-0.0011275681899860501, -0.016529185697436333..."
6,crispy,"[-0.024566395208239555, -0.008102098479866982,..."
7,hamburger,"[-0.04638974741101265, -0.056404534727334976, ..."
8,coffee,"[-0.01013763528317213, 0.0037400354631245136, ..."
9,green,"[0.006280624307692051, -0.0011062989942729473,..."


The distance and direction between points (vectors) in this space can be used to infer semantic relationships, such as similarity in meaning, context, or usage, between words or text snippets.

Rather than explicitly measuring the distance between two points, we can measure how similar two vectors are by calculating the cosine of the angle between them (a technique call cosine similarity).

These values of similarity range from -1 to 1:
* 1 indicates that the vectors are in the same direction (meaning very similar)
* 0 indicates that they are orthogonal or have no similarity
* -1 indicates that they are in exactly opposite directions (meaning they are dissimilar)

In [7]:
hot_dog_vector

array([-0.03104656, -0.04807668, -0.01410173, ..., -0.00022234,
        0.0028964 , -0.00729964], shape=(1536,))

In [14]:
with np.printoptions(threshold=np.inf):
    print(hot_dog_vector)

[-3.10465582e-02 -4.80766781e-02 -1.41017335e-02 -1.66605152e-02
 -5.27393445e-02 -3.60219702e-02 -4.91854809e-02 -1.46987829e-02
  3.53964902e-02 -3.05632334e-02  4.88158800e-02  3.28066089e-04
  1.73855051e-02  3.20416391e-02 -2.62985956e-02 -4.92210209e-04
 -2.09820140e-02  1.17490757e-02 -4.15375642e-02  3.48847322e-02
 -7.52530759e-04  1.02209141e-02 -4.40394878e-02  2.25670380e-03
  2.79475898e-02  1.09402157e-01 -2.43368633e-02  2.54172385e-02
  2.26025768e-02 -2.80470978e-02 -1.07824244e-02 -4.62286659e-02
  5.55540062e-02 -4.01728824e-02 -2.45749718e-03  5.69329085e-03
 -3.38612199e-02  9.51724872e-03  1.66747309e-02  4.67404239e-02
 -1.54379867e-02  9.21872444e-03  4.18787375e-02  4.36414555e-02
  3.53680588e-02  6.76566910e-04 -1.15216281e-02  2.50334200e-02
  2.34732730e-03 -1.14505505e-02 -1.85689412e-03  3.91493700e-02
  4.22199070e-02  3.73013616e-02  4.58874963e-02 -1.47556448e-02
  3.38150188e-03  5.56108691e-02  1.92477293e-02 -1.35188997e-02
 -8.72118305e-03 -2.33844

In [8]:
hot_dog_df = df.copy()
hot_dog_df["similarities"] = hot_dog_df['embedding'].apply(lambda x: cosine_similarity(x, hot_dog_vector))
hot_dog_df.sort_values("similarities", ascending=False, inplace=True)
hot_dog_df

Unnamed: 0,word,embedding,similarities
7,hamburger,"[-0.04638974741101265, -0.056404534727334976, ...",0.595374
18,cheeseburger,"[-0.025503935292363167, -0.05670874938368797, ...",0.464017
14,french fries,"[-0.038560159504413605, -0.016487879678606987,...",0.360207
3,cheese,"[0.00935442466288805, -0.04611948877573013, 0....",0.358372
1,potatoes,"[-0.02531793899834156, -0.04170173779129982, -...",0.339797
6,crispy,"[-0.024566395208239555, -0.008102098479866982,...",0.332201
16,cake,"[0.04641314595937729, -0.009166176430881023, 6...",0.314139
2,soda,"[0.01846451684832573, -0.025069599971175194, -...",0.310825
8,coffee,"[-0.01013763528317213, 0.0037400354631245136, ...",0.301618
13,chocolate,"[0.013673103414475918, -0.04669247195124626, 0...",0.299343


What become really neat about having concepts mathmatically modeled in high-dimensional space, is that we can add words (vectors) together and its new resulting point is a different location in this concept space. Exactly how  different words and phrasing give you additional context and semantic meaning.

Take the words "milk" and "espresso" for example.

Here is milk's point in space.

In [9]:
milk_vector = np.array(df.loc[df['word'] == 'milk']['embedding'].item())
milk_vector

array([ 0.03961927, -0.00421355, -0.03665198, ...,  0.00503549,
       -0.01055762, -0.01131131], shape=(1536,))

Here is espresso's point in space.

In [10]:
espresso_vector =  np.array(df.loc[df['word'] == 'espresso']['embedding'].item())
espresso_vector

array([-0.05628017, -0.05029742, -0.01352074, ..., -0.00888147,
        0.00271341, -0.00932488], shape=(1536,))

Here is the point in space where the concept of milk and espresso exists.

In [11]:
milk_espresso_vector = milk_vector + espresso_vector
milk_espresso_vector

array([-0.0166609 , -0.05451097, -0.05017272, ..., -0.00384597,
       -0.00784421, -0.02063619], shape=(1536,))

Here is how similar each of the things in our list are to that point in space.

In [12]:
milk_espresso_df = df.copy()
milk_espresso_df["similarities"] = milk_espresso_df['embedding'].apply(lambda x: cosine_similarity(x, milk_espresso_vector))
milk_espresso_df.sort_values("similarities", ascending=False, inplace=True)
milk_espresso_df

Unnamed: 0,word,embedding,similarities
10,milk,"[0.03961927071213722, -0.004213553387671709, -...",0.818834
19,espresso,"[-0.056280165910720825, -0.05029742047190666, ...",0.818834
15,latte,"[-0.017921417951583862, -0.03153694421052933, ...",0.577733
22,mocha,"[0.020180465653538704, -0.02676199935376644, -...",0.563065
13,chocolate,"[0.013673103414475918, -0.04669247195124626, 0...",0.536893
8,coffee,"[-0.01013763528317213, 0.0037400354631245136, ...",0.515901
3,cheese,"[0.00935442466288805, -0.04611948877573013, 0....",0.471408
2,soda,"[0.01846451684832573, -0.025069599971175194, -...",0.469844
7,hamburger,"[-0.04638974741101265, -0.056404534727334976, ...",0.403802
18,cheeseburger,"[-0.025503935292363167, -0.05670874938368797, ...",0.37853


What is really important to understand about large language models (LLMs), is that we are effectively creating mathmatical models of human understanding that move us into realm where computers and software can now interpret everything  conceptually instead of literally.

Languguages, search terms, everything...

And if your mind isn't already blown with that notion. This isn't just a mathmatical model for words/text. We are also on the precipice of these model being trained to become multimodal meaning they conceptually map text, images, audio, and video into a single unified model (Rosetta Stone) of human perception.

# Footnotes

Best practices for handling rate limits for getting embeddings:
* https://cookbook.openai.com/examples/using_embeddings
* https://cookbook.openai.com/examples/how_to_handle_rate_limits

