# Understanding embeddings
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

While embeddings and vectors can be used interchangeably in the context of vector embeddings, "embeddings" emphasizes the notion of representing data in a meaningful and structured way, while "vectors" refers to the numerical representation itself.

There are several types of embeddings, read more: https://www.ibm.com/topics/embedding

This notebook will use the text/sentence embeddings.





## How to create embeddings

To create an embedding, we're going to send our text string to the OpenAI embeddings API endpoint along with the embedding model name. We will use `text-embedding-3-small` model. It's one of the newest and most performant embedding models released by the OpenAI. Read more: https://openai.com/blog/new-embedding-models-and-api-updates


### Setup: 
We need to setup our environment and retrieve API keys for OpenAI \
(Skip this  setup if you already installed the OpenAI python library, dotenv, and if you already created and configured the .env file with your api key)

1. Install the OpenAI python library and dotenv for managing our api key using pip in your terminal:

    ``` bash
    pip install openai python-dotenv
    # This command will install both the OpenAI python library and dotenv.
    ```
2. Verify if the installation was successfull by running this command in your terminal:

    ``` bash
    pip list
    # This will list all the installed python packages, locate both packages to verify if they're successfully installed.
    ```
3. Create a `.env` file and paste your OpenAI API key from https://platform.openai.com/api-keys:
    ``` bash
    # Your .env file should look like this, replace the value with your actual OpenAI API key
    OPENAI_API_KEY=sk-#####
    ```


### Creating embeddings
Let's initialize our connection to OpenAI Embeddings:

In [1]:
import os
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv

# Load the .env
load_dotenv(find_dotenv())

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

We can now create embeddings with OpenAI as follows:

In [2]:
MODEL = "text-embedding-3-small"

response = client.embeddings.create(
    input="Hello villagers",
    model=MODEL
)

# Print out the vectors
print(response.data[0].embedding)


[0.054231349378824234, -0.01202359702438116, -0.026938024908304214, 0.01760338619351387, -0.012927991338074207, -0.005987573880702257, 0.014922503381967545, 0.012968366034328938, -0.007162478752434254, -0.007477401290088892, -0.020187368616461754, -0.026453528553247452, -0.006916192825883627, 0.0027152011170983315, -0.020009720697999, -0.015673473477363586, -0.034916073083877563, 0.029731957241892815, -0.01577037200331688, 0.06346908956766129, 0.03286503627896309, 0.006956567522138357, -0.022270705550909042, -0.022432204335927963, 0.0037003448233008385, 0.023045901209115982, -0.027180273085832596, -0.007978048175573349, 0.09386318922042847, -0.07538770884275436, -0.016860490664839745, -0.04205432906746864, 0.0005541432765312493, -0.03310728445649147, 0.013993884436786175, -0.0013000665931031108, -0.006447845604270697, -0.06989675015211105, -0.02248065359890461, -0.005285053048282862, 0.02435404248535633, -0.02504848688840866, -0.030813999474048615, -0.007626788225024939, 0.002608207985

Let's print the dimension length of the vector:

In [3]:
print("Length: " , len(response.data[0].embedding))

Length:  1536


That's a lot of numbers!
### Reducing embedding dimensions
Using larger embeddings, such as storing them in a *vector database* for retrieval, usually involves higher costs and requires more computing power, memory, and storage compared to using smaller embeddings.

The new embedding models were trained with a technique that will allows us to balance performance and the cost of using embeddings. We can shorten embeddings (by removing some numbers from the end of the sequence) without sacrificing their ability to represent concepts, thanks to the `dimensions` API parameter.

Example usage of `dimensions` API parameter.

In [None]:
DIMENSIONS = 516

response = client.embeddings.create(
    input="Hello villagers",
    model=MODEL,
    dimensions=DIMENSIONS
)

# Print out the dimension length of the vector
print("Length: " , len(response.data[0].embedding))

Read more about the OpenAPI embedding API parameters : https://platform.openai.com/docs/api-reference/embeddings/create

### What is a vector database?

A vector database is a type of database designed to efficiently store and search vector data. In this type of database, data points are represented as vectors in a high-dimensional space. These vectors can come from various sources, such as machine learning models that generate embeddings for words or images.

Vector databases are designed to support similarity searches, nearest neighbor queries, and other operations that involve measuring the similarity or distance between vectors. Read more: https://www.ibm.com/topics/vector-database

List of vector databases: https://cookbook.openai.com/examples/vector_databases/readme.