# Lesson 2 : Generating Document Embeddings with OpenAI


Welcome back! In the previous lesson, you learned how to load and split documents using LangChain, setting the foundation for more advanced document processing tasks. Today, we will take the next step by exploring **embeddings**, a crucial concept in document processing.

Embeddings are numerical representations of text data that capture the semantic meaning of words, phrases, or entire documents. They allow Large Language Models (LLMs) to understand and process text in a meaningful way. By converting text into embeddings, you can perform tasks such as similarity search, clustering, and classification.

In this lesson, you will:

1. Generate embeddings for document chunks using OpenAI and LangChain.
2. Understand how embeddings power context retrieval systems.
3. Prepare for context retrieval tasks in future lessons.

---

## Embeddings and Language Models

Embeddings play a vital role in context retrieval systems. Think of them as a way to translate human language into a format computers can compare—like giving machines their own secret decoder ring!

Consider these three sentences:

> 1. "The Avengers assembled to fight Thanos"
> 2. "Earth's mightiest heroes united against the Mad Titan"
> 3. "My soufflé collapsed in the oven again"

Despite different wording, sentences 1 and 2 share the same meaning and will produce embeddings (vectors) that lie close together in embedding space. Sentence 3, a baking disaster, will end up far away.

---

## Context Retrieval Systems

Here’s how embeddings fit into a practical retrieval pipeline:

1. **Document Processing**
   Break documents into chunks (like slicing a pizza).

2. **Embedding Generation**
   Convert each chunk into a vector (assign each slice its own flavor profile).

3. **Storage**
   Store vectors in a vector database (our digital pizza fridge).

4. **Query Processing**
   Convert a user’s question into an embedding.

5. **Similarity Search**
   Find chunks whose embeddings best match the query embedding.

6. **Response Generation**
   Use the top-matching chunks as context for an LLM to generate an answer.

*Example:*
If you have a huge library of movie scripts and someone asks, “Who said ‘I’ll be back’?”, the system retrieves script passages that embed closest to that question—even if they use synonyms like “Arnold’s famous catchphrase.”

---

## Document Loading and Splitting

Before generating embeddings, load and split your document as learned previously:

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# 2. Load the PDF
loader = PyPDFLoader(file_path)
docs = loader.load()

# 3. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)
```

---

## OpenAI Embeddings and LangChain

LangChain provides a unified interface to many embedding models. To use OpenAI’s embeddings, simply import and initialize:

```python
from langchain_openai import OpenAIEmbeddings

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()
```

This class uses your existing OpenAI API key and defaults to a performant embedding model.

---

## Configuring Embedding Model Parameters

You can customize the embedding generation:

```python
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-small",  # choice of embedding model
    dimensions=1536,                 # vector length
    chunk_size=1000                  # batch size for text processing
)
```

* **model**: Select between fast or more accurate embedding engines.
* **dimensions**: Controls vector detail (higher = more dimensions).
* **chunk\_size**: Controls how many chunks process at once.

Defaults work well for most projects. Tweak these as you gain experience.

---

## Generating an Embedding with OpenAI

Convert a single chunk into an embedding:

```python
# Extract the first chunk’s text
document_text = split_docs[0].page_content

# Generate its embedding vector
embedding_vector = embedding_model.embed_query(document_text)
```

`embed_query()` returns a list of floats representing your text in high-dimensional space.

---

## Inspecting Embedding Vectors

Peek at the first few elements:

```python
print(f"First 5 values: {embedding_vector[:5]}")
```

Example output:

```
First 5 values: [0.01057, -0.00014, 0.00523, -0.02460, -0.01267]
```

Despite appearing random, these numbers encode meaning. Similar texts yield similar vectors.

---

## Vector Databases for Embedding Storage

To build a full retrieval system, store and search millions of embeddings efficiently. Vector databases excel at this:

* **Chroma**: Lightweight open-source vector store.
* **FAISS**: High-performance library by Facebook AI.
* **Pinecone**: Managed vector database service.
* **Weaviate**: Open-source search engine with vector support.

These systems use Approximate Nearest Neighbors (ANN) algorithms to make similarity searches lightning-fast.

---

## Summary and Next Steps

In this lesson you:

* Learned what embeddings are and why they matter.
* Saw how embeddings power context retrieval systems.
* Generated and inspected embeddings using OpenAI and LangChain.
* Explored vector database options for large-scale storage.

Next, you’ll apply these concepts in hands-on exercises to generate different embeddings and practice similarity search. Then, we’ll integrate a vector database to build a complete retrieval pipeline. Keep up the great work!


## Creating Document Embeddings with OpenAI

Your first task is to dive into the world of embeddings by generating one for a single document chunk from a PDF.

Begin by initializing the OpenAIEmbeddings object.
Extract the text content from one of the document chunks.
Generate the embedding using the embed\_query method on the extracted text.
Display the first few elements of the resulting embedding vector.
This exercise will help you understand how to convert text into a numerical representation and see how this step fits into the overall document processing pipeline. Dive in and see how embeddings bring text to life!

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# TODO: Initialize the OpenAIEmbeddings object

# TODO: Generate embeddings for the first document chunk

# TODO: Print the first few elements of the embedding vector

```
Here’s how you can complete the stub to generate and inspect an embedding for the first chunk:

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# 1. Load and split the document
loader = PyPDFLoader(file_path)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# 2. Initialize the embeddings model
embeddings = OpenAIEmbeddings()  # you can pass your OpenAI API key here if needed

# 3. Extract text from the first chunk
first_chunk_text = split_docs[0].page_content

# 4. Generate the embedding vector
embedding_vector = embeddings.embed_query(first_chunk_text)

# 5. Display the first few elements
print("First 5 dimensions of the embedding vector:")
print(embedding_vector[:5])
```

**What this does:**

1. **Loads** your PDF and **splits** it into ≈1 000-token chunks with 100-token overlap.
2. **Initializes** the `OpenAIEmbeddings` object (make sure your environment variable `OPENAI_API_KEY` is set, or pass `openai_api_key="..."` to the constructor).
3. **Extracts** the raw text of the first chunk.
4. **Calls** `embed_query(...)`, which returns a list of floats (the embedding).
5. **Prints** the first five dimensions so you can see how the text is numerically represented.

When you run this, you should see output like:

```
First 5 dimensions of the embedding vector:
[0.01234567, -0.00432109, 0.02345678, ...]
```

Those numbers are the beginning of your \~1 536-dimensional embedding (exact size depends on the model).


## Experiment with Embedding Parameters


Now, let's take it a step further by experimenting with different configurations for the OpenAIEmbeddings object.

Modify the OpenAIEmbeddings instance to use:

model="text-embedding-3-large"
dimensions=2048
Observe the printed values to see how these changes affect the results. This exercise will help you understand the flexibility of embeddings and how different settings can impact your document processing tasks.

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# TODO: Initialize the OpenAIEmbeddings object with different parameters, such as:
# - model = "text-embedding-3-large"
# - dimensions = 2048
embedding_model = OpenAIEmbeddings()

# Generate embeddings for the first document chunk
embedding_vector = embedding_model.embed_query(split_docs[0].page_content)

# Print the first few elements of the embedding vector
print(embedding_vector[:5])
```

Here’s the updated snippet showing how to pass in a different model name and override the embedding dimensionality, plus a quick check of the resulting vector size:

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# 1. Load and split the document
loader = PyPDFLoader(file_path)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# 2. Initialize the embeddings model with custom settings
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-large",  # switch to the larger embedding model
    dimensions=2048                  # override default dimensionality
    # (ensure your LangChain version supports this parameter)
)

# 3. Generate embeddings for the first chunk
first_chunk_text = split_docs[0].page_content
embedding_vector = embedding_model.embed_query(first_chunk_text)

# 4. Inspect the output
print(f"Embedding vector length: {len(embedding_vector)}")
print("First 5 dimensions:", embedding_vector[:5])
```

**What to look for when you run this:**

* **Vector length** should now report `2048`, confirming the dimensionality change.
* **The numerical values** in `embedding_vector[:5]` will differ from your previous run (and from the smaller-model output), reflecting how “text-embedding-3-large” encodes text in a higher-capacity space.

Feel free to experiment with other model names or dimensions to see how the geometry of your embedding space shifts!


## Fix the Embedding Model Bug

Great job on learning how to load and split documents! Now, let's troubleshoot a code that's causing errors. Your task is to find and fix this bug to ensure the embeddings are generated correctly.

This exercise should help you understand the difference between embedding models and chat models. Dive in and enhance your skills!

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAIEmbeddings object
embedding_model = OpenAIEmbeddings(
    model="gpt-4",
    dimensions=2048
)

# Generate embeddings for the first document chunk
embedding_vector = embedding_model.embed_query(split_docs[0].page_content)

# Print the first few elements of the embedding vector
print(embedding_vector[:5])
```

The error here is that you’re trying to use a **chat** model (“gpt-4”) with the embeddings API. Chat models don’t support `embed_query`, so you’ll get an error. To fix it, switch to a supported **embedding** model—e.g. `"text-embedding-3-large"`—and (if needed) set your desired `dimensions`. Here’s the corrected snippet:

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Load and split the document
loader = PyPDFLoader(file_path)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# ❌ Buggy: chat model cannot embed
# embedding_model = OpenAIEmbeddings(
#     model="gpt-4",
#     dimensions=2048
# )

# ✅ Fixed: use an embedding model
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-large",
    dimensions=2048
)

# Generate embeddings for the first chunk
first_chunk_text = split_docs[0].page_content
embedding_vector = embedding_model.embed_query(first_chunk_text)

# Inspect the output
print(f"Embedding vector length: {len(embedding_vector)}")  # should be 2048
print("First 5 dimensions:", embedding_vector[:5])
```

**Key takeaway:**

* **Chat models** like `gpt-4` are for conversation; they don’t support `embed_query`.
* **Embedding models** (e.g. `text-embedding-3-large`, `text-embedding-ada-002`) are designed to return numerical vectors of fixed dimension.

With this change, your call to `embed_query` will succeed, returning a 2048-dimensional vector you can inspect or feed into downstream tasks.


## Exploring Embedding Dimensions

Well done learning about generating embeddings! Now, let's explore the dimensionality of these embeddings to deepen your understanding.

Your task is to print the length of the embedding vector using the len() function to see its dimensionality. Observe how the vector length will match the dimension specified in the OpenAIEmbeddings initialization.

This exercise will help you grasp how the dimensions of an embedding relate to its informational capacity and computational needs.

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAIEmbeddings object
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-large",
    dimensions=2048
)

# Generate embeddings for the first document chunk
embedding_vector = embedding_model.embed_query(split_docs[0].page_content)

# TODO: Print the length of the embedding vector to explore its dimensionality
```


```python
# ... previous code ...

# Generate embeddings for the first document chunk
embedding_vector = embedding_model.embed_query(split_docs[0].page_content)

# Print the length of the embedding vector
print(f"Embedding dimensionality: {len(embedding_vector)}")
```

When you run this, you should see:

```
Embedding dimensionality: 2048
```

confirming that your embedding vector has 2048 dimensions.
