# Lab 24: Document Embeddings for Semantic Search

This lab demonstrates how to create and work with document embeddings using OpenAI's embedding models. You'll learn:
- How to use `OpenAIEmbeddings` to convert text into vector representations
- Understanding different embedding models and their characteristics
- Creating embeddings for multiple documents simultaneously
- Exploring embedding dimensions and vector structure
- Preparing embeddings for similarity search and RAG applications

In [None]:
# Import OpenAI's embedding model interface from LangChain
# OpenAIEmbeddings provides access to OpenAI's powerful text embedding models
# These embeddings convert text into high-dimensional vectors for semantic similarity
from langchain_openai import OpenAIEmbeddings

In [None]:
# Configure OpenAI API credentials for accessing embedding models
# Required for using OpenAI's text-embedding models
import os
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
# Initialize OpenAI embeddings with the text-embedding-3-large model
# text-embedding-3-large: Latest and most capable embedding model from OpenAI
# - High dimensional vectors (3072 dimensions)
# - Excellent performance for semantic similarity and retrieval tasks
# - Optimized for RAG applications and knowledge base searches
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [None]:
# Create a sample dataset of sports-related documents for embedding
# These documents contain similar themes (sports events) but different contexts
# This allows us to explore semantic similarity between related content
# Documents include both cricket and football World Cup content for comparison
docs = [
    "Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship",
    "Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns", 
    "Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup",
    "From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars"
]

In [None]:
# Generate embeddings for all documents in the list
# embed_documents() processes multiple texts efficiently in a batch
# Each document is converted to a high-dimensional vector representation
# These vectors capture semantic meaning and enable similarity comparisons
embed_docs = embeddings.embed_documents(docs)

In [None]:
# Verify the number of generated embeddings matches input documents
# Should return 4 (one embedding vector per input document)
len(embed_docs)

In [None]:
# Display the first embedding vector (for first document)
# Shows the actual numerical representation as a list of floating-point numbers
# These values represent the document's position in high-dimensional semantic space
embed_docs[0]

In [None]:
# Check the dimensionality of the embedding vector
# text-embedding-3-large produces 3072-dimensional vectors
# Higher dimensions generally capture more nuanced semantic relationships
# This dimensionality is optimized for various downstream tasks like similarity search
len(embed_docs[0])