# Setup: Installing Required Libraries

Before we begin, we need to install the necessary Python libraries. Run the cell below to install all dependencies for this notebook.

In [None]:
# Install required libraries
!pip install -q chromadb==0.4.22 openai==1.12.0 sentence-transformers==2.3.1

print("✅ All libraries installed successfully!")

# ChromaDB

In this notebook, we'll learn how to use vector database called ChromaDB. It's open-source and ideal for quick prototyping and developing AI applications. As you will see, it seamlessly integrates with embedding models that convert your data into vectors.

# 1. Creating Embeddings Locally

We’ll start by generating embeddings locally. This means the text will be converted into numerical vectors right on your computer. Let's import `embedding_functions` module which provides various embedding functions.  


In [None]:
# Importing
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv
import os

We'll use the default embedding function with the model called `all-MiniLM-L6-v2` which outputs a **384-dimensional vector** for each piece of text.

In [None]:
# Instantiating embedding function
default_emb_function = embedding_functions.DefaultEmbeddingFunction()

Let's create an embedding from the name "Robert":

In [None]:
name = "Robert"

The result will be a list containing a single 384-dimensional vector:

In [None]:
# Creating an embedding
local_emb = default_emb_function([name])

print("Number of input texts processed:", len(local_emb))
print("Size of embedding vector:", len(local_emb[0]))
print("\n")
print("Vector:")
local_emb

### 📝 EXERCISE 1: Create Your Own Embeddings (5 minutes)

**What you'll practice:** Creating embeddings from custom text using the default embedding function.

**Your task:**
1. Create a list with 3-5 words or short phrases related to a topic you're interested in (e.g., programming languages, foods, cities, hobbies)
2. Generate embeddings for your list using `default_emb_function()`
3. Print the number of items processed and the size of one embedding vector

**Hint:** Follow the same pattern as the "Robert" example above. Remember that `default_emb_function()` expects a list as input.

**Expected outcome:** You should see that each item produces a 384-dimensional vector, just like the example.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# my_items = ["item1", "item2", "item3"]
# my_embeddings = default_emb_function(my_items)
# print("Number of items:", len(my_embeddings))
# print("Embedding size:", len(my_embeddings[0]))

### 📝 EXERCISE 1: Create Your Own Embeddings (5 minutes)

**What you'll practice:** Creating embeddings from custom text using the default embedding function.

**Your task:**
1. Create a list with 3-5 words or short phrases related to a topic you're interested in (e.g., programming languages, foods, cities, hobbies)
2. Generate embeddings for your list using `default_emb_function()`
3. Print the number of items processed and the size of one embedding vector

**Hint:** Follow the same pattern as the "Robert" example above. Remember that `default_emb_function()` expects a list as input.

**Expected outcome:** You should see that each item produces a 384-dimensional vector, just like the example.

## 1.1 Storing the data
This was a really simple example of embedding just one word. But in the real world, we have hundreds or thousands of texts for which we want to create embeddings and then search them later. We need somewhere to keep them. Let's see how ChromaDB handle storing these data.


**1. In-memory Chroma Database**:
- Chroma keeps all your embeddings only while the program is running. As soon as you stop or restart your notebook, **all data is lost automatically**. This is fine for quick experiments where you don’t care about keeping the results, but it’s not practical for building something you’ll revisit later.

To create a temporary, in-memory Chroma database instance, we can use **an ephemeral client** right inside our notebook session. It starts a Chroma server entirely in memory and returns a Python client object that we can interact with as if it's a full database:

```
client = chromadb.EphemeralClient()
```

**2. Persistent storage**
- The second approach is creating **a persistent client that saves everything to files on your disk**. Even if you close your notebook or shut down your computer, you can open the same database later and all your data will still be there.




Let's create a persistent client using the below code.


In [None]:
# Creating a persistent storage
client = chromadb.PersistentClient(path="./db/chroma_persist")

> NOTE:
> There is also `HttpClient` (“connect-to-a-remote-server”) option in Chroma:  instead of running the database inside your notebook, you point a client at a Chroma server over HTTP. Use it when you want a shared, centralized vector DB for multiple notebooks/apps or users, when deploying behind Docker/Kubernetes, or when you need language-agnostic access and clearer separation between your app and the database.

## 1.2 Creating a Collection

Now that we have a persistent place to store our data, the next step is organizing that data inside Chroma.

We'll create **a collection that groups together documents (i.e. chunks of text), their embeddings and any associated metadata.** At the moment of creation, the collection is just like an empty database table: it defines the structure and storage space, but no vectors exist yet.

To create a collection, we will use `get_or_create_collection()`:
- It checks if a collection named, for example, "my_documents_locally" already exists in the Chroma database. If it does, it returns that existing collection. If it doesn’t, it creates a new one.

We also have the option to **specify the distance metric that will be used for comparing embeddings**. By default, Chroma use L2 (Euclidean Distance Squared) distance metric. But if we want to work with text embeddings, we can specify **"cosine"**  using `meatadata` parameter. `hnsw` refers to **the Hierarchical Navigable Small World** indexing algorithm that Chroma uses for fast approximate nearest-neighbor search.

By default, Chroma will assign its default embedding function (`all-MiniLM-L6-v2`) but this can be changed using `embedding_function` parameter. In this example, we'll create a collection with this default.


> NOTE: **The embedding function and the distance metric are both set at collection creation and cannot be changed afterwards.** If you need to change them, you’ll have to create a new collection.


In [None]:
# Creating a collection
collection = client.get_or_create_collection("my_documents_locally",
                                             metadata={"hnsw:space": "cosine"})

You can list all created collections using:

In [None]:
client.list_collections()

If you want to delete an entire collection, use the below code:

In [None]:
# This is irreversible
#client.delete_collection("my_documents_locally")

## 1.3 Adding Documents into a Collection

In real projects, we embed PDFs, HTML, or other larger documents, but to clearly see how Chroma works, we’ll start with a few short example texts:

In [None]:
# Documents with metadata that will be embedded and added into Chroma databasedocuments = [    {        "id": "document_1",        "text": """Public-key cryptography, also known as asymmetric cryptography, represents a monumental paradigm shift from its predecessor, symmetric cryptography. The fundamental challenge it solves is that of key distribution, which was the Achilles' heel of symmetric systems requiring a shared secret key to be exchanged over a secure channel beforehand. Introduced conceptually by Whitfield Diffie and Martin Hellman in their seminal 1976 paper "New Directions in Cryptography," this system utilizes pairs of mathematically linked keys: a public key and a private key. The public key can be shared openly without compromising security, while the private key must be kept secret by its owner. This ingenious design allows for two primary functions: encryption and digital signatures. For encryption, a sender uses the recipient's public key to encrypt a message, which can then only be decrypted by the recipient using their corresponding private key. This ensures confidentiality even when communicating over a hostile, monitored network like the internet. For digital signatures, a sender uses their own private key to sign a message, and anyone with their public key can verify the signature's authenticity, ensuring both integrity and non-repudiation. The security of these systems relies on the computational difficulty of 'one-way functions' with a 'trapdoor,' where it is easy to compute in one direction but infeasible to reverse without the secret information of the private key. The first practical and widely used implementation was the RSA algorithm, developed in 1977 by Ron Rivest, Adi Shamir, and Leonard Adleman, which bases its security on the difficulty of factoring large prime numbers. Over time, other algorithms like Elliptic Curve Cryptography (ECC) have gained prominence, offering equivalent security with much smaller key sizes, making them ideal for resource-constrained devices like smartphones. The practical application of public-key cryptography is managed through a Public Key Infrastructure (PKI). A PKI is a framework of policies and procedures that uses a trusted third party, known as a Certificate Authority (CA), to issue digital certificates that bind public keys to specific identities. This is the trust model that underpins the entire secure web, enabling protocols like Transport Layer Security (TLS), which secures HTTPS traffic. Its influence extends to securing email with PGP, remote shell access with SSH, and forming the transactional backbone of virtually all cryptocurrencies, including Bitcoin. However, the rise of quantum computing poses a significant future threat to current public-key algorithms, as they can theoretically solve the underlying mathematical problems efficiently. This has spurred the development of a new field called post-quantum cryptography (PQC), which aims to create a new generation of secure algorithms resistant to attacks from both classical and quantum computers.""",        "metadata": {"category": "cryptography", "year": 1976, "difficulty": "advanced", "author": "Security Team"}    },    {        "id": "document_2",        "text": """Stuxnet stands as a watershed moment in the history of digital warfare, a malicious computer worm that transcended the digital realm to cause tangible, physical destruction. First discovered in 2010 by the Belarusian security firm VirusBlokAda, its origins and development are widely attributed to a top-secret joint American-Israeli intelligence operation, codenamed "Olympic Games." Stuxnet's primary target was the Iranian nuclear program, specifically the Siemens industrial control systems (ICS) that managed the gas centrifuges at the Natanz uranium enrichment facility. What made Stuxnet exceptionally sophisticated was its multi-stage attack methodology and its method of propagation. It was specifically designed to cross the 'air gap,' a security measure where critical networks are physically isolated from the internet. The worm primarily spread through infected USB flash drives, exploiting a Windows Shell vulnerability related to .LNK files. Once inside a network, it would aggressively seek out machines running Siemens Step7 software, the specific platform used to program the programmable logic controllers (PLCs) that automated the centrifuges. The worm's payload was a masterpiece of deception. It would record the normal operating frequencies of the centrifuges and then replay this data back to the human operators, creating the illusion that everything was functioning correctly. Meanwhile, the malware would subtly and intermittently alter the rotational speed of the centrifuges, causing them to spin too fast and then too slow, inducing excessive vibration and stress that led to their eventual mechanical failure and destruction. This cyber-physical attack was unprecedented in its precision and impact. Stuxnet leveraged an astonishing four different zero-day vulnerabilities in Microsoft Windows, a record for a single piece of malware at the time. To further evade detection, it was signed with stolen, legitimate digital certificates from reputable hardware companies, Realtek and JMicron, allowing it to masquerade as trusted software. The discovery of Stuxnet irrevocably changed the landscape of national security, proving that code could be deployed as a precise and effective weapon to sabotage critical infrastructure. It ushered in a new era of state-sponsored cyber-physical attacks and served as a blueprint for future digital weapons, fundamentally altering the calculus of modern conflict.""",        "metadata": {"category": "cyberwarfare", "year": 2010, "difficulty": "intermediate", "author": "Security Team"}    },    {        "id": "document_3",        "text": """The Zero Trust Architecture (ZTA) is a modern cybersecurity strategy built on a profound philosophical shift: 'never trust, always verify.' This model fundamentally rejects the outdated 'castle-and-moat' concept of security, where a hardened perimeter was thought to be sufficient to protect a trusted internal network. That traditional model is no longer viable in an era of cloud computing, remote workforces, and sophisticated persistent threats that often breach the perimeter. Coined by John Kindervag at Forrester Research in 2010, Zero Trust mandates a granular, identity-centric approach to security. It operates on the core principle of 'assume breach,' meaning that an attacker is already present within the network, and therefore, no user or device can be implicitly trusted based on its physical or network location. Access to resources is granted on a per-session, least-privilege basis, strictly enforced through a dynamic policy engine. This policy engine makes access decisions based on a wide array of real-time signals, not just a static password. These signals form the pillars of Zero Trust: strong identity verification, device health, and network context. Every access request must be accompanied by robust authentication, with Multi-Factor Authentication (MFA) considered the absolute baseline. The security posture of the endpoint device is scrutinized; a device that is unpatched, jailbroken, or shows signs of infection may be denied access. To prevent the lateral movement of attackers within a network, Zero Trust heavily relies on micro-segmentation. This practice involves breaking the network into small, isolated zones, often down to the individual workload level, and enforcing strict access controls between them. The National Institute of Standards and Technology (NIST) has formalized these concepts in its Special Publication 800-207, providing a comprehensive guide for organizations. Implementing Zero Trust is not about deploying a single product but is a strategic journey that involves integrating various technologies and redesigning network and security architecture. The ultimate goal is to create a resilient, adaptable security posture where access is continuously validated, and the potential 'blast radius' of any single breach is minimized. It is a transition from a location-based trust model to an identity- and context-based model fit for the complexities of modern IT environments.""",        "metadata": {"category": "architecture", "year": 2010, "difficulty": "intermediate", "author": "Security Team"}    },    {        "id": "document_4",        "text": """Phishing is the most pervasive form of social engineering, an attack vector that targets the weakest link in the security chain: human psychology. Its name, a homophone of 'fishing,' originated in the mid-1990s within the hacker community, where attackers used email lures to 'phish' for passwords from unsuspecting users. This attack vector functions by having a threat actor masquerade as a trustworthy entity to deceive a victim. The ultimate objective is typically to steal sensitive data like login credentials, credit card numbers, or personally identifiable information (PII), or to deploy malware like ransomware. The attack begins with a carefully crafted lure, which most often takes the form of an email, SMS text message (known as smishing), or even a voice call (vishing). These messages are designed to exploit powerful human emotions such as fear, urgency, curiosity, or greed. For example, a message might falsely claim a user's bank account has been compromised and demand immediate action, or it might offer an unbelievable prize. The sophistication of phishing attacks varies widely. At the low end are bulk phishing campaigns sent to millions of users with generic greetings and obvious grammatical errors. At the high end is spear phishing, a highly targeted attack that uses personal information gathered about an individual or organization to create an extremely convincing and personalized lure. A sub-variant of this, known as whaling, specifically targets senior executives or other high-value individuals within a company. Another common technique is clone phishing, where an attacker copies a legitimate, previously delivered email and replaces its links or attachments with malicious ones. The attack chain relies on the victim taking a specific action, such as clicking a hyperlink that leads to a fraudulent website designed to harvest credentials or downloading an attachment laden with malware. Defending against phishing requires a multi-layered approach. User education and awareness training are critical to help people recognize the tell-tale signs of a phishing attempt, such as mismatched URLs, suspicious sender addresses, and unexpected requests for sensitive information. On the technical side, organizations deploy email security gateways that use protocols like SPF, DKIM, and DMARC to authenticate senders, as well as advanced threat protection systems that scan links and attachments for malicious content. Despite these defenses, phishing remains the primary initial access vector for a vast majority of all cyberattacks, making it an enduring and critical threat to both individuals and organizations.""",        "metadata": {"category": "social-engineering", "year": 1995, "difficulty": "beginner", "author": "Security Team"}    },    {        "id": "document_5",        "text": """Multi-Factor Authentication (MFA) is a security mechanism that requires users to provide two or more verification factors to gain access to a resource, such as an application, online account, or VPN. Rather than just asking for a username and password, MFA requires additional credentials, making it harder for attackers to gain unauthorized access. The factors fall into three categories: something you know (like a password or PIN), something you have (like a smartphone or security token), and something you are (like a fingerprint or facial recognition). A common example is when you log into your bank account: you enter your password (something you know), and then the bank sends a code to your phone (something you have). Even if an attacker steals your password, they would still need access to your phone to complete the login. MFA significantly reduces the risk of account takeovers, which are common in phishing attacks where passwords are compromised. According to Microsoft, MFA can block over 99.9% of account compromise attacks. The most common MFA methods include SMS codes, authenticator apps like Google Authenticator or Microsoft Authenticator, hardware tokens like YubiKey, and biometric verification. While SMS-based MFA is better than nothing, it's vulnerable to SIM-swapping attacks where attackers convince mobile carriers to transfer a victim's phone number to a new SIM card. Authenticator apps and hardware tokens are more secure alternatives. Organizations should implement MFA for all critical systems, especially email, VPN access, and administrative accounts.""",        "metadata": {"category": "authentication", "year": 2020, "difficulty": "beginner", "author": "Security Team"}    },    {        "id": "document_6",        "text": """A Virtual Private Network (VPN) creates a secure, encrypted tunnel between your device and the internet. When you connect to a VPN, all your internet traffic is routed through an encrypted connection to a server operated by the VPN provider. This serves several purposes: it hides your IP address, encrypts your data so that ISPs and other third parties cannot see what you're doing online, and allows you to appear as if you're browsing from a different location. VPNs are essential for remote workers who need to securely access corporate networks from home or public Wi-Fi networks. Without a VPN, data transmitted over public Wi-Fi can be intercepted by attackers using tools like packet sniffers. VPNs use various protocols to establish secure connections, including OpenVPN, WireGuard, and IKEv2/IPSec. Each protocol offers different balances of speed, security, and compatibility. Enterprise VPNs often integrate with corporate authentication systems and enforce security policies, such as requiring devices to have up-to-date antivirus software before allowing connections. While consumer VPNs are marketed for privacy, it's important to choose reputable providers, as a malicious VPN provider could potentially monitor your traffic. For businesses, VPNs are a critical component of remote access security, especially in the context of Zero Trust architectures where every connection is verified regardless of location.""",        "metadata": {"category": "network-security", "year": 2015, "difficulty": "beginner", "author": "Network Team"}    }]

In [None]:
documents[:3]

Now we'll fill the created collection with data using `add()` function which takes a list of unique string IDs and a list of documents. These will be automatically embedded using the default embedding function. Chroma then builds vector indexes inside the collection.

In [None]:
# Adding documents into the collection
ids = [doc["id"] for doc in documents]
texts = [doc["text"] for doc in documents]

collection.add(ids = ids, documents = texts)

## 1.4 Working with Metadata

In the previous example, we added just the text. But ChromaDB also supports **metadata** - additional information about each document that can be used for filtering and organizing your data.

Metadata is extremely useful for:
- **Filtering searches** by category, date, author, difficulty level, etc.
- **Organizing documents** by type or topic
- **Tracking source information** without embedding it

Let's delete our current collection and recreate it with metadata included.

In [None]:
# Delete the old collection
client.delete_collection("my_documents_locally")

# Recreate it
collection = client.get_or_create_collection("my_documents_locally",
                                             metadata={"hnsw:space": "cosine"})

Now let's add documents **with metadata**. Notice how we extract the metadata dictionary from each document:

In [None]:
# Adding documents WITH metadata
ids = [doc["id"] for doc in documents]
texts = [doc["text"] for doc in documents]
metadatas = [doc["metadata"] for doc in documents]

collection.add(
    ids = ids, 
    documents = texts,
    metadatas = metadatas
)

Let's retrieve a document and see its metadata:

In [None]:
# Get document with metadata
result = collection.get(ids=["document_2"], include=["documents", "metadatas"])

print("Document ID:", result["ids"][0])
print("\nMetadata:")
for key, value in result["metadatas"][0].items():
    print(f"  {key}: {value}")
print("\nDocument preview:", result["documents"][0][:150], "...")

## 1.5 Filtering with Where Clauses

Now that we have metadata, we can use it to **filter our searches**. This is incredibly powerful - you can retrieve only documents that match specific criteria.

Let's say we only want documents from the "cryptography" category:

In [None]:
# Get only cryptography documents
crypto_docs = collection.get(
    where={"category": "cryptography"},
    include=["documents", "metadatas"]
)

print(f"Found {len(crypto_docs['ids'])} cryptography document(s):")
for doc_id, metadata in zip(crypto_docs["ids"], crypto_docs["metadatas"]):
    print(f"  - {doc_id}: {metadata['category']} ({metadata['year']})")

We can also filter by multiple criteria. Let's find all beginner-level documents:

In [None]:
# Get beginner documents
beginner_docs = collection.get(
    where={"difficulty": "beginner"},
    include=["metadatas"]
)

print(f"Found {len(beginner_docs['ids'])} beginner-level document(s):")
for doc_id, metadata in zip(beginner_docs["ids"], beginner_docs["metadatas"]):
    print(f"  - {doc_id}: {metadata['category']} (difficulty: {metadata['difficulty']})")

You can also use operators like `$gt` (greater than), `$lt` (less than), `$gte`, `$lte`, `$ne` (not equal), and `$in` (in list):

In [None]:
# Get documents from year 2010 or later
recent_docs = collection.get(
    where={"year": {"$gte": 2010}},
    include=["metadatas"]
)

print(f"Found {len(recent_docs['ids'])} document(s) from 2010 or later:")
for doc_id, metadata in zip(recent_docs["ids"], recent_docs["metadatas"]):
    print(f"  - {doc_id}: {metadata['category']} ({metadata['year']})")

## 1.6 Updating and Deleting Documents

Sometimes you need to modify existing documents or remove them entirely. ChromaDB provides methods for both operations.

### Updating Documents with `upsert()`

The `upsert()` function is smart: 
- If a document with the given ID **already exists**, it **updates** it
- If it **doesn't exist**, it **creates** it

Let's update document_5's metadata to mark it as reviewed:

In [None]:
# Update metadata for document_5
collection.upsert(
    ids=["document_5"],
    metadatas=[{"category": "authentication", "year": 2020, "difficulty": "beginner", "author": "Security Team", "reviewed": True}]
)

# Verify the update
result = collection.get(ids=["document_5"], include=["metadatas"])
print("Updated metadata for document_5:")
print(result["metadatas"][0])

You can also use `upsert()` to add a completely new document:

In [None]:
# Add a new document using upsert
collection.upsert(
    ids=["document_7"],
    documents=["""Firewalls are network security devices that monitor and control incoming and outgoing network traffic based on predetermined security rules. A firewall establishes a barrier between trusted internal networks and untrusted external networks, such as the Internet."""],
    metadatas=[{"category": "network-security", "year": 2000, "difficulty": "beginner", "author": "Network Team"}]
)

print("\nTotal documents after upsert:", collection.count())

### Deleting Documents

To remove specific documents, use the `delete()` method with document IDs:

In [None]:
# Delete document_7
collection.delete(ids=["document_7"])

print("Documents after deletion:", collection.count())

# Try to get the deleted document
try:
    result = collection.get(ids=["document_7"])
    if not result["ids"]:
        print("Document document_7 has been successfully deleted")
except Exception as e:
    print(f"Document not found: {e}")

You can also delete documents based on metadata filters:

In [None]:
# Example: Delete all documents from before year 2000 (we don't have any, but this shows the syntax)
# collection.delete(where={"year": {"$lt": 2000}})

print("\nNote: The delete with where clause is commented out to preserve our data.")
print("Syntax: collection.delete(where={'year': {'$lt': 2000}})")

## 1.7 Collection Introspection

ChromaDB provides several methods to inspect what's in your collection without retrieving all the data.

In [None]:
# Peek at the first few items
preview = collection.peek(limit=2)

print("Peeking at first 2 documents:")
for doc_id, metadata in zip(preview["ids"], preview["metadatas"]):
    print(f"\n  ID: {doc_id}")
    print(f"  Category: {metadata['category']}")
    print(f"  Year: {metadata['year']}")
    print(f"  Difficulty: {metadata['difficulty']}")

In [None]:
# Get total count
total = collection.count()
print(f"\nTotal documents in collection: {total}")

You can also get all documents (be careful with large collections!):

In [None]:
# Get all documents (IDs and metadata only)
all_docs = collection.get(include=["metadatas"])

print(f"\nAll document IDs: {all_docs['ids']}")
print(f"\nCategories present:")
categories = set(meta['category'] for meta in all_docs['metadatas'])
for cat in sorted(categories):
    print(f"  - {cat}")

> NOTE: Chroma also provides `upsert()` function which works like `add()`, but with one difference: If an ID already exists in the collection, the record is updated (text, metadata, embedding). If it doesn’t exist, the record is created.

You can inspect the created collection:

In [None]:
# Number of items
collection.count()

Let's return both the original text and the generated embedding for ID "document_3":

### 📝 EXERCISE 2: Explore Your Collection (5-7 minutes)

**What you'll practice:** Retrieving and comparing documents and embeddings from a ChromaDB collection.

**Your task:**
1. Retrieve the document with ID `"document_2"` (the Stuxnet article) from the collection
2. Print the first 200 characters of the document text to see what it's about
3. Print the embedding vector length
4. Compare: Is the embedding size the same as for `document_3`? Why or why not?

**Hint:** Use the same `collection.get()` method as shown above, but change the ID.

**Expected outcome:** You should see the Stuxnet article text and confirm that all documents in the same collection have the same embedding size (384 dimensions for the default model).

In [None]:
# YOUR CODE HERE
# Example solution structure:
# row = collection.get(ids=["document_2"], include=["documents", "embeddings"])
# print("First 200 characters:", row["documents"][0][:200])
# print("Embedding length:", len(row["embeddings"][0]))

In [None]:
# Retrieving embedding
row = collection.get(ids = ["document_3"], include = ["documents", "embeddings"])

print("Original document:\n", row["documents"][0])
print("\n Embedding length: \n", len(row["embeddings"][0]))
print("\n Embedding vector: \n", row["embeddings"][0])

So, the embeddings are stored permanently on disk in our Chroma database. Whenever we want to use them again (after restarting/closing kernel), we don’t need to re-embed the texts. We just point a new client at the same folder and re-open the collection using `get_collection()`.

In [None]:
#client = chromadb.PersistentClient(path="./db/chroma_persist")
#collection = client.get_collection("my_documents_locally")

# 2. Creating Embeddings using OpenAI API

Now we’ll switch to provider **OpenAI**. The concept is exactly the same: text gets transformed into numerical vectors, but this time **the computation happens on OpenAI’s servers** instead of locally.

**Authentication using OpenAI API key**:

To access OpenAI’s API, you’ll need an API key. The most convenient way is to store it as an environment variable so it loads automatically whenever your terminal starts. Open the terminal and type `nano ~/.zshrc`. At the end of the file, add: `export CHROMA_OPENAI_API_KEY="your_api_key"`. Save and exit. Then reload your shell config (so the change applies immediately) using `source ~/.zshrc`.

We'll use `text-embedding-3-small` model to create the embeddings. For our tiny demo texts, the number of tokens is so small that the cost of creating embeddings with OpenAI is practically negligible.

Let's instantiate it using `OpenAIEmbeddingFunction()`:

In [None]:
import os

# Configure OpenAI API key
OPENAI_API_KEY = None

try:
    from google.colab import userdata  # type: ignore
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY:
        print('✅ API key loaded from Colab secrets')
except Exception:
    pass

if not OPENAI_API_KEY:
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY:
    try:
        from getpass import getpass
        print('💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY')
        OPENAI_API_KEY = getpass('Enter your OpenAI API Key: ')
    except Exception as exc:
        raise ValueError('❌ ERROR: No API key provided! Set OPENAI_API_KEY as an environment variable or Colab secret.') from exc

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == '':
    raise ValueError('❌ ERROR: No API key provided!')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print('✅ Authentication configured!')

OPENAI_MODEL = 'gpt-5-nano'  # Using gpt-5-nano for cost efficiency
print(f'🤖 Selected Model: {OPENAI_MODEL}')

OPENAI_EMBED_MODEL = 'text-embedding-3-small'
print(f'🧠 Embedding Model: {OPENAI_EMBED_MODEL}')


In [None]:
# Instantiating OpenAI embedding function
openai_embedding_function = embedding_functions.OpenAIEmbeddingFunction(
    model_name = OPENAI_EMBED_MODEL,
    api_key = os.getenv("OPENAI_API_KEY")
)

Now we need to create a new collection (in the same persistent database). Notice that we specify "openai_embedding_function" using`embedding_function` parameter. This ensures that whenever we add documents to this collection, they’ll be embedded with OpenAI instead of the Chroma's default model.

In [None]:
# Creating a new collection
openai_collection = client.get_or_create_collection(
    "my_documents_openai",
    embedding_function = openai_embedding_function,
    metadata={"hnsw:space": "cosine"}
)

Next we'll add the documents into this collection:

In [None]:
# Adding documents
ids = [d["id"] for d in documents]
texts = [d["text"] for d in documents]

openai_collection.add(ids = ids, documents = texts)

For the model we chose, each embedding is **1536-dimensional vector**. Because this is such a large array, Jupyter Notebook automatically truncates the result.

### 📝 EXERCISE 3: Compare Local vs OpenAI Embeddings (10 minutes)

**What you'll practice:** Understanding the differences between local and cloud-based embedding models.

**Your task:**
1. Retrieve `document_4` (about Phishing) from the `openai_collection`
2. Print the embedding vector length for this OpenAI embedding
3. Compare it to the embedding length from the local collection (384 dimensions)
4. Think about: Why might OpenAI's embeddings be larger? What are the trade-offs?

**Discussion points to consider:**
- Local embeddings (384d): Faster, free, runs on your computer, good for prototyping
- OpenAI embeddings (1536d): Higher dimensional, potentially captures more nuance, costs money, requires API calls
- Both can work well depending on your use case!

**Hint:** Use `openai_collection.get()` method with the appropriate ID.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# row = openai_collection.get(ids=["document_4"], include=["documents", "embeddings"])
# print("Document topic:", row["documents"][0][:100])
# print("OpenAI embedding length:", len(row["embeddings"][0]))
# print("\nComparison:")
# print("  Local model: 384 dimensions")
# print("  OpenAI model:", len(row["embeddings"][0]), "dimensions")

In [None]:
# Retrieving embedding
row = openai_collection.get(ids = ["document_1"], include = ["documents", "embeddings"])

print("Original document:\n", row["documents"][0])
print("\n Embedding length: \n", len(row["embeddings"][0]))
print("\n Embedding vector: \n", row["embeddings"][0])