#Embedding-Based Retrieval with Activeloop and OpenAI

This second component of the RAG pipeline transforms the prepared data by the first component of the pipeline into embeddings and stores the vectors obtained in the vector store.

# Installing the environment

*First run the following cells and restart Google Colab session if prompted. Then run the notebook again cell by cell to explore the code.*

In [1]:
try:
  import deeplake
except:
  !pip install deeplake==3.9.18
  import deeplake

Collecting deeplake==3.9.18
  Downloading deeplake-3.9.18.tar.gz (608 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.9/608.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pillow~=10.2.0 (from deeplake==3.9.18)
  Downloading pillow-10.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
Collecting boto3 (from deeplake==3.9.18)
  Downloading boto3-1.35.26-py3-none-any.whl.metadata (6.6 kB)
Collecting pathos (from deeplake==3.9.18)
  Downloading pathos-0.3.2-py3-none-any.whl.metadata (11 kB)
Collecting humbug>=0.3.1 (from deeplake==3.9.18)
  Downloading humbug-0.3.2-py3-none-any.whl.metadata (6.8 kB)
Collecting lz4 (from deeplake==3.9.18)
  Downloading lz4-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting libdeeplake==0.0.138 (fr



Mount a drive or implement the method that best fits your project to retrieve API tokens.

In [1]:
#Google Drive option to store API Keys
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


grequests.py contains a function to download files from GitHub

In [2]:
#GitHub grequests.py
#Script to download files from the GitHub repository.

import subprocess

url = "https://raw.githubusercontent.com/Denis2054/RAG-Driven-Generative-AI/main/commons/grequests.py"
output_file = "grequests.py"

# Prepare the curl command
curl_command = [
    "curl",
    "-o", output_file,
    url
]

# Execute the curl command
try:
    subprocess.run(curl_command, check=True)
    print("Download successful.")
except subprocess.CalledProcessError:
    print("Failed to download the file.")


Download successful.


In [3]:
!pip install openai==1.40.3

Collecting openai==1.40.3
  Downloading openai-1.40.3-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai==1.40.3)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai==1.40.3)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai==1.40.3)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai==1.40.3)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.40.3-py3-none-any.whl (360 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m360.7/360.7 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.

In [4]:
# For Google Colab and Activeloop(Deeplake library)
#This line writes the string "nameserver 8.8.8.8" to the file. This is specifying that the DNS server the system
#should use is at the IP address 8.8.8.8, which is one of Google's Public DNS servers.
with open('/etc/resolv.conf', 'w') as file:
   file.write("nameserver 8.8.8.8")

In [5]:
#Retrieving and setting the OpenAI API key
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline().strip()
f.close()

#The OpenAI KeyActiveloop and OpenAI API keys
import os
import openai
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

In [6]:
#Retrieving and setting the Activeloop API token
f = open("drive/MyDrive/files/activeloop.txt", "r")
API_token=f.readline().strip()
f.close()
ACTIVELOOP_TOKEN=API_token
os.environ['ACTIVELOOP_TOKEN'] =ACTIVELOOP_TOKEN

# Embedding and Storage: populating the vector store

## Downloading and preparing the data

In [7]:
from grequests import download
source_text = "llm.txt"

directory = "Chapter02"
filename = "llm.txt"
download(directory, filename)


Downloaded 'llm.txt' successfully.


In [8]:
# Open the file and read the first 20 lines
with open('llm.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()
    # Print the first 20 lines
    for line in lines[:20]:
        print(line.strip())

Exploration of space, planets, and moons "Space Exploration" redirects here. For the company, see SpaceX . For broader coverage of this topic, see Exploration . Buzz Aldrin taking a core sample of the Moon during the Apollo 11 mission Self-portrait of Curiosity rover on Mars 's surface Part of a series on Spaceflight History History of spaceflight Space Race Timeline of spaceflight Space probes Lunar missions Mars missions Applications Communications Earth observation Exploration Espionage Military Navigation Settlement Telescopes Tourism Spacecraft Robotic spacecraft Satellite Space probe Cargo spacecraft Crewed spacecraft Apollo Lunar Module Space capsules Space Shuttle Space stations Spaceplanes Vostok Space launch Spaceport Launch pad Expendable and reusable launch vehicles Escape velocity Non-rocket spacelaunch Spaceflight types Sub-orbital Orbital Interplanetary Interstellar Intergalactic List of space organizations Space agencies Space forces Companies Spaceflight portal v t e S

Chunking the data

In [9]:
with open(source_text, 'r') as f:
    text = f.read()

CHUNK_SIZE = 1000
chunked_text = [text[i:i+CHUNK_SIZE] for i in range(0,len(text), CHUNK_SIZE)]

## Verifying if the vector store exists or create it

If vector store doesn't exist, the following code will create it and display a message.

If the vectore store exists, only the "Vector store exists" message will be displayed.

Deep Lake

**Replace `hub://denis76/space_exploration_v1` by your organization and dataset name**

In [10]:
vector_store_path = "hub://denis76/space_exploration_v1"

In [None]:
from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
import deeplake.util

try:
    # Attempt to load the vector store
    vector_store = VectorStore(path=vector_store_path)
    print("Vector store exists")
except FileNotFoundError:
    print("Vector store does not exist. You can create it.")
    # Code to create the vector store goes here
    create_vector_store=True


## The embedding function

In [12]:
def embedding_function(texts, model="text-embedding-3-small"):
   if isinstance(texts, str):
       texts = [texts]
   texts = [t.replace("\n", " ") for t in texts]
   return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]

## Adding data to the vector store

In [13]:
add_to_vector_store=True

In [None]:
if add_to_vector_store == True:
    with open(source_text, 'r') as f:
        text = f.read()
        CHUNK_SIZE = 1000
        chunked_text = [text[i:i+1000] for i in range(0, len(text), CHUNK_SIZE)]

vector_store.add(text = chunked_text,
              embedding_function = embedding_function,
              embedding_data = chunked_text,
              metadata = [{"source": source_text}]*len(chunked_text))


# Vector Store information

Summary

In [None]:
# Print the summary of the Vector Store
print(vector_store.summary())

Visualize

Online:
https://app.activeloop.ai/datasets/mydatasets/

In [None]:
ds = deeplake.load(vector_store_path)

Dataset size

In [17]:
#Estimates the size in bytes of the dataset.
ds_size=ds.size_approx()

In [18]:
# Convert bytes to megabytes and limit to 5 decimal places
ds_size_mb = ds_size / 1048576
print(f"Dataset size in megabytes: {ds_size_mb:.5f} MB")

# Convert bytes to gigabytes and limit to 5 decimal places
ds_size_gb = ds_size / 1073741824
print(f"Dataset size in gigabytes: {ds_size_gb:.5f} GB")

Dataset size in megabytes: 55.31311 MB
Dataset size in gigabytes: 0.05402 GB
