[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/telmo-correa/nebulous-llm-experiment/blob/main/notebooks/1%20-%20Vector%20embedding.ipynb)

### 1. Vector embedding

This notebook demonstrates how to use langchain, OpenAI and Chroma to generate a Chroma DB containing vector embeddings for an arbitrary set of text files.

Note that the data scraping for the data sources used by other notebooks in this repository is not included here; the pre-computed embeddings are made available for download when used.

In [None]:
## If running on Google Colab, install the dependencies:

%pip install openai langchain

Import libraries:

In [None]:
import os
import openai

import getpass
from tqdm import tqdm

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

Setup OpenAI API key:

In [None]:
if "OPENAI_API_KEY" not in os.environ:
    print("Please enter your OpenAI API key:")
    openai_api_key = getpass.getpass()

    os.environ["OPENAI_API_KEY"] = openai_api_key
    openai.api_key = openai_api_key

Setup input and output directories, chunk size, and chunk overlap:

In [None]:
INPUT_DIRECTORY = "input_files/"  # directory to read files from
PERSIST_DIRECTORY = "/tmp/chroma"  # directory to persist the chroma DB

CHUNK_SIZE = 1000                 # size of the chunk size used in each document
CHUNK_OVERLAP = 150               # overlap size from consecutive chunk sizes

In [None]:
# Create text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Create two lists to hold docs in memory
all_texts = []
all_metadata = []

# Iterate through files
for entry in tqdm(os.scandir(INPUT_DIRECTORY)):
    if entry.is_dir():
        continue

    # Read file content
    with open(entry.path) as f:
        content = f.read()

    # Split file into chunks
    texts = text_splitter.split_text(content)

    # Save chunks into list
    all_texts += texts

    # Save metadata for each chunk
    all_metadata += [{"source": f"{entry.name}_{i+1}"} for i in range(len(texts))]

print("Number of chunks: ", len(all_texts))

Create OpenAI embeddings:

In [None]:
embedding = OpenAIEmbeddings()

Create ChromaDB, and persist it to the provided directory.

**Note**: This will call OpenAI to generate embeddings for each text chunk.

In [None]:
docsearch = Chroma.from_texts(
     texts=all_texts, 
     embedding=embedding, 
     metadatas=all_metadata,
     persist_directory=PERSIST_DIRECTORY
)
docsearch.persist()

The embedding is now saved in the persist directory; we can compress it to a local file for convenience.

In [None]:
!echo Compressing $PERSIST_DIRECTORY into chroma.tar.gz...
!tar -cvzf chroma.tar.gz $PERSIST_DIRECTORY

Display file info (make sure to download it / save it somewhere!)

In [None]:
!ls -lh chroma.tar.gz

Remove temporary persist directory:

In [None]:
!rm -rf $PERSIST_DIRECTORY