# Building a Vector Index

This notebook uses pre-fetched data to build a vector index using the `langchain` library. The index is built using the `Chroma` vector store and is saved to disk for subsequent querying.

In [11]:
import logging

import chromadb
import numpy as np
import pandas as pd
from chromadb.config import Settings
from langchain_chroma import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from tqdm import tqdm

In [2]:
logging.basicConfig(level=logging.INFO)

### Configure the model

We use the `all-MiniLM-L6` quantized model via GPT4All for local, efficient embeddings.

In [3]:
embeddings = GPT4AllEmbeddings(
    model_name="all-MiniLM-L6-v2.gguf2.f16.gguf",
    n_threads=8,
)

Failed to load libllamamodel-mainline-cuda.so: dlopen: libcudart.so.11.0: cannot open shared object file: No such file or directory
Failed to load libllamamodel-mainline-cuda-avxonly.so: dlopen: libcudart.so.11.0: cannot open shared object file: No such file or directory


### Load pre-fetched Wikipedia data

In [None]:
articles_df = pd.read_parquet("../data/input/wikipedia_articles.parquet")

logging.info(f"Loaded {len(articles_df)} articles from Parquet.")

INFO:root:Loaded 4573 articles from Parquet.


### Split summaries into chunks

To improve retrieval precision, let's split each summary into overlapping chunks of ~1000 characters with some overlap. We also preserve metadata like source index, title, and URL.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""],
    length_function=len,
)

In [None]:
documents = text_splitter.create_documents(
    articles_df["summary"].tolist(),
    metadatas=[
        {"source": str(row.Index), "title": row.title, "url": row.url}
        for row in articles_df.itertuples()
    ],
)

Add the `title` to each chunk to help with retrieval.

In [24]:
for doc in documents:
    doc.page_content = f"{doc.metadata['title']}\n{doc.page_content}"

In [32]:
logging.info(f"Split {len(articles_df)} articles into {len(documents)} chunks.")

INFO:root:Split 4573 articles into 7873 chunks.


## Build the index

### Index configuration

The embeddings are automatically computed on ingestion. The vectors are stored in a `Chroma` vector store.

In [33]:
chroma_client = chromadb.PersistentClient(
    path="../data/database/wikipedia.db",
    settings=Settings(allow_reset=True),
)

INDEX_NAME = "wikipedia-index"
vector_store = Chroma(
    client=chroma_client,
    collection_name=INDEX_NAME,
    embedding_function=embeddings,
)

### Load the documents

In [None]:
# vector_store.reset_collection()

In [39]:
document_batch_size = 100
document_batches = np.array_split(documents, len(documents) // document_batch_size + 1)

for document_batch in tqdm(document_batches, desc="Ingesting documents"):
    vector_store.add_documents(document_batch)

Ingesting documents: 100%|██████████| 79/79 [11:09<00:00,  8.48s/it]


In [43]:
total_vectors = vector_store._chroma_collection.count()
logging.info(f"Ingested {total_vectors} vectors into '{INDEX_NAME}'.")

INFO:root:Ingested 7873 vectors into 'wikipedia-index'.


## Perform a similarity search

In [44]:
vector_store.search(
    "donald trump",
    search_type="similarity",
    k=5,
)

[Document(id='b98d5e99-5c1f-49d8-bff2-84b566762c33', metadata={'source': '0', 'title': 'Donald Trump', 'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}, page_content="Donald Trump\nDonald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who is the 47th president of the United States. A member of the Republican Party, he served as the 45th president from 2017 to 2021.\nBorn into a wealthy family in the New York City borough of Queens, Trump graduated from the University of Pennsylvania in 1968 with a bachelor's degree in economics. He became the president of his family's real estate business in 1971, renamed it the Trump Organization, and began acquiring and building skyscrapers, hotels, casinos, and golf courses. He launched side ventures, many licensing the Trump name, and filed for six business bankruptcies in the 1990s and 2000s. From 2004 to 2015, he hosted the reality television show The Apprentice, bolstering his fame as a billiona