# How to create and populate a vector search index

This notebook covers the basic of populating a vector index in Redis. If you are brand new to the concept of vector search and/or RAG with Redis and are looking for more details check out [Redis AI Resources](https://github.com/redis-developer/redis-ai-resources) for more recipes on how to get going.

## Creating chunks from a PDF

If starting from a pdf document that you want to make searchable you can use langchain pointing to your pdf file to break into chunks

In [32]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

doc = "data/volvo_c30.pdf" # path to pdf or other type of file to load

# set up the file loader/extractor and text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=0
)

loader = UnstructuredFileLoader(
    doc, mode="single", strategy="fast"
)

# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Done preprocessing. Created 213 chunks of the original pdf data/volvo_c30.pdf


## (Optional) Save the chunks to a file

In [33]:
import json

output_file = "data/volvo_chunks.json"

with open(output_file, "w") as f:
    json_chunks = [
        {
            "text": chunk.page_content,
            "make": "volvo",
            "model": "c30",
            "item_id": f"volvo_c30:{i}"
        } for i, chunk in enumerate(chunks)
    ]

    json.dump(json_chunks, f)

# Defining and populating and index from existing chunks

Read existing chunks from data folder

In [34]:
import json
with open("data/volvo_chunks.json", "r") as f:
    chunks = json.load(f)

In [35]:
chunks

[{'text': "VOLVO C30Owners ManualWeb Edition\n\nDownloaded from www.Manualslib.com manuals search engine\n\nDownloaded from www.Manualslib.com manuals search engine\n\nDEAR VOLVO OWNERTHANK YOU FOR CHOOSING VOLVOWe hope you will enjoy many years of driving pleasure in your Volvo.The car has been designed for the safety and comfort of you and yourpassengers. Volvo is one of the safest cars in the world. Your Volvohas also been designed to satisfy all current safety and environmentalrequirements.In order to increase your enjoyment of the car, we recommend thatyou familiarise yourself with the equipment, instructions and mainte-nance information contained in this owner's manual.\n\nDownloaded from www.Manualslib.com manuals search engine\n\n4\n\nOption/accessory, for more information, see Introduction.\n\n00\n\n02\n\n01SafetySeatbelts...................................................18Airbag system...........................................21Airbags.......................................

In [23]:
with_title = [{"text": c, "make": "mazda", "model": "3", "item_id": f"mazda_3:{i}"} for i, c in enumerate(chunks)]

with open("data/mazda_chunks.json", "w") as f:
    json.dump(with_title, f)

Create vector embeddings of chunk content

In [37]:
import os
import warnings

warnings.filterwarnings("ignore")

from redisvl.utils.vectorize import HFTextVectorizer

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Embed each chunk content
embeddings = hf.embed_many([chunk["text"] for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(embeddings) == len(chunks)

True

In [38]:
from redis import Redis
from redisvl.index import SearchIndex

REDIS_URL = "redis://localhost:6379/0"

# connect to redis
client = Redis.from_url(REDIS_URL)

# path to the schema file
path_to_yaml = "schema/index_schema.yaml"

# create an index from schema and the client
index = SearchIndex.from_yaml(path_to_yaml)
index.set_client(client)
index.create(overwrite=True, drop=True)

Now that the index is created we can load documents into it

In [39]:
from redisvl.redis.utils import array_to_buffer

data = [
    {
        **chunk,
        # For HASH -- must convert embeddings to bytes
        'text_embedding': array_to_buffer(embeddings[i], dtype="float32")
    } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
keys = index.load(data, id_field="item_id")

Check the index.info

In [40]:
index.info()["num_docs"]

213

In [31]:
import redisvl
redisvl.__version__

'0.3.5'

In [121]:
path_to_yaml = "schema/index_json_schema.yaml"
# create an index from schema and the client
jindex = SearchIndex.from_yaml(path_to_yaml)
jindex.set_client(client)
jindex.create(overwrite=True, drop=True)

11:07:34 redisvl.index.index INFO   Index already exists, overwriting.


In [122]:
jdata = [
    {
        'chunk_id': str(i),
        'content': chunk,
        # For HASH -- must convert embeddings to bytes
        'text_embedding': embeddings[i]
    } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
keys = jindex.load(jdata, id_field="chunk_id")

In [125]:
jindex.info()["num_docs"]

251