![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# How to create and populate a vector search index

This notebook covers the basic of populating a vector index in Redis. If you are brand new to the concept of vector search and/or RAG with Redis and are looking for more details check out [Redis AI Resources](https://github.com/redis-developer/redis-ai-resources) for more recipes on how to get going.

## Creating chunks from a PDF

If starting from a pdf document that you want to make searchable you can use langchain pointing to your pdf file to break into chunks

<a href="https://colab.research.google.com/github/redis-applied-ai/retrieval-optimizer/blob/main/examples/getting_started/populate_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install requirements

In [None]:
!pip install langchain langchain-community pypdf redisvl

## Grab data (if colab)

In [None]:
!git clone https://github.com/redis-applied-ai/retrieval-optimizer.git temp_repo
!mv temp_repo/examples/getting_started/data .
!mv temp_repo/examples/getting_started/schema .
!rm -rf temp_repo

## Use langchain tools to process pdf

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

path = "data/volvo_c30.pdf" # path to pdf or other type of file to load

# set up the file loader/extractor and text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=0
)

# load and split
loader = PyPDFLoader(path)
pages = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=0)
chunks = text_splitter.split_documents(pages)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", path)

Done preprocessing. Created 354 chunks of the original pdf data/volvo_c30.pdf


In [10]:
chunks[0]

Document(metadata={'source': 'data/volvo_c30.pdf', 'page': 0, 'page_label': '1'}, page_content='VOLVO C30\nOwners Manual Web Edition\nDownloaded from www.Manualslib.com  manuals search engine')

## (Optional) Save the chunks to a file

In [11]:
import json

output_file = "data/volvo_chunks.json"

with open(output_file, "w") as f:
    json_chunks = [
        {
            "text": chunk.page_content,
            "make": "volvo",
            "model": "c30",
            "item_id": f"volvo_c30:{i}"
        } for i, chunk in enumerate(chunks)
    ]

    json.dump(json_chunks, f)

# Defining and populating and index from existing chunks

Read existing chunks from data folder

In [12]:
import json
with open("data/combined_chunks.json", "r") as f:
    chunks = json.load(f)

In [13]:
chunks[0]

{'text': "Mazda3_8Y64-EA-08A_Edition1 Page1 Tuesday, November 27 2007 9:0 AM\n\nForm No.8Y64-EA-08A\n\nBlack plate (1,1)\n\nMazda3_8Y64-EA-08A_Edition1 Page2 Tuesday, November 27 2007 9:0 AM\n\nForm No.8Y64-EA-08A\n\nBlack plate (2,1)\n\nMazda3_8Y64-EA-08A_Edition1 Page3 Tuesday, November 27 2007 9:0 AM\n\nBlack plate (3,1)\n\nA Word to Mazda Owners\n\nThank you for choosing a Mazda. We at Mazda design and build vehicles with complete customer satisfaction in mind.\n\nTo help ensure enjoyable and trouble-free operation of your Mazda, read this manual carefully and follow its recommendations.\n\nAn Authorized Mazda Dealer knows your vehicle best. So when maintenance or service is necessary, that's the place to go.\n\nOur nationwide network of Mazda professionals is dedicated to providing you with the best possible service.\n\nWe assure you that all of us at Mazda have an ongoing interest in your motoring pleasure and in your full satisfaction with your Mazda product.\n\nMazda Motor Corp

## Create vector embeddings of chunk content

In [14]:
import os
import warnings

warnings.filterwarnings("ignore")

from redisvl.utils.vectorize import HFTextVectorizer

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Embed each chunk content
embeddings = hf.embed_many([chunk["text"] for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(embeddings) == len(chunks)

True

## Run redis instance
Later in this tutorial, Redis will be used to store, index, and query vector
embeddings created from PDF document chunks. **We need to make sure we have a Redis
instance available.**

#### For Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly
from the Redis package archive.

In [None]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### For Alternative Environments
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.com/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis -p 6379:6379 -p 8001:8001 redis/redis-stack:latest`

In [15]:
from redis import Redis
from redisvl.index import SearchIndex

REDIS_URL = "redis://localhost:6379/0"

# connect to redis
client = Redis.from_url(REDIS_URL)

# path to the schema file
path_to_yaml = "schema/mazda_schema.yaml"

# create an index from schema and the client
index = SearchIndex.from_yaml(path_to_yaml)
index.set_client(client)
index.create(overwrite=True, drop=True)

Now that the index is created we can load documents into it

In [16]:
from redisvl.redis.utils import array_to_buffer

data = [
    {
        "item_id": chunk["item_id"],
        "text": chunk["text"],
        **chunk["query_metadata"],
        # For HASH -- must convert embeddings to bytes
        'text_embedding': array_to_buffer(embeddings[i], dtype="float32")
    } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
keys = index.load(data, id_field="item_id")

Check the index.info to see if the documents were loaded successfully

In [17]:
index.info()["num_docs"]

464

## You know have a vector index setup!

You can check out your data with [RedisInsight](https://redis.io/insight/) (already running on localhost:8001 if running redis-stack with docker command above) to see the populated fields

![r_insight](../../images/r_insight.png)