# RAG2: Working with a local vector database

While in the previous chapter we used FAISS as an in-memory vector database, in this chapter we will take things further:
* Instead of using text strings, we will load structured data into the database
* Use an installed vector database - Weaviate

## Running Weaviate as a Docker container

For LLM generative use cases, we need to run Weaviate with the text2vec capability. 
To do that, use the docker-compose.yaml file in this directory.

The final step will be to start Docker with this file:
```
docker compose up -d
```
([Installation directions for Docker compose plugin are here](https://docs.docker.com/compose/install/linux/#install-the-plugin-manually))

Next we need to install the Python client:

In [2]:
%pip install weaviate-client

Defaulting to user installation because normal site-packages is not writeable
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-health-checking<2.0.0,>=1.57.0->weaviate-client)
  Using cached protobuf-5.28.2-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Using cached protobuf-5.28.2-cp38-abi3-manylinux2014_x86_64.whl (316 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.5
    Uninstalling protobuf-4.25.5:
      Successfully uninstalled protobuf-4.25.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-ai-generativelanguage 0.6.6 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.28.2 which is incompatible.[0m[31m
[0mSuccessfully installed protobuf-5.28.2
Note: you may need to restart the

Check our installation:

In [3]:
%pip show weaviate-client

Name: weaviate-client
Version: 4.9.0
Summary: A python native Weaviate client
Home-page: https://github.com/weaviate/weaviate-python-client
Author: Weaviate
Author-email: hello@weaviate.io,
License: BSD 3-clause
Location: /home/yuvalzukerman/.local/lib/python3.12/site-packages
Requires: authlib, grpcio, grpcio-health-checking, grpcio-tools, httpx, pydantic, requests, validators
Required-by: 
Note: you may need to restart the kernel to use updated packages.


Let's connect to our Weaviate instance in Docker

In [4]:
import weaviate
import os

headers = {
    "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
}  # Replace with your OpenAI API key

client = weaviate.connect_to_local(headers=headers)
client.close()
print ("Successfully connected to Weaviate! Hooray!")

Successfully connected to Weaviate! Hooray!


In [2]:
import json

metainfo = client.get_meta()
print(json.dumps(metainfo, indent=2))
client.close()

{
  "hostname": "http://[::]:8080",
  "modules": {
    "text2vec-openai": {
      "documentationHref": "https://platform.openai.com/docs/guides/embeddings/what-are-embeddings",
      "name": "OpenAI Module"
    }
  },
  "version": "1.27.0"
}


### Populate the vector database

From the Weaviate demo, we will create a collection - similar to a database schema. The collection will hold open source movie data.

In [5]:
import weaviate.classes.config as wc

client = weaviate.connect_to_local(headers=headers)
client.collections.create(
    name="Movie",
    properties=[
        wc.Property(name="title", data_type=wc.DataType.TEXT),
        wc.Property(name="overview", data_type=wc.DataType.TEXT),
        wc.Property(name="vote_average", data_type=wc.DataType.NUMBER),
        wc.Property(name="genre_ids", data_type=wc.DataType.INT_ARRAY),
        wc.Property(name="release_date", data_type=wc.DataType.DATE),
        wc.Property(name="tmdb_id", data_type=wc.DataType.INT),
    ],
    # Define the vectorizer module
    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
    # Define the generative module
    generative_config=wc.Configure.Generative.openai()
)

client.close()

With the schema set up, let's load up some data.

In [7]:
# Download movie data
import pandas as pd
import requests
import json

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

df.head()

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,/3Nn5BOM1EVw1IYrv6MsbOS6N1Ol.jpg,"[14, 18, 10749]",162,en,Edward Scissorhands,A small suburban town receives a visit from a ...,45.694,/1RFIbuW9Z3eN9Oxw2KaQG5DfLmD.jpg,1990-12-07,Edward Scissorhands,False,7.7,12305
1,/sw7mordbZxgITU877yTpZCud90M.jpg,"[18, 80]",769,en,GoodFellas,"The true story of Henry Hill, a half-Irish, ha...",57.228,/aKuFiU82s5ISJpGZp7YkIr3kCUd.jpg,1990-09-12,GoodFellas,False,8.5,12106
2,/6uLhSLXzB1ooJ3522ydrBZ2Hh0W.jpg,"[35, 10751]",771,en,Home Alone,Eight-year-old Kevin McCallister makes the mos...,3.538,/onTSipZ8R3bliBdKfPtsDuHTdlL.jpg,1990-11-16,Home Alone,False,7.4,10599
3,/vKp3NvqBkcjHkCHSGi6EbcP7g4J.jpg,"[12, 35, 878]",196,en,Back to the Future Part III,The final installment of the Back to the Futur...,28.896,/crzoVQnMzIrRfHtQw0tLBirNfVg.jpg,1990-05-25,Back to the Future Part III,False,7.5,9918
4,/3tuWpnCTe14zZZPt6sI1W9ByOXx.jpg,"[35, 10749]",114,en,Pretty Woman,When a millionaire wheeler-dealer enters a bus...,97.953,/hVHUfT801LQATGd26VPzhorIYza.jpg,1990-03-23,Pretty Woman,False,7.5,7671


#### Inserting individual rows into the database

In [9]:
import weaviate
import os
from datetime import datetime, timezone
from weaviate.util import generate_uuid5


headers = {
    "X-OpenAI-Api-Key": os.environ.get("OPENAI_API_KEY")
    #os.getenv("OPENAI_APIKEY")
}  # Replace with your OpenAI API key

try:    
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
    movies = client.collections.get("Movie")

    # get the first entry
    first_movie = df.iloc[1]
    #print(first_movie)
    
    release_date = datetime.strptime(first_movie["release_date"], "%Y-%m-%d").replace(
                    tzinfo=timezone.utc
                )
    genre_ids = json.loads(first_movie["genre_ids"])
    vote_avg = int(first_movie["vote_average"])
    tmdb_id = int(first_movie["id"])

    movie_obj = { 
        "title": first_movie["title"],
            "overview": first_movie["overview"],
            "vote_average": vote_avg,
            "genre_ids": genre_ids, 
            "release_date": release_date,
            "tmdb_id": tmdb_id
        }
    
    uuid = movies.data.insert(
        properties = movie_obj,
        #uuid = generate_uuid5(movie_obj),
    )

    print(uuid)
finally: 
    client.close()

dd378331-78b1-4df2-878f-c0140b977fe8


#### Batch data loading

In [10]:
import weaviate
from datetime import datetime, timezone

from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

try:    
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
           
    # Get the collection
    movies = client.collections.get("Movie")
    
    # Enter context manager
    with movies.batch.dynamic() as batch:
        # Loop through the data
        for i, movie in tqdm(df.iterrows(), total=len(df)):
            # Convert data types
            # Convert a JSON date to `datetime` and add time zone information
            release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
                tzinfo=timezone.utc
            )
            # Convert a JSON array to a list of integers
            genre_ids = json.loads(movie["genre_ids"])

            vote_avg = int(movie["vote_average"])
            tmdb_id = int(movie["id"])
    
            # Build the object payload
            movie_obj = {
                "title": movie["title"],
                "overview": movie["overview"],
                "vote_average": vote_avg,
                "genre_ids": genre_ids,
                "release_date": release_date,
                "tmdb_id": tmdb_id,
            }

            seed = movie["title"] + str(movie["id"])
    
            # Add object to batch queue
            batch.add_object(
                properties=movie_obj,
                uuid=generate_uuid5(seed)
                # references=reference_obj  # You can add references here
            )
            # Batcher automatically sends batches
    
    # Check for failed objects
    if len(movies.batch.failed_objects) > 0:
        print(f"Failed to import {len(movies.batch.failed_objects)} objects")
finally:
    client.close()

100%|████████████████████████████████████████| 680/680 [00:04<00:00, 168.97it/s]


## Searching

#### Keyword search
This search uses a relevance algorithm, as opposed to vector distance.
You specify you want this search method in the attribute:
```python
return_metadata=wq.MetadataQuery(distance=True)
```

In [13]:
import weaviate
import weaviate.classes.query as wq
import os


headers = {
    "X-OpenAI-Api-Key": os.environ.get("OPENAI_API_KEY")
    #os.getenv("OPENAI_APIKEY")
}  # Replace with your OpenAI API key

try:    
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
    movies = client.collections.get("Movie")
    
    
    # Perform query
    response = movies.query.near_text(
        query="dystopian future", 
        limit=5, # maximum number of results
        return_metadata=wq.MetadataQuery(distance=True)
    )
    
    # Inspect the response
    for o in response.objects:
        print(
            o.properties["title"], o.properties["release_date"].year, o.uuid
        )  # Print the title and release year (note the release date is a datetime object)
        print(
            f"Distance to query: {o.metadata.distance:.3f}\n"
        )  # Print the distance of the object from the query

finally:
    client.close()

Gattaca 1997 da7e6b77-f10a-5d9e-b86d-d1152c2825a7
Distance to query: 0.187

Mad Max: Fury Road 2015 9e48bc0d-deb6-5856-96ee-a159362f3292
Distance to query: 0.189

In Time 2011 56bf229a-2e15-5c5e-b740-a5cd35b644a9
Distance to query: 0.190

I, Robot 2004 7920b130-0163-5a3f-af88-3b486b5a74d6
Distance to query: 0.193

Children of Men 2006 6e07b865-2d2a-5029-a140-e46b60a51d19
Distance to query: 0.196



#### Hybrid search
This uses the algorithmic search along with semantic/vector search.
The key difference is in the attribute:
```python
return_metadata=wq.MetadataQuery(score=True)
```

In [14]:
import weaviate
import weaviate.classes.query as wq
import os


headers = {
    "X-OpenAI-Api-Key": os.environ.get("OPENAI_API_KEY")
    #os.getenv("OPENAI_APIKEY")
}  # Replace with your OpenAI API key

try:    
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
    movies = client.collections.get("Movie")

    # Perform query
    response = movies.query.hybrid(
        query="history", limit=5, return_metadata=wq.MetadataQuery(score=True)
    )
    
    # Inspect the response
    for o in response.objects:
        print(
            o.properties["title"], o.properties["release_date"].year, o.uuid
        )  # Print the title and release year (note the release date is a datetime object)
        print(
            f"Hybrid score: {o.metadata.score:.3f}\n"
        )  # Print the hybrid search score of the object from the query

finally:
    client.close()

Legends of the Fall 1994 3d241a94-119e-5a4c-a025-9f52a5a21dae
Hybrid score: 0.822

Hacksaw Ridge 2016 1dbdb868-d8dd-5887-943f-037bea19b953
Hybrid score: 0.617

The Butterfly Effect 2004 95bd7d50-9222-5192-81cf-764e57ad9536
Hybrid score: 0.566

A Beautiful Mind 2001 1d8a3a9d-a527-5af3-b64c-9bdde4f40ba1
Hybrid score: 0.559

Forrest Gump 1994 d4d8da98-b202-5067-aac4-533cbdc63859
Hybrid score: 0.541



#### Search Filters

In [6]:
from datetime import datetime

try:
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
    movies = client.collections.get("Movie")

    # Perform query
    response = movies.query.near_text(
        query="dystopian future",
        limit=5,
        return_metadata=wq.MetadataQuery(distance=True),
        filters=wq.Filter.by_property("release_date").greater_than(datetime(2020, 1, 1))
    )
    
    # Inspect the response
    for o in response.objects:
        print(
            o.properties["title"], o.properties["release_date"].year
        )  # Print the title and release year (note the release date is a datetime object)
        print(
            f"Distance to query: {o.metadata.distance:.3f}\n"
        )  # Print the distance of the object from the query

finally: 
    client.close()

    

            To use a different timezone, specify it in the datetime object. For example:
            datetime.datetime(2021, 1, 1, 0, 0, 0, tzinfo=datetime.timezone(-datetime.timedelta(hours=2))).isoformat() = 2021-01-01T00:00:00-02:00
            


Dune 2021
Distance to query: 0.197

Dune 2021
Distance to query: 0.197

Greenland 2020
Distance to query: 0.210

Greenland 2020
Distance to query: 0.210

Tenet 2020
Distance to query: 0.218



### CRUD Operations
Like with any database, you access objects via their ID. Let's find one of the movies in the database, update and delete it.
The first step would be to get the movie's UUID, so let's query for it.

In [15]:
import weaviate
import weaviate.classes.query as wq
import os


headers = {
    "X-OpenAI-Api-Key": os.environ.get("OPENAI_API_KEY")
    #os.getenv("OPENAI_APIKEY")
}  # Replace with your OpenAI API key

try:    
    # connect to database
    client = weaviate.connect_to_local(headers=headers)
    movies = client.collections.get("Movie")

    # Perform query
    movie_uuid = ''
    response = movies.query.bm25(
            query = "GoodFellas"
        )
    
    # Inspect the response
    for o in response.objects:
        movie_uuid = str(o.uuid) # just picking up one result
        print(o.uuid)
        print(o.properties["title"])
        print(o.properties["vote_average"])
        print("\n")

    
    # See the vector
    data_object = movies.query.fetch_object_by_id(
        str(movie_uuid)
    )

    # print(data_object.vector["default"])

    ######
    # Update
    ######
    movies.data.update (
        uuid = movie_uuid,
        properties = {
            "vote_average" : 33.92,
        }
    )

    # check whether the update took place
    # Perform query
    response = movies.query.fetch_object_by_id(movie_uuid)
    
    # Inspect the response
    print(response.properties)
    
finally:
    client.close()

eb90287c-c1aa-5a81-8071-936ee73df166
GoodFellas
8.0


dd378331-78b1-4df2-878f-c0140b977fe8
GoodFellas
8.0


{'overview': 'The true story of Henry Hill, a half-Irish, half-Sicilian Brooklyn kid who is adopted by neighbourhood gangsters at an early age and climbs the ranks of a Mafia family under the guidance of Jimmy Conway.', 'title': 'GoodFellas', 'release_date': datetime.datetime(1990, 9, 12, 0, 0, tzinfo=datetime.timezone.utc), 'tmdb_id': 769, 'vote_average': 33.92, 'genre_ids': [18, 80]}
