# Ungraded Lab - Introduction to Weaviate API


<div align="center">
  <img src="images/weaviate.png" alt="RAG Overview" width="10%">
</div>

Welcome to the ungraded lab about the Weaviate API! As you dive into the world of vector databases, you'll discover that there are several options available to help you build your RAG systems. In this course, you'll focus on the [Weaviate API](https://weaviate.io/).

This lab is designed to give you a hands-on introduction to Weaviate, so you'll be well-prepared for the upcoming assignment. You'll explore how Weaviate functions, what it can do, and how to make the most of its features. By the time you reach the assignment, you'll have the tools and knowledge you need to succeed.

Let's go!


# Table of Contents
- [ 1 - Introduction](#1)
  - [ 1.1 Loading the necessary libraries](#1-1)
  - [ 1.2 - The Weaviate Client](#1-2)
- [ 2 - Configuring the database](#2)
  - [ 2.1 Creating a Collection](#2-1)
  - [ 2.2 Configuring the Vectorizer](#2-2)
  - [ 2.3 The Properties](#2-3)
  - [ 2.4 Adding elements into a Collection](#2-4)
- [ 3 - Querying on a collection](#3)
  - [ 3.1 Filters](#3-1)
  - [ 3.2 Semantic Search](#3-2)
  - [ 3.3 BM25 search](#3-3)
  - [ 3.4 Hybrid Search](#3-4)
  - [ 3.5 Reranking](#3-5)


---
<h4 style="color:black; font-weight:bold;">USING THE TABLE OF CONTENTS</h4>

JupyterLab provides an easy way for you to navigate through your assignment. It's located under the Table of Contents tab, found in the left panel, as shown in the picture below.

![TOC Location](images/toc.png)

---

<a id='1'></a>
## 1 - Introduction
---
<a id='1-1'></a>
### 1.1 Loading the necessary libraries
Run the cell below to load the necesary libraries for this assignment.

In [1]:
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import Filter
from typing import List
from tqdm import tqdm
import joblib
import weaviate
import re
from weaviate.util import generate_uuid5
from pprint import pprint
import os

In [2]:
import flask_app
from utils import (
    suppress_subprocess_output, 
    generate_with_single_input, 
    print_object_properties
)

 * Serving Flask app 'flask_app'
 * Debug mode: off


You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


<a id='1-2'></a>
### 1.2 - The Weaviate Client

To start working with the Weaviate API in this environment, you need to start a `client`. In this course, you will use an *embedded client*, which is a way of using Weaviate within this application and not relying on a stand-alone instance of Weaviate running. 

When you start `Embedded Weaviate` for the first time, it creates a data storage file at the location you specify in `persistence_data_path`. Even after you close your client and Embedded Weaviate shuts down, your data will still be saved there. 

When creating your Weaviate client, you must pass an embedding model to perform vectorization. You can pass different models and Weaviate has different modules to help you call `OpenAI` models and others. Since OpenAI is a paid system, we will use a local model to perform the vectorization. 

One example to load an OpenAI model is this call:


```Python
import weaviate
client = weaviate.connect_to_embedded(
    version="1.26.1",
    headers={
        "X-OpenAI-Api-Key": YOUR_OPENAI_API_KEY
    },
)
```

Let's load our client! 

This function `suppress_subprocess_output()` is designed to suppress the Weaviate output logs, which can pollute the lab. These logs won't be explored in this lab, but feel free to remove it if you are curious about what the logs look like! The arguments it will be using are:

- `persistence_data_path`: The path where the client will look for (and create) the vector databases. Once you create it, it is stored there and persisted, i.e., it won't be deleted once you close the client!
- `environment_vabiables`: Necessary variables that we must pass to make the local embedding server to work. 

In [3]:
with suppress_subprocess_output():
    client = weaviate.connect_to_embedded(
        persistence_data_path="./.collections",
        environment_variables={
            "ENABLE_API_BASED_MODULES": "true", # Enable API based modules 
            "ENABLE_MODULES": 'text2vec-transformers, reranker-transformers', # We will be using a transformer model
            "TRANSFORMERS_INFERENCE_API":"http://127.0.0.1:5000/", # The endpoint the weaviate API will be using to vectorize
            "RERANKER_INFERENCE_API":"http://127.0.0.1:5000/" # The endpoint the weaviate API will be using to rerank
        }
    )

With the defined `client`, your primary usage is creating a collection, adding elements to it and querying over it.

<a id='2'></a>
## 2 - Configuring the database

---

In this section, you will explore the central object in this lab and in this assignment: [the collection](https://weaviate.io/developers/weaviate/manage-data/collections) - this is the name Weaviate gives to a group of data objects which will be indexed for retrieval. Remember the workflow from the lectures:

<div align="center">
  <img src="images/workflow.png" alt="RAG Overview" width="60%">
</div>


<a id='2-1'></a>
### 2.1 Creating a Collection

To create a collection, there are some parameters that must be set. The most important for our purposes are:

- `name`: the collection name, this is the name that will be saved in memory and the name that you will need to load it.
- `vectorizer_config`: a list with vectorizer configurations. You can pass more than one vectorizer configuration, which means that in the same vector database, you can vectorize your datapoints with different embedding models. In your context, you will be using only one.

Let's load a database to illustrate this section.

In [4]:
data = joblib.load("data.joblib")
print_object_properties(data[0])

place: Grand Canyon
state: Arizona
description: A stunning canyon with vast vistas and incredible geology.
best_season_to_visit: Spring, Fall
attractions: South Rim, Havasu Falls, Skywalk
budget: Moderate
user_ratings: 4.8
last_updated: 2023-10-01T00:00:00Z



The dataset is a set of places to visit, with some properties describing each location. The properties here are `place, state, description, best_season_to_visit, attractions, budget, user_ratings, last_updated`. When creating a collection, you must create one property for each key in this dictionary and add the expected datatype. 

<a id='2-2'></a>
### 2.2 Configuring the Vectorizer

As mentioned before, you will use the `text2vec_transformers` embedding model to vectorize your data. To configure it, you must pass the corresponding Configure object. When configuring the vectorizer, you can pass a list of different vectorizers, so your collection can store several vectorizations for the same object. You can also choose to vectorize specific properties on specific vectorizers. In this course you will stick with one vectorizer. Not every property must be vectorized, it depends on the data and the information you want to retrieve.

In this case, let's use the following properties to be vectorized:

`place, state, description, best_season_to_visit, attractions, budget`

These properties will be appended to each other and then vectorized. When defining the property, you might choose to add the property name or not in the vectorization. Note that it would make sense to have the property name in budget, for example, as only the word "Moderate" would not provide enough information about what "Moderate" stands for.

In [5]:
vectorizer_config = [Configure.NamedVectors.text2vec_transformers(
                name="vector", # This is the name you will need to access the vectors of the objects in your collection
                source_properties=['place', 'state', 'description', 'best_season_to_visit', 'attractions', 'budget'], # which properties should be used to generate a vector, they will be appended to each other when vectorizing
                vectorize_collection_name = False, # This tells the client to not vectorize the collection name. 
                                                   # If True, it will be appended at the beginning of the text to be vectorized
                inference_url="http://127.0.0.1:5000", # Since we are using an API based vectorizer, you need to pass the URL used to make the calls 
                                                       # This was setup in our Flask application
            )]

<a id='2-3'></a>
### 2.3 The Properties

In a collection, the features of each data point are called Properties.

In [6]:
# Delete the collection in case it exists
if client.collections.exists("example_collection"):
    client.collections.delete("example_collection")
    

In [8]:
if not client.collections.exists('example_collection'): # Creates only if the collection does not exist
    collection = client.collections.create(
            name='example_collection',
            vectorizer_config=vectorizer_config, # The config we defined before,
            reranker_config=Configure.Reranker.transformers(), # The reranker config

            properties=[  # Define properties
            Property(name="place",vectorize_property_name=True,data_type= DataType.TEXT),
            Property(name="state",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="description",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="best_season_to_visit",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="attractions",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="budget",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="user_ratings", data_type=DataType.NUMBER),
            Property(name="last_updated", data_type=DataType.DATE),

        ]
        )
else:
    collection = client.collections.get("example_collection")

Running it creates a collection and returns the collection. Printing it shows the collection configuration.

In [9]:
print(collection)

<weaviate.Collection config={
  "name": "Example_collection",
  "description": null,
  "generative_config": null,
  "inverted_index_config": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    },
    "cleanup_interval_seconds": 60,
    "index_null_state": false,
    "index_property_length": false,
    "index_timestamps": false,
    "stopwords": {
      "preset": "en",
      "additions": null,
      "removals": null
    }
  },
  "multi_tenancy_config": {
    "enabled": false,
    "auto_tenant_creation": false,
    "auto_tenant_activation": false
  },
  "properties": [
    {
      "name": "place",
      "description": null,
      "data_type": "text",
      "index_filterable": true,
      "index_range_filters": false,
      "index_searchable": true,
      "nested_properties": null,
      "tokenization": "word",
      "vectorizer_config": null,
      "vectorizer": null,
      "vectorizer_configs": {
        "text2vec-transformers": {
          "skip": false,
          "vectorize_property_

If you try to create a collection that already exists, an exception will be thrown:

In [10]:
try:
    collection = client.collections.create(
        name='example_collection',

        vectorizer_config=vectorizer_config, # The config we defined before,
    
        properties=[  # Define properties
        Property(name="place",vectorize_property_name=True,data_type= DataType.TEXT),
        Property(name="state",vectorize_property_name=True, data_type=DataType.TEXT),
        Property(name="description",vectorize_property_name=True, data_type=DataType.TEXT),
        Property(name="best_season_to_visit",vectorize_property_name=True, data_type=DataType.TEXT),
        Property(name="attractions",vectorize_property_name=True, data_type=DataType.TEXT),
        Property(name="budget",vectorize_property_name=True, data_type=DataType.TEXT),
        Property(name="user_ratings", data_type=DataType.NUMBER),
        Property(name="last_updated", data_type=DataType.DATE),
                 
    ]
    )
except Exception as e:
    print(e)

Collection may not have been created properly.! Unexpected status code: 422, with response body: {'error': [{'message': 'class name Example_collection already exists'}]}.


You can also retrieve all the collections saved:

In [11]:
client.collections.list_all().keys()

dict_keys(['Example_collection'])

The result of .list_all() is a dictionary with the collections names as keys and their properties.

<a id='2-4'></a>
### 2.4 Adding elements into a Collection

Once you create a collection, you get an empty collection. Now you need to add elements to it. When you add an element, two important steps happen in the background:

1. The information is vectorized (as configured in the collection definition)
2. The HNSW index is updated to optimize search (as you saw in the lectures). This occurs in the backend and you don't see it, but this can make the process take a bit of time

Adding elements is completed using a `collection.batch`, which adds additional useful features. For example, it will let you decide how many objects to send in each batch, handle errors during import, and improve performance by reducing the number of individual network calls. In this example, one element is added at a time, with only a single concurrent request at a time.

You can add a uuid (unique identifier id) to each element you add, so this prevents duplicate entries in your database. 

Let's see in practice!

In [13]:
# Set up a batch process with specified fixed size and concurrency
with collection.batch.fixed_size(batch_size=1, concurrent_requests=1) as batch:
    # Iterate over a subset of the dataset
    for document in tqdm(data): # tqdm is a library to show progress bars
            # Generate a UUID based on the article_content text for unique identification
            uuid = generate_uuid5(document)

            # Add the object to the batch with properties and UUID. 
            # properties expects a dictionary with the keys being the properties.
            batch.add_object(
                properties=document,
                uuid=uuid,
            )

100%|██████████| 20/20 [00:00<00:00, 52.37it/s]


Awesome! Now you have a collection with vectors! You can check the number of vectors using `len(collection)`:

In [14]:
len(collection)

20

<a id='3'></a>
## 3 - Querying on a collection

In this section, you will learn how to query on a collection. You can:

- Query on metadata
- Query with semantic search
- Query with BM25
- Query with filtering

Let's see some examples.

<a id='3-1'></a>
### 3.1 Filters

Before diving into querying, let's understand the Filters. Filters are a way of restricting your search on some criteria. They are very flexible. You usually pass them as an argument in a query. Let's have an example to illustrate it.

In [15]:
# Here we are fetching 2 objects with a filter by property, filtering by 'user_ratings, only objects with value greater or equal to 3.5'
result = collection.query.fetch_objects(limit = 2, filters = Filter.by_property('user_ratings').greater_or_equal(3.5))

The result is an object called QueryReturn:

In [16]:
result

QueryReturn(objects=[Object(uuid=_WeaviateUUIDInt('c99763a3-46a0-59d4-831b-af9bc290260c'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'state': 'California', 'last_updated': datetime.datetime(2023, 10, 1, 0, 0, tzinfo=datetime.timezone.utc), 'best_season_to_visit': 'Spring', 'budget': 'Moderate', 'place': 'Hollywood', 'description': 'Famous district in Los Angeles known as the entertainment capital of the world.', 'attractions': 'Walk of Fame, Hollywood Sign', 'user_ratings': 4.2}, references=None, vector={}, collection='Example_collection'), Object(uuid=_WeaviateUUIDInt('9e5ba590-8c75-5b53-9b0a-8a9c161004ad'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'state': 'New York', 'last_updated': datetime.datetime(2023, 

You can access its objects by `result.objects`

In [17]:
result.objects

[Object(uuid=_WeaviateUUIDInt('c99763a3-46a0-59d4-831b-af9bc290260c'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'state': 'California', 'last_updated': datetime.datetime(2023, 10, 1, 0, 0, tzinfo=datetime.timezone.utc), 'best_season_to_visit': 'Spring', 'budget': 'Moderate', 'place': 'Hollywood', 'description': 'Famous district in Los Angeles known as the entertainment capital of the world.', 'attractions': 'Walk of Fame, Hollywood Sign', 'user_ratings': 4.2}, references=None, vector={}, collection='Example_collection'),
 Object(uuid=_WeaviateUUIDInt('9e5ba590-8c75-5b53-9b0a-8a9c161004ad'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'state': 'New York', 'last_updated': datetime.datetime(2023, 10, 1, 0, 0, tzinfo

So, each element in the list is an element of the collection. 

In [18]:
obj = result.objects[0]

You can check their properties, which is a dictionary.

In [19]:
obj.properties

{'state': 'California',
 'last_updated': datetime.datetime(2023, 10, 1, 0, 0, tzinfo=datetime.timezone.utc),
 'best_season_to_visit': 'Spring',
 'budget': 'Moderate',
 'place': 'Hollywood',
 'description': 'Famous district in Los Angeles known as the entertainment capital of the world.',
 'attractions': 'Walk of Fame, Hollywood Sign',
 'user_ratings': 4.2}

In this course, the way of filtering is `.by_property`. You will see more Filtering examples as other query methods are explained.

<a id='3-2'></a>
### 3.2 Semantic Search

You can use semantic search to query over your collection. This uses the vectors to compute distances between them and return the closest ones. You must pass a query, which will be vectorized and then compared over the elements on your collection. The method is `.near_text`.

In [20]:
result = collection.query.near_text(query = 'I want suggestions to travel during Winter. I want cheap places.', limit = 4)

In [21]:
# Let's iterate over the result objects and return their properties
for obj in result.objects:
    print_object_properties(obj.properties)

state: New York
place: Times Square
best_season_to_visit: Winter
budget: Low
last_updated: 2023-10-01 00:00:00+00:00
description: Bustling pedestrian intersection and major commercial hub.
attractions: Broadway Theaters, New Year’s Eve Ball Drop
user_ratings: 4.3

state: Montana
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Summer
budget: Moderate
place: Glacier National Park
description: Park known for its rugged mountains and alpine forests.
attractions: Going-to-the-Sun Road, Grinnell Glacier
user_ratings: 4.8

state: Utah
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Spring, Fall
budget: Moderate
place: Zion National Park
description: Beautiful park known for its impressive canyons and towering cliffs.
attractions: The Narrows, Angels Landing
user_ratings: 4.7

state: Massachusetts
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Summer
budget: Moderate
place: Cape Cod
description: Popular tourist destination known for its beaches and qu

You can also already query over the elements with `budget = Low`:

In [22]:
result = collection.query.near_text(query = 'I want suggestions to travel during Winter. I want cheap places.', 
                                    filters = Filter.by_property('budget').equal('Low'),
                                    limit = 4)

In [24]:
# Let's iterate over the result objects and return their properties
for obj in result.objects:
    print_object_properties(obj.properties)

state: New York
place: Times Square
best_season_to_visit: Winter
budget: Low
last_updated: 2023-10-01 00:00:00+00:00
description: Bustling pedestrian intersection and major commercial hub.
attractions: Broadway Theaters, New Year’s Eve Ball Drop
user_ratings: 4.3

state: California
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Spring, Summer
budget: Low
place: Alcatraz Island
description: Famed former prison island located in San Francisco Bay.
attractions: Cellhouse Tour, Alcatraz Lighthouse
user_ratings: 4.4

state: Pennsylvania
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Spring, Fall
budget: Low
place: Gettysburg National Military Park
description: Historic site of a major Civil War battle.
attractions: Gettysburg Museum, Battlefield Tours
user_ratings: 4.6

state: New York
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Spring, Fall
budget: Low
place: Statue of Liberty
description: Iconic symbol of freedom and democracy in the United 

You can also pass a list on the possible values on a filter, by using `.contains_any`:

In [25]:
result = collection.query.near_text(query = 'I want suggestions to travel during Winter. I want cheap places.', 
                                    filters = Filter.by_property('budget').contains_any(['Low','Moderate']),
                                    limit = 4)

In [26]:
# Let's iterate over the result objects and return their properties
for obj in result.objects:
    print_object_properties(obj.properties)

state: New York
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Winter
budget: Low
place: Times Square
description: Bustling pedestrian intersection and major commercial hub.
attractions: Broadway Theaters, New Year’s Eve Ball Drop
user_ratings: 4.3

state: Montana
place: Glacier National Park
best_season_to_visit: Summer
budget: Moderate
last_updated: 2023-10-01 00:00:00+00:00
description: Park known for its rugged mountains and alpine forests.
attractions: Going-to-the-Sun Road, Grinnell Glacier
user_ratings: 4.8

state: Utah
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Spring, Fall
budget: Moderate
place: Zion National Park
description: Beautiful park known for its impressive canyons and towering cliffs.
attractions: The Narrows, Angels Landing
user_ratings: 4.7

state: Massachusetts
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Summer
budget: Moderate
place: Cape Cod
description: Popular tourist destination known for its beaches and qu

<a id='3-3'></a>
### 3.3 BM25 search

To perform BM25 search, just run `colections.query.bm25`, the usual parameters `query`, `limit` and `filters` can be passed.

In [27]:
result = collection.query.bm25(query = 'I want suggestions to travel during Winter. I want cheap places.', 
                                    filters = Filter.by_property('budget').contains_any(['Low','Moderate']),
                                    limit = 4)

In [28]:
# Let's iterate over the result objects and return their properties
for obj in result.objects:
    print_object_properties(obj.properties)

state: New York
place: Times Square
best_season_to_visit: Winter
budget: Low
last_updated: 2023-10-01 00:00:00+00:00
description: Bustling pedestrian intersection and major commercial hub.
attractions: Broadway Theaters, New Year’s Eve Ball Drop
user_ratings: 4.3



<a id='3-4'></a>
### 3.4 Hybrid Search

This search is the RRF search you saw in the lectures. Apart from the standard parameters for querying, you can pass an `alpha` to control how much of BM25 you want in to mix in.

In [29]:
result = collection.query.hybrid(query = 'I want suggestions to travel during Winter. I want cheap places.', 
                                    filters = Filter.by_property('budget').contains_any(['Low','Moderate']),
                                    alpha = 0.3,
                                    limit = 4)

In [30]:
# Let's iterate over the result objects and return their properties
for obj in result.objects:
    print_object_properties(obj.properties)

state: New York
place: Times Square
best_season_to_visit: Winter
budget: Low
last_updated: 2023-10-01 00:00:00+00:00
description: Bustling pedestrian intersection and major commercial hub.
attractions: Broadway Theaters, New Year’s Eve Ball Drop
user_ratings: 4.3

state: Montana
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Summer
budget: Moderate
place: Glacier National Park
description: Park known for its rugged mountains and alpine forests.
attractions: Going-to-the-Sun Road, Grinnell Glacier
user_ratings: 4.8

state: Utah
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Spring, Fall
budget: Moderate
place: Zion National Park
description: Beautiful park known for its impressive canyons and towering cliffs.
attractions: The Narrows, Angels Landing
user_ratings: 4.7

state: Massachusetts
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Summer
budget: Moderate
place: Cape Cod
description: Popular tourist destination known for its beaches and qu

<a id='3-5'></a>
### 3.5 Reranking

You can easily perform reranking with Weaviate by passing a new argument to a search. Let's try with semantic search!

In [31]:
from weaviate.classes.query import Rerank

response = collection.query.near_text(
    query="'I want suggestions to travel during Winter. I want cheap and fun places.'",  
    limit=5,
    rerank=Rerank(
        prop="attractions",                   # The property to rerank on
        query="Fun places"  # If not provided, the original query will be used
    )
)


In [32]:
# Let's iterate over the result objects and return their properties
for obj in result.objects:
    print_object_properties(obj.properties)

state: New York
place: Times Square
best_season_to_visit: Winter
budget: Low
last_updated: 2023-10-01 00:00:00+00:00
description: Bustling pedestrian intersection and major commercial hub.
attractions: Broadway Theaters, New Year’s Eve Ball Drop
user_ratings: 4.3

state: Montana
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Summer
budget: Moderate
place: Glacier National Park
description: Park known for its rugged mountains and alpine forests.
attractions: Going-to-the-Sun Road, Grinnell Glacier
user_ratings: 4.8

state: Utah
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Spring, Fall
budget: Moderate
place: Zion National Park
description: Beautiful park known for its impressive canyons and towering cliffs.
attractions: The Narrows, Angels Landing
user_ratings: 4.7

state: Massachusetts
last_updated: 2023-10-01 00:00:00+00:00
best_season_to_visit: Summer
budget: Moderate
place: Cape Cod
description: Popular tourist destination known for its beaches and qu

In [None]:
# Don't forget to close the client!
client.close()

Keep it up! You just finished the basics of the Weaviate API you'll use in this course!