![Introduction to Weaviate](./img/01-cover.png)

![Introduction to Weaviate](./img/02-about-weaviate.png)

### Agenda:

#### What you will see:

- Examples of AI-powered searches
- Create and build a vector database
- Search with a vector database
- Retrieval augmented generation (RAG)
- Scalability considerations

### You will learn:

- About vector, keyword & hybrid searches
    - When to use each one
- How to perform RAG
- How to build a scalable vector DB

## Search: An Introduction

Try searches using this (pre-populated) toy dataset. 

```json
animal_objs = [
    {"description": "brown dog"},
    {"description": "small domestic black cat"},
    {"description": "orange cheetah"},
    {"description": "black bear"},
    {"description": "large white seagull"},
    {"description": "yellow canary"},
]
```

In [1]:
# Prep script: No need to show

import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import Filter, MetadataQuery
import os

# Recommended: save sensitive data as environment variables
cohere_key = os.getenv("COHERE_APIKEY")
headers = {
    "X-Cohere-Api-Key": cohere_key,
}

client = weaviate.connect_to_local(
    headers=headers
)

# Work with Weaviate

animals = client.collections.delete("Animals")

animals = client.collections.create(
    name="Animals",
    properties=[
        Property(name="description", data_type=DataType.TEXT),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_ollama(
            name="description",
            source_properties=["description"],
            api_endpoint="http://host.docker.internal:11434",  # If using Docker, use this to contact your local Ollama instance
            model="nomic-embed-text",  # The model to use, e.g. "nomic-embed-text"
        )
    ],
    generative_config=Configure.Generative.ollama(
        api_endpoint="http://host.docker.internal:11434",  # If using Docker, use this to contact your local Ollama instance
        model="gemma2:2b"
    ),
    # reranker_config=Configure.Reranker.cohere()
)

animal_objs = [
    {"description": "brown dog"},
    {"description": "small domestic black cat"},
    {"description": "orange cheetah"},
    {"description": "black bear"},
    {"description": "large white seagull"},
    {"description": "yellow canary"},
]

animals.data.insert_many(animal_objs)

BatchObjectReturn(_all_responses=[UUID('096c8376-76f4-4181-9f69-9c57c35b9636'), UUID('4de77d18-4931-4c90-90fd-333be908a661'), UUID('5ffb4a6a-8dd2-4da4-ba1a-179a4b15c3ee'), UUID('b2bc5ccd-45d3-421e-a943-bda0f01d37ac'), UUID('08467083-2f20-4d91-8b1d-e4b720fbe1fc'), UUID('6393ff1f-574c-49ab-a513-94e697ea6ff4')], elapsed_seconds=0.9247109889984131, errors={}, uuids={0: UUID('096c8376-76f4-4181-9f69-9c57c35b9636'), 1: UUID('4de77d18-4931-4c90-90fd-333be908a661'), 2: UUID('5ffb4a6a-8dd2-4da4-ba1a-179a4b15c3ee'), 3: UUID('b2bc5ccd-45d3-421e-a943-bda0f01d37ac'), 4: UUID('08467083-2f20-4d91-8b1d-e4b720fbe1fc'), 5: UUID('6393ff1f-574c-49ab-a513-94e697ea6ff4')}, has_errors=False)

### Traditional search

In [2]:
query = "cat"

response = animals.query.bm25(query)

print(f"{len(response.objects)} results returned:")
for o in response.objects:
    print(o.properties)

1 results returned:
{'description': 'small domestic black cat'}


But, traditional searches are not very robust. 

In [3]:
query = "kitty"  # Try synonyms or even typos

response = animals.query.bm25(query)

print(f"{len(response.objects)} results returned:")
for o in response.objects:
    print(o.properties)

0 results returned:


### Vector search

But vector search is based on similarity, allowing more forgiving, nuanced search:

In [4]:
query = "cat"

response = animals.query.near_text(query)

print(f"{len(response.objects)} results returned:")
for o in response.objects:
    print(o.properties)

6 results returned:
{'description': 'small domestic black cat'}
{'description': 'orange cheetah'}
{'description': 'yellow canary'}
{'description': 'black bear'}
{'description': 'large white seagull'}
{'description': 'brown dog'}


In [5]:
query = "cat"  # Try synonyms or even typos

response = animals.query.near_text(query)

print(f"{len(response.objects)} results returned:")
for o in response.objects:
    print(o.properties)

6 results returned:
{'description': 'small domestic black cat'}
{'description': 'orange cheetah'}
{'description': 'yellow canary'}
{'description': 'black bear'}
{'description': 'large white seagull'}
{'description': 'brown dog'}


Vector searches provide forgiving, nuanced, meaning-based similarity search. 

But - what is a vector?

## Introduction to Vectors

![Introduction to Vectors](./img/04-vectors-intro-01.png)

![Introduction to Vectors](./img/04-vectors-intro-02.png)

![Introduction to Vectors](./img/04-vectors-intro-03.png)

![Introduction to Vectors](./img/04-vectors-intro-04.png)

![Introduction to vectors](./img/04-vectors-intro-05.png)

## Why use vector search?

- Better search
    - Find contextually relevant info
    - Allow synonyms, different languages
    - More value from data
- Work together with generative AI models
    - Overcome hallucinations or lack of specific / prioprietary information

![Introduction to RAG](./img/06-rag-intro-01.png)

![Introduction to RAG](./img/06-rag-intro-02.png)

![Introduction to RAG](./img/06-rag-intro-03.png)

![Introduction to RAG](./img/06-rag-intro-04.png)

### Example RAG prompts:

- Summarise the corporate strategy of ACME Co for FY2024-25.
- What is our internal policy on food expenses?
- What smartphone issues do users commonly complain about?

#### 🤔 How can we find data for these prompts with *just* keyword searches?

It's very difficult!

### Example:

#### `What smartphone issues do users commonly complain about?`

How would you search for "smartphone" issues in your data?

- "*phone*"?
- "tablet"?
- "android" and "iphone"?
- Include every smartphone maker, model and name?

> With vector DBs - you get **flexibility** because semantic search takes these into account.

![Introduction to RAG](./img/06-rag-intro-05.png)

# Weaviate in practice

## Build a database

### Preparation: Get the data

We'll use a dataset of movies from TMDB. Let's download the data, and preview it.

In [6]:
import pandas as pd

# movie_df = pd.read_csv("./data/movies.csv")
movie_df = pd.read_csv("https://raw.githubusercontent.com/weaviate-tutorials/intro-workshop/main/data/movies.csv")
movie_df.head()

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,year
0,/rH0DPF7pB35jxLxKb3JRUgCrrnp.jpg,"[10751, 14, 16, 10749]",11224,en,Cinderella,Cinderella has faith her dreams of a better li...,100.819,/avz6S9HYWs4O8Oe4PenBFNX4uDi.jpg,1950-02-22,Cinderella,False,7.044,6523,1950
1,/p47ihFj4A7EpBjmPHdTj4ipyq1S.jpg,[18],599,en,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,57.74,/sC4Dpmn87oz9AuxZ15Lmip0Ftgr.jpg,1950-08-10,Sunset Boulevard,False,8.312,2485,1950
2,/zyO6j74DKMWfp5snWg6Hwo0T3Mz.jpg,"[80, 18, 9648]",548,ja,羅生門,Brimming with action while incisively examinin...,21.011,/vL7Xw04nFMHwnvXRFCmYYAzMUvY.jpg,1950-08-26,Rashomon,False,8.091,2121,1950
3,/b4yiLlIFuiULuuLTxT0Pt1QyT6J.jpg,"[16, 10751, 14, 12]",12092,en,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",75.465,/20cvfwfaFqNbe9Fc3VEHJuPRxmn.jpg,1951-07-28,Alice in Wonderland,False,7.2,5697,1951
4,/mxf8hJJkHTCqZP3m4o8E1TtwHHs.jpg,"[35, 10749]",872,en,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",31.407,/w03EiJVHP8Un77boQeE7hg9DVdU.jpg,1952-04-09,Singin' in the Rain,False,8.2,3036,1952


### Step 1: Connect to Weaviate

You can use Weaviate as a hosted instance on Weaviate Cloud, or install Weaviate anywhere using the open-source distribution, such as on AWS, GCP, etc., or locally. 

We will use a local Docker instance in the workshop.

In [7]:
import weaviate

# If you have got Weaviate running locally Docker:
client = weaviate.connect_to_local()

Retrieve Weaviate instance information to check our configuration.

In [9]:
client.is_ready()

True

### Step 2: Add data to Weaviate

#### Add collection definition

The equivalent of a SQL "table" is called a "collection" in Weaviate.

We'll create a new collection definition here for "Movie":
- Two "named vectors" -> which will save different "meanings" of the data,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our movie data (which are like SQL columns).
    - Just the title, overview, year and popularity for now.

In [12]:
from weaviate.classes.config import Configure, DataType, Property

# DO NOT DO THIS IN PRODUCTION - THIS IS TO DELETE DATA FROM MY PREVIOUS DEMOS
if client.collections.exists("Movie"):
    client.collections.delete("Movie")

# Create a collection
client.collections.create(
    name="Movie",
    # ================================================================================
    # Using our Ollama integration: https://weaviate.io/developers/weaviate/model-providers/ollama
    # Many other integrations available. See https://weaviate.io/developers/weaviate/model-providers/
    # ================================================================================
    vectorizer_config=[
        Configure.NamedVectors.text2vec_ollama(
            name="title",
            source_properties=["title"],
            api_endpoint="http://host.docker.internal:11434",  # If using Docker, use this to contact your local Ollama instance
            model="nomic-embed-text",  # The model to use, e.g. "snowflake-arctic-embed"
        ),
        Configure.NamedVectors.text2vec_ollama(
            name="all_text",
            source_properties=["title", "overview"],
            api_endpoint="http://host.docker.internal:11434",  # If using Docker, use this to contact your local Ollama instance
            model="nomic-embed-text",  # The model to use, e.g. "snowflake-arctic-embed"
        ),
    ],
    generative_config=Configure.Generative.ollama(
        api_endpoint="http://host.docker.internal:11434",
        model="gemma2:2b"
    ),
    # ================================================================================
    # OPTIONAL - SPECIFY YOUR DATA SCHEMA OR HAVE IT INFERRED BY WEAVIATE
    # ================================================================================
    # properties=[
    #     Property(
    #         name="title",
    #         data_type=DataType.TEXT,
    #     ),
    #     Property(
    #         name="overview",
    #         data_type=DataType.TEXT,
    #     ),
    #     Property(
    #         name="popularity",
    #         data_type=DataType.NUMBER,
    #     ),
    #     Property(
    #         name="year",
    #         data_type=DataType.INT,
    #     ),
    # ],    
)

<weaviate.collections.collection.sync.Collection at 0x110db5cd0>

Was our collection created successfully? Let's take a look

In [13]:
client.collections.exists("Movie")

True

#### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [14]:
data_columns = ['title', 'overview', 'year', 'popularity']

df = movie_df[data_columns]

df.head()

Unnamed: 0,title,overview,year,popularity
0,Cinderella,Cinderella has faith her dreams of a better li...,1950,100.819
1,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,1950,57.74
2,Rashomon,Brimming with action while incisively examinin...,1950,21.011
3,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",1951,75.465
4,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",1952,31.407


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [15]:
from tqdm import tqdm

movies = client.collections.get("Movie")

with movies.batch.fixed_size(200) as batch:
    for i, row in tqdm(df.iterrows()):
        obj_body = {
            c: row[c] for c in data_columns
        }
        batch.add_object(
            properties=obj_body
        )

1322it [00:19, 69.05it/s]


#### Confirm data load

Do we have data? 

Let's get an object count

In [16]:
print(len(movies))

1322


Does the data look right?

Let's grab a few objects from Weaviate!

In [17]:
response = movies.query.fetch_objects(limit=3)
for o in response.objects:
    print(o.properties)

{'year': 1950.0, 'title': 'Sunset Boulevard', 'overview': 'A hack screenwriter writes a screenplay for a former silent film star who has faded into Hollywood obscurity.', 'popularity': 57.74}
{'year': 2005.0, 'title': 'Robots', 'overview': "Rodney Copperbottom is a young robot inventor who dreams of making the world a better place, until the evil Ratchet takes over Big Weld Industries. Now, Rodney's dreams – and those of his friends – are in danger of becoming obsolete.", 'popularity': 45.208}
{'title': 'The Irishman', 'overview': 'Pennsylvania, 1956. Frank Sheeran, a war veteran of Irish origin who works as a truck driver, accidentally meets mobster Russell Bufalino. Once Frank becomes his trusted man, Bufalino sends him to Chicago with the task of helping Jimmy Hoffa, a powerful union leader related to organized crime, with whom Frank will maintain a close friendship for nearly twenty years.', 'year': 2019.0, 'popularity': 99.531}


Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

### Step 3: Work with the data

Let's try a few more involved queries

#### Filtering (similar to WHERE filter in SQL)

A filter reduces the number of objects based on specific criteria.

In [18]:
from weaviate.classes.query import Filter

response = movies.query.fetch_objects(
    filters=Filter.by_property("year").greater_than(2015),
    limit=3
)

for o in response.objects:
    print(o.properties["title"])

Captain America: Civil War
Doctor Strange
Ghostbusters


But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

#### Keyword search

Keyword search ranks results based on keyword match "scores", according to the BM25 algorithm. These scores are based on how often tokens in the query appear in each data object. 

In [19]:
from weaviate.classes.query import MetadataQuery

response = movies.query.bm25(
    query="galaxy",
    limit=5,
    return_metadata=MetadataQuery(score=True, last_update_time=True)
)

for o in response.objects:
    print(o.metadata.score)
    print(o.metadata.last_update_time)
    print(o.properties)

3.2621753215789795
2024-09-16 13:14:33.652000+00:00
{'year': 2017.0, 'title': 'Guardians of the Galaxy Vol. 2', 'overview': "The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage.", 'popularity': 142.267}
3.2621753215789795
2024-09-16 13:14:42.314000+00:00
{'year': 2023.0, 'title': 'Guardians of the Galaxy Vol. 3', 'overview': 'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.', 'popularity': 165.416}
2.128286361694336
2024-09-16 13:14:24.323000+00:00
{'title': 'Star Wars: Episode II - Attack of the Clones', 'overview': 'Following an assassination attempt on Senator Padmé Amidala, Jedi Knights Anakin Skywalker and Obi-Wan Kenobi investigate a mysterious plot that could change the galaxy forever.', 'year':

#### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [20]:
import json

response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
)

for o in response.objects:
    print(json.dumps(o.properties, indent=2))

{
  "year": 1999.0,
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "popularity": 62.01
}
{
  "year": 1977.0,
  "title": "Star Wars",
  "overview": "Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore p

#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [21]:
import json

response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    return_metadata=MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata)
    print(json.dumps(o.properties, indent=2))

MetadataReturn(creation_time=None, last_update_time=None, distance=0.2529594898223877, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999.0,
  "popularity": 62.01
}
MetadataReturn(creation_time=None, last_update_time=None, distance=0.4024823307991028, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_scor

This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [28]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",  # or "overview"
    include_vector=True,
    return_metadata=MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(json.dumps(o.properties, indent=2))
    print(o.vector["title"][:5])

WeaviateClosedClientError: The `WeaviateClient` is closed. Run `client.connect()` to (re)connect!

These vector representations come from deep learning models to those that power LLMs. They capture meaning, and are called vector "embeddings".

#### Generative search

A generative search transforms your data at retrieval time. 

In [24]:
response = movies.generate.near_text(
    query="galaxy",
    limit=5,
    target_vector="title",
    single_prompt="Write a tweet promoting the movie with TITLE: {title} and OVERVIEW: {overview}.",
    grouped_task="What audience demographic might enjoy this group of movies?"
)

In [25]:
print(response.generated)

Based on the movie descriptions and titles, this audience demographic might enjoy these movies:

**Primary Demographic:** **Millennials & Gen Xers (ages 25-45)**  who have a fondness for:

* **Sci-Fi**: All the films feature space travel, aliens, and fantastical worlds.
* **Humor**: "Galaxy Quest" is a hilarious sendup of classic sci-fi tropes and actors dealing with their past fame.  
* **Action/Adventure**: The movies offer exciting storylines with good vs. evil battles, heroic rescues, and action sequences. 
* **Pop Culture:** "Star Wars," in particular, has become deeply embedded in pop culture, attracting fans from all generations.

**Secondary Demographics:**

* **Younger audiences (18-24):** Some of the films may appeal to younger viewers due to their entertaining storylines and references to classic sci-fi themes. 
* **Adults who enjoy nostalgic entertainment:** The movies offer a fun escape to simpler times and are sure to draw in those looking for something lighthearted and e

In [26]:
for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

🚀 Calling all Star Trek fans! 💥 

Remember that hilariously epic series *Galaxy Quest*?  🤣 Get ready for a trip back in time with Tim Allen & Sigourney Weaver in this sci-fi classic! 🍿 

[Link to trailer/movie] #GalaxyQuest #SciFiComedy #ThrowbackThursday 

{
  "year": 1999.0,
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "popularity": 62.01
}
🚀💫 **Join the Rebellion!** 💫🚀

Princess Leia is captured by the tyrannical Empire,

Each object has been transformed into a tweet by the LLM based on our prompt!

Remember to close the client connection with `client.close()` to close sockets and resources.

In [27]:
client.close()