## Weaviate workshop

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-workshop/blob/main/workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Goals:

#### What you will see:


- Create a vector database with Weaviate,
- Add data to the database, and
- Interact with the data, including searching, and using LLMs with your data in Weaviate

### You will learn today:

- What Weaviate is,
- How it stores the data (based on its "meaning"), and
- What you can do with Weaviate, like semantic searches, and using LLMs to transform data.

Install the Weaviate python client, for environments that don't yet have it.

In [1]:
# !pip install -U weaviate-client

## Preparation: Get the data

We'll use a dataset of movies from TMDB. 

Pre-processed version: "./data/movies.csv"


Load (or download) the data, and preview it

In [2]:
import pandas as pd

# movie_df = pd.read_csv("./data/movies.csv")
movie_df = pd.read_csv("https://raw.githubusercontent.com/weaviate-tutorials/intro-workshop/main/data/movies.csv")
movie_df.head()

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,year
0,/rH0DPF7pB35jxLxKb3JRUgCrrnp.jpg,"[10751, 14, 16, 10749]",11224,en,Cinderella,Cinderella has faith her dreams of a better li...,100.819,/avz6S9HYWs4O8Oe4PenBFNX4uDi.jpg,1950-02-22,Cinderella,False,7.044,6523,1950
1,/p47ihFj4A7EpBjmPHdTj4ipyq1S.jpg,[18],599,en,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,57.74,/sC4Dpmn87oz9AuxZ15Lmip0Ftgr.jpg,1950-08-10,Sunset Boulevard,False,8.312,2485,1950
2,/zyO6j74DKMWfp5snWg6Hwo0T3Mz.jpg,"[80, 18, 9648]",548,ja,羅生門,Brimming with action while incisively examinin...,21.011,/vL7Xw04nFMHwnvXRFCmYYAzMUvY.jpg,1950-08-26,Rashomon,False,8.091,2121,1950
3,/b4yiLlIFuiULuuLTxT0Pt1QyT6J.jpg,"[16, 10751, 14, 12]",12092,en,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",75.465,/20cvfwfaFqNbe9Fc3VEHJuPRxmn.jpg,1951-07-28,Alice in Wonderland,False,7.2,5697,1951
4,/mxf8hJJkHTCqZP3m4o8E1TtwHHs.jpg,"[35, 10749]",872,en,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",31.407,/w03EiJVHP8Un77boQeE7hg9DVdU.jpg,1952-04-09,Singin' in the Rain,False,8.2,3036,1952


## Step 1: Create a Weaviate instance (database)

This (Embedded Weaviate) is a quick way to create a Weaviate database. Note that this is suitable for evaluation use only, and currently not compatible with Windows (we are working on it 😉).

You can also use:
- A free sandbox with Weaviate Cloud Services
- Open-source Weaviate directly, available cross-platform with Docker

In [3]:
!pip show weaviate-client

Name: weaviate-client
Version: 4.9.0
Summary: A python native Weaviate client
Home-page: https://github.com/weaviate/weaviate-python-client
Author: Weaviate
Author-email: hello@weaviate.io,
License: BSD 3-clause
Location: /Users/jphwang/code/weaviate-tutorials/intro-workshop/.venv/lib/python3.11/site-packages
Requires: authlib, grpcio, grpcio-health-checking, grpcio-tools, httpx, pydantic, requests, validators
Required-by: 


In [4]:
import weaviate
import os
import json

your_cohere_apikey = os.environ["COHERE_APIKEY"]

client = weaviate.connect_to_embedded(
    version="1.27.0",
    headers={
        "X-Cohere-Api-Key": your_cohere_apikey,  # Replace this with your actual key
    },
    environment_variables={
        "ENABLE_API_BASED_MODULES": "true"
    }
)

{"action":"startup","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-10-29T12:34:24Z"}
{"action":"startup","auto_schema_enabled":true,"build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-10-29T12:34:24Z"}
{"build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-10-29T12:34:24Z"}
{"build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","le

Retrieve Weaviate instance information to check our configuration.

In [5]:
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-anthropic': {'documentationHref': 'https://docs.anthropic.com/en/api/getting-started',
   'name': 'Generative Search - Anthropic'},
  'generative-anyscale': {'documentationHref': 'https://docs.anyscale.com/endpoints/overview',
   'name': 'Generative Search - Anyscale'},
  'generative-aws': {'documentationHref': 'https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html',
   'name': 'Generative Search - AWS'},
  'generative-cohere': {'documentationHref': 'https://docs.cohere.com/reference/chat',
   'name': 'Generative Search - Cohere'},
  'generative-databricks': {'documentationHref': 'https://docs.databricks.com/en/machine-learning/foundation-models/api-reference.html#completion-task',
   'name': 'Generative Search - Databricks'},
  'generative-friendliai': {'documentationHref': 'https://docs.friendli.ai/openapi/create-chat-completions',
   'name': 'Generative Search - FriendliAI'},
  'generative-google': {'docum

## Step 2: Add data to Weaviate

### Add collection definition

The equivalent of a SQL "table", is called a "collection" in Weaviate, like they are in NoSQL databases.

In case I created a demo collection - let's delete it.

In [6]:
client.collections.delete("Movie")

{"action":"load_all_shards","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"error","msg":"failed to load all shards: context canceled","time":"2024-10-29T12:34:27Z"}


And create a new collection definition here.
We'll set up a collection called "Movie" with:
- Two "named vectors" -> which will save different "meanings" of the data,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our movie data (which are like SQL columns).
    - Just the title, overview, year and popularity for now.

In [7]:
from weaviate.classes.config import Configure, DataType, Property

client.collections.create(
    name="Movie",
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="title",
            source_properties=["title"]
        ),
        Configure.NamedVectors.text2vec_cohere(
            name="overview",
            source_properties=["title", "overview"]
        ),
    ],
    generative_config=Configure.Generative.cohere(),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
        ),
        Property(
            name="overview",
            data_type=DataType.TEXT,
        ),
        Property(
            name="year",
            data_type=DataType.INT,
        ),
        Property(
            name="popularity",
            data_type=DataType.NUMBER,
        ),
    ]
)

<weaviate.collections.collection.sync.Collection at 0x11a19fc50>

> Tip: You can get example collection definitions in our documentation:
> - https://weaviate.io/developers/weaviate/manage-data/collections

Was our collection created successfully? Let's take a look

In [8]:
client.collections.exists("Movie")

True

### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [9]:
data_columns = ['title', 'overview', 'year', 'popularity']

df = movie_df[data_columns]

df.head()

Unnamed: 0,title,overview,year,popularity
0,Cinderella,Cinderella has faith her dreams of a better li...,1950,100.819
1,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,1950,57.74
2,Rashomon,Brimming with action while incisively examinin...,1950,21.011
3,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",1951,75.465
4,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",1952,31.407


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [10]:
movies = client.collections.get("Movie")

with movies.batch.dynamic() as batch:
    for i, row in df.iterrows():
        obj_body = {
            c: row[c] for c in data_columns
        }
        batch.add_object(
            properties=obj_body
        )

{"action":"hnsw_prefill_cache_async","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-10-29T12:34:27Z","wait_for_cache_prefill":false}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","count":1000,"index_id":"vectors_overview","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-10-29T12:34:27Z","took":35375}
{"action":"hnsw_prefill_cache_async","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-10-29T12:34:27Z","wait_for_cache_prefill":false}
{"build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"Crea

In [11]:
print(len(movies.batch.failed_objects))

0


#### Confirm data load

Do we have data? 

Let's get an object count

In [12]:
movies = client.collections.get("Movie")

movies.aggregate.over_all(total_count=True)

AggregateReturn(properties={}, total_count=1322)

Does the data look right?

Let's grab a few objects from Weaviate!

In [13]:
response = movies.query.fetch_objects(limit=3)
for o in response.objects:
    print(o.properties)

{'title': 'The Departed', 'overview': "To take down South Boston's Irish Mafia, the police send in one of their own to infiltrate the underworld, not realizing the syndicate has done likewise. While an undercover cop curries favor with the mob kingpin, a career criminal rises through the police ranks. But both sides soon discover there's a mole among them.", 'year': 2006, 'popularity': 99.772}
{'title': 'The Rock', 'overview': 'When vengeful General Francis X. Hummel seizes control of Alcatraz Island and threatens to launch missiles loaded with deadly chemical weapons into San Francisco, only a young FBI chemical weapons expert and notorious Federal prisoner have the stills to penetrate the impregnable island fortress and take him down.', 'year': 1996, 'popularity': 115.51}
{'title': 'World War Z', 'overview': 'Life for former United Nations investigator Gerry Lane and his family seems content. Suddenly, the world is plagued by a mysterious infection turning whole human populations int

Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

## Step 3: Work with the data

Let's try a few more involved queries

### Filtering (similar to WHERE filter in SQL)

Let's find objects that meet a particular condition.

In [14]:
import weaviate.classes.query as wq

response = movies.query.fetch_objects(
    filters=wq.Filter.by_property("year").greater_than(2015),
    limit=3
)

for o in response.objects:
    print(o.properties["title"])

Ghostbusters
Arrival
Doctor Strange


But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

### Keyword search

Unlike a keyword filter, a keyword search will search for, and rank results based on the frequency of the keyword.

In [15]:
response = movies.query.bm25(
    query="galaxy",
    limit=5,
    return_metadata=wq.MetadataQuery(score=True, last_update_time=True)
)

for o in response.objects:
    print(o.metadata.score)
    print(o.metadata.last_update_time)
    print(o.properties)

3.2621753215789795
2024-10-29 12:34:34.957000+00:00
{'title': 'Guardians of the Galaxy Vol. 3', 'overview': 'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.', 'year': 2023, 'popularity': 165.416}
3.2621753215789795
2024-10-29 12:34:34.086000+00:00
{'overview': "The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage.", 'year': 2017, 'title': 'Guardians of the Galaxy Vol. 2', 'popularity': 142.267}
2.128286361694336
2024-10-29 12:34:31.492000+00:00
{'title': 'Star Wars: Episode II - Attack of the Clones', 'overview': 'Following an assassination attempt on Senator Padmé Amidala, Jedi Knights Anakin Skywalker and Obi-Wan Kenobi investigate a mysterious plot that could change the galaxy forever.', 'year': 200

### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [16]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata)
    print(json.dumps(o.properties, indent=2))

MetadataReturn(creation_time=None, last_update_time=None, distance=0.46619880199432373, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
MetadataReturn(creation_time=None, last_update_time=None, distance=0.523431658744812, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=

#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [17]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",  # or "overview"
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(json.dumps(o.properties, indent=2))

0.46619880199432373
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
0.523431658744812
{
  "title": "Guardians of the Galaxy Vol. 2",
  "overview": "The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage.",
  "year": 2017,
  "popularity": 142.267
}
0.5245879888534546
{
  "title": "Stargate",
  "overview": "An interstel

This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [18]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    include_vector=True,
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(o.vector)
    print(json.dumps(o.properties, indent=2))

0.46619880199432373
{'overview': [0.0171356201171875, 0.06304931640625, -0.04339599609375, 0.014739990234375, -0.019317626953125, -0.0550537109375, -0.0297393798828125, -0.0567626953125, -0.050079345703125, 0.034088134765625, 0.0841064453125, -0.04034423828125, 0.03570556640625, -0.034820556640625, 0.05108642578125, -0.024871826171875, -0.01035308837890625, -0.025909423828125, -0.01329803466796875, -0.00933837890625, -0.00658416748046875, 0.054718017578125, 0.037445068359375, -0.004016876220703125, 0.02593994140625, 0.01413726806640625, 0.044647216796875, 0.0019817352294921875, 0.0692138671875, 0.021026611328125, 0.0177154541015625, -0.043731689453125, 0.031219482421875, 0.038055419921875, 0.0030345916748046875, 0.018157958984375, 0.0193328857421875, -0.00823974609375, 0.01282501220703125, 0.0137481689453125, -0.003551483154296875, -0.01511383056640625, -0.035003662109375, 0.011962890625, -0.04052734375, 0.00507354736328125, 0.037628173828125, 0.0237579345703125, 0.066162109375, -0.004

These vector representations come from deep learning models to those that power LLMs. They capture meaning, and are called vector "embeddings".

### Generative search

A generative search transforms your data at retrieval time. 

In [19]:
response = movies.generate.near_text(
    query="galaxy",
    limit=5,
    target_vector="title",
    single_prompt="Write a tweet promoting the movie with TITLE: {title} and OVERVIEW: {overview}.",
    grouped_task="What audience demographic might enjoy this group of movies?"
)

print(response.generated)
for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

Science fiction fans would most likely enjoy this selection of movies. Movies like Star Wars, Stargate and Interstellar are classic sci-fi films that have gained cult followings over the years. The other two films, Galaxy Quest and Guardians of the Galaxy Vol. 2, also fall into the science fiction genre and would appeal to fans of the respective series.

Furthermore, these movies could appeal to fans of space opera, a sub-genre of science fiction that focuses on dramatic stories set in space, usually involving conflict between planets, galaxies or even different universes. The epic and adventurous tone of these movies fits well within the space opera category.

Also, the comedy elements in Galaxy Quest and the Marvel Cinematic Universe's Guardians of the Galaxy Vol. 2 could extend the demographic reach of these movies to fans of science fiction humor.
"Galactic adventures await! Buckle up and embark on a cosmic journey with the iconic crew of the NSEA Protector in Galaxy Quest. From th

You can see here ⬆️ that each object has been transformed into a tweet by the LLM based on our prompt.

You can ask LLMs to perform all sorts of tasks

In [20]:
response = movies.generate.near_text(
    query="galaxy",
    target_vector="title",
    limit=3,
    single_prompt="Summarise the following movie overview into a short French sentence: {overview}."
)

for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

Une équipe d'acteurs télé oubliés sont enlevés par des aliens qui ont mal interprété leurs émissions télévisées comme des documents historiques.
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
Les Gardiens doivent se battre pour protéger leur nouvelle famille et découvrir les secrets de la parenté de Peter Quill.
{
  "title": "Guardians of the Galaxy Vol. 2",
  "overview": "The Guardians

The LLM is multi-lingual!

You can also send groups of results to the LLM with Weaviate.

In [21]:
response = movies.generate.near_text(
    query="galaxy",
    target_vector="title",
    limit=3,
    grouped_task="Write a poem about these movies"
)

print(response.generated)
for o in response.objects:
    print(json.dumps(o.properties, indent=2))

Three stories, one a comedy gem,
A space adventure with a unique twist,
A crew of actors, beamed up to exist,
In a galaxy where their fame had expired.

Guardians, a band of misfits, bold,
Uniting to fight, a family's bond,
Unraveling secrets, Quill's true origin,
In the vastness of space, their mission unfolds.

Through the Stargate, a portal's open,
To a world of ancient gods and sand,
Ra, the mighty, rules with a strong hand,
A mysterious journey, the explorers venture.

Three tales, each a epic voyage,
In the cosmos, with courage, they dare,
Defying dangers, seeking truth there,
In the vast unknown, their destinies interlace. 

So, from the funny to the epic, 
Adventures in space, our hearts capture,
The wonder of worlds, a treasure to capture,
In the deep of the night, the stars inspire. 

Let the movies take flight,
And our imaginations soar,
With laughter and wonder, a cosmic voyage,
To new universes, our minds explore.
{
  "title": "Galaxy Quest",
  "overview": "For four years,

In [22]:
client.close()

{"action":"restapi_management","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"Shutting down... ","time":"2024-10-29T12:34:41Z","version":"1.27.0"}
{"action":"restapi_management","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2024-10-29T12:34:41Z","version":"1.27.0"}
{"action":"telemetry_push","build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"telemetry terminated","payload":"\u0026{MachineID:20f9dd85-ea84-4a3a-947a-0c4c828d628e Type:TERMINATE Version:1.27.0 NumObjects:1322 OS:darwin Arch:arm64 UsedModules:[generative-cohere text2vec-cohere]}","time":"2024-10-29T12:34:41Z"}
{"build_git_commit":"6c571ff13","build_go_version":"go1.23.2","build_image_tag":"","build_wv_version":"","level":"info","msg":"c