## Weaviate workshop

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-workshop/blob/main/workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Goals:

#### What you will see:


- Create a vector database with Weaviate,
- Add data to the database, and
- Interact with the data, including searching, and using LLMs with your data in Weaviate

### You will learn today:

- What Weaviate is,
- How it stores the data (based on its "meaning"), and
- What you can do with Weaviate, like semantic searches, and using LLMs to transform data.

Install the Weaviate python client, for environments that don't yet have it.

In [1]:
# !pip install -U weaviate-client

## Preparation: Get the data

We'll use a dataset of movies from TMDB. 

Pre-processed version: "./data/movies.csv"


Load (or download) the data, and preview it

In [2]:
import pandas as pd

# movie_df = pd.read_csv("./data/movies.csv")
movie_df = pd.read_csv("https://raw.githubusercontent.com/weaviate-tutorials/intro-workshop/main/data/movies.csv")
movie_df.head()

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,year
0,/rH0DPF7pB35jxLxKb3JRUgCrrnp.jpg,"[10751, 14, 16, 10749]",11224,en,Cinderella,Cinderella has faith her dreams of a better li...,100.819,/avz6S9HYWs4O8Oe4PenBFNX4uDi.jpg,1950-02-22,Cinderella,False,7.044,6523,1950
1,/p47ihFj4A7EpBjmPHdTj4ipyq1S.jpg,[18],599,en,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,57.74,/sC4Dpmn87oz9AuxZ15Lmip0Ftgr.jpg,1950-08-10,Sunset Boulevard,False,8.312,2485,1950
2,/zyO6j74DKMWfp5snWg6Hwo0T3Mz.jpg,"[80, 18, 9648]",548,ja,羅生門,Brimming with action while incisively examinin...,21.011,/vL7Xw04nFMHwnvXRFCmYYAzMUvY.jpg,1950-08-26,Rashomon,False,8.091,2121,1950
3,/b4yiLlIFuiULuuLTxT0Pt1QyT6J.jpg,"[16, 10751, 14, 12]",12092,en,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",75.465,/20cvfwfaFqNbe9Fc3VEHJuPRxmn.jpg,1951-07-28,Alice in Wonderland,False,7.2,5697,1951
4,/mxf8hJJkHTCqZP3m4o8E1TtwHHs.jpg,"[35, 10749]",872,en,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",31.407,/w03EiJVHP8Un77boQeE7hg9DVdU.jpg,1952-04-09,Singin' in the Rain,False,8.2,3036,1952


## Step 1: Create a Weaviate instance (database)

This (Embedded Weaviate) is a quick way to create a Weaviate database. Note that this is suitable for evaluation use only, and currently not compatible with Windows (we are working on it 😉).

You can also use:
- A free sandbox with Weaviate Cloud Services
- Open-source Weaviate directly, available cross-platform with Docker

In [3]:
!pip show weaviate-client

Name: weaviate-client
Version: 4.7.1
Summary: A python native Weaviate client
Home-page: https://github.com/weaviate/weaviate-python-client
Author: Weaviate
Author-email: hello@weaviate.io,
License: BSD 3-clause
Location: /Users/jphwang/code/weaviate-tutorials/intro-workshop/venv/lib/python3.10/site-packages
Requires: authlib, grpcio, grpcio-health-checking, grpcio-tools, httpx, pydantic, requests, validators
Required-by: 


In [4]:
import weaviate
import os
import json

your_cohere_apikey = os.environ["COHERE_APIKEY"]
your_openai_apikey = os.environ["OPENAI_APIKEY"]
your_anthropic_apikey = os.environ["ANTHROPIC_APIKEY"]

client = weaviate.connect_to_embedded(
    version="1.26.1",
    headers={
        "X-Cohere-Api-Key": your_cohere_apikey,  # Replace this with your actual key
        "X-OpenAI-Api-Key": your_openai_apikey,  # Replace this with your actual key
        "X-Anthropic-Api-Key": your_anthropic_apikey  # Replace this with your actual key
    },
    environment_variables={
        "ENABLE_API_BASED_MODULES": "true"
    }
)

{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-08-20T13:37:57+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-08-20T13:37:57+01:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-08-20T13:37:57+01:00"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-08-20T13:37:57+01:00"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":50471},"time":"2024-08-20T13:37:57+01:00"}
{"address":"192.168.1.140:50472","level":"info","msg":"starting cloud rpc server ...","time":"2024-08-20T13:37:57+01:00"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-08-20T13:3

Retrieve Weaviate instance information to check our configuration.

In [5]:
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-anthropic': {'documentationHref': 'https://docs.anthropic.com/en/api/getting-started',
   'name': 'Generative Search - Anthropic'},
  'generative-anyscale': {'documentationHref': 'https://docs.anyscale.com/endpoints/overview',
   'name': 'Generative Search - Anyscale'},
  'generative-aws': {'documentationHref': 'https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html',
   'name': 'Generative Search - AWS'},
  'generative-cohere': {'documentationHref': 'https://docs.cohere.com/reference/chat',
   'name': 'Generative Search - Cohere'},
  'generative-mistral': {'documentationHref': 'https://docs.mistral.ai/api/',
   'name': 'Generative Search - Mistral'},
  'generative-octoai': {'documentationHref': 'https://octo.ai/docs/text-gen-solution/getting-started',
   'name': 'Generative Search - OctoAI'},
  'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 

## Step 2: Add data to Weaviate

### Add collection definition

The equivalent of a SQL "table", is called a "collection" in Weaviate, like they are in NoSQL databases.

In case I created a demo collection - let's delete it.

In [6]:
client.collections.delete("Movie")

And create a new collection definition here.
We'll set up a collection called "Movie" with:
- Two "named vectors" -> which will save different "meanings" of the data,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our movie data (which are like SQL columns).
    - Just the title, overview, year and popularity for now.

In [7]:
from weaviate.classes.config import Configure, DataType, Property

client.collections.create(
    name="Movie",
    vectorizer_config=[
        Configure.NamedVectors.text2vec_openai(
            name="title",
            source_properties=["title"]
        ),
        Configure.NamedVectors.text2vec_cohere(
            name="overview",
            source_properties=["title", "overview"]
        ),
    ],
    generative_config=Configure.Generative.anthropic(),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
        ),
        Property(
            name="overview",
            data_type=DataType.TEXT,
        ),
        Property(
            name="year",
            data_type=DataType.INT,
        ),
        Property(
            name="popularity",
            data_type=DataType.NUMBER,
        ),
    ]
)

{"action":"telemetry_push","level":"info","msg":"telemetry started","payload":"\u0026{MachineID:7e66353c-8e5d-4b1f-86c8-fad5ff708fa7 Type:INIT Version:1.26.1 NumObjects:0 OS:darwin Arch:arm64 UsedModules:[]}","time":"2024-08-20T13:38:00+01:00"}


<weaviate.collections.collection.sync.Collection at 0x1315cd0f0>

> Tip: You can get example collection definitions in our documentation:
> - https://weaviate.io/developers/weaviate/manage-data/collections

Was our collection created successfully? Let's take a look

In [8]:
client.collections.exists("Movie")

True

### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [9]:
data_columns = ['title', 'overview', 'year', 'popularity']

df = movie_df[data_columns]

df.head()

Unnamed: 0,title,overview,year,popularity
0,Cinderella,Cinderella has faith her dreams of a better li...,1950,100.819
1,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,1950,57.74
2,Rashomon,Brimming with action while incisively examinin...,1950,21.011
3,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",1951,75.465
4,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",1952,31.407


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [10]:
movies = client.collections.get("Movie")

with movies.batch.dynamic() as batch:
    for i, row in df.iterrows():
        obj_body = {
            c: row[c] for c in data_columns
        }
        batch.add_object(
            properties=obj_body
        )

{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-08-20T13:38:00+01:00","wait_for_cache_prefill":false}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"vectors_overview","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-08-20T13:38:00+01:00","took":67208}
{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-08-20T13:38:00+01:00","wait_for_cache_prefill":false}
{"level":"info","msg":"Created shard movie_h6OXTYt85gPw in 2.676667ms","time":"2024-08-20T13:38:00+01:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"vectors_title","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-08-20T13:38:00+01:00","took":45667}
{"action":"bootstrap","level":"info","msg":"node reporting ready, node has probably recovered cluster from raft config. Exitin

In [11]:
print(len(movies.batch.failed_objects))

0


#### Confirm data load

Do we have data? 

Let's get an object count

In [12]:
movies = client.collections.get("Movie")

movies.aggregate.over_all(total_count=True)

AggregateReturn(properties={}, total_count=1322)

Does the data look right?

Let's grab a few objects from Weaviate!

In [13]:
response = movies.query.fetch_objects(limit=3)
for o in response.objects:
    print(o.properties)

{'title': 'Doctor Strange in the Multiverse of Madness', 'overview': 'Doctor Strange, with the help of mystical allies both old and new, traverses the mind-bending and dangerous alternate realities of the Multiverse to confront a mysterious new adversary.', 'year': 2022, 'popularity': 122.959}
{'title': 'Sinister', 'overview': "True-crime writer Ellison Oswald is in a slump; he hasn't had a best seller in more than 10 years and is becoming increasingly desperate for a hit. So, when he discovers the existence of a snuff film showing the deaths of a family, he vows to solve the mystery. He moves his own family into the victims' home and gets to work. However, when old film footage and other clues hint at the presence of a supernatural force, Ellison learns that living in the house may be fatal.", 'year': 2012, 'popularity': 52.57}
{'overview': 'With computer genius Luther Stickell at his side and a beautiful thief on his mind, agent Ethan Hunt races across Australia and Spain to stop a f

Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

## Step 3: Work with the data

Let's try a few more involved queries

### Filtering (similar to WHERE filter in SQL)

Let's find objects that meet a particular condition.

In [14]:
import weaviate.classes.query as wq

response = movies.query.fetch_objects(
    filters=wq.Filter.by_property("year").greater_than(2015),
    limit=3
)

for o in response.objects:
    print(o.properties["title"])

Doctor Strange
Hacksaw Ridge
Silence


But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

### Keyword search

Unlike a keyword filter, a keyword search will search for, and rank results based on the frequency of the keyword.

In [15]:
response = movies.query.bm25(
    query="galaxy",
    limit=5,
    return_metadata=wq.MetadataQuery(score=True, last_update_time=True)
)

for o in response.objects:
    print(o.metadata.score)
    print(o.metadata.last_update_time)
    print(o.properties)

3.2621753215789795
2024-08-20 12:38:07.476000+00:00
{'title': 'Guardians of the Galaxy Vol. 3', 'overview': 'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.', 'year': 2023, 'popularity': 165.416}
3.2621753215789795
2024-08-20 12:38:06.825000+00:00
{'title': 'Guardians of the Galaxy Vol. 2', 'overview': "The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage.", 'year': 2017, 'popularity': 142.267}
2.128286361694336
2024-08-20 12:38:05.320000+00:00
{'title': 'Star Wars: Episode II - Attack of the Clones', 'overview': 'Following an assassination attempt on Senator Padmé Amidala, Jedi Knights Anakin Skywalker and Obi-Wan Kenobi investigate a mysterious plot that could change the galaxy forever.', 'year': 200

### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [16]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata)
    print(json.dumps(o.properties, indent=2))

MetadataReturn(creation_time=None, last_update_time=None, distance=0.13486772775650024, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
MetadataReturn(creation_time=None, last_update_time=None, distance=0.16071271896362305, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_scor

#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [17]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",  # or "overview"
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(json.dumps(o.properties, indent=2))

0.13486772775650024
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
0.16071271896362305
{
  "title": "Star Wars",
  "overview": "Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful pr

This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [18]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    include_vector=True,
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(o.vector)
    print(json.dumps(o.properties, indent=2))

0.13486772775650024
{'title': [0.001771763782016933, -0.02627941593527794, -0.018018173053860664, -0.04696746543049812, -0.03687505051493645, 0.029885845258831978, -0.023357927799224854, -0.046659938991069794, -0.023623516783118248, 0.0002955123782157898, 0.030500896275043488, -0.004354275297373533, 0.03293313831090927, -0.0005744253867305815, 0.01859128847718239, -0.0036378817167133093, 0.02812456525862217, -0.023315992206335068, 0.012426808476448059, -0.017570864409208298, 0.02307835780084133, 0.029075097292661667, 0.005797546356916428, -0.023022444918751717, 0.02534286119043827, 0.0018311720341444016, 0.006968238390982151, -0.02359556034207344, 0.0023763300850987434, 0.005510989110916853, 0.017724627628922462, -0.004937874153256416, -0.01195154245942831, -0.006524424068629742, -0.016061196103692055, -0.0026209522038698196, 0.016899900510907173, -0.0037427200004458427, -0.0037462145555764437, -0.012867128476500511, 0.02647511288523674, 0.010190262459218502, -0.003429953008890152, -0.

These vector representations come from deep learning models to those that power LLMs. They capture meaning, and are called vector "embeddings".

### Generative search

A generative search transforms your data at retrieval time. 

In [19]:
response = movies.generate.near_text(
    query="galaxy",
    limit=5,
    target_vector="title",
    single_prompt="Write a tweet promoting the movie with TITLE: {title} and OVERVIEW: {overview}.",
    grouped_task="What audience demographic might enjoy this group of movies?"
)

print(response.generated)
for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

I apologize, but there seems to be an issue with the input data format. The provided string appears to be a byte array encoding of JSON data rather than properly formatted JSON. Without being able to parse the actual movie information, I can't provide a specific analysis of the audience demographic.

However, if you can provide the movie information in a readable format, I'd be happy to analyze it and suggest potential audience demographics that might enjoy those films. Generally, factors like genre, themes, release dates, and content ratings help determine which demographics a group of movies might appeal to.
Here's a tweet promoting Galaxy Quest without reproducing copyrighted material:

"Blast off with the hilarious sci-fi comedy Galaxy Quest! Join a washed-up TV crew thrust into a real space adventure. Can these actors become the heroes they once portrayed? Stars Tim Allen, Sigourney Weaver & Alan Rickman. A laugh-out-loud journey across the galaxy! #GalaxyQuest"
{
  "overview": "F

You can see here ⬆️ that each object has been transformed into a tweet by the LLM based on our prompt.

You can ask LLMs to perform all sorts of tasks

In [20]:
response = movies.generate.near_text(
    query="galaxy",
    target_vector="title",
    limit=3,
    single_prompt="Summarise the following movie overview into a short French sentence: {overview}."
)

for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))



D'anciens acteurs d'une série de science-fiction sont recrutés par des extraterrestres pour sauver l'univers, croyant que leurs aventures télévisées étaient réelles.
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
Voici un résumé en français :

Luke Skywalker et Han Solo s'allient pour sauver la princesse Leia et combattre l'Empire galactique.

This short French sentence summarizes the k

The LLM is multi-lingual!

You can also send groups of results to the LLM with Weaviate.

In [21]:
response = movies.generate.near_text(
    query="galaxy",
    target_vector="title",
    limit=3,
    grouped_task="Write a poem about these movies"
)

print(response.generated)
for o in response.objects:
    print(json.dumps(o.properties, indent=2))

I apologize, but it seems there was an error in the input you provided. The data appears to be a byte array that likely contains encoded JSON information about movies, but it's not in a readable format for me to directly use.

To write a poem about movies, I would need the actual movie titles and descriptions in plain text. If you could provide the movie information in a readable format, I'd be happy to compose a poem based on that content.
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no cl

In [22]:
client.close()

{"action":"restapi_management","docker_image_tag":"unknown","level":"info","msg":"Shutting down... ","time":"2024-08-20T13:38:34+01:00"}
{"action":"restapi_management","docker_image_tag":"unknown","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2024-08-20T13:38:34+01:00"}
{"action":"telemetry_push","level":"info","msg":"telemetry terminated","payload":"\u0026{MachineID:7e66353c-8e5d-4b1f-86c8-fad5ff708fa7 Type:TERMINATE Version:1.26.1 NumObjects:1130 OS:darwin Arch:arm64 UsedModules:[generative-anthropic text2vec-cohere text2vec-openai]}","time":"2024-08-20T13:38:34+01:00"}
{"level":"info","msg":"closing raft FSM store ...","time":"2024-08-20T13:38:34+01:00"}
{"level":"info","msg":"shutting down raft sub-system ...","time":"2024-08-20T13:38:34+01:00"}
{"level":"info","msg":"transferring leadership to another server","time":"2024-08-20T13:38:34+01:00"}
{"error":"cannot find peer","level":"error","msg":"transferring leadership","time":"2024-08-20T13:38:34