## Weaviate workshop

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-workshop/blob/main/workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Goals:

#### What you will see:


- Create a vector database with Weaviate,
- Add data to the database, and
- Interact with the data, including searching, and using LLMs with your data in Weaviate

### You will learn today:

- What Weaviate is,
- How it stores the data (based on its "meaning"), and
- What you can do with Weaviate, like semantic searches, and using LLMs to transform data.

Install the Weaviate python client, for environments that don't yet have it.

In [1]:
# !pip install -U --pre weaviate-client

## Preparation: Get the data

We'll use a dataset of movies from TMDB. 

Pre-processed version: "./data/movies.csv"


Load (or download) the data, and preview it

In [2]:
import pandas as pd

movie_df = pd.read_csv("./data/movies.csv")
movie_df.head()

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,year
0,/rH0DPF7pB35jxLxKb3JRUgCrrnp.jpg,"[10751, 14, 16, 10749]",11224,en,Cinderella,Cinderella has faith her dreams of a better li...,100.819,/avz6S9HYWs4O8Oe4PenBFNX4uDi.jpg,1950-02-22,Cinderella,False,7.044,6523,1950
1,/p47ihFj4A7EpBjmPHdTj4ipyq1S.jpg,[18],599,en,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,57.74,/sC4Dpmn87oz9AuxZ15Lmip0Ftgr.jpg,1950-08-10,Sunset Boulevard,False,8.312,2485,1950
2,/zyO6j74DKMWfp5snWg6Hwo0T3Mz.jpg,"[80, 18, 9648]",548,ja,羅生門,Brimming with action while incisively examinin...,21.011,/vL7Xw04nFMHwnvXRFCmYYAzMUvY.jpg,1950-08-26,Rashomon,False,8.091,2121,1950
3,/b4yiLlIFuiULuuLTxT0Pt1QyT6J.jpg,"[16, 10751, 14, 12]",12092,en,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",75.465,/20cvfwfaFqNbe9Fc3VEHJuPRxmn.jpg,1951-07-28,Alice in Wonderland,False,7.2,5697,1951
4,/mxf8hJJkHTCqZP3m4o8E1TtwHHs.jpg,"[35, 10749]",872,en,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",31.407,/w03EiJVHP8Un77boQeE7hg9DVdU.jpg,1952-04-09,Singin' in the Rain,False,8.2,3036,1952


## Step 1: Create a Weaviate instance (database)

This (Embedded Weaviate) is a quick way to create a Weaviate database. Note that this is suitable for evaluation use only, and currently not compatible with Windows (we are working on it 😉).

You can also use:
- A free sandbox with Weaviate Cloud Services
- Open-source Weaviate directly, available cross-platform with Docker

In [3]:
import weaviate
import os
import json

your_cohere_apikey = os.environ["COHERE_APIKEY"]
your_openai_apikey = os.environ["OPENAI_APIKEY"]

client = weaviate.connect_to_embedded(
    version="1.25.5",
    headers={
        "X-Cohere-Api-Key": your_cohere_apikey,  # Replace this with your actual key
        "X-OpenAI-Api-Key": your_openai_apikey,  # Replace this with your actual key
    },
    environment_variables={
        "ENABLE_MODULES": "text2vec-cohere, generative-cohere, text2vec-openai, generative-openai, text2vec-ollama, generative-ollama"
    }
)

Started /Users/jphwang/.cache/weaviate-embedded: process ID 14361
listen tcp :6060: bind: address already in use


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-06-25T20:08:07+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-06-25T20:08:07+01:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-06-25T20:08:07+01:00"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":8300},"time":"2024-06-25T20:08:07+01:00"}
{"address":"192.168.1.140:8301","level":"info","msg":"starting cloud rpc server ...","time":"2024-06-25T20:08:07+01:00"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-06-25T20:08:07+01:00"}
{"address":"192.168.1.140:8300","level":"info","msg":"tcp transport","tcpMaxPo

Retrieve Weaviate instance information to check our configuration.

In [4]:
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-cohere': {'documentationHref': 'https://docs.cohere.com/reference/chat',
   'name': 'Generative Search - Cohere'},
  'generative-ollama': {'documentationHref': 'https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion',
   'name': 'Generative Search - Ollama'},
  'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'Generative Search - OpenAI'},
  'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
   'name': 'Cohere Module'},
  'text2vec-ollama': {'documentationHref': 'https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings',
   'name': 'Ollama Module'},
  'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
   'name': 'OpenAI Module'}},
 'version': '1.25.5'}

## Step 2: Add data to Weaviate

### Add collection definition

The equivalent of a SQL "table", is called a "collection" in Weaviate, like they are in NoSQL databases.

In case I created a demo collection - let's delete it.

In [5]:
client.collections.delete("Movie")

{"action":"load_all_shards","level":"error","msg":"failed to load all shards: context canceled","time":"2024-06-25T20:08:09+01:00"}


And create a new collection definition here.
We'll set up a collection called "Movie" with:
- Two "named vectors" -> which will save different "meanings" of the data,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our movie data (which are like SQL columns).
    - Just the title, overview, year and popularity for now.

In [6]:
from weaviate.classes.config import Configure, DataType, Property

client.collections.create(
    name="Movie",
    vectorizer_config=[
        Configure.NamedVectors.text2vec_openai(
            name="title",
            source_properties=["title"]
        ),
        Configure.NamedVectors.text2vec_cohere(
            name="overview",
            source_properties=["title", "overview"]
        ),
    ],
    generative_config=Configure.Generative.cohere(model="command-r"),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
        ),
        Property(
            name="overview",
            data_type=DataType.TEXT,
        ),
        Property(
            name="year",
            data_type=DataType.INT,
        ),
        Property(
            name="popularity",
            data_type=DataType.NUMBER,
        ),
    ]
)

{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-06-25T20:08:10+01:00","wait_for_cache_prefill":false}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"vectors_overview","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-06-25T20:08:10+01:00","took":38333}
{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-06-25T20:08:10+01:00","wait_for_cache_prefill":false}
{"level":"info","msg":"Created shard movie_264zw3nbR7uB in 1.911333ms","time":"2024-06-25T20:08:10+01:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"vectors_title","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-06-25T20:08:10+01:00","took":36625}


<weaviate.collections.collection.Collection at 0x104f15270>

{"action":"bootstrap","level":"info","msg":"node reporting ready, node has probably recovered cluster from raft config. Exiting bootstrap process","time":"2024-06-25T20:08:10+01:00"}


> Tip: You can get example collection definitions in our documentation:
> - https://weaviate.io/developers/weaviate/manage-data/collections

Was our collection created successfully? Let's take a look

In [7]:
client.collections.exists("Movie")

True

### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [8]:
data_columns = ['title', 'overview', 'year', 'popularity']

df = movie_df[data_columns]

df.head()

Unnamed: 0,title,overview,year,popularity
0,Cinderella,Cinderella has faith her dreams of a better li...,1950,100.819
1,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,1950,57.74
2,Rashomon,Brimming with action while incisively examinin...,1950,21.011
3,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",1951,75.465
4,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",1952,31.407


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [9]:
movies = client.collections.get("Movie")

with movies.batch.dynamic() as batch:
    for i, row in df.iterrows():
        obj_body = {
            c: row[c] for c in data_columns
        }
        batch.add_object(
            properties=obj_body
        )

{"action":"telemetry_push","level":"info","msg":"telemetry started","payload":"\u0026{MachineID:d41e48d3-1900-486d-998f-ec8c49cdc67d Type:INIT Version:1.25.5 NumObjects:0 OS:darwin Arch:arm64 UsedModules:[generative-cohere text2vec-cohere text2vec-openai]}","time":"2024-06-25T20:08:10+01:00"}


In [10]:
print(len(movies.batch.failed_objects))

0


#### Confirm data load

Do we have data? 

Let's get an object count

In [11]:
movies = client.collections.get("Movie")

movies.aggregate.over_all(total_count=True)

AggregateReturn(properties={}, total_count=1322)

Does the data look right?

Let's grab a few objects from Weaviate!

In [12]:
response = movies.query.fetch_objects(limit=3)
for o in response.objects:
    print(o.properties)

{'year': 2009, 'title': 'Coraline', 'overview': 'A young girl discovers an idealized parallel universe behind a secret door in her new home, unaware that it contains a sinister secret.', 'popularity': 215.445}
{'title': 'Shark Tale', 'overview': "Oscar is a small fish whose big aspirations often get him into trouble. Meanwhile, Lenny is a great white shark with a surprising secret that no sea creature would guess: He's a vegetarian. When a lie turns Oscar into an improbable hero and Lenny becomes an outcast, the two form an unlikely friendship.", 'year': 2004, 'popularity': 55.359}
{'title': 'True Romance', 'overview': 'Clarence marries hooker Alabama, steals cocaine from her pimp, and tries to sell it in Hollywood, while the owners of the coke try to reclaim it.', 'year': 1993, 'popularity': 102.252}


Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

## Step 3: Work with the data

Let's try a few more involved queries

### Filtering (similar to WHERE filter in SQL)

Let's find objects that meet a particular condition.

In [13]:
import weaviate.classes.query as wq

response = movies.query.fetch_objects(
    filters=wq.Filter.by_property("year").greater_than(2015),
    limit=3
)

for o in response.objects:
    print(o.properties["title"])

Zootopia
The Jungle Book
Miss Peregrine's Home for Peculiar Children


But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

### Keyword search

Unlike a keyword filter, a keyword search will search for, and rank results based on the frequency of the keyword.

In [14]:
response = movies.query.bm25(
    query="galaxy",
    limit=5,
    return_metadata=wq.MetadataQuery(score=True, last_update_time=True)
)

for o in response.objects:
    print(o.metadata.score)
    print(o.metadata.last_update_time)
    print(o.properties)

3.2621753215789795
2024-06-25 19:08:24.432000+00:00
{'title': 'Guardians of the Galaxy Vol. 3', 'overview': 'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.', 'year': 2023, 'popularity': 165.416}
3.2621753215789795
2024-06-25 19:08:22.713000+00:00
{'title': 'Guardians of the Galaxy Vol. 2', 'overview': "The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage.", 'year': 2017, 'popularity': 142.267}
2.128286361694336
2024-06-25 19:08:16.304000+00:00
{'title': 'Star Wars: Episode II - Attack of the Clones', 'overview': 'Following an assassination attempt on Senator Padmé Amidala, Jedi Knights Anakin Skywalker and Obi-Wan Kenobi investigate a mysterious plot that could change the galaxy forever.', 'year': 200

### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [15]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata)
    print(json.dumps(o.properties, indent=2))

MetadataReturn(creation_time=None, last_update_time=None, distance=0.13486772775650024, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
MetadataReturn(creation_time=None, last_update_time=None, distance=0.16076624393463135, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_scor

#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [16]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",  # or "overview"
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(json.dumps(o.properties, indent=2))

0.13486772775650024
{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
0.16076624393463135
{
  "title": "Star Wars",
  "overview": "Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful pr

This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [17]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    include_vector=True,
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(o.vector)
    print(json.dumps(o.properties, indent=2))

0.1343761682510376
{'overview': [0.0200653076171875, 0.057373046875, -0.03131103515625, 0.0007500648498535156, -0.0080413818359375, -0.0251007080078125, -0.01678466796875, -0.0626220703125, -0.051025390625, 0.050018310546875, 0.06610107421875, -0.04693603515625, 0.04107666015625, -0.0214080810546875, 0.056304931640625, -0.01153564453125, 0.00043272972106933594, -0.0296783447265625, -0.01198577880859375, -0.0273895263671875, -0.01030731201171875, 0.046417236328125, 0.050933837890625, -0.001316070556640625, 0.0172576904296875, 0.040924072265625, 0.048431396484375, -0.03338623046875, 0.07720947265625, 0.0149078369140625, 0.022430419921875, -0.05145263671875, 0.019561767578125, 0.0277252197265625, 0.00977325439453125, 0.03497314453125, 0.01788330078125, -0.01306915283203125, 0.00879669189453125, 0.02496337890625, -0.04534912109375, -0.00921630859375, -0.01788330078125, 0.0167236328125, -0.041534423828125, -0.0199432373046875, 0.03887939453125, 0.0200653076171875, 0.056915283203125, -0.0019

These vector representations come from deep learning models to those that power LLMs. They capture meaning, and are called vector "embeddings".

### Generative search

A generative search transforms your data at retrieval time. 

In [18]:
response = movies.generate.near_text(
    query="galaxy",
    limit=5,
    target_vector="title",
    single_prompt="Write a tweet promoting the movie with TITLE: {title} and OVERVIEW: {overview}.",
    grouped_task="What audience demographic might enjoy this group of movies?"
)

print(response.generated)
for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

This group of movies seems to largely appeal to fans of science fiction and space operas. The collection includes beloved classics like Star Wars: A New Hope, which has a broad appeal due to its iconic status in popular culture, spanning multiple generations of fans. Alien, Interstellar, and Galaxy Quest also fit within the science fiction genre, attracting fans of thrilling and imaginative storytelling set in outer space.
 
The Star Wars sequel, The Rise of Skywalker, could also entice fans of fantasy and action-adventure movies, as it combines elements of both genres and continues a well-known and loved story. The cast of characters also offers a wide range of appeals, with a mix of iconic and new heroes, villains, and quirky supporting roles.

In addition, these movies could appeal to specific fanbases. For example, the presence of Tim Allen and Sigourney Weaver in Galaxy Quest could attract fans of their respective bodies of work. Similarly, the iconic nature of the Star Wars franc

You can see here ⬆️ that each object has been transformed into a tweet by the LLM based on our prompt.

You can ask LLMs to perform all sorts of tasks

In [19]:
response = movies.generate.near_text(
    query="galaxy",
    target_vector="title",
    limit=3,
    single_prompt="Summarise the following movie overview into a short French sentence: {overview}."
)

for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

Une équipe d'acteurs de série spatiale oubliés sont enlevés par des aliens qui ont mal interprété leurs émissions télévisées comme des documents historiques.
{
  "year": 1999,
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "popularity": 62.01
}
Deux robots aidant, Luke Skywalker et Han Solo libèrent la princesse Leia des forces impitoyables de l'Empire galactique.
{
  "title": "Star Wars",
  "overview": "Princess Leia is capt

The LLM is multi-lingual!

You can also send groups of results to the LLM with Weaviate.

In [20]:
response = movies.generate.near_text(
    query="galaxy",
    target_vector="title",
    limit=3,
    grouped_task="Write a poem about these movies"
)

print(response.generated)
for o in response.objects:
    print(json.dumps(o.properties, indent=2))

In the realm of reels, three tales unfurl,
Space adventures, the stuff of legend,
Galaxy Quest, a comedy gem,
With a crew of actors, bold and brave,
Their mission, to save the universe,
But all was not as it seemed,
For the aliens, misinterpreting signals,
Beamed up the stars of the screen.

In Star Wars, a galaxy far, far away,
Leia, Luke, and Han Solo brave the dark,
Battling evil, restoring balance,
With droids by their side, they fought the fight,
An epic saga, a timeless classic,
That captured hearts, a shining gem.

Then, in the depths of space, the Nostromo flew,
A distress signal, a harbinger of doom,
Eggs of terror, an alien birth,
A parasitic horror, a crewman's doom,
The creature, a nightmare incarnate,
In the darkness, it lurked and waited. 

Three films, three journeys, one sky,
Each with a tale that's gripped our minds,
A galaxy of stars, a universe vast,
Where heroes and villains alike, dare to fly. 

So let the credits roll,
And the movies forever hold,
Their place in o

In [21]:
client.close()

{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2024-06-25T20:08:38+01:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2024-06-25T20:08:38+01:00"}
{"action":"telemetry_push","level":"info","msg":"telemetry terminated","payload":"\u0026{MachineID:d41e48d3-1900-486d-998f-ec8c49cdc67d Type:TERMINATE Version:1.25.5 NumObjects:2967 OS:darwin Arch:arm64 UsedModules:[generative-cohere text2vec-cohere text2vec-openai]}","time":"2024-06-25T20:08:39+01:00"}
{"level":"info","msg":"closing raft FSM store ...","time":"2024-06-25T20:08:39+01:00"}
{"level":"info","msg":"shutting down raft sub-system ...","time":"2024-06-25T20:08:39+01:00"}
{"level":"info","msg":"transferring leadership to another server","time":"2024-06-25T20:08:39+01:00"}
{"error":"cannot find peer","level":"error","msg":"transferring leadership","time":"2024-06-25T20:08:39+01:00"}
{"level":"info","msg":"closing raft-net ...","time":