## Weaviate workshop

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-workshop/blob/main/workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Goals:

#### What you will see:


- Create a vector database with Weaviate,
- Add data to the database, and
- Interact with the data, including searching, and using LLMs with your data in Weaviate

### You will learn today:

- What Weaviate is,
- How it stores the data (based on its "meaning"), and
- What you can do with Weaviate, like semantic searches, and using LLMs to transform data.

Install the Weaviate python client, for environments that don't yet have it.

In [1]:
# !pip install -U --pre weaviate-client

## Preparation: Get the data

We'll use a subset of the Jeopardy! quiz library:
> https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions

Pre-processed version:
> https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json


Load (or download) the data, and preview it

In [2]:
import requests
import json

# Download the data
response = requests.get('https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json')
raw_data = response.text

# Parse the JSON and preview it
data = json.loads(raw_data)
print(type(data), len(data))
print(json.dumps(data[0], indent=2))

<class 'list'> 1000
{
  "Air Date": "2006-11-08",
  "Round": "Double Jeopardy!",
  "Value": 800,
  "Category": "AMERICAN HISTORY",
  "Question": "Abraham Lincoln died across the street from this theatre on April 15, 1865",
  "Answer": "Ford's Theatre (the Ford Theatre accepted)"
}


## Step 1: Create a Weaviate instance (database)

This (Embedded Weaviate) is a quick way to create a Weaviate database. Note that this is suitable for evaluation use only, and currently not compatible with Windows (we are working on it 😉).

You can also use:
- A free sandbox with Weaviate Cloud Services
- Open-source Weaviate directly, available cross-platform with Docker

In [3]:
import weaviate
import os

client = weaviate.connect_to_embedded(
    version="1.24.8",
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_APIKEY")  # Replace this with your actual key
    }
)

Started /Users/jphwang/.cache/weaviate-embedded: process ID 79207


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-04-18T13:50:34+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-04-18T13:50:34+01:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-04-18T13:50:34+01:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50050","time":"2024-04-18T13:50:34+01:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-04-18T13:50:34+01:00"}


Retrieve Weaviate instance information to check our configuration.

In [4]:
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'Generative Search - OpenAI'},
  'qna-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'OpenAI Question & Answering Module'},
  'ref2vec-centroid': {},
  'reranker-cohere': {'documentationHref': 'https://txt.cohere.com/rerank/',
   'name': 'Reranker - Cohere'},
  'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
   'name': 'Cohere Module'},
  'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
   'name': 'Hugging Face Module'},
  'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
   'name': 'OpenAI Module'}},
 'version': '1.24.8'}

## Step 2: Add data to Weaviate

### Add class definition

The equivalent of a SQL "table", or noSQL "collection" is called a "class" in Weaviate.

In case I created a demo class - let's delete it.

In [5]:
client.collections.delete("Question")

And create a new class definition here.
We'll set up a class called "Question" with:
- A "vectorizer" -> which will convert data to vectors, which represent meaning,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our quiz data (which are like SQL columns).
    - Just the question and answer for now

In [6]:
import weaviate.classes.config as wc

client.collections.create(
    name="Question",
    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
    generative_config=wc.Configure.Generative.openai(),
    properties=[
        wc.Property(
            name="Question",
            data_type=wc.DataType.TEXT,
        ),
        wc.Property(
            name="Answer",
            data_type=wc.DataType.TEXT,
        )
    ]
)

{"level":"info","msg":"Created shard question_8KEPm47nIhoR in 3.057ms","time":"2024-04-18T13:50:34+01:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-04-18T13:50:34+01:00","took":268583}


<weaviate.collections.collection.Collection at 0x11a471780>

> Tip: You can get example class definitions in our documentation:
> - https://weaviate.io/developers/weaviate/manage-data/classes#example-class-configurations

Was our class created successfully? Let's take a look

In [7]:
client.collections.list_all()

{'EmbeddedBackupTest': _CollectionConfigSimple(name='EmbeddedBackupTest', description=None, generative_config=None, properties=[_Property(name='title', description=None, data_type=<DataType.TEXT: 'text'>, index_filterable=True, index_searchable=True, nested_properties=None, tokenization=<Tokenization.WORD: 'word'>, vectorizer_config=_PropertyVectorizerConfig(skip=False, vectorize_property_name=True), vectorizer='text2vec-openai')], references=[], reranker_config=None, vectorizer_config=_VectorizerConfig(vectorizer=<Vectorizers.TEXT2VEC_OPENAI: 'text2vec-openai'>, model={'baseURL': 'https://api.openai.com', 'model': 'ada'}, vectorize_collection_name=True), vectorizer=<Vectorizers.TEXT2VEC_OPENAI: 'text2vec-openai'>, vector_config=None),
 'Movie': _CollectionConfigSimple(name='Movie', description=None, generative_config=_GenerativeConfig(generative=<GenerativeSearches.COHERE: 'generative-cohere'>, model={}), properties=[_Property(name='title', description=None, data_type=<DataType.TEXT: 

### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [8]:
for o in data[:2]:
    obj_body = {
        "question": o["Question"],
        "answer": o["Answer"],
    }
    print(obj_body)

{'question': 'Abraham Lincoln died across the street from this theatre on April 15, 1865', 'answer': "Ford's Theatre (the Ford Theatre accepted)"}
{'question': 'Any pigment on the wall so faded you can barely see it', 'answer': 'faint paint'}


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [9]:
from weaviate.util import generate_uuid5
import weaviate.classes.data as wd

question = client.collections.get("Question")

object_list = list()
for o in data:
    obj_body = {
        "question": o["Question"],
        "answer": o["Answer"],
    }
    wv_obj = wd.DataObject(
        properties=obj_body,
        uuid=generate_uuid5(obj_body)
    )
    object_list.append(wv_obj)

response = question.data.insert_many(object_list)

{"action":"telemetry_push","level":"info","msg":"telemetry started","payload":"\u0026{MachineID:6e233afc-2daa-4830-9d66-c13584df449e Type:INIT Version:1.24.8 Modules:generative-openai,qna-openai,ref2vec-centroid,reranker-cohere,text2vec-cohere,text2vec-huggingface,text2vec-openai NumObjects:0 OS:darwin Arch:arm64}","time":"2024-04-18T13:50:35+01:00"}
{"level":"info","msg":"Completed loading shard embeddedbackuptest_zNbzMxQgICMT in 4.038667ms","time":"2024-04-18T13:50:35+01:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-04-18T13:50:35+01:00","took":40666}
{"level":"info","msg":"Completed loading shard movie_kPp8m3LWxr1C in 9.44025ms","time":"2024-04-18T13:50:35+01:00"}
{"level":"info","msg":"Completed loading shard review_w0zHx7QBU6hy in 10.76ms","time":"2024-04-18T13:50:35+01:00"}
{"action":"hnsw_vector_cache_prefill","count":5000,"index_id":"main","level":"info","limit":10000000

In [10]:
print(response.has_errors)

False


In [11]:
print(response)

BatchObjectReturn(all_responses=[UUID('f1819cbc-5a70-52be-adf9-4afcff2c3195'), UUID('a8a00cb0-bee8-5bde-b7d1-245939c8778a'), UUID('29fc6b7f-71d7-5671-b86b-42062b09e042'), UUID('7bbf8fca-f31b-5a9c-b7fe-122d2d3b8f9a'), UUID('5bf5514d-a7a1-519f-aa05-2e8bb815d1f5'), UUID('cdd58b59-88d1-50e9-9efa-06002150a205'), UUID('8520461b-2d60-590a-938c-59bc65ff4d3c'), UUID('f06f25c4-f9bd-5843-8af2-6e4fb6d43516'), UUID('d174b6d2-609d-5e68-b6a8-2d520a9bcfda'), UUID('db5d5a31-6728-566f-90a3-6e3bc42113eb'), UUID('250914c3-a317-5fd1-accd-5888bf109fa2'), UUID('6139dff0-b18f-5188-a888-85ddc4851482'), UUID('06065ece-2e17-5a82-a8d4-ae4245957773'), UUID('c1167274-21ca-570a-9e55-cc4fab3c1513'), UUID('53986f4f-6283-5470-9571-1c8a27a034ad'), UUID('1d5ed28a-2bba-5d73-a8e5-324d3079265b'), UUID('ab841cc2-ecc2-50ab-aa67-e77d90f51601'), UUID('95bbf75c-7bc7-5b20-87af-8c91b9a1a94f'), UUID('dab9f9fa-8c4f-5d93-baae-e3ea4062867d'), UUID('bf64ea39-b55d-5962-8dd6-c51c09416e65'), UUID('df1dece0-d6b5-55fd-abe8-3c3be13f6e24'), U

#### Confirm data load

Do we have data? 

Let's get an object count

In [12]:
questions = client.collections.get("Question")

question.aggregate.over_all(total_count=True)

AggregateReturn(properties={}, total_count=1000)

Does the data look right?

Let's grab a few objects from Weaviate!

In [13]:
response = questions.query.fetch_objects(limit=3)
for o in response.objects:
    print(o.properties)

{'answer': 'a stun gun', 'question': "An immobilizer used by the police (don't taze me, bro!)"}
{'answer': 'Brussels', 'question': 'NATO is headquartered on Blvd. Leopold III in this European capital'}
{'answer': 'Dutch', 'question': 'The standard form of this language developed from that spoken in Amsterdam & nearby cities'}


Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

## Step 3: Work with the data

Let's try a few more involved queries

### Filtering (similar to WHERE filter in SQL)

Let's find objects that meet a particular condition.

In [14]:
import weaviate.classes.query as wq

response = questions.query.fetch_objects(
    filters=wq.Filter.by_property("question").like("*history*"),
    limit=3
    )

for o in response.objects:
    print(o.properties["question"])

What was once the Chicago Natural History Museum is now called this, after its founder
A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles
The Drake Well Museum in Titusville, Penn. is dedicated to the history of this industry


But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

### Keyword search

Unlike a keyword filter, a keyword search will search for, and rank results based on the frequency of the keyword.

In [15]:
response = questions.query.bm25(
    query="history",
    limit=10,
    return_metadata=wq.MetadataQuery(score=True, last_update_time=True)
)

for o in response.objects:
    print(o.metadata.score)
    print(o.metadata.last_update_time)
    print(o.properties)

2.520803213119507
2024-04-18 12:50:40.634000+00:00
{'answer': '"A Brief History Of Time In A Bottle"', 'question': "Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce"}
1.8396204710006714
2024-04-18 12:50:42.697000+00:00
{'answer': 'Oil', 'question': 'The Drake Well Museum in Titusville, Penn. is dedicated to the history of this industry'}
1.7712442874908447
2024-04-18 12:50:38.190000+00:00
{'answer': 'the Field Museum', 'question': 'What was once the Chicago Natural History Museum is now called this, after its founder'}
1.6486854553222656
2024-04-18 12:50:39.263000+00:00
{'answer': 'Greyhound', 'question': 'A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles'}
1.6486854553222656
2024-04-18 12:50:44.070000+00:00
{'answer': 'the draft', 'question': "You're in the Army now--in 1940 FDR instituted the first peacetime one of these in U.S. history"}


### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [16]:
response = questions.query.near_text(
    query="history",
    limit=3,
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata)
    print(json.dumps(o.properties, indent=2))

MetadataReturn(creation_time=None, last_update_time=None, distance=0.2034582495689392, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "answer": "Greyhound",
  "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
}
MetadataReturn(creation_time=None, last_update_time=None, distance=0.20675033330917358, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "answer": "Shinto",
  "question": "Compiled in 712, the Kojiki, \"Records of Ancient Matters\", is one of this religion's oldest texts"
}
MetadataReturn(creation_time=None, last_update_time=None, distance=0.21075522899627686, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "answer": "\"A Brief History Of Time In A Bottle\"",
  "question": "Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce"
}


#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [17]:
response = questions.query.near_text(
    query="history",
    limit=3,
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(json.dumps(o.properties, indent=2))

0.20352506637573242
{
  "answer": "Greyhound",
  "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
}
0.2067863941192627
{
  "answer": "Shinto",
  "question": "Compiled in 712, the Kojiki, \"Records of Ancient Matters\", is one of this religion's oldest texts"
}
0.21077150106430054
{
  "answer": "\"A Brief History Of Time In A Bottle\"",
  "question": "Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce"
}


This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [18]:
response = questions.query.near_text(
    query="history",
    limit=3,
    include_vector=True,
    return_metadata=wq.MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(o.vector)
    print(json.dumps(o.properties, indent=2))

0.20352506637573242
{'default': [-0.015432987362146378, -0.010958095081150532, 0.00245478842407465, -0.038090500980615616, -0.02602716162800789, 0.011025488376617432, 0.006988649722188711, -0.014462528750300407, -0.013512288220226765, -0.028601571917533875, -0.020972689613699913, 0.015513858757913113, -0.007413225714117289, -0.020649202167987823, -0.028116341680288315, 0.01103222742676735, 0.014125564135611057, 0.014071649871766567, 0.013397720642387867, 0.006961692590266466, -0.004613047931343317, -0.006105802021920681, 0.010068507865071297, -0.009926983155310154, 0.00633830763399601, 0.004373803269118071, 0.027873728424310684, -0.011106359772384167, -0.0020824423991143703, 0.013222498819231987, -0.005344261880964041, -0.0022913606371730566, -0.016174310818314552, 0.0007792310789227486, 0.0049769701436161995, -0.032159920781850815, 0.0012383456341922283, 0.009165442548692226, 0.018843071535229683, 0.01337750256061554, -0.011638764292001724, 0.003290461143478751, -0.0038481380324810743

These vector representations come from deep learning models to those that power LLMs. They capture meaning, and are called vector "embeddings".

### Generative search

A generative search transforms your data at retrieval time. 

In [19]:
response = questions.generate.near_text(
    query="history",
    limit=5,
    single_prompt="Write a tweet about {question} as an interesting factoid.",
    grouped_task="Summarize this into bullet points"
)

print(response.generated)
for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

- Greyhound: Founded in Hibbing, Minn. in 1914 using Hupmobiles
- Shinto: Kojiki, one of the religion's oldest texts, compiled in 712
- "A Brief History Of Time In A Bottle": Stephen Hawking's 1988 bio of the universe, No. 1 hit for Jim Croce
- The Rijksmuseum: Dutch national art museum founded by Louis Bonaparte in 1808
- John James Audubon: Artist and naturalist who began writing "The Ornithological Biography" in 1831
Did you know that a museum in Hibbing, Minn. traces the history of a bus company founded there in 1914 using Hupmobiles? 🚌 #FunFact #HistoryBuff #HibbingMN
{
  "answer": "Greyhound",
  "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
}
Did you know that the Kojiki, "Records of Ancient Matters", was compiled in 712 and is one of the oldest texts of Shintoism? #religion #history #factoid 📜🇯🇵
{
  "answer": "Shinto",
  "question": "Compiled in 712, the Kojiki, \"Records of Ancient Matters\", is one of this r

You can see here ⬆️ that each object has been transformed into a tweet by the LLM based on our prompt.

You can ask LLMs to perform all sorts of tasks

In [20]:
response = questions.generate.near_text(
    query="history",
    limit=3,
    single_prompt="Translate {question} into French."
)

for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

Un musée de Hibbing, dans le Minnesota, retrace l'histoire de cette compagnie de bus fondée en 1914 en utilisant des Hupmobiles.
{
  "answer": "Greyhound",
  "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
}
Compilé en 712, le Kojiki, "Chroniques des temps anciens", est l'un des plus anciens textes de cette religion.
{
  "answer": "Shinto",
  "question": "Compiled in 712, the Kojiki, \"Records of Ancient Matters\", is one of this religion's oldest texts"
}
Translate Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce into French.

La biographie de l'univers de Stephen Hawking en 1988 qui a été un succès numéro 1 pour Jim Croce.
{
  "answer": "\"A Brief History Of Time In A Bottle\"",
  "question": "Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce"
}


The LLM is multi-lingual!

You can also send groups of results to the LLM with Weaviate.

In [21]:
response = questions.generate.near_text(
    query="history",
    limit=3,
    grouped_task="Write a poem about these facts"
)

print(response.generated)
for o in response.objects:
    print(json.dumps(o.properties, indent=2))

In a small town in Minnesota, a museum stands tall
Tracing the history of a bus company, Greyhound, for all
Founded in 1914 with Hupmobiles as its start
Connecting people near and far, it holds a special part

In the ancient land of Japan, a religion called Shinto
With the Kojiki as one of its oldest texts, a treasure to show
Compiled in 712, the "Records of Ancient Matters" it holds
A glimpse into the beliefs and traditions that unfold

And in the world of science, a brilliant mind did write
Stephen Hawking's "A Brief History Of Time In A Bottle" took flight
Exploring the universe and its mysteries so vast
A No. 1 hit for Jim Croce, a journey that will forever last

So let us marvel at the wonders of history and time
From Greyhound buses to Shinto beliefs, a world so sublime
With knowledge and curiosity, we continue to explore
The stories and facts that make our world so much more.
{
  "answer": "Greyhound",
  "question": "A Hibbing, Minn. museum traces the history of this bus company

In [22]:
client.close()

{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2024-04-18T13:51:03+01:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2024-04-18T13:51:03+01:00"}
{"action":"telemetry_push","level":"info","msg":"telemetry terminated","payload":"\u0026{MachineID:6e233afc-2daa-4830-9d66-c13584df449e Type:TERMINATE Version:1.24.8 Modules:generative-openai,qna-openai,ref2vec-centroid,reranker-cohere,text2vec-cohere,text2vec-huggingface,text2vec-openai NumObjects:1325 OS:darwin Arch:arm64}","time":"2024-04-18T13:51:04+01:00"}
