## Weaviate workshop

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-workshop/blob/main/workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Goals:

#### What you will see:


- Create a vector database with Weaviate,
- Add data to the database, and
- Interact with the data, including searching, and using LLMs with your data in Weaviate

### You will learn today:

- What Weaviate is,
- How it stores the data (based on its "meaning"), and
- What you can do with Weaviate, like semantic searches, and using LLMs to transform data.

Install the Weaviate python client, for environments that don't yet have it.

In [1]:
# !pip install -U weaviate-client

## Preparation: Get the data

We'll use a subset of the Jeopardy! quiz library:
> https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions

Pre-processed version:
> https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json


Load (or download) the data, and preview it

In [1]:
import requests
import json

# Download the data
response = requests.get('https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json')
raw_data = response.text

# Parse the JSON and preview it
data = json.loads(raw_data)
print(type(data), len(data))
print(json.dumps(data[0], indent=2))

<class 'list'> 1000
{
  "Air Date": "2006-11-08",
  "Round": "Double Jeopardy!",
  "Value": 800,
  "Category": "AMERICAN HISTORY",
  "Question": "Abraham Lincoln died across the street from this theatre on April 15, 1865",
  "Answer": "Ford's Theatre (the Ford Theatre accepted)"
}


## Step 1: Create a Weaviate instance (database)

This (Embedded Weaviate) is a quick way to create a Weaviate database. Note that this is suitable for evaluation use only, and currently not compatible with Windows (we are working on it 😉).

You can also use:
- A free sandbox with Weaviate Cloud Services
- Open-source Weaviate directly, available cross-platform with Docker

In [2]:
import weaviate
import os

client = weaviate.connect_to_local(
    port=8080,
    grpc_port=50051,
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]  # Replace this with your actual key
    }
)

Retrieve Weaviate instance information to check our configuration.

In [3]:
client.get_meta()

{'hostname': 'http://[::]:8080',
 'modules': {'generative-cohere': {'documentationHref': 'https://docs.cohere.com/reference/generate',
   'name': 'Generative Search - Cohere'},
  'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'Generative Search - OpenAI'},
  'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
   'name': 'Cohere Module'},
  'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
   'name': 'Hugging Face Module'},
  'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
   'name': 'OpenAI Module'},
  'text2vec-palm': {'documentationHref': 'https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings',
   'name': 'Google PaLM Module'}},
 'version': '1.22.2'}

## Step 2: Add data to Weaviate

### Add class definition

The equivalent of a SQL "table", or noSQL "collection" is called a "class" in Weaviate.

In case I created a demo class - let's delete it.

In [4]:
client.collections.delete("Question")

And create a new class definition here.
We'll set up a class called "Question" with:
- A "vectorizer" -> which will convert data to vectors, which represent meaning,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our quiz data (which are like SQL columns).
    - Just the question and answer for now

In [5]:
import weaviate.classes as wvc

questions = client.collections.create(
    name="QuestionAndAnswer",
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_openai(),
    vector_index_config=wvc.Configure.vector_index(
        distance_metric=wvc.VectorDistance.COSINE
    ),
    generative_config=wvc.Configure.Generative.openai(),
    properties=[
        wvc.Property(
            name="question",
            data_type=wvc.DataType.TEXT
        ),
        wvc.Property(
            name="answer",
            data_type=wvc.DataType.TEXT
        ),
    ]
)

> Tip: You can get example class definitions in our documentation:
> - https://weaviate.io/developers/weaviate/manage-data/classes#example-class-configurations

Was our class created successfully? Let's take a look

In [13]:
collections = client.collections.list_all()
print(collections.keys())

dict_keys(['QuestionAndAnswer'])


### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [9]:
for o in data[:2]:
    obj_body = {
        "question": o["Question"],
        "answer": o["Answer"],
    }
    print(obj_body)

{'question': 'Abraham Lincoln died across the street from this theatre on April 15, 1865', 'answer': "Ford's Theatre (the Ford Theatre accepted)"}
{'question': 'Any pigment on the wall so faded you can barely see it', 'answer': 'faint paint'}


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [15]:
data_objects = list()
for o in data:
    obj_body = {
        "question": o["Question"],
        "answer": o["Answer"],
    }
    data_objects.append(obj_body)

questions.data.insert_many(data_objects)

BatchObjectReturn(all_responses=[UUID('a15b33d3-0650-43bb-9ea8-acdcce5406f3'), UUID('c4a5a296-5509-4def-9df0-da08ebd36ddf'), UUID('e7d09ba1-05fe-4da0-8e62-e645bd3703f1'), UUID('971c1cce-8a38-4e8d-9609-2577be9d5968'), UUID('68f7daf2-a2b6-4ed7-84d0-ddd36888471b'), UUID('3bb7537d-c144-4fb8-bf06-61999d88ba8d'), UUID('dbe86917-a9b7-45e2-9f08-d2006563b873'), UUID('d747fe07-b245-40e4-a3b7-0f809a1bbab3'), UUID('d3772462-c7eb-43fc-b7fc-f622a7670421'), UUID('eade1630-e26d-4012-835a-172b7a66a172'), UUID('162c9d5d-72a2-4eb8-9a29-ae7f6681ba73'), UUID('41d26add-8433-4d2d-a9ad-3803e1ceb40d'), UUID('aa25b89c-3257-465b-9e53-26894de718d8'), UUID('5325a058-5391-4076-97bd-b5128be758e4'), UUID('4fbf034f-66c1-42d5-b57e-2e0ac0fd2062'), UUID('a3c515d8-d731-47c3-8896-82d06f74cc6a'), UUID('774d4e21-6270-4645-b301-b8273b80a687'), UUID('5cc791a9-71d0-47c3-979a-616fa77f6a70'), UUID('0a61d83d-00bc-4963-9b19-b3d0d12dd615'), UUID('8920a91b-bc32-49f8-b416-92899c321d5e'), UUID('acf27cab-c931-4c4a-b64f-2c82fbd46326'), U

#### Confirm data load

Do we have data? 

Let's get an object count

In [17]:
questions.aggregate.over_all(total_count=True)

_AggregateReturn(properties={}, total_count=1000)

Does the data look right?

Let's grab a few objects from Weaviate!

In [26]:
response = questions.query.fetch_objects(limit=2)
response.objects

[_Object(properties={'answer': 'Casablanca', 'question': 'In January 1943 FDR & Winston Churchill met in this North African city to discuss the war effort'}, metadata=_MetadataReturn(uuid=UUID('00ad75b8-2fe4-48ba-935d-b3649b6f301f'), vector=None, creation_time_unix=1699028039851, last_update_time_unix=1699028039851, distance=None, certainty=None, score=0.0, explain_score='', is_consistent=False)),
 _Object(properties={'answer': 'Swiss Miss', 'question': 'This cocoa brand will "warm the heart"'}, metadata=_MetadataReturn(uuid=UUID('00df873c-0632-481d-8616-e1a82114851e'), vector=None, creation_time_unix=1699028043264, last_update_time_unix=1699028043264, distance=None, certainty=None, score=0.0, explain_score='', is_consistent=False))]

Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

## Step 3: Work with the data

Let's try a few more involved queries

### Filtering (similar to WHERE filter in SQL)

Let's find objects that meet a particular condition.

In [35]:
response = questions.query.fetch_objects(
    limit=2,
    filters=wvc.Filter("question").like("history")
)
response.objects

[_Object(properties={'answer': 'the Field Museum', 'question': 'What was once the Chicago Natural History Museum is now called this, after its founder'}, metadata=_MetadataReturn(uuid=UUID('57c2bc90-99bd-434a-9ba8-d45f1ec65931'), vector=None, creation_time_unix=1699028033805, last_update_time_unix=1699028033805, distance=None, certainty=None, score=0.0, explain_score='', is_consistent=False)),
 _Object(properties={'answer': 'Greyhound', 'question': 'A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles'}, metadata=_MetadataReturn(uuid=UUID('cbf51871-5e1f-4b04-a3f1-1dbdf4dcbcff'), vector=None, creation_time_unix=1699028036328, last_update_time_unix=1699028036328, distance=None, certainty=None, score=0.0, explain_score='', is_consistent=False))]

But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

### Keyword search

Unlike a keyword filter, a keyword search will search for, and rank results based on the frequency of the keyword.

In [36]:
response = questions.query.bm25(
    limit=2,
    query="history"
)
response.objects

[_Object(properties={'answer': '"A Brief History Of Time In A Bottle"', 'question': "Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce"}, metadata=_MetadataReturn(uuid=UUID('e9d78df6-6c3d-4f50-8426-28a984cac4b7'), vector=None, creation_time_unix=1699028039316, last_update_time_unix=1699028039316, distance=None, certainty=None, score=2.520803213119507, explain_score=', BM25F_history_frequency:1, BM25F_history_propLength:7', is_consistent=False)),
 _Object(properties={'answer': 'Oil', 'question': 'The Drake Well Museum in Titusville, Penn. is dedicated to the history of this industry'}, metadata=_MetadataReturn(uuid=UUID('4cddec5b-24df-4df7-aee5-e539ad6fcc5a'), vector=None, creation_time_unix=1699028043885, last_update_time_unix=1699028043885, distance=None, certainty=None, score=1.8396204710006714, explain_score=', BM25F_history_propLength:14, BM25F_history_frequency:1', is_consistent=False))]

### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [37]:
response = questions.query.near_text(
    limit=2,
    query="history"
)
response.objects

[_Object(properties={'answer': 'Greyhound', 'question': 'A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles'}, metadata=_MetadataReturn(uuid=UUID('cbf51871-5e1f-4b04-a3f1-1dbdf4dcbcff'), vector=None, creation_time_unix=1699028036328, last_update_time_unix=1699028036328, distance=0.2003852128982544, certainty=0.8998073935508728, score=0.0, explain_score='', is_consistent=False)),
 _Object(properties={'answer': 'Shinto', 'question': 'Compiled in 712, the Kojiki, "Records of Ancient Matters", is one of this religion\'s oldest texts'}, metadata=_MetadataReturn(uuid=UUID('02095eaa-3bd4-45f5-90b8-acb1ac058595'), vector=None, creation_time_unix=1699028046837, last_update_time_unix=1699028046837, distance=0.20662283897399902, certainty=0.8966885805130005, score=0.0, explain_score='', is_consistent=False))]

#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [53]:
response = questions.query.near_text(
    limit=2,
    query="history",
    return_metadata=wvc.MetadataQuery(
        distance=True
    )
)
response.objects

[_Object(properties={}, metadata=_MetadataReturn(uuid=None, vector=None, creation_time_unix=None, last_update_time_unix=None, distance=0.20026379823684692, certainty=None, score=None, explain_score=None, is_consistent=False)),
 _Object(properties={}, metadata=_MetadataReturn(uuid=None, vector=None, creation_time_unix=None, last_update_time_unix=None, distance=0.20650076866149902, certainty=None, score=None, explain_score=None, is_consistent=False))]

This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [54]:
response = questions.query.near_text(
    limit=2,
    query="history",
    return_metadata=wvc.MetadataQuery(
        distance=True,
        vector=True
    )
)
response.objects

[_Object(properties={}, metadata=_MetadataReturn(uuid=None, vector=None, creation_time_unix=None, last_update_time_unix=None, distance=0.20038992166519165, certainty=None, score=None, explain_score=None, is_consistent=False)),
 _Object(properties={}, metadata=_MetadataReturn(uuid=None, vector=None, creation_time_unix=None, last_update_time_unix=None, distance=0.20658022165298462, certainty=None, score=None, explain_score=None, is_consistent=False))]

These vector representations come from deep learning models to those that power LLMs. They capture meaning, and are called vector "embeddings".

### Generative search

A generative search transforms your data at retrieval time. 

In [56]:
response = questions.generate.near_text(
    limit=2,
    query="history",
    single_prompt="Write a short tweet to promote {question}"
)
response.objects

[_GenerativeObject(properties={'answer': 'Greyhound', 'question': 'A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles'}, metadata=_MetadataReturn(uuid=UUID('cbf51871-5e1f-4b04-a3f1-1dbdf4dcbcff'), vector=None, creation_time_unix=1699028036328, last_update_time_unix=1699028036328, distance=0.20038992166519165, certainty=0.8998050689697266, score=0.0, explain_score='', is_consistent=False), generated='"Discover the fascinating history of a bus company that started it all in Hibbing, MN! 🚌🌟 Explore the museum and dive into the legacy of this iconic transportation company founded in 1914, featuring the remarkable Hupmobiles. Don\'t miss this unique journey through time! #HibbingMuseum #TransportationHistory"'),
 _GenerativeObject(properties={'answer': 'Shinto', 'question': 'Compiled in 712, the Kojiki, "Records of Ancient Matters", is one of this religion\'s oldest texts'}, metadata=_MetadataReturn(uuid=UUID('02095eaa-3bd4-45f5-90b8-acb1ac

In [57]:
for o in response.objects:
    print(o.generated)

"Discover the fascinating history of a bus company that started it all in Hibbing, MN! 🚌🌟 Explore the museum and dive into the legacy of this iconic transportation company founded in 1914, featuring the remarkable Hupmobiles. Don't miss this unique journey through time! #HibbingMuseum #TransportationHistory"
"📜 Discover the ancient wisdom of #Kojiki, the 'Records of Ancient Matters' compiled in 712! 🏯 Unveiling the roots of our beliefs, this sacred text holds the key to understanding our religion's origins. 🌟 Join us on a journey through time and embrace the wisdom of our ancestors! #ReligiousHeritage"


You can see here ⬆️ that each object has been transformed into a tweet by the LLM based on our prompt.

You can ask LLMs to perform all sorts of tasks

In [58]:
response = questions.generate.near_text(
    limit=2,
    query="history",
    single_prompt="Translate {question} into German."
)

for o in response.objects:
    print(o.generated)

Ein Museum in Hibbing, Minnesota, verfolgt die Geschichte dieses Busunternehmens, das dort 1914 mit Hupmobiles gegründet wurde.
Übersetzt in 712, ist das Kojiki, "Aufzeichnungen über alte Angelegenheiten", eines der ältesten Texte dieser Religion.


The LLM is multi-lingual!

You can also send groups of results to the LLM with Weaviate.

In [60]:
response = questions.generate.near_text(
    limit=2,
    query="history",
    grouped_task="Write a poem based on these results"
)

In [None]:
print(response.generated)

Look how far we've got in a short time - we can do much more than that! 

Here's something I prepared earlier.

## What more can we do with Weaviate?

Here is a demo instance that you can connect to and try out. 

Like many of our production clusters, we have a read-only API key set up that you can use.

In [22]:
## Update with either a local instance or when WCS clusters updated

# api_headers = {
#     "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"],
# }

# # Instantiate the client with the auth config
# client = weaviate.Client(
#     url="https://edu-demo.weaviate.network",
#     auth_client_secret=weaviate.AuthApiKey(
#         api_key="learn-weaviate"
#     ),
#     additional_headers=api_headers
# )

This instance is populated with the first two chapters of the "Pro Git" book.

In [23]:
# response = (
#     client.query
#     .get("GitBookChunk", ["chunk", "chunk_index", "chapter_title"])
#     .with_limit(2)
#     .do()
# )

# jprint(response)

{
  "data": {
    "Get": {
      "GitBookChunk": [
        {
          "chapter_title": "01-introduction",
          "chunk": "== Distributed Version Control Systems\n\n(((version control,distributed)))\nThis is where Distributed Version Control Systems (DVCSs) step in.\nIn a DVCS (such as Git, Mercurial, Bazaar or Darcs), clients don't just check out the latest snapshot of the files; rather, they fully mirror the repository, including its full history.\nThus, if any server dies, and these systems were collaborating via that server, any of the client repositories can be copied back up to the server to restore it.\nEvery clone is really a full backup of all the data.\n\n.Distributed version control diagram\nimage::images/distributed.png[Distributed version control diagram]\n\nFurthermore, many of these systems deal pretty well with having several remote repositories they can work with, so you can collaborate with different groups of people in different ways simultaneously within the sam

Using Weaviate, we can talk to this book!

Let's see what the book says about ways of undoing commits.

In [24]:
# response = (
#     client.query
#     .get("GitBookChunk", ["chunk", "chunk_index", "chapter_title"])
#     .with_near_text({"concepts": ["undo a git commit"]})
#     .with_generate(grouped_task="key concepts contained here in bullet points")
#     .with_limit(3)
#     .do()
# )

Take a look at the results as we've done before

In [25]:
# print(response["data"]["Get"]["GitBookChunk"][0]["_additional"]["generate"]["groupedResult"])

- The concept of undoing changes in Git
- The use of the `git commit --amend` command to redo a commit
- The use of the `git reset` command to undo changes in the working directory


And the information that this is based on:

In [26]:
# for o in response["data"]["Get"]["GitBookChunk"]:
#     print(f"========== Chunk: {o['chunk_index']} ==========")
#     print(o["chunk"])

===

[[_undoing]]= Undoing Things

At any stage, you may want to undo something.
Here, we'll review a few basic tools for undoing changes that you've made.
Be careful, because you can't always undo some of these undos.
This is one of the few areas in Git where you may lose some work if you do it wrong.

One of the common undos takes place when you commit too early and possibly forget to add some files, or you mess up your commit message.
If you want to redo that commit, make the additional changes you forgot, stage them, and commit again using the `--amend` option:

[source,console]
----
$ git commit --amend
----

This command takes your staging area and uses it for the commit.
If you've made no changes since your last commit (for instance, you run this command immediately after your previous commit), then your snapshot will look exactly the same, and all you'll change is your commit message.

The same commit-message editor fires up, but it already contains the message of your previous

You can do strange and wonderful things - like this:

In [28]:
# response = (
#     client.query
#     .get("GitBookChunk", ["chunk", "chunk_index", "chapter_title"])
#     .with_near_text({"concepts": ["history of git"]})
#     .with_generate(grouped_task="explain these results in a short children's story, with emojis.")
#     .with_limit(3)
#     .do()
# )

In [29]:
# print(response["data"]["Get"]["GitBookChunk"][0]["_additional"]["generate"]["groupedResult"])

Once upon a time, there was a little penguin named Linux 🐧. Linux had a big project called the Linux kernel, which was a special kind of software. But Linux needed a way to keep track of all the changes and updates to the project.

At first, Linux used patches and archived files to share the changes with others. But then, in 2002, a new tool called BitKeeper came along. BitKeeper helped Linux and its friends work together on the project 🤝.

But, uh-oh! In 2005, BitKeeper had some problems with the company that made it 😱. The company took away the tool for free, and Linux and its friends were sad 😢.

But Linux, being a smart penguin, decided to create its own tool called Git 🐧🔧. Git was super fast and could handle big projects like the Linux kernel. It also had a cool branching system for different versions of the project 🌳.

With Git, Linux and its friends could work on the project without needing help from other computers. Everything they needed was right there on their own computers 

In [30]:
# for o in response["data"]["Get"]["GitBookChunk"]:
#     print(f"========== Chunk: {o['chunk_index']} ==========")
#     print(o["chunk"])

=== A Short History of Git

As with many great things in life, Git began with a bit of creative destruction and fiery controversy.

The Linux kernel is an open source software project of fairly large scope.(((Linux)))
During the early years of the Linux kernel maintenance (1991–2002), changes to the software were passed around as patches and archived files.
In 2002, the Linux kernel project began using a proprietary DVCS called BitKeeper.(((BitKeeper)))

In 2005, the relationship between the community that developed the Linux kernel and the commercial company that developed BitKeeper broke down, and the tool's free-of-charge status was revoked.
This prompted the Linux development community (and in particular Linus Torvalds, the creator of Linux) to develop their own tool based on some of the lessons they learned while using BitKeeper.(((Linus Torvalds)))
Some of the goals of the new system were as follows:

* Speed
* Simple design
* Strong support for non-linear development (thousands 

And a lot more. 

Weaviate makes it easy for you to work with your data and these AI models, at scale. As a vector database, we deal with data stores with 10s or 100s of M objects!