-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Weaviate

In this notebook, we will use Weaviate as our vector database. We will then write the embedding vectors out to Weaviate and query for similar documents. Weaviate provides customization options, such as to incorporate Product Quantization or not (refer [here](https://weaviate.io/developers/weaviate/concepts/vector-index#hnsw-with-product-quantizationpq)). 

[Zilliz](https://zilliz.com/) has an enterprise offering for Weaviate.

## Library pre-requisites

- weaviate-client
  - pip install below
- Spark connector jar file
  - **IMPORTANT!!** Since we will be interacting with Spark by writing a Spark dataframe out to Weaviate, we need a Spark Connector.
  - [Download the Spark connector jar file](https://github.com/weaviate/spark-connector#download-jar-from-github) and [upload to your Databricks cluster](https://github.com/weaviate/spark-connector#using-the-jar-in-databricks).

In [0]:
%pip install weaviate-client==3.19.1

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting weaviate-client==3.19.1
  Downloading weaviate_client-3.19.1-py3-none-any.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.6/99.6 kB 1.5 MB/s eta 0:00:00
Collecting validators<=0.21.0,>=0.18.2
  Downloading validators-0.21.0-py3-none-any.whl (27 kB)
Collecting authlib>=1.1.0
  Downloading Authlib-1.2.1-py2.py3-none-any.whl (215 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 215.3/215.3 kB 13.9 MB/s eta 0:00:00
Installing collected packages: validators, authlib, weaviate-client
Successfully installed authlib-1.2.1 validators-0.21.0 weaviate-client-3.19.1
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


## Classroom Setup

In [0]:
%run ../Includes/Classroom-Setup

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment:
| enumerating serving endpoints...found 0...(0 seconds)
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/large-language-models/v01"


Importing lab testing framework.



Using the "default" schema.

Predefined paths variables:
| DA.paths.working_dir: /dbfs/mnt/dbacademy-users/vijaymohire@bhadaleit.onmicrosoft.com/large-language-models
| DA.paths.user_db:     /dbfs/mnt/dbacademy-users/vijaymohire@bhadaleit.onmicrosoft.com/large-language-models/database.db
| DA.paths.datasets:    /dbfs/mnt/dbacademy-datasets/large-language-models/v01

Setup completed (8 seconds)

The models developed or used in this course are for demonstration and learning purposes only.
Models may occasionally output offensive, inaccurate, biased information, or harmful instructions.


## Weaviate

[Weaviate](https://weaviate.io/) is an open-source persistent and fault-tolerant [vector database](https://weaviate.io/developers/weaviate/concepts/storage). It integrates with a variety of tools, including OpenAI and Hugging Face Transformers. You can refer to their [documentation here](https://weaviate.io/developers/weaviate/quickstart).

### Setting up your Weaviate Network

Before we could proceed, you need your own Weaviate Network. To start your own network, visit the [homepage](https://weaviate.io/). 

Step 1: Click on `Start Free` 

<img src="https://files.training.databricks.com/images/weaviate_homepage.png" width=500>

Step 2: You will be brought to this [Console page](https://console.weaviate.cloud/). If this is your first time using Weaviate, click `Register here` and pass in your credentials.

<img src="https://files.training.databricks.com/images/weaviate_register.png" width=500>

Step 3: Click on `Create cluster` and select `Free sandbox`. Provide your cluster name. For simplicity, we will toggle `enable authentication` to be `No`. Then, hit `Create`. 

<img src="https://files.training.databricks.com/images/weaviate_create_cluster.png" width=900>

Step 4: Click on `Details` and copy the `Cluster URL` and paste in the cell below.

We will use embeddings from OpenAI,  so we will need a token from OpenAI API

Steps:
1. You need to [create an account](https://platform.openai.com/signup) on OpenAI. 
2. Generate an OpenAI [API key here](https://platform.openai.com/account/api-keys). 

Note: OpenAI does not have a free option, but it gives you $5 as credit. Once you have exhausted your $5 credit, you will need to add your payment method. You will be [charged per token usage](https://openai.com/pricing). **IMPORTANT**: It's crucial that you keep your OpenAI API key to yourself. If others have access to your OpenAI key, they will be able to charge their usage to your account!

In [0]:
# TODO
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["WEAVIATE_NETWORK"] = ""

In [0]:
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]
weaviate_network = os.environ["WEAVIATE_NETWORK"]

In [0]:
import weaviate

client = weaviate.Client(
    weaviate_network, additional_headers={"X-OpenAI-Api-Key": openai.api_key}
)
client.is_ready()

True

### Dataset


In this section, we are going to use the data on <a href="https://newscatcherapi.com/" target="_blank">news topics collected by the NewsCatcher team</a>, who collects and indexes news articles and release them to the open-source community. The dataset can be downloaded from <a href="https://www.kaggle.com/kotartemiy/topic-labeled-news-dataset" target="_blank">Kaggle</a>.

In [0]:
df = (
    spark.read.option("header", True)
    .option("sep", ";")
    .format("csv")
    .load(
        f"/dbfs/mnt/dbacademy-datasets/large-language-models/v01/news/labelled_newscatcher_dataset.csv".replace(
            "/dbfs", "dbfs:"
        )
    )
)
display(df)

topic,link,domain,published_date,title,lang
SCIENCE,https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel potential,en
SCIENCE,https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, study finds",en
SCIENCE,https://www.express.co.uk/news/science/1322607/artificial-intelligence-warning-machine-learning-algorithm-social-media-data,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know us better than we know ourselves,en
SCIENCE,https://www.ndtv.com/world-news/glaciers-could-have-sculpted-mars-valleys-study-2273648,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en
SCIENCE,https://www.thesun.ie/tech/5742187/perseid-meteor-shower-tonight-time-uk-see/,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how to see the huge bright FIREBALLS over UK again tonight,en
SCIENCE,https://interestingengineering.com/nasa-releases-in-depth-map-of-beirut-explosion-damage,interestingengineering.com,2020-08-08 11:05:45,NASA Releases In-Depth Map of Beirut Explosion Damage,en
SCIENCE,https://www.thequint.com/tech-and-auto/spacex-nasa-demo-2-rocket-launch-set-for-saturday-how-to-watch,thequint.com,2020-05-28 09:09:46,"SpaceX, NASA Demo-2 Rocket Launch Set for Saturday: How to Watch",en
SCIENCE,https://www.thespacereview.com/article/4003/1,thespacereview.com,2020-08-10 22:48:23,Orbital space tourism set for rebirth in 2021,en
SCIENCE,https://www.businessinsider.com/greenland-melting-ice-sheet-past-tipping-point-2020-8,businessinsider.com,2020-08-16 00:28:54,Greenland's melting ice sheet has 'passed the point of no return',en
SCIENCE,https://www.thehindubusinessline.com/news/science/nasa-invites-engineering-students-to-help-harvest-water-on-mars-moon/article32352915.ece,thehindubusinessline.com,2020-08-14 07:43:25,"NASA invites engineering students to help harvest water on Mars, Moon",en


We are going to store this dataset in the Weaviate database. To do that, we first need to define a schema. A schema is where we define classes, class properties, data types, and vectorizer modules we would like to use. 

In the schema below, notice that:

- We capitalize the first letter of `class_name`. This is Weaviate's rule. 
- We specify data types within `properties`
- We use `text2vec-openai` as the vectorizer. 
  - You can also choose to upload your own vectors (refer to [docs here](https://weaviate.io/developers/weaviate/api/rest/objects#with-a-custom-vector)) or create a class without any vectors (but we won't be able to perform similarity search after).

[Reference documentation here](https://weaviate.io/developers/weaviate/tutorials/schema)

In [0]:
class_name = "News"
class_obj = {
    "class": class_name,
    "description": "News topics collected by NewsCatcher",
    "properties": [
        {"name": "topic", "dataType": ["string"]},
        {"name": "link", "dataType": ["string"]},
        {"name": "domain", "dataType": ["string"]},
        {"name": "published_date", "dataType": ["string"]},
        {"name": "title", "dataType": ["string"]},
        {"name": "lang", "dataType": ["string"]},
    ],
    "vectorizer": "text2vec-openai",
}

In [0]:
# If the class exists before, we will delete it first
if client.schema.exists(class_name):
    print("Deleting existing class...")
    client.schema.delete_class(class_name)

print(f"Creating class: '{class_name}'")
client.schema.create_class(class_obj)

Creating class: 'News'


If you are curious what the schema looks like for your class, run the following command.

In [0]:
import json

print(json.dumps(client.schema.get(class_name), indent=4))

{
    "class": "News",
    "description": "News topics collected by NewsCatcher",
    "invertedIndexConfig": {
        "bm25": {
            "b": 0.75,
            "k1": 1.2
        },
        "cleanupIntervalSeconds": 60,
        "stopwords": {
            "additions": null,
            "preset": "en",
            "removals": null
        }
    },
    "moduleConfig": {
        "text2vec-openai": {
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "vectorizeClassName": true
        }
    },
    "multiTenancyConfig": {
        "enabled": false
    },
    "properties": [
        {
            "dataType": [
                "text"
            ],
            "indexFilterable": true,
            "indexSearchable": true,
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": false,
                    "vectorizePropertyName": false
                }
            },
            "name": "topic",
        

Now that the class is created, we are going to write our dataframe to the class. 

**IMPORTANT!!** Since we are writing a Spark DataFrame out, we need a Spark Connector to Weaviate. You need to [download the Spark connector jar file](https://github.com/weaviate/spark-connector#download-jar-from-github) and [upload to your Databricks cluster](https://github.com/weaviate/spark-connector#using-the-jar-in-databricks) before running the next cell. If you do not do this, the next cell *will fail*.

In [0]:
(
    df.limit(100)
    .write.format("io.weaviate.spark.Weaviate")
    .option("scheme", "http")
    .option("host", weaviate_network.split("https://")[1])
    .option("header:X-OpenAI-Api-Key", openai.api_key)
    .option("className", class_name)
    .mode("append")
    .save()
)

Let's check if the data is indeed populated. You can run either the following command or go to 
`https://{insert_your_cluster_url_here}/v1/objects` 

You should be able to see the data records, rather than null objects.

In [0]:
client.query.get("News", ["topic"]).do()

{'data': {'Get': {'News': [{'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'TECHNOLOGY'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIENCE'},
    {'topic': 'SCIE

Yay! Looks like the data is populated. We can proceed further and do a query search. We are going to search for any news titles related to `locusts`. Additionally, we are going to add a filter statement, where the topic of the news has to be `SCIENCE`. Notice that we don't have to carry out the step of converting `locusts` into embeddings ourselves because we have included a vectorizer within the class earlier on.

We will use `with_near_text` to specify the text we would like to query similar titles for. By default, Weaviate uses cosine distance to determine similar objects. Refer to [distance documentation here](https://weaviate.io/developers/weaviate/config-refs/distances#available-distance-metrics).

In [0]:
where_filter = {
    "path": ["topic"],
    "operator": "Equal",
    "valueString": "SCIENCE",
}

# We are going to search for any titles related to locusts
near_text = {"concepts": "locust"}
(
    client.query.get(class_name, ["topic", "domain", "title"])
    .with_where(where_filter)
    .with_near_text(near_text)
    .with_limit(2)
    .do()
)

{'data': {'Get': {'News': [{'domain': 'pulse.ng',
     'title': 'An irresistible scent makes locusts swarm, study finds',
     'topic': 'SCIENCE'},
    {'domain': 'cnn.com',
     'title': "'Zombie cicadas' under the influence of a mind controlling fungus have returned to West Virginia",
     'topic': 'SCIENCE'}]}}}

Alternatively, if you wish to supply your own embeddings at query time, you can do that too. Since embeddings are vectors, we will use `with_near_vector` instead.

In the code block below, we additionally introduce a `distance` parameter. The lower the distance score is, the closer the vectors are to each other. Read more about the distance thresholds [here](https://weaviate.io/developers/weaviate/config-refs/distances#available-distance-metrics).

In [0]:
import openai

model = "text-embedding-ada-002"
openai_object = openai.Embedding.create(input=["locusts"], model=model)

openai_embedding = openai_object["data"][0]["embedding"]

(
    client.query.get("News", ["topic", "domain", "title"])
    .with_where(where_filter)
    .with_near_vector(
        {
            "vector": openai_embedding,
            "distance": 0.7,  # this sets a threshold for distance metric
        }
    )
    .with_limit(2)
    .do()
)



{'data': {'Get': {'News': [{'domain': 'pulse.ng',
     'title': 'An irresistible scent makes locusts swarm, study finds',
     'topic': 'SCIENCE'},
    {'domain': 'cnn.com',
     'title': "'Zombie cicadas' under the influence of a mind controlling fungus have returned to West Virginia",
     'topic': 'SCIENCE'}]}}}

-sandbox
&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>