
# Introduction

This notebook shows how to use the confluent-weaviate connector with [Weaviate Cloud Services](https://weaviate.io/pricing) and [Databricks](https://databricks.com/).

# Dependencies

Install the weaviate client:

In [None]:
!pip install weaviate-client

Collecting weaviate-client
  Using cached weaviate_client-3.24.1-py3-none-any.whl (107 kB)
Collecting requests<3.0.0,>=2.30.0
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting validators<1.0.0,>=0.21.2
  Using cached validators-0.22.0-py3-none-any.whl (26 kB)
Collecting authlib<2.0.0,>=1.2.1
  Using cached Authlib-1.2.1-py2.py3-none-any.whl (215 kB)
Installing collected packages: validators, requests, authlib, weaviate-client
  Attempting uninstall: requests
    Found existing installation: requests 2.27.1
    Not uninstalling requests at /databricks/python3/lib/python3.9/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-a730a752-2104-4686-8b68-5d410fc80cc2
    Can't uninstall 'requests'. No files were found to uninstall.
Successfully installed authlib-1.2.1 requests-2.31.0 validators-0.22.0 weaviate-client-3.24.1
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-a730a752-2104-4686-8b68-5d410fc80

# Imports

In [None]:
import json
import os
import time

import weaviate
from pyspark.sql import SparkSession

# Setup

Setup weaviate client to connect to Weaviate Cloud Services:

In [None]:
wcs_url = dbutils.secrets.get("demo-confluent-connector", "WCS_URL")
wcs_api_key = dbutils.secrets.get("demo-confluent-connector", "WCS_API_KEY")

client = weaviate.Client(
    url=wcs_url,
    auth_client_secret=weaviate.AuthApiKey(wcs_api_key),
)

client.schema.delete_all()
weaviate_url = client._connection.url
weaviate_host = weaviate_url.split("://")[1]

token = client._connection._headers["authorization"]
weaviate_api_key = token.split("Bearer ")[1]

Setup the spark session:

Make sure you have installed the following libraries into your cluster:

1. org.apache.spark:spark-avro_2.12:3.4.1
2. org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1
3. confluent-connector_2.12-3.4.0_0.0.1.jar

Reference:
* [Libraries](https://docs.databricks.com/en/libraries/index.html)


Grab the Confluent Cloud Platform-related credentials:

In [None]:
confluentClusterName = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_CLUSTER_NAME")
confluentBootstrapServers = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_BOOTSTRAP_SERVERS")
confluentTopicName = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_TOPIC_NAME")
schemaRegistryUrl = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_SCHEMA_REGISTRY_URL")
confluentApiKey = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_API_KEY")
confluentSecret = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_SECRET")
confluentRegistryApiKey = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_REGISTRY_API_KEY")
confluentRegistrySecret = dbutils.secrets.get("demo-confluent-connector", "CONFLUENT_REGISTRY_SECRET")

# Demo

Create a schema in WCS:

In [None]:
weaviate_schema = {
    "class": "Clickstream",
    "description": "A record of user clicks on a website",
    "properties": [
        {
            "name": "_kafka_key",
            "dataType": [
                "string"
            ],
            "description": "The key of the Kafka message"
        },
        {
            "name": "_kafka_topic",
            "dataType": [
                "string"
            ],
            "description": "The topic of the Kafka message"
        },
        {
            "name": "_kafka_partition",
            "dataType": [
                "int"
            ],
            "description": "The partition of the Kafka message"
        },
        {
            "name": "_kafka_offset",
            "dataType": [
                "int"
            ],
            "description": "The offset of the Kafka message"
        },
        {
            "name": "_kafka_timestamp",
            "dataType": [
                "date"
            ],
            "description": "The timestamp of the Kafka message"
        },
        {
            "name": "_kafka_timestampType",
            "dataType": [
                "int"
            ],
            "description": "The timestamp type of the Kafka message"
        },
        {
            "name": "_kafka_schema_id",
            "dataType": [
                "string"
            ],
            "description": "The schema ID of the Kafka message value"
        },
        {
            "name": "user_id",
            "dataType": [
                "int"
            ],
            "description": "The ID of the user who clicked"
        },
        {
            "name": "username",
            "dataType": [
                "string"
            ],
            "description": "The username of the user who clicked"
        },
        {
            "name": "registered_at",
            "dataType": [
                "int"
            ],
            "description": "The timestamp when the user registered"
        },
        {
            "name": "first_name",
            "dataType": [
                "string"
            ],
            "description": "The first name of the user who clicked"
        },
        {
            "name": "last_name",
            "dataType": [
                "string"
            ],
            "description": "The last name of the user who clicked"
        },
        {
            "name": "city",
            "dataType": [
                "string"
            ],
            "description": "The city where the user who clicked is located"
        },
        {
            "name": "level",
            "dataType": [
                "string"
            ],
            "description": "The level of the user who clicked"
        }
    ]
}

In [None]:
client.schema.create_class(weaviate_schema)

Create a Spark Structured Streaming `DataFrame` to read streaming data from a Kafka topic from Confluent Cloud:

In [None]:
clickstreamDF = (
    spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", confluentBootstrapServers)
    .option("subscribe", confluentTopicName)
    .option("startingOffsets", "latest")
    .option("kafka.security.protocol", "SASL_SSL")
    .option(
        "kafka.sasl.jaas.config",
        "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(
            confluentApiKey, confluentSecret
        ),
    )
    .option("kafka.ssl.endpoint.identification.algorithm", "https")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("failOnDataLoss", "false")
    .option("name", "clickStreamReadFromConfluent")
    .load()
)

Define a function to run on each microbatch:

In [None]:
total_rows_processed = 0


def f(df, batch_id):
    global total_rows_processed
    row_count = df.count()
    total_rows_processed += row_count

    print(f"Number of rows in the batch with batch id {batch_id}: {row_count}")
    df.write.format("io.weaviate.confluent.Weaviate").option("batchsize", 200).option(
        "scheme", "http"
    ).option("host", weaviate_host).option("apiKey", weaviate_api_key).option(
        "className", weaviate_schema["class"]
    ).option(
        "schemaRegistryUrl", schemaRegistryUrl
    ).option(
        "schemaRegistryApiKey", confluentRegistryApiKey
    ).option(
        "schemaRegistryApiSecret", confluentRegistrySecret
    ).mode(
        "append"
    ).save()

Start writing the stream:

In [None]:
query = (
    clickstreamDF.writeStream.foreachBatch(f)
    .queryName("write_stream_to_weaviate")
    .start()
)

Stop writing after 15 seconds:

In [None]:
time.sleep(15)
query.stop()

Number of rows in the batch with batch id 0: 1
Number of rows in the batch with batch id 1: 13


ERROR:py4j.clientserver:There was an exception while executing the Python Proxy on the Python Side.
Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 617, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/databricks/spark/python/pyspark/sql/utils.py", line 119, in call
    raise e
  File "/databricks/spark/python/pyspark/sql/utils.py", line 116, in call
    self.func(DataFrame(jdf, wrapped_session_jdf), batch_id)
  File "<command-3510017430435041>", line 6, in f
    row_count = df.count()
  File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/databricks/spark/python/pyspark/sql/dataframe.py", line 1214, in count
    return int(self._jdf.count())
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/databricks/s

In [None]:
query.stop()

Compare the number of rows processed and the number of objects in Weaviate:

In [None]:
results = client.query.aggregate(weaviate_schema["class"]).with_meta_count().do()
total_objects_in_weaviate = results["data"]["Aggregate"][weaviate_schema["class"]][0][
    "meta"
]["count"]

assert (
    total_rows_processed == total_objects_in_weaviate
), f"Total rows processed {total_rows_processed} does not match total objects in weaviate {total_objects_in_weaviate}"

Look at some of the objects in Weaviate:

In [None]:
client.data_object.get(class_name=weaviate_schema["class"], limit=3)

Out[13]: {'deprecations': [],
 'objects': [{'class': 'Clickstream',
   'creationTimeUnix': 1694802167098,
   'id': '1450d859-f98f-453e-aaf2-4ecad0877345',
   'lastUpdateTimeUnix': 1694802167098,
   'properties': {'_kafka_key': '203414',
    '_kafka_offset': 33782,
    '_kafka_partition': 3,
    '_kafka_schemaId': 100002,
    '_kafka_timestamp': '2023-09-15T18:22:39.691Z',
    '_kafka_timestampType': 0,
    '_kafka_topic': '[REDACTED]',
    'city': 'Frankfurt',
    'first_name': 'Reeva',
    'last_name': 'Vanyard',
    'level': 'Gold',
    'registered_at': 1442165147142,
    'user_id': 203414,
    'username': 'LukeWaters_23'},
   'vectorWeights': None},
  {'class': 'Clickstream',
   'creationTimeUnix': 1694802162605,
   'id': '45609b6d-d2fb-4476-833f-c5c4a3d411de',
   'lastUpdateTimeUnix': 1694802162605,
   'properties': {'_kafka_key': '203412',
    '_kafka_offset': 34137,
    '_kafka_partition': 4,
    '_kafka_schemaId': 100002,
    '_kafka_timestamp': '2023-09-15T18:22:38.702Z',
    '