[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/data-platforms/spark/spark-connector.ipynb)

# ⚡ Using the Spark Connector for Weaviate
Welcome to this recipe notebook! 

Here, we'll walk you through a small example how you can take data from a Spark Dataframe and feed it into Weaviate.

Virtual Environment and Dependencies:
To ensure smooth execution and prevent potential conflicts with your global Python environment, we recommend running the code in a virtual environment. Later in this notebook, we'll guide you through setting up this environment and installing the necessary dependencies.

With these points in mind, let's get started!

## Dependencies
Before proceeding with the notebook content, it's essential to set up an isolated Python environment. This helps avoid any potential package conflicts and ensures that you have a clean workspace.

**You will also need Java 8+ and Scala 2.12 installed.**

## Virtual Environment Setup:
If you haven't created a virtual environment before, here's how you can do it:

Using `virtualenv`:
```bash
pip install virtualenv
python -m virtualenv venv
```

Using `venv` (built-in with Python 3.3+):

```bash
python -m venv venv
```

After creating the virtual environment, you need to activate it:

Windows:
```bash
.\venv\Scripts\activate
```
macOS and Linux:
```bash
source venv/bin/activate
```

## Installing Dependencies:
With the virtual environment active, run the following code to install all the required dependencies for this notebook:

**Please note that you will also need Java 8+ and Scala 2.12 installed.**

In [None]:
!python -m pip install weaviate-client==3.25.3, pyspark==3.5.0

## Obtain the JAR File that is used to build the Spark Connector

You can obtain the latest JAR file at this [link](https://github.com/weaviate/spark-connector/releases/latest). Download and place the JAR file in this repository

## Start the Spark Session

In [1]:
from pyspark.sql import SparkSession
import os, json
import warnings
warnings.filterwarnings('ignore')

In [2]:
spark = (
    SparkSession.builder.config(
        "spark.jars",
        "spark-connector-assembly-1.3.1.jar",
    )
    .master("local[*]")
    .appName("weaviate")
    .getOrCreate()
)


spark.sparkContext.setLogLevel("WARN")

23/12/09 01:11:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
spark

In [4]:
df = spark.read.json('tiny_Jeopardy.json')

                                                                                

In [5]:
df.show()

+--------------------+--------+--------------------+
|              Answer|Category|            Question|
+--------------------+--------+--------------------+
|               Liver| SCIENCE|This organ remove...|
|            Elephant| ANIMALS|It's the only liv...|
|   the nose or snout| ANIMALS|The gavial looks ...|
|            Antelope| ANIMALS|Weighing around a...|
|the diamondback r...| ANIMALS|Heaviest of all p...|
|             species| SCIENCE|2000 news: the Gu...|
|                wire| SCIENCE|A metal that is d...|
|                 DNA| SCIENCE|In 1953 Watson & ...|
|      the atmosphere| SCIENCE|Changes in the tr...|
|       Sound barrier| SCIENCE|In 70-degree air,...|
+--------------------+--------+--------------------+



## Initialize Weaviate Instance

Here we will:
- Create the Weaviate Client
- Define the Schema

In [6]:
import weaviate
from weaviate.embedded import EmbeddedOptions

client = weaviate.Client(
        embedded_options=weaviate.embedded.EmbeddedOptions(),
        additional_headers={'X-OpenAI-Api-Key': os.environ["OPENAI_API_KEY"]}
)

client.is_ready()

Started /Users/zainhasan/.cache/weaviate-embedded: process ID 11789


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-12-09T01:11:46-05:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-12-09T01:11:46-05:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"article_ep5yZpA4vsfT","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-12-09T01:11:47-05:00","took":661500}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"jeopardy_I7Kbovb9KTXv","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-12-09T01:11:47-05:00","took":44125}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50060","time":"2023-12-09T01:11:47-05:00"}
{"action":"restapi_management","level":"info","msg":"Ser

True

In [7]:
if client.schema.exists("Jeopardy"):
    client.schema.delete_class("Jeopardy")

client.schema.create_class(
    {
        "class": "Jeopardy",
        "properties": [
            {"name": "Answer", "dataType": ["string"]},
            {"name": "Category", "dataType": ["string"]},
            {"name": "Question", "dataType": ["string"]},
            ],
        "vectorizer": "text2vec-openai",
    }
)

{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"jeopardy_nOUABh0TzMfQ","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-12-09T01:11:49-05:00","took":62333}


## Move data from Spark to Weaviate

In [8]:
df.write.format("io.weaviate.spark.Weaviate") \
    .option("batchSize", 200) \
    .option("scheme", "http") \
    .option("host", "localhost:8079") \
    .option("header:X-OpenAI-Api-Key", os.getenv("OPENAI_API_KEY")) \
    .option("className", "Jeopardy") \
    .mode("append").save()

                                                                                

## Verify data has been written and query Weaviate

In [9]:
print(json.dumps(client.query.aggregate("Jeopardy").with_meta_count().do(), indent=2))

{
  "data": {
    "Aggregate": {
      "Jeopardy": [
        {
          "meta": {
            "count": 10
          }
        }
      ]
    }
  }
}


In [10]:
response = (client.query
            .get("Jeopardy", ['question', 'answer','category'])
            .with_near_text({"concepts": "biology"})
            .with_additional(['distance'])
            .with_limit(2)
            .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "Jeopardy": [
        {
          "_additional": {
            "distance": 0.1876005
          },
          "answer": "DNA",
          "category": "SCIENCE",
          "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
        },
        {
          "_additional": {
            "distance": 0.20415491
          },
          "answer": "species",
          "category": "SCIENCE",
          "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
        }
      ]
    }
  }
}
