## Data import with Weaviate

> Note: This Jupyter notebook is an accompaniment to our video tutorial on the topic, which can be found at https://www.weaviate.io/developers/weaviate/tutorials/vector-provision-options

To import data into Weaviate, you can:

- Have Weaviate obtain a vector (e.g. through an inference API) with a `vectorizer` module

- Or provide your own vector

We will show you an example of both here.

**Note**: Python used here - but principle applicable to all client libraries.

### Instantiate Weaviate client

In [1]:
import weaviate
import os

api_key = os.environ["OPENAI_API_KEY"]
weaviate_url = os.environ["TEMP_WEAVIATE_URL"]

client = weaviate.Client(
    url=weaviate_url,  # Replace with your actual instance (e.g. url="http://localhost:8080/", or "https://your-endpoint.weaviate.network/")
    additional_headers={
        "X-OpenAI-API-Key": api_key
    }
)

In [2]:
# client.schema.delete_all()  # To wipe your instance clean

In [3]:
def jprint(json_in):
    import json
    print(json.dumps(json_in, indent=2))

jprint(client.get_meta())

{
  "hostname": "http://[::]:8080",
  "modules": {
    "text2vec-openai": {
      "documentationHref": "https://beta.openai.com/docs/guides/embeddings/what-are-embeddings",
      "name": "OpenAI Module"
    }
  },
  "version": "1.18.0"
}


In [4]:
jprint(client.schema.get())

{
  "classes": []
}


### Load data

**Note**: Detail unimportant
- Know that this function yields dictionaries of data to be imported.

Example: 

```
{
  "review_body": "A soft, creamy wine full of apple and pear fruits that are edged with crisp acidity. It is ripe, but still young; drink this attractive wine from 2017.",
  "title": "DFJ Vinhos 2015 Grand'Arte Alvarinho (Lisboa)",
  "country": "Portugal",
  "points": 87,
  "price": 13.0
}
```

In [5]:
def dataloader():
    import pandas as pd
    winedata_path = "winemag_tiny.csv"
    df = pd.read_csv(winedata_path)
    for _, row in df.iterrows():
        data_obj = {
            "review_body": row["description"],
            "title": row["title"],
            "country": row["country"],
            "points": row["points"],
            "price": row["price"],
        }
        yield data_obj

### Import objects - with Weaviate-obtained vectors

#### Add the object class to the schema

In [6]:
class_name = "WineReview"

In [7]:
class_obj = {
    "class": class_name,
    "vectorizer": "text2vec-openai",
}

client.schema.create_class(class_obj)

In [8]:
jprint(client.schema.get())

{
  "classes": [
    {
      "class": "WineReview",
      "invertedIndexConfig": {
        "bm25": {
          "b": 0.75,
          "k1": 1.2
        },
        "cleanupIntervalSeconds": 60,
        "stopwords": {
          "additions": null,
          "preset": "en",
          "removals": null
        }
      },
      "moduleConfig": {
        "text2vec-openai": {
          "model": "ada",
          "modelVersion": "002",
          "type": "text",
          "vectorizeClassName": true
        }
      },
      "properties": [],
      "replicationConfig": {
        "factor": 1
      },
      "shardingConfig": {
        "virtualPerPhysical": 128,
        "desiredCount": 1,
        "actualCount": 1,
        "desiredVirtualCount": 128,
        "actualVirtualCount": 128,
        "key": "_id",
        "strategy": "hash",
        "function": "murmur3"
      },
      "vectorIndexConfig": {
        "skip": false,
        "cleanupIntervalSeconds": 300,
        "maxConnections": 64,
        "efCon

#### Add objects

In [9]:
with client.batch as batch:  # Context manager manages batch flushing
    batch.batch_size=50
    batch.dynamic=True
    for data_obj in dataloader():
        batch.add_data_object(
            data_obj, 
            class_name
        )

In [10]:
jprint(data_obj)

{
  "review_body": "A soft, creamy wine full of apple and pear fruits that are edged with crisp acidity. It is ripe, but still young; drink this attractive wine from 2017.",
  "title": "DFJ Vinhos 2015 Grand'Arte Alvarinho (Lisboa)",
  "country": "Portugal",
  "points": 87,
  "price": 13.0
}


#### Did it work?

In [11]:
jprint(client.query.aggregate(class_name).with_meta_count().do())

{
  "data": {
    "Aggregate": {
      "WineReview": [
        {
          "meta": {
            "count": 50
          }
        }
      ]
    }
  }
}


👍

Let's try a vector search

In [12]:
results = client.query.get(
    class_name, ["title", "country", "review_body"]
).with_near_text(
    {"concepts": ["fruity French wine"]}
).with_additional(
    ["distance", "id"]
).with_limit(5).do()
jprint(results)

{
  "data": {
    "Get": {
      "WineReview": [
        {
          "_additional": {
            "distance": 0.10594118,
            "id": "f9a081b2-08a8-46fd-99f7-a0b1b64c0607"
          },
          "country": "France",
          "review_body": "Definite juiciness here, although the wine has a light and fresh character. The acidity and tannins are there to give the berry fruit a boost.",
          "title": "Ch\u00e2teau Saint-Sernin 2007 Malbec (Cahors)"
        },
        {
          "_additional": {
            "distance": 0.11196965,
            "id": "6db7e884-da6f-459f-9c83-b31656a0b17f"
          },
          "country": "France",
          "review_body": "This ripe wine with its red fruits and well-balanced texture is beautifully ready to drink, with just the right crisp acidity to balance the soft richness. There is no sign yet of maturity, the fruit from this great vintage showing at its perfumed best. Drink the wine now.",
          "title": "Pol Roger 2008 Ros\u00e9 Brut  

👍🥳

### Import objects - with our own vectors

#### Add the object class to the schema

In [13]:
class_name = "OtherWineReview"

In [14]:
class_obj = {
    "class": class_name,
    "vectorizer": "none",
}

client.schema.create_class(class_obj)

### Add objects

(We will use arbitrary, meaningless vectors just for demonstration.)

In [15]:
with client.batch as batch:  # Context manager manages batch flushing
    batch.batch_size=50
    batch.dynamic=True
    for i, data_obj in enumerate(dataloader()):
        batch.add_data_object(
            data_obj, 
            class_name, 
            # ⬇️ Arbitrary, meaningless vectors
            vector=[0.1 + (i+j) * 0.01 for j in range(5)]  
        )  

Let's try a vector search again.

But we cannot use `nearText` as we don't have a vectorizer configured. So we will use `nearVector`.

In [16]:
results = client.query.get(
    class_name, ["title", "country", "review_body"]
).with_near_vector(
    {"vector": [0.6] * 5}
).with_additional(
    ["distance", "id", "vector"]
).with_limit(10).do()  # We use a limit of 10 in the video - lowered here for brevity
jprint(results)

{
  "data": {
    "Get": {
      "OtherWineReview": [
        {
          "_additional": {
            "distance": 0.00026863813,
            "id": "0e34c52f-feef-46ab-a2a1-603b4af2fca5",
            "vector": [
              0.59,
              0.6,
              0.61,
              0.62,
              0.63
            ]
          },
          "country": "Portugal",
          "review_body": "A soft, creamy wine full of apple and pear fruits that are edged with crisp acidity. It is ripe, but still young; drink this attractive wine from 2017.",
          "title": "DFJ Vinhos 2015 Grand'Arte Alvarinho (Lisboa)"
        },
        {
          "_additional": {
            "distance": 0.00027763844,
            "id": "391deb37-d56c-419d-a70a-91b71a27d773",
            "vector": [
              0.58,
              0.59,
              0.6,
              0.61,
              0.62
            ]
          },
          "country": "France",
          "review_body": "Classic Viognier apricot flavors

👍🥳

### Import objects - specify vectors AND use a vectorizer!

But - how? And why? 🤔

#### Add the object class to the schema

In [17]:
class_name = "OtherOtherWineReview"

In [18]:
class_obj = {
    "class": class_name,
    "vectorizer": "text2vec-openai",
}

client.schema.create_class(class_obj)

### Add objects

In [19]:
import openai
openai.api_key = api_key

with client.batch as batch:  # Context manager manages batch flushing
    batch.batch_size=50
    batch.dynamic=True
    for i, data_obj in enumerate(dataloader()):
        input_txt = " || ".join([data_obj["title"], data_obj["country"], data_obj["review_body"]])
        openai_resp = openai.Embedding.create(
            model="text-embedding-ada-002",
            input=input_txt
        )
        vector = openai_resp["data"][0]["embedding"]     
        batch.add_data_object(
            data_obj, 
            class_name, 
            vector=vector  # Provide openai vector
        )  

As before, we can still perform vector searches. Here's one with the latest vector from the `for` loop:

In [20]:
jprint(data_obj)

{
  "review_body": "A soft, creamy wine full of apple and pear fruits that are edged with crisp acidity. It is ripe, but still young; drink this attractive wine from 2017.",
  "title": "DFJ Vinhos 2015 Grand'Arte Alvarinho (Lisboa)",
  "country": "Portugal",
  "points": 87,
  "price": 13.0
}


In [21]:
results = client.query.get(
    class_name, ["title", "country", "review_body"]
).with_near_vector(
    {"vector": vector}
).with_additional(
    ["distance", "id"]
).with_limit(5).do()
jprint(results)

{
  "data": {
    "Get": {
      "OtherOtherWineReview": [
        {
          "_additional": {
            "distance": 1.1920929e-07,
            "id": "0ad6ece9-50ef-4370-957c-bcec6bcf7c9c"
          },
          "country": "Portugal",
          "review_body": "A soft, creamy wine full of apple and pear fruits that are edged with crisp acidity. It is ripe, but still young; drink this attractive wine from 2017.",
          "title": "DFJ Vinhos 2015 Grand'Arte Alvarinho (Lisboa)"
        },
        {
          "_additional": {
            "distance": 0.11368734,
            "id": "d6beadb6-8527-4e69-b66c-f533d2f2c79a"
          },
          "country": "Portugal",
          "review_body": "In a medium sweet style, the wine has a light prickle and fine citrus acidity to balance the soft freshness. It is a refreshing wine, totally drinkable immediately.",
          "title": "Messias 2015 Santola White (Vinho Verde)"
        },
        {
          "_additional": {
            "distance": 0

But perhaps more importantly...

We can still use `nearText` to vectorize inputs, while making use of our own vectors!

In [22]:
results = client.query.get(
    class_name, ["title", "country", "review_body"]
).with_near_text(
    {"concepts": ["fruity French wine"]}
).with_additional(
    ["distance", "id"]
).with_limit(5).do()
jprint(results)

{
  "data": {
    "Get": {
      "OtherOtherWineReview": [
        {
          "_additional": {
            "distance": 0.123701274,
            "id": "ae1e1fdb-a006-4f3c-a8c0-d83119a6c565"
          },
          "country": "France",
          "review_body": "Classic Viognier apricot flavors are prominent in this ripe, full wine. It's as poised as it is rich\u2014full of fruit, with a warm, rounded aftertaste.",
          "title": "Maison de la Villette 2016 Viognier (Vin de France)"
        },
        {
          "_additional": {
            "distance": 0.1310789,
            "id": "0bd5c5dd-accf-4019-9092-bdae682643dd"
          },
          "country": "France",
          "review_body": "Firm tannins make this wine solid and dry. It has banana and bright cherry fruits well integrated with the acidity. The aftertaste suggests the wine could benefit from a few more months before drinking, so wait until 2017.",
          "title": "Domaine Foretal 2015  Beaujolais-Villages"
        },
  

👍🥳👍🥳

## Wrap-up

- You can provide your vector to Weaviate **or** have Weaviate handle vectorization. It depends on your situation.

- If a vector is provided with the object, Weaviate will use that vector at import time *regardless of* whether a vectorizer is specified.

- If a vectorizer is not specified, you cannot use *nearText* as Weaviate has no way of turning text into a vector.

- Using a vectorizer and providing your own vectors can give you the best of both worlds in some cases. 