## Populate Weaviate instance

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-workshop/blob/main/2_build_new_db.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Add env vars (Colab only)

In [None]:
# import os

# os.environ["COHERE_APIKEY"] = "YOUR_COHERE_KEY"
# os.environ["OPENAI_APIKEY"] = "YOUR_OPENI_KEY"

## Prep

Install libraries as needed (uncomment for Colab)

In [None]:
# !pip install -Uqq weaviate-client weaviate-demo-datasets

In [None]:
def jprint(str_in: str):
    import json
    print(json.dumps(str_in, indent=2))

# Fun with Weaviate 😁🚀

Instantiate Weaviate client

In [None]:
import os

api_headers = {
    # You *ONLY* need the API key for the inference service that you are using
    # You will define further below (at Weaviate class-level) which API inference service to use for your data    
    "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"],  # *Your* OpenAI key
    "X-Cohere-Api-Key": os.environ["COHERE_APIKEY"],  # *Your* Cohere key
}

### Weaviate instantiation options

There are many ways to run Weaviate! The easiest ways are:

- Embedded Weaviate (Linux only; MacOS support coming soon)
  - No additional step required
- WCS (SaaS option)
  - Go to https://console.weaviate.cloud and create an instance

### Also available:

If you would like further flexibility, you can also run Weaviate using:
- Docker-Compose
  - Go to https://weaviate.io/developers/weaviate/installation/docker-compose to get the config file

For production use, you can also use:
- Kubernetes, and
- Hybrid SaaS (On-premise deployment, managed by Weaviate)

In [None]:
# # Option 1 - If using Colab:
# import weaviate
# from weaviate.embedded import EmbeddedOptions

# client = weaviate.Client(
#     embedded_options=EmbeddedOptions(version="latest"),
#     additional_headers=api_headers
# )

In [None]:
# Option 2 - If using WCS sandbox:
import weaviate

client = weaviate.Client(
    "https://workshop-demo-0xwl314q.weaviate.network",  # Replace this with your sandbox URL
    auth_client_secret=weaviate.AuthApiKey("KQBu0wOvoWd70rXJIf1hs1oFkSmnxiupA7rm"),  # Replace this with your API Key
    additional_headers=api_headers
)

In [None]:
# # Option 3 - If using Docker-Compose:
# import weaviate
#
# client = weaviate.Client(
#     "http://localhost:8080",
#     additional_headers=api_headers
# )

Let's confirm that we are ready to go.

In [None]:
client.is_ready()

## What's available on my instance?

In [None]:
jprint(client.get_meta())

### Version

Note the Weaviate version

## Modules

What are these `modules`?

Modules allow you to configure Weaviate (so that you can work with your data in a way that suits you). Most commonly used modules are:
- Vectorizers (e.g. `text2vec-cohere`, `text2vec-huggingface`, `text2vec-openai`, `text2vec-palm`, etc.)
- Generative modules (e.g. `generative-xxx`)

## Populating Weaviate

### Import process

The process for importing data into Weaviate is:

```
Add class definition to the schema
                ⬇
        (Batch) import data
```

**Discussion**: Why batch import?

### What data to import?

In [None]:
import weaviate_datasets as wd

dataset = wd.WineReviews()
dataset.get_sample()

In [None]:
class_name = "WineReview"

### Add class definition

In [None]:
class_objs = dataset.get_class_definitions()
class_objs[0].keys()

### Required information

Notice above that we have `class`, `vectorizer`, `moduleConfig`, `properties` data:

- `class`: The class name (like a SQL table name)
- `vectorizer`: Module to be used to generate vectors
- `moduleConfig`: Configure various modules to be used with the class
- `properties`: Define object properties (like a SQL column)

In [None]:
class_objs

### Option 1: Bring your own vector

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_simple.png?raw=1)

In this use case, you simply import your data and the associated vector. Weaviate simply indexes your data for you, so that you can search through it quickly, and effectively.

In [None]:
for class_obj in class_objs:
    if not client.schema.contains(class_obj):
        print(f"Adding {class_obj['class']}")
        client.schema.create_class(class_obj)

In [None]:
from weaviate.util import generate_uuid5

loader = dataset._class_dataloader(class_name)

with client.batch() as batch:
    for data_objs in loader:
        batch.add_data_object(
            data_object=data_objs[0],
            class_name=class_name,
            vector=[1, 2, 3],  # Specify your own vector here
            uuid=generate_uuid5(data_objs[0]["title"]),  # This will generate a deterministic UUID based on the data object's content 
        )

In [None]:
res = (
    client.query.aggregate("WineReview")
    .with_meta_count()
    .do()
)

print(res)

In [None]:
client.schema.delete_class(class_name)

### Option 2: Use a vectorizer module

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

In this use case, you task Weaviate with the additional work of converting your data into vectors. The `vectorizer` module performs this job - whether it be through a local model, or using an external vectorizer API service.

In [None]:
class_objs[0]["vectorizer"]

In [None]:
# # If you want to change the vectorizer:

# class_objs[0]["vectorizer"] = "text2vec-cohere"
# class_objs

### Add class definition to the schema

In [None]:
for class_obj in class_objs:
    if not client.schema.contains(class_obj):
        print(f"Adding {class_obj['class']}")
        client.schema.create_class(class_obj)

### Batch import

Note: you should almost always use bach imports for speed.

In [None]:
from weaviate.util import generate_uuid5

loader = dataset._class_dataloader(class_name)

with client.batch() as batch:
    for data_objs in loader:
        batch.add_data_object(
            data_object=data_objs[0],
            class_name=class_name,
#             vector=[1, 2, 3],  # Specify your own vector here
            uuid=generate_uuid5(data_objs[0]["title"]),  # This will generate a deterministic UUID based on the data object's content 
        )

Confirm import by getting an object count.

In [None]:
client.query.aggregate(class_name).with_meta_count().do()

Check that we have generated vectors.

In [None]:
res = client.query.get(
    class_name,
    ["title", "country", "review_body", "points"]
).with_additional("vector").with_limit(2).do()

In [None]:
jprint(res)

## Try queries

### Similarity-based searches

#### Against a text input

In [None]:
res = client.query.get(
    class_name, ["title", "country", "review_body", "points"]
).with_near_text(
    {"concepts": ["a fruity white wine"]}
).with_limit(5).do()

In [None]:
for r in res["data"]["Get"][class_name]:
    print(r)

#### Move "away" from certain types of objects

In [None]:
res = client.query.get(
    class_name, ["title", "country", "review_body", "points"]
).with_near_text(
    {"concepts": ["earthy European wine"],
     "moveAwayFrom": {"concepts": ["white wine"], "force": 2.0}
     }
).with_limit(5).do()

In [None]:
for r in res["data"]["Get"][class_name]:
    jprint(r)

### Filter objects

In [None]:
res = client.query.get(
    class_name, ["title", "country", "review_body", "points"]
).with_near_text(
    {"concepts": ["earthy European wine"],
     "moveAwayFrom": {"concepts": ["white wine"], "force": 2.0}
     }
).with_where(
    {"path": ["price"],
     "operator": "GreaterThan",
     "valueNumber": 10}
).with_limit(5).do()

In [None]:
jprint(res)

#### Generative search

In [None]:
res = client.query.get(
    class_name, ["title", "country", "review_body", "points"]
).with_near_text(
    {"concepts": ["earthy European wine"],
     "moveAwayFrom": {"concepts": ["white wine"], "force": 2.0}
     }
).with_where(
    {"path": ["price"],
     "operator": "GreaterThan",
     "valueNumber": 10}
).with_generate(
    grouped_task="Are there any commonalities between these?",
).with_limit(3).do()

In [None]:
jprint(res)

## What if we try a different model?

In [None]:
client.schema.delete_class("WineReview")

In [None]:
# To change the vectorizer:

class_objs[0]["vectorizer"] = "text2vec-cohere"

In [None]:
from weaviate.util import generate_uuid5

for class_obj in class_objs:
    if not client.schema.contains(class_obj):
        print(f"Adding {class_obj['class']}")
        client.schema.create_class(class_obj)

loader = dataset._class_dataloader(class_name)

with client.batch() as batch:
    for data_objs in loader:
        batch.add_data_object(
            data_object=data_objs[0],
            class_name=class_name,
            uuid=generate_uuid5(data_objs[0]["title"]),  # This will generate a deterministic UUID based on the data object's content 
        )

In [None]:
res = client.query.get(
    class_name, ["title", "country", "review_body", "points"]
).with_near_text(
    {"concepts": [""]}
).with_limit(5).do()

In [None]:
res

In [None]:
res = client.query.get(
    class_name, ["title", "country", "review_body", "points"]
).with_near_text(
    {"concepts": ["Vino argentino fácil de beber"]}
).with_generate(
    single_prompt="Translate this to French {review_body}"
).with_limit(5).do()

In [None]:
res

## You can also use cross-references

The syntax looks like this:

```python
sf_id = "00ff6900-e64f-5d94-90db-c8cfa3fc851b"
us_cities_id = "20ffc68d-986b-5e71-a680-228dba18d7ef"

client.data_object.reference.add(
    from_class_name="JeopardyQuestion",
    from_uuid=sf_id,
    from_property_name="hasCategory",
    to_class_name="JeopardyCategory",
    to_uuid=us_cities_id,
)
```

Cross-references allow you to link data objects. You can:

- Retrieved cross-referenced properties
- Filter data with cross-references

But, they do *not* impact object vectors.

### For more examples: See our "how-to" search pages

- [How to: Search](https://weaviate.io/developers/weaviate/search)
- [How to: Manage data](https://weaviate.io/developers/weaviate/manage-data)

### If you want to clean up the data

**This will delete all of your data in the *WineReview* class!**

In [None]:
client.schema.delete_class("WineReview")