# Using Vector Databases for Embeddings Search

/// TODO: update this text
This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.

### What is a Vector Database

A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.

### Why use a Vector Database

Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.


### Demo Flow
The demo flow is:
- **Prerequisites Setup**: Create a Weaviate instance and install required libraries
- **Connect**: Connect to your Weaviate instance 
- **Schema Configuration**: Configure the schema of your data
    - *Note*: Here we can define which OpenAI Embedding Model to use
    - *Note*: Here we can configure which properties to index on
- **Import data**: Load a demo dataset and import it into Weaviate
    - *Note*: The import process will automatically index your data - based on the configuration in the schema
    - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you.
- **Run Queries**: Query 
    - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you.

Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings.

# OpenAI Module in Weaviate
All Weaviate instances come equiped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module.

This module is responsible handling vectorization at import (or any CRUD operations) and when you run a query.

## No need to manually vectorize data
This is great news for you. With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.

All you need to do is:
1. provide your OpenAI API Key – when you connected to the Weaviate Client
2. define which OpenAI vectorizer to use in your Schema

Which is covered by this cookbook.

# Prerequisites

Before we start this project, we need setup the following:

* create a `Weaviate` instance
* install the `weaviate-client` and `datasets`

## Create a Weaviate instance

To create a Weaviate instance we have 2 options:

1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) – to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.
2. Install and run Weaviate locally with Docker.


### Option 1 – WCS Installation Steps

Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.
1. create a free account and/or login to [WCS](https://console.weaviate.io/)
2. create a `Weaviate Cluster` with the following settings:
    * Sandbox: `Sandbox Free`
    * Weaviate Version: Use default (latest)
    * OIDC Authentication: `Disabled`
3. your instance should be ready in a minute or two
4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` 

### Option 2 – local Weaviate instance with Docker

Install and run Weaviate locally with Docker.
1. Get the [./docker-compose.yml](./docker-compose.yml) file
2. Then start docker with 
```bash
docker-compose up -d 
```
Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)


> To shut down, you can call:
>```bash
>docker-compose down
>```

To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose).

    
## Install required libraries

Before running this project make sure to have the following libraries:

### Weaviate Python Client

The [Weaviate Python Client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.

To install the Python client, run the following command:

```
pip install weaviate-client
```

### Datasets

The datasets library is used to load the sample dataset.

If you don't have the datasets library, run the following command:

```
pip install datasets
```

# Setup - TODO: remove this section

Import the required libraries and set the embedding model that we'd like to use.

In [31]:
### TODO: Remove this section

# import openai

# import tiktoken
# from tenacity import retry, wait_random_exponential, stop_after_attempt
# from typing import List, Iterator
# import concurrent
# from tqdm import tqdm
# import pandas as pd
from datasets import load_dataset
# import numpy as np
# import os

# Pinecone's client library for Python
# import pinecone

# Weaviate's client library for Python
import weaviate

# Qdrant's client library for Python
# import qdrant_client

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Connect to your Weaviate instance

In this section, we will:

1. connect to your Weaviate
2. provide your `OpenAI API Key`
3. and test the client connection

### The client 

After this step, the `client` object will be used to perform all Weaviate-related operations.

In [3]:
### TODO: update the path to a generic "https://your-.weaviate.network"
### TODO: remove the API key

import weaviate
from datasets import load_dataset

# Connect to your Weaviate instance
# Make sure to provide your OpenAI API, it will be needed during data import and query
client = weaviate.Client(
    url="https://openai-test.semi.network",
    additional_headers={
        'X-OpenAI-Api-Key': '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'
    }
)

# Check if your instance is live and ready
# This should return `True`
client.is_ready()

True

# Schema

In this section, we will:
1. configure the data schema for your data
2. select OpenAI module

> This is the second and final step, which requires OpenAI specific configuration.
> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.


## What is a schema

In Weaviate you create __schemas__ to capture each of the entities you will be searching.

A schema is how you tell Weaviate:
* what embedding model should be used to vectorize the data
* what your data is made of (property names and types)
* which properties should be vectorized and indexed

In this cookbook we will use a dataset for `Articles`, which contains:
* `title`
* `content`
* `url`

We want to vectorize `title` and `content`, but not the `url`.

To vectorize and query the data, we will use `text-embedding-ada-002`.

In [4]:
# Clear up the schema, so that we can recreate it
client.schema.delete_all()
client.schema.get()

# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`
article_schema = {
    "class": "Article",
    "description": "A collection of articles",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
          "model": "ada",
          "modelVersion": "002",
          "type": "text"
        }
    },
    "properties": [{
        "name": "title",
        "description": "Title of the article",
        "dataType": ["string"]
    },
    {
        "name": "content",
        "description": "Contents of the article",
        "dataType": ["text"]
    },
    {
        "name": "url",
        "description": "URL to the article",
        "dataType": ["string"],
        "moduleConfig": { "text2vec-openai": { "skip": True } }
    }]
}

# add the Article schema
client.schema.create_class(article_schema)

# get the schema to make sure it worked
client.schema.get()

{'classes': [{'class': 'Article',
   'description': 'A collection of articles',
   'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},
    'cleanupIntervalSeconds': 60,
    'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},
   'moduleConfig': {'text2vec-openai': {'model': 'ada',
     'modelVersion': '002',
     'type': 'text',
     'vectorizeClassName': True}},
   'properties': [{'dataType': ['string'],
     'description': 'Title of the article',
     'moduleConfig': {'text2vec-openai': {'skip': False,
       'vectorizePropertyName': False}},
     'name': 'title',
     'tokenization': 'word'},
    {'dataType': ['text'],
     'description': 'Contents of the article',
     'moduleConfig': {'text2vec-openai': {'skip': False,
       'vectorizePropertyName': False}},
     'name': 'content',
     'tokenization': 'word'},
    {'dataType': ['string'],
     'description': 'URL to the article',
     'moduleConfig': {'text2vec-openai': {'skip': True,
       'vectorizePropert

## Import data

In this section we will:
1. load the Simple Wikipedia dataset
2. configure Weaviate Batch import (to make the import more efficient)
3. import the data into Weaviate

> Note: <br/>
> Like mentioned before. We don't need to manually vectorize the data.<br/>
> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that.

In [5]:
## TODO: update to 25k objects
### STEP 1 - load the dataset

from datasets import load_dataset
from typing import List, Iterator

# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding
dataset = list(load_dataset("wikipedia", "20220301.simple")["train"])

# Limited to 25k articles for demo purposes
# dataset = dataset[:25_000]
dataset = dataset[:250]

Found cached dataset wikipedia (/Users/sebawita/.cache/huggingface/datasets/wikipedia/20220301.simple/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
### Step 2 - configure Weaviate Batch, with
# - starting batch size of 100
# - dynamically increase/decrease based on performance
# - add timeout retries if something goes wrong

client.batch.configure(
    batch_size=100, 
    dynamic=True,
    timeout_retries=3,
#   callback=None,
)

<weaviate.batch.crud_batch.Batch at 0x7fbd11f952e0>

In [8]:
### Step 3 - import data

print("importing Articles")

with client.batch as batch:
    for article in dataset:

        properties = {
            "title": article["title"],
            "content": article["text"],
            "url": article["url"]
        }
        
        batch.add_data_object(properties, "Article")


importing Articles


[ERROR] Batch ReadTimeout Exception occurred! Retrying in 2s. [1/3]


{'error': [{'message': 'update vector: failed with status: 429 error: Rate limit reached for default-global-with-image-limits in organization org-lKjq4nGuGDGZgdPhDiF4ndjt on requests per min. Limit: 60.000000 / min. Current: 190.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://beta.openai.com/account/billing to add a payment method.'}]}
{'error': [{'message': 'update vector: failed with status: 429 error: Rate limit reached for default-global-with-image-limits in organization org-lKjq4nGuGDGZgdPhDiF4ndjt on requests per min. Limit: 60.000000 / min. Current: 180.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://beta.openai.com/account/billing to add a payment method.'}]}
{'error': [{'message': 'update vector: failed with status: 429 error: Rate limit reached for default

In [9]:
# Test our insert has worked by checking one object
print(client.data_object.get()['objects'][0]['properties']['title'])
print(client.data_object.get()['objects'][0]['properties']['content'])

# Test that all data has loaded
result = client.query.aggregate("Article") \
    .with_fields('meta { count }') \
    .do()
result['data']

Church (building)
A church is a building that was constructed to allow people to meet to worship together. These people are usually Christians, or influenced by Christianity. Some other non-Christian religious groups also call their religious buildings churches, most notably Scientology.

The following description is about Roman Catholic churches, although some parts are the same in Episcopalian and Lutheran churches. Depending on the number of people that are in a community, the churches come in different sizes. Small churches are called chapels. The churches in a particular geographical area form a group called the diocese.  Each diocese has a cathedral.  In most cases, the cathedral is a very big church. Cathedrals are the seat of bishops.

History of church buildings 

 

In the early days of Christianity people had to worship in secret. Christian worship was not allowed in the Roman empire, so Christians had to meet in a secret place. Sometimes they met in people’s houses or barns

{'Aggregate': {'Article': [{'meta': {'count': 120}}]}}

### Search Data

As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors

In [27]:
def query_weaviate(query, collection_name):
    
    nearText = {
        "concepts": [query],
        "distance": 0.7,
    }

    properties = [
        "title", "content", "url",
        "_additional {certainty distance}"
    ]

    result = (
        client.query
        .get(collection_name, properties)
        .with_near_text(nearText)
        .with_limit(10)
        .do()
    )

    return result['data']['Get'][collection_name]

In [28]:
query_result = query_weaviate('modern art in Europe','Article')
counter = 0
for article in query_result:
    counter += 1
    print(f"{counter}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })")

1. Art (Score: 0.896)
2. Architecture (Score: 0.893)
3. Einstein on the Beach (Score: 0.886)
4. City (Score: 0.883)
5. Austria (Score: 0.882)
6. Continent (Score: 0.877)
7. Armenia (Score: 0.876)
8. History (Score: 0.876)
9. Cartography (Score: 0.874)
10. Catharism (Score: 0.873)


In [29]:
query_result = query_weaviate('Famous battles in Scottish history','Article')
counter = 0
for article in query_result:
    counter += 1
    print(f"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })")

1. Ireland (Score: 0.882)
2. Alan Turing (Score: 0.867)
3. Alan Turing (Score: 0.867)
4. History (Score: 0.864)
5. Colchester (Score: 0.858)
6. Black pudding (Score: 0.858)
7. Black pudding (Score: 0.858)
8. China (Score: 0.857)
9. China (Score: 0.857)
10. Historian (Score: 0.857)


Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo.