We'll move on now to working with **Weaviate** to perform the following operations:
- Class Configuration
- Data Indexing
- Keyword search
- Vector search

In [1]:
## Installing Libraries ##

!pip install python-dotenv --quiet
!pip install loguru==0.7.0 --quiet 
!pip install weaviate-client==3.25.3 --quiet
!pip install openai --quiet
#workhorse for converting text into embeddings/vectors
!pip install sentence-transformers==2.2.2 --quiet

In [2]:
#external files
from preprocessing import FileIO
from weaviate_interface import WeaviateClient, WeaviateIndexer

#standards
import os
import time
import json
from typing import List
from tqdm.notebook import tqdm

from rich import print

#load from local .env file
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

### Instantiate Weaviate Client

The `WeaviateClient` Class is a convenient wrapper around the Weaviate python API.  We'll use it to create a Weaviate client to connect with our Weaviate Cloud instance.  
Instantiating the Class requires 3 main pieces of information that will change from user to user:
- a model name or path for use with vector searches - this is the model that will create an embedding from the query string
- the Weaviate instance endpoint
- your personal Weaviate api key

In a production setting you'd have to account for security layers.

In [3]:
#read env vars from local .env file
# api_key = os.environ['WEAVIATE_API_KEY']
# url = os.environ['WEAVIATE_ENDPOINT']

api_key = ""
url = ""

#instantiate client
client = WeaviateClient(api_key, url)

#check if WCS instance is live and ready
client.is_live(), client.is_ready()

            Please consider upgrading to the latest version. See https://weaviate.io/developers/weaviate/client-libraries/python for details.


(True, True)

In [4]:
#Load Saved Data from Disk (from previous notebook)
data_path = "impact-theory-minilmL6-256.parquet"
data = FileIO().load_parquet(data_path)

Shape of data: (26448, 12)
Memory Usage: 2.42+ MB


## Intuition
***

Now that we've instantiated our Weaviate client as a connection with our Weaviate Host in the cloud, as well as loaded our data in memory, we're prepared to Index our data.  We'll follow the below steps:  

1. **Define a schema of Class properties**: This step allows us to be precise on data types, filterablitiy, and indexability.
2. **Define a Class configuration**: This is our chance to configure how we want our Class (index) to run.  As part of this configuration we'll insert our schema of properties.
3. **Index Data on Weaviate**: Index our data using batch uploads.


***

### Step 1 --> Define a Schema of Class Properties

Weaviate supports an auto-schema option wherein properties are defined during data ingestion, however, for greater precision we are going to manually define each of our properties.  We'll define the following parameters for each property:
   - `name`: the name of the property, for simplicity we'll ensure that each name corresponds with each key/field of each entry in our data
   - `dataType`: the type of data i.e. `text`, `number`, `date`, etc.
   - `indexFilterable`: should we be able to filter on this property?
   - `indexSearchable`: should we be able to search over this property?  Do not set this property to "true" if you do not intend to search over this property.
   - Property example:
        ```
           {
             {
              'name': 'video_id',
              'dataType': ['text'],
              'indexFilterable': True,
              'indexSearchable': True
             },
             {
              'name': 'length',
              'dataType': ['number'],
              'indexFilterable': True,
              'indexSearchable': False
             },
            }
        ```

In [5]:
from class_templates import impact_theory_class_properties
print(impact_theory_class_properties)

Every use case will be different so we may not want to index or create a filter for every single piece of metadata.  Being selective and deciding in advance how you want to configure your index is an important step.  On the other hand, depending on the size of the data you're working with, sometimes you're better off including metadata that you aren't sure if you're going to use, because the cost of adding it later or having to reindex all of your data can be prohibitive.  \
One other thing to note which is particular to Weaviate is that you don't see a `content_embedding` property where a user can set a property for a list of floats for a vector representation.  The vector property is added to the schema during data indexing and is handled as a separate action, this is likely the case because Weaviate is a native vector database first with the additional benefit of being able to filter and search using keywords.

### Step 2 --> Define a Class Configuration

A Class configuration is a blueprint of how our data is to be organized and stored on the Weaviate cluster. 

The code below constructs a class configuration.  The primary variables that will change from one class to another are the `class` (name of the class) and the `properties` fields.  The other config fields to consider are ones which tune the HNSW graph which is built during data indexing time.  Something else to point out is that the `vectorizer` field in the below config is marked as `"none"`.  Weaviate supports several types of built in vectorization models, however, we are bringing our own embeddings so it's important to mark this field as `none` so that the database knows not to unnecessarily vectorize any of our incoming data.

- `class`: The name of the class in string format
- `description`: Human-readable class description for your reference
- `vectorIndexType`: ANN algorithm to use (not that you have choice with weaviate)
- `ef`: Balance search speed and recall. The ef parameter controls the size of the approximate nearest neighbors (ANN) list at query time.. Search is more accurate when ef is
        higher, but it is also slower.  Default value is `-1` which means Weaviate will dynamically alter the list size at runtime.
- `efConstruction`: Balance index search speed and build speed. A high efConstruction value means you can lower your ef settings, but importing is slower. Default value is `128`.
- `maxConnections`: Maximum number of connections per element. maxConnections is the connection limit per layer for layers above the zero layer. Default value is `64`.
- `vectorizer`: Vectorizer to use for data objects added to this class. We are providing the vectors ourselves through our SentenceTransformer model, so this field is "none"
- `properties`: Property values to add to the class.  These are previously defined in the `impact_theory_class_properties`

In [6]:
class_name = "Impact_theory_minilm_256"
#Review Indexing Body
class_config = {'classes': [

                      {"class": class_name,

                       "description": "Episodes of Impact Theory up to Nov 2023",

                       "vectorIndexType": "hnsw",

                       # Vector index specific settings
                       "vectorIndexConfig": {

                            "ef": 64,
                            "efConstruction": 128,
                            "maxConnections": 32,
                                            },

                       "vectorizer": "none",

                       # pre-defined property mappings
                       "properties": impact_theory_class_properties }
                      ]
               }

In [7]:
print (class_config)

In [20]:
#After you've defined your Class properties and defined your Class configuration, 
# you can upload the entire schema to your Weaviate instance using the client.schema.create method.


# client.schema.create(class_config) 

In [None]:
client.show_classes()

In [9]:
#Execute this call to see that your class was successfully configured on Weaviate
print(client.show_class_config(class_name))

Couple points to make about the class configuration now that it's successfully uploaded to Weaviate.  You'll note that an inverted Index was created. This is the index Weaviate will use when executing keyword search through the `.with_bm25` method.

There is also a section under the `vectorIndexConfig` called `pq` which stands for Product Quantization. PQ is a form of data compression that reduces the memory footprint of the index. HNSW is an in-memory index, so enabling PQ lets you work with larger datasets.  


### Step 3 --> Data Indexing

To get our data indexed, we'll use the `WeaviateIndexer` Class.

The WeaviateIndexer is a wrapper around Weaviate's batch upload functions.  Under the hood, instantiating the WeaviateIndexer configures the batching client with sensible default values.  One could sequentially add entries into the class/index through the `client.batch.add_data_object` method, and that is likely the method to use when adding updates to your class/index.  But for this initial data push, it's best to use Weaviate's underlying batching mechanism to speed up the process.

In [10]:
from weaviate_interface import WeaviateIndexer

indexer = WeaviateIndexer(client=client, batch_size=200, num_workers=2)
indexer.batch_index_data(data=data, class_name=class_name)

100%|██████████| 26448/26448 [01:49<00:00, 241.60it/s]


Batch job completed in 1.9 minutes.


In [None]:
client.show_classes()

### Step 4 --> Searching

***

We are going to cover the two primary text-based search methods in Weaviate:
- Keyword Search
- Vector Search

Weaviate's basic query language is GraphQL. \
Use `client.query.get` as an entry point to make a search request.  The GraphQL example below is using the BM25 algorithm to conduct a keyword search across a given class.

### Anatomy of a Weaviate GraphQL Search query
***

```
.get(class_name = "name of class to search across",
     properties = "properties to display in response value")

.with_bm25(query="user query", properties= "properties to search across, users can include multiple properties i.e. content, and title, and guest")

.with_additional( "user can speficy additional fields to return as part of the response, these additional fields are particular to each search method" --> ['score', "id"])

.with_limit("restrict size of returned results to no more than this number")

.do("this call executes the query request")
```

Using the above code block as a guide, we are going to execute a manual keyword search using the GraphQL syntax.  Then we'll execute the same search using the pre-built client `keyword_search` method.

#### Get the properties that we want to display as part of the returned results.  

Deciding on which properties to display in the search results is a non-trivial matter.  From a human-readable perspective it's nice to see the additional metadata attached to each result, so that you have a better orientation of where the response came from i.e. who the `guest` was on the show, how long the show lasted (`length`), perhaps even include the `summary` of the show as part of the response, etc.  These displayed properties will become very important to the visual characteristics of the response in the UI.  For instance the `thumbnail_url` provides a link to a graphic depiction of the show, and the `episode_url` provides a convenient link directly to the show on YouTube.  Other properties such as `guest` and `summary` can be used in a response post-processing step to provide recommendations to the end user for other shows or guests to search for.  These search enrichments are only possible if the properties themselves are returned as part of the response results.

In [11]:
#get properties that are part of the class
display_properties = [property['name'] for property in client.show_class_properties(class_name)]

#we don't want to see the summary or playlist_id, so remove them
display_properties.remove('summary')
display_properties.remove('playlist_id')
display_properties

['title',
 'video_id',
 'length',
 'thumbnail_url',
 'views',
 'episode_url',
 'doc_id',
 'guest',
 'content']

In [12]:
# We'll set the query to be a universal question that everyone should have an interest in
query="How can I avoid paying taxes"

### Keyword Search
***

In [13]:
response = (client.query.get(class_name, display_properties)
 .with_bm25(query=query, properties=['content'])
 .with_additional(['score', 'id'])
 .with_limit(3)
 .do()
)
print(client.format_response(response, class_name))

In [14]:
print(client.keyword_search(query, class_name, limit=3))

### Vector Search

Let's run through the GraphQL syntax for a vector-based search.  Biggest difference here is that as part of the search execution we need to embed the user query at runtime and supply it as one of the search parameters
***

In [15]:
client.model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [16]:
#Create an embedding for the user query
query = "How can I avoid paying taxes"
query_embedding = client.model.encode(query)

In [17]:
response = (client.query

 # search over our class and display the properties that we created earlier
 .get(class_name, display_properties)

 # use near_vector our search method, and only search over the "content" property
 .with_near_vector({'vector': query_embedding})

 # instead of "score", vector search can return a "distance" property for scoring, the smaller the distance, the semantically similar is the result
 .with_additional(['distance'])
 .with_limit(3)
 .do())

print(client.format_response(response, class_name))

In [18]:
print(client.vector_search(query, class_name, limit=3))