## Scenario 1 -  Single collection RAG

### **SupportPatterns** - Support Training & Education Platform

- Develops training materials and courses for customer support professionals
- Uses aggregated, anonymized support conversations to create realistic training scenarios

### Solution

Collect as much conversation data between support agents and customers as possible. 

Analyse this data to identify common patterns and develop training materials based on these patterns.

### Helper functions for downloads

In [1]:
from pathlib import Path
from typing import Literal


def download_datafiles(setup: Literal["ollama", "cohere"]):
    filepaths_set = {
        "ollama": (
            "https://weaviate-workshops.s3.eu-west-2.amazonaws.com/odsc-europe-2024/twitter_customer_support_weaviate_export_50000_nomic.h5",
            Path("data/twitter_customer_support_nomic.h5")
        ),
        "cohere": (
            "https://weaviate-workshops.s3.eu-west-2.amazonaws.com/odsc-europe-2024/twitter_customer_support_weaviate_export_50000_cohere-embed-multilingual-light-v3.0.h5",
            Path("data/twitter_customer_support_cohere.h5"),
        )
    }

    filepaths = filepaths_set[setup]

    if not filepaths[1].exists():
        print(f"Downloading {filepaths[0]}")
        filepaths[1].parent.mkdir(parents=True, exist_ok=True)
        import urllib.request
        urllib.request.urlretrieve(filepaths[0], filepaths[1])
    else:
        print(f"File already exists: {filepaths[1]}")
    return True

## AI Models

This workshop is set up for you to work with local, Ollama models, or API-based Cohere models. Follow either [Ollama](#ollama) or [Cohere](#cohere) instructions below.


In [None]:
!ollama pull nomic-embed-text && ollama pull gemma2:2b

In [None]:
download_datafiles("ollama")

model_type = "ollama"

### Cohere 

To use the Cohere API for this workshop, run the below code cell to configure the variables:

In [None]:
download_datafiles("cohere")

model_type = "cohere"


### Create the collection


In [5]:
from weaviate.classes.config import Configure

if model_type == "ollama":
    vectorizer_config = Configure.NamedVectors.text2vec_ollama(
        name="text_with_metadata",
        source_properties=["text", "company_author"],
        vector_index_config=Configure.VectorIndex.hnsw(),
        api_endpoint="http://host.docker.internal:11434",
        model="nomic-embed-text",
    )
    generative_config = Configure.Generative.ollama(
        api_endpoint="http://host.docker.internal:11434",
        model="gemma2:2b"
    )
else:
    vectorizer_config = Configure.NamedVectors.text2vec_cohere(
        name="text_with_metadata",
        source_properties=["text", "company_author"],
        vector_index_config=Configure.VectorIndex.hnsw(),
        model="embed-multilingual-light-v3.0",
    )

    generative_config = Configure.Generative.cohere(
        model="command-r-plus"
    )


In [6]:
import os
import weaviate
from weaviate.classes.config import Property, DataType, Configure
from dotenv import load_dotenv

load_dotenv()

client = weaviate.connect_to_local(
    headers={"X-Cohere-Api-Key": os.getenv("WORKSHOP_COHERE_KEY")}
)

collection_name = "SupportChat"

# For re-running the demo only: Delete existing collection if it exists
client.collections.delete(collection_name)

# Create a new collection with specified properties and vectorizer configuration
chunks = client.collections.create(
    name=collection_name,
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        # STUDENT TODO:
        # Create properties for 'dialogue_id', 'company_author' and 'created_at' - with data types 'int', 'text' and 'date' respectively
    ],
    vectorizer_config=[vectorizer_config],
    generative_config=generative_config,
)

### Helper functions for loading data

In [7]:
import h5py
import json
import numpy as np
from typing import Literal
from pathlib import Path


def get_hdf5_obj(file_path):
    with h5py.File(file_path, "r") as hf:
        for uuid in hf.keys():
            src_obj = hf[uuid]

            # Get the object properties
            properties = json.loads(src_obj["object"][()])

            # Get the vector(s)
            vectors = {}
            for key in src_obj.keys():
                if key.startswith("vector_"):
                    vector_name = key.split("_", 1)[1]
                    vectors[vector_name] = np.asarray(src_obj[key])

            yield uuid, properties, vectors


def get_data_obj(model_type: Literal["ollama", "cohere"]):
    file_path = Path("data/twitter_customer_support_nomic.h5")
    if model_type == "cohere":
        file_path = Path("data/twitter_customer_support_cohere.h5")

    for uuid, properties, vectors in get_hdf5_obj(file_path):
        yield uuid, properties, vectors

### Load data

In [None]:
from tqdm import tqdm

with client.batch.fixed_size(batch_size=200) as batch:
    for uuid, properties, vectors in tqdm(get_data_obj(model_type)):
        batch.add_object(
            # STUDENT TODO:
            # Define the object to be added - specify the collection name, uuid and properties - the "vector" property is pre-defined for you
            vector={"text_with_metadata": vectors["text_with_metadata"]},
        )

Check for that our data is loaded correctly

In [None]:
print(f"Processed {len(client.batch.results.objs.all_responses)} objects.")

In [10]:
if len(client.batch.failed_objects) > 0:
    print("*" * 80)
    print(f"***** Failed to add {len(client.batch.failed_objects)} objects *****")
    print("*" * 80)
    print(client.batch.failed_objects[:3])

### Retrieve some arbitrary objects

In [11]:
# Instantiate a collection object to interact with the collection
support_chats = client.collections.get(collection_name)

In [12]:
# STUDENT TODO:
# Fetch the first two objects from the collection with the vector included
# Hint - use the 'query.fetch_objects' method with the 'limit' and 'include_vector' parameters

In [None]:
# STUDENT TODO:
# Print the UUID of the first object in the response
# Hint - The response will have an `.objects` attribute which is a list of objects

In [None]:
# STUDENT TODO:
# Inspect the properties of the first object in the response
# Hint - the object will have a 'properties' attribute which is a dictionary of properties

In [None]:
# STUDENT TODO:
# Inspect the first few dimensions of the object's vector
# HINT - the object will have a 'vector' attribute which is a dictionary of vectors

### Queries

#### Helper function for displaying objects

In [16]:
def display_objects(response):
    for o in response.objects:
        print(o.uuid, "\n")
        print(o.properties["text"][:100], "\n")

In [None]:
# Near text search: Semantic search example
response = support_chats.query.near_text("return process", limit=3)
display_objects(response)

In [None]:
# STUDENT TODO:
# Run a `bm25` query with the search term "return process" and a limit of 3, and display the results
# Hint - start with the previous cell, and vary the query method

In [None]:
# STUDENT TODO:
# Run a `hybrid` query with the same parameters and display the results
# Hint - start with the previous cell, and vary the query method

In [20]:
# Generative search (RAG) example
response = support_chats.generate.fetch_objects(
    limit=20,
    grouped_task="What patterns are we seeing here in these issues?"
)

In [None]:
print(response.generated)

## Example use cases

- Develop training materials
    - Investigate common patterns in support conversations
    - Identify common issues and resolutions

In [22]:
# How might our example business use these capabilties?
# What types of RAG queries would be useful for them?

In [None]:
print(response.generated)

In [24]:
# Student TODO:
# Try your own `grouped_task` query with a different question

In [None]:
print(response.generated)

### Resource management

- How much memory are we using?
- How will this scale with more data?

## When to use this pattern

- Is any of the data isolated from the others?
- What use cases might not be covered by this architecture?


## Demo application

- Outside of the notebook
