* Vector Databases:

Vector databases store data in a way that allows for fast and accurate similarity searches. This is particularly useful for tasks like recommendation systems and semantic searches.

* Recommender Systems:

These systems suggest items to users based on similarities. For example, an e-commerce site might recommend products based on what similar users have purchased.

* Hybrid Search:

Combines traditional keyword search with vector search. This allows for more accurate and relevant search results by balancing between exact keyword matches and semantic understanding.

* Retrieval-Augmented Generation (RAG):

RAG is a technique that enhances large language models (LLMs) like ChatGPT by providing them with relevant context from a vector database before generating responses.
Steps in RAG:
- Query the Vector Database: Retrieve relevant information based on a concept (e.g., animals).
- Build a Prompt: Use the retrieved information to create a prompt for the LLM.
- Generate Response: The LLM uses the prompt to generate a more accurate and contextually relevant answer.



Example Workflow:
- Query: Extract three animal-related questions from the database.
- Generate Query: Provide these questions to an LLM.
- Evaluate: Check how many questions the LLM answered correctly by comparing them to the correct answers in the database.

Simplified Analogy:
Think of RAG as doing research before writing an essay. You gather relevant information (retrieval), then use that information to write a well-informed essay (generation).

Practical Application:
In your role, understanding how to use vector databases and RAG can help you build more intelligent data analysis tools, enhancing the accuracy and relevance of your insights.

In [1]:
import requests
import os
import json
import weaviate
from weaviate.embedded import EmbeddedOptions



In [2]:
# Downlaod the data 'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json'
response = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(response.text)

# Parse the json and preview the data using json_prin function
def json_print(data):
    print(json.dumps(data, indent=2))

print(type(data))
print(len(data))
print(json.dumps(data[0], indent=2))

<class 'list'>
10
{
  "Category": "SCIENCE",
  "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
  "Answer": "Liver"
}


In [3]:
# Load the data into Weaviate
cohere_api_key = os.getenv("COHERE_APIKEY")

waeivate_api_key = os.getenv("WEAVIATE_API_KEY")

openai_api_key = os.getenv("OPENAI_API")

auth_config = weaviate.AuthApiKey(api_key=waeivate_api_key)

# Connect to the locallay launched instance of Weaviate
client = weaviate.Client(
        url = "http://localhost:8080",
        additional_headers={
        "X-Cohere-Api-Key": cohere_api_key
    }

    )

print(f"Client is ready: {client.is_ready()}")

Client is ready: True


            your code to use Python client v4 `weaviate.WeaviateClient` connections and methods.

            For Python Client v4 usage, see: https://weaviate.io/developers/weaviate/client-libraries/python
            For code migration, see: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration
            


In [4]:
# Check if the schema is already present in the Weaviate instance
if client.schema.exists("Question"):
    print("The schema is already present in the Weaviate instance")
    client.schema.delete_class("Question")

print("Schema deleted.")

The schema is already present in the Weaviate instance
Schema deleted.


In [5]:
# Create class object
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-cohere",
}

client.schema.create_class(class_obj)
print("Class created")

Class created


In [6]:
with client.batch.configure() as batch:
    for i, d in enumerate(data):  # Batch import data
        
        print(f"importing question: {i+1}")
        
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        
        batch.add_data_object(
            data_object=properties,
            class_name="Question"
        )

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
importing question: 6
importing question: 7
importing question: 8
importing question: 9
importing question: 10


In [7]:
json_print(client.query.aggregate('Question').with_meta_count().do())

{
  "data": {
    "Aggregate": {
      "Question": [
        {
          "meta": {
            "count": 10
          }
        }
      ]
    }
  }
}


In [11]:
# Q1 - Query: Extract three animal-related questions from the database.
response = (
    client.query
    .get('Question', ['question'])
    .with_near_text({"concepts": "animal"})
    .with_limit(3)
    .do() 
    )
json_print(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "question": "It's the only living mammal in the order Proboseidea"
        },
        {
          "question": "Weighing around a ton, the eland is the largest species of this animal in Africa"
        },
        {
          "question": "The gavial looks very much like a crocodile except for this bodily feature"
        }
      ]
    }
  }
}


In [12]:
# Q2 - Generate Query: Provide these questions to an LLM.
prompt = "Answer these questions {question}. If you don't know the answer, you can say 'I don't know'."

response = (
    client.query
    .get('Question', ['question'])
    .with_near_text({"concepts": "animal"})
    .with_generate(prompt)
    .with_limit(3)
    .do() 
    )

json_print(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "The only living mammal in the order Proboscidea is the elephant."
            }
          },
          "question": "It's the only living mammal in the order Proboseidea"
        },
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "The largest species of animal in Africa weighing around a ton would be the African elephant."
            }
          },
          "question": "Weighing around a ton, the eland is the largest species of this animal in Africa"
        },
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "Here are the questions I can answer about the gavial:\n\n1. What bodily feature distinguishes the gavial from a crocodile? \nThe long, slender snout is a distinguishi

In [13]:
# Q3 - Evaluate: Check how many questions the LLM answered correctly by comparing them to the correct answers in the database.
response = (
    client.query
    .get('Question', ['question', 'answer'])
    .with_near_text({"concepts": "animal"})
    .with_generate(prompt)
    .with_limit(3)
    .do() 
    )

json_print(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "The only living mammal in the order Proboscidea is the elephant."
            }
          },
          "answer": "Elephant",
          "question": "It's the only living mammal in the order Proboseidea"
        },
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "The largest species of animal in Africa weighing around a ton would be the African elephant."
            }
          },
          "answer": "Antelope",
          "question": "Weighing around a ton, the eland is the largest species of this animal in Africa"
        },
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "Here are the questions I can answer about the gavial:\n\n1. What bodily feature distinguishes the gavi