# ReadtheDocs Retrieval Augmented Generation (RAG) using Zilliz Free Tier

In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product.  The chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.

Many RAG demos use OpenAI for the Embedding Model and ChatGPT for the Generative AI model.  **In this notebook, we will demo a fully open source RAG stack.**

Using open-source Q&A with retrieval saves money since we make free calls to our own data almost all the time - retrieval, evaluation, and development iterations.  We only make a paid call to OpenAI once for the final chat generation step. 

<div>
<img src="./rag_image.png" width="80%"/>
</div>

Let's get started!

In [1]:
# For colab install these libraries in this order:
# !pip install pymilvus langchain torch transformers sentence-transformers 
# !pip install python-dotenv unstructured openai

In [2]:
# Import common libraries.
import sys, os, time, pprint
import numpy as np

# Import custom functions for splitting and search.
sys.path.append("..")  # Adds higher directory to python modules path.
import milvus_utilities as _utils

## Download Milvus documentation to a local directory.

The data we’ll use is our own product documentation web pages.  ReadTheDocs is an open-source free software documentation hosting platform, where documentation is written with the Sphinx document generator.

The code block below downloads the web pages into a local directory called `rtdocs`.  

I've already uploaded the `rtdocs` data folder to github, so you should see it if you cloned my repo.

In [3]:
# # Uncomment to download readthedocs pages locally.

# DOCS_PAGE="https://pymilvus.readthedocs.io/en/latest/"
# !echo $DOCS_PAGE

# # Specify encoding to handle non-unicode characters in documentation.
# !wget -r -A.html -P rtdocs --header="Accept-Charset: UTF-8" $DOCS_PAGE

## Start up a Zilliz free tier cluster.

Code in this notebook uses fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  
  1. Choose the default "Starter" option when you provision > Create collection > Give it a name > Create cluster and collection.  
  2. On the Cluster main page, copy your `API Key` and store it locally in a .env variable.  See note below how to do that.
  3. Also on the Cluster main page, copy the `Public Endpoint URI`.

💡 Note: To keep your tokens private, best practice is to use an **env variable**.  See [how to save api key in env variable](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). <br>

👉🏼 In Jupyter, you need a .env file (in same dir as notebooks) containing lines like this:
- ZILLIZ_API_KEY=f370c...
- OPENAI_API_KEY=sk-H...
- VARIABLE_NAME=value...

In [4]:
# STEP 1. CONNECT TO MILVUS

# !pip install pymilvus #python sdk for milvus
from pymilvus import connections, utility
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.getenv("ZILLIZ_API_KEY")

# Connect to Zilliz cloud using endpoint URI and API key TOKEN.
# TODO change this.
CLUSTER_ENDPOINT="https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443"
CLUSTER_ENDPOINT="https://in03-48a5b11fae525c9.api.gcp-us-west1.zillizcloud.com:443"
connections.connect(
  alias='default',
  #  Public endpoint obtained from Zilliz Cloud
  uri=CLUSTER_ENDPOINT,
  # API key or a colon-separated cluster username and password
  token=TOKEN,
)

# Check if the server is ready and get colleciton name.
print(f"Type of server: {utility.get_server_version()}")

Type of server: Zilliz Cloud Vector Database(Compatible with Milvus 2.3)


## Load the Embedding Model checkpoint and use it to create vector embeddings
**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally. 

💡Tip:  A good way to choose a sentence transformer model is to check the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).  Sort descending by column "Retrieval Average" and choose the best-performing small model.

Two model parameters of note below:
1. EMBEDDING_DIM refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 1024. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>
2. MAX_SEQ_LENGTH is the maximum Context Length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input.

In [5]:
# STEP 2. DOWNLOAD AN OPEN SOURCE EMBEDDING MODEL.

# Import torch.
import torch
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')
print(f"device: {DEVICE}")

# Load the model from huggingface model hub.
# python -m pip install -U angle-emb
model_name = "WhereIsAI/UAE-Large-V1"
encoder = SentenceTransformer(model_name, device=DEVICE)
print(type(encoder))
print(encoder)

# Get the model parameters and save for later.
EMBEDDING_DIM = encoder.get_sentence_embedding_dimension()
MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length() 
# # Assume tokens are 3 characters long.
# MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS * 3
# HF_EOS_TOKEN_LENGTH = 1 * 3
# Test with 512 sequence length.
MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS
HF_EOS_TOKEN_LENGTH = 1

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_DIM: {EMBEDDING_DIM}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

device: cpu


No sentence-transformers model found with name WhereIsAI/UAE-Large-V1. Creating a new one with MEAN pooling.


<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)
model_name: WhereIsAI/UAE-Large-V1
EMBEDDING_DIM: 1024
MAX_SEQ_LENGTH: 512


## Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases.  The **collection** will contain the 
- **Schema** (or [no-schema Milvus client](https://milvus.io/docs/using_milvusclient.md)).  
💡 You'll need the vector `EMBEDDING_DIM` parameter from your embedding model.
Typical values are:
   - 1024 for sbert embedding models
   - 1536 for ada-002 OpenAI embedding models
- **Vector index** for efficient vector search
- **Vector distance metric** for measuring nearest neighbor vectors
- **Consistency level**
In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.

## Add a Vector Index

The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  

Most vector indexes use different sets of parameters depending on whether the database is:
- **inserting vectors** (creation mode) - vs - 
- **searching vectors** (search mode) 

Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)
- AUTOINDEX - Automatically determined based on OSS vs [Zilliz cloud](https://docs.zilliz.com/docs/autoindex-explained), type of GPU, size of data.

Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered "close" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized.

In [6]:
# STEP 3. CREATE A NO-SCHEMA MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.

from pymilvus import MilvusClient

# Set the Milvus collection name.
COLLECTION_NAME = "MilvusDocs"

# Add custom HNSW search index to the collection.
# M = max number graph connections per layer. Large M = denser graph.
# Choice of M: 4~64, larger M for larger data and larger embedding lengths.
M = 16
# efConstruction = num_candidate_nearest_neighbors per layer. 
# Use Rule of thumb: int. 8~512, efConstruction = M * 2.
efConstruction = M * 2
# Create the search index for local Milvus server.
INDEX_PARAMS = dict({
    'M': M,               
    "efConstruction": efConstruction })
index_params = {
    "index_type": "HNSW", 
    "metric_type": "COSINE", 
    "params": INDEX_PARAMS
    }

# Use no-schema Milvus client uses flexible json key:value format.
# https://milvus.io/docs/using_milvusclient.md
mc = MilvusClient(
    uri=CLUSTER_ENDPOINT,
    # API key or a colon-separated cluster username and password
    token=TOKEN)

# Check if collection already exists, if so drop it.
has = utility.has_collection(COLLECTION_NAME)
if has:
    drop_result = utility.drop_collection(COLLECTION_NAME)
    print(f"Successfully dropped collection: `{COLLECTION_NAME}`")

# Create the collection.
mc.create_collection(COLLECTION_NAME, 
                     EMBEDDING_DIM,
                     consistency_level="Eventually", 
                     auto_id=True,  
                     overwrite=True,
                     # skip setting params below, if using AUTOINDEX
                     params=index_params
                    )

print(f"Successfully created collection: `{COLLECTION_NAME}`")
print(mc.describe_collection(COLLECTION_NAME))

Successfully created collection: `MilvusDocs`
{'collection_name': 'MilvusDocs', 'auto_id': True, 'num_shards': 1, 'description': '', 'fields': [{'field_id': 100, 'name': 'id', 'description': '', 'type': 5, 'params': {}, 'element_type': 0, 'auto_id': True, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': 101, 'params': {'dim': 1024}, 'element_type': 0}], 'aliases': [], 'collection_id': 448076879561902231, 'consistency_level': 3, 'properties': {}, 'num_partitions': 1, 'enable_dynamic_field': True}


## Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:
- **Strategy** = Use markdown header hierarchies.  Keep markdown sections together unless they are too long.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = 
  - Langchain's `HTMLHeaderTextSplitter` to split markdown sections.
  - Langchain's `RecursiveCharacterTextSplitter` to split up long reviews recursively.


Notice below, each chunk is grounded with the document source page.  <br>
In addition, header titles are kept together with the chunk of markdown text.

In [7]:
# STEP 4. PREPARE DATA: CHUNK AND EMBED

# Read docs into LangChain
#!pip install langchain 
from langchain.document_loaders import DirectoryLoader

# Load HTML files from a local directory
path = "rtdocs/pymilvus.readthedocs.io/en/latest/"
loader = DirectoryLoader(path, glob='*.html')
docs = loader.load()

num_documents = len(docs)
print(f"loaded {num_documents} documents")

loaded 8 documents


In [8]:
from langchain.text_splitter import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter
from bs4 import BeautifulSoup

# Define the headers to split on for the HTMLHeaderTextSplitter
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]
# Create an instance of the HTMLHeaderTextSplitter
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Specify chunk size and overlap.
chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH
chunk_overlap = np.round(chunk_size * 0.10, 0)
print(f"chunk_size: {chunk_size}, chunk_overlap: {chunk_overlap}")

# Create an instance of the RecursiveCharacterTextSplitter
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    length_function = len,
)

# Split the HTML text using the HTMLHeaderTextSplitter.
start_time = time.time()
html_header_splits = []
for doc in docs:
    soup = BeautifulSoup(doc.page_content, 'html.parser')
    splits = html_splitter.split_text(str(soup))
    for split in splits:
        # Add the source URL and header values to the metadata
        metadata = {}
        new_text = split.page_content
        for header_name, metadata_header_name in headers_to_split_on:
            # Handle exception if h1 does not exist.
            try:
                header_value = new_text.split("¶ ")[0].strip()[:100]
                metadata[header_name] = header_value
            except:
                break
            # Handle exception if h2 does not exist.
            try:
                new_text = new_text.split("¶ ")[1].strip()[:50]
            except:
                break
        split.metadata = {
            **metadata,
            "source": doc.metadata["source"]
        }
        # Add the header to the text
        split.page_content = split.page_content
    html_header_splits.extend(splits)

# Split the documents further into smaller, recursive chunks.
chunks = child_splitter.split_documents(html_header_splits)

end_time = time.time()
print(f"chunking time: {end_time - start_time}")
print(f"docs: {len(docs)}, split into: {len(html_header_splits)}")
print(f"split into chunks: {len(chunks)}, type: list of {type(chunks[0])}") 

# Inspect a chunk.
print()
print("Looking at a sample chunk...")
print(chunks[0].page_content[:100])
print(chunks[0].metadata)

# # TODO - Uncomment to print child splits with their associated header metadata.
# print()
# for child in chunks:
#     print(f"Content: {child.page_content}")
#     print(f"Metadata: {child.metadata}")
#     print()

chunk_size: 511, chunk_overlap: 51.0
chunking time: 0.013944864273071289
docs: 8, split into: 8
split into chunks: 161, type: list of <class 'langchain_core.documents.base.Document'>

Looking at a sample chunk...
pymilvus latest Table of Contents Installation Installing via pip Installing in a virtual environmen
{'h1': 'pymilvus latest Table of Contents Installation Installing via pip Installing in a virtual environmen', 'h2': 'Installing via pip', 'source': 'rtdocs/pymilvus.readthedocs.io/en/latest/install.html'}


In [9]:
# Clean up the metadata urls
for doc in chunks:
    new_url = doc.metadata["source"]
    new_url = new_url.replace("rtdocs", "https:/")
    doc.metadata.update({"source": new_url})

print(chunks[0].page_content[:100])
print(chunks[0].metadata)

pymilvus latest Table of Contents Installation Installing via pip Installing in a virtual environmen
{'h1': 'pymilvus latest Table of Contents Installation Installing via pip Installing in a virtual environmen', 'h2': 'Installing via pip', 'source': 'https://pymilvus.readthedocs.io/en/latest/install.html'}


## Insert data into Milvus

For each original text chunk, we'll write the quadruplet (`vector, text, source, h1, h2`) into the database.

<div>
<img src="../../images/db_insert.png" width="80%"/>
</div>

**The Milvus Client wrapper can only handle loading data from a list of dictionaries.**

Otherwise, in general, Milvus supports loading data from:
- pandas dataframes 
- list of dictionaries

Below, we use the embedding model provided by HuggingFace, download its checkpoint, and run it locally as the encoder.  

In [10]:
# STEP 5. INSERT CHUNKS AND EMBEDDINGS IN ZILLIZ.

# Convert chunks to a list of dictionaries.
chunk_list = []
for chunk in chunks:

    # Generate embeddings using encoder from HuggingFace.
    embeddings = torch.tensor(encoder.encode([chunk.page_content]))
    # embeddings = F.normalize(embeddings, p=2, dim=1) #use torch
    embeddings = np.array(embeddings / np.linalg.norm(embeddings)) #use numpy
    converted_values = list(map(np.float32, embeddings))[0]
    
    # Only use h1, h2. Truncate the metadata in case too long.
    try:
        h2 = chunk.metadata['h2'][:50]
    except:
        h2 = ""
    # Assemble embedding vector, original text chunk, metadata.
    chunk_dict = {
        'vector': converted_values,
        'chunk': chunk.page_content,
        'source': chunk.metadata['source'],
        'h1': chunk.metadata['h1'][:50],
        'h2': h2,
    }
    chunk_list.append(chunk_dict)

# Insert data into the Milvus collection.
print("Start inserting entities")
start_time = time.time()
insert_result = mc.insert(
    COLLECTION_NAME,
    data=chunk_list,
    progress_bar=True)
end_time = time.time()
print(f"Milvus Client insert time for {len(chunk_list)} vectors: {end_time - start_time} seconds")

# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush(COLLECTION_NAME)

# Milvus Client insert time for 156 vectors: 1.283660888671875 seconds

Start inserting entities


100%|██████████| 1/1 [00:01<00:00,  1.13s/it]


Milvus Client insert time for 161 vectors: 1.1355090141296387 seconds


In [11]:
# # TODO - Uncomment to print.
# chunk_list[0]

## Ask a question about your data

So far in this demo notebook: 
1. Your custom data has been mapped into a vector embedding space
2. Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 In LLM vocabulary:
> **Query** is the generic term for user questions.  
A query is a list of multiple individual questions, up to maybe 1000 different questions!

> **Question** usually refers to a single user question.  
In our example below, the user question is "What is AUTOINDEX in Milvus Client?"

> **Semantic Search** = very fast search of the entire knowledge base to find the `TOP_K` documentation chunks with the closest embeddings to the user's query.

💡 The same model should always be used for consistency for all the embeddings data and the query.

In [12]:
# Define a sample question about your data.
QUESTION1 = "What do the parameters for HNSW mean?"
QUESTION2 = "What are good default values for HNSW parameters with 25K vectors dim 1024?"
QUESTION3 = "What is the default AUTOINDEX distance metric in Milvus Client?"
QUERY = [QUESTION1, QUESTION2, QUESTION3]

# Inspect the length of the query.
QUERY_LENGTH = len(QUESTION2)
print(f"query length: {QUERY_LENGTH}")

query length: 75


In [13]:
# SELECT A PARTICULAR QUESTION TO ASK.

SAMPLE_QUESTION = QUESTION1

## Execute a vector search

Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).

💡 By their nature, vector searches are "semantic" searches.  For example, if you were to search for "leaky faucet": 
> **Traditional Key-word Search** - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

> **Semantic search** - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words.

In [14]:
# RETRIEVAL USING MILVUS API.

# # Not needed with Milvus Client API.
# mc.load()

# Embed the question using the same encoder.
query_embeddings = _utils.embed_query(encoder, [SAMPLE_QUESTION])
TOP_K = 2

# Return top k results with HNSW index.
SEARCH_PARAMS = dict({
    # Re-use index param for num_candidate_nearest_neighbors.
    "ef": INDEX_PARAMS['efConstruction']
    })

# Define output fields to return.
OUTPUT_FIELDS = ["id", "h1", "h2", "source", "chunk"]

# Run semantic vector search using your query and the vector database.
start_time = time.time()
results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    search_params=SEARCH_PARAMS,
    output_fields=OUTPUT_FIELDS, 
    # Milvus can utilize metadata in boolean expressions to filter search.
    # filter="id >= 0 && source == 'https://pymilvus.readthedocs.io/en/latest/param.html'",
    limit=TOP_K,
    consistency_level="Eventually"
    )

elapsed_time = time.time() - start_time
print(f"Milvus Client search time for {len(chunk_list)} vectors: {elapsed_time} seconds")

# Inspect search result.
print(f"type: {type(results[0])}, count: {len(results[0])}")

# Milvus Client search time for 156 vectors: 0.1264362335205078 seconds
# type: <class 'list'>, count: 3

Milvus Client search time for 161 vectors: 0.26346588134765625 seconds
type: <class 'list'>, count: 2


## Assemble and inspect the search result

The search result is in the variable `results[0]` of type `'pymilvus.orm.search.SearchResult'`.  

In [15]:
# Assemble retrieved context and context metadata.
METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']
formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(
    results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)
print(f"Length context: {len(context[0])}, Number of contexts: {len(context)}")

# TODO - Uncomment to loop through each context and metadata and print.
for i in range(len(context)):
    print(f"Retrieved result #{i+1}")
    print(f"distance = {formatted_results[i][0]}")
    print(f"Context: {context[i][:150]}")
    print(f"Metadata: {context_metadata[i]}")
    print()

Length context: 506, Number of contexts: 2
Retrieved result #1
distance = 0.6960254907608032
Context: the next layer to begin another search. After multiple iterations, it can quickly approach the target position. In order to improve performance, HNSW 
Metadata: {'id': 448076879560588669, 'h1': 'pymilvus latest Table of Contents Installation Tut', 'h2': 'Milvus support to create index to accelerate vecto', 'source': 'https://pymilvus.readthedocs.io/en/latest/param.html'}

Retrieved result #2
distance = 0.6890001893043518
Context: count. HNSW¶ HNSW (Hierarchical Navigable Small World Graph) is a graph-based indexing algorithm. It builds a multi-layer navigation structure for an 
Metadata: {'id': 448076879560588668, 'h1': 'pymilvus latest Table of Contents Installation Tut', 'h2': 'Milvus support to create index to accelerate vecto', 'source': 'https://pymilvus.readthedocs.io/en/latest/param.html'}



## Use an LLM to Generate a chat response to the user's question using the Retrieved Context.

Below, we'll use an open, very tiny generative AI model, or LLM, available on HuggingFace.  Many demos use OpenAI as the LLM choice instead.

In [16]:
# Separate all the context together by space.
contexts_combined = ' '.join(context)
# Separate all the sources together by comma.
source_values = [item['source'] for item in context_metadata]
source_combined = ', '.join(source_values)
print(f"Length long text to summarize: {len(contexts_combined)}")

# Define the question and context
no_context = "The quick brown fox jumped over the lazy dog."
no_prompt = f"{no_context}" + "\n" + f"{SAMPLE_QUESTION}"
full_prompt_baseline = "\<human>\: " + no_prompt + "\n" + "\<bot>\:"

my_prompt = f"{contexts_combined}" + "\n" + f"{SAMPLE_QUESTION}"
full_prompt = "\<human>\: " + my_prompt + "\n" + "\<bot>\:"

# pprint.pprint(full_prompt_baseline)
# pprint.pprint(full_prompt)


Length long text to summarize: 1016


In [17]:
# BASELINING THE LLM: ASK A QUESTION WITHOUT ANY RETRIEVED CONTEXT.

from transformers import AutoTokenizer, AutoModelForCausalLM

# Using sheared-llama from LLMWare.
# https://huggingface.co/llmware/bling-sheared-llama-1.3b-0.1
# Load the model and tokenizer.
tokenizer = AutoTokenizer.from_pretrained("llmware/bling-sheared-llama-1.3b-0.1")  
model = AutoModelForCausalLM.from_pretrained("llmware/bling-sheared-llama-1.3b-0.1")

# Encode the inputs for question-answering.
inputs = tokenizer(full_prompt_baseline, return_tensors="pt")  
start_of_output = len(inputs.input_ids[0])

# Generate the answer using the model
outputs = model.generate(
        inputs.input_ids.to(DEVICE),
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.3,
        max_new_tokens=100,
        )
output_only = tokenizer.decode(outputs[0][start_of_output:],skip_special_tokens=True)  

# Post-processing due to fine-tuning artifacts. 
eot1 = output_only.find("Хроно")
eot2 = output_only.find("textt")
# print(eot1, eot2)
eot_index = -1
eot_index = min(i for i in [eot1, eot2] if i >= 0)
# print(eot_index)
if eot_index > -1:
    answer = output_only[:eot_index]
else:
    answer = output_only

# Print the generated answer
pprint.pprint(answer)

# Not a good answer.

'The parameters for HNSW are as follows '


In [18]:
# USING THE SAME LLM: ASK THE SAME QUESTION WITH RETRIEVED CONTEXT.

# Encode the inputs for question-answering.
inputs = tokenizer(full_prompt, return_tensors="pt")  
start_of_output = len(inputs.input_ids[0])

# Generate the answer using the model
outputs = model.generate(
        inputs.input_ids.to(DEVICE),
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.3,
        max_new_tokens=150,
        )
output_only = tokenizer.decode(outputs[0][start_of_output:],skip_special_tokens=True)  

# Post-processing due to fine-tuning artifacts. 
eot1 = output_only.find("<textt")
eot2 = output_only.find("Хроно")
eot3 = output_only.find("<|endoftext")
# print(eot1, eot2, eot3)
eot_index = -1
eot_index = min(i for i in [eot1, eot2, eot3] if i >= 0)
# print(eot_index)
if eot_index > -1:
    answer = output_only[:eot_index-1]
else:
    answer = output_only

# Print the generated answer
pprint.pprint(answer)

# Better answer.

('The parameters for HNSW are: M: Maximum degree of the node. efConstruction: '
 'Take the effect in stage of index construction.\r\n')


## Use OpenAI to generate a more human-like chat response to the user's question 

We've practiced retrieval for free on our own data using open-source LLMs.  <br>

Now let's make a call to the paid OpenAI GPT.

💡 Note: For use cases that need to always be factually grounded, use very low temperature values while more creative tasks can benefit from higher temperatures.

In [19]:
SYSTEM_PROMPT = f"""Use the Context below to answer the user's question. Be clear, factual, complete, concise.
If the answer is not in the Context, say "I don't know". 
Otherwise answer with fewer than 4 sentences and cite the grounding sources.
Context: {contexts_combined}
Answer: The answer to the question.
Grounding sources: {source_combined}
"""

In [20]:
# CAREFUL!! THIS COSTS MONEY!!
import openai, pprint
from openai import OpenAI

# Define the generation llm model to use.
# https://openai.com/blog/new-embedding-models-and-api-updates
# Customers using the pinned gpt-3.5-turbo model alias will be automatically upgraded to gpt-3.5-turbo-0125 two weeks after this model launches.
LLM_NAME = "gpt-3.5-turbo"
TEMPERATURE = 0.1
RANDOM_SEED = 415

# See how to save api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
openai_client = OpenAI(
    # This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# Generate response using the OpenAI API.
response = openai_client.chat.completions.create(
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT,},
        {"role": "user", "content": f"question: {SAMPLE_QUESTION}",}
    ],
    model=LLM_NAME,
    temperature=TEMPERATURE,
    seed=RANDOM_SEED,
    frequency_penalty=2,
)

# Print the question and answer along with grounding sources and citations.
print(f"Question: {SAMPLE_QUESTION}")

# Print all answers in the response.
for i, choice in enumerate(response.choices, 1):
    pprint.pprint(f"Answer: {choice.message.content}")
    print("\n")

# Question1: What do the parameters for HNSW mean?
# Answer:  Looks perfect!
# Best answer:  M: maximum degree of nodes in a layer of the graph. 
# efConstruction: number of nearest neighbors to consider when connecting nodes in the graph.
# ef: number of nearest neighbors to consider when searching for similar vectors. 

# Question2: What are good default values for HNSW parameters with 25K vectors dim 1024?
# Answer: M=16, efConstruction=500, and ef=64
# Best answer:  M=16, efConstruction=32, ef=32

# Question3: what is the default distance metric used in AUTOINDEX in Milvus?
# Answer: L2 
# Trick answer:  IP inner product, not yet updated in documentation still says L2.

Question: What do the parameters for HNSW mean?
('Answer: The parameters for HNSW are as follows:\n'
 '- M: Maximum degree of the node, limiting the maximum number of connections '
 'each node can have on a layer.\n'
 '- efConstruction: Controls search range during index construction.\n'
 '- ef: Controls search range when searching for targets in the graph.\n'
 '\n'
 'Sources:\n'
 'https://pymilvus.readthedocs.io/en/latest/param.html')




In [21]:
# Drop collection
utility.drop_collection(COLLECTION_NAME)

In [22]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p torch,transformers,sentence_transformers,pymilvus,langchain,openai --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.21.0

torch                : 2.2.0
transformers         : 4.37.2
sentence_transformers: 2.3.1
pymilvus             : 2.3.6
langchain            : 0.1.5
openai               : 1.11.1

conda environment: py311

