3  RAG 10k Query Using Open Embeddings and Local LLMs

This notebook we will run every thing locally

- embeddings are generated by open source models that run locally (see notebook : [rag-10k-a-populate-embeddings-open.ipynb](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate-embeddings-open.ipynb))
- And we are going to query a an LLM running locally

Here is the overall RAG pipeline.   In this notebook, we will do steps (2), (3) and (4)
- Step-1: populating embeddings.  It is already done in this notebook [rag-10k-a-populate-embeddings-open.ipynb](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate-embeddings-open.ipynb)
- 👉 Step 2: Calculate embedding for user query
- 👉 Step 3 & 4: Send the query to Atlas to retrieve relevant documents
- 👉 Step-4: Send the query and relevant documents (returned above step) to LLM and get answers to our query

![image missing](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/rag-overview-3-mistral.png)

### What you need to run this notebook

- a (free) MongoDB Atlas Account
- and connection credentials

### This lab depends on:

- We assume we have processed PDF documents, calculated embeddings and loaded them into Atlas.  Refer to this notebook : [rag-10k-a-populate-embeddings-open.ipynb](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate-embeddings-open.ipynb)

### The Tech Stack

- Langugage : Python
- Vector database: Atlas
- Embedding Model: open source embedding model (runs locally)
- LLM: Mistral instruct 7B v0.2 (runs locally)

#### local LLM:

- **mistral-7b-instruct-v0.2.Q4_K_M.gguf** (4 bit quantized, medium)
- Learn about this model using here : [model card](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)
- Smallest 'recommended' model
- Model size : 4.37 GB
- when using GPU, the model takes about 6.5 GB or GPU memory
- We are running it through [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) with GPU support compiled in.  See below for instructions on how to setup with GPU support

### How to run

This notebook is designed to run on local python environment (not Google Colab)

### ⚠️ Note: GPU is highly recommended 

Running LLMs efficiently requires a GPU.  GPUs can provide 5x-10x speed up.  For example if LLM answered a question in 5 seconds with GPU, it might take 50 seconds without GPU!

We use `llama-cpp-python` package.  Here is a quick guide to install the 'GPU support enabled' package

```bash
conda activate atlas-1 # adjust to which ever conda environment you are using

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install --upgrade --force-reinstall --no-cache-dir  llama-cpp-python
```

For detailed install and setup guide see here : [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)

## Step-1: Make sure documents is loaded in Atlas

This is done in this notebook: [rag-10k-a-populate-embeddings-local.ipynb](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate-embeddings-local.ipynb)

Please complete this first.

## Step-2: Configuration

We will setup some common configurations here

In [1]:
# We will keep all global variables in an object to not pollute the global namespace.
class MyConfig(object):
    pass

MY_CONFIG = MyConfig()

MY_CONFIG.DB_NAME = 'rag1'
MY_CONFIG.COLLECTION_NAME = '10k_local'
MY_CONFIG.EMBEDDING_ATTRIBUTE = 'embedding_local'
MY_CONFIG.INDEX_NAME = 'idx_embedding_local'

## Embedding settings
## Option 1 : small model - about 133 MB size
## Option 2 : large model - about 1.34 GB
## See Step-12 for more details

MY_CONFIG.EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"


## Step-3: Load Configuration

We need to configure the following
- Atlas connection credentials

- setup your local python env following this [setup guide](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/setup-python-env.md)
- Create a file named `.env` in the same location as notebook
- And add the following settings

```text
ATLAS_URI=mongodb+srv://<username>:<password>@sandbox.....
```


## Step-4: Basic Setup

### 4.1 - CPU/GPU

### To ensure only CPU is used, use these settings

```python
os.environ["CUDA_VISIBLE_DEVICES"]=""
```

and you will see this output

```text
using CUDA/GPU:  False
```

And when initlalizing LLM, 

- set `n_gpu_layers": 0`

And when the model is running, 
- You should not see any mention of CUDA in the output
- you should see CPU usage spike up (it will use most available CPU cores)
- if you use a tools like `nvidia-smi` you will notice, GPU memory is not being used by Python process
- You may typicall see `eval time =   ....   (  195.39 ms per token,     5.12 tokens per second)` . This is pretty slow.


### To use GPU 

Make sure the following code is commented out

```python
# os.environ["CUDA_VISIBLE_DEVICES"]=""
```

and you will see this output

```text
using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070
```

And when initlalizing LLM, 

- set `n_gpu_layers": -1` - this will offload all layers to GPU.  You can specify a number 1, 10, 20 ... that your GPU can support.  Higher numbers required more GPU memory

And when the model is running, 
- You should see CUDA being used
- you should not see a lot CPU usage (GPU is doing the heavy lifting now)
- if you use a tools like `nvidia-smi` you will notice, GPU memory is being used by Python process  (full offload can take about 6GB GPU memory)
- You may typicall see `eval time =   ....   (   20.73 ms per token,    48.23 tokens per second)` . This is much faster than before!

In [2]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070


### 4.2 - Logging

In [3]:
## Setup logging.  To see more loging set the level to DEBUG

import sys
import logging

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.WARN)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [4]:
import os, sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

## Step-5: Load Configurations

In [5]:
import os,sys
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

MY_CONFIG.ATLAS_URI = config.get('ATLAS_URI')

if  MY_CONFIG.ATLAS_URI:
    print ("✅ config ATLAS_URI found")
else:
    raise Exception ("'❌ ATLAS_URI' is not set.  Please set it above to continue...")

## Disable openAI key be sure
os.environ['OPENAI_API_KEY'] = ''

✅ config ATLAS_URI found


## Step-6: Initialize Atlas Client

If this step fails, make sure 'connect from anywhere' is enabled on your Atlas network configuration

![](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/atlas-connect-2.png)

In [6]:
import pymongo

mongodb_client = pymongo.MongoClient(MY_CONFIG.ATLAS_URI)
print ('✅ Connected to Atlas instance!')

✅ Connected to Atlas instance!


## Step-7 : Setup Embeddings


Remember this embedding model must be the same as in `populate` step. 

See this notebook here. [rag-10k-a-populate-embeddings-open.ipynb](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate-embeddings-open.ipynb)


In [7]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

Settings.embed_model = HuggingFaceEmbedding(
    model_name = MY_CONFIG.EMBEDDING_MODEL
)

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
## testing
embeddings = Settings.embed_model.get_text_embedding("Hello world!")
print ('embedding len : ', len(embeddings))
print ('first few embeddings : ', embeddings[:10])

embedding len :  384
first few embeddings :  [-0.0032757227309048176, -0.011690807528793812, 0.041559189558029175, -0.03814816102385521, 0.024183066561818123, 0.01364425290375948, 0.011117850430309772, 0.04811973124742508, 0.02140951342880726, 0.01417492888867855]


## Step-8: Initialize Local LLM

We will use a LLM running locally.

- **mistral-7b-instruct-v0.2.Q4_K_M.gguf** (4 bit quantized, medium)
- Learn about this model using here : [model card](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)
- Smallest 'recommended' model
- Model size : 4.37 GB
- when using GPU, the model takes about 6.5 GB or GPU memory
- We are running it through [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) with GPU support compiled in.  See above for instructions on how to setup with GPU support


### How do I know the LLM is using GPU?

See the output of below cell.  If you see something similar to this, then we know GPU accelleration is in effect

```text
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
```

In [9]:

from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
from llama_index.core import Settings


# small
# model_file_path = '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_S.gguf'

# medium
model_file_path = '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf'

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_file_path,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to 0 for no GPU, at least 1 to use GPU,  -1 all layers are offloaded
    # change this value from 1, 10, 20, 30, 40
    # for Nvidia GEForce 2070 with 8 GB RAM 40 works well
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

Settings.llm = llm


llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 12

## Step-9: Setup Tokenizers

Setup tokenizers to match LLM for best results

In [10]:
from transformers import AutoTokenizer
from llama_index.core import Settings

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2"
)

# set tokienizer for llama-index : https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/#tokenizer
Settings.tokenzier = tokenizer #typo?
Settings.tokenizer = tokenizer


In [11]:
## test tokenizer
text = "Tokenizers are essential for natural language processing."
tokens = tokenizer.tokenize(text)
print ("Text words count : ", len (text.split()))
print ('tokens count: ', len(tokens))
print("Tokens:", tokens)

Text words count :  7
tokens count:  9
Tokens: ['▁Token', 'izers', '▁are', '▁essential', '▁for', '▁natural', '▁language', '▁processing', '.']


## Step-10 - Quick testing

Let's see if LLM / tokenizer settings are working well...

In [12]:
## Testing
resp = llm.complete("The capital of the United States is ")
print (resp)


llama_print_timings:        load time =     157.94 ms
llama_print_timings:      sample time =      26.99 ms /    77 runs   (    0.35 ms per token,  2853.33 tokens per second)
llama_print_timings: prompt eval time =     157.79 ms /    70 tokens (    2.25 ms per token,   443.62 tokens per second)
llama_print_timings:        eval time =    1256.74 ms /    76 runs   (   16.54 ms per token,    60.47 tokens per second)
llama_print_timings:       total time =    1573.79 ms /   146 tokens


 The capital city of the United States is Washington, D.C. (District of Columbia). This information is based on fact and is not speculative or made up. I am here to provide accurate and truthful responses to your queries. Let me know if you have any other question or if there's something else I can assist you with. Have a great day!


## Step-11: Connect Illama-Index and MongoDB Atlas

Let's define MongoDB Atlas as our vector storage. This is critical to stored indexed data and then query

In [13]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex


vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                        db_name = MY_CONFIG.DB_NAME,
                                        collection_name = MY_CONFIG.COLLECTION_NAME,
                                        index_name  = MY_CONFIG.INDEX_NAME,
                                        embedding_key = MY_CONFIG.EMBEDDING_ATTRIBUTE,
                                        ## the following columns are set to default values
                                       # text_key = 'text', metadata_= 'metadata',
                                 )
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store, storage_context=storage_context)

## Step-12: Query Data / Ask Questions

Now that we have every thing setup, let's ask some questions

These are the PDF documents we have loaded into Atlas, you can download them and inspect them.

- [10k/lyft_2021.pdf](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/data/10k/lyft_2021.pdf)
- [10k/uber_2021.pdf](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/data/10k/uber_2021.pdf)

In [14]:
%%time

from IPython.display import Markdown
from pprint import pprint

query = "What was Uber's revenue?"
response = index.as_query_engine().query(query)
print (response)
print()
# display(Markdown(f"<b>{response}</b>"))
pprint(response, indent=4)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     157.94 ms
llama_print_timings:      sample time =      51.75 ms /   152 runs   (    0.34 ms per token,  2937.37 tokens per second)
llama_print_timings: prompt eval time =    1338.67 ms /  1632 tokens (    0.82 ms per token,  1219.12 tokens per second)
llama_print_timings:        eval time =    2914.14 ms /   151 runs   (   19.30 ms per token,    51.82 tokens per second)
llama_print_timings:       total time =    4606.03 ms /  1783 tokens


 Based on the provided context from Uber's Annual Report on Form 10-K for the years ended December 31, 2019, 2020, and 2021, the total revenue for Uber was $13,000 million in 2019, $11,139 million in 2020, and $17,455 million in 2021. The revenue was disaggregated into Mobility revenue, Delivery revenue, Freight revenue, and All Other revenue. The revenue figures mentioned in the query refer to the total revenue for Uber across all its offerings and geographical regions.

Response(response=" Based on the provided context from Uber's Annual Report on "
                  'Form 10-K for the years ended December 31, 2019, 2020, and '
                  '2021, the total revenue for Uber was $13,000 million in '
                  '2019, $11,139 million in 2020, and $17,455 million in 2021. '
                  'The revenue was disaggregated into Mobility revenue, '
                  'Delivery revenue, Freight revenue, and All Other revenue. '
                  'The revenue figures mentioned in

In [15]:
%%time

query = "How much money did Lyft make in 2020?"
response = index.as_query_engine().query(query)
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     157.94 ms
llama_print_timings:      sample time =      19.49 ms /    57 runs   (    0.34 ms per token,  2923.98 tokens per second)
llama_print_timings: prompt eval time =    1565.10 ms /  1965 tokens (    0.80 ms per token,  1255.51 tokens per second)
llama_print_timings:        eval time =    1117.35 ms /    56 runs   (   19.95 ms per token,    50.12 tokens per second)
llama_print_timings:       total time =    2819.19 ms /  2021 tokens


 Based on the provided context information from the Lyft 2021 Annual Report on Form 10-K, Lyft reported revenue of $2,364,681 thousand in the year ended December 31, 2020.

Response(response=' Based on the provided context information from the Lyft '
                  '2021 Annual Report on Form 10-K, Lyft reported revenue of '
                  '$2,364,681 thousand in the year ended December 31, 2020.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='fd763dfb-8f15-42e2-ae92-f6fc96fdef02', embedding=None, metadata={'page_label': '58', 'file_name': 'lyft_2021.pdf', 'file_path': '/content/data/10k/lyft_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1440303, 'creation_date': '2024-03-22', 'last_modified_date': '2024-03-22'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_acces

In [16]:
%%time

## The answer to this question doesn't exist in the Lyft_10k filing!
## Let's see what we get back
query = "What was Lyft's revenue for 2018?"
response = index.as_query_engine().query(query)
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     157.94 ms
llama_print_timings:      sample time =      15.00 ms /    44 runs   (    0.34 ms per token,  2934.12 tokens per second)
llama_print_timings: prompt eval time =    1042.85 ms /  1297 tokens (    0.80 ms per token,  1243.70 tokens per second)
llama_print_timings:        eval time =     809.23 ms /    43 runs   (   18.82 ms per token,    53.14 tokens per second)
llama_print_timings:       total time =    2021.59 ms /  1340 tokens


 The context information provided does not contain Lyft's revenue for the year 2018. The provided information only includes revenue data for the years 2019 and 2021.

Response(response=" The context information provided does not contain Lyft's "
                  'revenue for the year 2018. The provided information only '
                  'includes revenue data for the years 2019 and 2021.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='7b9763f5-20ce-49de-ab6d-5ff63ff87bb4', embedding=None, metadata={'page_label': '58', 'file_name': 'lyft_2021.pdf', 'file_path': '/content/data/10k/lyft_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1440303, 'creation_date': '2024-03-22', 'last_modified_date': '2024-03-22'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 

In [17]:
%%time

query = "When did Uber go IPO?"
response = index.as_query_engine().query(query)
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     157.94 ms
llama_print_timings:      sample time =       9.44 ms /    27 runs   (    0.35 ms per token,  2859.56 tokens per second)
llama_print_timings: prompt eval time =    1836.13 ms /  2182 tokens (    0.84 ms per token,  1188.37 tokens per second)
llama_print_timings:        eval time =     533.47 ms /    26 runs   (   20.52 ms per token,    48.74 tokens per second)
llama_print_timings:       total time =    2442.87 ms /  2208 tokens


 Uber went public through an Initial Public Offering (IPO) on May 14, 2019.

Response(response=' Uber went public through an Initial Public Offering (IPO) '
                  'on May 14, 2019.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='8b94e298-1153-4b99-bcd5-f0d7e7ce9d4e', embedding=None, metadata={'page_label': '119', 'file_name': 'uber_2021.pdf', 'file_path': '/content/data/10k/uber_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1880483, 'creation_date': '2024-03-22', 'last_modified_date': '2024-03-22'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='6c43260d-f702-46de-a9d9-027090556f09', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '119', 'file_name': 'uber_2

### We can even query data in tables!

Here is a table based data in - [10k/lyft_2021.pdf](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/data/10k/lyft_2021.pdf)- page 82

And the LLMs are able to extract the info and answer!

![](../images/rag-3-pdf-table.png)

In [18]:
%%time

query = "What were the Stock-based compensation for Lyft?"
response = index.as_query_engine().query(query)
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     157.94 ms
llama_print_timings:      sample time =      26.36 ms /    77 runs   (    0.34 ms per token,  2920.98 tokens per second)
llama_print_timings: prompt eval time =    1355.59 ms /  1651 tokens (    0.82 ms per token,  1217.92 tokens per second)
llama_print_timings:        eval time =    1478.23 ms /    76 runs   (   19.45 ms per token,    51.41 tokens per second)
llama_print_timings:       total time =    3006.69 ms /  1727 tokens


 The stock-based compensation for Lyft was $1,599,311 in 2019, $565,807 in 2020, and $721,710 in 2021. (Refer to the 'Stock-based compensation' line in the provided financial statements.)

Response(response=' The stock-based compensation for Lyft was $1,599,311 in '
                  '2019, $565,807 in 2020, and $721,710 in 2021. (Refer to the '
                  "'Stock-based compensation' line in the provided financial "
                  'statements.)',
         source_nodes=[   NodeWithScore(node=TextNode(id_='40cefe84-7a2c-4a04-9bfa-7bf13b18c0e4', embedding=None, metadata={'page_label': '82', 'file_name': 'lyft_2021.pdf', 'file_path': '/content/data/10k/lyft_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1440303, 'creation_date': '2024-03-22', 'last_modified_date': '2024-03-22'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size

## Try your own queries below...

In [19]:
%%time

# query = "Your query goes here"
# response = index.as_query_engine().query(query)
# print (response)
# print()
# pprint(response, indent=4)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.63 µs
