#  RAG 10k Query Using Local LLMs

![](../images/rag-1.svg)

## CPU vs GPU

### To ensure only CPU is used, use these settings

```python
os.environ["CUDA_VISIBLE_DEVICES"]=""
```

and you will see this output

```text
using CUDA/GPU:  False
```

And when initlalizing LLM, 

- set `n_gpu_layers": 0`

And when the model is running, 
- You should not see any mention of CUDA in the output
- you should see CPU usage spike up (it will use most available CPU cores)
- if you use a tools like `nvidia-smi` you will notice, GPU memory is not being used by Python process
- You may typicall see `eval time =   ....   (  195.39 ms per token,     5.12 tokens per second)` . This is pretty slow.


### To use GPU 

Make sure the following code is commented out

```python
# os.environ["CUDA_VISIBLE_DEVICES"]=""
```

and you will see this output

```text
using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070
```

And when initlalizing LLM, 

- set `n_gpu_layers": -1` - this will offload all layers to GPU.  You can specify a number 1, 10, 20 ... that your GPU can support.  Higher numbers required more GPU memory

And when the model is running, 
- You should see CUDA being used
- you should not see a lot CPU usage (GPU is doing the heavy lifting now)
- if you use a tools like `nvidia-smi` you will notice, GPU memory is being used by Python process  (full offload can take about 6GB GPU memory)
- You may typicall see `eval time =   ....   (   20.73 ms per token,    48.23 tokens per second)` . This is much faster than before!

In [1]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070


In [2]:
## Setup logging.  To see more loging set the level to DEBUG

import sys
import logging

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [3]:
import os, sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

## Step-1: Load Settings

In [4]:
import os,sys
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")

## Disable openAI key be sure
os.environ['OPENAI_API_KEY'] = ''

In [5]:
DB_NAME = 'rag1'
COLLECTION_NAME = '10k'
INDEX_NAME = 'idx_embedding'

In [6]:
import os
## LlamaIndex will download embeddings models as needed.
## Set llamaindex cache dir to ./cache dir here (Default is system tmp)
## This way, we can easily see downloaded artifacts
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath(''), '..', 'llama-index-cache')

In [7]:
from pymongo import MongoClient

mongodb_client = MongoClient(ATLAS_URI)

print ("Atlas client initialized")

Atlas client initialized


## Step-2 : Setup Embeddings

The default embedding is OpenAI.  We can always plugin custom embeddings

### 2.1 : OpenAI Embeddings

This is using OpenAI embedding model
You will need an API key (defined in env variable : OPENAI_API_KEY)

In [8]:
# from llama_index import  OpenAIEmbedding
# embed_model = OpenAIEmbedding()

### 2.2 : Using Custom Embeddings

Remember this embedding model must be the same as in `populate` step. 

In [9]:
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

  from .autonotebook import tqdm as notebook_tqdm


## Initialize Local LLM

We will use a LLM running locally.

Our local LLM:

- mistral-7b-instruct-v0.2.Q4_K_M.gguf (4 bit quantized, medium).  [Model card here](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)
- Smallest 'recommended' model
- Model size : 4.37 GB
- when using GPU, the model takes about 6.5 GB or GPU memory
- We are running it through [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) with GPU support compiled in

In [10]:

from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
# small
# model_file_path = '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_S.gguf'

# medium
model_file_path = '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf'

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_file_path,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to 0 for no GPU, at least 1 to use GPU,  -1 all layers are offloaded
    # change this value from 1, 10, 20, 30, 40
    # for Nvidia GEForce 2070 with 8 GB RAM 40 works well
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32            

In [11]:
from llama_index import  ServiceContext

# The LLM used to generate natural language responses to queries.
# If not provided, defaults to gpt-3.5-turbo from OpenAI
# If your OpenAI key is not set, defaults to llama2-chat-13B from Llama.cpp

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)

## Step-3: Connect Illama-Index and MongoDB Atlas

Let's define MongoDB Atlas as our vector storage. This is critical to stored indexed data and then query

In [12]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.storage.storage_context import StorageContext
from llama_index.indices.vector_store.base import VectorStoreIndex


vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 index_name  = 'idx_embedding',
                                 ## the following columns are set to default values
                                 # embedding_key = 'embedding', text_key = 'text', metadata_= 'metadata',
                                 )
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

## Step-4: Query Data / Ask Questions

Now that we have every thing setup, let's ask some questions

In [13]:
%%time

from IPython.display import Markdown
from pprint import pprint

response = index.as_query_engine().query("What was Uber's revenue?")
print (response)
print()
# display(Markdown(f"<b>{response}</b>"))
pprint(response, indent=4)

 Based on the provided context information from Uber's 2021 Annual Report on Form 10-K, I can see that Uber generated total revenue of $17,455 million for the year ended December 31, 2021. This revenue is a combination of Mobility revenue, Delivery revenue, Freight revenue, and All Other revenue. The exact breakdown of these revenues by offering and geographical region is presented in the context information as well.

Response(response=" Based on the provided context information from Uber's 2021 "
                  'Annual Report on Form 10-K, I can see that Uber generated '
                  'total revenue of $17,455 million for the year ended '
                  'December 31, 2021. This revenue is a combination of '
                  'Mobility revenue, Delivery revenue, Freight revenue, and '
                  'All Other revenue. The exact breakdown of these revenues by '
                  'offering and geographical region is presented in the '
                  'context information 


llama_print_timings:        load time =     394.11 ms
llama_print_timings:      sample time =      37.53 ms /   106 runs   (    0.35 ms per token,  2824.26 tokens per second)
llama_print_timings: prompt eval time =    1392.78 ms /  1689 tokens (    0.82 ms per token,  1212.69 tokens per second)
llama_print_timings:        eval time =    2402.50 ms /   105 runs   (   22.88 ms per token,    43.70 tokens per second)
llama_print_timings:       total time =    4019.27 ms /  1794 tokens


In [14]:
%%time

response = index.as_query_engine().query("How much money did Lyft make in 2020?")
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit


 Based on the provided context information from the Lyft 2021 Annual Report on Form 10-K, I can see that Lyft reported revenue of $2,364,681 thousand for the year ended December 31, 2020. Therefore, Lyft made approximately $2,364,681 thousand in revenue during the year 2020.

Response(response=' Based on the provided context information from the Lyft '
                  '2021 Annual Report on Form 10-K, I can see that Lyft '
                  'reported revenue of $2,364,681 thousand for the year ended '
                  'December 31, 2020. Therefore, Lyft made approximately '
                  '$2,364,681 thousand in revenue during the year 2020.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='e425582e-bf00-4ef2-a52f-8b83959a204c', embedding=None, metadata={'page_label': '58', 'file_name': 'lyft_2021.pdf', 'file_path': '../data/10k/lyft_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1440303, 'creation_date': '2024-01-23', 'last_modified_date': '2024-01-23', 'last


llama_print_timings:        load time =     394.11 ms
llama_print_timings:      sample time =      31.20 ms /    89 runs   (    0.35 ms per token,  2852.47 tokens per second)
llama_print_timings: prompt eval time =    1545.23 ms /  1963 tokens (    0.79 ms per token,  1270.36 tokens per second)
llama_print_timings:        eval time =    2077.26 ms /    88 runs   (   23.61 ms per token,    42.36 tokens per second)
llama_print_timings:       total time =    3830.44 ms /  2051 tokens


In [15]:
%%time

## The answer to this question doesn't exist in the Lyft_10k filing!
## Let's see what we get back
response = index.as_query_engine().query("What was Lyft's revenue for 2018?")
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit


 I cannot directly provide you with Lyft's revenue for 2018 from the given context information as it does not contain that specific data. The context only provides financial information for the years 2019 and 2021.

Response(response=" I cannot directly provide you with Lyft's revenue for 2018 "
                  'from the given context information as it does not contain '
                  'that specific data. The context only provides financial '
                  'information for the years 2019 and 2021.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='ad41b16a-c797-47fd-b35c-5d0661d57b35', embedding=None, metadata={'page_label': '58', 'file_name': 'lyft_2021.pdf', 'file_path': '../data/10k/lyft_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1440303, 'creation_date': '2024-01-23', 'last_modified_date': '2024-01-23', 'last_accessed_date': '2024-01-29'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'la


llama_print_timings:        load time =     394.11 ms
llama_print_timings:      sample time =      19.74 ms /    53 runs   (    0.37 ms per token,  2685.31 tokens per second)
llama_print_timings: prompt eval time =    1037.72 ms /  1296 tokens (    0.80 ms per token,  1248.89 tokens per second)
llama_print_timings:        eval time =    1186.85 ms /    52 runs   (   22.82 ms per token,    43.81 tokens per second)
llama_print_timings:       total time =    2346.79 ms /  1348 tokens


In [16]:
%%time

response = index.as_query_engine().query("When did Uber go IPO?")
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit


 Uber went public through an initial public offering (IPO) on the date that is not provided in the given context information. However, it is mentioned that there was a proceeds from issuance of common stock upon initial public offering, net of offering costs, which amounted to $7,973 million in 2019. This information alone does not provide an exact date for the IPO.

Response(response=' Uber went public through an initial public offering (IPO) '
                  'on the date that is not provided in the given context '
                  'information. However, it is mentioned that there was a '
                  'proceeds from issuance of common stock upon initial public '
                  'offering, net of offering costs, which amounted to $7,973 '
                  'million in 2019. This information alone does not provide an '
                  'exact date for the IPO.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='5a9c1dd0-44f6-4009-8427-df7e92d1ee0f', embedding=None, 


llama_print_timings:        load time =     394.11 ms
llama_print_timings:      sample time =      32.55 ms /    87 runs   (    0.37 ms per token,  2672.98 tokens per second)
llama_print_timings: prompt eval time =    1388.76 ms /  1710 tokens (    0.81 ms per token,  1231.32 tokens per second)
llama_print_timings:        eval time =    2010.66 ms /    86 runs   (   23.38 ms per token,    42.77 tokens per second)
llama_print_timings:       total time =    3592.48 ms /  1796 tokens


### We can even query data in tables!

Here is a table based data in [Lyft's 10k document](data/10k/lyft_2021.pdf) - page 82

And the LLMs are able to extract the info and answer!

![](../images/rag-3-pdf-table.png)

In [17]:
%%time

response = index.as_query_engine().query("What were the Stock-based compensation for Lyft?")
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit


 Based on the provided context information, I see that there are two instances where stock-based compensation is mentioned in each of the given consolidated financial statements for Lyft, Inc. The amounts for these compensation are as follows:

1. For the year ending December 31, 2021: $1,599,311
2. For the year ending December 31, 2019: $1,599,311

Therefore, the total stock-based compensation for Lyft during these two years was $3,198,622 ($1,599,311 * 2).

Response(response=' Based on the provided context information, I see that '
                  'there are two instances where stock-based compensation is '
                  'mentioned in each of the given consolidated financial '
                  'statements for Lyft, Inc. The amounts for these '
                  'compensation are as follows:\n'
                  '\n'
                  '1. For the year ending December 31, 2021: $1,599,311\n'
                  '2. For the year ending December 31, 2019: $1,599,311\n'
             


llama_print_timings:        load time =     394.11 ms
llama_print_timings:      sample time =      52.00 ms /   146 runs   (    0.36 ms per token,  2807.91 tokens per second)
llama_print_timings: prompt eval time =    1333.75 ms /  1634 tokens (    0.82 ms per token,  1225.11 tokens per second)
llama_print_timings:        eval time =    3366.63 ms /   145 runs   (   23.22 ms per token,    43.07 tokens per second)
llama_print_timings:       total time =    5131.95 ms /  1779 tokens
