#  RAG 10k Query Using Local LLMs

## CPU vs GPU

### To ensure only CPU is used, use these settings

```python
os.environ["CUDA_VISIBLE_DEVICES"]=""
```

and you will see this output

```text
using CUDA/GPU:  False
```

And when initlalizing LLM, 

- set `n_gpu_layers": 0`

And when the model is running, 
- You should not see any mention of CUDA in the output
- you should see CPU usage spike up (it will use most available CPU cores)
- if you use a tools like `nvidia-smi` you will notice, GPU memory is not being used by Python process
- You may typicall see `eval time =   ....   (  195.39 ms per token,     5.12 tokens per second)` . This is pretty slow.


### To use GPU 

Make sure the following code is commented out

```python
# os.environ["CUDA_VISIBLE_DEVICES"]=""
```

and you will see this output

```text
using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070
```

And when initlalizing LLM, 

- set `n_gpu_layers": -1` - this will offload all layers to GPU.  You can specify a number 1, 10, 20 ... that your GPU can support.  Higher numbers required more GPU memory

And when the model is running, 
- You should see CUDA being used
- you should not see a lot CPU usage (GPU is doing the heavy lifting now)
- if you use a tools like `nvidia-smi` you will notice, GPU memory is being used by Python process  (full offload can take about 6GB GPU memory)
- You may typicall see `eval time =   ....   (   20.73 ms per token,    48.23 tokens per second)` . This is much faster than before!

In [1]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070


In [2]:
## Setup logging.  To see more loging set the level to DEBUG

import sys
import logging

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [3]:
import os, sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

## Step-1: Load Settings

In [4]:
import os,sys
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")

## Disable openAI key be sure
os.environ['OPENAI_API_KEY'] = ''

In [5]:
DB_NAME = 'rag1'
COLLECTION_NAME = '10k'
INDEX_NAME = 'idx_embedding'

In [6]:
import os
## LlamaIndex will download embeddings models as needed.
## Set llamaindex cache dir to ./cache dir here (Default is system tmp)
## This way, we can easily see downloaded artifacts
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath(''), '..', 'llama-index-cache')

In [7]:
from pymongo import MongoClient

mongodb_client = MongoClient(ATLAS_URI)

print ("Atlas client initialized")

Atlas client initialized


## Step-2 : Setup Embeddings

The default embedding is OpenAI.  We can always plugin custom embeddings

### 2.1 : OpenAI Embeddings

This is using OpenAI embedding model
You will need an API key (defined in env variable : OPENAI_API_KEY)

In [8]:
# from llama_index import  OpenAIEmbedding
# embed_model = OpenAIEmbedding()

### 2.2 : Using Custom Embeddings

Remember this embedding model must be the same as in `populate` step. 

In [9]:
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

  from .autonotebook import tqdm as notebook_tqdm


## Initialize Local LLM

We will use a LLM running locally

In [10]:

from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
model_file_path = '../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_S.gguf'

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_file_path,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to 0 for no GPU, at least 1 to use GPU,  -1 all layers are offloaded
    # change this value from 1, 10, 20, 30, 40
    # for Nvidia GEForce 2070 with 8 GB RAM 40 works well
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ../models/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32            

In [11]:
from llama_index import  ServiceContext

# The LLM used to generate natural language responses to queries.
# If not provided, defaults to gpt-3.5-turbo from OpenAI
# If your OpenAI key is not set, defaults to llama2-chat-13B from Llama.cpp

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)

## Step-3: Connect Illama-Index and MongoDB Atlas

Let's define MongoDB Atlas as our vector storage. This is critical to stored indexed data and then query

In [12]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.storage.storage_context import StorageContext
from llama_index.indices.vector_store.base import VectorStoreIndex


vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 index_name  = 'idx_embedding',
                                 ## the following columns are set to default values
                                 # embedding_key = 'embedding', text_key = 'text', metadata_= 'metadata',
                                 )
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

## Step-4: Query Data / Ask Questions

Now that we have every thing setup, let's ask some questions

In [13]:
from IPython.display import Markdown
from pprint import pprint

response = index.as_query_engine().query("What was Uber's revenue?")
print (response)
print()
# display(Markdown(f"<b>{response}</b>"))
pprint(response, indent=4)

 According to the provided context information from Uber's 2021 Annual Report on Form 10-K, Uber's total revenue for the years ended December 31, 2019, 2020, and 2021 were $13,000 million, $11,139 million, and $17,455 million, respectively.

Response(response=" According to the provided context information from Uber's "
                  "2021 Annual Report on Form 10-K, Uber's total revenue for "
                  'the years ended December 31, 2019, 2020, and 2021 were '
                  '$13,000 million, $11,139 million, and $17,455 million, '
                  'respectively.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='97e43c19-36ff-4295-a7e1-cee0e271566e', embedding=None, metadata={'page_label': '54', 'file_name': 'uber_2021.pdf', 'file_path': '../data/10k/uber_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1880483, 'creation_date': '2024-01-23', 'last_modified_date': '2024-01-23', 'last_accessed_date': '2024-01-29'}, excluded_embed_metadata_keys=['file_na


llama_print_timings:        load time =     395.50 ms
llama_print_timings:      sample time =      31.84 ms /    91 runs   (    0.35 ms per token,  2857.86 tokens per second)
llama_print_timings: prompt eval time =    1449.86 ms /  1689 tokens (    0.86 ms per token,  1164.94 tokens per second)
llama_print_timings:        eval time =    1900.27 ms /    90 runs   (   21.11 ms per token,    47.36 tokens per second)
llama_print_timings:       total time =    3541.24 ms /  1779 tokens


In [14]:
response = index.as_query_engine().query("How much money Lyft made in 2020?")
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit


 According to the provided financial statements, Lyft had revenue of $2,364,681 thousand in the year 2020. This figure represents the total amount of money that Lyft earned through its business operations during that year.

Response(response=' According to the provided financial statements, Lyft had '
                  'revenue of $2,364,681 thousand in the year 2020. This '
                  'figure represents the total amount of money that Lyft '
                  'earned through its business operations during that year.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='b865ebda-59d0-49b2-b9ea-427745b89382', embedding=None, metadata={'page_label': '79', 'file_name': 'lyft_2021.pdf', 'file_path': '../data/10k/lyft_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1440303, 'creation_date': '2024-01-23', 'last_modified_date': '2024-01-23', 'last_accessed_date': '2024-01-29'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_mod


llama_print_timings:        load time =     395.50 ms
llama_print_timings:      sample time =      19.27 ms /    53 runs   (    0.36 ms per token,  2751.10 tokens per second)
llama_print_timings: prompt eval time =    1647.30 ms /  1963 tokens (    0.84 ms per token,  1191.65 tokens per second)
llama_print_timings:        eval time =    1140.49 ms /    52 runs   (   21.93 ms per token,    45.59 tokens per second)
llama_print_timings:       total time =    2906.84 ms /  2015 tokens


In [15]:
## The answer to this question doesn't exist in the Lyft_10k filing!
## Let's see what we get back
response = index.as_query_engine().query("How much money Lyft made in 2018?")
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit


 Based on the provided context information, I cannot directly determine how much money Lyft made in 2018 as the financial statements only provide data for the years 2019 and 2021.

Response(response=' Based on the provided context information, I cannot '
                  'directly determine how much money Lyft made in 2018 as the '
                  'financial statements only provide data for the years 2019 '
                  'and 2021.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='b865ebda-59d0-49b2-b9ea-427745b89382', embedding=None, metadata={'page_label': '79', 'file_name': 'lyft_2021.pdf', 'file_path': '../data/10k/lyft_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1440303, 'creation_date': '2024-01-23', 'last_modified_date': '2024-01-23', 'last_accessed_date': '2024-01-29'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_typ


llama_print_timings:        load time =     395.50 ms
llama_print_timings:      sample time =      16.54 ms /    46 runs   (    0.36 ms per token,  2781.30 tokens per second)
llama_print_timings: prompt eval time =     580.76 ms /   561 tokens (    1.04 ms per token,   965.97 tokens per second)
llama_print_timings:        eval time =     940.91 ms /    45 runs   (   20.91 ms per token,    47.83 tokens per second)
llama_print_timings:       total time =    1621.39 ms /   606 tokens


In [16]:
response = index.as_query_engine().query("When did Uber do IPO?")
print (response)
print()
pprint(response, indent=4)

Llama.generate: prefix-match hit


 Uber completed its Initial Public Offering (IPO) on May 14, 2019.

Response(response=' Uber completed its Initial Public Offering (IPO) on May '
                  '14, 2019.',
         source_nodes=[   NodeWithScore(node=TextNode(id_='c9dcd649-9044-4b74-b410-fb487a83651a', embedding=None, metadata={'page_label': '119', 'file_name': 'uber_2021.pdf', 'file_path': '../data/10k/uber_2021.pdf', 'file_type': 'application/pdf', 'file_size': 1880483, 'creation_date': '2024-01-23', 'last_modified_date': '2024-01-23', 'last_accessed_date': '2024-01-29'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='df5cfc7a-157d-4995-bfe0-6ab6d4116a5e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '119', 'file_na


llama_print_timings:        load time =     395.50 ms
llama_print_timings:      sample time =       9.64 ms /    25 runs   (    0.39 ms per token,  2592.29 tokens per second)
llama_print_timings: prompt eval time =    2094.70 ms /  2366 tokens (    0.89 ms per token,  1129.52 tokens per second)
llama_print_timings:        eval time =     544.27 ms /    24 runs   (   22.68 ms per token,    44.10 tokens per second)
llama_print_timings:       total time =    2705.27 ms /  2390 tokens
