

---



## **Notes to run this notebook**
1. If its being run in colab use a T4 with high RAM. It also runs without high RAM but after some use it will crash since the reranker uses CPU and not GPU.
2. HF_TOKEN is the hugging face token thats in the secrets of the notebook and loaded from there.
3. If for some reason FAISS serialized vectorstore file "vecstore_full.pkl" does not download or corrupted then use the downloaded processed documents files "documents_full_sdk.pkl", "documents_full_blogs.pkl" and "documents_full_forums.pkl". Then perform splitting and vectorization using the sentence transformer and create the vectorstore.



---



## Setup LLM chain

### Installing Libraries

In [None]:
!pip install accelerate transformers tokenizers
!pip install bitsandbytes einops
!pip install xformers
!pip install langchain
!pip install faiss-gpu
!pip install sentence_transformers
!pip install nest_asyncio
!pip install rank_bm25
!pip install flashrank
!pip install gdown



### Load the model

In [None]:
from torch import cuda, bfloat16, float16
import transformers
from google.colab import userdata


model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token
hf_auth = userdata.get('HF_TOKEN')
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    token=hf_auth,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded on cuda:0


### Load the model tokenizer

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    token=hf_auth
)

### Define the stopping criteria for the llm

In [None]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

In [None]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

### Define the Hugging Face pipeline chain with the model and tokenizer

In [None]:

generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    max_new_tokens=1024,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

### Initialize Hugging Face pipeline chain

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
llm(prompt="What is the NVIDIA CUDA Toolkit?")

  warn_deprecated(


'\n październik 12, 2022 No Comments by admin\nNVIDIA CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA for general-purpose computing on graphics processing units (GPUs). The CUDA Toolkit is a collection of software development tools and libraries that allow developers to use the GPU to accelerate their applications. It provides a set of programming APIs, sample codes, and documentation to help developers take advantage of the massively parallel processing capabilities of NVIDIA GPUs.\nThe CUDA Toolkit includes several components:\nCUDA Runtime: This is the runtime environment that manages the communication between the host CPU and the GPU. It provides a set of functions for launching kernel code on the GPU, handling data transfer between the GPU and CPU, and managing the memory hierarchy.\nCUDA Samples: These are a set of pre-built examples that demonstrate how to use various CUDA features, such as thread blocking, sh

### Load embeddings

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)



---



## Vectorization of documents   
*(Load the all documents files "documents_full_sdk.pkl", "documents_full_blogs.pkl" and "documents_full_forums.pkl" attached in the zip if vectorstore file "vecstore_full.pkl" is not available)*

### Load scraped documents

#### Download all shared scraped documents files "documents_full_sdk.pkl", "documents_full_blogs.pkl" and "documents_full_forums.pkl"

In [None]:
!gdown --id 1Iin64B6MwA34APl7Rt_pKFHPZC7qLreD
!gdown --id 1PmNWDXsKV7aaA2ybmxUdPsAyaCSZQcfP
!gdown --id 1v-H54VVZhwSighkVv3uaT8ki0SRFBhhl

Downloading...
From (original): https://drive.google.com/uc?id=1Iin64B6MwA34APl7Rt_pKFHPZC7qLreD
From (redirected): https://drive.google.com/uc?id=1Iin64B6MwA34APl7Rt_pKFHPZC7qLreD&confirm=t&uuid=c426a3e5-ce11-445c-aa8c-66210490ff5c
To: /content/documents_full_sdk.pkl
100% 243M/243M [00:01<00:00, 222MB/s]
Downloading...
From: https://drive.google.com/uc?id=1PmNWDXsKV7aaA2ybmxUdPsAyaCSZQcfP
To: /content/documents_full_forums.pkl
100% 26.5k/26.5k [00:00<00:00, 75.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1v-H54VVZhwSighkVv3uaT8ki0SRFBhhl
To: /content/documents_full_blogs.pkl
100% 4.30M/4.30M [00:00<00:00, 275MB/s]


In [None]:
import pickle

sdk_data_path = "documents_full_sdk.pkl"
with open(sdk_data_path, 'rb') as file:
    documents_sdk = pickle.load(file)

blogs_data_path = "documents_full_blogs.pkl"
with open(blogs_data_path, 'rb') as file:
    documents_blogs = pickle.load(file)

forums_data_path = "documents_full_forums.pkl"
with open(forums_data_path, 'rb') as file:
    documents_forums = pickle.load(file)

documents = documents_sdk + documents_blogs + documents_forums

In [None]:
print(documents[0])
print(len(documents))

9905




---



### Create document splits

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)



---



## Vectorstore
*(Load the vectorstore file "vecstore_full.pkl" attached in the zip)*

### FAISS

#### Download shared vectorstore file "vecstore_full.pkl"

In [None]:
!gdown --id 1fD7l9k7bHksE9htb7qbOMkzJahjJrK9B

Downloading...
From (original): https://drive.google.com/uc?id=1fD7l9k7bHksE9htb7qbOMkzJahjJrK9B
From (redirected): https://drive.google.com/uc?id=1fD7l9k7bHksE9htb7qbOMkzJahjJrK9B&confirm=t&uuid=e08742e4-326e-4ba0-b4e7-dadac7e1248f
To: /content/vecstore_full.pkl
100% 493M/493M [00:04<00:00, 117MB/s]


#### Read from vectorstore

In [None]:
import faiss
import json
import os
import pickle
from langchain.vectorstores import FAISS

# Uncomment If loading from google drive
# vecstore_data_path = "/content/drive/MyDrive/datasets/llm/vecstore_full.pkl"
vecstore_data_path = "vecstore_full.pkl"
if os.path.exists(vecstore_data_path):
  with open(vecstore_data_path, 'rb') as f:
    vecstore_data = pickle.load(f)
    vectorstore = FAISS.deserialize_from_bytes(embeddings=embeddings, serialized=vecstore_data)


#### **[DO NOT RUN UNLESS NEEDED]** Write to vectorstore and Save

In [None]:
# storing embeddings in the vector store
# vectorstore = FAISS.from_documents(all_splits, embeddings)

In [None]:
# Write the vector store to disk
'''vecstore_data_path = "/content/drive/MyDrive/datasets/llm/vecstore_full_edited.pkl"
with open(vecstore_data_path, 'wb') as f:
    # Assuming `docstore` is your document store object and it has a method to return its data as a dictionary
    pickle.dump(vectorstore.serialize_to_bytes(), f)'''

'vecstore_data_path = "/content/drive/MyDrive/datasets/llm/vecstore_full_edited.pkl"\nwith open(vecstore_data_path, \'wb\') as f:\n    # Assuming `docstore` is your document store object and it has a method to return its data as a dictionary\n    pickle.dump(vectorstore.serialize_to_bytes(), f)'

## Rerank

In [None]:
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain.retrievers import ContextualCompressionRetriever

compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vectorstore.as_retriever(search_type="mmr",search_kwargs={"k": 3})
)

Downloading ms-marco-MultiBERT-L-12...


ms-marco-MultiBERT-L-12.zip: 100%|██████████| 98.7M/98.7M [00:00<00:00, 136MiB/s]


### Test code for Rerank compression retriever

In [None]:
question = "How can I install the NVIDIA CUDA Toolkit on windows?"
docs = compression_retriever.get_relevant_documents(query=question)
print(f"length - {len(docs)}")
print(f"{docs}")
torch.cuda.empty_cache()

length - 3
[Document(page_content='CUDA Installation Guide for Microsoft Windows\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1. Introduction\n1.1. System Requirements\n1.2. About This Document\n\n\n2. Installing CUDA Development Tools\n2.1. Verify You Have a CUDA-Capable GPU\n2.2. Download the NVIDIA CUDA Toolkit\n2.3. Install the CUDA Software\n2.3.1. Uninstalling the CUDA Software\n\n\n2.4. Using Conda to Install the CUDA Software\n2.4.1. Conda Overview\n2.4.2. Installation\n2.4.3. Uninstallation\n2.4.4. Installing Previous CUDA Releases\n\n\n2.5. Use a Suitable Driver Model\n2.6. Verify the Installation\n2.6.1. Running the Compiled Examples\n\n\n\n\n3. Pip Wheels\n4. Compiling CUDA Programs\n4.1. Compiling Sample Projects\n4.2. Sample Projects\n4.3. Build Customizations for New Projects\n4.4. Build Customizations for Existing Projects\n\n\n5. Additional Considerations\n6. Notices\n6.1. Notice\n6.2. OpenCL\n6.3. Trademarks\n\n\n\n\n\n\n\n\nInstallation 



---



## RetrievalQA chain

In [None]:
from langchain.prompts import PromptTemplate

custom_template = """"
[INST]<<SYS>>
You are a helpful, respectful and honest assistant. Always answer as accurate as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.
If you don't know the answer to a question, please don't share false information.
If there is no context, just say you cannot answer the question and politely apologize.
<</SYS>>

CONTEXT:/n/n {context}/n

Question: {question}[/INST]
"""

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(input_variables=["question", "context"], template=custom_template)

chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever, return_source_documents=True, chain_type_kwargs={"prompt": prompt})



---



## Questions and Answers

### Helper wrappers to provide clean responses

In [None]:
import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    if "source_documents" in llm_response:
      print(f'\n\nSources: {llm_response["source_documents"]}')
      for source in llm_response["source_documents"]:
        if 'source' in source.metadata:
          print(source.metadata['source'])
    torch.cuda.empty_cache()

### Example Queries

In [None]:
query = "What is the NVIDIA CUDA Toolkit?"
result = chain(query)
process_llm_response(result)

  warn_deprecated(


The NVIDIA CUDA Toolkit is a development environment for creating high-performance, GPU-accelerated
applications. It provides a range of tools and libraries for developing, optimizing, and deploying
applications on various hardware configurations, including desktop workstations, enterprise data centers,
cloud-based platforms, and supercomputers. The toolkit includes GPU-accelerated libraries, debugging and
optimization tools, a C/C++ compiler, and a runtime library.


Sources: [Document(page_content='CUDA Toolkit - Free Tools and Training | NVIDIA Developer\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCUDA Toolkit\nThe NVIDIA® CUDA® Toolkit provides a development environment for creating high-performance, GPU-accelerated applications. With it, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and supercomputers. The toolkit inc

In [None]:
query = "How can I install the NVIDIA CUDA Toolkit on windows?"
result = chain(query)
process_llm_response(result)

The NVIDIA CUDA Toolkit can be installed on Windows using the following steps:

1. Download the NVIDIA CUDA Toolkit from the official website: <https://developer.nvidia.com/cuda-downloads>
2. Verify the download by comparing the MD5 checksum with the one provided on the website.
3. Once the download is complete, run the installation executable (e.g., `cudaminer.exe` for 64-bit systems)
to begin the installation process.
4. Follow the on-screen instructions to complete the installation.

Please note that the installation process may vary depending on your system configuration and preferences.
Additionally, it's important to ensure that you have a CUDA-capable GPU installed in your system to take full
advantage of the CUDA Toolkit.


Sources: [Document(page_content='CUDA Installation Guide for Microsoft Windows\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1. Introduction\n1.1. System Requirements\n1.2. About This Document\n\n\n2. Installing CUDA Development 

In [None]:
query = "Can you give me some best practices using the NVIDIA CUDA Toolkit?"
result = chain(query)
process_llm_response(result)

I'm happy to help! Here are some best practices for using the NVIDIA CUDA Toolkit:

1. Familiarize yourself with the CUDA programming model and the C++ syntax for accessing the GPU. This will
make it easier to write efficient and correct code.
2. Use the CUDA debugger to identify and fix issues in your code.
3. Take advantage of the various development tools provided with the CUDA Toolkit, such as NVIDIA® Nsight™
Eclipse Edition, NVIDIA Visual Profiler, CUDA-GDB, and CUDA-MEMCHECK. These tools can help you optimize your
code and identify performance bottlenecks.
4. Keep your code simple and easy to read. Avoid complex loops and data structures that can be difficult to
parallelize.
5. Use the CUDA Streams API to split your code into multiple streams and execute them simultaneously on
multiple GPU cores. This can greatly improve performance on multi-GPU systems.
6. Use the CUDA shared memory and constant memory to store data that is used frequently across multiple
threads. This can reduc

In [None]:
query = "What is nvflare?"
result = chain(query)
process_llm_response(result)

NvFlare is a Python package for creating and manipulating video effects using NVIDIA's VFX (Visual Effects)
API. It allows developers to create complex video effects using a variety of built-in filters and models, as
well as custom filters and models through the use of Python scripts.


Sources: [Document(page_content='Once venv is installed, you can use it to create a virtual environment with:\n$ python3 -m venv nvflare-env\n\n\nThis will create the nvflare-env directory in current working directory if it doesn’t exist,\nand also create directories inside it containing a copy of the Python interpreter,\nthe standard library, and various supporting files.\nActivate the virtualenv by running the following command:\n$ source nvflare-env/bin/activate\n\n\nYou may find that the pip and setuptools versions in the venv need updating:\n(nvflare-env) $ python3 -m pip install -U pip\n(nvflare-env) $ python3 -m pip install -U setuptools\n\n\n\n\n\nInstall Stable Release¶\nStable releases are ava

In [None]:
query = "What is the difference between NVIDIA's BioMegatron and Megatron 530B LLM?"
result = chain(query)
process_llm_response(result)

NVIDIA's BioMegatron and Megatron 530B LLM are both large language models (LLMs) developed by NVIDIA. However,
they differ in their architecture and capabilities.

BioMegatron is a transformer-based LLM that uses a combination of CPU and GPU resources to achieve state-of-
the-art results in various natural language processing (NLP) tasks. It was specifically designed for bio-
related tasks such as protein structure prediction, drug discovery, and genomics analysis. BioMegatron has 530
billion parameters and is trained on a dataset of over 1 exabyte of text data.

On the other hand, Megatron 530B is also a transformer-based LLM that is designed for general-purpose language
understanding and generation tasks. It has 530 billion parameters and is trained on a diverse dataset of text
from the internet. While Megatron 530B can be used for a wide range of NLP tasks, it may not be as specialized
or optimized for bio-related tasks as BioMegatron.

In summary, while both BioMegatron and Megatro

In [None]:
query = "How to find the JetPack version of NVIDIA Jetson Device?"
result = chain(query)
process_llm_response(result)

To find the JetPack version of an NVIDIA Jetson device, you can follow these steps:

1. Connect to the device's terminal or command prompt using a serial cable or remotely via SSH.
2. Run the following command to check the JetPack version: `jetpack_version`

This command will display the current version of the JetPack software installed on the device.

Alternatively, you can check the version of the Jetson device by running the following command:
`jetson_version`

This command will display the version number of the Jetson device, which can be used to identify the JetPack
version installed on the device.


Sources: [Document(page_content='NVIDIA JetPack Documentation\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNVIDIANVIDIA JetPack Documentation\n\nSearch In:\nEntire Site\nJust This Document\nclear search\nsearch\n\n\n\nNVIDIA JetPack SDK\n\nIntroduction to JetPack\n\nRelease Notes\n\nRelease Notes\n\nInstallation and Setup\n\nHow to Install JetPack\n\nCopyright And License Notices\n\nJetPack EULA\



---

