## RAG LLM by LangChain - simple Q&A bot

Document: SageMaker FAQ   
Models: huggingface embedding and mistral LLM

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import TextDataset, DataCollatorForLanguageModeling

import torch
from torch.utils.data import Dataset, random_split
from transformers import TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

In [3]:
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

Prepare embedding model for RAG

In [4]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")  

###### Document embedding and vector storage

In [5]:
# Load documents 
loader = PyPDFDirectoryLoader("./data/smdoc")

documents = loader.load()

# Split documents by setting chunk size
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,     
    chunk_overlap = 50,    
)

docs = text_splitter.split_documents(documents)

In [6]:
# Create FAISS vector store, it will take some time ...  
vectorstore_faiss = FAISS.from_documents(
    docs,
    embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

# Save store to local disk 
vectorstore_faiss.save_local("smdoc_vectordb")

In [7]:
# load vector store from local disk
vectorstore_faiss = FAISS.load_local("smdoc_vectordb", embeddings)
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

###### Test doc search 

In [8]:
query = "How can I optimize my SageMaker costs, such as detecting and stopping idle resources avoid unnecessary charges? "

query_embedding = vectorstore_faiss.embedding_function(query)

relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

4 documents are fetched which are relevant to the query.
----
## Document 1: commitments. For more details, see Amazon SageMaker Pricing and the Amazon SageMaker Pricing
Calculator.
Q: How can I optimize my SageMaker costs, such as detecting and stopping idle resources to
avoid unnecessary charges?
There are several best practices that you can adopt to optimize your SageMaker resource usage.
Some approaches involve conﬁguration optimizations; others involve programmatic solutions. A.......
---
## Document 2: inference workloads at any time and automatically continue to pay the Savings Plans price.
Q: Why should I use SageMaker Savings Plans?
If you have a consistent amount of SageMaker instance usage (measured in $/hour) and use
multiple SageMaker components or expect your technology conﬁguration (such as instance family,
or Region) to change over time, SageMaker Savings Plans make it simpler to maximize your savings.......
---
## Document 3: guide for using SageMaker Studio notebooks.

##### Setup LLM in RAG - Mistral-7bi

In [9]:
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.1"

device_map="auto"

In [10]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
model = AutoModelForCausalLM.from_pretrained(  
    BASE_MODEL,  
    torch_dtype=torch.float16,
    device_map=device_map,
)

model.resize_token_embeddings(len(tokenizer))

Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.55s/it]


Embedding(32000, 4096)

In [12]:
!nvidia-smi

Wed Oct 25 19:10:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   32C    P0    59W / 300W |   4735MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   32C    P0    61W / 300W |   4755MiB / 16384MiB |      0%      Defaul

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   32C    P0    64W / 300W |   4755MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P0    64W / 300W |   4125MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID            

In [13]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

pipe = pipeline('text-generation', 
                model=model, 
                tokenizer=tokenizer, 
                #use_cache=True,
                #do_sample=True,
                ##temperature = 0.2,
                ##top_p = 0.92,
                #top_k=5,
                ##max_length=1000,
                device_map="auto",
                max_new_tokens=256,
                num_return_sequences=1,
                repetition_penalty = 1.5,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
)

llm_mistral = HuggingFacePipeline(pipeline=pipe)

In [14]:
#response = llm_mistral("<s>[INST] Who is the CEO of AWS? [/INST] </s>")
response = llm_mistral("Who is the CEO of AWS?")
print(response)


AWS does not have a traditional Chief Executive Officer (CEO). Instead, it operates under an executive committee led by Andy Jassy.


##### Create QA Bot by LangChain

In [15]:
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.chains import SequentialChain

In [46]:
# Template for Q&A (1)
prompt_template = """<s>[INST] You are a helpful, respectful and honest assistant. Provide a concise answer based on the context.

Answer the question below from context below :

{context}

{question}  [/INST]"""

ANSWER_PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [39]:
# Template for Q&A (2)
prompt_template = """Human: You are a helpful, respectful and honest assistant. Provide a concise answer based on the context.

Answer the question below from context below :

{context}

{question}

Assistant: """

ANSWER_PROMPT = PromptTemplate.from_template(prompt_template)

In [47]:
chain = load_qa_chain(llm=llm_mistral, prompt=ANSWER_PROMPT, chain_type="stuff")

In [48]:
query = "How does SageMaker secure my code?"

In [49]:
# search the 3 neighbor docs 
docs = vectorstore_faiss.similarity_search(query, k=5)

In [50]:
docs

[Document(page_content='Q: What security measures does SageMaker have?\nSageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and\nat rest. Requests to the SageMaker API and console are made over a secure (SSL) connection. You\npass AWS Identity and Access Management roles to SageMaker to provide permissions to access\nresources on your behalf for training and deployment. You can use encrypted Amazon Simple', metadata={'source': 'data/smdoc/Amazon SageMaker FAQs.pdf', 'page': 1}),
 Document(page_content='downtimes. SageMaker APIs run in Amazon proven high-availability data centers, with service stack\nreplication conﬁgured across three facilities in each Region to provide fault tolerance in the event\nof a server failure or Availability Zone outage.\nQ: How does SageMaker secure my code?\nSageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted\nat rest.\nQ: What security measures does SageMaker have?', met

In [51]:
response = chain({"input_documents": docs, "question": query}, return_only_outputs=True)[
    "output_text"
]
print(response)

 SageMaker secures user code using encryption both during transmission via SSL connections and while it is being processed within its environment. Additionally, all files uploaded into SageMaker Storage Volumes used for storing code are protected by IAM Roles assigned to them which allows fine grained permission management.
