## RAG LLM by LangChain - contextual chat with chat history 

Document: SageMaker FAQ   
Models: huggingface embedding and mistral LLM

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import TextDataset, DataCollatorForLanguageModeling

import torch
from torch.utils.data import Dataset, random_split
from transformers import TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

In [3]:
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

Prepare embedding model for RAG

In [4]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")  

###### Document embedding and vector storage

In [5]:
# Load documents 
loader = PyPDFDirectoryLoader("./data/smdoc")

documents = loader.load()

# Split documents by setting chunk size
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,     
    chunk_overlap = 50,    
)

docs = text_splitter.split_documents(documents)

In [6]:
# Create FAISS vector store, it will take some time ...  
vectorstore_faiss = FAISS.from_documents(
    docs,
    embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

# Save store to local disk 
vectorstore_faiss.save_local("smdoc_vectordb")

In [7]:
# load vector store from local disk
vectorstore_faiss = FAISS.load_local("smdoc_vectordb", embeddings)
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

###### Test doc search 

In [8]:
query = "How can I optimize my SageMaker costs, such as detecting and stopping idle resources avoid unnecessary charges? "

query_embedding = vectorstore_faiss.embedding_function(query)

relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

4 documents are fetched which are relevant to the query.
----
## Document 1: commitments. For more details, see Amazon SageMaker Pricing and the Amazon SageMaker Pricing
Calculator.
Q: How can I optimize my SageMaker costs, such as detecting and stopping idle resources to
avoid unnecessary charges?
There are several best practices that you can adopt to optimize your SageMaker resource usage.
Some approaches involve conﬁguration optimizations; others involve programmatic solutions. A.......
---
## Document 2: inference workloads at any time and automatically continue to pay the Savings Plans price.
Q: Why should I use SageMaker Savings Plans?
If you have a consistent amount of SageMaker instance usage (measured in $/hour) and use
multiple SageMaker components or expect your technology conﬁguration (such as instance family,
or Region) to change over time, SageMaker Savings Plans make it simpler to maximize your savings.......
---
## Document 3: guide for using SageMaker Studio notebooks.

##### Setup LLM in RAG - Mistral-7bi

In [9]:
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.1"

device_map="auto"

In [10]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
model = AutoModelForCausalLM.from_pretrained(  
    BASE_MODEL,  
    torch_dtype=torch.float16,
    device_map=device_map,
)

model.resize_token_embeddings(len(tokenizer))

Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.55s/it]


Embedding(32000, 4096)

In [12]:
!nvidia-smi

Wed Oct 25 19:37:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   32C    P0    59W / 300W |   9839MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   32C    P0    60W / 300W |   9897MiB / 16384MiB |      0%      Default |
|       

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [13]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

pipe = pipeline('text-generation', 
                model=model, 
                tokenizer=tokenizer, 
                #use_cache=True,
                #do_sample=True,
                ##temperature = 0.2,
                ##top_p = 0.92,
                #top_k=5,
                ##max_length=1000,
                device_map="auto",
                max_new_tokens=256,
                num_return_sequences=1,
                repetition_penalty = 1.5,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
)

llm_mistral = HuggingFacePipeline(pipeline=pipe)

In [15]:
response = llm_mistral("<s>[INST] Who is the CEO of AWS? [/INST]")
print(response)

 The current Chief Executive Officer (CEO) of Amazon Web Services (AWS), a subsidiary of Amazon, is Andy Jassy.


##### Create QA Bot by LangChain

In [16]:
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.chains import SequentialChain

In [25]:
# Template for Q&A 
prompt_template = """<s>[INST] You are a helpful, respectful and honest assistant. Provide a concise answer based on the context.

Answer the question below from context below :

{context}

{question}  [/INST]"""

ANSWER_PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
#ANSWER_PROMPT = PromptTemplate.from_template(prompt_template)

In [26]:
chain = load_qa_chain(llm=llm_mistral, prompt=ANSWER_PROMPT, chain_type="stuff")

In [27]:
query = "How does SageMaker secure my code?"

In [28]:
# search the 3 neighbor docs 
docs = vectorstore_faiss.similarity_search(query, k=5)

In [29]:
docs

[Document(page_content='Q: What security measures does SageMaker have?\nSageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and\nat rest. Requests to the SageMaker API and console are made over a secure (SSL) connection. You\npass AWS Identity and Access Management roles to SageMaker to provide permissions to access\nresources on your behalf for training and deployment. You can use encrypted Amazon Simple', metadata={'source': 'data/smdoc/Amazon SageMaker FAQs.pdf', 'page': 1}),
 Document(page_content='downtimes. SageMaker APIs run in Amazon proven high-availability data centers, with service stack\nreplication conﬁgured across three facilities in each Region to provide fault tolerance in the event\nof a server failure or Availability Zone outage.\nQ: How does SageMaker secure my code?\nSageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted\nat rest.\nQ: What security measures does SageMaker have?', met

In [30]:
response = chain({"input_documents": docs, "question": query}, return_only_outputs=True)[
    "output_text"
]
print(response)

 SageMaker secures user code using encryption both during transmission via SSL connections and while it is being processed within its environment. Additionally, all files uploaded into SageMaker Storage Volumes used for storing code are protected by IAM Roles assigned to them which allows fine grained permission management.


##### Method 1: use Conversational QA Chain for chat history

In [31]:
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.chains import SequentialChain

In [32]:
# Template to generate new prompt based on chat history

_template = """<s>[INST]  Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}

Follow Up Input: 
{question}

Assistant: [/INST]"""

#CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
CONDENSE_QUESTION_PROMPT = PromptTemplate(template=_template, input_variables=["chat_history", "question"])

In [33]:
# Template for Q&A 
template = """<s>[INST]  You are a helpful, respectful and honest assistant. Provide a concise answer based on the context.

Answer the question below from context below :

{context}

{question}  

[/INST]"""

#ANSWER_PROMPT = PromptTemplate.from_template(template)
ANSWER_PROMPT = PromptTemplate(template=template, input_variables=["context","question"])

In [34]:
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")

def _combine_documents(docs, document_prompt = DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

In [35]:
from typing import Tuple, List

def _format_chat_history(chat_history: List[Tuple]) -> str:
    buffer = ""
    for dialogue_turn in chat_history:        
        human = "Human: " + dialogue_turn[0]
        ai = "Assistant: " + dialogue_turn[1]
        buffer += "\n" + "\n".join([human, ai])
    return buffer

In [36]:
retriever = vectorstore_faiss.as_retriever(search_kwargs={'k': 5})

In [37]:
from operator import itemgetter

from langchain.schema.runnable import RunnableMap
from langchain.schema import format_document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

In [38]:
_inputs = RunnableMap(
    {
        "standalone_question": {
            "question": lambda x: x["question"],
            "chat_history": lambda x: _format_chat_history(x['chat_history'])
        } | CONDENSE_QUESTION_PROMPT | llm_mistral | StrOutputParser(),
    }
)

_context = {
    "context": itemgetter("standalone_question") | retriever | _combine_documents,
    "question": lambda x: x["standalone_question"]
}

In [39]:
conversational_qa_chain = _inputs | _context | ANSWER_PROMPT | llm_mistral

In [40]:
response = conversational_qa_chain.invoke({
    "question": "How does SageMaker secure my code?",
    "chat_history": [],
})
print(response)

 When using Amazon SageMaker, your code is stored in ML storage volumes, secured by security


In [43]:
response = conversational_qa_chain.invoke({
    "question": "What security does it secure by?",
    "chat_history": [("How does SageMaker secure my code?", 
                      "When using Amazon SageMaker, your code is stored in ML storage volumes, secured by security")],
})
print(response)



 When storing code on AWS SageMaker's ML Storage Volumes, it is stored at rest using encryption options provided through Security Groups and optional encryption keys passed via AWS KMS. Additionally, requests to interact with these resources are done over SSL connections.


In [44]:
response = conversational_qa_chain.invoke({
    "question": "How does AWS KMS help in this case?",
    "chat_history": [("How does SageMaker secure my code?", 
                      "When using Amazon SageMaker, your code is stored in ML storage volumes, secured by security"),
                    ("What security does it secure by?",
                     "When storing code on AWS SageMaker's ML Storage Volumes, it is stored at rest using encryption options provided through Security Groups and optional encryption keys passed via AWS KMS. Additionally, requests to interact with these resources are done over SSL connections.")
                    ],
})
print(response)

 Yes, I would be happy to help! The AWS Key Management Service (KMS) provides encryption keys that protect sensitive information stored in various AWS resources including those related to Amazon SageMaker's machine learning workflows. When working with these types of files, users must first obtain permission by passing their identity via AWS Identity & Access Management (IAM). Once granted this permission, any subsequent requests will utilize the provided KMS key to decipher the content before allowing further processing. This way, only authorized individuals who possess the necessary credentials can view or modify confidential data while ensuring its protection at both rest and during transmission across networks.


##### Using memory to store the chat history

In [67]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(return_messages=True, output_key="answer", input_key="question")

In [68]:
from typing import Tuple, List

def _format_chat_history_memory(chat_history: List[Tuple]) -> str:
    print("Chat history:", len(chat_history)/2)
    buffer = ""
    for i in range(int(len(chat_history)/2)):
        human = "Human: " + chat_history[i*2].content
        ai = "Assistant: " + chat_history[i*2+1].content
        buffer += "\n" + "\n".join([human, ai])
    return buffer

In [69]:
# First we add a step to load memory
# This needs to be a RunnableMap because its the first input
loaded_memory = RunnableMap(
    {
        "question": itemgetter("question"),
        "memory": memory.load_memory_variables,
    }
)
# Next we add a step to expand memory into the variables
expanded_memory = {
    "question": itemgetter("question"),
    "chat_history": lambda x: x["memory"]["history"]
}

# Now we calculate the standalone question
standalone_question = {
    "standalone_question": {
        "question": lambda x: x["question"],
        "chat_history": lambda x: _format_chat_history_memory(x['chat_history'])
    } | CONDENSE_QUESTION_PROMPT | llm_mistral | StrOutputParser(),
}
# Now we retrieve the documents
retrieved_documents = {
    "docs": itemgetter("standalone_question") | retriever,
    "question": lambda x: x["standalone_question"]
}
# Now we construct the inputs for the final prompt
final_inputs = {
    "context": lambda x: _combine_documents(x["docs"]),
    "question": itemgetter("question")
}
# And finally, we do the part that returns the answers
answer = {
    "answer": final_inputs | ANSWER_PROMPT | llm_mistral,
    "docs": itemgetter("docs"),
}


In [70]:
# And now we put it all together!
final_chain = loaded_memory | expanded_memory | standalone_question | retrieved_documents | answer

In [71]:
inputs = {"question": "How does SageMaker secure my code?"}
result = final_chain.invoke(inputs)
result["answer"]

Chat history: 0.0


' When using Amazon SageMaker, your code is stored in ML storage volumes, secured by security'

In [72]:
# Note that the memory does not save automatically
# This will be improved in the future
# For now you need to save it yourself
memory.save_context(inputs, {"answer": result["answer"]})

In [73]:
# check it up
memory.load_memory_variables({})

{'history': [HumanMessage(content='How does SageMaker secure my code?', additional_kwargs={}, example=False),
  AIMessage(content=' When using Amazon SageMaker, your code is stored in ML storage volumes, secured by security', additional_kwargs={}, example=False)]}

In [74]:
inputs = {"question": "What security does it secure by?"}
result = final_chain.invoke(inputs)
result["answer"]

Chat history: 1.0


" When storing code on AWS SageMaker's ML Storage Volumes, it is stored at rest using encryption options provided through Security Groups and optional encryption keys passed via AWS KMS. Additionally, requests to interact with these resources are done over SSL connections."

In [75]:
memory.save_context(inputs, {"answer": result["answer"]})

In [76]:
memory.load_memory_variables({})

{'history': [HumanMessage(content='How does SageMaker secure my code?', additional_kwargs={}, example=False),
  AIMessage(content=' When using Amazon SageMaker, your code is stored in ML storage volumes, secured by security', additional_kwargs={}, example=False),
  HumanMessage(content='What security does it secure by?', additional_kwargs={}, example=False),
  AIMessage(content=" When storing code on AWS SageMaker's ML Storage Volumes, it is stored at rest using encryption options provided through Security Groups and optional encryption keys passed via AWS KMS. Additionally, requests to interact with these resources are done over SSL connections.", additional_kwargs={}, example=False)]}

In [77]:
inputs = {"question": "How does AWS KMS help in this case?"}
result = final_chain.invoke(inputs)
result["answer"]

Chat history: 2.0


" Yes, I would be happy to help! The AWS Key Management Service (KMS) provides encryption keys that protect sensitive information stored in various AWS resources including those related to Amazon SageMaker's machine learning workflows. When working with these types of files, users must first obtain permission by passing their identity via AWS Identity & Access Management (IAM). Once granted this permission, any subsequent requests will utilize the provided KMS key to decipher the content before allowing further processing. This way, only authorized individuals who possess the necessary credentials can view or modify confidential data while ensuring its protection at both rest and during transmission across networks."

In [78]:
memory.save_context(inputs, {"answer": result["answer"]})

In [79]:
memory.load_memory_variables({})

{'history': [HumanMessage(content='How does SageMaker secure my code?', additional_kwargs={}, example=False),
  AIMessage(content=' When using Amazon SageMaker, your code is stored in ML storage volumes, secured by security', additional_kwargs={}, example=False),
  HumanMessage(content='What security does it secure by?', additional_kwargs={}, example=False),
  AIMessage(content=" When storing code on AWS SageMaker's ML Storage Volumes, it is stored at rest using encryption options provided through Security Groups and optional encryption keys passed via AWS KMS. Additionally, requests to interact with these resources are done over SSL connections.", additional_kwargs={}, example=False),
  HumanMessage(content='How does AWS KMS help in this case?', additional_kwargs={}, example=False),
  AIMessage(content=" Yes, I would be happy to help! The AWS Key Management Service (KMS) provides encryption keys that protect sensitive information stored in various AWS resources including those related

In [66]:
# ready to store the chat history in a file or db, for the future training purpose