# Setup the Jupyter lab environment

## Install all the required packages

In [203]:
#%pip install -U --no-cache-dir boto3 pypdf langchain-community langchain sqlalchemy
#%pip install -U --no-cache-dir langchain faiss-cpu
#%pip install -U --no-cache-dir pinecone-client tiktoken ipywidgets matplotlib anthropic langchain-aws
#%pip install -U --no-cache-dir transformers
#%pip install python-pptx
#%pip install openpyxl
#%pip install docx2txt



Collecting docx2txt
  Using cached docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: docx2txt
  Building wheel for docx2txt (setup.py) ... [?25ldone
[?25h  Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3959 sha256=73d6bb0df58665a1a3aa7f050d87170a3879c6e544fa2bf2f61632d5aa4a3889
  Stored in directory: /Users/skadthan/Library/Caches/pip/wheels/0f/0e/7a/3094a4ceefe657bff7e12dd9592a9d5b6487ef4338ace0afa6
Successfully built docx2txt
Installing collected packages: docx2txt
Successfully installed docx2txt-0.8
Note: you may need to restart the kernel to use updated packages.


## Setup

⚠️ ⚠️ ⚠️ Before running this notebook, ensure you've run the [Bedrock boto3 setup notebook](../rag_poc/bedrock_basics.ipynb) notebook. ⚠️ ⚠️ ⚠️ Then run these installs below

⚠️ ⚠️ ⚠️  Please un comment and run the pip installs if you have not done these already ⚠️ ⚠️ ⚠️ 


In [151]:
import warnings

from io import StringIO
import sys
import textwrap
import os
from typing import Optional

# External Dependencies:
import boto3
from botocore.config import Config

warnings.filterwarnings('ignore')

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))
        



def get_bedrock_client(
    assumed_role: Optional[str] = None,
    region: Optional[str] = None,
    runtime: Optional[bool] = True,
):
    """Create a boto3 client for Amazon Bedrock, with optional configuration overrides

    Parameters
    ----------
    assumed_role :
        Optional ARN of an AWS IAM role to assume for calling the Bedrock service. If not
        specified, the current active credentials will be used.
    region :
        Optional name of the AWS Region in which the service should be called (e.g. "us-east-1").
        If not specified, AWS_REGION or AWS_DEFAULT_REGION environment variable will be used.
    runtime :
        Optional choice of getting different client to perform operations with the Amazon Bedrock service.
    """
    if region is None:
        target_region = os.environ.get("AWS_REGION", os.environ.get("AWS_DEFAULT_REGION"))
    else:
        target_region = region

    print(f"Create new client\n  Using region: {target_region}")
    session_kwargs = {"region_name": target_region}
    client_kwargs = {**session_kwargs}

    profile_name = os.environ.get("AWS_PROFILE")
    if profile_name:
        print(f"  Using profile: {profile_name}")
        session_kwargs["profile_name"] = profile_name

    retry_config = Config(
        region_name=target_region,
        retries={
            "max_attempts": 10,
            "mode": "standard",
        },
    )
    session = boto3.Session(**session_kwargs)

    if assumed_role:
        print(f"  Using role: {assumed_role}", end='')
        sts = session.client("sts")
        response = sts.assume_role(
            RoleArn=str(assumed_role),
            RoleSessionName="langchain-llm-1"
        )
        print(" ... successful!")
        client_kwargs["aws_access_key_id"] = response["Credentials"]["AccessKeyId"]
        client_kwargs["aws_secret_access_key"] = response["Credentials"]["SecretAccessKey"]
        client_kwargs["aws_session_token"] = response["Credentials"]["SessionToken"]

    if runtime:
        service_name='bedrock-runtime'
    else:
        service_name='bedrock'

    bedrock_client = session.client(
        service_name=service_name,
        config=retry_config,
        **client_kwargs
    )

    print("boto3 Bedrock client successfully created!")
    print(bedrock_client._endpoint)
    return bedrock_client

In [153]:
import json
import os
import sys

import boto3


# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

# os.environ["AWS_DEFAULT_REGION"] = "<REGION_NAME>"  # E.g. "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."


boto3_bedrock = get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region='us-east-1' #os.environ.get("AWS_DEFAULT_REGION", None)
)

Create new client
  Using region: us-east-1
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)


#### FAISS as VectorStore

In order to be able to use embeddings for search, we need a store that can efficiently perform vector similarity searches. In this notebook we use FAISS, which is an in memory store. For permanently store vectors, one can use pgVector, Pinecone or Chroma.

The langchain VectorStore API's are available [here](https://python.langchain.com/en/harrison-docs-refactor-3-24/reference/modules/vectorstore.html)

To know more about the FAISS vector store please refer to this [document](https://arxiv.org/pdf/1702.08734.pdf).

#### Titan embeddings Model

Embeddings are a way to represent words, phrases or any other discrete items as vectors in a continuous vector space. This allows machine learning models to perform mathematical operations on these representations and capture semantic relationships between them.

Embeddings are for example used for the RAG [document search capability](https://labelbox.com/blog/how-vector-similarity-search-works/) 


In [219]:
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.vectorstores import FAISS
from pathlib import Path

#from langchain.embeddings import BedrockEmbeddings #Warning message, this is updated to the the following import.
from langchain_aws import BedrockEmbeddings
from langchain_community.document_loaders import Docx2txtLoader
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pptx import Presentation
from openpyxl import load_workbook
from langchain.schema import Document

br_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=boto3_bedrock)

s3_path = "s3://ashu-nimbus-data"
working_path = "./rag_data/nimbus"
!aws s3 sync $s3_path $working_path
folder_path = Path(working_path)
vectorstore_faiss_aws = None

def load_docx(file):
    loader = Docx2txtLoader(file)
    documents= loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200,)
    split_data = text_splitter.split_documents(documents)
    print(f"Number of documents={len(documents)}")
    return split_data

def load_pdf(file):
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200,)
    split_data = text_splitter.split_documents(documents)
    print(f"Number of documents={len(documents)}")
    return split_data

def load_csv(file):
    loader = CSVLoader(file_path) # --- > 219 docs with 400 chars, each row consists in a question column and an answer column
    documents = loader.load() #
    print(f"Number of documents={len(documents)}")
    docs = CharacterTextSplitter(chunk_size=2000, chunk_overlap=400, separator=",").split_documents(documents)
    return split_data

def load_xlsx(file):
    workbook = load_workbook(file)
    full_text = ""
    documents = []
    for sheet_name in workbook.sheetnames:
        sheet = workbook[sheet_name]
        for row in sheet.iter_rows(values_only=True):
            row_text = " ".join([str(cell) if cell is not None else "" for cell in row])
            full_text += row_text + "\n"
    
    documents.append(Document(page_content=full_text, metadata={"source": str(file)}))
    # Split the text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200,)
    split_data = text_splitter.split_documents(documents)
    print(f"Processed {len(docs)} document chunks.")
    return split_data

def load_pptx(file):
    presentation = Presentation(file)
    full_text = ""
    documents = []
    
    for slide in presentation.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                full_text += shape.text + "\n"
    
    documents.append(Document(page_content=full_text, metadata={"source": str(file)}))
    # Split the text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=200,)
    split_data = text_splitter.split_documents(documents)
    print(f"Processed {len(docs)} document chunks.")
    return split_data
    

for file_path in folder_path.iterdir():
    try:
        if file_path.is_file():
            #print(f"Processing file: {file_path}")
            file_extension = os.path.splitext(file_path)[1].lower()
            # Check the type based on extension
            
            if file_extension == '.docx':
                file_type = 'Word Document'
                docs=load_docx(file_path)
            elif file_extension == '.ppt' or file_extension == '.pptx':
                file_type = 'PowerPoint Presentation'
                docs=load_pptx(file_path)
            elif file_extension == '.xlsx':
                file_type = 'Excel Spreadsheet'
                docs=load_xlsx(file_path)
            elif file_extension == '.pdf':
                file_type = 'PDF Document'
                loader = PyPDFDirectoryLoader(file_path)
                docs=load_pdf(file_path)
            elif file_extension == '.csv':
                file_type = 'CSV File'
                docs=load_csv(file_path)
            else:
                file_type = 'Unknown Type'

            print(f"{file_path}: {file_type}")

            # Update or create the FAISS vectorstore
            if vectorstore_faiss_aws is None:
                vectorstore_faiss_aws = FAISS.from_documents(documents=docs,embedding = br_embeddings)
            else:
                vectorstore_faiss_aws.add_documents(documents=docs)     

    except Exception as e:
        print(f"Error processing {file_path}: {e}")




Number of documents=1
rag_data/nimbus/Relevant_Experience_Project_1.docx: Word Document
Number of documents=1
rag_data/nimbus/Blurbs_relating_to_the_EDLI.docx: Word Document
Number of documents=1
rag_data/nimbus/ORES1_OC_DevOps_Self_Assessment_OY4.docx: Word Document
Number of documents=1
rag_data/nimbus/Nimbus_ISFCS_MarketResearch_Capabilities.docx: Word Document
Number of documents=1
rag_data/nimbus/Nimbus_RFQ #210501_ORES_Volume I_Technical_edited.docx: Word Document
Number of documents=1
rag_data/nimbus/Nimbus_OC_Self_Assessment_OY3.docx: Word Document
Number of documents=1
rag_data/nimbus/Nimbus_DAS_Self_Assessment_OY3.docx: Word Document
Number of documents=1
rag_data/nimbus/ORES2_BY_CPI.docx: Word Document
Processed 28 document chunks.
rag_data/nimbus/Nimbus_OHI_Partnership.pptx: PowerPoint Presentation
Number of documents=1
rag_data/nimbus/RR_Nimbus_Value_to_SMAQ_and_CCSQ_Strategy.docx: Word Document
Processed 5 document chunks.
rag_data/nimbus/FDA_DIMES_Capes_Matrix_Nimbus.xls

#### Auto add the history to the Chat with Retriever

Wrap with Runnable Chat History with Session id and run the chat conversation

borrowed from https://github.com/langchain-ai/langchain

In [222]:
print(vectorstore_faiss_aws)
# Save the updated FAISS vectorstore if needed
vectorstore_faiss_aws.save_local("./faiss")


<langchain_community.vectorstores.faiss.FAISS object at 0x1670a1c50>


## Steps to read the vector DB content

In [None]:
# Define a query
query = "What is the purpose of this document?"

# Perform a similarity search
similar_docs = vectorstore_faiss_aws.similarity_search(query, k=5)  # Retrieve top 5 similar documents

# Display the retrieved documents
for i, doc in enumerate(similar_docs):
    print(f"Document {i+1}:")
    print(f"Page Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")
    print("-" * 80)


In [231]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_history_aware_retriever
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory

model_parameter = {"temperature": 0.0, "top_p": .5, "max_tokens_to_sample": 2000}
modelId = "anthropic.claude-3-sonnet-20240229-v1:0" #"anthropic.claude-v2"
chatbedrock_llm = ChatBedrock(
    model_id=modelId,
    client=boto3_bedrock,
    model_kwargs=model_parameter, 
    beta_use_converse_api=True
)

    
contextualized_question_system_template = (
    "You are assisting in generating a comprehensive capabilities statement based on user queries and contextual information."
    "Given the chat history and the latest user query, rewrite the query to be a self-contained question,"
    "including any missing information required to answer it effectively. Ensure the question is framed professionally"
    "and aligns with the goal of constructing a detailed capabilities statement."
    "Do NOT answer the question—only reformulate it. If no reformulation is needed, return the query as is."
)

contextualized_question_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualized_question_system_template),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)
history_aware_retriever = create_history_aware_retriever(
    chatbedrock_llm, vectorstore_faiss_aws.as_retriever(), contextualized_question_prompt
)


qa_system_prompt = """You are an AI assistant tasked with providing detailed and relevant responses for generating a capabilities statement. \
Using the retrieved documents and context, answer the user query as accurately and professionally as possible. \
If the retrieved context does not contain sufficient information to answer the query, state: \
"I do not have enough context to provide an answer."\

Guidelines:\
1. Provide answers that are factual and directly relevant to constructing a capabilities statement.\
2. Use concise, professional language tailored to the capabilities domain.\
3. Reference retrieved context explicitly when possible, such as contracts, proposals, or assessments.\

{context}"""

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", qa_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])
question_answer_chain = create_stuff_documents_chain(chatbedrock_llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

#- Wrap the rag_chain with RunnableWithMessageHistory to automatically handle chat history:
store = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
    #print(session_id)
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

chain_with_history = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)


In [233]:
result = chain_with_history.invoke(
    {"input": "What kind of bias can SageMaker detect?"},
    config={"configurable": {"session_id": "session_1"}}
)
result

{'input': 'What kind of bias can SageMaker detect?',
 'chat_history': [],
 'context': [Document(metadata={'source': 'rag_data/nimbus/Nimbus_CPI_Self_Assessment_OY3.docx'}, page_content='PEOG-DSEO CMS.gov Program Integrity Knowledge and Resource Center: in March of 2020, the Nimbus Web Developers designed, built, and edited content for this new CPI web page. The goal was to have a page to serve as a training resource for medical reviewers, program integrity and appeals contractors, CMS staff, state Medicaid agency staff, and other stakeholders.\n\nBenefits to CMS: This page was necessary to resolve a failure on IBM’s part concerning the Extrapolation and Estimation eLearning course; IBM spent approximately one year developing the course, which was unusable when it was time to launch due to the burden on test-takers to submit to an Experian identity check in order to proceed to the course questions. DSEO’s management agreed the Experian questions were unreasonable and too probing for an 

### As a follow on question

1. The phrase `it` will be converted based on the chat history
2. Retriever gets invoked to get relevant content based on chat history 

In [235]:
follow_up_result = chain_with_history.invoke(
    {"input": "Please help me write 2 pager of full capability statement for the services/tasks such as Project Management Support, data architecture, data quality, system architecture areas?"},
    config={"configurable": {"session_id": "session_1"}}
)
print_ww(follow_up_result)

{'input': 'Please help me write 2 pager of full capability statement for the services/tasks such as
Project Management Support, data architecture, data quality, system architecture areas?',
'chat_history': [HumanMessage(content='What kind of bias can SageMaker detect?',
additional_kwargs={}, response_metadata={}), AIMessage(content="I'm afraid I don't have enough
context about SageMaker's specific capabilities to provide details on what types of bias it can
detect. SageMaker is Amazon's machine learning platform, but without more specifics on which
SageMaker service or feature you are asking about, I cannot give an accurate answer regarding bias
detection. Could you provide some more context about the particular SageMaker component you have in
mind? That would help me better understand and respond to your query.", additional_kwargs={},
response_metadata={})], 'context': [Document(metadata={'source':
'rag_data/nimbus/RR_CMS_PPM_RFI_Response_Final_Draft_for_Feedback.docx'}, page_content=

In [237]:
follow_up_result = chain_with_history.invoke(
    {"input": "Instead of providing example projects (such as project 1, project 2 etc.,), could you please me fetch the actual project names that was used to pulled the data for past performances?"},
    config={"configurable": {"session_id": "session_1"}}
)
print_ww(follow_up_result)

{'input': 'Instead of providing example projects (such as project 1, project 2 etc.,), could you
please me fetch the actual project names that was used to pulled the data for past performances?',
'chat_history': [HumanMessage(content='What kind of bias can SageMaker detect?',
additional_kwargs={}, response_metadata={}), AIMessage(content="I'm afraid I don't have enough
context about SageMaker's specific capabilities to provide details on what types of bias it can
detect. SageMaker is Amazon's machine learning platform, but without more specifics on which
SageMaker service or feature you are asking about, I cannot give an accurate answer regarding bias
detection. Could you provide some more context about the particular SageMaker component you have in
mind? That would help me better understand and respond to your query.", additional_kwargs={},
response_metadata={}), HumanMessage(content='Please help me write 2 pager of full capability
statement for the services/tasks such as Project Mana

In [239]:
follow_up_result = chain_with_history.invoke(
    {"input": "The data quality section is very narrow, could you please expand on that a bit more? "},
    config={"configurable": {"session_id": "session_1"}}
)
follow_up_result

{'input': 'The data quality section is very narrow, could you please expand on that a bit more? ',
 'chat_history': [HumanMessage(content='What kind of bias can SageMaker detect?', additional_kwargs={}, response_metadata={}),
  AIMessage(content="I'm afraid I don't have enough context about SageMaker's specific capabilities to provide details on what types of bias it can detect. SageMaker is Amazon's machine learning platform, but without more specifics on which SageMaker service or feature you are asking about, I cannot give an accurate answer regarding bias detection. Could you provide some more context about the particular SageMaker component you have in mind? That would help me better understand and respond to your query.", additional_kwargs={}, response_metadata={}),
  HumanMessage(content='Please help me write 2 pager of full capability statement for the services/tasks such as Project Management Support, data architecture, data quality, system architecture areas?', additional_kwa

#### Now ask a random question

In [None]:
follow_up_result = chain_with_history.invoke(
    {"input": "Give me a few tips on how to plant a  new garden."},
    config={"configurable": {"session_id": "session_1"}}
)
follow_up_result

Let's see how the semantic search works:
1. First we calculate the embeddings vector for the query, and
2. then we use this vector to do a similarity search on the store