# Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced [LangChain](https://python.langchain.com/en/latest/index.html) Library



In this notebook we will demonstrate how to use mutiple large language models like **Falcon 7b** and **Llama-2 7b Chat** to answer questions using a library of documents as a reference, by using document embeddings and retrieval. The embeddings are generated from **GPT-J-6B** embedding model. 

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and asnwering application.**

## Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.


We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

1. **Generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B embedding model.**
2. **Identify top K most relevant documents based on user query.**
    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**
    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**
    - 2.3 **Use the indexes to retrieve the corresponded documents.**
3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

---
To build a simiplied QA application with LangChain, we need: 
1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.
2. Prepare the dataset to build the knowledge data base. 

---

## Step 1. Deploy large language model (LLM) and embedding model in SageMaker JumpStart

To better illustrate the idea, let's first deploy all the models that are required to perform the demo. You can choose either deploying all inference models as the large language model (LLM) to compare their model performances, or select **subset** of the models based on your preference. To do that, you need modify the `_MODEL_CONFIG_` python dictionary.

In [173]:
!pip install --upgrade pip

Collecting pip
  Obtaining dependency information for pip from https://files.pythonhosted.org/packages/50/c2/e06851e8cc28dcad7c155f4753da8833ac06a5c704c109313b8d5a62968a/pip-23.2.1-py3-none-any.whl.metadata
  Downloading pip-23.2.1-py3-none-any.whl.metadata (4.2 kB)
Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.2
    Uninstalling pip-23.2:
      Successfully uninstalled pip-23.2
Successfully in

In [4]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet
!pip install langchain==0.0.148 --quiet
!pip install faiss-cpu --quiet

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming ve

In [82]:
!pip install transformers -q

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [None]:
!pip install langchain -q

In [5]:
#import the required libraries
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"



Deploy SageMaker endpoint(s) for large language models and GPT-J 6B embedding model. Please uncomment the entries as below if you want to deploy multiple LLM models to compare their performance.

In [6]:
_MODEL_CONFIG_ = {
     #"huggingface-text2text-flan-t5-xxl": {
     #    "instance type": "ml.g5.12xlarge",
     #    "env": {"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL": "1"},
     #    "parse_function": parse_response_model_flan_t5,
     #    "prompt": """Answer based on context:\n\n{context}\n\n{question}""",
     #    "endpoint_name": "yoar-d3-rag-huggingface-text2text-flan--2023-07-17-15-04-45-378",
     #    "input_key":"text_inputs",
     #},
    "huggingface-textembedding-gpt-j-6b": {
       "instance type": "ml.g5.12xlarge",
        "env": {"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL": "1"},
        "endpoint_name":"agupta-d3-rag-huggingface-textembedding-2023-07-31-14-05-13-066",
       
        
    },
    #"huggingface-llm-falcon-40b-instruct-bf16": {
    #    "instance type": "ml.g5.12xlarge",
    #    "env": {"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL": "1"},
    #    "parse_function": parse_response_model_falcon,
    #    "endpoint_name":"jumpstart-dft-hf-llm-falcon-40b-instruct-bf16-1",
    #   "prompt": """Please answer the question below based on this context and  If you cannot find reference for the question in the context, please answer that you Dont know:\n\n{context}\n\n{question}""",
    #    "input_key": "inputs"
    #},
    
    "meta-textgeneration-llama-2-7b": {
        "instance type": "ml.g5.2xlarge",
        "env": {"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL": "1"},
        "endpoint_name":"jumpstart-dft-agupta-meta-textgeneration-llama-2-7b-f",
        "prompt": """Please answer the question below based on this context:\n\n{context}\n\n{question}""",
    },
    
    
    # "huggingface-llm-falcon-7b-instruct-bf16": {
    #     "instance type": "ml.g5.12xlarge",
    #     "env": {"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL": "1"},
    # },
    # "huggingface-textgeneration1-bloomz-7b1-fp16": {
    #     "instance type": "ml.g5.12xlarge",
    #     "env": {},
    #     "parse_function": parse_response_multiple_texts_bloomz,
    #     "prompt": """question: \"{question}"\\n\nContext: \"{context}"\\n\nAnswer:""",
    # },
    # "huggingface-text2text-flan-ul2-bf16": {
    #     "instance type": "ml.g5.24xlarge",
    #     "env": {
    #         "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
    #         "TS_DEFAULT_WORKERS_PER_MODEL": "1"
    #     },
}

In [7]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"

# **<span style="color:red">Do not run the below block if the models are already deployed</span>.**

In [6]:

for model_id in _MODEL_CONFIG_:
    endpoint_name = name_from_base(f"agupta-d3-rag-{model_id}")
    inference_instance_type = _MODEL_CONFIG_[model_id]["instance type"]

    # Retrieve the inference container uri. This is the base HuggingFace container image for the default model above.
    deploy_image_uri = image_uris.retrieve(
        region=None,
        framework=None,  # automatically inferred from model_id
        image_scope="inference",
        model_id=model_id,
        model_version=model_version,
        instance_type=inference_instance_type,
    )
    # Retrieve the model uri.
    model_uri = model_uris.retrieve(
        model_id=model_id, model_version=model_version, model_scope="inference"
    )
    print("Setting up")
    model_inference = Model(
        image_uri=deploy_image_uri,
        model_data=model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=endpoint_name,
        env=_MODEL_CONFIG_[model_id]["env"],
    )
    print("Deploy begin")
    model_predictor_inference = model_inference.deploy(
        initial_instance_count=1,
        instance_type=inference_instance_type,
        predictor_cls=Predictor,
        endpoint_name=endpoint_name,
    )
    print(f"{bold}Model {model_id} has been deployed successfully.{unbold}{newline}")
    _MODEL_CONFIG_[model_id]["endpoint_name"] = endpoint_name


Setting up
Deploy begin
--------------![1mModel huggingface-textembedding-gpt-j-6b has been deployed successfully.[0m



## Step2: Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

#### Llama2 Chat: 7b 

In [8]:
# function to create a payload for the Llama-2 Chat Model
def create_payload(query=None,context=None):
    if context and query:
        prompt = """Context is\n\n{context}\n\nQuestion is:\n\n{question}"""
        text_input = prompt.replace("{context}", context)
        text_input = text_input.replace("{question}", query)
        system_content="""You are an expert who answers questions only 
        from the context being provided and use your expertise to extract a relevant and correct answer""" 
    elif query:
        text_input = query
        system_content="You are a chat bot who answers questions"
    else:
        text_input = ""  # or you can set it to None or some default value
        system_content="You are a chat bot who answers questions"
        

    payload = {
        "inputs": [
          [
           {"role": "system", "content": system_content},
           {"role": "user", "content": text_input}
          ]
        ],
        "parameters":{
            "max_new_tokens": 1000,
            # "return_full_text": False,
            # "do_sample": False,
            # "top_k":5
        }
    }
    
    return payload

In [160]:
#query fucntion for LLAMA2 7b Chat Model

endpoint_name = _MODEL_CONFIG_["meta-textgeneration-llama-2-7b"]["endpoint_name"]

def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps(payload).encode('utf-8'),CustomAttributes='accept_eula=true')
    model_predictions = json.loads(response['Body'].read())
    #print(model_predictions)
    generated_texts = model_predictions[0]['generation']
    generated_text=generated_texts['content']
    print (
        f"{bold}{generated_text}{unbold}{newline}")


In [161]:
#Llama 2 7b Chat
question="Which instances can I use with Managed Spot Training in SageMaker?"
payload=create_payload(question)
query_endpoint(payload)

[1m Hello! I'm here to help you with your question. Managed Spot Training is a feature in SageMaker that allows you to train machine learning models using spare AWS computing capacity. Here are some instances that you can use with Managed Spot Training in SageMaker:

1. Amazon Elastic Compute Cloud (EC2): You can use EC2 instances with Managed Spot Training to train machine learning models using the spare CPU capacity of the instances.
2. Amazon Elastic Container Service (ECS): ECS is a highly scalable, high-performance container orchestration service that can be used with Managed Spot Training to train machine learning models.
3. Amazon Lambda: Lambda is a serverless compute service that can be used with Managed Spot Training to train machine learning models without provisioning or managing servers.
4. Amazon Elastic Container Service for Kubernetes (EKS): EKS is a managed service that makes it easy to run containerized applications and workloads in a Kubernetes environment. You can 

You can see the generated answer is wrong or doesn't make much sense. 

## Step 3: Improve the answer to the same question using **prompt engineering** with insightful context


To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [15]:
#Answering based on context with LLama2 7B chat model

question="Which instances can I use with Managed Spot Training in SageMaker?"
context="""Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."""
payload=create_payload(question,context)

query_endpoint(payload)

[1m Based on the context provided, you can use any instance type supported by Amazon SageMaker with Managed Spot Training. Since Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available, you can choose any instance type that is available in the region where you are running your SageMaker job.

Therefore, the answer to your question is:

You can use any instance type supported by Amazon SageMaker with Managed Spot Training.[0m



The output from above tells us the chance to get the correct response significantly correlates with the insightful context you send into the LLM. 

**<span style="color:red">Now, the question becomes where can I find the insightful context based on the user query? The answer is to use a pre-stored knowledge data base with retrieval augmented generation, as shown below.</span>.**


## Step 4:  Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.

### Step 4.1: Wrap Sagemaker endpoints for embedding and inference models 

To use the SageMaker LLM endpoint with LangChain, we use langchain.llms.sagemaker_endpoint.SagemakerEndpoint, which abstracts the SageMaker LLM endpoint. We need to perform a transformation for the request and response payload as shown in the following code for the LangChain SageMaker integration. Note that you may need to adjust the code in ContentHandler based on the content_type and accepts format of the LLM model that you choose to use.

Wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.

In [163]:
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler


class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = [] # To store the results of embeddings
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            print
            results.extend(response)
        return results


class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        # Converts input string and model arguments to JSON and encodes it as bytes
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        # Decodes the JSON response from the model and extracts embeddings
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        return embeddings


content_handler = ContentHandler()

embeddings = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=_MODEL_CONFIG_["huggingface-textembedding-gpt-j-6b"]["endpoint_name"],
    region_name=aws_region,
    content_handler=content_handler,
)

Next, we wrap up our SageMaker endpoints for LLama2 into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

#### **<span style="color:red">The below block only works for Llama2 Chat. If you want to wrap any other LLM, please make the necessary changes to the Content Handler Class</span>.** 

In [11]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        input_str = json.dumps({"inputs" : [[{"role" : "system",
        "content" : """You are a helpful, respectful and honest MBA Graduate Teaching Assistant. 
        Always answer as helpfully as possible, while being safe.  
        Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
        Please ensure that your responses are socially unbiased and positive in nature.
        If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
        If you don't know the answer to a question, please don't share false information."""},
                                             
        {"role" : "user", "content" : prompt}]],
        "parameters" : {**model_kwargs}})
        return input_str.encode('utf-8')
    
    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]["generation"]["content"]
    


#### Instantiate a LangChain SageMaker Endpoint Object

In [162]:
#content handler class for LLama2 7B Chat Model

parameters={ "max_new_tokens": 1500, 
            "top_p": 0.9, 
            "temperature": 0.6
            
    }

content_handler = ContentHandler()

llm=SagemakerEndpoint(
     endpoint_name=_MODEL_CONFIG_["meta-textgeneration-llama-2-7b"]["endpoint_name"], 
     region_name=aws_region, 
     model_kwargs=parameters,
     endpoint_kwargs={"CustomAttributes": 'accept_eula=true'},
     content_handler=content_handler
 )

#### Create a Prompt Template

In [164]:
from langchain import PromptTemplate
template = "{content}"

prompt = PromptTemplate.from_template(template)

#### <b>Combine your SageMaker endpoint and prompt template to create an LLM chain</b>

The most basic type of chain in LangChain is the LLM chain, which combines an LLM with a prompt template. An LLM chain is instantiated with details related to your LLM and the prompt template you would like to use. You can then run the LLM chain by passing it text. The LLM chain will format that text based on the associated prompt template, and then pass the formatted text to the LLM, and provide the response of the LLM back to you.

In [165]:
from langchain import LLMChain
llm_chain = LLMChain(
     llm=llm,
     prompt=prompt
 )

In [166]:
result=llm_chain.run({"What factors do you think are important for a company to consider when determining the appropriate amount of leverage (debt) to use? How do lenders and borrowers view this factor differently?"})
print(result)

 As a responsible and ethical MBA Graduate Teaching Assistant, I must first emphasize that the decision to use leverage (debt) in a company's operations is a complex and sensitive issue that requires careful consideration of various factors. When determining the appropriate amount of leverage to use, companies must weigh the potential benefits of debt financing against the potential risks and drawbacks.

From a lender's perspective, the most important factors to consider when determining the appropriate amount of leverage for a borrower are:

1. Creditworthiness: The borrower's credit score and credit history are crucial in determining their ability to repay the loan. Lenders will typically require a higher level of leverage for borrowers with lower credit scores or limited credit history.
2. Cash flow: Lenders will want to assess the borrower's ability to generate sufficient cash flow to service their debt obligations. A borrower with a strong cash flow may be able to secure lower int

#### <b>Test the LLM hosted on the SageMaker Endpoint</b>

In [13]:
result=llm_chain.run({"What is a balance sheet?"})
print(result)

 Of course! As an accounting professor, I'd be happy to help you with your question. Based on your lecture notes, a balance sheet is a financial statement that presents the financial position of a company at a specific point in time. It provides a snapshot of the company's assets, liabilities, and equity at a given date.

The balance sheet is composed of three main sections:

1. Assets: These are the resources owned by the company, including cash, accounts receivable, inventory, property, plant, and equipment, and investments.
2. Liabilities: These are the debts or obligations of the company, including accounts payable, notes payable, and long-term debt.
3. Equity: This represents the ownership interest in the company, including common stock, preferred stock, and retained earnings.

The balance sheet is important because it provides a comprehensive view of a company's financial position, including its financial strengths and weaknesses. It can help investors, creditors, and other stake

## Step 4.2: Ingesting the knowledge database

### Initiate a boto3 client to connect to S3 for getting the data

In [196]:
import boto3
import os

In [194]:
def load_S3_data(bucket,s3_dir,local_dir):
    s3 = boto3.client('s3') #Configure AWS Credentials using AWS CLI

    bucket_name = bucket
    prefix = s3_dir
    local_directory = local_dir #specify the directory where you want to store the data from the s3 bucket

    paginator = s3.get_paginator('list_objects_v2')

    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
        for obj in page['Contents']:
            if obj['Key'].endswith('/') or obj['Key'].endswith('.DS_Store'):  # Skip if the key is a directory or a .DS_Store file.
                continue
            target = os.path.join(local_directory, os.path.relpath(obj['Key'], prefix))

            # make sure all necessary directories exist
            os.makedirs(os.path.dirname(target), exist_ok=True)

            # download file
            s3.download_file(bucket_name, obj['Key'], target)

In [197]:
#Uncomment the below lines if data has not been loaded from S3 into the local directory yet. 
#load_s3_data("d3-generative-ai","data/processed/curated_data/", "../accounting_data/") #load acounting data

In [None]:
!pip install unstructured

In [13]:
from langchain.document_loaders import DirectoryLoader

In [14]:
#Specify the path of the folder containing the data
directory="../accounting_data"

In [15]:
loader = DirectoryLoader(directory)

In [16]:
documents = loader.load()

The PDF <_io.BufferedReader name='../accounting_data/Letter from Prison Teaching Note.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
The PDF <_io.BufferedReader name='../accounting_data/Molex (B).pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
The PDF <_io.BufferedReader name='../accounting_data/LIFO vs FIFO technical note.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
The PDF <_io.BufferedReader name='../accounting_data/Class 11 - Target Corporation Ackman versus the Board.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring

In [186]:
len(documents)

102

## Step 4.3: Feeding data into Vector Databse and building the context based Question Answering Application

In [None]:
!pip install tokenizers
!pip install tiktoken -q

In [18]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma, AtlasDB, FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain

In [19]:
# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Created a chunk of size 1243, which is longer than the specified 1000
Created a chunk of size 1046, which is longer than the specified 1000
Created a chunk of size 3518, which is longer than the specified 1000
Created a chunk of size 1429, which is longer than the specified 1000
Created a chunk of size 1405, which is longer than the specified 1000
Created a chunk of size 1199, which is longer than the specified 1000
Created a chunk of size 1240, which is longer than the specified 1000
Created a chunk of size 1162, which is longer than the specified 1000
Created a chunk of size 1104, which is longer than the specified 1000
Created a chunk of size 1180, which is longer than the specified 1000
Created a chunk of size 1068, which is longer than the specified 1000
Created a chunk of size 1012, which is longer than the specified 1000
Created a chunk of size 1328, which is longer than the specified 1000
Created a chunk of size 1381, which is longer than the specified 1000
Created a chunk of s

In [20]:
# Firstly, we generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B embedding model
docsearch = FAISS.from_documents(docs, embeddings)

In [34]:
#add note about retriever

In [21]:
# expose the index in a retriever interface
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":10})

In [112]:
#create a prompt template for generating questions on the list of summaries
from langchain.prompts import PromptTemplate

prompt_template = """Generate 10 questions from the provided context for an accounting exam on these topics: {question}\n Context is: \n{context}"""
Question_Prompt = PromptTemplate.from_template(prompt_template)

In [129]:
prompt_template="""
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
\n\n{context}\n\n
Question: Generate 10 questions from the provided context for an accounting exam on these topics: {question}
\nHelpful Answer:
"""
Question_Prompt = PromptTemplate.from_template(prompt_template)

In [130]:
# create a chain to generrate questions 
q = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True,chain_type_kwargs={"prompt":Question_Prompt})

In [131]:
#print out the template of the question answering chain
print(q.combine_documents_chain.llm_chain.prompt.template)


Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.


{context}


Question: Generate 10 questions from the provided context for an accounting exam on these topics: {question}

Helpful Answer:



### Extracting the list of summaries which will be used to iterate over the chunks of documents

In [198]:
#Uncomment the below line to load summary excel file from s3
#load_s3_data("d3-generative-ai","data/processed/Summary/", "../Summary/") #Load the summary excel sheet

In [193]:
import pandas as pd

excel_file = '../Summary/Summary_Per_Class.xlsx'

# Read the Excel sheet into a DataFrame
data = pd.read_excel(excel_file)

In [27]:
#print out the datatypes of the columns in the dataframe
data.dtypes

Folder Name    object
Assignee       object
Summary        object
Unnamed: 3      int64
dtype: object

In [28]:
#drop the null values
data = data.dropna()

In [29]:
#convert the Summary column from Object type to string
data['Summary'] = data['Summary'].astype('string')

In [30]:
data.dtypes

Folder Name    object
Assignee       object
Summary        string
Unnamed: 3      int64
dtype: object

#### The Summary column has comma separated values. In order to iterate over the data, we create new list of summaries which has 4  values each. These comma separated values correspond to one topic/term 

In [31]:
# Initialize a list to store grouped values
summaries = []

# Process each row in the DataFrame
for index, row in data.iterrows():
    comma_values = row['Summary'].split(',')  # Replace 'Column_Name' with the actual column name
    
    # Group the comma-separated values into chunks of four
    for i in range(0, len(comma_values), 4):
        # Join the values and append to the list
        summaries.append(','.join(comma_values[i:i+4]))

In [32]:
len(summaries)

127

### Generate the question by iterating over the summaries 

In [35]:
#create a function to add the generated questions with sources in a dataframe
def add_to_dataframe(df,s,i,sources):
    # Split the string by lines and filter the lines that start with numbered bullet points
    rows = [line.split('. ', 1)[-1] for line in s.split('\n') if line.strip() and line.split(' ')[0].replace('.', '').isdigit()]
    # Append rows to the 'Questions' and 'SUmmary' column
    df = pd.concat([df, pd.DataFrame({'Summary': [i]*len(rows),'Question': rows,'Question_Sources': [sources]*len(rows)})], ignore_index=True)
    # Return the dataframe
    return df

### Testing with a subset of summaries

In [47]:
sub=summaries[:3]

In [99]:
sub

['M&A deal, revenues per square foot, revenues/sf, roe',
 ' return on equity, dupont, adjusted dupont, ratio analysis',
 ' adjust financials to make firms comparable, corporate governance practices of boards, selecting the right CEO, social issues']

In [152]:
import re
df = pd.DataFrame(columns=['Summary','Question','Question_Sources'])
questions=[]
question_sources=[]
summary=[]
for i in sub:
        rows=[]
        result = q({"query": i})
        response = result['result']
        print(response)
        sources=result['source_documents']
        # Split the text into lines
        lines = response.split('\n')
        # Extract lines that contain a question mark
        rows = [line for line in lines if '?' in line]
        # Remove any leading formatting by keeping only the part of the line that starts with an uppercase or lowercase letter
        cleaned_rows = [re.sub(r'^[^a-zA-Z]*(?:Question\s+\d+)?[^a-zA-Z]*', '', row, flags=re.IGNORECASE) for row in rows]
        if cleaned_rows:
            for row in cleaned_rows:
                questions.append(row)
                summary.append(i)
                question_sources.append(sources)
        else:
            questions.append(response)
            summary.append(i)
            question_sources.append(sources)

df['Summary'] = summary
df['Question'] = questions
df['Question_Sources']= question_sources

 Sure, here are 10 questions based on the provided context for an accounting exam:

1. What factors does a company consider when determining the appropriate amount of leverage for a merger or acquisition? How do these factors impact the deal?
2. How does the sales growth of Family Dollar compare to Dollar General in 2013? What can be inferred from this comparison?
3. What is the difference between revenues per square foot and revenues per square foot? How do these metrics impact a company's performance?
4. How does the operating ROA of Family Dollar compare to Dollar General in 2011? What does this indicate about the financial health of the two companies?
5. What is the significance of the increase in inventory and receivables for GMCR in 2011? How does this impact the company's financial performance?
6. How does the use of trade credit impact a company's liquidity? What are the potential benefits and drawbacks of using trade credit?
7. What is the impact of Target's cash holdings on i

In [175]:
df

Unnamed: 0,Summary,Question,Question_Sources
0,"M&A deal, revenues per square foot, revenues/s...",What factors does a company consider when dete...,[page_content='Same store sales: Family Dollar...
1,"M&A deal, revenues per square foot, revenues/s...",How does the sales growth of Family Dollar com...,[page_content='Same store sales: Family Dollar...
2,"M&A deal, revenues per square foot, revenues/s...",What is the difference between revenues per sq...,[page_content='Same store sales: Family Dollar...
3,"M&A deal, revenues per square foot, revenues/s...",How does the operating ROA of Family Dollar co...,[page_content='Same store sales: Family Dollar...
4,"M&A deal, revenues per square foot, revenues/s...",What is the significance of the increase in in...,[page_content='Same store sales: Family Dollar...
5,"M&A deal, revenues per square foot, revenues/s...",How does the use of trade credit impact a comp...,[page_content='Same store sales: Family Dollar...
6,"M&A deal, revenues per square foot, revenues/s...",What is the impact of Target's cash holdings o...,[page_content='Same store sales: Family Dollar...
7,"M&A deal, revenues per square foot, revenues/s...",How does the tapered banking model of Souqalma...,[page_content='Same store sales: Family Dollar...
8,"M&A deal, revenues per square foot, revenues/s...",What is the significance of the net commission...,[page_content='Same store sales: Family Dollar...
9,"M&A deal, revenues per square foot, revenues/s...",How does the ROE of GMCR compare to its peers ...,[page_content='Same store sales: Family Dollar...


In [176]:
df.shape

(30, 3)

In [198]:
#store the dataframe in a csv
df.to_csv("questions.csv")

### Generate answer to the questions provided by the model

In [156]:
#create a new retreiever or use the existing one for fetching chunks to answer the generated questions
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":10})


In [177]:
# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

In [158]:
#print the template 
print(qa.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


In [180]:
#testing with a subset
df=df.iloc[0:5]

In [181]:
# Create empty lists to store the results
results_col = []
response_times_col=[]
sources_col=[]

# Iterate through each question
for question in df['Question']:

    #  Measure the response time
    start_time = time.time()

    # Call the llm chain
    result = qa({"query": question})
    response = result['result']
    
    # Calculate the response time
    response_time = time.time() - start_time
    
    sources=result['source_documents']
    # Append the row data, response, and response time to the results list
    results_col.append(response)
    response_times_col.append(response_time)
    sources_col.append(sources)
    
    
df['Answer_With_Context'] = results_col
df['Response_Time_Answers_With_Context'] = response_times_col
df['Answer_Sources']=sources_col

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Answer_With_Context'] = results_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Response_Time_Answers_With_Context'] = response_times_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Answer_Sources']=sources_col


In [182]:
df

Unnamed: 0,Summary,Question,Question_Sources,Answer_With_Context,Response_Time_Answers_With_Context,Answer_Sources
0,"M&A deal, revenues per square foot, revenues/s...",What factors does a company consider when dete...,[page_content='Same store sales: Family Dollar...,As a helpful and respectful MBA Graduate Teac...,19.507791,[page_content='In considering which M&A target...
1,"M&A deal, revenues per square foot, revenues/s...",How does the sales growth of Family Dollar com...,[page_content='Same store sales: Family Dollar...,Thank you for the question. To answer your qu...,9.053071,[page_content='2. How is Dollar General perfor...
2,"M&A deal, revenues per square foot, revenues/s...",What is the difference between revenues per sq...,[page_content='Same store sales: Family Dollar...,"Great, I'm glad you asked! Revenues per squar...",15.427165,[page_content='I ask students to debate the tr...
3,"M&A deal, revenues per square foot, revenues/s...",How does the operating ROA of Family Dollar co...,[page_content='Same store sales: Family Dollar...,Thank you for the question! To answer your qu...,10.43178,[page_content='2. How is Dollar General perfor...
4,"M&A deal, revenues per square foot, revenues/s...",What is the significance of the increase in in...,[page_content='Same store sales: Family Dollar...,Thank you for asking! The increase in invento...,9.324757,[page_content='This discussion should highligh...


In [171]:
#store the results in csv
df.to_csv("prompt_responses.csv")

### Generating Answers without Context

In [183]:
template = "{content}"
prompt = PromptTemplate.from_template(template)
llm_chain = LLMChain(
     llm=llm,
     prompt=prompt
 )
#Create empty lists to store the results
general_answers = []
response_times_col=[]

for question in df['Question']:
    #  Measure the response time
    start_time = time.time()
    # Call the llm chain
    response = llm_chain.run({question})

    # Calculate the response time
    response_time = time.time() - start_time

    # Append the row data, response, and response time to the results list
    general_answers.append(response)
    response_times_col.append(response_time)

#Add the lists as new columns in the dataframe
df['General_Answers'] = general_answers
df['Response_Times_General_Answers'] = response_times_col

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['General_Answers'] = general_answers
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Response_Times_General_Answers'] = response_times_col


In [184]:
df

Unnamed: 0,Summary,Question,Question_Sources,Answer_With_Context,Response_Time_Answers_With_Context,Answer_Sources,General_Answers,Response_Times_General_Answers
0,"M&A deal, revenues per square foot, revenues/s...",What factors does a company consider when dete...,[page_content='Same store sales: Family Dollar...,As a helpful and respectful MBA Graduate Teac...,19.507791,[page_content='In considering which M&A target...,"As an MBA Graduate Teaching Assistant, I'm ha...",20.855622
1,"M&A deal, revenues per square foot, revenues/s...",How does the sales growth of Family Dollar com...,[page_content='Same store sales: Family Dollar...,Thank you for the question. To answer your qu...,9.053071,[page_content='2. How is Dollar General perfor...,As a responsible and ethical MBA Graduate Tea...,8.791997
2,"M&A deal, revenues per square foot, revenues/s...",What is the difference between revenues per sq...,[page_content='Same store sales: Family Dollar...,"Great, I'm glad you asked! Revenues per squar...",15.427165,[page_content='I ask students to debate the tr...,"Hello! As an MBA Graduate Teaching Assistant,...",15.144954
3,"M&A deal, revenues per square foot, revenues/s...",How does the operating ROA of Family Dollar co...,[page_content='Same store sales: Family Dollar...,Thank you for the question! To answer your qu...,10.43178,[page_content='2. How is Dollar General perfor...,As a responsible and ethical MBA Graduate Tea...,9.214614
4,"M&A deal, revenues per square foot, revenues/s...",What is the significance of the increase in in...,[page_content='Same store sales: Family Dollar...,Thank you for asking! The increase in invento...,9.324757,[page_content='This discussion should highligh...,Thank you for your question. I'm glad to help...,13.759195


In [None]:
#store the results in csv
df.to_csv("all_prompt_responses.csv")

## **<span style="color:red"> This marks the end of the notebook. The following blocks of code are part of the experimentation process </span>**

### Query Function for Falcon 40B Model

#### Falcon 40B Model

In [136]:
# function to create a payload for the Falcon40b Model
def create_payload_falcon(query=None,context=None):
    if context and query:
        prompt = """Please answer the question below based on the provided context and If you cannot find reference for the question in the context, 
        please answer that you Dont know:\n\nContext is: \n\n{context}\n\nQuestion is:\n\n{question}"""
        text_input = prompt.replace("{context}", context)
        text_input = text_input.replace("{question}", query)
    
    elif query:
        text_input = query
    else:
        text_input = ""  # or you can set it to None or some default value
        

    payload = {
    "inputs": text_input,
    "parameters":{
        "max_new_tokens": 100,
        # "return_full_text": False,
        # "do_sample": False,
        # "top_k":5
        }
    }
    
    return payload

In [168]:
#query function for falcon model

endpoint_name = 'jumpstart-dft-hf-llm-falcon-40b-instruct-bf16-1'

def query_endpoint_falcon(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps(payload).encode('utf-8'))
    model_predictions = json.loads(response['Body'].read())
    generated_text = model_predictions[0]['generated_text']
    print (
        f"{bold}{generated_text}{unbold}{newline}")


In [110]:
question="Which instances can I use with Managed Spot Training in SageMaker?"
payload=create_payload_falcon(question)
query_endpoint_falcon(payload)

[1m
You can use Managed Spot Training in SageMaker with the following instances:
- ml.m5.xlarge
- ml.m5.2xlarge
- ml.m5.4xlarge
- ml.m5.8xlarge
- ml.m5.16xlarge
- ml.m5d.xlarge
- ml.m5d.2xlarge
- ml.m5d.4xlarge
- ml.[0m



**<span style="color:red">Running this section will override 'documents' variable from the above code. </span>**

### Documents in .csv format

Now, let's download the example data and prepare it for demonstration. We will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query. 


In [None]:

original_data = "s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/"

!mkdir -p rag_data
!aws s3 cp --recursive $original_data rag_data


download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to rag_data/Amazon_SageMaker_FAQs.csv


For the case when you have data saved in multiple subsets. The following code will read all files that end with `.csv` and concatenate them together. Please ensure each `csv` file has the same format.

In [None]:

import glob
import os
import pandas as pd

all_files = glob.glob(os.path.join("rag_data/", "*.csv"))

df_knowledge = pd.concat(
    (pd.read_csv(f, header=None, names=["Question", "Answer"]) for f in all_files),
    axis=0,
    ignore_index=True,
)


Drop the `Question` column as it is not used in this demonstration.

In [None]:
df_knowledge.drop(["Question"], axis=1, inplace=True)

In [None]:
df_knowledge.head(5)

In [None]:
df_knowledge.to_csv("rag_data/processed_data.csv", header=False, index=False)

In [None]:
loader = CSVLoader(file_path="rag_data/processed_data.csv")

In [None]:
documents = loader.load()

### Alternate approach to creating a FAISS Index

### Method 3 : VectorstoreIndexCreator

It exposes a higher-level interface to let you get started in few lines of code. The following code shows how the VectorstoreIndexCreator class in LangChain is used to create a concise implementation of question answering with RAG. Next, we use the query method on the created index and pass the user’s question and SageMaker endpoint LLM. LangChain selects the top four closest documents (K=4) and passes the relevant context extracted from the documents to generate an accurate response.

In [29]:
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embeddings,
    text_splitter=CharacterTextSplitter(chunk_size=800, chunk_overlap=50),
)

In [30]:
index = index_creator.from_loaders([loader])

In [31]:
question="What is a Balance Sheet"

In [32]:
question

'What is a Balance Sheet'

In [33]:
index.query(question=question, llm=llm)

" Based on the provided context, a Balance Sheet is a financial statement that presents the financial position of a business at a specific point in time. It provides a snapshot of the company's assets, liabilities, and equity, and is used to assess the business's financial health and performance. The Balance Sheet is one of the two main financial statements used in accounting, the other being the Income Statement.\n\nThe Balance Sheet is structured to show the following components:\n\n1. Assets: These are the resources owned or controlled by the business, such as cash, accounts receivable, inventory, property, and equipment.\n2. Liabilities: These are the debts or obligations of the business, such as accounts payable, loans, and taxes owed.\n3. Equity: This represents the ownership interest in the business, including common stock, retained earnings, and other reserves.\n\nThe Balance Sheet is important because it helps stakeholders, such as investors and creditors, understand the finan

## **<span style="color:red">Run this section only if you want to use Pinecone vector database for testing with test data</span>** ##

### Testing Pinecone as our Vector database

In [None]:
!pip install pinecone-client -q

In [37]:
#importing libraries
import os
import pinecone
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma, AtlasDB, FAISS
from langchain.text_splitter import CharacterTextSplitter

In [64]:
#splitting the documents into chunks before storing in the database
from langchain.text_splitter import RecursiveCharacterTextSplitter
def split_docs(documents, chunk_size=1000, chunk_overlap=20):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(documents)
    return docs

docs = split_docs(documents)
print(len(docs))

732


In [39]:
#check the dimensionality of the embeddings  for creating a database on Pinecone.
query_result = embeddings.embed_query("Hello world")
len(query_result)

4096

In [None]:
!pip install python-dotenv

In [None]:
import os
from dotenv import load_dotenv

In [None]:
pinecone_key=os.getenv("PINECONE_API_KEY")

Before running the following cell, you need to create an Index on Pinecone. Provide a name for the index and the dimensionality of the embddings being stored. 

In [65]:
pinecone.init(
    api_key=pinecone_key,
    environment="us-west4-gcp-free" #change the environment acording to your Pinecone Index
)

index_name = "qna" #this name would be the same as the name you provided while creating the index.

index_p = Pinecone.from_documents(docs, embeddings, index_name=index_name)

In [128]:
#this function performs a search for chunks of documents which might be relevant to answer the question being asked.
def get_similar_docs(query, k=4, score=False):
    if score:
        similar_docs = index_p.similarity_search_with_score(query, k=k)
    else:
        similar_docs = index_p.similarity_search(query, k=k)
    return similar_docs

In [129]:
query="What are the objectives of accounting"
similar_docs = get_similar_docs(query,score=True)

In [154]:
similar_docs

[(Document(page_content='The answers to these questions are to be found continuously and the best way to find them is to record all the business activities. Recording of business activities has to be done in a scientific manner so that they reveal correct outcome. The science of book-keeping and accounting provides an effective solution. It is a branch of social science. This study material aims at giving a platform to the students to understand basic principles and concepts, which can be applied to accurately measure performance of business. After studying the various chapters included herein, the student should be able to apply the principles, rules, conventions and practices to different business situations like trading, manufacturing or service.\n\nDEFINITIONS\n\nDefinition of Accounting\n\nDefinition by the American Institute of Certified Public Accountants (Year 1961):', metadata={'source': 'raw/Fundamentals-of-accounting.pdf'}),
  0.73249197),
 (Document(page_content='Under this

In [130]:
context=""
for doc in similar_docs:
    # Extract the 'page_content' from the Document object
    page_content = doc[0].page_content

    # Append the 'page_content' to the context_variable
    context += page_content + "\n"

In [152]:
context

'The answers to these questions are to be found continuously and the best way to find them is to record all the business activities. Recording of business activities has to be done in a scientific manner so that they reveal correct outcome. The science of book-keeping and accounting provides an effective solution. It is a branch of social science. This study material aims at giving a platform to the students to understand basic principles and concepts, which can be applied to accurately measure performance of business. After studying the various chapters included herein, the student should be able to apply the principles, rules, conventions and practices to different business situations like trading, manufacturing or service.\n\nDEFINITIONS\n\nDefinition of Accounting\n\nDefinition by the American Institute of Certified Public Accountants (Year 1961):\nUnder this principle, accounting data must be verified. In other words, documentary evidence of transactions must be made which are cap

In [132]:
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [133]:
num_tokens=num_tokens_from_string(context,"cl100k_base")

In [134]:
num_tokens

386

In [135]:
#Falcon 40b Model
query="What are the objectives of accounting"
payload=create_payload_falcon(query,context)
query_endpoint_falcon(payload)

[1m?

The objectives of accounting are:

1. To provide information to the users of accounting information.
2. To provide information to the management for planning, controlling and decision-making.
3. To provide information to the government for tax purposes.
4. To provide information to the creditors for assessing the creditworthiness of the business.
5. To provide information to the shareholders for assessing the profitability of the business.
6. To provide information to the employees for assessing[0m



In [153]:
#Llama 2 7b Chat (prompt2)
payload=create_payload(query,context)
query_endpoint(payload)

[1m Based on the provided context, the objectives of accounting are:

1. To provide reliable and verifiable information about business activities.
2. To record all business activities in a scientific manner to ensure accurate measurement of performance.
3. To provide a platform for students to understand basic principles and concepts of accounting.
4. To apply principles, rules, conventions, and practices to different business situations.
5. To ensure uniformity and understandability of accounting[0m



In [158]:
#Llama 2 7b Chat 
query="Give me a list of all question answer pairs from the provided context that capture all the information in the context."
payload=create_payload(query,context)
query_endpoint(payload)

[1m Certainly! Based on the provided context, here are all the question answer pairs that capture the information:

1. What is the definition of accounting?
Answer: Accounting is defined as the science of recording, classifying, and reporting financial transactions and events of a business entity.
2. What is the definition of accounting by the American Institute of Certified Public Accountants (AICPA)?
Answer: According to the AICPA, accounting is the art of recording, classifying, and reporting financial transactions and events of a business entity in a manner that provides a fair presentation of its financial position and performance.
3. What is the purpose of accounting?
Answer: The purpose of accounting is to provide financial information that is useful to users, such as investors, creditors, and management, in making economic decisions.
4. What are the three fundamental assumptions of accounting?
Answer: The three fundamental assumptions of accounting are:
	* The business will co

In [102]:
#llama 2 7b Chat (prompt1)
query_endpoint(payload)

[1m Based on the context provided, the objectives of accounting can be summarized as follows:

1. To provide a platform for students to understand basic principles and concepts of accounting and their application in measuring the performance of business.
2. To verify and validate accounting data through documentary evidence, ensuring reliability and dependability of the information.
3. To ensure uniformity and understandability of accounting practices and procedures.
4. To provide standards for[0m



In [105]:
#llama2 13b chat(prompt1)
query_endpoint_13(payload)

[1m Based on the context provided, the objectives of accounting are:

1. To record and report financial transactions in a scientific manner to provide accurate and reliable information to stakeholders.
2. To provide a platform for students to understand basic principles and concepts of accounting, which can be applied to different business situations.
3. To ensure uniformity and understandability of accounting practices and procedures.
4. To provide standards for accounting practices and procedures to ensure[0m



In [151]:
#llama2 13b chat (prompt2)
payload=create_payload(query,context)
query_endpoint_13(payload)

[1m Based on the provided context, the objectives of accounting are:

1. To provide reliable and verifiable information about the financial activities and performance of a business.
2. To ensure uniformity and understandability of accounting practices and procedures.
3. To record financial facts on a sound basis and logical considerations.
4. To provide a platform for students to understand basic principles and concepts of accounting, which can be applied to different business situations.
5.[0m



In [80]:
prompt_template = """Answer based on context:\n\n{context}\n\n{question}"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [81]:
chain = load_qa_chain(llm=sm_llm, prompt=PROMPT)

### Method 1: Load QA Chain

In this section, we show you an approach to implement RAG using SageMaker and LangChain. This approach offers the flexibility to configure top K parameters for a relevancy search in the documents. It also allows you to use the LangChain feature of prompt templates, which allow you to easily parameterize the prompt creation instead of hard coding the prompts.

In the following code, we explicitly use FAISS to generate embedding for each of the document in the knowledge library with the SageMaker GPT-J-6B embedding model. Then we identify the top K (K=3) most relevant documents based on the user query.

In [17]:
summary="""M&A deal, revenues per square foot, revenues/sf, roe, return on equity, dupont, adjusted dupont, ratio analysis"""

Based on the question above, we then **identify top K most relevant documents based on user query**.

In [18]:
similar_docs = docsearch.similarity_search(summary, k=10)

In [19]:
def get_context(documents):
    context = [doc.page_content for doc in documents]
    return context
context=get_context(similar_docs)

In [20]:
#define a method to count the number of tokens being retrieved from the documents

import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens
num_of_tokens=0
for doc in similar_docs:
    num_of_tokens+=num_tokens_from_string(doc.page_content,"cl100k_base")
num_of_tokens

2072

Load_qa_chain provides the most generic interface for answering questions. It loads a chain that you can do QA for your input documents and uses ALL of the text in the documents.

In [23]:
#Using chain_type=stuff
chain = load_qa_chain(llm=llm,chain_type="stuff")
question="Generate 10 questions from the context provided.Also provide answers to those questions from the context only."
result=chain.run(input_documents=similar_docs,question=question)

In [24]:
print(result)

 Sure, I'd be happy to help! Here are 10 questions generated from the context provided:

1. What factors does the manager consider when determining the appropriate amount of leverage for the company?
Answer: The manager considers factors such as collateralizable assets, future profitability, and cash generation when determining the appropriate amount of leverage for the company.
2. How does the manager actions and decisions impact the performance metrics of the company?
Answer: The manager's actions and decisions can impact the performance metrics of the company, such as profitability, turnover, and leverage.
3. What is the difference between ROA and ROE, and how do they relate to the company's performance?
Answer: ROA (Return on Assets) measures a company's profitability in relation to its total assets, while ROE (Return on Equity) measures a company's profitability in relation to its shareholder equity. A higher ROA and ROE indicates better performance for the company.
4. How does th

In [128]:
#Using chain_type=map_reduce
chain = load_qa_chain(llm=llm,chain_type="map_reduce")
question="Generate 10 questions and answer pairs from the context provided."
result=chain.run(input_documents=similar_docs,question=question)
print(result)

Token indices sequence length is longer than the specified maximum sequence length for this model (7439 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1884 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2379 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3001 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2773 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence leng

 Great! Here are the answers to the questions you provided:

1. The president did not mention Michael Jackson.
2. The board of directors has a fiduciary duty of care, loyalty, and candor to their shareholders when it comes to M&A.
3. Historical ratio analysis can help guide M&A considerations by providing a framework for evaluating the financial performance of potential targets.
4. The board of directors has a critical role in M&A, including evaluating potential targets, negotiating deals, and ensuring that the company is getting the best possible price and terms.
5. Some of the risks associated with M&A include the risk of not being able to integrate the target company successfully, the risk of underestimating the costs of the acquisition, and the risk of not being able to achieve the expected synergies.
6. To ensure that they are getting the best possible price and terms in an M&A deal, companies can conduct thorough due diligence, negotiate aggressively, and be transparent about the

In [129]:
#Using chain_type=refine
chain = load_qa_chain(llm=llm,chain_type="refine")
question="Generate 10 questions and answer pairs from the context provided.."
result=chain.run(input_documents=similar_docs,question=question)
print(result)

 Sure, I'd be happy to help! Based on the new context provided, here are some refined answers to the original questions:

1. What is the DuPont formula, and how is it used in performance analysis?

The DuPont formula is a way of analyzing a company's return on equity (ROE) into three components: net profit margin, asset turnover, and equity multiplier. The formula is used to provide a more comprehensive picture of a company's financial performance and to identify areas for improvement. In the context of Target's acquisition strategy, the DuPont formula can be used to evaluate the financial performance of Target and its potential acquisition targets, and to identify areas where they may be underperforming.

2. How does the board's role change during an acquisition bid?

During an acquisition bid, the board's role becomes especially emphasized, as the board must carefully evaluate the potential acquisition and its impact on the company. The board must ensure that the acquisition aligns w

### Method 2: RetrievalQA

RetrievalQA chain uses load_qa_chain under the hood. We retrieve the most relevant chunk of text and feed those to the language model.

In [27]:
# expose the index in a retriever interface
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":10})
# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
query = """Generate 10 questions from the provided context on these topics: 
M&A deal, revenues per square foot, revenues/sf, roe, return on equity, dupont, adjusted dupont, ratio analysis.
Also provide answers to those questions from the context only."""
result = qa({"query": query})
result_text = result['result']
print(result_text)

 Sure, I'd be happy to help! Here are 10 questions generated from the provided context, along with answers to those questions based on the information provided in the context:

1. What factors do you think are critical for driving success in the dollar store industry?

Answer: The critical factors for driving success in the dollar store industry include offering competitive prices, maintaining a wide selection of products, providing excellent customer service, and optimizing store operations to maximize efficiency and profitability.

2. What is your assessment of Dollar General's performance over time and relative to its competitors?

Answer: Based on the financial ratios provided in the context, Dollar General's performance appears to have improved over time, with increases in profitability and asset turnover. However, its return on equity (ROE) has declined relative to its competitors, which may be a concern for investors.

3. If you were Rick Dreiling, how do you think an acquisitio

In [28]:
for i in range(len(result['source_documents'])):
    print (result['source_documents'][i].page_content)
    print('\n')
    print('Source is')
    print (result['source_documents'][i].metadata['source'])
    print('\n')

Patent expiration: lower margins and lower sales  Brand partnerships: lower margins since the brands capture most of the value  Licensee acquisition: lower asset turnover  Goodwill write-offs: losses, book value written off  Revenue recognition: channel stuffing, lower sales growth  Growth in CapEx: lower asset turnover

In conclusion, Einhorn exposed management issues and drew attention to inconsistencies between

GMCR’s reports and actions. His presentation damaged the company’s credibility and its valuation.

Q: How should the company respond to Einhorn, and what challenges does it face?

If time permits the instructor can spend some time on this question. The commentary following my fictional case study in the Harvard Business Review, “A Short-Seller Crashes the Party” Harvard Business Review 91, no. 12 (December 2013) addresses this question.

What Happened?


Source is
../accounting_data/GMCR Teaching Note.pdf


What factors would you consider in determining the appropriate 

In [29]:
# expose the index in a retriever interface
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":10})
# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
query = """Generate 10 questions from the provided context on these topics: 
M&A deal, revenues per square foot, revenues/sf, roe, return on equity, dupont, adjusted dupont, ratio analysis."""
result = qa({"query": query})
result_text = result['result']
print(result_text)

 Sure, here are 10 questions that could be generated from the provided context:

1. How does the company's M&A deal impact its financial performance, particularly in terms of ROE and return on equity?
2. What is the significance of the company's revenue per square foot, and how does it compare to its competitors?
3. How does the company's use of adjusted Dupont analysis affect its financial ratios and overall financial performance?
4. What is the impact of the company's high levels of debt on its financial ratios and overall financial health?
5. How does the company's ratio analysis compare to that of its competitors, and what insights can be gained from this comparison?
6. What is the significance of the company's negative net debt ratio, and how does it impact its financial performance?
7. How does the company's cash flow situation impact its ability to invest in growth opportunities and meet its financial obligations?
8. What is the impact of the company's acquisition of licensees o

In [43]:
import csv

def slice_string_to_csv(s, filename):
    with open(filename, 'w', newline='') as file:
        writer = csv.writer(file)
        for line in s.split('\n'):
            # Check if the line starts with a numbered bullet point
            if line.strip() and line.split(' ')[0].replace('.', '').isdigit():
                writer.writerow([line])

In [42]:
filename="questions.csv"
slice_string_to_csv(result_text,filename)