# Lab 3 - App Deployment

In [None]:
!pip install sagemaker==2.163.0 --upgrade

In [None]:
!pip install sagemaker-studio-image-build

In [None]:
%pip install -q sagemaker-studio-image-build aws-sam-cli

In [None]:
import sagemaker
import boto3
import os
import tarfile
import requests
import json
from io import BytesIO
from tqdm import tqdm

## Attach IAM Role policy

We're going to attach role policy to SageMaker execution role. 

In [None]:
# Retrieve SM execution role 
role = sagemaker.get_execution_role()
print (role)

role_parts = role.split(':')[-1].split('/')
ExeRole_name = role_parts[-1]

print("Please copy command below and paste in AWS CloudShell")
print("aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AdministratorAccess --role-name " + ExeRole_name)

![CloudShell](../img/d-cs.png)

<span style="color:red">Disclaimer: This solution is not recommended for production, just for lab environment.</span>

# Frontend

All relevant components for building a dockerized frontend application can be found in the "fe" directory. It consists of the following files: 
- ```app.py```: actual frontend utilizing the popular streamlit framework
- ```Dockerfile```: Dockerfile providing the blueprint for the creation of a Docker image
- ```requirements.txt```: specifying the dependencies required to be installed for hosting the frontend application
- ```setup.sh```: setup script consisting all the necessary steps to create a ECR repository, build the Docker image and push it to the respective repository we created

## Streamlit 

[Streamlit](https://streamlit.io/) is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps. It is a very popular frontend development framework for rapid prototyping amongst the AI/ML space since easy to use webpages can be built in minutes without anything than Python skills.

## UI

The chatbot frontend web application "AWSomeChat" looks as follows:

![chat-frontend](../img/chat-frontend.png)

To chat with the chatbot enter a message into the light grey input box and press ENTER. The chat conversation will appear below.

On the top of the page you can spot the session id assigned to your chat conversation. This is used to map different conversation histories to a specific user since the chatbot backend is stateless. To start a new conversation, press the "Clear Chat" and "Reset Session" buttons on the top right of the page.


## Dockerization and hosting

In order to prepare our frontend application to be hosted as a Docker container, we execute the bash script setup.sh. It looks as follows: 

```bash 
#!/bin/bash

# Get the AWS account ID
aws_account_id=$(aws sts get-caller-identity --query Account --output text)
aws_region=$(aws configure get region)

echo "AccountId = ${aws_account_id}"
echo "Region = ${aws_region}"


# Create a new ECR repository
echo "Creating ECR Repository..."
aws ecr create-repository --repository-name rag-app

# Get the login command for the new repository
echo "Logging into the repository..."
#$(aws ecr get-login --no-include-email)
aws ecr get-login-password --region ${aws_region} | docker login --username AWS --password-stdin ${aws_account_id}.dkr.ecr.${aws_region}.amazonaws.com

# Build and push the Docker image and tag it
echo "Building and pushing Docker image..."
sm-docker build -t "${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com/rag-app:latest" --repository rag-app:latest .
````

The script performs the following steps in a sequential manner:

1. Retrieval of the AWS account id and region
2. Create a new ECR repository with the name rag-app. Note: this operation will fail, if the repository already exists within your account. This is intended behaviour and can be ignored.
3. Login to the respective ECR repository. 
4. Build the Docker image and tag it with the "latest" tag using the sagemaker-studio-image-build package we previously installed. The "sm-docker build" command will push the built image into the specified repository automatically. All compute will be carried out in AWS CodeBuild.

In [None]:
#Run setup.sh


!cd fe && bash setup.sh

# Orchestration layer

## Create Lambda function codebase 

We will now look into the orchestrator implementation, meant to be hosted through AWS Lambda with a Python runtime. You can find the source code in the ```rag_app```directory. It consists of the following components:
- ```kendra```directory: implementation of the Kendra retriever. This can be used as is and does not require further attention.
- ```rag_app.py```: implementation of the orchestration layer as AWS Lambda handler function.
- ```requirements.txt```: specifying the dependencies required to be installed for hosting the frontend application.

Let's dive a bit deeper into the code of the AWS Lambda handler function ```rag_app.py```. First, we import the required libraries: 


```python
import json
import os
from langchain.chains import ConversationalRetrievalChain
from langchain import SagemakerEndpoint
from langchain.prompts.prompt import PromptTemplate
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.llms.sagemaker_endpoint import ContentHandlerBase, LLMContentHandler
from langchain.memory import ConversationBufferWindowMemory
from langchain import PromptTemplate, LLMChain
from langchain.memory.chat_message_histories import DynamoDBChatMessageHistory
from kendra.kendra_index_retriever import KendraIndexRetriever
```


We are using the following libraries:
- json: built-in Python package, which can be used to work with JSON data.
- os: a python library implementing miscellaneous operating system interfaces 
- langchain: Several classes originating out of this framework for developing applications powered by language model. For a detailed description see above.
- kendra: Kendra retriever module, pointing to the implementation in the ```kendra``` directory.

Then we are retrieving the AWS region and the Kendra index id from the Lambda function's environment variables. We will need them further down the implementation. 


```python
REGION = os.environ.get('REGION')
KENDRA_INDEX_ID = os.environ.get('KENDRA_INDEX_ID')
```


In the next step we define the LLM we want to use through the ```SagemakerEndpoint```class. 


```python
# Generative LLM 
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt, model_kwargs):
        # model specific implementation
        return ...
    
    def transform_output(self, output):
        # model specific implementation
        return ...

content_handler = ContentHandler()

llm=SagemakerEndpoint(
    endpoint_name=SM_ENDPOINT_NAME,
    model_kwargs={...}, # model specific hyperparameters
    region_name=REGION, 
    content_handler=content_handler, 
)
```


Thereby the ```ContentHandler``` is used to transform input and output of the model into the desired format.  This implementation can differ from model to model. In this step we can also define model-specific parameters like temperature or max_length of the generated content. In this lab, we stick with the parameter settings provided in the code. This is also why we need to adjust our ```ContentHandler``` according to the model option we chose before. 

As described further above, the ```SageMakerEndpoint``` class requires the endpoint name to be passed. This is happening through an environment variable passed to the Lambda function. We will configure this further down the notebook.


As discussed before, for retrieval-augmented generation with chat memory, the first of two chain steps condenses the prompt and the chat memory into a standalone ask for retrieval. Therefor we want to adjust the prompt used in this step according to the specific model we are using. This can be achieved as shown below by using the ```PromptTemplate```class.


```python
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. 

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
```


Within the Lambda handler function, executed once per chat conversation we specify the ```ConversationBufferWindowMemory``` with ```k=3```, instructing the memory to always keep track of the past 3 conversation turns. In order to ingest this data into the "MemoryTable" DynamoDB database, we utilize a ```DynamoDBChatMessageHistory``` with session_id matching the database's partition key.


```python
message_history = DynamoDBChatMessageHistory(table_name="MemoryTable", session_id=uuid)
memory = ConversationBufferWindowMemory(memory_key="chat_history", chat_memory=message_history, return_messages=True, k=3)
```


Then we initialize the ```KendraIndexRetriever```, matching the created Kendra index in the region we are operating.


```python
retriever = KendraIndexRetriever(kendraindex=KENDRA_INDEX_ID, awsregion=REGION, return_source_documents=True)
```


Finally we assemble the ```ConversationalRetrievalChain``` with all above specified components and execute it with it's ```.run()``` function.


```python
qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, condense_question_prompt=CONDENSE_QUESTION_PROMPT, verbose=True)
response = qa.run(query)   
```

# Application Deployment

Finally, we want to put all pieces together and deploy the LLM-powered chatbot application we have created throughout the lab. 

## Infrastructure as Code: CloudFormation and SAM

Complying with AW and DevOps best practices, we will be be conducting an Infrastructure as Code deployment for the majority of the application stack. Therefor we will be using [AWS Serverless Application Model (SAM)](https://aws.amazon.com/serverless/sam/).

## 🚨 Deploy stack with SAM

Before we will deploy the AWS SAM stack, we need to adjust the Lambda function's environment variable pointing to the Kendra index. 

🚨🚨 **Please overwrite the the placeholder \*\*\*KENDRA_INDEX_ID\*\*\* in the file ```template.yml``` (you can search with STRG/CMD+F)with the index id of the Kendra index we created.** 

Further, we need to adjust the Lambda function's environment variable pointing to the LLM we've deployed. 

🚨🚨 **Please overwrite the the placeholder \*\*\*SM_ENDPOINT_NAME\*\*\* in the file ```template.yml``` (you can search with STRG/CMD+F)with the endpoint name of the model we've deployed.** 

\*\*\*KENDRA_INDEX_ID\*\*\*
![kendraindex](../img/kendraindex.png)

![get-kendra-index](../img/get-kendra-index.gif)

![edit-template](../img/tem.png)

Now we are ready for deployment. Therefor we follow these subsequent steps:


In [None]:
# Building the code artifacts
!sam build

In [None]:
# Deploying the stack
!sam deploy --stack-name rag-stack --resolve-s3 --capabilities CAPABILITY_IAM

Once the deployment is done, we can go ahead to the CloudFormation service and select the "Resources" tab of the Stack "rag-app". Click on the "Physical ID" of the LoadBalancer and copy the DNS name of the page you get forwarded to. You can now reach the web application through a browser by using this as URL.

![get-url](../img/get-url.gif)

# Application testing

Now that we are in the chat, let us check some things we want to ask our chatbot, while keeping in mind the resource constrains that we have in the demo accounts. 

Lets ask about Amazon EC2. What it is, how we can create one and some more information about it. 
Take a look at the below conversation and try to think why the answers are structured as they are.

<p align="center">
  <img src="../img/ChatEC2.png" alt="A chat with the model about EC2">
</p>

First of all, we can see that the LLM has memory about the previous conversation turn, as we reference EC2 implicitly via "Okay. How can I create one?" 

Secondly, we see that the shortcoming of a low number of retrieved characters on the Kendra side. This can be solved by increasing this limit in your own account.  

#### Discussions about the patents that we uploaded 
Patents can be one of the hardest documents to find, read and investigate the claims that are made in them. After all, the claim of the patents describes exactly what has been protected. It would therefore be good way to have an easier interaction with it. 
Lets see how far we can get if we would add a patent database to our system. 
<p align="center">
  <img src="../img/PatentChat.png" alt="A chat with the model about one of the patents we downloaded">
</p>

### Conclusion:
We have two main drivers for the quality of the interaction. 
- The retrieval quality of our retriever. For Kendra, there are plenty of options to optimise the retrieval quality through human feedback, metadata, query optimisation and tuning search relevance to name only a few. However, this is out of scope for this workshop. We would like to point the interested reader to the [docs](https://docs.aws.amazon.com/kendra/latest/dg/tuning.html) as well as the [Kendra workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/df64824d-abbe-4b0d-8b31-8752bceabade/en-US). 
- The LLM that we are using for the chat interaction. Here, especially models with larger context windows can be helpful to get wider context. 

To conclude, RAG can be a very helpful approach to augment your company internal and external search. The retrieval and LLM quality are of high importance to this approach, and the generated load on the systems can be substantial. Especially here, a careful cost consideration between a token based and an infrastructure based pricing model should be done. 
