# Lab 2 - Building the LLM-powered chatbot "AWSomeChat" with retrieval-augmented generation

## Introduction

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as HuggingFace and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size. 

While these models are able to generalize quite well and hence are capable to perform a vast amount of different generic tasks without having specifically being trained on them, they also lack domain-specific, proprietary and recent knowledge. This however is required in most use cases across organisations. With fine-tuning, in Lab 2 we already learned an option to infuse domain-specific or proprietary knowledge into a Large Language Model. However, this option can become complex and costly, especially if carried out on a regular basis to ensure access to recent information. Luckily there are several design patterns wrapping Large Language Models into powerful applications able to overcome even these challenges with a leightweight footprint. 

In this Lab, we'll explore how to build GenAI-powered applications capable of performing tasks within a specific domain. The application we will be building in a step-by-step process leverages the retrieval-augmented generation (RAG) design pattern and consists of multiple components ranging out of the broad service portfolio of AWS. 

## Background and Details

We have two primary [types of knowledge for LLMs](https://www.pinecone.io/learn/langchain-retrieval-augmentation/): 
- **Parametric knowledge**: refers to everything the LLM learned during training and acts as a frozen snapshot of the world for the LLM. 
- **Source knowledge**: covers any information fed into the LLM via the input prompt. 

When trying to infuse knowledge into a generative AI - powered application we need to choose which of these types to target. Lab 2 deals with elevating the parametric knowledge through fine-tuning. Since fine-tuning is a resouce intensive operation, this option is well suited for infusing static domain-specific information like domain-specific langauage/writing styles (medical domain, science domain, ...) or optimizing performance towards a very specific task (classification, sentiment analysis, RLHF, instruction-finetuning, ...). 

In contrast to that, targeting the source knowledge for domain-specific performance uplift is very well suited for all kinds of dynamic information, from knowledge bases in structured and unstructured form up to integration of information from live systems. This Lab is about retrieval-augmented generation, a common design pattern for ingesting domain-specific information through the source knowledge. It is particularily well suited for ingestion of information in form of unstructured text with semi-frequent update cycles. 

The application we will be building in this lab will be a LLM-powered chatbot infused with domain specific knowledge on AWS services. You will be able to chat through a chatbot frontend and receive information going beyond the parametric knowledge encoded in the model. 


### Retrieval-augmented generation (RAG)

![rag-concept](../img/rag-concept.png)

The design pattern of retrieval-augmented generation is depicted in the above figure. It works as follows:

- Step 0: Knowledge documents / document sequences are encoded and ingested into a vector database. 
- Step 1: Customer e-mail query is pre-processed and/or tokenized
- Step 2: Tokenized input query is encoded
- Step 3: Encoded query is used to retrieve most similar text passages in document index using vector similarity search (e.g., Mixed Inner Product Search)
- Step 4: Top-k retrieved documents/text passages in combination with original customer e-mail query and e-mail generation prompt are fed into Generator model (Encoder-Decoder) to generate response e-mail

### Architecture

![rag-architecture](../img/rag-architecture.png)

Above figure shows the architecture for the LLM-powered chatbot with retrieval-augmented generation component we will be implementing in this lab. It consists of the following components:
- Document store & semantic search: We leverage semantic document search service Amazon Kendra as fully managed embeddings/vector store as well as for a fully managed solution for document retrieval based on questions/asks in natural language.
- Response generation: For the chatbot response generation, we use the open-source encoder-decoder model FLAN-T5-XXL conveniently deployed in a one-click fashion through Amazon SageMaker JumpStart right into your VPC.
- Orchestration layer: For hosting the orchestration layer implmented with the popular framework langchain we choose a serverless approach. The orchestration layer is exposed as RESTful API service via a Amazon API Gateway.
- Conversational Memory: In order to be able to keep track of different chatbot conversation turns while keeping the orchestration layer stateless we integrate the chatbot's memory with Amazon DynamoDB as a storage component.
- Frontend: The chatbot frontend is a web application hosted in a Docker container on Amazon ECS. For storing the container image we leverage Amazon ECR. The website is exposed through an Amazon Elastic Load Balancer. 

## Instructions

### Prerequisites

#### To run this workshop...
You need a computer with a web browser, preferably with the latest version of Chrome / FireFox.
Sequentially read and follow the instructions described in AWS Hosted Event and Work Environment Set Up

#### Recommended background
It will be easier for you to run this workshop if you have:

- Experience with Deep learning models
- Familiarity with Python or other similar programming languages
- Experience with Jupyter notebooks
- Beginners level knowledge and experience with SageMaker Hosting/Inference.
- Beginners level knowledge and experience with Large Language Models

#### Target audience
Data Scientists, ML Engineering, ML Infrastructure, MLOps Engineers, Technical Leaders.
Intended for customers working with large Generative AI models including Language, Computer vision and Multi-modal use-cases.
Customers using EKS/EC2/ECS/On-prem for hosting or experience with SageMaker.

Level of expertise - 300

#### Time to complete
Approximately 30 minutes.

# Installation and import of required dependencies, further setup tasks

For this lab, we will use the following libraries:

 - sagemaker-studio-image-build, a CLI for building and pushing Docker images in SageMaker Studio using AWS CodeBuild and Amazon ECR.
 - aws-sam-cli, an open-source CLI tool that helps you develop IaC-defined serverless applications into AWS.
 - SageMaker SDK for interacting with Amazon SageMaker. We especially want to highlight the classes 'HuggingFaceModel' and 'HuggingFacePredictor', utilizing the built-in HuggingFace integration into SageMaker SDK. These classes are used to encapsulate functionality around the model and the deployed endpoint we will use. They inherit from the generic 'Model' and 'Predictor' classes of the native SageMaker SDK, however implementing some additional functionality specific to HuggingFace and the HuggingFace model hub.
 - boto3, the AWS SDK for python
 - os, a python library implementing miscellaneous operating system interfaces 
 - tarfile, a python library to read and write tar archive files
 - io, native Python library, provides Python’s main facilities for dealing with various types of I/O.
 - tqdm, a utility to easily show a smart progress meter for synchronous operations.

In [None]:
!pip install sagemaker==2.163.0 --upgrade

In [None]:
import sagemaker
import boto3
import os
import tarfile
import requests
import json
from io import BytesIO
from tqdm import tqdm

# Setup of S3 bucket for Amazon Kendra storage of knowledge documents

Amazon Kendra provides multiple built-in adapters for integrating with data sources to build up a document index, e.g. S3, web-scraper, RDS, Box, Dropbox, ... . In this lab we will store the documents containing the knowledge to be infused into the application in S3. For this purpose, (if not already present) we create a dedicated S3 bucket.

In [None]:
# specifying bucket name for model artifact storage
prefix = 'gen-ai-immersion-day-kendra-storage'
model_bucket_name = f'{prefix}-{account_id}-{region}'
model_bucket_name

In [None]:
# Create S3 bucket
s3_client = boto3.client('s3', region_name=region)
location = {'LocationConstraint': region}

bucket_name = model_bucket_name

# Check if bucket already exists
bucket_exists = True
try:
    s3_client.head_bucket(Bucket=bucket_name)
except:
    bucket_exists = False

# Create bucket if it does not exist
if not bucket_exists:
    if region == 'us-east-1':
        s3_client.create_bucket(Bucket=bucket_name)
    else: 
        s3_client.create_bucket(Bucket=bucket_name,
                                CreateBucketConfiguration=location)
    print(f"Bucket '{bucket_name}' created successfully")

# Document store and retriever component

## Amazon Kendra

[Amazon Kendra](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html) is an intelligent search service that uses natural language processing and advanced machine learning algorithms to return specific answers to search questions from your data.

Unlike traditional keyword-based search, Amazon Kendra uses its semantic and contextual understanding capabilities to decide whether a document is relevant to a search query. It returns specific answers to questions, giving users an experience that's close to interacting with a human expert.

With Amazon Kendra, you can create a unified search experience by connecting multiple data repositories to an index and ingesting and crawling documents. You can use your document metadata to create a feature-rich and customized search experience for your users, helping them efficiently find the right answers to their queries.

## Uploading knowledge documents into an Amazon Kendra index

Next we are going to add some more documents from S3 to show how easy it is to integrate different data sources to a Kendra Index. 
First we are going to download some interesting pdf files from the internet, but please feel free to drop any pdf you might find interesting in it as well. 

In [None]:
import os
import boto3
import requests
from io import BytesIO
from tqdm import tqdm

# Create an S3 client
s3 = boto3.client('s3')

# Create a bucket if it doesn't exist
bucket_name = f'immersion-day-bucket-{account_id}-{region}'
if s3.list_buckets()['Buckets']:
    for bucket in s3.list_buckets()['Buckets']:
        if bucket['Name'] == bucket_name:
            break
    else:
        s3.create_bucket(Bucket=bucket_name)
else:
    s3.create_bucket(Bucket=bucket_name)

# List of URLs to download PDFs from
pdf_urls = [
    "https://patentimages.storage.googleapis.com/bb/0f/5a/6ef847538a6ab5/US10606565.pdf",
    "https://patentimages.storage.googleapis.com/f7/50/e4/81af7ddcbb2773/US9183397.pdf",
    "https://docs.aws.amazon.com/pdfs/enclaves/latest/user/enclaves-user.pdf",
    "https://docs.aws.amazon.com/pdfs/ec2-instance-connect/latest/APIReference/ec2-instance-connect-api.pdf",
]

# Download PDFs from the URLs and upload them to the S3 bucket
for url in tqdm(pdf_urls):
    response = requests.get(url, stream=True)
    filename = os.path.basename(url)
    print(f"Working on {filename}")
    fileobj = BytesIO()
    total_size = int(response.headers.get('content-length', 0))
    block_size = 1024
    progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
    for data in response.iter_content(block_size):
        progress_bar.update(len(data))
        fileobj.write(data)
    progress_bar.close()
    fileobj.seek(0)
    s3.upload_fileobj(fileobj, bucket_name, filename)

Lets use those documents in Kendra. First navigate to the Kendra console. 

Under "Data Management" you will find the tab "Data Sources". Navigate there and add a new data source via "Add data source". 
Take some time to inspect all the different connectors that are there for you to use out of the box. We will use s3 as our source. 

It is worth noting that Kendra respect enterprise level access attributes. That means, that it can deny queries if a user is not authorized to retrieve a document. 

You can either add the sample bucket as a data source that has been provided on the top of the connectors, but for the sake of demonstration, we will add our downloaded pdfs as well. 

The animation below shows how to add an s3 data source to kendra to index. We are creating a new IAM role as well as setting the indexing frequncy to "on-demand". 

<p align="center">
  <img src="../img/new_s3_connection.gif" alt="How to add a kendra s3 data source "/>
</p>


After the connection has been established, you can sync your data source by clicking "sync now". 

It is worth noting, that the maximum file size by default in the Developer Edition of Kendra is 5Mb. 

For an improved generative AI experience, we recommend requesting a larger document excerpt to be returned. Which is not possilbe in the AWS provided workshop accounts. Otherwise, navigate in the browser window you are using for AWS Management Console navigate to [Service Quota](https://console.aws.amazon.com/servicequotas/home/services/kendra/quotas/L-196E775D) and choose Request quota increase, and change quota value to a number up to a max of 750.