# Lab: Building the LLM-powered chatbot "AWSomeChat" with retrieval-augmented generation

## Introduction

In this Lab, we'll explore how to build GenAI-powered applications capable of performing tasks within a specific domain. The application we will be building in a step-by-step process leverages the retrieval-augmented generation (RAG) design pattern and consists of multiple components ranging out of the broad service portfolio of AWS. 

## Background and Details

We have two primary [types of knowledge for LLMs](https://www.pinecone.io/learn/langchain-retrieval-augmentation/): 
- **Parametric knowledge**: refers to everything the LLM learned during training and acts as a frozen snapshot of the world for the LLM. 
- **Source knowledge**: covers any information fed into the LLM via the input prompt. 


### Retrieval-augmented generation (RAG)

![rag-concept](../img/rag-concept.png)

The design pattern of retrieval-augmented generation is depicted in the above figure. It works as follows:

- Step 0: Knowledge documents / document sequences are encoded and ingested into a vector database. 
- Step 1: Customer e-mail query is pre-processed and/or tokenized
- Step 2: Tokenized input query is encoded
- Step 3: Encoded query is used to retrieve most similar text passages in document index using vector similarity search (e.g., Mixed Inner Product Search)
- Step 4: Top-k retrieved documents/text passages in combination with original customer e-mail query and e-mail generation prompt are fed into Generator model (Encoder-Decoder) to generate response e-mail

### Architecture

![rag-architecture](../img/rag-architecture.png)

Above figure shows the architecture for the LLM-powered chatbot with retrieval-augmented generation component we will be implementing in this lab. It consists of the following components:
- Document store & semantic search: We leverage semantic document search service Amazon Kendra as fully managed embeddings/vector store as well as for a fully managed solution for document retrieval based on questions/asks in natural language.
- Response generation: For the chatbot response generation, we use the open-source encoder-decoder model FLAN-T5-XXL conveniently deployed in a one-click fashion through Amazon SageMaker JumpStart right into your VPC.
- Orchestration layer: For hosting the orchestration layer implmented with the popular framework langchain we choose a serverless approach. The orchestration layer is exposed as RESTful API service via a Amazon API Gateway.
- Conversational Memory: In order to be able to keep track of different chatbot conversation turns while keeping the orchestration layer stateless we integrate the chatbot's memory with Amazon DynamoDB as a storage component.
- Frontend: The chatbot frontend is a web application hosted in a Docker container on Amazon ECS. For storing the container image we leverage Amazon ECR. The website is exposed through an Amazon Elastic Load Balancer. 

## Instructions

### Prerequisites
#### Recommended background
It will be easier for you to run this workshop if you have:

- Experience with Deep learning models
- Familiarity with Python or other similar programming languages
- Experience with Jupyter notebooks
- Beginners level knowledge and experience with SageMaker Hosting/Inference.
- Beginners level knowledge and experience with Large Language Models

#### Target audience
Data Scientists, ML Engineering, ML Infrastructure, MLOps Engineers, Technical Leaders.
Intended for customers working with large Generative AI models including Language, Computer vision and Multi-modal use-cases.
Customers using EKS/EC2/ECS/On-prem for hosting or experience with SageMaker.

Level of expertise - 400

#### Time to complete
Approximately 1 hour.