# Building a RAG Chat Bot the Hard Way with Google Vertex AI, BigQuery & LangChain
## Overview
In this lab you will build a chat bot that uses documentation to inform its answers. The Public LangChain Documentation will be used in this example. You will also keep track of the conversation history so that we can ask follow up questions to the chat bot. Finally, you'll implement a "Reset Chat" function to clear this chat history so that you can "Change the subject".

## Objectives
In this tutorial you will learn how to:
 * Load and Chunk Documents using LangChain
 * Create vector embeddings from document chunks using Vertex AI and LangChain
 * Use BigQuery as a LangChain Vector Store
 * Use LangChain to build multiple Chains
   * History Aware Chain
   * Retrieval Question and Answer Chain
 * Manage Message History using SQL
 * Return Answers to Questions including the entire message history.

## Setup and requirements

### Set Your GCP Project
In order to get started you'll need to authenticate to your GCP Project.

In [None]:
from google.colab import auth


PROJECT_ID = 'cody-hill-project-293913' # @param {type:"string"}

auth.authenticate_user(project_id=PROJECT_ID)

!gcloud config set project $PROJECT_ID

Updated property [core/project].


### Python Packages
The following Python Packages will be needed to implement all of these features.

Once this step is done you'll be prompted to "restart the session". Go ahead and do that.

In [None]:
!pip install langchain langchain-google-vertexai langchain-google-community[featurestore] google-cloud-storage google-cloud-bigquery

Collecting langchain
  Downloading langchain-0.2.15-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-google-vertexai
  Downloading langchain_google_vertexai-1.0.10-py3-none-any.whl.metadata (3.8 kB)
Collecting langchain-google-community[featurestore]
  Downloading langchain_google_community-1.0.8-py3-none-any.whl.metadata (3.4 kB)
Collecting langchain-core<0.3.0,>=0.2.35 (from langchain)
  Downloading langchain_core-0.2.35-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.106-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting google-cloud-storage
  Downloading google_cloud_storage-2.18.2-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting httpx<0.28.0,>=0.2

### GCP Services
The follwoing GCP Services will need to be enabled

In [None]:
!gcloud services enable aiplatform.googleapis.com bigquery.googleapis.com storage-api.googleapis.com

Operation "operations/acat.p2-915773244463-809ccc63-3aa4-4006-a8b0-7a5d4e6e586d" finished successfully.


## Part 1: Building the Vector Store
In order to be able to use your own documentation and dynamically retrieve the correct pieces of documentation based on the question that is asked, you need to create vector embeddings of all of your documentation. There are multiple steps involved in getting the documentation usable for Retrival Augmented Generation (***RAG***).
 * Download the Documentation
 * Load the Documentation
 * Break the Documentation into smaller "Searchable" pieces (***Chunks***)
 * Create embeddings of each of these chunks
 * Store these embeddings along with their content into a vector store.

### Task 1: Setting Variables
Below are all of the variables you will need in order to build out the Vector Store. The only variable the *needs* to be changed is `PROJECT_ID` everything else is fine how it is.
  * **PROJECT_ID** *(Change Me)*:
    * This is your Google Cloud Project ID and is needed to authenticate to different services in Google Cloud
  * **REGION**:
    * This is the region in Google Cloud that you would like to utilize services within.
  * **EMBEDDING_MODEL**:
    * This is the text embedding model that you will use to create embeddings from our text chunks. As of writing this Colab. `text-embedding-004` is the latest text embedding model.
  * **BQ_DATASET**:
    * This is the name of the BiqQuery Dataset that you will be creating our Table to store our Embeddings and Text. (This BQ_DATASET must already exist)
  * **BQ_TABLE**:
    * This is the name of the BigQuery table that will be used to store our Embeddings and Text. (This will be created automatically by LangChain)
  * **CHUNK_SIZE**:
    * This is how large you would like each of our text chunks to be (in number of characters). The larger this number the more content that will be in each chunk, but the larger the chunk is, the less unique the results of each semantic search becomes. So striking a balance between good searchability and enough content to inform the LLM is important.
  * **CHUNK_OVERLAP**:
    * This is how much overlap (in number of characters) you would like in each chunk. In order to try and not lose context, when a document is broken into chunks, it may be in the middle of a paragraph, sentence, or even a word. So to make sure each chunk has enough context, you will be storing some of the text from the previous chunk in the following chunk.
  * **DOCS_BUCKET_NAME**:
    * This is the GCP Bucket that houses the documentation that you will be downloading, embedding, and storing.
      * Don't update this unless you know what you are doing.
  * **DOCS_DIR**:
    * This is the local directory that you will be downloading the documentation from the GCP Bucket into.



In [None]:
PROJECT_ID = 'cody-hill-project-293913' # @param {type:"string"}
REGION = 'us-central1' # @param {type:"string"}
EMBEDDING_MODEL = 'text-embedding-004' # @param {type:"string"}
BQ_DATASET = 'doing_it' # @param {type:"string"}
BQ_TABLE = 'the_hard_way' # @param {type:"string"}
CHUNK_SIZE = 2000 # @param {type:"integer"}
CHUNK_OVERLAP = 200 # @param {type:"integer"}
DOCS_BUCKET_NAME = 'ch-langchain-docs' # @param {type:"string"}
DOCS_DIR = 'documentation' # @param {type:"string"}

### Task 2: Python Imports
These are all of the python modules you'll need in order to create our Vector Store. These will be descibed in greater detail as you use them.

In [None]:
import os
import random
from time import sleep
from google.cloud import storage, bigquery
import google.api_core.exceptions
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_community import BigQueryVectorSearch
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import langchain

### Task 3: Downloading the documentation locally
This function will download all of the content from the GCP Bucket `DOCS_BUCKET_NAME` to the local directory `DOCS_DIR` and print out how many files were downloaded.

In [None]:
client = storage.Client()
source_bucket = client.bucket(DOCS_BUCKET_NAME)
blobs = source_bucket.list_blobs()
downloads = 0
for blob in blobs:
    object_path = blob.name
    local_directory = os.path.join(DOCS_DIR, os.path.dirname(object_path))
    os.makedirs(local_directory, exist_ok=True)
    local_path = os.path.join(local_directory, os.path.basename(object_path))
    blob.download_to_filename(local_path)
    downloads += 1
print(f"Number of downloads:  {downloads}")

Number of downloads:  1350


### Task 4: Loading Documents
This function loads all of the files from the `DOCS_DIR` that match the `glob` of `**/*.md*`. This basically means search this directory recursively and load any file as a LangChain "Document" object that has a file extension of `.md` or `.mdx`. Then print how many documents were "Loaded"

In [None]:
text_loader_kwargs = {'autodetect_encoding': True}
loader = DirectoryLoader(DOCS_DIR,
                          glob="**/*.md*",
                          use_multithreading=True,
                          loader_cls=TextLoader,
                          loader_kwargs=text_loader_kwargs)
docs = loader.load()
print(f"Number of Docs:  {len(docs)}")

Number of Docs:  1346


### Task 5: Chunking Documents
This function will take each document and break it into "Chunks" based on our `CHUNK_SIZE`, `CHUNK_OVERLAP`, & `separators`.

The `CHUNK_SIZE` & `CHUNK_OVERLAP` were described in the `Setting Variables` section above, but let's talk about the `separators`.

Separators are what you would like your document "Split" by. In the example below:
```python
["\n\n", "\n", ".", "!", "?", ",", " ", ""]
```
You are telling the splitter to try and find a place to split the document around the `CHUNK_SIZE` that ends with these characters in order of priority:
  * `"\n\n"`: Double Line break normally indicates the end of a section
  * `"\n"`: Single line break normally indicates the end of a paragraph.
  * `"."`, `"!"`, & `"?"`: A period, exclamation mark, or question mark normally indicates the end of a sentence.
  * `","`: Splitting at a comma is better than splitting between random words.
  * `" "`: Splitting at a space is better than splitting in the middle of a word.
  * `""`: Finally if no other option is available, then split the document at any character.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)
chunked_docs = []
for doc in docs:
    chunks = text_splitter.split_documents([doc])
    for idx, split in enumerate(chunks):
        split.metadata["chunk"] = idx
    chunked_docs.append(chunks)
print(f"Number of Document Chunks:  {len(chunked_docs)}")

Number of Document Chunks:  1346


### Task 6a: Embedding and Storing the Documents
#### **!!!Only run this if you have over 1 Hour to Create the Vector Store If you do not, skip this step and run Task 7b instead!!!**
In this step you are defining our embedding model as well as our vector store, and allowing LangChain to embed and store our documents into BigQuery.

In this case you are using the `EMBEDDING_MODEL` variable to define which model we'd like to use in our `embedding_engine`.

You are then providing the `embedding_engine` to the `BigQueryVectorSearch` along with the `BQ_DATASET` & `BQ_TABLE` that you'd like our embeddings stored in.

You are then looping through each document and using `store.add_documents(doc)` to save our documentation content along with the embeddings into our BigQuery Table.

The rest of the code is to handle "Slowing things down" in the case that you exceed our quota or rate limit when creating these embeddings or saving them to BigQuery.

#### **NOTE:**
To run this, you'll need to edit the code in the try block and uncomment `store.add_documents(doc)` and comment out `pass`

In [None]:
embedding_engine = VertexAIEmbeddings(
    model_name=EMBEDDING_MODEL,
    project=PROJECT_ID,
    location=REGION,
    request_parallelism=250,
    max_retries=10
)
store = BigQueryVectorSearch(
    project_id=PROJECT_ID,
    dataset_name=BQ_DATASET,
    table_name=BQ_TABLE,
    location=REGION,
    embedding=embedding_engine
)

client = bigquery.Client()
dataset_ref = client.dataset(BQ_DATASET)
dataset = bigquery.Dataset(dataset_ref)
dataset.location = REGION
client.create_dataset(dataset, exists_ok=True)

max_retries = 10
initial_delay = 1
max_delay = 30
for doc in chunked_docs:
    for attempt in range(max_retries):
        try:
            # store.add_documents(doc)
            # uncomment the line above if you actually would like to create the
            # BigQuery Vector Store
            pass
            break
        except google.api_core.exceptions.Forbidden as e:
            if "Exceeded rate limits" in str(e) or "quotaExceeded" in str(e):
                delay = min(initial_delay * 2**attempt, max_delay)
                delay += random.uniform(0, delay)
                print(f"Retrying in {min(delay, max_delay)} seconds...")
                sleep(min(delay, max_delay))
            else:
                print(f"Error: {e}")
                raise
        except Exception as e:
            print(f"Error: {e}")
            raise
        if attempt - 1 == max_retries:
            print("Hit maximum retries, giving up...")

NotFound: 404 POST https://bigquery.googleapis.com/bigquery/v2/projects/cody-hill-project-293913/queries?prettyPrint=false: Not found: Dataset cody-hill-project-293913:doing_it was not found in location us-central1

### Task 6b: Import Vector Store to BigQuery
Because it takes so long to create the Vector Embeddings and push them to BigQuery, this step will allow you to simply import the vector store instead of having to create it from scratch.

In [None]:
#client = bigquery.Client()
client = bigquery.Client(PROJECT_ID)

dataset_ref = client.dataset(BQ_DATASET)
dataset = bigquery.Dataset(dataset_ref)
dataset.location = REGION
client.create_dataset(dataset, exists_ok=True)

avro_file_path = f'{DOCS_DIR}/bq_the_hard_way.avro'

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.AVRO
)

with open(avro_file_path, 'rb') as source_file:
    job = client.load_table_from_file(
        source_file,
        f'{BQ_DATASET}.{BQ_TABLE}',
        job_config=job_config,
    )

job.result()

print(f'Loaded {job.output_rows} rows into {BQ_DATASET}.{BQ_TABLE}')

Loaded 13553 rows into doing_it.the_hard_way_denis


## Part 2: Building the Chat Bot
Now that you have all of the content that you would like our chatbot to use as stored in BigQuery, you need to build out the front end of the chat bot.
This consists of the following:
  * Fetching message history if this isn't the first message in the chat.
  * Embedding the users message
  * Using these embeddings to search for the correct documentation that is most similar to the user's message
  * Provide the chat history, retreived documentation, and the user's message to the LLM for a response.
  * Storing the conversation history.
  * Displaying the conversation to the user



### Task 1: Setting Variables
Below are all of the variables you will need in order to build out the Vector Store. The only varibale the *needs* to be changed is `PROJECT_ID` everything else can remain the same.
  * **PROJECT_ID** *(Change Me)*:
    * This is your Google Cloud Project ID and is needed to authenticate to different services in Google Cloud
  * **REGION**:
    * This is the region in Google Cloud that you would like to utilize services within.
  * **LLM_MODEL**:
    * This is the Large Language Model that you will use to respond to the users messages. As of writing this Colab, `gemini-1.5-pro-001` is our most capable model.
  * **EMBEDDING_MODEL**:
    * This is the text embedding model that you will use to create embeddings from our text chunks. As of writing this Colab. `text-embedding-004` is the latest text embedding model.
  * **BQ_DATASET**:
    * This is the name of the BiqQuery Dataset that you will be creating our Table to store our Embeddings and Text. (This BQ_DATASET must already exist)
  * **BQ_TABLE**:
    * This is the name of the BigQuery table that will be used to store our Embeddings and Text. (This will be created automatically by LangChain)
  * **MAX_OUTPUT_TOKEN**:
    * This parameter will tell the Large Language Model the maximum amount of tokens it is allowed in it's response.
  * **TEMPERATURE**:
    * This is the model's temperature. The temperature determins how creative or factual the model is. Where a temperature of 0.0 is not creative at all and will result in much more factual responses, and a temperature of 1.0 is very creative and will result in non-fact based answers.
  * **SQL_CONNECTION_STRING**:
    * Here you are defining a `SQLAlchemy` compatible connection string to be used for message chat memory.

In [None]:
PROJECT_ID = 'cody-hill-project-293913' # @param {type:"string"}
REGION = 'us-central1' # @param {type:"string"}
LLM_MODEL = 'gemini-1.5-pro-001' # @param {type:"string"}
EMBEDDING_MODEL = 'text-embedding-004' # @param {type:"string"}
BQ_DATASET = 'doing_it' # @param {type:"string"}
BQ_TABLE = 'the_hard_way_denis' # @param {type:"string"}
MAX_OUTPUT_TOKENS = 8192 # @param {type:"integer"}
TEMPERATURE = 0.1 # @param
SQL_CONNECTION_STRING = 'sqlite:///sqlite.db' # @param {type:"string"}


### Task 2: Python Imports
These are all of the Python modules you'll need in order to create our Vector Store. These will be describe in greater detail as you use them.

In [None]:
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings, HarmBlockThreshold, HarmCategory
from langchain_google_community import BigQueryVectorSearch
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import SQLChatMessageHistory
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.chains import create_history_aware_retriever
import langchain

### Task 3: Define Embedding Model and Vector Store
Just like when you created the Vector Store, you need to define how you want to create our embeddings and where our embeddings are stored. The difference this time around is that instead of storing the embeddings, you'll be using these embeddings to search for simliar content.

In [None]:
def get_retriever():
    embedding_engine = VertexAIEmbeddings(
          model_name=EMBEDDING_MODEL,
          project=PROJECT_ID,
          location=REGION,
          request_parallelism=250
    )
    store = BigQueryVectorSearch(
        project_id=PROJECT_ID,
        dataset_name=BQ_DATASET,
        table_name=BQ_TABLE,
        embedding=embedding_engine
    )
    retriever = store.as_retriever(search_kwargs={"k": 10})
    return retriever

#get_retriever()

### Task 4: Define Large Language Model
Here you are defining the large language model you want to use to respond to the user's message. You are setting our `safety_settings` in each category to their lowest setting. This isn't recommended for production, but it shows an example of how you can modify these settings.

You are then using the variables `LLM_MODEL`, `MAX_OUTPUT_TOKENS`, & `TEMPERATURE`. Along with the `safety_settings` to define the Large Language Model

In [None]:
def get_llm():
    safety_settings = {
        HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    }

    llm_model = ChatVertexAI(
        model_name=LLM_MODEL,
        safety_settings=safety_settings,
        max_tokens=MAX_OUTPUT_TOKENS,
        temperature=TEMPERATURE
    )
    return llm_model

### Task 5: Define History Aware Chain
Now you're starting to utilize the "chaining" in LangChain.

In this step you are creating a "chain" to take into account the previous conversation that may have taken place, along with the latest question. Both of these will be used together along with instructions to create a "History Aware Retriever" to be used to answer the question in a later step.

In [None]:
def create_history_aware_chain(llm_model, retriever):
    contextualize_q_system_prompt = (
        "Given a chat history and the latest user question which might reference context in the chat history, "
        "formulate a standalone question which can be understood without the chat history. Do NOT answer the question, "
        "just reformulate it if needed and otherwise return it as is."
    )
    contextualize_q_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", contextualize_q_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )
    history_aware_retriever = create_history_aware_retriever(
        llm_model, retriever, contextualize_q_prompt
    )

    return history_aware_retriever

### Task 6: Define Question and Answer Chain
In this step you're creating a chain that will answer the user's question or follow up question, based on the conversation history.

In [None]:
def create_question_answer_chain(llm_model):
  qa_system_prompt = (
    "Instructions:  You are a knowledgeable LangChain assistant that answers questions. "
    "Using the following pieces of documentation, explain in great detail how to answer the Human's question. "
    "Provide Code examples with explanations whenever possible."
    "\n\n"
    "{context}"
  )
  qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
  )

  question_answer_chain = create_stuff_documents_chain(llm_model, qa_prompt)
  return question_answer_chain, qa_prompt

### Task 7: Define "Full Chain"
In this step you are combining the both the History Aware Chain with the Question and Answer Chain to create a full "Conversational, RAG, QA Chain!"

You're passing in the `rag_chain` as well as all of the message history, to create the final "Runnable" LangChain object that will call to the LLM and respond to the user's messages.

The conversation history is stored by using a unique `session_id`. In this implementation you're using the [SQLChatMessageHistory](https://python.langchain.com/v0.1/docs/integrations/memory/sql_chat_message_history/). This can use any `SQLAlchemy` compatible `connection_string`. In this example we're using `sqlite` but there are many different [memory integrations available in LangChain](https://python.langchain.com/v0.1/docs/integrations/memory/).

In [None]:
def create_full_chain(history_aware_chain, question_answer_chain, session_id):
    rag_chain = create_retrieval_chain(history_aware_chain, question_answer_chain)

    conversational_rag_chain = RunnableWithMessageHistory(
        rag_chain,
        lambda session_id: SQLChatMessageHistory(
            session_id=session_id, connection_string=SQL_CONNECTION_STRING
        ),
        input_messages_key="input",
        history_messages_key="chat_history",
        output_messages_key="answer",
    )
    return conversational_rag_chain

### Task 8: Define Question Answering
You're finally ready to put all of this together and answer the users question.

Here you are calling all of the functions that you have previously defined to fetch the correct documentation, taking into account the conversation history, and answering the question with that context.

In [None]:
def answer_question(question, session_id):
    retriever = get_retriever()
    llm_model = get_llm()
    history_aware_chain = create_history_aware_chain(llm_model, retriever)
    question_answer_chain, qa_prompt = create_question_answer_chain(llm_model)
    conversational_rag_chain = create_full_chain(history_aware_chain, question_answer_chain, session_id)

    conversational_rag_chain.invoke(
        {"input": question},
        config={"configurable": {"session_id": session_id}}
    )["answer"]

### Task 9: Ask Questions and Get Answers
Now you can finally use this application by asking questions and we'll get our entire conversation back including the answer to our latest question.


The `question` parameter is pretty self explanatory and allows you to update the message you'll send to this chat bot.

The `session_id` parameter is used to store the unique chat conversation. If you were to ask a question with the `session_id` besing set to `conversation123` any follow question using that same `session_id` will be store with `conversation123` if you were to change the `session_id` to `conversation456` you can create a brand new chat with no chat history. You could then change your `session_id` back to `conversation123` and pickup that conversation where you left off.

If you would like to simply `clear` you chat history, there is a function below for this.

After defining your `question` and `session_id` you'll pass those both to `answer_question` which will automatically add your latest question and it's respone to the `SQLChatMessageHistory`.

You can then fetch the latest `chat_history` and iterate through each message. If the message is of type `human` you'll print `Human:` before the message. If not, you'll print `AI:` before the message.

In [None]:
question = "how can i add \"context\" key to include the sources of my retrieved docs" # @param {type:"string"}
session_id = 'MyUniqueChatSession' # @param {type:"string"}

answer_question(question, session_id)

chat_history = SQLChatMessageHistory(
    session_id=session_id, connection_string=SQL_CONNECTION_STRING
)

for message in chat_history.messages:
    if message.type == 'human':
        print(f"Human:\n  {message.content}\n\n")
    else:
        print(f"AI:\n  {message.content}\n\n")


TypeError: string indices must be integers

### Task 10: Clear Message History
This little function simply deletes the conversation history for a given `session_id`

In [None]:
session_id = 'MyUniqueChatSession' # @param {type:"string"}

chat_history = SQLChatMessageHistory(
    session_id=session_id, connection_string=SQL_CONNECTION_STRING
)
chat_history.clear()

## Congratulations!
You have now completed the lab! In this lab, you utilized BigQuery as a Vector Store by: downloading, chucking, embedding, and storing documentation. You then used these embeddings in a RAG architecture to inform a language model's responses. You have also implemented chat history using LangChains "Memory" to allow for follow up questions.

### Next Steps
* Check out the [Generative AI on Vertex AI documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview).
* Check out [LangChain's documentation](https://python.langchain.com/v0.2/docs/introduction/)
* Learn more about Generative AI on the [Google Cloud Tech YouTube channel](https://www.youtube.com/@googlecloudtech/).

# Assessment Suggestions for Lab Building Team
### Part 1: Task 6a-b
The only thing that's really measurable is to see if the user has a BigQuery Dataset and Table named:

DATASET = 'doing_it'

TABLE = 'the_hard_way'

### Part2: Task 9
Other than that, if you can check the user's for API utilization for Vertex AI?

Maybe if they have any API or quota usage we can check that.