## Obtaining and preparing the data 
The first step of any data science project is to obtain, understand, and prepare the data. In this notebook, we will walk through the process of obtaining a practical real world dataset and prepare it for downstream purposes including building a Retrieval Augmented Generation (RAG) pipeline and fine tuning a large language model to improve the performance of the pipeline.

### Usecase Overview
The dataset we'll be using is comprised of [US Federal Banks and Banking Regulations](https://www.ecfr.gov/current/title-12). Our goal is to build a chatbot that will help users get answers to specific questions and understand the regulations and policies of the US Federal Banks. To simulate a real world scenario, we will obtain the raw data directly from the provided API and prepare it for downstream tasks. Once we have the data, we will build and evaluate an initial RAG pipeline and then proceed to improve on the pipeline by fine-tuning our own model.

### Download and wrangle the data
The first step is to download the data from the API and wrangle it into a format that can be used for downstream tasks. We will use the `requests` library to download the data. The data comes in an `XML` format, so we will use the `BeautifulSoup` library to parse the data and extract the relevant information.

In [None]:
import sys
import os
import subprocess
module_path = "../.."
sys.path.append(os.path.abspath(module_path))
from utils.environment_validation import validate_environment, validate_model_access
validate_environment()

In [None]:
required_models = [
    "amazon.titan-embed-text-v2:0",
    "mistral.mixtral-8x7b-instruct-v0:1",
    "mistral.mistral-7b-instruct-v0:2",
]
validate_model_access(required_models)

In [None]:
# create an mlflow tracking server that can later be used to log experiments
from mlflow_utils import create_mlflow_tracking_server
subprocess.Popen([sys.executable, "mlflow_utils.py", "--tracking-server-name", "workshop-mlflow-1"])

In [None]:
import requests
import json
from pathlib import Path
from bs4 import BeautifulSoup
from rich import print as rprint
import re
import uuid

INCLUDED_CHAPTERS = ["I", "II", "III"]

data_path = Path("data")

# Download the raw data and save it to a file

api_url = "https://www.ecfr.gov"
api_path = "/api/versioner/v1/full/2024-04-04/title-12.xml?chapter=I"
raw_data_path = data_path / "raw"
raw_data_path.mkdir(exist_ok=True, parents=True)

response = requests.get(api_url + api_path, timeout=300)

if response.status_code != 200:
    raise Exception(f"Failed to download the data. Status code: {response.status_code}")
else:
    rprint("[green]Data downloaded successfully[/green]")
    with (raw_data_path / f"raw_data.xml").open("w") as f:
        f.write(response.text)

The heading for each chapter is captured in a `<div3>` tag. We can use that information to split the xml into chapters. As this is a sizeable dataset, we will filter on only the first 3 chapters that deal with Comptroller of the Currency, Federal Reserve System, and Federal Deposit Insurance Corporation.

In [None]:
with (raw_data_path / f"raw_data.xml").open("r") as f:
    soup = BeautifulSoup(f, "lxml")

chapters = soup.find_all("div3")
filtered_chapters = [chapter for chapter in chapters if chapter["n"] in INCLUDED_CHAPTERS]

The next part is to iterate through the chapter hierarchy and extract the headings and the text for each section. The image illustrates the hierarchy of the data and the corresponding xml tags. 

![Hierarchy](images/BankingRegHierarchy.png)

For each we'll also capture the metadata such as the title, part, chapter, and section. We will then save the data in a `json` format for further processing.

In [None]:
sections = []

for chapter in filtered_chapters:
    chapter_title = next(chapter.stripped_strings)
    volumes = chapter.find_all("div5")
    for volume in volumes:
        volume_title = next(volume.stripped_strings)
        for section in volume.find_all("div8"):
            section_title = next(section.stripped_strings)
            section_attributes = {
                "metadata": {
                    "chapter_title": chapter_title,
                    "chapter_id": chapter["n"],
                    "volume_title": volume_title,
                    "volume_id": volume["n"],
                    "section_id": section["n"],
                    "section_title": section_title,
                    "unique_id": f"{chapter['n']}.{section['n']}"
                }
            }
            section_text = section.get_text()
            section_attributes["text"] = section_text
            sections.append(section_attributes)

In [None]:
rprint(f"Found {len(sections)} sections")
rprint(f"Sample section: {sections[0]}")

### Synthetic Data Generation
We have obtained some raw data that we can immediately use to build a RAG pipeline and fine-tune a model via [Continued Pre-training](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html). Continued pre-training would help domain adapt the model to specific vocabulary and context of the banking regulations, but not necessarily give it the ability to reason about and answer questions pertaining to the regulations. For that we need an instructions tuning dataset with examples of questions and answers pertaining to the regulations. Obtaining such a dataset can be expensive and time consuming as it requires domain experts to create the data. Instead we will synthetically generate a dataset of question and answer pairs using LLMs available on Amazon Bedrock.

Synthetic data generation can be a powerful tool to create large amounts of training data for a variety of tasks. We can further refine and filter out synthetic data that is not relevant by enlisting subject matter experts (SMEs) by leveraging tools such as [SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/). In summary, the advantages of synthetic data generation are:
- It is cost effective and quick to generate large amounts of data
- It can be used to create data for tasks where labeled data is scarce
- It can be refined and filtered by SMEs to ensure quality
- It can be used to evaluate the performance of Retrieval Augmented Generation (RAG) pipelines 

We will use the Mistral Mixtral 8x7B model hosted on Amazon Bedrock to generate the data. The model will be invoked using the `langchain` library.
<div style="color: #415a77; background-color: #ff9f1c; padding: 10px; margin-bottom: 10px;">
    <strong>Important Note:</strong> Review the End User License Agreement (EULA) of the model before using it to generate synthetic data to ensure compliance with the terms of use.
</div>


In [None]:
from langchain_aws.chat_models import ChatBedrockConverse
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables.config import RunnableConfig

import boto3

import asyncio
import nest_asyncio
nest_asyncio.apply()

boto3_session=boto3.session.Session()
bedrock_runtime = boto3_session.client("bedrock-runtime")

llm_modelId = "mistral.mixtral-8x7b-instruct-v0:1"


llm = ChatBedrockConverse(
    model_id=llm_modelId,
    temperature=0.3,
    max_tokens=500,
    top_p=1,
    client=bedrock_runtime,
)



Below is the prompt template that we will use to generate the synthetic data. The prompt assigns a role to the model (law professor specializing in Banking Regulatory Compliance) and asks the model to generate a question and answer pair based on the provided context. The context is a section of the banking regulations that we obtained from the API.
You can get creative with the data generation process including:
- using different prompts to generate the questions and answers
- using different models for questions and answers
- using other models to refine the generated questions, and so on

To speed up the process, we will create an async function to generate the data in parallel by making concurrent API calls.

In [None]:
QUESTION_GENERATION_TEMPLATE = """You are a law professor specializing in Banking Regulatory Compliance.
You are preparing a exam for your students. You need to generate questions and answers based on the following regulation.
---------------------
{context}
----------------------
Here are some guidelines your generated question should adhere to
- Question should not be multiple choice
- The question should be answerable based only on the information provided in the regulation text above
- The question should be pointed and specific and not broad such as asking to summarize or explain an entire regulation section
- Question can be about a specific detail or a concept mentioned in the regulation
- Question may also ask for implications or consequences of a specific detail or concept mentioned in the regulation
- Question may require interpretation or analysis of the regulation text
- Question should be stand-alone and not part of a series of questions
- The question should not include the clause identifier as students should be able to identify the relevant clause based on the question

Below are guidelines for the answer
- Answer should include an explanation or reasoning for why the answer is correct
- If referencing a specific part of the regulation text, please include the full reference to the paragraph section number and the regulation itself

Generate {num_questions} questions and a correct answer only using 'Answer' and 'Question' keys as per the format below:

[Question 1]
Question: 
Answer:

[Question 2]
Question:
Answer:

Do not include any additional keys or information in the response. 

[/INST]
"""


async def generate_questions(section, num_questions=3):

    question_generation_prompt = ChatPromptTemplate.from_template(
        QUESTION_GENERATION_TEMPLATE, partial_variables={"num_questions": num_questions}
    )

    question_generation_chain = question_generation_prompt | llm

    response = await question_generation_chain.ainvoke({"context": section["text"]})

    result = {
        "ref_doc_id": section["metadata"]["unique_id"],
        "questions": response.content,
    }

    return result


def run_question_generation(sections, num_questions=3):

    tasks = [
        generate_questions(section, num_questions=num_questions) for section in sections
    ]
    event_loop = asyncio.get_event_loop()
    results = event_loop.run_until_complete(asyncio.gather(*tasks))

    return results

It can take a considerable amount of time to generate the question and answer pairs for the entire dataset. For demonstration purposes, we will generate the data for the first 10 sections of the regulations. A larger dataset has already been generated and is included in this lab.  

In [None]:
SECTIONS_TO_SAMPLE = 10
generated_questions = run_question_generation(sections[:SECTIONS_TO_SAMPLE], num_questions=3)

In [None]:
# print a sample question
rprint(generated_questions[0]["questions"])

The question and answer pairs can be extracted using a regular expression. They can then be combined with the original context data giving us a dataset that is comprised of the context, question, and answer. 

In [None]:

question_re = re.compile(r"\[Question \d+\]\nQuestion: (.*)\nAnswer: (.*)\n?")
prepared_data = []
for qa in generated_questions:
    question_answers = question_re.findall(qa["questions"])
    relevant_section = next(section for section in sections if section["metadata"]["unique_id"] == qa["ref_doc_id"])
    
    for question in question_answers:
        prepared_data.append({
            "example_id": str(uuid.uuid4()),
            "ref_doc_id": qa["ref_doc_id"],
            "question": question[0],
            "answer": question[1],
            "context" : relevant_section["text"],
            "section_metadata": relevant_section["metadata"],
        })

In [None]:
rprint(f"Generated {len(prepared_data)} questions")
rprint(prepared_data[0])

### Conclusion
In this notebook, we obtained the raw data from the API, wrangled it into a format that can be used for downstream tasks, and generated synthetic data using a large language model. We have prepared the data for building and evaluating a RAG pipeline and fine-tuning a model to improve the performance of the pipeline.