## Generate Synthetic Dataset

In [None]:
import requests
from bs4 import BeautifulSoup
import re

# Fetch the page
url = 'https://www.consumerfinance.gov/rules-policy/regulations/1024/17/'
response = requests.get(url)
html_content = response.text

# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract all paragraphs
paragraphs = soup.find_all('p')
# Combine all text into a single string for easier processing
full_text = '\n'.join([p.get_text() for p in paragraphs])

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1250, chunk_overlap=20)

chunks = text_splitter.split_text(full_text)

In [None]:
print("Number of chunks: ", len(chunks))
print("Chunk 1: ", chunks[0])
print("Chunk 2: ", chunks[1])

In [None]:
# save the chunks to a json file for future tracability
import json
with open('chunks.json', 'w') as f:
    json.dump(chunks, f)

In [7]:
from openai import OpenAI
from utility.utils import get_openai_api_key

OpenAI.api_key = get_openai_api_key()
client = OpenAI()

In [None]:
SYSTEM_PROMPT = """
# Role
you are a worldclass llm training data generator. 

# Task
You are generating q/a pairs for finetuning our own llm model.
create a json of questions and answers. 

# Specifics
Context relecance for the Q/A is really important, else our business will loose money on computation resources 

         
# Example format:
{ data: [
    {
        "q": <question>,
        "a": <answer>
    },
    {
        "q": <question>,
        "a": <answer>
    }, ...
]}

# Notes
Make sure to generate JSON format for the Q/A pairs
Generate 10 Q/A pairs
          
"""

In [None]:
def generate_qa_pairs(system_prompt, text):

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": text
            }
        ],
        model="gpt-3.5-turbo",
        response_format={"type": "json_object"}
    )

    return chat_completion.choices[0].message.content

### test out the generation

In [None]:
# lets do a test run
input = "(a) General. This section sets out the requirements for an escrow account that a lender establishes in connection with a federally related mortgage loan. It sets limits for escrow accounts using calculations based on monthly payments and disbursements within a calendar year. If an escrow account involves biweekly or any other payment period, the requirements in this section shall be modified accordingly. A Public Guidance Document entitled “Biweekly Payments - Example” provides examples of biweekly accounting and a Public Guidance Document entitled “Annual Escrow Account Disclosure Statement - Example” provides examples of a 3-year accounting cycle that may be used in accordance with paragraph (c)(9) of this section. A Public Guidance Document entitled “Consumer Disclosure for Voluntary Escrow Account Payments” provides a model disclosure format that originators and servicers are encouraged, but not required, to provide to consumers when the originator or servicer anticipates a substantial increase in disbursements from the escrow account after the first year of the loan. The disclosures in that model format may be combined with or included in the Initial Escrow Account Statement required in § 1024.17(g)."
qa_pairs = generate_qa_pairs(SYSTEM_PROMPT, input)
qa_pairs

Looking at the chunk and and our generation of 10 q/a pairs data for each chunk, we will effectively have about 410 q/a pairs from the 41 chunks. This is a good start for the training data. We can always add more data later on.

In [None]:
qa_chunks = {}
qa_chunks["data"] = []

In [None]:
# WARNING: This code will take a long time to run (about 7 min) and will use up a lot of OpenAI credits
# WARNING: This code will rewrite the existing qa_chunks.json file
if False:
    import json
    count = 0
    last_chunk__ending_index= 0
    for chunk in chunks:
        qa_pairs = generate_qa_pairs(SYSTEM_PROMPT, chunk)
        if isinstance(qa_pairs, str):
            qa_pairs = json.loads(qa_pairs)
        qa_chunk = {
                    "id": count,
                    "metadata":{
                        "text": chunk,
                        "length": len(chunk),
                        "start_index": last_chunk__ending_index,
                        "end_index": last_chunk__ending_index + len(chunk)
                    }, 
                    "data": qa_pairs.get('data')}
        count+=1
        last_chunk__ending_index += len(chunk)
        qa_chunks.get('data').append(qa_chunk)
        
        # save qa_chunks to a json file so we dont loose the data
        with open('qa_chunks.json', 'w') as f:
            json.dump(qa_chunks, f)
    

In [3]:
# read the qa_chunks from the json file
import json
with open('generated_qa_data/qa_chunks.json') as f:
    generated_qa_chunks = json.load(f)

total_qa_generated = 0
for chunk in generated_qa_chunks.get('data'):
    total_qa_generated += len(chunk.get('data'))
print("Total Q/A pairs generated: ", total_qa_generated)
print("Last chunk id: ", generated_qa_chunks.get('data')[-1].get('id'))

Total Q/A pairs generated:  368
Last chunk id:  40


In [None]:
# since we didnt get expected number of Q/A pairs, lets see what happened
for chunk in qa_chunks.get('data'):
    print("Chunk id: ", chunk.get('id'), "Q/A pairs: ", len(chunk.get('data')))

In [None]:
qa_chunks.get('data')[1]

## Generating additional data

The generated q/a pairs are less than 400, so we will generate more data (we need atlease 1000 sample points).

This time rather then using GPT-3.5 turbo we can use a smaller model as we are not trying to create Q/A from a piece of text, rather we are just generating a similar q/a pairs using the already generated q/a pair with more capable models.

for this part we will use llama3-8b-instruct model for the generation of the q/a pairs.

lets pull the model using ollama
```ollama pull llama3```

In [None]:
!ollama pull llama3

In [None]:
import time
import ollama
# since we are using smaller model (8b) and like openai's api, we dont have a guarentee of Json output, we will generate the similar output one at a time

def generate_similar_question(question, additional_context=None):
    
    message = f'please reright the question in other words: {question}\n note: only respond with the question, no answer needed. You can simplify the question as well if needed.'
    if additional_context:
        message = f'{additional_context} {message}'
        
    response = ollama.chat(
        model='llama3',
        messages=[
            {
                'role': 'user',
                'content': message
            }
        ],
        
        )
    new_question = response['message']['content']
    return new_question


def generate_10_similar_questions(orignal_question, additional_context=None):

    questions = []
    print(f"Generating similar questions like '{orignal_question}'")
    while True:
        new_question = generate_similar_question(
            orignal_question, additional_context)

        # we see some wierd behavour from ollama api, when your request text is the same as the previous one, it may return the same text
        # hence we will skip to save the same text
        if new_question in questions:
            continue
        questions.append(new_question)
        time.sleep(0.5)
        print(".", end="")
        if len(questions) == 10:
            break
        
    return questions


In [None]:
generate_10_similar_questions("Why is the sky blue?")

We can see above how the local llm is generation similar questions to the origninal, now we will use the same function to generate more data.

In [None]:
generate_10_similar_questions(
    "What is the definition of Aggregate analysis in the context of escrow account analysis?",
    additional_context="remove the term escrow account from your resposne")

In [None]:
# WARNING: This code will take a long time to run (about 60 min on macbook m2 pro 32 gb) and will use up a lot of local Compute Resource
# now lets do the same for all the qenerated Q/A pairs, and save them to a new json file
if False:
    for chunk in generated_qa_chunks.get('data'):
        for qa_pair in chunk.get('data'):
            new_questions = generate_10_similar_questions(qa_pair.get('q'))
            qa_pair['similar_questions'] = new_questions
        
        # save qa_chunks to a json file so we dont loose the data
        with open('qa_chunks_with_similar_questions.json', 'w') as f:
            json.dump(generated_qa_chunks, f)

In [None]:
generated_qa_chunks.get('data')[0].get('data')[0]

Now we have 3000+ q/a pairs after we pair the generated data with the GPT-3.5-turbo generated data. ideally it will be good to have similar answers generated as well but in the interest of time we will just use the questions generated by the llama3-8b-instruct model.

In [128]:
# load the saved qa_chunks_with_similar_questions.json file
import json
with open('generated_qa_data/qa_chunks_with_similar_questions.json') as f:
    generated_qa_chunks = json.load(f)

### lets bring everything together, and generate the final dataset for fine-tuning the model.
# we need a csv file with cloums: question, answer, chunk_id, relevent_text
qa = []
for chunk in generated_qa_chunks.get('data'):
    for qa_pair in chunk.get('data'):
        # orignial q/a pair (generated using gpt-3.5-turbo)
        relevant_pair = {
            "question": qa_pair.get('q'),
            "answer": qa_pair.get('a'),
            "chunk_id": chunk.get('id'),
            "relevent_text": chunk.get('metadata').get('text'),
        }
        irrelevant_pair = {
            "question": qa_pair.get('q'),
            "answer": "The document does not contain and relevant information for your query",
            "chunk_id": chunk.get('id'),
            "relevent_text": "No relevant information found",
        }
        qa.append(relevant_pair)
        qa.append(irrelevant_pair)
        
        # adding similar questions to the orignal questions to the dataset
        # these questions were generated using llama3
        # add positive examples
        for similar_question in qa_pair.get('similar_questions'):
            new_pair = {
                "question": similar_question,
                "answer": qa_pair.get('a'),
                "chunk_id": chunk.get('id'),
                "relevent_text": chunk.get('metadata').get('text'),
            }
            qa.append(new_pair)
        # add no relevent_text examples
        for similar_question in qa_pair.get('similar_questions'):
            new_pair = {
                "question": similar_question,
                "answer": "No relevant information found in the document",
                "chunk_id": chunk.get('id'),
                "relevent_text": "No relevant information found",
            }
            qa.append(new_pair)
        
print(qa[0])
print(len(qa))

        

{'question': 'What are the requirements for an escrow account established by a lender in connection with a federally related mortgage loan?', 'answer': 'The requirements for an escrow account set limits based on monthly payments and disbursements within a calendar year. If the escrow account involves a different payment period like biweekly, the requirements are modified accordingly.', 'chunk_id': 0, 'relevent_text': '(a) General. This section sets out the requirements for an escrow account that a lender establishes in connection with a federally related mortgage loan. It sets limits for escrow accounts using calculations based on monthly payments and disbursements within a calendar year. If an escrow account involves biweekly or any other payment period, the requirements in this section shall be modified accordingly. A Public Guidance Document entitled “Biweekly Payments - Example” provides examples of biweekly accounting and a Public Guidance Document entitled “Annual Escrow Account 

In [129]:
# saving the pairs to a csv file
# Save the data to a csv file.
import pandas as pd
df = pd.DataFrame(qa)

data_save_path = 'generated_qa_data/relevant_non_relevant_context_data.json'

df.to_csv(data_save_path, index=False)

In [130]:
#loading the data to make sure it was saved correctly
df = pd.read_csv(data_save_path)
# adding an extra colum "input", and renaming the colum question to "input"and answer to "output"
df = df.drop(columns=['chunk_id'])

In [131]:
df.head(50)

Unnamed: 0,question,answer,relevent_text
0,What are the requirements for an escrow accoun...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
1,What are the requirements for an escrow accoun...,The document does not contain and relevant inf...,No relevant information found
2,What are the regulatory standards and guidelin...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
3,What are the regulations or guidelines that go...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
4,What are the key criteria that must be met whe...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
5,What are the key guidelines or rules that a le...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
6,What are the key conditions and guidelines tha...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
7,What are the essential guidelines or standards...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
8,What are the key factors that a lender must co...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...
9,What are the key criteria for setting up an es...,The requirements for an escrow account set lim...,(a) General. This section sets out the require...


# experimental (not used for training, but could be used in future)

In [98]:
NEGATIVE_SYSTEM_PROMPT = """
# Role
you are a worldclass llm training data generator. 

# Task
You are generating negative examples for finetuning our own llm model.
User gives you a piece of text and you have to generate a question that is NOT relevant to the text.
Your job is to generate 10 questions that are not relevant to the text.

# Specifics
Context irrelevance for the Q/A is really important, else our business will lose money and reputation as you are serving our customers

         
# Example format:
{ data: [ irrelevant_question1, irrelevant_question2, ...]}
]}

# Notes
- Make sure to generate JSON format for the irrelevant questions
- Generate 10 irrelevant questions 
- The irrelevant question can be anything that is NOT DIRECTLY related to the user prompt.
         
"""

In [102]:
# generate negative examples for the dataset
# provided a piece of text, generate a questions that are not relevant to the text

def generate_irrelevant_questions(system_prompt, text, temperature=0.5):

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": text
            }
        ],
        model="gpt-3.5-turbo",
        response_format={"type": "json_object"},
        temperature=temperature
    )

    data = chat_completion.choices[0].message.content
    try:
        data = json.loads(data)
    except ValueError:
        data = None
        
    return data
        

In [103]:

irrelivant_questions = generate_irrelevant_questions(NEGATIVE_SYSTEM_PROMPT,"Aggregate (or) composite analysis, hereafter called aggregate analysis, means an accounting method a servicer uses in conducting an escrow account analysis by computing the sufficiency of escrow account funds by analyzing the account as a whole. Appendix E to this part sets forth examples of aggregate escrow account analyses.\n\nAnnual escrow account statement means a statement containing all of the information set forth in § 1024.17(i). As noted in § 1024.17(i), a servicer shall submit an annual escrow account statement to the borrower within 30 calendar days of the end of the escrow account computation year, after conducting an escrow account analysis.")

In [104]:
print(irrelivant_questions)


{'data': ['What is the best type of cheese for making a grilled cheese sandwich?', 'Have you ever tried bungee jumping?', 'Do you prefer to drink tea or coffee in the morning?', 'How many hours of sleep do you typically get each night?', 'What is your favorite TV show to binge-watch?', 'Have you ever traveled to Antarctica?', 'Do you believe in aliens?', 'What is the capital of Australia?', 'What is your opinion on pineapple pizza?', 'Have you ever participated in a hot dog eating contest?']}


In [105]:
context_relevance_prompt= """# Role
your job is to see if the user query is relevant to the context provided.

# Instruction
respond only with a number between: 0.0 (not relevant at all) to 1.0 
0.0 being not relevant at all and 1.0 being very relevant
Note the relevance of the user query to the context provided should be your ONLY concern

# Output Example (json)
{{relevance: <float>}}

# Context
{text} """

def validate_context_relevance(question, text):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": context_relevance_prompt.format(text=text)
            },
            {
                "role": "user",
                "content": question
            }
        ],
        response_format={"type": "json_object"}
    )
    data =  response.choices[0].message.content
    try:
        if isinstance(data, str):
            data = json.loads(data)
        elif isinstance(data, dict):
            pass
        else: 
            raise ValueError
    
    except ValueError:
        data = None
    
    return data

In [106]:
print(irrelivant_questions)
valid = validate_context_relevance(irrelivant_questions["data"][0], "Aggregate (or) composite analysis, hereafter called aggregate analysis, means an accounting method a servicer uses in conducting an escrow account analysis by computing the sufficiency of escrow account funds by analyzing the account as a whole. Appendix E to this part sets forth examples of aggregate escrow account analyses.\n\nAnnual escrow account statement means a statement containing all of the information set forth in § 1024.17(i). As noted in § 1024.17(i), a servicer shall submit an annual escrow account statement to the borrower within 30 calendar days of the end of the escrow account computation year, after conducting an escrow account analysis.")
print(valid)

{'data': ['What is the best type of cheese for making a grilled cheese sandwich?', 'Have you ever tried bungee jumping?', 'Do you prefer to drink tea or coffee in the morning?', 'How many hours of sleep do you typically get each night?', 'What is your favorite TV show to binge-watch?', 'Have you ever traveled to Antarctica?', 'Do you believe in aliens?', 'What is the capital of Australia?', 'What is your opinion on pineapple pizza?', 'Have you ever participated in a hot dog eating contest?']}
{'relevance': 0.0}


In [107]:
def validate_irrilevant_generated_questions(irrelivant_questions):
    valid_irrilevant_q=[] 
    for question in irrelivant_questions:
        print("validating question: ", question)
        response = validate_context_relevance(question, "Aggregate (or) composite analysis, hereafter called aggregate analysis, means an accounting method a servicer uses in conducting an escrow account analysis by computing the sufficiency of escrow account funds by analyzing the account as a whole. Appendix E to this part sets forth examples of aggregate escrow account analyses.\n\nAnnual escrow account statement means a statement containing all of the information set forth in § 1024.17(i). As noted in § 1024.17(i), a servicer shall submit an annual escrow account statement to the borrower within 30 calendar days of the end of the escrow account computation year, after conducting an escrow account analysis.")
        
        if response.get('relevance') < 0.5:
            valid_irrilevant_q.append({"question": question, "relevance": response.get('relevance')})
    
    return valid_irrilevant_q
        

In [108]:
print(len(irrelivant_questions))
print(irrelivant_questions)
valid_irrilevant_q = validate_irrilevant_generated_questions(irrelivant_questions["data"])
print("len of irrilevant questions: ", len(irrelivant_questions))
print("len of valid irrilevant questions: ", len(valid_irrilevant_q))
print(valid_irrilevant_q)

1
{'data': ['What is the best type of cheese for making a grilled cheese sandwich?', 'Have you ever tried bungee jumping?', 'Do you prefer to drink tea or coffee in the morning?', 'How many hours of sleep do you typically get each night?', 'What is your favorite TV show to binge-watch?', 'Have you ever traveled to Antarctica?', 'Do you believe in aliens?', 'What is the capital of Australia?', 'What is your opinion on pineapple pizza?', 'Have you ever participated in a hot dog eating contest?']}
validating question:  What is the best type of cheese for making a grilled cheese sandwich?
validating question:  Have you ever tried bungee jumping?
validating question:  Do you prefer to drink tea or coffee in the morning?
validating question:  How many hours of sleep do you typically get each night?
validating question:  What is your favorite TV show to binge-watch?
validating question:  Have you ever traveled to Antarctica?
validating question:  Do you believe in aliens?
validating question: