## Generate Synthetic Dataset

In [1]:
import requests
from bs4 import BeautifulSoup
import re

# Fetch the page
url = 'https://www.consumerfinance.gov/rules-policy/regulations/1024/17/'
response = requests.get(url)
html_content = response.text

# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract all paragraphs
paragraphs = soup.find_all('p')
# Combine all text into a single string for easier processing
full_text = '\n'.join([p.get_text() for p in paragraphs])

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1250, chunk_overlap=20)

chunks = text_splitter.split_text(full_text)

In [3]:
print("Number of chunks: ", len(chunks))
print("Chunk 1: ", chunks[0])
print("Chunk 2: ", chunks[1])

Number of chunks:  41
Chunk 1:  (a) General. This section sets out the requirements for an escrow account that a lender establishes in connection with a federally related mortgage loan. It sets limits for escrow accounts using calculations based on monthly payments and disbursements within a calendar year. If an escrow account involves biweekly or any other payment period, the requirements in this section shall be modified accordingly. A Public Guidance Document entitled “Biweekly Payments - Example” provides examples of biweekly accounting and a Public Guidance Document entitled “Annual Escrow Account Disclosure Statement - Example” provides examples of a 3-year accounting cycle that may be used in accordance with paragraph (c)(9) of this section. A Public Guidance Document entitled “Consumer Disclosure for Voluntary Escrow Account Payments” provides a model disclosure format that originators and servicers are encouraged, but not required, to provide to consumers when the originator o

In [181]:
# save the chunks to a json file for future tracability
import json
with open('chunks.json', 'w') as f:
    json.dump(chunks, f)

In [4]:
from openai import OpenAI
from utility.utils import get_openai_api_key

OpenAI.api_key = get_openai_api_key()
client = OpenAI()

In [5]:
SYSTEM_PROMPT = """
# Role
you are a worldclass llm training data generator. 

# Task
You are generating q/a pairs for finetuning our own llm model.
create a json of questions and answers. 

# Specifics
Context relecance for the Q/A is really important, else our business will loose money on computation resources 

         
# Example format:
{ data: [
    {
        "q": <question>,
        "a": <answer>
    },
    {
        "q": <question>,
        "a": <answer>
    }, ...
]}

# Notes
Make sure to generate JSON format for the Q/A pairs
Generate 10 Q/A pairs
          
"""

In [9]:
def generate_qa_pairs(system_prompt, text):

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": text
            }
        ],
        model="gpt-3.5-turbo",
        response_format={"type": "json_object"}
    )

    return chat_completion.choices[0].message.content

### test out the generation

In [10]:
# lets do a test run
input = "(a) General. This section sets out the requirements for an escrow account that a lender establishes in connection with a federally related mortgage loan. It sets limits for escrow accounts using calculations based on monthly payments and disbursements within a calendar year. If an escrow account involves biweekly or any other payment period, the requirements in this section shall be modified accordingly. A Public Guidance Document entitled “Biweekly Payments - Example” provides examples of biweekly accounting and a Public Guidance Document entitled “Annual Escrow Account Disclosure Statement - Example” provides examples of a 3-year accounting cycle that may be used in accordance with paragraph (c)(9) of this section. A Public Guidance Document entitled “Consumer Disclosure for Voluntary Escrow Account Payments” provides a model disclosure format that originators and servicers are encouraged, but not required, to provide to consumers when the originator or servicer anticipates a substantial increase in disbursements from the escrow account after the first year of the loan. The disclosures in that model format may be combined with or included in the Initial Escrow Account Statement required in § 1024.17(g)."
qa_pairs = generate_qa_pairs(SYSTEM_PROMPT, input)
qa_pairs

{
    "data": [
        {
            "q": "What are the requirements set out for an escrow account established in connection with a federally related mortgage loan?",
            "a": "The requirements for an escrow account established in connection with a federally related mortgage loan include limits based on monthly payments and disbursements within a calendar year, with modifications for various payment periods."
        },
        {
            "q": "How are limits for escrow accounts calculated in relation to monthly payments and disbursements?",
            "a": "Escrow account limits are calculated based on monthly payments and disbursements within a calendar year for federally related mortgage loans."
        },
        {
            "q": "What guidance documents provide examples related to biweekly payments and annual escrow account disclosures?",
            "a": "The Public Guidance Document entitled 'Biweekly Payments - Example' provides examples of biweekly accounting, w

Looking at the chunk and and our generation of 10 q/a pairs data for each chunk, we will effectively have about 410 q/a pairs from the 41 chunks. This is a good start for the training data. We can always add more data later on.

In [28]:
qa_chunks = {}
qa_chunks["data"] = []

In [None]:
# WARNING: This code will take a long time to run (about 7 min) and will use up a lot of OpenAI credits
# WARNING: This code will rewrite the existing qa_chunks.json file
import json
count = 0
last_chunk__ending_index= 0
for chunk in chunks:
    qa_pairs = generate_qa_pairs(SYSTEM_PROMPT, chunk)
    if isinstance(qa_pairs, str):
        qa_pairs = json.loads(qa_pairs)
    qa_chunk = {
                "id": count,
                "metadata":{
                    "text": chunk,
                    "length": len(chunk),
                    "start_index": last_chunk__ending_index,
                    "end_index": last_chunk__ending_index + len(chunk)
                }, 
                "data": qa_pairs.get('data')}
    count+=1
    last_chunk__ending_index += len(chunk)
    qa_chunks.get('data').append(qa_chunk)
    
    # save qa_chunks to a json file so we dont loose the data
    with open('qa_chunks.json', 'w') as f:
        json.dump(qa_chunks, f)
    

In [125]:
# read the qa_chunks from the json file
with open('qa_chunks.json') as f:
    generated_qa_chunks = json.load(f)

total_qa_generated = 0
for chunk in generated_qa_chunks.get('data'):
    total_qa_generated += len(chunk.get('data'))
print("Total Q/A pairs generated: ", total_qa_generated)
print("Last chunk id: ", generated_qa_chunks.get('data')[-1].get('id'))

Total Q/A pairs generated:  368
Last chunk id:  40


In [124]:
# since we didnt get expected number of Q/A pairs, lets see what happened
for chunk in qa_chunks.get('data'):
    print("Chunk id: ", chunk.get('id'), "Q/A pairs: ", len(chunk.get('data')))

Chunk id:  0 Q/A pairs:  10
Chunk id:  1 Q/A pairs:  4
Chunk id:  2 Q/A pairs:  10
Chunk id:  3 Q/A pairs:  10
Chunk id:  4 Q/A pairs:  6
Chunk id:  5 Q/A pairs:  10
Chunk id:  6 Q/A pairs:  10
Chunk id:  7 Q/A pairs:  3
Chunk id:  8 Q/A pairs:  10
Chunk id:  9 Q/A pairs:  10
Chunk id:  10 Q/A pairs:  10
Chunk id:  11 Q/A pairs:  3
Chunk id:  12 Q/A pairs:  10
Chunk id:  13 Q/A pairs:  6
Chunk id:  14 Q/A pairs:  10
Chunk id:  15 Q/A pairs:  10
Chunk id:  16 Q/A pairs:  5
Chunk id:  17 Q/A pairs:  6
Chunk id:  18 Q/A pairs:  5
Chunk id:  19 Q/A pairs:  10
Chunk id:  20 Q/A pairs:  10
Chunk id:  21 Q/A pairs:  10
Chunk id:  22 Q/A pairs:  10
Chunk id:  23 Q/A pairs:  10
Chunk id:  24 Q/A pairs:  10
Chunk id:  25 Q/A pairs:  10
Chunk id:  26 Q/A pairs:  10
Chunk id:  27 Q/A pairs:  10
Chunk id:  28 Q/A pairs:  10
Chunk id:  29 Q/A pairs:  10
Chunk id:  30 Q/A pairs:  10
Chunk id:  31 Q/A pairs:  10
Chunk id:  32 Q/A pairs:  10
Chunk id:  33 Q/A pairs:  10
Chunk id:  34 Q/A pairs:  10
Chu

In [106]:
qa_chunks.get('data')[1]

{'id': 1,
 'metadata': {'text': "(b) Definitions. As used in this section:\nAggregate (or) composite analysis, hereafter called aggregate analysis, means an accounting method a servicer uses in conducting an escrow account analysis by computing the sufficiency of escrow account funds by analyzing the account as a whole. Appendix E to this part sets forth examples of aggregate escrow account analyses.\nAnnual escrow account statement means a statement containing all of the information set forth in §\xa01024.17(i).  As noted in §\xa01024.17(i),  a servicer shall submit an annual escrow account statement to the borrower within 30 calendar days of the end of the escrow account computation year, after conducting an escrow account analysis.\nCushion or reserve (hereafter cushion) means funds that a servicer may require a borrower to pay into an escrow account to cover unanticipated disbursements or disbursements made before the borrower's payments are available in the account, as limited by 

## Generating additional data

The generated q/a pairs are less than 400, so we will generate more data (we need atlease 1000 sample points).

This time rather then using GPT-3.5 turbo we can use a smaller model as we are not trying to create Q/A from a piece of text, rather we are just generating a similar q/a pairs using the already generated q/a pair with more capable models.

for this part we will use llama3-8b-instruct model for the generation of the q/a pairs.

lets pull the model using ollama
```ollama pull llama3```

In [36]:
!ollama pull llama3

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest 
pulling 00e1317cbf74... 100% ▕████████████████▏ 4.7 GB                         
pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         
pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         
pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         
pulling ad1518640c43... 100% ▕████████████████▏  483 B                         
verifying sha256 digest ⠋ [?25h[?25l[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1Gpulling manifest 
pulling 00e1317cbf74... 100% ▕████████████████▏ 4.7 GB                         
pulling 4fa551d

### 

In [174]:
import time
import ollama
# since we are using smaller model (8b) and like openai's api, we dont have a guarentee of Json output, we will generate the similar output one at a time

def generate_similar_question(question, additional_context=None):
    
    message = f'please reright the question in other words: {question}\n note: only respond with the question, no answer needed. You can simplify the question as well if needed.'
    if additional_context:
        message = f'{additional_context} {message}'
        
    response = ollama.chat(
        model='llama3',
        messages=[
            {
                'role': 'user',
                'content': message
            }
        ],
        
        )
    new_question = response['message']['content']
    return new_question


def generate_10_similar_questions(orignal_question, additional_context=None):

    questions = []
    print(f"Generating similar questions like '{orignal_question}'")
    while True:
        new_question = generate_similar_question(
            orignal_question, additional_context)

        # we see some wierd behavour from ollama api, when your request text is the same as the previous one, it may return the same text
        # hence we will skip to save the same text
        if new_question in questions:
            continue
        questions.append(new_question)
        time.sleep(0.5)
        print(".", end="")
        if len(questions) == 10:
            break
        
    return questions


In [163]:
generate_10_similar_questions("Why is the sky blue?")

Generating similar questions like 'Why is the sky blue?'
..........

['What makes the sky appear blue to our eyes?',
 'What makes the sky appear a specific shade of blue?',
 'What makes the sky appear blue?',
 'What makes the sky appear to be blue to our eyes?',
 'What makes the daytime sky appear a certain shade of blue?',
 "What's behind the beautiful blue color of the daytime sky?",
 'Why does the atmosphere appear blue to our eyes?',
 'Why does the atmosphere scatter sunlight to make the sky appear blue?',
 'What makes the daytime sky appear blue?',
 'What color is the sky most of the time?']

We can see above how the local llm is generation similar questions to the origninal, now we will use the same function to generate more data.

In [175]:
generate_10_similar_questions(
    "What is the definition of Aggregate analysis in the context of escrow account analysis?",
    additional_context="remove the term escrow account from your resposne")

Generating similar questions like 'What is the definition of Aggregate analysis in the context of escrow account analysis?'
..........

['What does "Aggregate Analysis" mean when examining an escrow account\'s transactions?',
 'What does "Aggregate Analysis" refer to when used in the context of analyzing an escrow account\'s performance?',
 'What does "Aggregate Analysis" mean when discussing escrow account transactions?',
 'What does "aggregate" mean when referring to the analysis of an escrow account?',
 'What does "Aggregate" refer to when analyzing data related to an escrow account?',
 'What does "Aggregate Analysis" refer to in relation to analyzing an escrow account\'s transactions and activity?',
 'What does "Aggregate Analysis" refer to in the context of analyzing an escrow account?',
 'What does "Aggregate Analysis" mean when analyzing an escrow account\'s financial data?',
 'What does "Aggregate Analysis" refer to in relation to analyzing data for an escrow account?',
 'What does "Aggregate Analysis" mean when applied to analyzing an escrow account?']

In [None]:
# WARNING: This code will take a long time to run (about 60 min on macbook m2 pro 32 gb) and will use up a lot of local Compute Resource
# now lets do the same for all the qenerated Q/A pairs, and save them to a new json file
for chunk in generated_qa_chunks.get('data'):
    for qa_pair in chunk.get('data'):
        new_questions = generate_10_similar_questions(qa_pair.get('q'))
        qa_pair['similar_questions'] = new_questions
    
    # save qa_chunks to a json file so we dont loose the data
    with open('qa_chunks_with_similar_questions.json', 'w') as f:
        json.dump(generated_qa_chunks, f)

In [178]:
generated_qa_chunks.get('data')[0].get('data')[0]

{'q': 'What are the requirements for an escrow account established by a lender in connection with a federally related mortgage loan?',
 'a': 'The requirements for an escrow account set limits based on monthly payments and disbursements within a calendar year. If the escrow account involves a different payment period like biweekly, the requirements are modified accordingly.',
 'similar_questions': ['What are the regulatory standards and guidelines that govern the establishment of an escrow account by a lender when originating a federally related mortgage loan?',
  'What are the regulations or guidelines that govern the establishment of an escrow account by a lender when issuing a federally related mortgage loan?',
  'What are the key criteria that must be met when setting up an escrow account by a lender, in relation to federal regulations for mortgages?',
  'What are the key guidelines or rules that a lender must follow when setting up an escrow account to manage taxes and insurance pa

Now we have 3000+ q/a pairs after we pair the generated data with the GPT-3.5-turbo generated data. ideally it will be good to have similar answers generated as well but in the interest of time we will just use the questions generated by the llama3-8b-instruct model.

In [184]:
### lets bring everything together, and generate the final dataset for fine-tuning the model.
# we need a csv file with cloums: question, answer, chunk_id
qa = []
for chunk in generated_qa_chunks.get('data'):
    for qa_pair in chunk.get('data'):
        new_pair = {
            "question": qa_pair.get('q'),
            "answer": qa_pair.get('a'),
            "chunk_id": chunk.get('id')
        }
        qa.append(new_pair)
        for similar_question in qa_pair.get('similar_questions'):
            new_pair = {
                "question": similar_question,
                "answer": qa_pair.get('a'),
                "chunk_id": chunk.get('id')
            }
            qa.append(new_pair)
        
print(qa[0])
print(len(qa))

        


{'question': 'What are the requirements for an escrow account established by a lender in connection with a federally related mortgage loan?', 'answer': 'The requirements for an escrow account set limits based on monthly payments and disbursements within a calendar year. If the escrow account involves a different payment period like biweekly, the requirements are modified accordingly.', 'chunk_id': 0}
4048


In [185]:
# saving the pairs to a csv file
# Save the data to a csv file.
import pandas as pd
df = pd.DataFrame(qa)
df.to_csv('qa_finetuning_dataset.csv', index=False)

In [191]:
#loading the data to make sure it was saved correctly
df = pd.read_csv('generated_qa_data/qa_finetuning_dataset.csv')
# adding an extra colum "input", and renaming the colum question to "input"and answer to "output"
df["instruction"] = df['question']
df['output'] = df['answer']
df = df.drop(columns=['question', 'answer', 'chunk_id'])
# add an empty colum input 
df['input'] = ""

In [192]:
df.head()

Unnamed: 0,instruction,output,input
0,What are the requirements for an escrow accoun...,The requirements for an escrow account set lim...,
1,What are the regulatory standards and guidelin...,The requirements for an escrow account set lim...,
2,What are the regulations or guidelines that go...,The requirements for an escrow account set lim...,
3,What are the key criteria that must be met whe...,The requirements for an escrow account set lim...,
4,What are the key guidelines or rules that a le...,The requirements for an escrow account set lim...,
