### Set api keys

Setup before running this notebook
1. Create a file called `.env` in this directory

2. Write `OPENAI_API_KEY="Your Api Key"` in the .env file

3. Gitignore your `.env` file


In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [2]:
from openai import OpenAI
import re
import json

In [3]:
client = OpenAI(api_key=OPENAI_API_KEY)

In [4]:
def append_txt_files(directory):
    result = ""
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".txt"):
                file_path = os.path.join(root, file)
                with open(file_path, 'r') as f:
                    result += f.read() + "\n"
    return result

In [5]:
SYSTEM_PROMPT = """
You are a helpful assistant used for data processing. You have expert understanding of taking in
chunks of textual data, first identifying all key facts in the text and generating question answer pairs for supervised fine tuning. You do not incorporate your own knowledge
and only use factual information form the provided knowledge text."""

In [29]:
USER_PROMPT = """
I want to create question answer pairs for the following knowledge text. First identify all key facts in the knowledge text 
(there is no set number, knowledge text might be short or long, make sure facts you have identified are strong and comprehensively cover the text).

Next for each fact, propose all plausible questions a person might commonly ask a knowledgeable assitant, give answers grounded on the facts you have identified and knowledge text.
Generate questions that people will commonly ask about this the given knowledge base.
ake sure your answers to the questions are clear and comprehensive, the answers must simulate an expert clearly yet succintly provinding a factual answer.
Make your answers medium length, but do not leave out important information, cover nuances relating to the question.

   
follow this line of reasoning and formatting:
<example>
<example_knowledge_text>
1.035 Co-Owners (VC §§4150.5 and 9852.5)
A vehicle or vessel may be owned by two or more co-owners. Co-owner names may be joined by “and”, “and/or”, or “or”. All owners mustendorse the title or registration application to register the vehicle/vessel, but the requirements for releasing ownership vary. Refer to Chapter 11.
Certificates issued for applications not indicating “and” or “or” between the names will show “and” as represented by a slash (/) between the names.
* The signatures of all owners are required to transfer ownership when the co-owner names are joined by “and”. Ownership passes to the surviving co-owner upon the death of a co-owner or, with the surviving co-owner’s release, to a new owner. A deceased co-owner’s interest may only be released by one of the following:
   * Heir of the deceased with an Affidavit for Transfer Without Probate California Titled Vehicle or Vessels Only (REG 5) form.
   * Administrator with Letters of Administration.
   * Executor with Letters Testamentary.
* The signature of only one owner is required to transfer ownership when the co-owner names are joined by “and/or” or “or”. A surviving co-owner’s signature on the title releases all owners’ interest unless “Tenants in Common” or “COMPRO” follows the co-owner’s names.
* A REG 5 cannot be used to circumvent the interest of a surviving owner when the vehicles are jointly owned by two or more persons and one of the owners is deceased. However, the surviving owner (if they are the heir) may complete a REG 5 to release the interest of the deceased owner. The California Certificate of Title must be signed twice, once by surviving owner and once for the deceased owner countersigned by the heir. If owned jointly by two or more deceased owners, a REG 5 for the most recently deceased owner and a death certificate for each owner is required.
Tenants in Common—When “Tenants in Common” follows the names of co-owners, the interest of a deceased co-owner reverts to the deceased co-owner’s estate, not to the surviving co-owner. Ownership may be transferred with Letters Testamentary, Letters of Administration, or an Affidavit for Transfer Without Probate California Titled Vehicle or Vessels Only (REG 5) form.
COMPRO—When “COMPRO” (community property) follows the names of co-owners, ownership passes to the surviving owner after the deceased co-owner’s interest is executed by the:
* Heir of the deceased with a REG 5, if the estate was not probated.
* Administrator of the estate with Letters of Administration.
* Executor of the estate with Letters Testamentary.
Ownership may be transferred to a new owner with the surviving co-owner’s release.
JTRS—When “JTRS” (joint tenants with right of survivorship) follows the names of co-owners:
* All owners must release interest during the lifetime of the co-owners.
* Upon the death of an owner, interest is released by the surviving co-owner.
* A copy of the deceased owner’s death certificate must accompany the application.
* The signature of the surviving co-owner(s) on the title releases all owner interests.
TOD (Transfer on Death)—Refer to Chapter 11.
<end_example_knowledge_text>
<example_qa_pair>
FACT 1: A vehicle or vessel may be owned by two or more co-owners
[{{'role': 'user', 'content': 'Can I register a car to multiple users in California'}},
{{'role': 'assistant', 'content': 'Yes, you can register a car under multiple owners in California. The state allows co-ownership with different options like "and," "or," or "and/or" between names, which affects how ownership transfers work. All co-owners need to sign the initial registration, but transfer requirements vary. There are also special rules for situations like when an owner passes away or for specific ownership types like community property. The key is to choose the right co-ownership option that fits your situation, as it'll determine things like whether you need all owners to sign off on a sale or just one.'}}]
<end_example_qa_pair>
<end_example>

Follow the template provided in the example, however give as many questions as needed for each fact, can range from 1 to many. You may also ask propose follow up questions.
Make sure you totally cover all fundamental questions about each fact, including different options, clarification of terms, explanation of complex points etc

The goal is to identify as many key facts as possible and generate as many questions answer pairs for each fact to be very comprehensive.

so outputs will follow the structure
FACT 1:
[{{"role": "user", "content:" "..."}}, {{"role": "assistant", "content:" "..."}}]
[{{"role": "user", "content:" "..."}}, {{"role": "assistant", "content:" "..."}}]
(possibly many more question answer pairs)
...

FACT 2:
....
(possibly many more facts)

<knowledge_text>
"""

In [30]:
def user_prompt_for(scrape_file):
    knowledge_base = ''
    with open(scrape_file, 'r') as f:
        knowledge_base += f.read() + "\n"
    return USER_PROMPT.format(100) + knowledge_base + "\n<end_knowledge_text>"

In [31]:
print(user_prompt_for("../dataset/ch01/sec08.txt"))


I want to create question answer pairs for the following knowledge text. First identify all key facts in the knowledge text 
(there is no set number, knowledge text might be short or long, make sure facts you have identified are strong and comprehensively cover the text).

Next for each fact, propose all plausible questions a person might commonly ask a knowledgeable assitant, give answers grounded on the facts you have identified and knowledge text.
Generate questions that people will commonly ask about this the given knowledge base.
ake sure your answers to the questions are clear and comprehensive, the answers must simulate an expert clearly yet succintly provinding a factual answer.
Make your answers medium length, but do not leave out important information, cover nuances relating to the question.

   
follow this line of reasoning and formatting:
<example>
<example_knowledge_text>
1.035 Co-Owners (VC §§4150.5 and 9852.5)
A vehicle or vessel may be owned by two or more co-owners. 

In [59]:
def sft_qa_pairs(scrape_file):
    user_prompt = user_prompt_for(scrape_file)
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    )
    content = completion.choices[0].message.content
    pattern = r'(\[{.*?}\])'
    matches = re.findall(pattern, content, re.DOTALL)
    split = [match.strip() for match in matches]
    parsed_result = [json.loads(item) for item in split]
    return completion, parsed_result

Large text example, 2,007 input tokens and 1,100 output tokens

In [60]:
completion, parsed_result = sft_qa_pairs("../dataset/ch01/sec08.txt")

In [61]:
print(completion.choices[0].message.content)

FACT 1: A vehicle or vessel may be owned by two or more co-owners.  
[{"role": "user", "content": "Can multiple people own a vehicle in California?"}, {"role": "assistant", "content": "Yes, in California, a vehicle or vessel can be owned by two or more co-owners. The names of the co-owners can be connected by 'and', 'and/or', or 'or', which influences how ownership is managed and transferred."}]  
[{"role": "user", "content": "What are the ways I can register a car with multiple owners?"}, {"role": "assistant", "content": "You can register a car with multiple owners by using the terms 'and', 'and/or', or 'or' between the names on the registration application. Each choice has different implications for how ownership rights operate."}]  

FACT 2: All owners must endorse the title or registration application to register the vehicle/vessel.  
[{"role": "user", "content": "Do all owners need to sign to register a vehicle?"}, {"role": "assistant", "content": "Yes, all co-owners are required 

In [57]:
print(parsed_result)

[[{'role': 'user', 'content': 'Can I register a car under multiple names in California?'}, {'role': 'assistant', 'content': "Yes, in California, a vehicle or vessel can be owned by two or more co-owners. The names of the co-owners can be joined by 'and', 'and/or', or 'or', depending on how you want ownership to be structured."}], [{'role': 'user', 'content': 'What happens if I register a vehicle with co-owners?'}, {'role': 'assistant', 'content': "When you register a vehicle with co-owners, all co-owners must sign the title or registration application. The type of conjunction you use between their names ('and', 'and/or', or 'or') determines the rules for transferring ownership."}], [{'role': 'user', 'content': "What do the terms 'and', 'and/or', and 'or' mean in a co-ownership context?"}, {'role': 'assistant', 'content': "'And' means all owners must sign for any transfer of ownership. 'And/or' or 'or' means that only one owner's signature is needed for a transfer. This affects how owne

In [53]:
print(parsed_result[0][0]["content"])

Can I register a car under multiple names in California?


In [56]:
for i in parsed_result: 
    print("QUESTION:", i[0]["content"])
    print("ANSWER:", i[1]["content"])
    print()
    print()

QUESTION: Can I register a car under multiple names in California?
ANSWER: Yes, in California, a vehicle or vessel can be owned by two or more co-owners. The names of the co-owners can be joined by 'and', 'and/or', or 'or', depending on how you want ownership to be structured.


QUESTION: What happens if I register a vehicle with co-owners?
ANSWER: When you register a vehicle with co-owners, all co-owners must sign the title or registration application. The type of conjunction you use between their names ('and', 'and/or', or 'or') determines the rules for transferring ownership.


QUESTION: What do the terms 'and', 'and/or', and 'or' mean in a co-ownership context?
ANSWER: 'And' means all owners must sign for any transfer of ownership. 'And/or' or 'or' means that only one owner's signature is needed for a transfer. This affects how ownership is released upon the death of a co-owner as well.


QUESTION: Do the different conjunctions change the ownership transfer rules?
ANSWER: Yes, they

Small Text Example

In [58]:
completion, parsed_result = sft_qa_pairs("../dataset/ch01/sec02.txt")
for i in parsed_result: 
    print("QUESTION:", i[0]["content"])
    print("ANSWER:", i[1]["content"])
    print()
    print()

QUESTION: Can I submit vehicle registration documents with a sticker or label on them?
ANSWER: No, you cannot submit vehicle registration documents that have any kind of adhesive label. It is required to remove the label before submission.


QUESTION: What should I do if my registration document has an adhesive label?
ANSWER: If your registration document has an adhesive label, you should remove it and either submit a new document or a correct form depending on the condition of the title after the label is removed.


QUESTION: What forms do I need to submit if I remove a label from a registration document?
ANSWER: You will need to submit either a Statement to Record Ownership / Statement of Error or Erasure (REG 101) form or an Application for Replacement or Transfer of Title (REG 227) form, based on the condition of the title after the label is removed.


QUESTION: What happens if I submit a registration form with an adhesive label?
ANSWER: Submitting a registration document that cont