# Scenario

## **Aim**

We have knowledge base for A Level students for the following subjects.

- Biology
- Chemistry
- Physics

We have to develop a Chat-bot for the stuents to assist them in studies.
<br>
<br>

## **Problem Faced**

We have to use RAG based approach for this as knowlegebase gets continuously updated.

But for RAG when storing all the knowledge in a single Vector Database, the time taken for retrieval of relevant documents takes long time.
<br>
<br>

## **Proposed Solution**

Separate the Vector Database into 3, one for each subject Biology, Chemistry, and Physics.
<br>
<br>

## **New Problem Faced**

We have to identify the subject based on the question asked by the student, and then route the RAG pipeline the the specicfic vector database for that subject.
<br>
<br>

## **New Proposed Solution**

Fine-tune GPT 3.5 to identify the subject for the complex question asked by student and get a one word response of 'biology' or 'chemistry' or 'physics'.

Then use the respose to route the RAG pipeline the the specicfic vector database of that subject.
<br>
<br>

## **Functional Requirement**

- Generate output as only 'biology' or 'chemistry' or 'physics'
- Out of scope questions are not handled
<br>
<br>

# Data generation

In [None]:
prompt = "A model that takes in a complex paragraph of AS & A Level science question, and answer with subject realted to the question as 'biology' or 'chemistry' or 'physics'. Give an output 'biology' or 'chemistry' or 'physics'"
temperature = .3
number_of_examples = 60

In [None]:
!pip install openai==0.28 tenacity

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
Successfully installed openai-0.28.0


In [None]:
import os
import openai
import random
from tenacity import retry, stop_after_attempt, wait_exponential

In [None]:
openai.api_key = 'your-api-key-here'

Generate training data

In [None]:
N_RETRIES = 3

@retry(stop=stop_after_attempt(N_RETRIES), wait=wait_exponential(multiplier=1, min=4, max=70))
def generate_example(prompt, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 8:
            prev_examples = random.sample(prev_examples, 8)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=temperature,
        max_tokens=1000,
    )

    return response.choices[0].message['content']

# Generate examples
prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i}')
    example = generate_example(prompt, prev_examples, temperature)
    prev_examples.append(example)

print(prev_examples)

Generating example 0
Generating example 1
Generating example 2
Generating example 3
Generating example 4
Generating example 5
Generating example 6
Generating example 7
Generating example 8
Generating example 9
Generating example 10
Generating example 11
Generating example 12
Generating example 13
Generating example 14
Generating example 15
Generating example 16
Generating example 17
Generating example 18
Generating example 19
Generating example 20
Generating example 21
Generating example 22
Generating example 23
Generating example 24
Generating example 25
Generating example 26
Generating example 27
Generating example 28
Generating example 29
Generating example 30
Generating example 31
Generating example 32
Generating example 33
Generating example 34
Generating example 35
Generating example 36
Generating example 37
Generating example 38
Generating example 39
Generating example 40
Generating example 41
Generating example 42
Generating example 43
Generating example 44
Generating example 4

Generate evaluation data

In [None]:
eval_examples = []
for i in range(50):
    print(f'Generating example {i}')
    example = generate_example(prompt, eval_examples, temperature)
    eval_examples.append(example)

print(eval_examples)

Generating example 0
Generating example 1
Generating example 2
Generating example 3
Generating example 4
Generating example 5
Generating example 6
Generating example 7
Generating example 8
Generating example 9
Generating example 10
Generating example 11
Generating example 12
Generating example 13
Generating example 14
Generating example 15
Generating example 16
Generating example 17
Generating example 18
Generating example 19
Generating example 20
Generating example 21
Generating example 22
Generating example 23
Generating example 24
Generating example 25
Generating example 26
Generating example 27
Generating example 28
Generating example 29
Generating example 30
Generating example 31
Generating example 32
Generating example 33
Generating example 34
Generating example 35
Generating example 36
Generating example 37
Generating example 38
Generating example 39
Generating example 40
Generating example 41
Generating example 42
Generating example 43
Generating example 44
Generating example 4

Generate a system message.

In [None]:
def generate_system_message(prompt):

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\n\nMake it as concise as possible. Include nothing but the system prompt in your response.\n\nFor example, never write: `\"$SYSTEM_PROMPT_HERE\"`.\n\nIt should be like: `$SYSTEM_PROMPT_HERE`."
          },
          {
              "role": "user",
              "content": prompt.strip(),
          }
        ],
        temperature=temperature,
        max_tokens=500,
    )

    return response.choices[0].message['content']

system_message = generate_system_message(prompt)

print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')

The system message is: `Given a complex paragraph of AS & A Level science question, you will identify and output the relevant subject as 'biology', 'chemistry', or 'physics'.`. Feel free to re-run this cell if you want a better result.


Create final pair of datasets.

In [None]:
import json
import pandas as pd

# Initialize lists to store prompts and responses
prompts = []
responses = []

# Parse out prompts and responses from examples
for example in prev_examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples.')

# Initialize list to store training examples
training_examples = []

# Create training examples in the format required for GPT-3.5 fine-tuning
for index, row in df.iterrows():
    training_example = {
        "messages": [
            {"role": "system", "content": system_message.strip()},
            {"role": "user", "content": row['prompt']},
            {"role": "assistant", "content": row['response']}
        ]
    }
    training_examples.append(training_example)

# Save training examples to a .jsonl file
with open('training_examples.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

There are 60 successfully-generated examples.


Create evaluation dataset.

In [None]:
import json
import pandas as pd

# Initialize lists to store prompts and responses
eval_prompts = []
eval_responses = []

# Parse out prompts and responses from examples
for example in eval_examples:
  try:
    split_example = example.split('-----------')
    eval_prompts.append(split_example[1].strip())
    eval_responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
eval_df = pd.DataFrame({
    'inputs': eval_prompts,
    'ground_truth': eval_responses
})

# Remove duplicates
eval_df = eval_df.drop_duplicates()

print('There are ' + str(len(eval_df)) + ' successfully-generated examples.')

# Initialize list to store training examples
evaluation_examples = []

# Create training examples in the format required for GPT-3.5 fine-tuning
for index, row in eval_df.iterrows():
    evaluation_example = {
        "messages": [
            {"role": "system", "content": system_message.strip()},
            {"role": "user", "content": row['inputs']},
            {"role": "assistant", "content": row['ground_truth']}
        ]
    }
    evaluation_examples.append(evaluation_example)

# Save training examples to a .jsonl file
with open('evaluation_examples.jsonl', 'w') as f:
    for example in evaluation_examples:
        f.write(json.dumps(example) + '\n')

There are 50 successfully-generated examples.


# Upload the file to OpenAI

In [None]:
file_id = openai.File.create(
  file=open("/content/training_examples.jsonl", "rb"),
  purpose='fine-tune'
).id

# Train the model

In [None]:
job = openai.FineTuningJob.create(training_file=file_id, model="gpt-3.5-turbo")

job_id = job.id

In [None]:
openai.FineTuningJob.list_events(id=job_id, limit=10)

<OpenAIObject list at 0x7d7a20863ab0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-KGHdotxfzsDMlUdnpZAxdEPF",
      "created_at": 1724413894,
      "level": "info",
      "message": "The job has successfully completed",
      "data": {},
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-ynPcrJuU2n4PIXyPQC5SFvQZ",
      "created_at": 1724413891,
      "level": "info",
      "message": "New fine-tuned model created",
      "data": {},
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-2FPLamejO5oI8ga2UY5BqOIL",
      "created_at": 1724413891,
      "level": "info",
      "message": "Checkpoint created at step 120",
      "data": {},
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-Av2sXUCS8idNiwMzqdKKHwhI",
      "created_at": 1724413891,
      "level": "info",
      "message": "Ch

# Get the fine-tuned model name.

In [None]:
model_name_pre_object = openai.FineTuningJob.retrieve(job_id)
model_name = model_name_pre_object.fine_tuned_model
print(model_name)

ft:gpt-3.5-turbo-0125:personal::9zN7DncB


# Inference Example

In [None]:
def geneate_response(question):
  response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": question,
      }
    ],
  )
  return response.choices[0].message['content']

In [None]:
geneate_response("Explain the concept of wave-particle duality, providing examples of its experimental observation. Discuss how this phenomenon challenges classical Newtonian mechanics and how it is incorporated into the quantum mechanical framework.")

'physics'

In [None]:
geneate_response(" Consider the mechanisms of epigenetic modification (DNA methylation, histone acetylation, and non-coding RNA), their impact on gene expression, and how these modifications can be influenced by environmental factors and inherited. Explore the potential for epigenetic therapies and the ethical implications of manipulating epigenetic processes.")

'biology'

In [None]:
geneate_response("""A compound, X, with the molecular formula C₅H₈O₂ undergoes the following reactions:
Reaction with Tollens' reagent: No reaction. Reaction with 2,4-dinitrophenylhydrazine (DNPH): A yellow precipitate forms.
Reaction with sodium hydroxide solution: A salt, Y, and an alcohol, Z, are formed.
Oxidation of Z with acidified potassium dichromate: A carboxylic acid, W, is formed.
Deduce the structural formula of compound X and explain the reactions involved.""")

'chemistry'

# Evaluate Model

In [None]:
!pip install mlflow tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [None]:
import os

os.environ["OPENAI_API_KEY"] = 'your-api-key-here'

In [None]:
import mlflow
import openai
import os
import pandas as pd
from getpass import getpass

eval_data = eval_df

with mlflow.start_run() as run:
    system_prompt = system_message
    logged_model_info = mlflow.openai.log_model(
        model="ft:gpt-3.5-turbo-0125:personal::9zN7DncB",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")


2024/08/23 13:02:39 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/08/23 13:02:41 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


See aggregated evaluation results below: 
{'exact_match/v1': 0.96}


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

See evaluation table below: 
                                               inputs ground_truth    outputs  \
0   Explain how the process of photosynthesis conv...      biology    biology   
1   Describe the process of cellular respiration, ...      biology    biology   
2   Discuss the principles of Newton's laws of mot...      physics    physics   
3   Explain the concept of chemical equilibrium an...    chemistry  chemistry   
4   Analyze the structure and function of the huma...      biology    biology   
5   Evaluate the impact of temperature on the rate...    chemistry  chemistry   
6   Illustrate the concept of energy conservation ...      physics    physics   
7   Examine the role of enzymes as biological cata...      biology    biology   
8   Discuss the principles of thermodynamics as th...    chemistry  chemistry   
9   Evaluate the principles of thermodynamics as t...    chemistry  chemistry   
10  Discuss the electromagnetic spectrum and its v...      physics    physics   