# Automated Construction of Topic-Specific Fine-Tuning Language Model (LLM)

Before using the OpenAI library, you need to set up and input your OpenAI API token. This token allows access to OpenAI's services

In [1]:
import os
import openai
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file
_ = load_dotenv(find_dotenv()) # read local .env file

# Set OpenAI API key from environment variable
openai.api_key  = os.environ['OPENAI_API_KEY']

# Dataset generation

In this step, we automatically generate the dataset to fine-tune the model. We initialize the following parameters:

1. Temperature: We opt for a lower value to ensure more precise question answers from the documentation.
2. Number of examples: We leave this parameter empty as the number of examples generated depends on the given content.
3. Reference document: For our selected use case, we provide a reference document to the model for generating data samples. If you don't need to provide reference data for your use case, which means the model itself needs to generate examples, you can leave the reference document field empty. In our case, we automatically scrape the content from the corresponding web documentation and input the extracted content into the model to generate data samples. You can find the web-scraped data [here](link_here).




In [18]:
User_prompt="Develop a fine-tuned LLM capable of responding to queries sourced from Tracified documentation"
temperature = .4
number_of_examples = 50


load the reference data:

For our selected use case, we provide a reference document to the model for generating data samples.

 If you don't need to provide reference data for your use case, which means the model itself needs to generate examples, you can leave the reference_doc variable empty.
 
  In our case, we automatically scrape the content from the corresponding web documentation and input the extracted content into the model to generate data samples. You can find the web-scraped data [here](link_here).

In [20]:
import pandas as pd
def readCsvFile(fileName):
    return pd.read_csv(fileName)

data = readCsvFile('tracified_website_data.csv')
reference_doc=data["Document Content"][0]
print(data["Document Content"][0])

Introduction to Tracified
What is Tracified#

Tracified powered by Blockchain technology facilitates a tamper proof platform that streamlines the data flow within a supply chain, introducing a novel crypto-economic model based on a reward/penalty concept, ensuring fair distribution of gains across the chain. The originality of the solution is further enhanced by its ability to get customized to suit the needs of a business.

A blockchain based platform that adds the crucial element ‘trust’ to traceability information. This applies to any buyer – seller scenario that occurs in a supply chain (not only the end-consumer). Trust is achieved by 3 pillars in tracified.1) blockchain based direct proofs 2) web of trust 3) security deposits by improving trust, tracified enables smarter purchasing decisions from the buyers side while making it possible for sellers who sell genuinely high quality products to prove their value.

WHAT IS INCLUDED IN THE TRACIFIED PRODUCT LINE?

Tracified Web Portal

Next, we define the data generation chain, which comprises the following components:

1. Prompt: We establish a clear prompt with input variables.
2. Model: We utilize the latest 'gpt-4-1106-preview' model.
3. Parser: We specify a parser to ensure that the output is structured and consistent at all times.

The output of this chain will consist of the generated data samples as outlined in the parser.

In [7]:
prompt_template = """
    You are generating data to train a machine learning model. \
    You will receive a high-level description of the model we want to train. If reference data is provided, use only that. \
    If not, rely on your knowledge. \
    From this, generate data samples, each with a prompt/response pair. \
    Ensure your samples are unique, diverse, and of high quality to train a well-performing model. \

    If a specific number of samples is requested, generate that number; otherwise, generate as many as possible.  		
    
    Model type: {prompt}
    Number of samples: {examples}
    Reference data: {reference}
    
    {format_instructions}
    """


In [8]:
from typing import List
from langchain_core.pydantic_v1 import BaseModel, Field


class DataSample(BaseModel):
    """Data samples generated for fine-tuning a large language model."""

    prompt: str = Field(description="Text provided as the prompt.")
    response: str = Field(description="The corresponding response to the prompt.")


class ListDataSamples(BaseModel):
    """Information to extract."""
    datasamples: List[DataSample] = Field(description="List of generated data samples")    

In [9]:
from langchain.output_parsers import PydanticOutputParser
#  Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=ListDataSamples)

In [10]:
from langchain_openai import ChatOpenAI
# Define model and extraction functions
model = ChatOpenAI(model_name='gpt-4-1106-preview', temperature=0.4)

In [15]:
from langchain.prompts import PromptTemplate

PROMPT = PromptTemplate(
        template=prompt_template, 
        input_variables=["prompt", "reference"],
        partial_variables={"examples":"","format_instructions": parser.get_format_instructions()},
    )


In [12]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
# Define data_generation chain
data_generation_chain = ({"prompt": RunnablePassthrough(),"reference":RunnablePassthrough()} | PROMPT | model | parser)


In [16]:


# Execute extraction chains
response1 = data_generation_chain.invoke({
    "prompt":User_prompt,
    "reference": reference_doc
})


In [17]:
print(response1)

datasamples=[DataSample(prompt='Can you explain how Tracified uses blockchain technology?', response='Tracified utilizes blockchain technology to create a tamper-proof platform that enhances the traceability and trust in supply chains. It does this by providing direct proofs on the blockchain, establishing a web of trust, and using security deposits to incentivize the fair distribution of gains and ensure the authenticity of products.'), DataSample(prompt='What are the main components of the Tracified product line?', response='The Tracified product line includes Tracified Web Portals such as the Admin Portal, Configs Portal, and Insights Portal, as well as Tracified Mobile Applications including the Field Officer Application, Consumer Application, and Wallet Application.'), DataSample(prompt="What is meant by 'Artifacts' or 'Master data' in the context of Tracified?", response="Within Tracified, 'Artifacts' or 'Master data' refer to static data that are entered into the system and rema

In [57]:
#Generated data samples
for s in response1.datasamples:
    print("Prompt:", s.prompt)
    print("Response:", s.response)
    print()  # Adding an empty line for clarity

Prompt: Can you explain how Tracified uses blockchain technology?
Response: Tracified utilizes blockchain technology to create a tamper-proof platform that enhances the traceability and trust in supply chains. It does this by providing direct proofs on the blockchain, establishing a web of trust, and using security deposits to incentivize the fair distribution of gains and ensure the authenticity of products.

Prompt: What are the main components of the Tracified product line?
Response: The Tracified product line includes Tracified Web Portals such as the Admin Portal, Configs Portal, and Insights Portal, as well as Tracified Mobile Applications including the Field Officer Application, Consumer Application, and Wallet Application.

Prompt: What is meant by 'Artifacts' or 'Master data' in the context of Tracified?
Response: Within Tracified, 'Artifacts' or 'Master data' refer to static data that are entered into the system and remain unchanged throughout the entire supply chain. This da

# System Message Generation:

In this stage, we provide a clear and precise prompt to generate system messages for use in inference. We then utilize the output parser to structure the output more consistently and systematically.

In [23]:
prompt_template2="""You will be given a high-level description of the model we are training, \
    and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given 
WHAT_THE_MODEL_SHOULD_DO.Make it as concise as possible. Include nothing but the system prompt in your response.\
Here is the high_level description of the model:{prompt}

\n{format_instructions2}\n
"""

In [24]:
class SystemMessage(BaseModel):
    """Generated system message."""

    system_message: str = Field(description="system message to use for inference.")
    
parser2 = PydanticOutputParser(pydantic_object=SystemMessage)


In [25]:
PROMPT2 = PromptTemplate(
        template=prompt_template2, 
        input_variables=["prompt"],
        partial_variables={"format_instructions2": parser2.get_format_instructions()},
    )

In [26]:
extraction_chain2 = ({"prompt": RunnablePassthrough()} | PROMPT2 | model | parser2)


In [31]:

# Execute system message generation chain
response2 = extraction_chain2.invoke({
    "prompt":"Develop a fine-tuned LLM capable of responding to queries sourced from Tracified documentation",
    
})


In [32]:
print(response2 )


system_message='Given a query sourced from Tracified documentation, provide a detailed and accurate response.'


Prompt: Can you explain how Tracified uses blockchain technology?
Response: Tracified utilizes blockchain technology to create a tamper-proof platform that enhances the traceability and trust in supply chains. It does this by providing direct proofs on the blockchain, establishing a web of trust, and using security deposits to incentivize the fair distribution of gains and ensure the authenticity of products.

Prompt: What are the main components of the Tracified product line?
Response: The Tracified product line includes Tracified Web Portals such as the Admin Portal, Configs Portal, and Insights Portal, as well as Tracified Mobile Applications including the Field Officer Application, Consumer Application, and Wallet Application.

Prompt: What is meant by 'Artifacts' or 'Master data' in the context of Tracified?
Response: Within Tracified, 'Artifacts' or 'Master data' refer to static data that are entered into the system and remain unchanged throughout the entire supply chain. This da

# Fine-Tunning

1.Prepare training data:

For the OpenAI API, the data must be stored in jsonl format.Also should be in below format: 

{"messages": [{"role": "system", "content": "system message here"}, {"role": "user", "content": "user prompt should be here"}, {"role": "assistant", "content": "response should be here"}]}


In [39]:
def extract_user_assistant_content(datasamples):
    user_content_list = []
    assistant_content_list = []
    
    for sample in datasamples:
        user_content_list.append(sample.prompt)
        assistant_content_list.append(sample.response)
    
    return user_content_list, assistant_content_list

# Extract user and assistant content
user_content, assistant_content = extract_user_assistant_content(response1.datasamples)
print(user_content)
print()
print(assistant_content)

['Can you explain how Tracified uses blockchain technology?', 'What are the main components of the Tracified product line?', "What is meant by 'Artifacts' or 'Master data' in the context of Tracified?", 'How does Tracified collect Stage Data or Tracking Data?', 'What is a TDP in Tracified?', 'What does Change of Custody (CoC) mean in Tracified?', 'How are batch IDs used in Tracified?', 'What is the purpose of the consumer-map in Tracified?', 'What are geotagged photos in the context of Tracified?']

['Tracified utilizes blockchain technology to create a tamper-proof platform that enhances the traceability and trust in supply chains. It does this by providing direct proofs on the blockchain, establishing a web of trust, and using security deposits to incentivize the fair distribution of gains and ensure the authenticity of products.', 'The Tracified product line includes Tracified Web Portals such as the Admin Portal, Configs Portal, and Insights Portal, as well as Tracified Mobile Appl

In [40]:
import json
# Write data to JSONL file
with open('dataset.jsonl', mode='w') as writer:
    for user_prompt, assistant_response in zip(user_content, assistant_content):
        data = {
            "messages": [
                {"role": "system", "content": response2.system_message},
                {"role": "user", "content": user_prompt},
                {"role": "assistant", "content": assistant_response}
            ]
        }
        writer.write(json.dumps(data) + '\n')

2.Upload the training file

Your training file must be in jsonl format. Once you've uploaded the file, processing might take a while. The maximum size for file uploads is 1 GB. To upload a file on the OpenAI server:

In [48]:
from openai import OpenAI
client = OpenAI()


client.files.create(
  file=open("dataset.jsonl", "rb"),
  purpose="fine-tune"
)


FileObject(id='file-vaE1yci8M4d6Ejq4iXNK5FYw', bytes=4826, created_at=1711602441, filename='dataset.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

In [49]:
file_id=client.files.create(
  file=open("dataset.jsonl", "rb"),
  purpose="fine-tune"
).id

3.Create a fine-tuned model-Training

After ensuring the file has been successfully uploaded, the next step is to create a fine-tuning job. training_file is the file ID that was returned when the training file was uploaded to the OpenAI API. To start a fine-tuning job:



In [52]:
from openai import OpenAI
client = OpenAI()


job=client.fine_tuning.jobs.create(
  training_file=file_id,
  model="gpt-3.5-turbo" #change to gpt-4-0613 if you have access
)


In [54]:
job_id = job.id
print(job_id)

ftjob-HhQ0CW8UKumR7mj1gY8wo4g3


In [55]:
from openai import OpenAI
client = OpenAI()
# Retrieve the state of a fine-tune
model_name_pre_object =client.fine_tuning.jobs.retrieve(job_id)

model_name = model_name_pre_object.fine_tuned_model
print(model_name)

ft:gpt-3.5-turbo-0125:personal::97gpX82a


 4.Using the fine-tuned model

 Upon successful completion of a job, the job details will include the fine_tuned_model field, displaying the name of the model. You can make an API call to this model and get a response from the model that we just tuned.

In [56]:
from openai import OpenAI
client = OpenAI()


completion = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "system", "content": response2.system_message},
    {"role": "user", "content": "What is tracified?"}
  ]
)
print(completion.choices[0].message)

ChatCompletionMessage(content="Tracified is a platform developed by Sri Lanka Institute of Nanotechnology (SLINTEC) that provides supply chain visibility by tracking products through their supply chains. This is done through a unique QR code assigned to each product which allows users to trace the product's journey from its origin to the end consumer.", role='assistant', function_call=None, tool_calls=None)
