<a href="https://colab.research.google.com/github/syoooooung/capstone_design/blob/main/Query_Decomposition/4type_classification_using_LLM_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install openai
!pip install -q groq
!pip install -U accelerate bitsandbytes datasets evaluate
!pip install -U peft transformers trl
import openai
import json
import time
from tqdm import tqdm
!pip install openai==0.28



In [None]:
# For Google Colab settings
from google.colab import userdata, drive

# This will prompt for authorization
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
openai.api_key = userdata.get('OPENAI_API_KEY')

In [None]:
# Load the hotpotqa_simple.json and decomposed_langchain.json
with open('/content/hotpotqa_simple.json', 'r') as f:
    hotpotqa_simple = json.load(f)

with open('/content/decomposed_langchain.json', 'r') as f:
    decomposed_langchain = json.load(f)

In [None]:
system_message = """
    I'll give you a question, and I'll give you several sub-queries where the question is split. You have to sort out what type of problem these divided questions have.
    First, there are four types in total.
    type1: Generating unnecessary questions
    type2: omit content from existing question
    Type3: Missing content in the process of creating sub-query
    type4: Sub-query occurrences in which no additional words are generated or omitted, but the question is misinterpreted and deviates from the intention of the existing question.

    You just have to answer what type you are, but if you think there is no problem, print out null.

    Ex)
    origin Q:What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992?
    sub queries: 1. Who is the father of Kasper Schmeichel\n 2. What was Kasper Schmeichel's father voted to be by the IFFHS in 1992?
    You:1

    origin Q:Alvaro Mexia had a diplomatic mission with which tribe of indigenous people?
    sub queries: 1. Who is Alvaro Mexia\n 2. What diplomatic missions did Alvaro Mexia undertake\n 3. Which tribe of indigenous people did Alvaro Mexia have a diplomatic mission with
    You:1,3
"""

In [None]:
# Prepare data for LLM input
def prepare_llm_input(hotpotqa_simple, decomposed_langchain):
    llm_inputs = []
    for idx, item in enumerate(hotpotqa_simple):
        original_question = item['question']
        sub_queries = decomposed_langchain['decomposed_questions'][idx]
        numbered_sub_queries = "\n".join([f"{i+1}. {sq}" for i, sq in enumerate(sub_queries)])

        prompt = f"origin Q:{original_question}\nsub queries:{numbered_sub_queries}\nYou:"
        llm_inputs.append({"id": item["id"], "prompt": prompt})
    return llm_inputs

In [None]:
# Call the LLM model
def process_data_gpt4(prompt):
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": system_message},
                  {"role": "user", "content": prompt}],
        temperature=0.5,
        max_tokens=128,
        top_p=1,
        stop=None,
    )
    return response['choices'][0]['message']['content']

In [None]:
# Send prompts in batches to the LLM
def send_messages_gpt4(messages):
    batch_size = 10
    answers = []

    for i in tqdm(range(0, len(messages), batch_size)):
        batch = messages[i:i + batch_size]

        for message in batch:
            output = process_data_gpt4(message["prompt"])
            answers.append({"id": message["id"], "type": output.strip()})

        if i + batch_size < len(messages):
            time.sleep(10)

    return answers

In [None]:
# Prepare input data
llm_inputs = prepare_llm_input(hotpotqa_simple, decomposed_langchain)

# Send to LLM and get responses
results = send_messages_gpt4(llm_inputs)

100%|██████████| 2/2 [00:15<00:00,  7.63s/it]


In [None]:
# Create the new JSON with question and type fields
output_data = []
for item, result in zip(hotpotqa_simple, results):
    output_data.append({
        "question": item['question'],
        "type": result['type']
    })

# Save the result to a new JSON file
with open('/content/output_questions_with_types.json', 'w') as outfile:
    json.dump(output_data, outfile, indent=4)

print("Processing complete. Results saved to output_questions_with_types.json")

Processing complete. Results saved to output_questions_with_types.json
