### Create queries

We want to enable our application user to automatically search new entries on the autobid platform using natural language. Thus, we need to fine-tune a model to handle search queries like: "I am looking for a quite new red sports car." We generated the queries (after multiple other different ideas and attempts) using the ChatGPT API with GPT-4 (as those queries showed the highest quality out of all our trials).

At the end, we got for about 550 of our 2500 scraped vehicle information texts queries (for a 30$ investement). For each of the 550 vehicle texts, we generated 10 queries - 5 matching "true" cases and 5 unmatching "false" queries per text, which gave us a total of about 5500 queries to fine-tune our BERT models on.

Step 1: Load scraped vehicle information from the yaml file, as our queries are going to match the vehicles (or contradict them partly for false cases).

In [None]:
import yaml

def get_vehicles_as_dict(file_path):
    """
    Parse YAML file and return a dictionary with URLs as keys and YAML text as values.

    Args:
        file_path (str): Path to the YAML file

    Returns:
        dict: Dictionary where keys are URLs and values are YAML text for each vehicle
    """
    # Load the YAML file
    with open(file_path, 'r', encoding='utf-8') as file:
        data = yaml.safe_load(file)

    # Create dictionary with URLs as keys and YAML text as values
    vehicles_dict = {}
    for url, vehicle_info in data.items():
        # Convert just the vehicle info (not including the URL) to YAML text
        yaml_text = yaml.dump(vehicle_info, default_flow_style=False, allow_unicode=True)
        vehicles_dict[url] = yaml_text

    return vehicles_dict

In [None]:
vehicle_data_file = "../../data/final_vehicles_data.yaml"
vehicle_data = get_vehicles_as_dict(vehicle_data_file)

Step 2: Prompt engineering: Find a prompt combination that works the way that our results are going to become 10 queries, 5 true, 5 false per car, in JSON format. The queries shall sound as natural as possible and refer to our scraped car in different ways.

In [None]:
system_prompt = """
You are a helpful assistant that evaluates search queries based on detailed car descriptions.
The user provides vehicle details in structured format. You must generate five realistic search
queries that match the car and another five that don't (but aren't absurd), and return them as a
JSON object where each key is a query and the value is a boolean indicating whether the query matches
the car (true) or not (false). Respond only with the JSON object, nothing else.
"""

In [None]:
question = """
Please create realistic search queries of someone who is searching for a car. Keep in mind that, when searching for a vehicle,
they won't already know details like exact read mileage and horse power. Mileage and horse power are important but a car dealer would
ask for a broad range. When asking for kilometers, hp or the number of previous owners always pick a specific number
(not the one from the text) and ask if the car has more or less of it. Milage is important so some queries should ask if it has more and some if it has less.
The same is true for registration date and horsepower/kW.
Also try to be specific and put in numbers for details that need them. Something like "low mileage" and "powerful engine" could be interpreted differently.
The search won't contain every detail of the car. Also vary the wording
and the chosen details of the search question for the queries and use synonyms for some of them. Make the queries multiple sentences
long and detailed. The car dealer has specific requirements.
When looking at the car details, the "information_dict:" contains the most valuable information about the car.

After that generate also 5 similar queries that don't match the car completely. The negative queries should contain some of the real vehicle details
but differ in some (some should be closer and some more different). The negative examples
shouldn't be too absurd and fit to searches a second hand car dealer could have. They should be detailed. Include
other details this car hasn't but are common for other cars. The negative queries should be close to the original car (and the true queries), but
differ in a few important details.
Make the negative queries multiple sentences long and detailed. Add enough details so the overall negative questions are similiar in length to the positive ones.

Some queries should be longer and some should be shorter. Some should be more detailed than others. Also the writing style should differ slightly.
This is true for the positive and negative queries. When there is already a question related to a detail, try to focus on another detail or use a synonym.

To JSON:

Please put the queries into a json item.
Like this:
{
"This is a matching query": true,
"This is another matching query": true,
"This is not a matching query": false,
}

Only respond with the json item you created.

"""

url, yaml_text = list(vehicle_data.items())[4]
user_prompt = yaml_text + question
print(user_prompt)

In [None]:
import openai

# This key has been deleted and is no longer active
api_key = "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"

client = openai.OpenAI(api_key=api_key)

In [None]:
import json

updated_data = {}
generated_question_file = "../../data/generated_questions"

retries = 3
save_interval = 40

for idx, (url, yaml_text) in enumerate( vehicle_data.items(), start=1):
    user_prompt = yaml_text + question
    parsed_result = None

    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                temperature=0.7
            )

            answer = response.choices[0].message.content.strip()

            parsed = json.loads(answer)

            # Validate structure: dict with 10 string:bool pairs
            if (
                isinstance(parsed, dict)
                and len(parsed) == 10
                and all(isinstance(k, str) and isinstance(v, bool) for k, v in parsed.items())
            ):
                parsed_result = parsed
                break  # Success
            else:
                print(f"[Attempt {attempt+1}] Invalid format or length for URL: {url}")

        except (json.JSONDecodeError, Exception) as e:
            print(f"[Attempt {attempt+1}] Error processing URL {url}: {e}")

    if parsed_result is not None:
        updated_data[url] = parsed_result
    else:
        print(f"Failed to process {url} after {retries} attempts.")

    # Periodic save in case of crash
    if idx % save_interval == 0:
        with open(f"{generated_question_file}_{idx}.json", "w", encoding="utf-8") as f:
            json.dump(updated_data, f, indent=2, ensure_ascii=False)
        print(f"[Checkpoint] Saved progress at {idx} items.")

with open(generated_question_file + ".json", "w", encoding="utf-8") as f:
    json.dump(updated_data, f, indent=2, ensure_ascii=False)