<a href="https://colab.research.google.com/github/rajilsaj/FICOchallenge/blob/main/notebooks/Week_3_Synthetic_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FICO Educational Analytics Challenge ¬© Fair Isaac 2025**

Copyright 2025 FICO licensed under CC BY-NC-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/

# Week 3: Synthetic Data Generation

This notebook walks through **best practices and practical code** for generating high-quality synthetic data to fine-tune a banking chatbot.

**Why synthetic data?**

- Privacy: avoid storing or exposing real customer data.

- Coverage: fill rare/edge cases and long-tail requests.

- Control: balance class distributions and ensure label quality.


**What we'll cover:**

1. Model(ing) Choices

2. Intents schema

3. Data augmentation (Temperature, Sentiments, Speech Style)

4. Negative/OOS & ambiguous examples

### Expected File Structure

This notebook expects you to have the following file structure inside of **MyDrive**:

```
MyDrive
    ‚îî‚îÄ‚îÄ FICO Analytic Challenge
        ‚îî‚îÄ‚îÄ Data
            ‚îî‚îÄ‚îÄ collections_intents.csv
            ‚îî‚îÄ‚îÄ seed_scenarios.csv
            ‚îî‚îÄ‚îÄ LLM_Data_Generation.ipynb
```

## What is "unsloth"?

Unsloth is an open-source library designed to make fine-tuning large language models (LLMs) faster and more memory-efficient, often by leveraging optimizations like parameter-efficient tuning. From their website:

> By manually deriving all compute heavy maths steps and handwriting GPU kernels, Unsloth magically makes training faster without any hardware changes.
> [Models run] 10x faster on a single GPU and up to 30x faster on multiple GPU systems compared to Flash Attention 2 (FA2). We support NVIDIA GPUs from Tesla T4 to H100, and we‚Äôre portable to AMD and Intel GPUs.

This is especially convenient for google colab, as we have free access to their T4 GPU.

In [None]:
!pip install unsloth



In [None]:
import os
import sys
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive/', force_remount=True)

# Base path for your project
path = '/content/drive/MyDrive/FICO Analytic Challenge/'

# Folder that's holding dataset
data = 'Data'

# Path to the "Data" and "Model" folder
data_path = os.path.join(path, data)

Mounted at /content/drive/


In [None]:
import pandas as pd
import json
import random
import sys
import os
import re
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from unsloth import FastModel

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


In [None]:
# Pytorch libraries
import torch
import torch.nn as nn

In [None]:
# Checking GPU compatibility
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)} is available.")
else:
    print("No GPU available. Training will run on CPU.")

GPU: Tesla T4 is available.


In [None]:
DEVICE_TYPE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE_TYPE)

cuda


## While there are many models to choose from for this task, `Llama-3.1-8B-Instruct` is ideal for many reasons:

* Instruction-Tuned for Following Prompts

  - The Instruct variant is optimized to follow natural-language instructions, making it well-suited for tasks like ‚ÄúGenerate 10 variations of a user asking to check their account balance.‚Äù This reduces the need for convoluted prompt engineering.

* Balanced Size vs. Capability

  - At 8B parameters, it is large enough to capture rich linguistic variation and domain knowledge, but still lightweight compared to 70B+ models. This balance makes it practical for synthetic data generation while being deployable on our modest hardware.

* High-Quality Language Generation

  - Produces diverse, coherent, and natural-sounding outputs. This helps simulate realistic user queries, including informal phrasing and coloquial style.

* Open and Customizable

  - Being open-source, it can be fine-tuned or adapted to banking-specific terminology and intents if needed. This flexibility allows tailoring the synthetic data pipeline directly to our chatbot‚Äôs domain!


In [None]:
model_name="unsloth/Meta-Llama-3.1-8B-Instruct"
model, tokenizer = FastLanguageModel.from_pretrained(model_name)
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1"
)
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2026.1.4: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0): LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRM

## Intent Schema
We start with a **controlled taxonomy** to keep intents mutually exclusive and consistent.

**Tip:** Decide early on the *granularity* of intents. Too fine-grained leads to sparse data; too coarse leads to ambiguous routing. We've included a pre-set list of intents in ```/content/drive/MyDrive/FICO Analytic Challenge/outbound_intents.csv``` since the evaluation set in week 10 will eventually only include our set of intents.

### Scenario Generation

The below code block generates scenarios from the input intents CSV and seed scenarios CSV in your ```Fico Analytic Challenge/Data/``` folder. <br> <br>
The scenarios are saved to ```Fico Analytic Challenge/Data/``` with filenames like ```scenarios_1.csv```. We will use these scenarios to generate conversations. <br><br>

Feel free to play around with the following parameters: <br>
<ul>
    <li> <b>TEMPERATURE</b> - Controls the "creativity" of outputs. Higher values (>0.9) mean more diverse outputs but risk inaccuracies. Lower values (<0.5) mean more predictable, stable outputs but less creativity.
    <li> <b>NUM_SCENARIOS</b> - Controls the number of scenarios generated for each intent. Higher NUM_SCENARIOS will generate more data but could lead to formatting issues and lower quality scenarios due to the limited <b>context length</b> of smaller LLMs.
</ul>

In [None]:
TEMPERATURE = 0.7
NUM_SCENARIOS = 5

In [None]:
def generate_scenarios(intent, description, seed_df):

    seed_df = seed_df[seed_df["intent"] == intent]

    examples = f"EXAMPLE:\n**Input**: Generate 3 realistic and varied collections outreach scenarios where the resulting customer intent is {intent}.\n**Output**:\n"

    random_sample = seed_df.sample(n=3)
    for i in range(3):
        examples += f"{i+1}. {random_sample["scenario"].iloc[i]}\n"

    system_prompt = (
        "You are a scenario generation engine for outbound banking chatbot conversations."
        " The bank initiates contact with the customer about collection on a past-due balance."
        " Your task is to generate realistic and logically consistent scenarios in which this outbound contact results in the customer taking a specific action or forming a final intent."
        " Each scenario should describe the relevant situation, context, and reasoning that leads from the outbound topic to the customer's concluding intent. The scenarios must reflect believable situations that could occur in real-world banking conversations, including appropriate motivations, financial circumstances, or customer behavior.\n"
        "Output strictly as a numbered list with each scenario on a new line. Do not include any commentary or explanation or additional lines.\n\n"
        f"{examples}"
    )
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Generate {NUM_SCENARIOS} realistic and varied scenarios where the bank reaches out to the customer about a collection and the resulting customer intent is {intent}. The intent is defined here: {description}."}
    ]

    inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to(DEVICE_TYPE)
    output_ids = model.generate(input_ids = inputs, max_new_tokens = 1024, temperature=TEMPERATURE, use_cache = True)
    output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    output = output[output.find(".assistant") + len(".assistant"):]

    scenarios = []
    for line in output.split("\n"):
        if line.strip().startswith(tuple(f"{i}." for i in range(1, NUM_SCENARIOS + 1))):
            scenario_text = line.split(".", 1)[1].strip()
            if scenario_text:
                scenarios.append(scenario_text)
    return scenarios

## Out-of-sample (OOS) examples are user queries that **do not belong to any of the chatbot‚Äôs supported intents**. For a banking chatbot, this might include questions like ‚ÄúWhat‚Äôs the weather today?‚Äù or ‚ÄúBook me a flight to New York‚Äù.

Including OOS examples in your dataset is valuable for many reasons:

* Improves Robustness

  - Without OOS data, the model may try to force-fit every input into one of the known intents, even when it‚Äôs irrelevant. By training with OOS samples labeled as ‚Äúunknown‚Äù or ‚Äúfallback‚Äù, the model learns to gracefully reject unsupported queries.

* Enhances User Trust

  - It‚Äôs better for the chatbot to say ‚ÄúI‚Äôm not sure I can help with that‚Äù than to give an incorrect or misleading banking response. This prevents confusion and builds user confidence in the system.

* Supports Fallback Strategies

  - OOS detection enables escalation (e.g., handoff to a human agent) or a redirect (e.g., ‚ÄúI can help with account balances and transfers ‚Äî would you like to try one of those?‚Äù).

* Reflects Real-World Behavior

  - In production, users will always type unexpected, noisy, or unrelated inputs. Training only on in-domain intents creates a brittle system. OOS examples simulate this reality and make the chatbot resilient.

* Helps Reduce Overfitting

  - By introducing ‚Äúnegative‚Äù data, the model learns sharper decision boundaries between supported and unsupported queries.

In [None]:
def generate_oos_scenarios(all_intents, seed_df):

    seed_df = seed_df[seed_df["intent"] == "FALLBACK"]

    examples = f"EXAMPLE:\n**Input**: Generate 3 realistic and varied collections outreach scenarios where the resulting customer intent is FALLBACK.\n**Output**:\n"

    random_sample = seed_df.sample(n=3)
    for i in range(3):
        examples += f"{i+1}. {random_sample["scenario"].iloc[i]}\n"

    system_prompt = (
        "You are a scenario generation engine for outbound banking chatbot conversations."
        " The bank initiates contact with the customer about collection on a past-due balance."
        " Your task is to generate realistic scenarios in which this outbound contact results in the customer making a request that is outside of the scope of the bank chatbot's capabilities."
        " Each scenario should describe the relevant situation and context for the outreach as well as the customer's out-of-scope request. In some scenarios, the customer's request can be loosely related to banking as long as it doesn't touch any of the below intents. In other scenarios, the customer request can be completely unrelated to banking.\n"
        f"Defined intents to avoid: {", ".join(all_intents)}\n\n"
        "Output strictly as a numbered list with each scenario on a new line. Do not include any commentary or explanation or additional lines.\n\n"
        f"{examples}"
    )
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Generate {NUM_SCENARIOS} varied scenarios."}
    ]

    inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")
    output_ids = model.generate(input_ids = inputs, max_new_tokens = 1024, temperature=TEMPERATURE, use_cache = True)
    output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    output = output[output.find(".assistant") + len(".assistant"):]

    scenarios = []
    for line in output.split("\n"):
        if line.strip().startswith(tuple(f"{i}." for i in range(1, NUM_SCENARIOS + 1))):
            scenario_text = line.split(".", 1)[1].strip()
            if scenario_text:
                scenarios.append(scenario_text)
    return scenarios

### This function calls both functions we've defined above: `generate_scenarios` and `generate_oos_scenarios`

In [None]:
def generate_all_scenarios(input_csv="collections_intents.csv", seed_csv="seed_scenarios.csv"):
    running_total = 0
    input_csv = os.path.join(data_path, input_csv)
    seed_data = os.path.join(data_path, seed_csv)
    seed_df = pd.read_csv(seed_data)

    i = 1
    while True:
        output_csv = os.path.join(data_path, f"scenarios_{i}.csv")
        if not os.path.exists(output_csv):
            break
        i += 1

    df = pd.read_csv(input_csv)

    all_intents = list(df["intent"])[:-1]

    scenario_data = []

    for idx, row in df.iterrows():
        intent = row['intent']
        description = row['description']
        if intent == "FALLBACK":
            scenarios = generate_oos_scenarios(all_intents, seed_df)
        else:
            scenarios = generate_scenarios(intent, description, seed_df)

        for sc in scenarios:
            new_row = {
                "intent": intent,
                "scenario": sc
            }
            seed_df = pd.concat([seed_df, pd.DataFrame([new_row])], ignore_index=True)
            scenario_data.append(new_row)

        running_total += NUM_SCENARIOS
        sys.stdout.write(f"\r**GENERATING** Total Scenarios Generated: {running_total}")
        sys.stdout.flush()

        pd.DataFrame(scenario_data).to_csv(output_csv, index=False)
    print(f"Saved scenario CSV: {output_csv}")

### Now we can call this function to generate all scenarios based on our input intents, which will then be used to generate full conversations!


In [None]:
generate_all_scenarios()

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


**GENERATING** Total Scenarios Generated: 45Saved scenario CSV: /content/drive/MyDrive/FICO Analytic Challenge/Data/scenarios_1.csv


### Conversation Generation

The below code block generates conversations from the scenarios CSVs and the seed conversations CSV in ```Fico Analytic Challenge/Data/```. <br> <br>
The conversations are saved to ```Fico Analytic Challenge/Data/``` with filenames like ```conversations_1.csv```.<br><br>

Feel free to play around with the following parameters: <br>
<ul>
    <li> <b>TEMPERATURE</b> - As explained earlier, this controls the creativity of outputs. Higher values mean more diversity and creativity at the risk of unpredictable hallucinations. Lower values are safer but less diverse.
    <li> <b>SENTIMENTS</b> - This defines the possible sentiments that the user/customer will express. You can change this list to have any number of sentiments you want. Just make sure to update <b>SENTIMENT_PROBS</b> accordingly.
    <li> <b>SENTIMENT_PROBS</b> - This is a list of probabilities that defines how likely each sentiment is. For example, if SENTIMENTS is ["angry", "neutral"] and SENTIMENT_PROBS is [0.3, 0.7], then there is a 30% change the user is angry and a 70% chance the user is neutral. Make sure your probabilities add up to 1.
    <li> <b>USER_SPEECH</b> - This defines the possible user speech types. You can change this list to have any number of different speech types (like "professional" or "slang"). Just make sure to update <b>USER_SPEECH_PROBS</b> accordingly.
    <li> <b>USER_SPEECH_PROBS</b> - This is a list of probabilities that defines how likely each user speech type is. For example, if USER_SPEECH is ["professional", "slang"] and USER_SPEECH_PROBS is [0.4, 0.6], then there is a 40% chance the user speaks professionally and a 60% chance the user uses slang. Make sure your probabilities add up to 1.
    <li> <b>NUM_VARIANTS</b> - This controls how many conversations are generated for each scenario. A higher value will generate more data, but you risk having repeptitive conversations if you don't tune the other parameters.
</ul>

In [None]:
# Define sentiments and probabilities
TEMPERATURE = 0.7
SENTIMENTS = ["angry", "confused", "neutral"]
SENTIMENT_PROBS = [0.15, 0.3, 0.55]
USER_SPEECH = ["casual", "professional", "slang", "typos"]
USER_SPEECH_PROBS = [0.5, 0.2, 0.1, 0.2]
NUM_VARIANTS = 5

In [None]:
assert len(SENTIMENT_PROBS) == len(SENTIMENTS)
assert len(USER_SPEECH_PROBS) == len(USER_SPEECH)

In [None]:
all_intents = pd.read_csv(os.path.join(data_path, "collections_intents.csv"))
all_intents = all_intents["intent"].tolist()

def generate_conversation(scenario, intent, sentiment, user_speech):

    system_prompt = (
        "You are a conversation generation engine for outbound banking chatbot interactions. You simulate realistic conversations where the bank initiates contact with the customer about collections on a past-due balance.\n"
        "Generate natural, multi-turn dialogues between a helpful, professional banking chatbot and a realistic customer. "
        "The chatbot should initiate the conversation and stay focused on the context provided in the scenario. "
        f"The customer messages should express the sentiment '{sentiment}' and speech type '{user_speech}'. However, the customer messages should align with the provided scenario and intent '{intent}' above all else.\n"
        f"The conversation must avoid overlapping with any other intents. Specifically, avoid: {", ".join([i for i in all_intents if i != intent])}\n"
        "Avoid including any account numbers or Social Security numbers.\n"
        "The chatbot messages should consistently be professional and friendly regardless of the customer's tone.\n"
        "Alternate between chatbot and customer messages, prefacing each line with 'Bot:' or 'User:'. Do not include summaries, commentary, or headings ‚Äî output only the dialogue."
    )

    user_prompt = (
        f"Customer Intent: {intent}\n"
        f"Customer Sentiment: {sentiment}\n"
        f"Customer Speech Type: {user_speech}\n"
        f"Scenario: {scenario}\n"
        f"Generate a realistic conversation."
    )

    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ]

    output_ids = model.generate(
        **tokenizer.apply_chat_template(
            messages,
            add_generation_prompt = False,
            tokenize = True,
            return_dict = True,
            return_tensors = "pt",
        ).to("cuda"),
        max_new_tokens = 1024,
        temperature = TEMPERATURE,
        top_p = 0.95,
        top_k = 128
    )
    output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    output = output[output.find(".assistant") + len(".assistant"):]

    return output

def generate_oos_conversation(scenario, sentiment, user_speech):


    system_prompt = (
        "You are a conversation generation engine for outbound banking chatbot interactions. You simulate realistic conversations where the bank initiates contact with the customer about collections on a past-due balance and the customer makes an out-of-scope request in response.\n"
        "Generate natural, multi-turn dialogues between a helpful, professional banking chatbot and a customer. "
        "The chatbot should initiate the conversation and stay focused on the context provided in the scenario. "
        f"The customer messages should express the sentiment '{sentiment}' and speech type '{user_speech}'. However, the customer messages should align with the provided scenario above all else. The purpose of the scenario is to describe a customer that is making an out-of-scope request that the banking chatbot can't fulfill.\n"
        "The chatbot messages should consistently be professional and friendly regardless of the customer's tone.\n"
        "Alternate between chatbot and customer messages, prefacing each line with 'Bot:' or 'User:'. Do not include summaries, commentary, or headings ‚Äî output only the dialogue."
    )

    user_prompt = (
        f"Customer Intent: OUT_OF_SCOPE"
        f"Customer Sentiment: {sentiment}\n"
        f"Customer Speech Type: {user_speech}\n"
        f"Scenario: {scenario}\n"
        f"Generate a realistic conversation."
    )

    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ]

    output_ids = model.generate(
        **tokenizer.apply_chat_template(
            messages,
            add_generation_prompt = False,
            tokenize = True,
            return_dict = True,
            return_tensors = "pt",
        ).to("cuda"),
        max_new_tokens = 1024,
        temperature = TEMPERATURE,
        top_p = 0.95,
        top_k = 128
    )
    output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    output = output[output.find(".assistant") + len(".assistant"):]

    return output


def generate_all_conversations():

    running_total = 0
    pattern = re.compile(r"^scenarios_(\d+)\.csv$")

    csv_files = []
    for f in os.listdir(data_path):
        if pattern.match(f):
            csv_files.append(os.path.join(data_path, f))

    if csv_files:
        df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)
    else:
        raise FileNotFoundError("No matching scenarios files found in the specified data_path.")

    i = 1
    while True:
        output_csv = os.path.join(data_path, f"conversations_{i}.csv")
        if not os.path.exists(output_csv):
            break
        i += 1

    all_convos = []

    for idx, row in df.sample(frac=1).iterrows():
        intent = row['intent']
        scenario = row['scenario']

        for variant_id in range(1, NUM_VARIANTS + 1):

            sentiment = random.choices(SENTIMENTS, weights=SENTIMENT_PROBS, k=1)[0]
            user_speech = random.choices(USER_SPEECH, weights=USER_SPEECH_PROBS, k=1)[0]
            if intent == "FALLBACK":
                conv_text = generate_oos_conversation(scenario, sentiment, user_speech)
            else:
                conv_text = generate_conversation(scenario, intent, sentiment, user_speech)

            num_turns = len(re.findall(r"(User:|Bot:)", conv_text))
            if num_turns <= 1:
                continue

            all_convos.append({
                "intent": intent,
                "scenario": scenario,
                "conversation_text": conv_text,
                "sentiment": sentiment,
                "user_speech_type": user_speech
            })
            running_total += 1

            sys.stdout.write(f"\r**GENERATING** Total Conversations Generated: {running_total}")
            sys.stdout.flush()

        pd.DataFrame(all_convos).to_csv(output_csv, index=False)

### Now we're ready to start generating our synthetic data based on our previously defined scenarios!

Pay attention to the `while True` loop above: every time you generate a new set of conversations, it will generate a new .csv file with a new index. Make sure to keep track of your experimental parameters as you're creating new sets of data when trying to optimize for synthetic data quality and intent fidelity.

In [None]:
generate_all_conversations()

**GENERATING** Total Conversations Generated: 225