# **Using LLMs to Generate Synthetic Data for Fine-Tuning GLiNER (NER + Relation Extraction)**

This notebook demonstrates a straightforward approach to generating synthetic training data for fine-tuning GLiNER on both Named Entity Recognition (NER) and Relation Extraction (RE).

We focus on creating **fully synthetic examples**, including both the input text and corresponding annotations. However, if you already have high-quality text from your target domain, you can also use an LLM to **annotate your existing data** instead.

In [1]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
AWS_ACCESS_KEY_ID = user_secrets.get_secret("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = user_secrets.get_secret("AWS_SECRET_ACCESS_KEY")
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
OPENAI_API_KEY = user_secrets.get_secret("OPENAI_API_KEY")

In [2]:
import boto3
import os
import json
import re
from typing import List, Dict, Tuple, Optional, Union, Any

# ── S3 CONFIG ─────────────────────────────────────────────────────────────────
s3 = boto3.client(
    "s3",
    aws_access_key_id     = AWS_ACCESS_KEY_ID,
    aws_secret_access_key = AWS_SECRET_ACCESS_KEY,
    region_name           = "ap-southeast-2",
)
BUCKET       = "doccano-processed"
INIT_KEY     = "gradio/initial_data_train.json"
VALID_PREFIX = "validated_records/"

In [3]:
def load_initial_data() -> List[Dict]:
    obj = s3.get_object(Bucket=BUCKET, Key=INIT_KEY)
    return json.loads(obj['Body'].read())

def load_all_validations() -> Dict[int, Dict]:
    records = {}
    pages = s3.get_paginator("list_objects_v2").paginate(
        Bucket=BUCKET, Prefix=VALID_PREFIX
    )
    for page in pages:
        for obj in page.get("Contents", []):
            key = obj["Key"]
            idx = int(os.path.splitext(os.path.basename(key))[0])
            data = s3.get_object(Bucket=BUCKET, Key=key)["Body"].read()
            records[idx] = json.loads(data)
    return records

In [4]:
data             = load_initial_data()
validated_store  = load_all_validations()

In [5]:
validated_store = load_all_validations()
validated_store = {idx:doc for idx, doc in validated_store.items() if idx <= 150}

In [6]:
len(validated_store)

150

In [7]:
len(data)

783

# Create the DataFrame

In [9]:
import pandas as pd
df = pd.DataFrame.from_dict(validated_store, orient='index')
df.index.name = 'idx'                        
df.reset_index(inplace=True) 

In [10]:
df.loc[df['validated'] == True].shape

(148, 5)

In [11]:
validated_df = df.loc[df['validated'] == True]

In [12]:
validated_df = validated_df.loc[validated_df['ner'].apply(lambda x: x != [])]

In [13]:
filtered_df = validated_df.copy()

In [14]:
filtered_df.head()

Unnamed: 0,idx,text,tokenized_text,ner,validated
0,0,displacement has been decreasing slowly in con...,"[displacement, has, been, decreasing, slowly, ...","[[19, 19, match_named], [24, 28, RUV <> descri...",True
1,1,non-cooperative behaviors that lead to the eme...,"[non-cooperative, behaviors, that, lead, to, t...","[[41, 51, unique dataset on refugee camps <> d...",True
2,10,3 Figure 1 Emigration is highest for middle-in...,"[3, Figure, 1, Emigration, is, highest, for, m...","[[15, 16, labelled_unnamed], [18, 19, migratio...",True
3,100,displaced children and adolescents residing in...,"[displaced, children, and, adolescents, residi...","[[17, 17, match_named], [40, 68, VenRePs-Kids ...",True
4,101,"9 and the Syrian crisis , the UNHCR and the Wo...","[9, and, the, Syrian, crisis, ,, the, UNHCR, a...","[[51, 51, proGres <> geography], [52, 52, matc...",True


In [15]:
def has_document_in_ner(ner_list):
    for entity in ner_list:
        if len(entity) >= 3 and "document" in entity[2].lower():
            return True
    return False

In [16]:
filtered_df[filtered_df["ner"].apply(has_document_in_ner)]

Unnamed: 0,idx,text,tokenized_text,ner,validated


In [17]:
def join_tokens(tokens):
    # Joining tokens with space, but handling special characters correctly
    text = ""
    for token in tokens:
        if token in {",", ".", "!", "?", ":", ";", "..."}:
            text = text.rstrip() + token
        else:
            text += " " + token
    return text.strip()

In [18]:
train_df = filtered_df.copy()
train_df.head()

Unnamed: 0,idx,text,tokenized_text,ner,validated
0,0,displacement has been decreasing slowly in con...,"[displacement, has, been, decreasing, slowly, ...","[[19, 19, match_named], [24, 28, RUV <> descri...",True
1,1,non-cooperative behaviors that lead to the eme...,"[non-cooperative, behaviors, that, lead, to, t...","[[41, 51, unique dataset on refugee camps <> d...",True
2,10,3 Figure 1 Emigration is highest for middle-in...,"[3, Figure, 1, Emigration, is, highest, for, m...","[[15, 16, labelled_unnamed], [18, 19, migratio...",True
3,100,displaced children and adolescents residing in...,"[displaced, children, and, adolescents, residi...","[[17, 17, match_named], [40, 68, VenRePs-Kids ...",True
4,101,"9 and the Syrian crisis , the UNHCR and the Wo...","[9, and, the, Syrian, crisis, ,, the, UNHCR, a...","[[51, 51, proGres <> geography], [52, 52, matc...",True


In [19]:
del filtered_df

In [21]:
dataset_map = {
    # named variants
    'match_named':   'named dataset',
    'pred_named':    'named dataset',
    'actual_named':  'named dataset',
    # unnamed variants
    'match_unnamed': 'unnamed dataset',
    'pred_unnamed':  'unnamed dataset',
    'actual_unnamed':'unnamed dataset',
    # vague variants
    'match_vague':   'vague dataset',
    'pred_vague':    'vague dataset',
    'actual_vague':  'vague dataset',
    # labelled override
'labelled_named': 'named dataset',
'labelled_vague': 'vague dataset',
'labelled_unnamed': 'unnamed dataset',

}

# map wrongly labelled relations
rel_map = {
    'year':       'reference year',
    'years':      'reference year',
    'demography': 'geography',
}

def normalize_ner(ner_list):
    normalized = []
    for start, end, label in ner_list:
        # if it’s a dataset‐tag, collapse it immediately
        if label in dataset_map:
            normalized.append((start, end, dataset_map[label]))
            continue

        # if it’s a relation, normalize only the right side
        if '<>' in label:
            parts = label.split(',')
            out_pieces = []
            for part in parts:
                part = part.strip()
                if '<>' not in part:
                    continue
                subj, rel = part.split('<>', 1)
                subj = subj.strip()
                rel  = rel.strip().lower()
                rel  = rel_map.get(rel, rel)
                out_pieces.append(f"{subj} <> {rel}")
            if out_pieces:
                normalized.append((start, end, ', '.join(out_pieces)))
            continue

        # otherwise, leave it
        normalized.append((start, end, label))

    return normalized

In [22]:
train_df['ner_normalized'] = train_df['ner'].apply(normalize_ner)

In [23]:
def extract_text(row):
    tokenized = row['tokenized_text']
    ner_entries = row['ner_normalized']
    extracted = []

    for start_idx, end_idx, label in ner_entries:
        # rebuild the actual text span
        span_text = " ".join(tokenized[start_idx : end_idx + 1])

        # split out multiple labels if they’re comma-separated
        pieces = [piece.strip() for piece in label.split(", ") if piece.strip()]
        for piece in pieces:
            extracted.append({
                "text":  span_text,
                "label": piece
            })

    return extracted

train_df["extracted_text"] = train_df.apply(extract_text, axis=1)

In [24]:
train_df.iloc[4]['text']

'9 and the Syrian crisis , the UNHCR and the World Food Program ( WFP ) have been conducting a variety of surveys as well as extensive home visits that allowed researchers to analyze refugee conditions as had never been done before . The paper uses two data sets : the Jordan proGres registration system ( PG for short ) as of December 2014 and the Jordan Home Visits survey , round II data ( HV for short ) collected between November 2013 and September 2014 . Both data sets were provided by the UNHCR in the context of the joint World Bank-UNHCR study on the welfare of Syrian refugees ( Verme et al . , 2016 ) . These comprehensive data sets have the distinct advantage that they can be linked by a common identification number . We can therefore trace the same individuals and households across the two sources of data . The proGres registration system is what we consider the “ census ” of refugees . This data set has no information on consumption but contains socio-economic characteristics fo

In [25]:
train_df.iloc[4]['ner_normalized']

[(51, 51, 'proGres <> geography'),
 (52, 52, 'named dataset'),
 (53, 54, 'proGres <> data type'),
 (56, 56, 'proGres <> acronym'),
 (63, 63, 'proGres <> reference year'),
 (66, 66, 'Home Visits survey <> geography'),
 (67, 69, 'named dataset'),
 (71, 72, 'Home Visits survey <> version'),
 (75, 75, 'Home Visits survey <> acronym'),
 (82, 82, 'Home Visits survey <> reference year'),
 (85, 85, 'Home Visits survey <> reference year'),
 (94, 94, 'proGres <> publisher, Home Visits survey <> publisher'),
 (99, 103, 'proGres <> data source, Home Visits survey <> data source'),
 (155, 155, 'named dataset'),
 (156, 157, 'proGres <> data type'),
 (163, 167, 'proGres <> description'),
 (178, 186, 'proGres <> description'),
 (192, 192, 'named dataset')]

In [26]:
train_df.iloc[4]['extracted_text']

[{'text': 'Jordan', 'label': 'proGres <> geography'},
 {'text': 'proGres', 'label': 'named dataset'},
 {'text': 'registration system', 'label': 'proGres <> data type'},
 {'text': 'PG', 'label': 'proGres <> acronym'},
 {'text': '2014', 'label': 'proGres <> reference year'},
 {'text': 'Jordan', 'label': 'Home Visits survey <> geography'},
 {'text': 'Home Visits survey', 'label': 'named dataset'},
 {'text': 'round II', 'label': 'Home Visits survey <> version'},
 {'text': 'HV', 'label': 'Home Visits survey <> acronym'},
 {'text': '2013', 'label': 'Home Visits survey <> reference year'},
 {'text': '2014', 'label': 'Home Visits survey <> reference year'},
 {'text': 'UNHCR', 'label': 'proGres <> publisher'},
 {'text': 'UNHCR', 'label': 'Home Visits survey <> publisher'},
 {'text': 'the joint World Bank-UNHCR study',
  'label': 'proGres <> data source'},
 {'text': 'the joint World Bank-UNHCR study',
  'label': 'Home Visits survey <> data source'},
 {'text': 'proGres', 'label': 'named dataset'}

## Synthetic data generation

In [27]:
from openai import OpenAI
from google.colab import userdata
from kaggle_secrets import UserSecretsClient
import os
import json
from typing import List, Set, Dict
# Set up OpenAI API key
user_secrets = UserSecretsClient()
os.environ['OPENAI_API_KEY'] = user_secrets.get_secret("OPENAI_API_KEY")
os.environ['HF_TOKEN'] = user_secrets.get_secret("HF_TOKEN")
# Initialize OpenAI client
client = OpenAI()
model = 'gpt-4o-mini'  # gpt-4o-mini-2024-07-18

In [28]:
from huggingface_hub import login
login(token=os.environ['HF_TOKEN'])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## Prompting function

In [29]:
train_df.sample(10)

Unnamed: 0,idx,text,tokenized_text,ner,validated,ner_normalized,extracted_text
19,115,model of expectations in the following section...,"[model, of, expectations, in, the, following, ...","[[11, 14, labelled_named], [15, 18, Foreign In...",True,"[(11, 14, named dataset), (15, 18, Foreign Inv...",[{'text': 'Foreign Investment During Recovery'...
139,9,) Source : Authors ’ calculations with data fr...,"[), Source, :, Authors, ’, calculations, with,...","[[9, 9, LFS <> geography], [10, 10, match_name...",True,"[(9, 9, LFS <> geography), (10, 10, named data...","[{'text': 'Jordan', 'label': 'LFS <> geography..."
121,73,which to believe that diﬀerent types of interv...,"[which, to, believe, that, diﬀerent, types, of...","[[74, 74, socio-economic household surveys <> ...",True,"[(74, 74, socio-economic household surveys <> ...","[{'text': 'UNHCR', 'label': 'socio-economic ho..."
4,101,"9 and the Syrian crisis , the UNHCR and the Wo...","[9, and, the, Syrian, crisis, ,, the, UNHCR, a...","[[51, 51, proGres <> geography], [52, 52, matc...",True,"[(51, 51, proGres <> geography), (52, 52, name...","[{'text': 'Jordan', 'label': 'proGres <> geogr..."
105,59,Table 1 : The fraction of messages from return...,"[Table, 1, :, The, fraction, of, messages, fro...","[[142, 157, Researcher Coded dataset <> descri...",True,"[(142, 157, Researcher Coded dataset <> descri...",[{'text': 'set of groups which mention locatio...
15,111,4 Data and descriptive statistics Primary data...,"[4, Data, and, descriptive, statistics, Primar...","[[14, 14, Refugee and Host Communities Househo...",True,"[(14, 14, Refugee and Host Communities Househo...","[{'text': 'Uganda', 'label': 'Refugee and Host..."
82,38,", with 55 % falling between 20 and 24 years . ...","[,, with, 55, %, falling, between, 20, and, 24...","[[107, 108, DHS survey <> data type], [111, 11...",True,"[(107, 108, DHS survey <> data type), (111, 11...","[{'text': 'national figures', 'label': 'DHS su..."
28,123,into 47 sub-national units . Locations in Ugan...,"[into, 47, sub-national, units, ., Locations, ...","[[7, 7, GWP <> geography], [88, 88, match_named]]",True,"[(7, 7, GWP <> geography), (88, 88, named data...","[{'text': 'Uganda', 'label': 'GWP <> geography..."
9,106,that were surveyed only in the ESS1 are exclud...,"[that, were, surveyed, only, in, the, ESS1, ar...","[[6, 6, match_named], [21, 21, match_named], [...",True,"[(6, 6, named dataset), (21, 21, named dataset...","[{'text': 'ESS1', 'label': 'named dataset'}, {..."
53,146,"12 Figure 2 . Poverty Estimates , by year and ...","[12, Figure, 2, ., Poverty, Estimates, ,, by, ...","[[64, 64, LFS <> geography], [88, 88, match_na...",True,"[(64, 64, LFS <> geography), (88, 88, named da...","[{'text': 'Turkey', 'label': 'LFS <> geography..."


In [30]:
train_df.shape

(136, 7)

In [31]:
train_df['entities'] = train_df['extracted_text']

In [32]:
train_df['ner'] = train_df['ner_normalized']

In [33]:
train_df.head()

Unnamed: 0,idx,text,tokenized_text,ner,validated,ner_normalized,extracted_text,entities
0,0,displacement has been decreasing slowly in con...,"[displacement, has, been, decreasing, slowly, ...","[(19, 19, named dataset), (24, 28, RUV <> desc...",True,"[(19, 19, named dataset), (24, 28, RUV <> desc...","[{'text': 'RUV', 'label': 'named dataset'}, {'...","[{'text': 'RUV', 'label': 'named dataset'}, {'..."
1,1,non-cooperative behaviors that lead to the eme...,"[non-cooperative, behaviors, that, lead, to, t...","[(41, 51, unique dataset on refugee camps <> d...",True,"[(41, 51, unique dataset on refugee camps <> d...",[{'text': 'assess the consequences of forced m...,[{'text': 'assess the consequences of forced m...
2,10,3 Figure 1 Emigration is highest for middle-in...,"[3, Figure, 1, Emigration, is, highest, for, m...","[(15, 16, unnamed dataset), (18, 19, migration...",True,"[(15, 16, unnamed dataset), (18, 19, migration...","[{'text': 'migration data', 'label': 'unnamed ...","[{'text': 'migration data', 'label': 'unnamed ..."
3,100,displaced children and adolescents residing in...,"[displaced, children, and, adolescents, residi...","[(17, 17, named dataset), (40, 68, VenRePs-Kid...",True,"[(17, 17, named dataset), (40, 68, VenRePs-Kid...","[{'text': 'VenRePs-Kids', 'label': 'named data...","[{'text': 'VenRePs-Kids', 'label': 'named data..."
4,101,"9 and the Syrian crisis , the UNHCR and the Wo...","[9, and, the, Syrian, crisis, ,, the, UNHCR, a...","[(51, 51, proGres <> geography), (52, 52, name...",True,"[(51, 51, proGres <> geography), (52, 52, name...","[{'text': 'Jordan', 'label': 'proGres <> geogr...","[{'text': 'Jordan', 'label': 'proGres <> geogr..."


In [34]:
from collections import Counter

label_counter = Counter()

for item in train_df[['ner']].to_dict(orient='records'):
    for ner in item.get('ner', []):
        label_counter[ner[2]] += 1

# Print sorted by frequency
for label, count in label_counter.most_common():
    print(f"{label}: {count}")

named dataset: 267
unnamed dataset: 32
vague dataset: 19
RDHS <> reference year: 12
FDP datasets <> geography: 10
Kagera Health and Development Survey <> geography: 10
survey data <> reference year: 8
Armed Conflict Survey <> geography: 8
UNHCR Camp Mapping database <> geography: 8
LFS <> reference year: 7
International Migration Organization Displacement Tracking Matrix dataset <> geography: 7
PISA <> geography: 7
Cadastro ´ Unico <> geography: 6
Poll 14 <> geography: 5
LFS <> geography: 5
ESS <> reference year: 5
PWT <> reference year: 4
Demographic and Health Survey <> reference year: 4
SYPJ <> reference year: 4
Listening to Displaced People Survey <> geography: 4
refugee registration data <> geography: 4
2009 census <> geography: 4
European Social Survey <> geography: 4
Penn World Tables <> data type: 3
World Bank survey <> geography: 3
Demographic and Health Survey <> data type: 3
JLMPS <> reference year: 3
Gallup World Poll <> reference year: 3
Gallup World Poll <> geography: 3
S

In [35]:
from pydantic import BaseModel, RootModel
from typing import List
import random

class Entity(BaseModel):
    text: str
    label: str

class SyntheticDatasetResponse(BaseModel):
    text: str
    entities: List[Entity]

class SyntheticDatasetBatch(BaseModel):
    RootModel: List[SyntheticDatasetResponse]

In [36]:
train_df.head()

Unnamed: 0,idx,text,tokenized_text,ner,validated,ner_normalized,extracted_text,entities
0,0,displacement has been decreasing slowly in con...,"[displacement, has, been, decreasing, slowly, ...","[(19, 19, named dataset), (24, 28, RUV <> desc...",True,"[(19, 19, named dataset), (24, 28, RUV <> desc...","[{'text': 'RUV', 'label': 'named dataset'}, {'...","[{'text': 'RUV', 'label': 'named dataset'}, {'..."
1,1,non-cooperative behaviors that lead to the eme...,"[non-cooperative, behaviors, that, lead, to, t...","[(41, 51, unique dataset on refugee camps <> d...",True,"[(41, 51, unique dataset on refugee camps <> d...",[{'text': 'assess the consequences of forced m...,[{'text': 'assess the consequences of forced m...
2,10,3 Figure 1 Emigration is highest for middle-in...,"[3, Figure, 1, Emigration, is, highest, for, m...","[(15, 16, unnamed dataset), (18, 19, migration...",True,"[(15, 16, unnamed dataset), (18, 19, migration...","[{'text': 'migration data', 'label': 'unnamed ...","[{'text': 'migration data', 'label': 'unnamed ..."
3,100,displaced children and adolescents residing in...,"[displaced, children, and, adolescents, residi...","[(17, 17, named dataset), (40, 68, VenRePs-Kid...",True,"[(17, 17, named dataset), (40, 68, VenRePs-Kid...","[{'text': 'VenRePs-Kids', 'label': 'named data...","[{'text': 'VenRePs-Kids', 'label': 'named data..."
4,101,"9 and the Syrian crisis , the UNHCR and the Wo...","[9, and, the, Syrian, crisis, ,, the, UNHCR, a...","[(51, 51, proGres <> geography), (52, 52, name...",True,"[(51, 51, proGres <> geography), (52, 52, name...","[{'text': 'Jordan', 'label': 'proGres <> geogr...","[{'text': 'Jordan', 'label': 'proGres <> geogr..."


In [40]:
train_df['sentence'] = train_df['tokenized_text'].map(join_tokens)

In [41]:
train_df.to_csv("150idx_train_data.csv", index=False)

In [42]:
# Parameters
SAMPLES_PER_DOMAIN = 1   # how many synthetic samples to generate per domain
N_SHOTS = 1               # how many few-shot examples per prompt
BATCH_SIZE = 50           # for downstream processing

# List of domains
domains = [
    "climate change",
   # "meteorology & weather modeling",
    "geospatial & remote sensing",
   # "oceanography & marine science",
   # "environmental science",
   # "hydrology & water resource management",
   # "biodiversity & ecosystem analysis",
   # "air pollution & emissions monitoring",
   # "energy & sustainability studies",
    "forestry & deforestation monitoring",
    "agriculture & food security",
    "carbon footprint & greenhouse gas inventories",
   # "macroeconomics & development studies",
    "global trade & commerce",
    "poverty & inequality research",
    "public finance & taxation",
    "monetary policy & inflation",
   # "sustainable finance & green investment",
    "microfinance & small business development",
    "labor markets & employment studies",
    "public policy & governance",
    #"urban development & city planning",
   # "disaster response & climate resilience",
   # "education & human capital development",
   # "social protection & welfare programs",
    "migration & refugee studies",
    "international aid & development policy",
   # "infrastructure development & transport economics",
   # "sustainable development goals (SDGs) monitoring",
    "World Bank & IMF policy research",
    #"refugee"
]
# domains = [
#     "meteorology & weather modeling",
#     "oceanography & marine science",
#     "environmental science",
#     "hydrology & water resource management",
#     "biodiversity & ecosystem analysis",
#     "air pollution & emissions monitoring",
#     "energy & sustainability studies",
#     "macroeconomics & development studies",
#     "sustainable finance & green investment",
#     "urban development & city planning",
#     "disaster response & climate resilience",
#     "education & human capital development",
#     "social protection & welfare programs",
#     "infrastructure development & transport economics",
#     "sustainable development goals (SDGs) monitoring",
#     "refugee",
# ]

In [43]:
import random

all_prompts = []

for _, row in train_df.iterrows():
    text = row["sentence"].strip()
    entities = row["entities"]

    # Sample 3 distinct domains
    sampled_domains = random.sample(domains, k=3)

    # Build entity display lines
    entity_lines = "\n".join(
        f"{i+1}. \"{ent['text']}\" ({ent['label']})"
        for i, ent in enumerate(entities)
    )

    # Build the entity JSON block (same across all)
    json_entities_example = ",\n        ".join(
        '{ "text": "...", "label": "' + ent["label"] + '" }'
        for ent in entities
    )

    # Example block
    example_section = f"""Example
---------
Text:
{text}

Entities (MUST have exactly {len(entities)} entries, in this order and format):
{entity_lines}
"""

    # Format sampled domains nicely
    domain_section = "\n".join(f"- {domain}" for domain in sampled_domains)

    prompt = f"""
Given the following example data, generate **three** synthetic examples using three different domains sampled below.
**IMPORTANT** The goal of the generation of synthetic examples is to extract dataset mentions in the text. Make sure to comply with the instructions below.

### Instructions:
1. Each generated text should match the structure and length of the original input.
2. Use exactly the same number of entity mentions, and the same labels in the same order.
3. Each output should reflect the theme of the domain it was generated for.
4. **Embed each entity string verbatim** in your generated text, matching its usage.
5. **Do not invent new dataset names** for `named dataset` entries — use only real-world or plausible names.
6. Output a list of 3 items, each with a `"text"` and `"entities"`.

### Allowed Labels:
- named dataset
- unnamed dataset
- vague dataset
- citation data source
- document data source

### Sampled Domains:
{domain_section}

{example_section}

### Output Format:
Return a list of 3 examples like this:

[
  {{
    "text": "...",  // related to: {sampled_domains[0]}
    "entities": [
        {json_entities_example}
    ]
  }},
  {{
    "text": "...",  // related to: {sampled_domains[1]}
    "entities": [
        {json_entities_example}
    ]
  }},
  {{
    "text": "...",  // related to: {sampled_domains[2]}
    "entities": [
        {json_entities_example}
    ]
  }}
]
""".strip()

    all_prompts.append(prompt)

In [44]:
len(all_prompts)

136

In [45]:
all_prompts[2]

'Given the following example data, generate **three** synthetic examples using three different domains sampled below.\n**IMPORTANT** The goal of the generation of synthetic examples is to extract dataset mentions in the text. Make sure to comply with the instructions below.\n\n### Instructions:\n1. Each generated text should match the structure and length of the original input.\n2. Use exactly the same number of entity mentions, and the same labels in the same order.\n3. Each output should reflect the theme of the domain it was generated for.\n4. **Embed each entity string verbatim** in your generated text, matching its usage.\n5. **Do not invent new dataset names** for `named dataset` entries — use only real-world or plausible names.\n6. Output a list of 3 items, each with a `"text"` and `"entities"`.\n\n### Allowed Labels:\n- named dataset\n- unnamed dataset\n- vague dataset\n- citation data source\n- document data source\n\n### Sampled Domains:\n- carbon footprint & greenhouse gas i

In [46]:
from tqdm import tqdm

def generate_from_prompts(prompts, model="gpt-4o-mini", temperature=0.2, top_p=0.95):
    """
    Calls GPT-4o-mini in batches and parses the generated outputs.
    
    Args:
        prompts (list): List of generated prompts.
        model (str): Model to use for generation.
        temperature (float): Controls randomness in output.
        top_p (float): Controls diversity in sampling.

    Returns:
        tuple: (Generated JSON outputs, tokenized processed output)
    """
    all_outputs = []
    processed_output = []
    with tqdm(total=len(prompts), desc="Generating Synthetic Data") as pbar:
        for prompt in prompts:
            try:
                completion = client.beta.chat.completions.parse(
                    model=model,
                    temperature=0.2,
                    messages=[{"role": "system", "content": prompt}],
                    response_format=SyntheticDatasetBatch,
                )
    
                # Parse JSON response
                json_output = completion.choices[0].message.parsed
                extracted_output = extract_entities(json_output.model_dump()['RootModel'])
                all_outputs.extend(json_output)
                processed_output.extend(extracted_output)
    
            except Exception as e:
                print(f"Error processing prompt: {e}")
                continue
            pbar.update(1)  # Update progress bar

    return all_outputs, processed_output

In [47]:
def extract_entities(parsed_batch):
    processed = []
    for item in parsed_batch:
        # Handle either Pydantic models or plain dicts
        text = item.text if hasattr(item, "text") else item["text"]
        entities = item.entities if hasattr(item, "entities") else item["entities"]
        processed.append({
            "text": text,
            "entities": [ent.dict() if hasattr(ent, "dict") else ent for ent in entities]
        })
    return processed

In [48]:
len(all_prompts)

136

In [49]:
# Generate and parse using your structured LLM pipeline
all_outputs, processed_outputs = generate_from_prompts(
    all_prompts,
    model="gpt-4o-mini",
    temperature=0.2,
    top_p=0.95
)

Generating Synthetic Data: 100%|██████████| 1/1 [00:18<00:00, 18.03s/it]


In [50]:
all_prompts[0]

'Given the following example data, generate **three** synthetic examples using three different domains sampled below.\n**IMPORTANT** The goal of the generation of synthetic examples is to extract dataset mentions in the text. Make sure to comply with the instructions below.\n\n### Instructions:\n1. Each generated text should match the structure and length of the original input.\n2. Use exactly the same number of entity mentions, and the same labels in the same order.\n3. Each output should reflect the theme of the domain it was generated for.\n4. **Embed each entity string verbatim** in your generated text, matching its usage.\n5. **Do not invent new dataset names** for `named dataset` entries — use only real-world or plausible names.\n6. Output a list of 3 items, each with a `"text"` and `"entities"`.\n\n### Allowed Labels:\n- named dataset\n- unnamed dataset\n- vague dataset\n- citation data source\n- document data source\n\n### Sampled Domains:\n- labor markets & employment studies\

In [51]:
len(processed_outputs)

3

In [52]:
print(processed_outputs[0])

{'text': 'The recent analysis of labor trends indicates that the employment rate has been steadily increasing, particularly among younger demographics. 7 Data from the BLS shows that the number of job openings has reached an all-time high, with a significant portion of these positions being filled by individuals aged 18 to 24. In particular, the BLS report highlights that 45 percent of new hires in the last quarter were from this age group, which is a notable increase compared to previous years. Furthermore, 12 percent of these new employees were hired for roles in technology and healthcare sectors ( U.S. Bureau of Labor Statistics, 2022 ). Previous studies, utilizing data from job placement agencies, also indicate that this demographic is increasingly seeking flexible work arrangements. Many of these young workers are prioritizing jobs that offer remote work options, as indicated by a survey conducted in 2021.', 'entities': [{'text': 'BLS', 'label': 'named dataset'}, {'text': 'number 

In [53]:
print(processed_outputs[1])

{'text': "Recent climate models indicate that global temperatures are set to rise significantly over the next few decades. 9 Data from the IPCC suggests that the average global temperature could increase by 1.5 degrees Celsius by 2030 if current trends continue. This alarming projection has raised concerns among scientists, particularly regarding the impact on polar ice caps and sea levels. Notably, the IPCC report emphasizes that 70 percent of the world's glaciers are retreating at unprecedented rates ( Intergovernmental Panel on Climate Change, 2021 ). Previous assessments, based on satellite imagery, also reveal that coastal cities are at high risk of flooding due to rising sea levels. Many regions are now prioritizing climate resilience strategies, as highlighted in a study conducted in 2020.", 'entities': [{'text': 'IPCC', 'label': 'named dataset'}, {'text': 'average global temperature', 'label': 'IPCC <> description'}, {'text': 'polar ice caps', 'label': 'IPCC <> geography'}, {'t

### Save for training

In [54]:
import json
output_file = "synthetic_data_refugee_relations_150idx.json"

with open(output_file, "w", encoding="utf-8") as f:
    json.dump(processed_outputs, f, indent=2, ensure_ascii=False)