In [1]:
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain_core.output_parsers import PydanticOutputParser

  from .autonotebook import tqdm as notebook_tqdm


## STEP 0: Helpers
Here I define some of the helper classes/functions for use in downstream tasks.

In [2]:
class Response(BaseModel):
    answer: str

parser = PydanticOutputParser(pydantic_object=Response)
format_instructions = parser.get_format_instructions()


def parse(output_text: str) -> str:
    try:
        parsed: Response = parser.parse(output_text)
        return parsed.answer
    except Exception as e:
        raise e(f"Could not parse from: {output_text!r}")
    
def propose_domain_name(description, prompt, model, tokenizer):
    prompt_specific = prompt.format(format_instructions=format_instructions)
    messages = [
        {"role" : "system", "content" : prompt_specific},
        {"role" : "user", "content" : f"**business_description**:\n{description}"}
    ]
    tokenized_text_processed = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
    tokenized_text_processed = tokenizer([tokenized_text_processed], return_tensors="pt")
    tokenized_text_processed = {k: v.to(model.device) for k, v in tokenized_text_processed.items()}
    generated_ids = model.generate(**tokenized_text_processed, max_new_tokens=1024, do_sample=False)
    output_ids = generated_ids[0][len(tokenized_text_processed["input_ids"][0]):]
    output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
    output = parse(output_text)
    return output

## STEP 1: setting up a baseline 

For this tutorial, I decided to take on open-source route. Obviously, in production, it would be important to experiment with different models, but for the purpose of this preview, I decided to roll with one of the cheaper open-source flagmans, `Qwen3-4b-Instruct-2507`. Ideally, would have played with `Thinking` version but generation time is much longer so left that out too. Finally, I opted to use `transformers/langgraph` packages as they are often encountered in such scenarios. I will be asking the model to suggest single name to have it more consistent with finetuning part in further stages, to have better comparability. 

In [3]:
NAME = "Qwen/Qwen3-4B-Instruct-2507"
model = AutoModelForCausalLM.from_pretrained(NAME, dtype = torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(NAME)

Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 37.27it/s]


In [4]:
prompt_base = """You are a creative but conceptual thinking tool whose task is to take the user provided *business_description** and try to summarize it and propose a potential domain name for their website.
    If the description includes references to area where the company operates, take it into consideration.

    IMPORTANT:
    - Return ONLY ONE suggested domain name.
    - Return ONLY the final answer in valid JSON format matching the schema below:

    {format_instructions}"""

here are 10 gpt-generated business-description examples on which I will test different prompts

In [5]:
manual_testing_bd = [
    "A small studio in Barcelona that designs custom lamps and light fixtures. We mostly work with reclaimed wood and metal. Everything is handmade and produced in small batches.",
    "We’re building an AI tool that helps small online shops write product descriptions automatically. It analyzes the photos and creates short, SEO-friendly copy. Not everything works perfectly yet, but early testers seem to like it.",
    "Local dog walking service in Edinburgh. Nothing fancy — just reliable walkers and fair prices. We also send the owners small picture updates after each walk.",
    "A new app that lets people rent out unused musical instruments to others in their neighborhood. Think of it like Airbnb but for guitars, violins, and keyboards. Still figuring out pricing though.",
    "I’m starting an accounting service for freelancers in the US. Mostly bookkeeping, invoicing help, tax reminders, stuff like that. Everything is online and appointments can be booked through the website.",
    "A boutique tea brand blending organic loose-leaf teas inspired by traditional Moroccan flavors. Each blend is hand-mixed and sold in small decorative tins.",
    "We make a browser extension that automatically finds coupon codes for online stores. It applies the best one at checkout. It’s aimed at college students who want quick savings.",
    "A family-owned bike repair shop located in Amsterdam. We fix all kinds of bikes and offer annual tune-up memberships. Customers can drop off their bikes or request mobile repairs.",
    "An online course platform focused on teaching retirees how to use smartphones, social media, and basic digital tools. Lessons are short, friendly, and designed for absolute beginners.",
    "Trying to build a marketplace for local artists to sell prints, stickers, and small handmade items. Not sure yet how to handle the shipping side, but the idea is to help them reach more people."
]

In [6]:
for bd in manual_testing_bd:
    print(f"User Description:\n{bd}")
    print(f"LLM suggestion: {propose_domain_name(bd, prompt_base, model, tokenizer)}")
    print("-----------------------------------------------------\n")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


User Description:
A small studio in Barcelona that designs custom lamps and light fixtures. We mostly work with reclaimed wood and metal. Everything is handmade and produced in small batches.
LLM suggestion: barcelona-lampstudio.com
-----------------------------------------------------

User Description:
We’re building an AI tool that helps small online shops write product descriptions automatically. It analyzes the photos and creates short, SEO-friendly copy. Not everything works perfectly yet, but early testers seem to like it.
LLM suggestion: shopdesc.ai
-----------------------------------------------------

User Description:
Local dog walking service in Edinburgh. Nothing fancy — just reliable walkers and fair prices. We also send the owners small picture updates after each walk.
LLM suggestion: edinburghdogwalks.com
-----------------------------------------------------

User Description:
A new app that lets people rent out unused musical instruments to others in their neighborhood

## STEP 2: Improving via prompt-engineering
Okay, looking at the suggestions on the test set, we can easily notice that 2 of the examples actually are not even proper domain names: `neighborhoodinstrument`, `CouponSavvy` and `LocalArtMarket`. This is clearly not good and should be addressed in prompt engineering. Furthermore, other descriptions, while relevant, have two evident issues: 

1) they are straight-forward 
2) they disregard suggesting appropriate TLD in most of the cases where it could be relevant.

In [7]:
prompt_improved = """You are a creative but conceptual thinking tool whose task is to take the user provided *business_description** and try to summarize it and propose a potential domain name for their website.
    If the description includes references to area where the company operates, take it into consideration.

    IMPORTANT:
    - Return ONLY ONE suggested domain name.
    - If it is possible to identify the relevant location, suggest TLD based on it so that it targets customers from relevant region.
    - Ensure that domain name is of proper format: NAME.TLD.
    - Return ONLY the final answer in valid JSON format matching the schema below:

    {format_instructions}"""

In [8]:
for bd in manual_testing_bd:
    print(f"User Description:\n{bd}")
    print(f"LLM suggestion: {propose_domain_name(bd, prompt_improved, model, tokenizer)}")
    print("-----------------------------------------------------\n")

User Description:
A small studio in Barcelona that designs custom lamps and light fixtures. We mostly work with reclaimed wood and metal. Everything is handmade and produced in small batches.
LLM suggestion: lumina-barca.com
-----------------------------------------------------

User Description:
We’re building an AI tool that helps small online shops write product descriptions automatically. It analyzes the photos and creates short, SEO-friendly copy. Not everything works perfectly yet, but early testers seem to like it.
LLM suggestion: shopdesc.ai
-----------------------------------------------------

User Description:
Local dog walking service in Edinburgh. Nothing fancy — just reliable walkers and fair prices. We also send the owners small picture updates after each walk.
LLM suggestion: walkedin.educ
-----------------------------------------------------

User Description:
A new app that lets people rent out unused musical instruments to others in their neighborhood. Think of it li

Now, from the first look, it seems that there are no more improperly formatted domain name suggestions, however, after a quick look at: https://data.iana.org/TLD/tlds-alpha-by-domain.txt one can see that `.amsterdam`, `.tea` or `.educ` not only sound weird, they simply do not exist. Also, some of the suggestions still do not respect the area of operation based on the description. Small detail - there is a mix of lower and upper case letters in the suggestions, which does not look consistent but this could be addressed not only through prompt engineering but also just simply through the function definition:

In [10]:
def propose_domain_name(description, prompt, model, tokenizer):
    prompt_specific = prompt.format(format_instructions=format_instructions)
    messages = [
        {"role" : "system", "content" : prompt_specific},
        {"role" : "user", "content" : f"**business_description**:\n{description}"}
    ]
    tokenized_text_processed = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
    tokenized_text_processed = tokenizer([tokenized_text_processed], return_tensors="pt")
    tokenized_text_processed = {k: v.to(model.device) for k, v in tokenized_text_processed.items()}
    with torch.no_grad():
        generated_ids = model.generate(**tokenized_text_processed, max_new_tokens=1024, do_sample=False)
    output_ids = generated_ids[0][len(tokenized_text_processed["input_ids"][0]):]
    output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
    print(output_text)
    output = parse(output_text)
    return output.lower()

In [11]:
prompt_improved_further = """
You are a domain-name generation assistant. Your job is to read the user-provided 
*business_description* and propose ONE realistic, human-feeling domain name that fits the business.

Follow these rules carefully:

1. If the business description mentions a specific city or country, use the correct ccTLD 
   for that location (e.g., .uk, .de, .fr, .nl, .es, .pt, .it, .lt, .ca, .us, .au, .jp, etc...).

2. If no location is mentioned, choose from REAL modern TLDs ONLY:
   .com, .io, .co, .app, .ai, .net, .org, etc...

3. The domain MUST:
   - sound natural and brandable
   - accurately reflect the type of business
   - use ONLY real TLDs (no invented extensions)

4. Make sure the domain is in the exact format: `name.tld`

5. Return ONLY one domain name.

6. The final output MUST be valid JSON that matches the schema:

{format_instructions}

Do NOT include explanations, reasoning, or extra text. Only return valid JSON.
"""

In [12]:
for bd in manual_testing_bd:
    print(f"User Description:\n{bd}")
    print(f"LLM suggestion: {propose_domain_name(bd, prompt_improved_further, model, tokenizer)}")
    print("-----------------------------------------------------\n")

User Description:
A small studio in Barcelona that designs custom lamps and light fixtures. We mostly work with reclaimed wood and metal. Everything is handmade and produced in small batches.
{"answer": "lumina-barca.com"}
LLM suggestion: lumina-barca.com
-----------------------------------------------------

User Description:
We’re building an AI tool that helps small online shops write product descriptions automatically. It analyzes the photos and creates short, SEO-friendly copy. Not everything works perfectly yet, but early testers seem to like it.
{"answer": "shopdesc.ai"}
LLM suggestion: shopdesc.ai
-----------------------------------------------------

User Description:
Local dog walking service in Edinburgh. Nothing fancy — just reliable walkers and fair prices. We also send the owners small picture updates after each walk.
{"answer": "pawwalk.edinburgh"}
LLM suggestion: pawwalk.edinburgh
-----------------------------------------------------

User Description:
A new app that le

Unfortunately, this prompt still doesnt solve some of the issues like `.edinburgh`/`.amsterdam` or suggesting `.com` too often. While this could probably be further addressed by including more examples in the prompt (or simply a choice of a better/bigger model), I will now turn my attention to fine-tuning.

## STEP 3: Fine-tuning

First things first, to do fine tuning I need to have some dataset. For simplicity, I used chat-gpt to come up with example data. This is reasonable as small experiments with gpt revealed that it can already generate domain suggestions pretty well and thus I am using its knowledge to transfer it into `Qwen3-4b-Instruct-2507`. The dataset contains 214 samples, ±90 of which are non-english. There is a huge focus on having the right TLD based on the description/language while other names I prompted to sound more innovative. Also created evaluation dataset to track training dynamics. 

For fine-tuning, standard approach in smaller/resource limited situations is to utilize LoRA (`peft` and `trl` packages).

In [15]:
from peft import LoraConfig, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

We will be using `prompt_improved_further` as a system prompt to set the model for fine-tuning on the right rails. To do this, chat complete has to be changed to include `{% generation %}` tag for SFTConfig to recognize and enable `completions_only_loss`.

In [16]:
tokenizer_ft = AutoTokenizer.from_pretrained(NAME)

# Overwrite the chat template with a simple one that supports assistant masks
tokenizer_ft.chat_template = """\
{% for message in messages %}
{% if message['role'] == 'system' %}
<|im_start|>system
{{ message['content'] }}<|im_end|>

{% elif message['role'] == 'user' %}
<|im_start|>user
{{ message['content'] }}<|im_end|>

{% elif message['role'] == 'assistant' %}
<|im_start|>assistant
{{ message['content'] }}<|im_end|>

{% endif %}
{% endfor %}
{% if add_generation_prompt %}
<|im_start|>assistant
{% generation %}
{% endgeneration %}
{% endif %}
"""

`assistant` portion has to be formated as this to ensure that system prompt is actually respected and followed and model does not learn to ignore it 

spoiler: (in my first attempt I actually forgot this and the model would predict string instead of JSON as pretrained instruct model even though system prompt clearly asks for it -> not good).

In [17]:
def formatting_example(example):   
    result = {
    "messages": [
            {"role": "system", "content": prompt_improved_further},
            {"role": "user", "content": f"**business_description**:\n{example['business_description']}"},
            {"role": "assistant", "content": f'{{"answer": "{example["domain_name"]}"}}'}
        ]
    }
    return result


train_raw = load_dataset("json", data_files=['ft.jsonl'], split="train")
eval_raw  = load_dataset("json", data_files=['ft_eval.jsonl'],  split="train")

train_dataset = train_raw.map(formatting_example, remove_columns=train_raw.column_names)
eval_dataset  = eval_raw.map(formatting_example,  remove_columns=eval_raw.column_names)

Map: 100%|██████████| 214/214 [00:00<00:00, 13882.62 examples/s]
Map: 100%|██████████| 67/67 [00:00<00:00, 13568.56 examples/s]


In [18]:
peft_config = LoraConfig(
    r = 4,
    lora_alpha = 8, # sensible default, lora_alpha=2*lora_rank
    lora_dropout = 0.1,
    target_modules = 'all-linear',
    task_type = TaskType.CAUSAL_LM,
)

args = SFTConfig(
    output_dir='ft_model/',
    overwrite_output_dir=True,
    num_train_epochs=3,
    eval_strategy='epoch',
    max_length=1024,
    bf16=True,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    lr_scheduler_type = 'cosine_with_min_lr',
    lr_scheduler_kwargs = {'min_lr_rate' : 0.1},
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    weight_decay=0.01,
    gradient_accumulation_steps=2,
    packing=False,
    completion_only_loss=True
)

In [21]:
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer_ft,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=args,
    peft_config=peft_config,
)



In [22]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Epoch,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
1,2.2699,1.592847,1.48114,65152.0,0.660346
2,0.8905,0.622565,0.621541,130304.0,0.855931
3,0.3945,0.417203,0.413008,195456.0,0.917512




TrainOutput(global_step=81, training_loss=1.2922264874717335, metrics={'train_runtime': 3953.8868, 'train_samples_per_second': 0.162, 'train_steps_per_second': 0.02, 'total_flos': 4421166407319552.0, 'train_loss': 1.2922264874717335, 'epoch': 3.0})

Looking at the table, both the training and validation loss seems to be decreasing reasonably, thus indicating that model has not overfit yet and the improvements (in terms of cross-entropy) are made. Additionally, looking at mean token accuracy, we can see that it grows significantly and thus finetuning has had positive effect on the models capabilities to recommend the domain name (comparing to the simulated datasets). The results are promising, and probably could be increased even further but for the purpose of this preview, they will be left as-is.

Now lets load the finetuned model and tokenizer to run through manual testing on the sample of data to compare to previous prompt-engineering approach.

In [23]:
model_ft = AutoModelForCausalLM.from_pretrained('ft_model/checkpoint-81')
tokenizer_ft = AutoTokenizer.from_pretrained('ft_model/checkpoint-81')

Loading checkpoint shards: 100%|██████████| 3/3 [00:18<00:00,  6.24s/it]


In [24]:
for bd in manual_testing_bd:
    print(f"User Description:\n{bd}")
    print(f"LLM suggestion: {propose_domain_name(bd, prompt_improved_further, model_ft, tokenizer_ft)}")
    print("-----------------------------------------------------\n")

User Description:
A small studio in Barcelona that designs custom lamps and light fixtures. We mostly work with reclaimed wood and metal. Everything is handmade and produced in small batches.
{"answer": "lumina-barcelona.es"}
LLM suggestion: lumina-barcelona.es
-----------------------------------------------------

User Description:
We’re building an AI tool that helps small online shops write product descriptions automatically. It analyzes the photos and creates short, SEO-friendly copy. Not everything works perfectly yet, but early testers seem to like it.
{"answer": "shopdesc.ai"}
LLM suggestion: shopdesc.ai
-----------------------------------------------------

User Description:
Local dog walking service in Edinburgh. Nothing fancy — just reliable walkers and fair prices. We also send the owners small picture updates after each walk.
{"answer": "edinburghdogwalks.co.uk"}
LLM suggestion: edinburghdogwalks.co.uk
-----------------------------------------------------

User Description:

There is the glaring issue for 4th prompt, where model predicts "string" which is not good. Probably finetuning could be improved with more examples, better adjusted training parameters or fallback options, like "if the prediction is not valid domain name -> use base model, if that is not valid -> use base model to summarize into one word the description and manually add '.com'. Obviously, in this experiment I did not pay attention to model railguards and rely on the knowledge of the base model to control it. 

However, there is a clear improvement in model's ability to suggest relevant TLD based on the area mentioned in the description (e.g. `.es`, `.co.uk`). Also, to my taste, the suggested domain names are a bit more creative.

## STEP 4: FAST API implementation
Finally, putting this all back together, here is the code to create an API endpoint for the model/agent. The logic is exported to a separate file, `app.py` for simplicity of launching it. To do it, simply run:
```bash
uvicorn app:app --reload --port 8000
```
and to access it, load it on your browser via `http://127.0.0.1:8000/docs#` and under `POST /suggest` experiment with how the API works.




## STEP 5: Final remarks
Overall, this repository gives a preview of how I would tacke this task. It took around 4 hours of work (+2-3 hours for model training and inference) to create this. Here are pros and cons of **promp-engineering** and **fine-tuning** approaches:

**prompt-engineering**

*pros*:
- very fast to implement
- very flexible in terms of the possibility to produce different results
- great for experimentation and setting up a baseline
- internal knowledge could be improved via providing additional information or examples

*cons*:
- while additional information could be added in the prompt, the more different things you put there, the higher the risk of the model being unable to focus/perform on the task
- small changes in the prompt can have significant changes in the model output without clear understanding of why that happens, i.e. stability


**fine-tuning**

*cons*:
- needs dataset
- takes much longer to set-up, including hyperparameter tuning and the training itself
- risk of overfitting
- potential for catastrophic forgetting on previous knowledge
- with standard implementations, model is bound to become somewhat more restrictve and more task-focused


*pros*:
- can achieve much better results than prompt-engineering
- much more customizable control over what and how the model outputs should look like

As for the future suggestions:
1) to transform the llm into an agent with tools; one of the tools could check if the website that is recommended already exists and if so, request the model to regenerate another suggestion.
2) output list of suggestions so that the user can choose.
3) spend more time to identify from the description the target population of the website and take it into account when proposing the domain name.
4) explore slightly bigger models and play with the hyperparameters.
5) expand the training dataset with more examples, or, even better, have human writters come up with a new, human-like dataset that would give a much more authentic flavour to the finetuned model, as some of the suggestions are still bland and predictable. Ofcourse, this could be also done using LLM to generate the data but the level of control should be very high.