Baseline Domain Name Generator 🚀

This notebook documents the creation of our baseline model for domain name suggestions. The goal is to fine-tune a small, open-source LLM to generate domain names based on a business description. This baseline will be the foundation for future iterations and improvements.

Objective:

Fine-tune a small, open-source LLM (TinyLlama-1.1B) for domain name generation.

Use a simple, reproducible fine-tuning recipe.

The resulting model will serve as the benchmark for all future evaluations.



1. Setup and Library Installation
First, we need to install the required libraries. This ensures that the environment is reproducible on any machine.

In [None]:
# Install necessary libraries
!pip install -q transformers datasets torch accelerate

2. Data Preparation
Our dataset is in a JSONL format, with a "prompt" and "completion" field for each example. We will load this data and format it into a single text string that the LLM can learn from. The format is designed to provide clear input and output boundaries for the model.

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import json
from datasets import Dataset

DATA_FILE_PATH = "domain_name_training.jsonl"

# Load the dataset from the JSONL file
data = []
try:
    with open(DATA_FILE_PATH, "r") as f:
        for line in f:
            data.append(json.loads(line))
except FileNotFoundError:
    print(f"Error: The file '{DATA_FILE_PATH}' was not found. Please check the path and ensure your Google Drive is mounted correctly.")
    exit()
except json.JSONDecodeError as e:
    print(f"Error decoding JSONL file: {e}")
    exit()

# Format the data for training
processed_data = []
for item in data:
    text = f"### Business Description: {item['prompt']}\n### Domain Suggestions: {item['completion']}"
    processed_data.append({"text": text})

dataset = Dataset.from_list(processed_data)
print(f"Loaded and formatted dataset with {len(dataset)} examples.")
print("Sample formatted data:", dataset[0])

3. Model and Tokenizer Initialization
We'll use TinyLlama-1.1B as our baseline model. It's a good choice for its small size, which allows for quick fine-tuning and iteration. We will load the model and its tokenizer from the Hugging Face Hub.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)

# Set up the padding token. This is a common practice for causal language models.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

4. Tokenization and Fine-Tuning
The dataset needs to be tokenized before we can train the model. We'll then use the Trainer API from Hugging Face, which simplifies the fine-tuning process. We'll use a basic set of hyperparameters for our baseline.

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
print("Dataset tokenized.")
print(tokenized_datasets)

# A data collator is used to handle padding of sequences during training
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Define training arguments
OUTPUT_DIR = "./tinyllama_domain_suggestions_baseline"
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,
    fp16=True,  # Use mixed precision for faster training on modern GPUs
    logging_dir='./logs',
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="no",
    report_to="none"
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    data_collator=data_collator,
)

print("Starting fine-tuning...")
trainer.train()

print("Fine-tuning complete.")

5. Model Saving and Next Steps
After training, the model and tokenizer files are saved to a local directory. We then use Python's zipfile library to compress this entire directory into a single archive, which is easier to share and manage.

In [None]:
import zipfile

# Save the fine-tuned model and tokenizer to a directory
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Baseline model and tokenizer saved to {OUTPUT_DIR}")

# Create a zip file from the saved directory
zip_file_name = f"{OUTPUT_DIR}.zip"
print(f"Creating a zip archive: {zip_file_name}")

with zipfile.ZipFile(zip_file_name, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(OUTPUT_DIR):
        for file in files:
            # Create a full file path
            file_path = os.path.join(root, file)
            # Add file to zip, preserving directory structure
            zipf.write(file_path, os.path.relpath(file_path, os.path.dirname(OUTPUT_DIR)))

print(f"Model successfully archived in {zip_file_name}")