# Baseline Model Training

This notebook trains a baseline model for the Domain Name Suggestion LLM using full fine-tuning on the synthetic dataset created in the previous notebook.

## Objectives
- Load the synthetic dataset.
- Fine-tune an open-source LLM (e.g., LLaMA-7B) on the dataset.
- Save model checkpoints and track performance.
- Generate initial domain suggestions for evaluation.

## Setup
Ensure the environment is set up with GPU support and the necessary dependencies installed.

In [None]:
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
from utils.config import load_config
import logging

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Load Configuration

Load the model configuration from `config/model_config.yaml`.

In [None]:
CONFIG_PATH = 'config/model_config.yaml'
config = load_config(CONFIG_PATH)

MODEL_NAME = config['base_model']['name']
TRAINING_CONFIG = config['versions']['baseline']
OUTPUT_DIR = 'models/baseline'

logger.info(f'Loading model: {MODEL_NAME}')
logger.info(f'Output directory: {OUTPUT_DIR}')

## Load Dataset

Load the synthetic dataset and prepare it for training.

In [None]:
DATASET_PATH = 'data/synthetic/training_data.json'
dataset = load_dataset('json', data_files=DATASET_PATH)

# Split dataset
train_val = dataset['train'].train_test_split(test_size=config['data']['val_split'])
train_dataset = train_val['train']
val_dataset = train_val['test']

logger.info(f'Training dataset size: {len(train_dataset)}')
logger.info(f'Validation dataset size: {len(val_dataset)}')

## Load Model and Tokenizer

Load the pre-trained model and tokenizer.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_auth_token=config['base_model']['use_auth_token'])
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, use_auth_token=config['base_model']['use_auth_token'])

# Set padding token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

logger.info('Model and tokenizer loaded successfully')

## Preprocess Dataset

Tokenize the dataset for training.

In [None]:
def preprocess_function(examples):
    inputs = [f'Business Description: {ex} -> Domain: ' for ex in examples['input']]
    targets = examples['output']
    model_inputs = tokenizer(inputs, max_length=config['data']['max_input_length'], truncation=True, padding='max_length')
    
    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=config['data']['max_output_length'], truncation=True, padding='max_length')
    
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

logger.info('Dataset preprocessing completed')

## Setup Training

Configure the training arguments and initialize the trainer.

In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=TRAINING_CONFIG['num_epochs'],
    per_device_train_batch_size=TRAINING_CONFIG['batch_size'],
    per_device_eval_batch_size=TRAINING_CONFIG['batch_size'],
    gradient_accumulation_steps=TRAINING_CONFIG['gradient_accumulation_steps'],
    learning_rate=TRAINING_CONFIG['learning_rate'],
    weight_decay=TRAINING_CONFIG['weight_decay'],
    warmup_steps=TRAINING_CONFIG['warmup_steps'],
    evaluation_strategy='steps',
    eval_steps=TRAINING_CONFIG['eval_steps'],
    save_steps=TRAINING_CONFIG['save_steps'],
    logging_steps=TRAINING_CONFIG['logging_steps'],
    fp16=config['hardware']['mixed_precision'] == 'fp16',
    gradient_checkpointing=config['hardware']['gradient_checkpointing'],
    save_total_limit=2,
    report_to=['tensorboard']
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val
)

logger.info('Training configuration completed')

## Train Model

Start the training process and save checkpoints.

In [None]:
logger.info('Starting model training...')
trainer.train()
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
logger.info(f'Model and tokenizer saved to {OUTPUT_DIR}')

## Generate Sample Predictions

Test the trained model with sample business descriptions.

In [None]:
def generate_domains(model, tokenizer, description, num_suggestions=3):
    input_text = f'Business Description: {description} -> Domain: '
    inputs = tokenizer(input_text, return_tensors='pt', padding=True).to(model.device)
    outputs = model.generate(
        **inputs,
        max_length=config['generation']['max_length'],
        num_return_sequences=num_suggestions,
        temperature=config['generation']['temperature'],
        top_p=config['generation']['top_p'],
        do_sample=True
    )
    return [tokenizer.decode(output, skip_special_tokens=True).split('Domain: ')[-1] for output in outputs]

sample_descriptions = [
    'Professional consulting firm specializing in business optimization',
    'Tech startup focused on AI solutions',
    'Organic coffee shop in downtown area'
]

model.eval()
for desc in sample_descriptions:
    domains = generate_domains(model, tokenizer, desc)
    print(f'\nBusiness Description: {desc}')
    print('Generated Domains:')
    for i, domain in enumerate(domains, 1):
        print(f'  {i}. {domain}')

## Conclusion

The baseline model has been successfully trained and can generate domain suggestions. The next steps include evaluating the model and iterating with LoRA/QLoRA fine-tuning.