# Dataset Creation

This notebook demonstrates the creation of a synthetic dataset for the Domain Name Suggestion LLM project. The dataset includes business descriptions and corresponding domain name suggestions across various industries, business types, tones, and complexity levels.

## Objectives
- Generate 1,000 diverse training examples (configurable).
- Ensure diversity across industries, business types, and tones.
- Validate dataset quality and generate summary statistics.
- Save the dataset in JSON format for training and evaluation.

## Setup
Ensure the project environment is set up by running `bash scripts/setup_environment.sh` and installing dependencies from `requirements.txt`.

In [None]:
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

from data.dataset_creation import SyntheticDatasetCreator
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Setup logging
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Configuration

Define the parameters for dataset creation, including the number of samples and output path.

In [None]:
# Configuration
NUM_SAMPLES = 1000
OUTPUT_PATH = 'data/synthetic/training_data.json'
CONFIG_PATH = 'config/model_config.yaml'

# Initialize dataset creator
creator = SyntheticDatasetCreator(CONFIG_PATH)
logger.info(f'Initialized dataset creator with config: {CONFIG_PATH}')

## Generate Sample Business Descriptions

Let's generate a few sample business descriptions to verify the diversity and quality of the generated data.

In [None]:
# Generate 5 sample business descriptions
for i in range(5):
    desc = creator.generate_business_description()
    print(f'\nSample {i+1}:')
    print(f'Description: {desc.description}')
    print(f'Industry: {desc.industry}')
    print(f'Business Type: {desc.business_type}')
    print(f'Tone: {desc.tone}')
    print(f'Keywords: {desc.keywords[:5]}')

## Generate Full Dataset

Generate the complete dataset with the specified number of samples and save it to the output path.

In [None]:
# Generate dataset
logger.info(f'Generating dataset with {NUM_SAMPLES} samples...')
creator.create_training_dataset(num_samples=NUM_SAMPLES, output_path=OUTPUT_PATH)
logger.info(f'Dataset saved to {OUTPUT_PATH}')

## Dataset Validation

Validate the generated dataset to ensure quality and consistency.

In [None]:
def validate_dataset(dataset_path):
    with open(dataset_path, 'r') as f:
        dataset = json.load(f)
    
    logger.info(f'Validating dataset with {len(dataset)} examples')
    
    # Basic validation checks
    assert isinstance(dataset, list), 'Dataset must be a list'
    assert len(dataset) > 0, 'Dataset is empty'
    
    for i, example in enumerate(dataset):
        assert 'input' in example, f'Example {i} missing input'
        assert 'output' in example, f'Example {i} missing output'
        assert 'metadata' in example, f'Example {i} missing metadata'
        assert '.' in example['output'], f'Example {i} output is not a valid domain'
    
    logger.info('Dataset validation passed')

validate_dataset(OUTPUT_PATH)

## Dataset Analysis

Analyze the generated dataset to understand its distribution and characteristics.

In [None]:
# Load dataset
with open(OUTPUT_PATH, 'r') as f:
    dataset = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(dataset)
metadata_df = pd.json_normalize(df['metadata'])

# Plot industry distribution
plt.figure(figsize=(12, 6))
sns.countplot(data=metadata_df, x='industry', order=metadata_df['industry'].value_counts().index)
plt.xticks(rotation=45, ha='right')
plt.title('Industry Distribution')
plt.tight_layout()
plt.show()

# Plot TLD distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=metadata_df, x='tld')
plt.title('TLD Distribution')
plt.tight_layout()
plt.show()

# Summary statistics
print('Dataset Summary:')
print(f'Total Examples: {len(dataset)}')
print(f'Unique Inputs: {df["input"].nunique()}')
print(f'Unique Outputs: {df["output"].nunique()}')
print(f'Average Input Length: {df["input"].str.len().mean():.1f} chars')
print(f'Average Output Length: {df["output"].str.len().mean():.1f} chars')

## Conclusion

The dataset creation process successfully generated a diverse set of business descriptions and domain name suggestions. The dataset is validated and ready for use in model training. The next step is to train the baseline model using this dataset.