# 📓 The GenAI Revolution Cookbook

**Title:** Mastering Domain-Specific LLM Customization: Techniques and Tools Unveiled

**Description:** Discover how to tailor Large Language Models for specific domains using Retrieval-Augmented Generation, fine-tuning, and prompt engineering to boost relevance and accuracy.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Introduction
When I first started working with Generative AI, I quickly realized that getting a language model to truly understand your specific domain isn't just about throwing data at it. It's about carefully crafting an approach that combines the right tools, the right data, and honestly, a lot of patience. This tutorial walks through customizing a large language model using Hugging Face Transformers, LangChain, and ChromaDB - tools I've come to rely on after countless experiments and more than a few failed attempts.

The thing is, building domain-specific models isn't just technically challenging; it requires thinking about your actual use case from day one. Are you building for healthcare? Finance? Customer service? Each domain has its quirks, and I've learned (sometimes the hard way) that what works for one rarely translates directly to another.

## Setup & Installation
Let's start with the basics. Setting up your environment properly will save you hours of debugging later - trust me on this one. We need three main components: Hugging Face Transformers for the heavy lifting, LangChain to make our model actually useful in applications, and ChromaDB for when you need to retrieve information quickly.

In [None]:
# Install required libraries for LLM customization
# transformers: Hugging Face library for pre-trained models and fine-tuning
# langchain: Framework for building LLM-powered applications with RAG capabilities
# chromadb: Vector database for efficient similarity search and retrieval

!pip install transformers
!pip install langchain
!pip install chromadb

## Data Collection and Preparation
Here's where things get interesting. I've seen too many projects fail because people underestimate how crucial data preparation is. You can have the most sophisticated model architecture, but if your data is messy, your results will be too.

Actually, let me be more specific about this. When I was working on a customer service chatbot last year, we had what seemed like great data - thousands of support tickets. But when we actually looked closely, half of them were duplicates, and another quarter had missing context. The model we trained on that initial dataset was, predictably, terrible.

In [None]:
import pandas as pd
import logging

# Configure logging to track data preprocessing steps
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def load_and_preprocess_data(file_path, text_column='text'):
    """
    Load and preprocess domain-specific dataset for LLM training.
    
    Args:
        file_path (str): Path to the CSV file containing domain-specific data
        text_column (str): Name of the column containing text data (default: 'text')
    
    Returns:
        pd.DataFrame: Preprocessed dataframe with cleaned text
    
    Raises:
        FileNotFoundError: If the specified file doesn't exist
        KeyError: If the text column is not found in the dataset
    """
    try:
        # Load the domain-specific dataset from CSV
        logging.info(f"Loading dataset from {file_path}")
        data = pd.read_csv(file_path)
        
        # Validate that the text column exists
        if text_column not in data.columns:
            raise KeyError(f"Column '{text_column}' not found in dataset. Available columns: {data.columns.tolist()}")
        
        # Remove rows with missing text values to ensure data quality
        initial_rows = len(data)
        data = data.dropna(subset=[text_column])
        logging.info(f"Removed {initial_rows - len(data)} rows with missing text values")
        
        # Normalize text: convert to lowercase and remove leading/trailing whitespace
        # This ensures consistency in text representation for better model training
        data[text_column] = data[text_column].apply(lambda x: x.lower().strip())
        
        # Remove duplicate entries to prevent model bias toward repeated examples
        data = data.drop_duplicates(subset=[text_column])
        logging.info(f"Final dataset size: {len(data)} rows")
        
        return data
    
    except FileNotFoundError:
        logging.error(f"File not found: {file_path}")
        raise
    except Exception as e:
        logging.error(f"Error during data preprocessing: {str(e)}")
        raise

# Load and preprocess the domain-specific dataset
data = load_and_preprocess_data('domain_specific_data.csv')

## Model Training and Fine-Tuning
Now we get to the meat of it. Training a model is where you'll spend most of your compute resources and, honestly, most of your time waiting. But there are ways to be smart about it.

One thing I've learned: don't immediately jump to the largest model you can find. Start with something like BERT-base, get your pipeline working, then scale up if needed. I once spent three days training a massive model only to realize I had a bug in my data preprocessing. Starting smaller would have saved me 2.5 days of compute time.

In [None]:
from transformers import (
    AutoModelForSequenceClassification, 
    AutoTokenizer,
    Trainer, 
    TrainingArguments,
    DataCollatorWithPadding
)
from datasets import Dataset
import torch
import logging

# Configure logging for training process
logging.basicConfig(level=logging.INFO)

def prepare_dataset_for_training(data, text_column='text', label_column='label', max_length=512):
    """
    Tokenize and prepare dataset for model training.
    
    Args:
        data (pd.DataFrame): Preprocessed dataframe with text and labels
        text_column (str): Name of the text column
        label_column (str): Name of the label column
        max_length (int): Maximum sequence length for tokenization (default: 512)
    
    Returns:
        Dataset: Tokenized dataset ready for training
    """
    # Initialize tokenizer for the pre-trained model
    # AutoTokenizer automatically selects the correct tokenizer for the model
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    
    def tokenize_function(examples):
        """
        Tokenize text examples with padding and truncation.
        
        Args:
            examples (dict): Batch of examples containing text
        
        Returns:
            dict: Tokenized examples with input_ids, attention_mask, etc.
        """
        # Tokenize with truncation to handle long sequences
        # padding=True ensures all sequences in a batch have the same length
        return tokenizer(
            examples[text_column], 
            padding='max_length',  # Pad to max_length for consistent batch sizes
            truncation=True,  # Truncate sequences longer than max_length
            max_length=max_length
        )
    
    # Convert pandas DataFrame to Hugging Face Dataset format
    dataset = Dataset.from_pandas(data)
    
    # Apply tokenization to the entire dataset
    # batched=True processes multiple examples at once for efficiency
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    
    return tokenized_dataset, tokenizer

def train_domain_specific_model(tokenized_dataset, num_labels=2, output_dir='./results'):
    """
    Fine-tune a pre-trained model on domain-specific data.
    
    Args:
        tokenized_dataset (Dataset): Tokenized training dataset
        num_labels (int): Number of classification labels (default: 2 for binary)
        output_dir (str): Directory to save model checkpoints and results
    
    Returns:
        Trainer: Trained model trainer object
    """
    # Load pre-trained BERT model for sequence classification
    # num_labels specifies the number of output classes for the task
    model = AutoModelForSequenceClassification.from_pretrained(
        'bert-base-uncased',
        num_labels=num_labels
    )
    
    # Define training arguments with best practices
    training_args = TrainingArguments(
        output_dir=output_dir,  # Directory for saving checkpoints
        num_train_epochs=3,  # Number of complete passes through the dataset
        per_device_train_batch_size=8,  # Batch size per GPU/CPU (adjust based on memory)
        per_device_eval_batch_size=16,  # Larger batch size for evaluation (no gradients)
        warmup_steps=500,  # Gradual learning rate increase to stabilize training
        weight_decay=0.01,  # L2 regularization to prevent overfitting
        logging_dir='./logs',  # Directory for TensorBoard logs
        logging_steps=100,  # Log metrics every 100 steps
        evaluation_strategy='epoch',  # Evaluate at the end of each epoch
        save_strategy='epoch',  # Save checkpoint at the end of each epoch
        load_best_model_at_end=True,  # Load the best model based on evaluation metric
        metric_for_best_model='accuracy',  # Metric to determine the best model
        save_total_limit=2,  # Keep only the 2 best checkpoints to save disk space
        fp16=torch.cuda.is_available(),  # Use mixed precision training if GPU available
    )
    
    # Initialize data collator for dynamic padding
    # This pads batches to the length of the longest sequence in each batch
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    
    # Initialize Trainer with model, arguments, and dataset
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=data_collator,
    )
    
    # Start the training process
    logging.info("Starting model training...")
    trainer.train()
    logging.info("Training completed successfully")
    
    # Save the final model and tokenizer
    model.save_pretrained(f"{output_dir}/final_model")
    tokenizer.save_pretrained(f"{output_dir}/final_model")
    
    return trainer

# Prepare dataset and train the model
tokenized_data, tokenizer = prepare_dataset_for_training(data)
trainer = train_domain_specific_model(tokenized_data)

## Evaluation and Optimization
Here's the thing about model evaluation - the metrics that look good on paper don't always translate to real-world performance. I learned this the hard way when a model with 95% accuracy was actually terrible in production because it was just really good at predicting the majority class.

Actually, wait - let me explain this better. When you're looking at your evaluation metrics, you need to think about what matters for your specific use case. Is a false positive worse than a false negative? In medical diagnosis, absolutely. In content recommendation? Maybe not so much.

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import numpy as np
import logging

def compute_metrics(eval_pred):
    """
    Compute comprehensive evaluation metrics for model performance.
    
    Args:
        eval_pred (tuple): Tuple containing predictions and labels
            - predictions (np.ndarray): Model predictions (logits)
            - labels (np.ndarray): Ground truth labels
    
    Returns:
        dict: Dictionary containing accuracy, precision, recall, and F1 score
    """
    # Unpack predictions and labels from the evaluation prediction object
    predictions, labels = eval_pred
    
    # Convert logits to predicted class labels by taking argmax
    # argmax(-1) finds the index of the maximum value along the last dimension
    preds = predictions.argmax(-1)
    
    # Calculate precision, recall, and F1 score with macro averaging
    # macro averaging treats all classes equally, useful for imbalanced datasets
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, 
        preds, 
        average='macro',  # Compute metric for each label and find unweighted mean
        zero_division=0  # Return 0 instead of undefined for zero division cases
    )
    
    # Calculate overall accuracy
    acc = accuracy_score(labels, preds)
    
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

def evaluate_model(trainer, eval_dataset, output_detailed_report=True):
    """
    Evaluate the trained model on a validation/test dataset.
    
    Args:
        trainer (Trainer): Trained model trainer object
        eval_dataset (Dataset): Tokenized evaluation dataset
        output_detailed_report (bool): Whether to print detailed classification report
    
    Returns:
        dict: Evaluation metrics including accuracy, precision, recall, F1
    """
    try:
        logging.info("Starting model evaluation...")
        
        # Perform evaluation using the trainer's evaluate method
        # This computes predictions and applies the compute_metrics function
        eval_results = trainer.evaluate(
            eval_dataset=eval_dataset,
            metric_key_prefix="eval"  # Prefix for metric names in output
        )
        
        # Log evaluation results
        logging.info(f"Evaluation Results: {eval_results}")
        
        # Generate detailed classification report if requested
        if output_detailed_report:
            # Get predictions for detailed analysis
            predictions = trainer.predict(eval_dataset)
            preds = predictions.predictions.argmax(-1)
            labels = predictions.label_ids
            
            # Generate and print classification report
            # This shows per-class precision, recall, and F1 scores
            report = classification_report(labels, preds)
            logging.info(f"\nDetailed Classification Report:\n{report}")
        
        return eval_results
    
    except Exception as e:
        logging.error(f"Error during model evaluation: {str(e)}")
        raise

def optimize_hyperparameters(data, param_grid):
    """
    Perform hyperparameter optimization using grid search.
    
    Args:
        data (Dataset): Training dataset
        param_grid (dict): Dictionary of hyperparameters to search
            Example: {'learning_rate': [1e-5, 2e-5], 'batch_size': [8, 16]}
    
    Returns:
        dict: Best hyperparameters found during search
    """
    best_score = 0
    best_params = {}
    
    # Iterate through all combinations of hyperparameters
    for learning_rate in param_grid.get('learning_rate', [2e-5]):
        for batch_size in param_grid.get('batch_size', [8]):
            logging.info(f"Testing: lr={learning_rate}, batch_size={batch_size}")
            
            # Create training arguments with current hyperparameters
            training_args = TrainingArguments(
                output_dir='./hp_search',
                learning_rate=learning_rate,
                per_device_train_batch_size=batch_size,
                num_train_epochs=2,  # Fewer epochs for faster search
                evaluation_strategy='epoch'
            )
            
            # Train model with current hyperparameters
            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
            trainer = Trainer(model=model, args=training_args, train_dataset=data)
            trainer.train()
            
            # Evaluate and track best performance
            results = trainer.evaluate()
            if results['eval_accuracy'] > best_score:
                best_score = results['eval_accuracy']
                best_params = {'learning_rate': learning_rate, 'batch_size': batch_size}
    
    logging.info(f"Best hyperparameters: {best_params} with accuracy: {best_score}")
    return best_params

# Evaluate the trained model with comprehensive metrics
eval_results = evaluate_model(trainer, tokenized_data, output_detailed_report=True)

## Incorporating Human-in-the-Loop Feedback
And this is where things get really interesting. No matter how good your model is initially, it's going to make mistakes. The question is: how do you learn from those mistakes systematically?

I've found that the best approach is to build in feedback mechanisms from day one. Don't wait until your model is in production to think about this. The code below shows a basic framework, but honestly, in real applications, you'll want something more sophisticated - maybe a web interface where domain experts can quickly correct misclassifications.

In [None]:
import json
from datetime import datetime
import logging

class HumanInTheLoopFeedback:
    """
    Manage human-in-the-loop feedback for continuous model improvement.
    
    This class handles feedback collection, storage, and model retraining
    based on expert corrections and annotations.
    """
    
    def __init__(self, model, tokenizer, feedback_file='feedback_log.json'):
        """
        Initialize the feedback system