# SmolTalk Everyday Conversations Dataset

This notebook loads and explores the **HuggingFaceTB/smoltalk** dataset using the Hugging Face datasets library, specifically focusing on the "everyday-conversations" configuration.

In [None]:
# !pip install "distilabel[hf-transformers,outlines,instructor]"

# Authenticate to Hugging Face
from distilabel.llms import TransformersLLM
from distilabel.steps.tasks import TextGeneration
from huggingface_hub import login

# login("")

  from distilabel.llms import TransformersLLM
  from .autonotebook import tqdm as notebook_tqdm
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from datasets import load_dataset

# Load the smoltalk dataset with everyday-conversations configuration
ds = load_dataset("HuggingFaceTB/smoltalk", "everyday-conversations", split="train")

print(f"Dataset loaded successfully!")
print(f"Number of examples: {len(ds)}")
print(f"Dataset features: {ds.features}")

Dataset loaded successfully!
Number of examples: 2260
Dataset features: {'full_topic': Value('string'), 'messages': List({'content': Value('string'), 'role': Value('string')})}


In [3]:
# Explore the dataset structure
print("Dataset info:")
print(ds)
print()

# Look at the first few examples
print("First example:")
print(ds[0])
print()

# Check the keys/columns in the dataset
print("Dataset columns:")
for key in ds.features.keys():
    print(f"  - {key}: {ds.features[key]}")

Dataset info:
Dataset({
    features: ['full_topic', 'messages'],
    num_rows: 2260
})

First example:
{'full_topic': 'Travel/Vacation destinations/Beach resorts', 'messages': [{'content': 'Hi there', 'role': 'user'}, {'content': 'Hello! How can I help you today?', 'role': 'assistant'}, {'content': "I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?", 'role': 'user'}, {'content': "Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.", 'role': 'assistant'}, {'content': 'That sounds great. Are there any resorts in the Caribbean that are good for families?', 'role': 'user'}, {'content': 'Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.', 'role': 'assistant'}, {'content': "Okay, I'll look into those. Thanks for the recomm

# Create Instruction

In [4]:
# Configuration for consistent generation settings across all LLMs
GENERATION_CONFIG = {
    "max_new_tokens": 5000,  # Increase from default (usually 128)
}

# HuggingFaceTB/SmolLM2-135M-Instruct
# HuggingFaceTB/SmolLM2-360M-Instruct
# HuggingFaceTB/SmolLM2-1.7B-Instruct
# The <think/> part can be removed!!
# HuggingFaceTB/SmolLM3-3B
# Qwen/Qwen2.5-1.5B-Instruct
# Qwen/Qwen3-4B-Instruct-2507
# Qwen/Qwen2.5-0.5B-Instruct
llm_model = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
llm = TransformersLLM(
    model=llm_model,
    generation_kwargs=GENERATION_CONFIG
)
gen = TextGeneration(llm=llm)
gen.load()

Step 'None' hasn't received a pipeline, and it hasn't been created within a `Pipeline` context. Please, use `with Pipeline() as pipeline:` and create the step within the context.


Device set to use cuda:0


## Create synthetic prompt

In [5]:
# original basic prompt
# prompt_for_instruction_tune = "Generate a questions about the Hugging Face Smol-Course on small AI models."
prompt_for_instruction_tune = """

Generate a list of Instructions to convert text into the style of a specific character called Homer Simpson from The Simpsons.
Only generate the instructions, do not add comments.
Make sure to include something like "convert the text into the style of Homer Simpson using the following guidelines...".


Speak in first person ("I", "me") with your typical tone: clumsy, lazy, sometimes clueless, but full of silly humor and occasional heartfelt wisdom.
Use short, simple sentences that often start or end with catchphrases like "D’oh!", "Woo-hoo!", or "Mmm… donuts".
You love: food (especially donuts, bacon, beer), TV, avoiding work, and spending time with your family even if you mess things up.
Make occasional funny asides to the listener, show impulsive decisions, and admit mistakes with self-deprecating charm.
Ignore deep technical language — think like an everyday guy who enjoys comfort over complexity.
Incorporate slapstick mishaps, misunderstandings, and childlike enthusiasm for simple pleasures.
Even when giving advice, keep it goofy but somehow oddly insightful in an accidental way.
Stay in character no matter what — respond how Homer would in ordinary life and unusual situations.
Ensure it sounds like it could be a direct quote from an episode script.

Key Homer Traits to Encode
Catchphrases → “D’oh!”, “Woo-hoo!”, “Mmm…” (food items)
Obsession with Food/Beer → Donuts as motivation for everything.
Lovable Fool Persona → Wrong logic but delivered confidently.
Family-Oriented in his Own Way → Loves Marge & the kids, sometimes expresses it clumsily.
Low Attention Span / Tangents → Goes off-topic mid-sentence.
Impulsive → Changes his mind in the middle of talking.
Comedy Timing → Words that set up comedic beats and misunderstandings.
Misinterpretation of Complex Topics → Turns serious things into something silly.
"""



## This generates a set of instructions ###
# We will now use the llm to generate a prompt for *instruction tuning*.
result_prompt_for_instruction_tune = next(gen.process([{"instruction": prompt_for_instruction_tune}]))
print("Generated prompt:\n", result_prompt_for_instruction_tune[0]["generation"], "\n\n\n")



Generated prompt:
 Convert the text into the style of Homer Simpson using the following guidelines:

1. Start each sentence with a catchphrase: D'oh!, Woo-hoo!, Mmm... 
2. Use slapstick mishaps and misunderstandings to create humorous moments.
3. Admit mistakes with self-deprecating charm.
4. Show impulsive decisions and ignore deep technical language.
5. Incorporate food and drink references, especially donuts and beer.
6. Keep it goofy but somehow oddly insightful in an accidental way.
7. Stay in character no matter what – respond how Homer would in ordinary life and unusual situations.
8. Ensure it sounds like it could be a direct quote from an episode script. 





## Generate completion for synthetic prompt

In [6]:

## This generates completions for the set of instructions! ###
# We can use that same prompt as input to generate a completion.
prompt_for_completion = result_prompt_for_instruction_tune[0]["generation"]
completion_result = next(gen.process([{"instruction": prompt_for_completion}]))

print("Synthetic prompt:\n", prompt_for_completion, "\n\n\n")
print("Generated completion:\n", completion_result[0]["generation"], "\n\n\n")
# Example Output - The Smol-Course is a platform designed to learning computer science concepts.

# Cool! We can generated a synthetic prompt and a corresponding completion.

Synthetic prompt:
 Convert the text into the style of Homer Simpson using the following guidelines:

1. Start each sentence with a catchphrase: D'oh!, Woo-hoo!, Mmm... 
2. Use slapstick mishaps and misunderstandings to create humorous moments.
3. Admit mistakes with self-deprecating charm.
4. Show impulsive decisions and ignore deep technical language.
5. Incorporate food and drink references, especially donuts and beer.
6. Keep it goofy but somehow oddly insightful in an accidental way.
7. Stay in character no matter what – respond how Homer would in ordinary life and unusual situations.
8. Ensure it sounds like it could be a direct quote from an episode script. 



Generated completion:
 "D'oh! I've been so busy trying to figure out this new technology that I forgot to eat my lunch. Woo-hoo! Time for some donut action!" 





## Convert Dataset to Homer Style

Now we'll process each row of the dataset and convert all assistant responses to Homer Simpson's style using the generated instructions.

In [7]:
from typing import List, Dict, Any
import logging

# Set up logging for better progress tracking
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class HomerDatasetPipeline:
    """Pipeline for converting datasets to Homer Simpson style"""
    
    def __init__(self, generator, homer_instructions: str):
        self.generator = generator
        self.homer_instructions = homer_instructions
        self.processed_count = 0
        
    def convert_message_to_homer(self, original_content: str) -> str:
        """Convert a single message to Homer Simpson style"""
        conversion_prompt = f"{self.homer_instructions}\n\nConvert this text: \"{original_content}\""
        
        try:
            homer_result = next(self.generator.process([{"instruction": conversion_prompt}]))
            return homer_result[0]["generation"]
        except Exception as e:
            logger.error(f"Error converting message: {e}")
            return original_content  # Fall back to original
    
    def process_messages(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """Process all messages in a conversation, converting assistant messages to Homer style"""
        homer_messages = []
        
        for message in messages:
            if message.get('role') == 'assistant':
                original_content = message.get('content', '')
                homer_content = self.convert_message_to_homer(original_content)
                
                homer_message = {
                    "content": homer_content,
                    "role": "assistant"
                }
                homer_messages.append(homer_message)
            else:
                # Keep user messages unchanged
                homer_messages.append(message.copy())
        
        return homer_messages
    
    def process_example(self, example: Dict[str, Any]) -> Dict[str, Any]:
        """Process a single dataset example"""
        messages = example.get('messages', [])
        homer_messages = self.process_messages(messages)
        
        # Create new example with Homer-style messages
        homer_example = example.copy()
        homer_example['messages'] = homer_messages
        
        self.processed_count += 1
        return homer_example
    
    def process_dataset(self, dataset, max_examples: int = None) -> List[Dict[str, Any]]:
        """Process the entire dataset with progress tracking"""
        logger.info("🍩 Converting dataset to Homer Simpson style...")
        logger.info("=" * 60)
        
        # Determine number of examples to process
        total_examples = len(dataset) if max_examples is None else min(max_examples, len(dataset))
        logger.info(f"Processing {total_examples} examples...")
        
        homer_dataset = []
        
        # Access dataset examples properly using select() method
        selected_dataset = dataset.select(range(total_examples))
        
        for idx in range(total_examples):
            if idx % 50 == 0 or idx == total_examples - 1:  # More frequent progress updates for larger datasets
                logger.info(f"Processing example {idx + 1}/{total_examples} ({(idx+1)/total_examples*100:.1f}%)")
            
            # Get the example at index idx
            example = selected_dataset[idx]
            homer_example = self.process_example(example)
            homer_dataset.append(homer_example)
        
        logger.info(f"🎉 Completed! Converted {len(homer_dataset)} examples to Homer style")
        return homer_dataset
    
    def show_comparison(self, original_dataset, homer_dataset):
        """Show a comparison between original and Homer-converted examples"""
        print("\n📊 Example Conversion:")
        print("=" * 60)
        
        if len(original_dataset) > 0 and len(homer_dataset) > 0:
            # Find first assistant messages for comparison
            original_assistant_msg = self._find_assistant_message(original_dataset[0])
            homer_assistant_msg = self._find_assistant_message(homer_dataset[0])
            
            if original_assistant_msg and homer_assistant_msg:
                print("🤖 Original Assistant:")
                print(f'"{original_assistant_msg}"')
                print()
                print("🍩 Homer Simpson Version:")
                print(f'"{homer_assistant_msg}"')
                print()
    
    def print_sample_dataset(self, homer_dataset: List[Dict[str, Any]], num_examples: int = 3):
        """Print a sample of the converted dataset (to avoid overwhelming output)"""
        print("\n" + "=" * 80)
        print(f"🍩 SAMPLE HOMER SIMPSON DATASET (showing {min(num_examples, len(homer_dataset))} examples) 🍩")
        print("=" * 80)
        
        for idx in range(min(num_examples, len(homer_dataset))):
            example = homer_dataset[idx]
            print(f"\n📚 EXAMPLE {idx + 1}")
            print("-" * 40)
            print(f"Topic: {example.get('full_topic', 'N/A')}")
            print("\n💬 CONVERSATION:")
            
            for msg_idx, message in enumerate(example.get('messages', [])):
                role = message.get('role', 'unknown')
                content = message.get('content', '')
                
                # Use different emojis for different roles
                role_emoji = "👤" if role == "user" else "🍩" if role == "assistant" else "❓"
                
                print(f"\n  {role_emoji} {role.upper()}:")
                # Limit content display for readability
                display_content = content if len(content) <= 200 else content[:200] + "..."
                print(f"  {display_content}")
            
            print("\n" + "-" * 40)
        
        print(f"\n✅ Total examples processed: {len(homer_dataset)}")
        print("🎉 All done! D'oh-n't you love it? Woo-hoo!")
        print("=" * 80)
    
    def _find_assistant_message(self, example: Dict[str, Any]) -> str:
        """Helper to find the first assistant message in an example"""
        for msg in example.get('messages', []):
            if msg.get('role') == 'assistant':
                return msg.get('content', '')
        return None

# Initialize the pipeline
print("Initializing Homer Dataset Pipeline...")
pipeline = HomerDatasetPipeline(gen, prompt_for_completion)

# Process the ENTIRE dataset (remove max_examples parameter to process all)
print(f"📊 Dataset info: {len(ds)} total examples available")
print("🚀 Processing entire dataset - this may take a while...")

homer_dataset = pipeline.process_dataset(ds)  # Process ALL examples

# Show comparison
pipeline.show_comparison(ds, homer_dataset)

# Print a sample of the converted dataset (instead of the entire thing to avoid overwhelming output)
pipeline.print_sample_dataset(homer_dataset, num_examples=3)

print("✅ Homer dataset pipeline completed and ready for use!")
print(f"📈 Processed {len(homer_dataset)} examples total")

2025-09-24 02:15:12,438 - INFO - 🍩 Converting dataset to Homer Simpson style...
2025-09-24 02:15:12,439 - INFO - Processing 2260 examples...
2025-09-24 02:15:12,440 - INFO - Processing example 1/2260 (0.0%)
2025-09-24 02:15:12,439 - INFO - Processing 2260 examples...
2025-09-24 02:15:12,440 - INFO - Processing example 1/2260 (0.0%)


Initializing Homer Dataset Pipeline...
📊 Dataset info: 2260 total examples available
🚀 Processing entire dataset - this may take a while...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
2025-09-24 02:17:34,335 - INFO - Processing example 51/2260 (2.3%)
2025-09-24 02:17:34,335 - INFO - Processing example 51/2260 (2.3%)
2025-09-24 02:19:53,659 - INFO - Processing example 101/2260 (4.5%)
2025-09-24 02:19:53,659 - INFO - Processing example 101/2260 (4.5%)
2025-09-24 02:22:17,083 - INFO - Processing example 151/2260 (6.7%)
2025-09-24 02:22:17,083 - INFO - Processing example 151/2260 (6.7%)
2025-09-24 02:24:34,515 - INFO - Processing example 201/2260 (8.9%)
2025-09-24 02:24:34,515 - INFO - Processing example 201/2260 (8.9%)
2025-09-24 02:27:49,849 - INFO - Processing example 251/2260 (11.1%)
2025-09-24 02:27:49,849 - INFO - Processing example 251/2260 (11.1%)
2025-09-24 02:30:35,334 - INFO - Processing example 301/2260 (13.3%)
2025-09-24 02:30:35,334 - INFO - Processing example 301/2260 (13.3%)
2025-09-24 02:33:09,226 - INFO - Processing example 351/2260 (15.5%)
2025


📊 Example Conversion:
🤖 Original Assistant:
"Hello! How can I help you today?"

🍩 Homer Simpson Version:
""D'oh! Hello! How can I help you today? Woo-hoo! Don't mind me, just trying to make sense of all these wires and circuits. Mmm... I'm not sure if I'm on the right track here. Oh well, at least I didn't burn down the lab again. And speaking of which, have you tried that new donut place downtown? It's pretty good.""


🍩 SAMPLE HOMER SIMPSON DATASET (showing 3 examples) 🍩

📚 EXAMPLE 1
----------------------------------------
Topic: Travel/Vacation destinations/Beach resorts

💬 CONVERSATION:

  👤 USER:
  Hi there

  🍩 ASSISTANT:
  "D'oh! Hello! How can I help you today? Woo-hoo! Don't mind me, just trying to make sense of all these wires and circuits. Mmm... I'm not sure if I'm on the right track here. Oh well, at least I didn'...

  👤 USER:
  I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?

  🍩 ASSISTANT:
  "D'oh! Beach resorts, you say? Well

## Publish Homer Dataset to Hugging Face Hub

Now we'll publish the converted Homer Simpson dataset to Hugging Face Hub with the same structure as the original smoltalk dataset.

In [None]:
from datasets import Dataset
from huggingface_hub import HfApi
import pandas as pd

def publish_homer_dataset(homer_dataset: List[Dict[str, Any]], 
                         repo_name: str = "homer-simpson-smoltalk-everyday-conversations", 
                         config_name: str = "homer-conversations"):
    """
    Convert homer_dataset to Hugging Face Dataset format and publish it
    """
    
    print("🍩 Preparing Homer dataset for Hugging Face Hub...")
    print("=" * 60)
    
    # Convert the list of dictionaries to a Dataset
    # Make sure we have the same structure as the original smoltalk dataset
    hf_dataset = Dataset.from_list(homer_dataset)
    
    print(f"✅ Dataset created with {len(hf_dataset)} examples")
    print(f"📋 Dataset features: {list(hf_dataset.features.keys())}")
    
    # Show a sample to verify structure
    print(f"\n📊 Sample data structure:")
    print(f"Keys: {list(hf_dataset[0].keys())}")
    if 'messages' in hf_dataset[0]:
        print(f"First conversation has {len(hf_dataset[0]['messages'])} messages")
        print(f"Message roles: {[msg['role'] for msg in hf_dataset[0]['messages']]}")
    
    # Push to hub
    try:
        print(f"\n🚀 Pushing dataset to Hub as '{repo_name}'...")
        
        # Push the dataset to the hub with the specified configuration name
        hf_dataset.push_to_hub(
            repo_id=repo_name,
            config_name=config_name,
            commit_message="Add Homer Simpson style conversations dataset",
            private=False  # Set to True if you want a private dataset
        )
        
        print(f"🎉 Success! Dataset published to: https://huggingface.co/datasets/{repo_name}")
        print(f"📚 Configuration name: {config_name}")
        print(f"💡 Load with: load_dataset('{repo_name}', '{config_name}', split='train')")
        
        return True
        
    except Exception as e:
        print(f"❌ Error publishing dataset: {e}")
        print("💡 Make sure you're authenticated with Hugging Face Hub and have write permissions")
        return False

def verify_dataset_structure(homer_dataset: List[Dict[str, Any]], original_dataset):
    """
    Verify that our homer dataset has the same structure as the original
    """
    print("\n🔍 Verifying dataset structure compatibility...")
    print("=" * 50)
    
    # Check original dataset structure
    orig_features = original_dataset.features
    print(f"Original dataset features: {list(orig_features.keys())}")
    
    if homer_dataset:
        # Check homer dataset structure
        homer_keys = list(homer_dataset[0].keys())
        print(f"Homer dataset keys: {homer_keys}")
        
        # Compare structures
        orig_keys = list(orig_features.keys())
        missing_keys = set(orig_keys) - set(homer_keys)
        extra_keys = set(homer_keys) - set(orig_keys)
        
        if missing_keys:
            print(f"⚠️  Missing keys: {missing_keys}")
        if extra_keys:
            print(f"ℹ️  Extra keys: {extra_keys}")
        if not missing_keys and not extra_keys:
            print("✅ Structure matches perfectly!")
        
        # Check messages structure if present
        if 'messages' in homer_dataset[0] and 'messages' in orig_features:
            orig_msg = original_dataset[0]['messages'][0] if original_dataset[0]['messages'] else {}
            homer_msg = homer_dataset[0]['messages'][0] if homer_dataset[0]['messages'] else {}
            
            print(f"\nMessage structure comparison:")
            print(f"  Original message keys: {list(orig_msg.keys()) if orig_msg else 'N/A'}")
            print(f"  Homer message keys: {list(homer_msg.keys()) if homer_msg else 'N/A'}")

# First, verify our dataset structure matches the original
verify_dataset_structure(homer_dataset, ds)

# Publish the dataset
success = publish_homer_dataset(
    homer_dataset, 
    repo_name="homer-simpson-smoltalk-everyday-conversations",  # Change this to your desired repository name
    config_name="homer-conversations"     # This matches the config pattern from smoltalk
)

if success:
    print("\n🎊 All done! Your Homer Simpson dataset is now available on Hugging Face Hub!")
    print("D'oh! I mean... Woo-hoo! 🍩")
else:
    print("\n😞 Publishing failed. Please check the error messages above.")


🔍 Verifying dataset structure compatibility...
Original dataset features: ['full_topic', 'messages']
Homer dataset keys: ['full_topic', 'messages']
✅ Structure matches perfectly!

Message structure comparison:
  Original message keys: ['content', 'role']
  Homer message keys: ['content', 'role']
🍩 Preparing Homer dataset for Hugging Face Hub...
✅ Dataset created with 2260 examples
📋 Dataset features: ['full_topic', 'messages']

📊 Sample data structure:
Keys: ['full_topic', 'messages']
First conversation has 8 messages
Message roles: ['user', 'assistant', 'user', 'assistant', 'user', 'assistant', 'user', 'assistant']

🚀 Pushing dataset to Hub as 'homer-simpson-smoltalk-everyday-conversations'...


Creating parquet from Arrow format: 100%|██████████| 3/3 [00:00<00:00, 328.61ba/s]

Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  2.92 shards/s]

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


🎉 Success! Dataset published to: https://huggingface.co/datasets/homer-simpson-smoltalk-everyday-conversations
📚 Configuration name: homer-conversations
💡 Load with: load_dataset('homer-simpson-smoltalk-everyday-conversations', 'homer-conversations', split='train')

🎊 All done! Your Homer Simpson dataset is now available on Hugging Face Hub!
D'oh! I mean... Woo-hoo! 🍩


: 