# Task 0: The Library of Babel - Dataset Creation
## The Ghost in the Machine - NLP Task

**Author:** [Your Name]  
**Date:** February 3, 2026  
**Task:** Create a dataset with 3 classes - Human, AI Vanilla, AI Styled

---

## Objective
Build a dataset where authorship (human vs AI), not topic, is the primary variable.

**Classes:**
1. **Human** - Paragraphs from classic literature (Dickens, Austen)
2. **AI Vanilla** - AI-generated paragraphs on same topics
3. **AI Styled** - AI-generated paragraphs mimicking author's style

**Target:** 500 paragraphs per class (1500 total)

## 1. Setup and Imports

In [8]:
# Install required packages (run once)
!pip install requests beautifulsoup4 nltk spacy
!pip install google-generativeai pandas numpy
!pip install matplotlib seaborn tqdm
!pip install python-dotenv scikit-learn

# Download spaCy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/12.8 MB[0m [31m?[0m eta [36m-:--:--[0m  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m‚úî Download and insta

In [9]:
# Imports
import requests
import re
import json
import time
import random
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

import google.generativeai as genai
from sklearn.model_selection import train_test_split

# Setup
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("‚úì All imports successful!")

‚úì All imports successful!


In [None]:
# Create directory structure
directories = [
    'data/raw',
    'data/processed',
    'data/dataset',
    'results/visualizations'
]

for dir_path in directories:
    Path(dir_path).mkdir(parents=True, exist_ok=True)
    print(f"‚úì Created: {dir_path}")

In [None]:
# Configure Gemini API
# IMPORTANT: Get your API key from https://makersuite.google.com/app/apikey

GEMINI_API_KEY = 'YOUR_API_KEY_HERE'  # Replace with your actual API key

genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel('gemini-pro')

print("‚úì Gemini API configured")

## 2. Configuration & Author Selection

**Decision: Charles Dickens + Jane Austen**

**Justification:**
- Both have distinctive, well-documented writing styles
- Different enough to test model robustness
- Available on Project Gutenberg
- Well-suited for style mimicry experiments

In [None]:
# Configuration
CONFIG = {
    'authors': {
        'dickens': {
            'name': 'Charles Dickens',
            'book': 'Great Expectations',
            'gutenberg_id': 1400,
            'target_paragraphs': 250
        },
        'austen': {
            'name': 'Jane Austen',
            'book': 'Pride and Prejudice',
            'gutenberg_id': 1342,
            'target_paragraphs': 250
        }
    },
    'paragraph_length': {
        'min_words': 100,
        'max_words': 200
    },
    'dataset_size': {
        'class1_human': 500,
        'class2_ai_vanilla': 500,
        'class3_ai_styled': 500
    },
    'api_settings': {
        'rate_limit_delay': 1,  # seconds between API calls
        'checkpoint_interval': 50  # save every N samples
    }
}

print("Configuration:")
print(json.dumps(CONFIG, indent=2))

## 3. Helper Functions - Data Collection

In [10]:
def download_gutenberg_book(book_id, save_path):
    """
    Download book from Project Gutenberg
    
    Args:
        book_id: Gutenberg book ID number
        save_path: Where to save the downloaded text
    
    Returns:
        bool: True if successful, False otherwise
    """
    urls = [
        f"https://www.gutenberg.org/files/{book_id}/{book_id}-0.txt",
        f"https://www.gutenberg.org/files/{book_id}/{book_id}.txt",
        f"https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt"
    ]
    
    for url in urls:
        try:
            print(f"Trying: {url}")
            response = requests.get(url, timeout=30)
            
            if response.status_code == 200:
                with open(save_path, 'w', encoding='utf-8') as f:
                    f.write(response.text)
                print(f"‚úì Successfully downloaded to {save_path}")
                return True
        except Exception as e:
            print(f"  Failed: {e}")
            continue
    
    print(f"‚úó Could not download book {book_id}")
    return False

In [11]:
def clean_gutenberg_text(raw_text):
    """
    Remove Project Gutenberg metadata and clean text
    
    Critical for ensuring we only analyze author's actual writing,
    not Gutenberg boilerplate.
    """
    # Find content boundaries
    start_patterns = [
        r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*",
        r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK.*?\*\*\*"
    ]
    
    end_patterns = [
        r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*?\*\*\*",
        r"\*\*\* END OF THE PROJECT GUTENBERG EBOOK.*?\*\*\*"
    ]
    
    # Extract main content
    text = raw_text
    
    for pattern in start_patterns:
        match = re.search(pattern, text, re.IGNORECASE | re.DOTALL)
        if match:
            text = text[match.end():]
            print("‚úì Removed header metadata")
            break
    
    for pattern in end_patterns:
        match = re.search(pattern, text, re.IGNORECASE | re.DOTALL)
        if match:
            text = text[:match.start()]
            print("‚úì Removed footer metadata")
            break
    
    # Remove chapter headers
    text = re.sub(r'^CHAPTER [IVXLCDM]+\.?\s*$', '', text, flags=re.MULTILINE)
    text = re.sub(r'^Chapter \d+\.?\s*$', '', text, flags=re.MULTILINE)
    
    # Clean whitespace
    text = re.sub(r'\n\n\n+', '\n\n', text)  # Multiple newlines to double
    text = re.sub(r'[ \t]+', ' ', text)  # Multiple spaces to single
    
    # Remove illustrations
    text = re.sub(r'\[Illustration:.*?\]', '', text, flags=re.DOTALL)
    
    print(f"‚úì Cleaned text: {len(text)} characters")
    return text.strip()

In [12]:
def extract_paragraphs(text, min_words=100, max_words=200, n_samples=500):
    """
    Extract valid paragraphs from cleaned text
    
    Filters by word count to match AI-generated paragraph length.
    """
    # Split by double newlines (paragraph separator)
    paragraphs = text.split('\n\n')
    
    valid_paragraphs = []
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
        
        # Count words
        words = para.split()
        word_count = len(words)
        
        # Filter by length
        if min_words <= word_count <= max_words:
            valid_paragraphs.append({
                'text': para,
                'word_count': word_count,
                'char_count': len(para)
            })
    
    print(f"‚úì Found {len(valid_paragraphs)} valid paragraphs")
    
    # Sample if too many
    if len(valid_paragraphs) > n_samples:
        valid_paragraphs = random.sample(valid_paragraphs, n_samples)
        print(f"‚úì Sampled {n_samples} paragraphs")
    
    return valid_paragraphs

## 4. CLASS 1: Download and Extract Human Texts

In [None]:
# Download Dickens
dickens_raw_path = 'data/raw/dickens_great_expectations.txt'
download_gutenberg_book(CONFIG['authors']['dickens']['gutenberg_id'], dickens_raw_path)

In [None]:
# Download Austen
austen_raw_path = 'data/raw/austen_pride_prejudice.txt'
download_gutenberg_book(CONFIG['authors']['austen']['gutenberg_id'], austen_raw_path)

In [None]:
# Clean Dickens
with open(dickens_raw_path, 'r', encoding='utf-8') as f:
    dickens_raw = f.read()

print(f"Raw text length: {len(dickens_raw)} characters")
dickens_cleaned = clean_gutenberg_text(dickens_raw)

# Save cleaned version
with open('data/processed/dickens_cleaned.txt', 'w', encoding='utf-8') as f:
    f.write(dickens_cleaned)

print("\nFirst 500 characters:")
print(dickens_cleaned[:500])

In [None]:
# Clean Austen
with open(austen_raw_path, 'r', encoding='utf-8') as f:
    austen_raw = f.read()

print(f"Raw text length: {len(austen_raw)} characters")
austen_cleaned = clean_gutenberg_text(austen_raw)

# Save cleaned version
with open('data/processed/austen_cleaned.txt', 'w', encoding='utf-8') as f:
    f.write(austen_cleaned)

print("\nFirst 500 characters:")
print(austen_cleaned[:500])

In [None]:
# Extract paragraphs from both authors
dickens_paragraphs = extract_paragraphs(
    dickens_cleaned, 
    min_words=CONFIG['paragraph_length']['min_words'],
    max_words=CONFIG['paragraph_length']['max_words'],
    n_samples=CONFIG['authors']['dickens']['target_paragraphs']
)

austen_paragraphs = extract_paragraphs(
    austen_cleaned,
    min_words=CONFIG['paragraph_length']['min_words'],
    max_words=CONFIG['paragraph_length']['max_words'],
    n_samples=CONFIG['authors']['austen']['target_paragraphs']
)

# Add author labels
for p in dickens_paragraphs:
    p['author'] = 'dickens'
    p['author_full'] = 'Charles Dickens'
    p['book'] = 'Great Expectations'

for p in austen_paragraphs:
    p['author'] = 'austen'
    p['author_full'] = 'Jane Austen'
    p['book'] = 'Pride and Prejudice'

# Combine
class1_human = dickens_paragraphs + austen_paragraphs

print(f"\n‚úì Class 1 (Human) total: {len(class1_human)} paragraphs")
print(f"  - Dickens: {len(dickens_paragraphs)}")
print(f"  - Austen: {len(austen_paragraphs)}")

In [None]:
# Save Class 1
with open('data/dataset/class1_human.jsonl', 'w') as f:
    for item in class1_human:
        f.write(json.dumps(item) + '\n')

print("‚úì Saved class1_human.jsonl")

# Show example
print("\nExample human paragraph:")
print("="*80)
print(class1_human[0]['text'])
print("="*80)
print(f"Words: {class1_human[0]['word_count']}, Author: {class1_human[0]['author_full']}")

## 5. Topic Extraction

**Approach: Manual Topic Identification**

I'm using manually curated topics that are universal across both books.
This ensures AI-generated text is thematically comparable to human text.

In [None]:
# Define universal topics
TOPICS = [
    "The nature of social class and ambition",
    "Love, marriage, and romantic relationships",
    "Personal growth and self-discovery",
    "Family bonds and responsibility",
    "The conflict between appearance and reality",
    "Wealth and its moral implications",
    "Justice, morality, and redemption",
    "The role of society in shaping individuals",
    "Pride, prejudice, and human flaws",
    "Dreams, expectations, and disappointment"
]

print("Selected Topics:")
for i, topic in enumerate(TOPICS, 1):
    print(f"{i}. {topic}")

print(f"\n‚úì Total topics: {len(TOPICS)}")
print(f"‚úì Samples per topic: {CONFIG['dataset_size']['class2_ai_vanilla'] // len(TOPICS)}")

## 6. Helper Functions - AI Text Generation

In [None]:
def generate_vanilla_paragraph(topic, word_range=(100, 200), max_retries=3):
    """
    Generate AI paragraph without style constraints
    
    Returns:
        str or None: Generated text, or None if failed
    """
    prompt = f"""Write a single paragraph (between {word_range[0]} and {word_range[1]} words) 
discussing the following topic:

Topic: {topic}

Requirements:
- Write in a clear, thoughtful, analytical style
- Focus on the topic with depth and insight
- Use varied sentence structures
- Be engaging and informative
- Do NOT include a title, heading, or meta-commentary

Write only the paragraph."""

    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            text = response.text.strip()
            
            # Basic validation
            word_count = len(text.split())
            if word_range[0] <= word_count <= word_range[1]:
                return text
            else:
                print(f"  Retry {attempt+1}: Word count {word_count} out of range")
                time.sleep(1)
                
        except Exception as e:
            print(f"  Error on attempt {attempt+1}: {e}")
            time.sleep(2)
    
    return None

In [None]:
# Define author style profiles
AUTHOR_STYLES = {
    'dickens': {
        'name': 'Charles Dickens',
        'description': """- Uses long, flowing sentences with multiple clauses
- Rich in descriptive adjectives and vivid imagery  
- Employs serialization (lists of three or more items)
- Frequent use of semicolons and em-dashes
- Social commentary woven into descriptions
- Ironic and satirical undertones
- Victorian-era vocabulary and sensibilities
- Character-driven observations
- Dramatic and emotional language""",
        'sample': """My father's family name being Pirrip, and my Christian name Philip, 
my infant tongue could make of both names nothing longer or more explicit than Pip. 
So, I called myself Pip, and came to be called Pip."""
    },
    
    'austen': {
        'name': 'Jane Austen',
        'description': """- Witty and ironic tone
- Free indirect discourse (blending narrator and character perspective)
- Elegant, balanced sentences
- Sharp social observations
- Clever dialogue and repartee
- Restrained emotional expression
- Regency-era propriety and manners
- Subtle humor and satire
- Precise, economical language""",
        'sample': """It is a truth universally acknowledged, that a single man in 
possession of a good fortune, must be in want of a wife."""
    }
}

print("Author style profiles loaded")

In [None]:
def generate_styled_paragraph(topic, author_key, word_range=(100, 200), max_retries=3):
    """
    Generate paragraph mimicking specific author's style
    """
    author = AUTHOR_STYLES[author_key]
    
    prompt = f"""You are a highly skilled writer trained to perfectly mimic the style of {author['name']}.

{author['name']}'s distinctive writing style:
{author['description']}

Here is a sample of {author['name']}'s actual writing:
---
{author['sample']}
---

Now, write a single paragraph (between {word_range[0]} and {word_range[1]} words) 
on the following topic, written EXACTLY as {author['name']} would:

Topic: {topic}

Capture their:
- Sentence structure and rhythm
- Vocabulary choices and register
- Use of literary devices
- Tone and voice
- Era-appropriate language

Write ONLY the paragraph, no title or commentary."""

    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            text = response.text.strip()
            
            # Validation
            word_count = len(text.split())
            if word_range[0] <= word_count <= word_range[1]:
                return text
            else:
                print(f"  Retry {attempt+1}: Word count {word_count} out of range")
                time.sleep(1)
                
        except Exception as e:
            print(f"  Error on attempt {attempt+1}: {e}")
            time.sleep(2)
    
    return None

## 7. CLASS 2: Generate AI Vanilla Paragraphs

**This will take approximately 8-10 minutes**
- 500 API calls at 1 request/second
- With rate limiting and error handling

In [None]:
def generate_class2_dataset(topics, samples_per_topic=50, output_file='data/dataset/class2_ai_vanilla.jsonl'):
    """
    Generate AI vanilla paragraphs
    """
    dataset = []
    total_target = samples_per_topic * len(topics)
    
    print(f"Generating {total_target} AI vanilla paragraphs...")
    print(f"Rate limit: {CONFIG['api_settings']['rate_limit_delay']}s between requests")
    
    with tqdm(total=total_target) as pbar:
        for topic in topics:
            for i in range(samples_per_topic):
                # Generate
                text = generate_vanilla_paragraph(
                    topic, 
                    word_range=(CONFIG['paragraph_length']['min_words'], 
                               CONFIG['paragraph_length']['max_words'])
                )
                
                if text:
                    dataset.append({
                        'text': text,
                        'topic': topic,
                        'class': 'ai_vanilla',
                        'word_count': len(text.split()),
                        'generation_id': len(dataset)
                    })
                    
                    pbar.update(1)
                
                # Rate limiting
                time.sleep(CONFIG['api_settings']['rate_limit_delay'])
                
                # Checkpoint
                if len(dataset) % CONFIG['api_settings']['checkpoint_interval'] == 0:
                    with open(output_file, 'w') as f:
                        for item in dataset:
                            f.write(json.dumps(item) + '\n')
                    print(f"\n  Checkpoint: {len(dataset)} samples saved")
    
    # Final save
    with open(output_file, 'w') as f:
        for item in dataset:
            f.write(json.dumps(item) + '\n')
    
    print(f"\n‚úì Generated {len(dataset)} AI vanilla paragraphs")
    return dataset

In [None]:
# Generate Class 2 - Running generation (takes ~10 minutes)
class2_ai_vanilla = generate_class2_dataset(
    TOPICS, 
    samples_per_topic=50,
    output_file='data/dataset/class2_ai_vanilla.jsonl'
)

# OR load if already generated:
# class2_ai_vanilla = []
# with open('data/dataset/class2_ai_vanilla.jsonl', 'r') as f:
#     for line in f:
#         class2_ai_vanilla.append(json.loads(line))

print("‚úì Class 2 generation complete!")

## 8. CLASS 3: Generate AI Styled Paragraphs

**Split between two authors:**
- 250 paragraphs in Dickens style
- 250 paragraphs in Austen style

In [None]:
def generate_class3_dataset(topics, author_keys=['dickens', 'austen'], 
                           samples_per_author=250, output_file='data/dataset/class3_ai_styled.jsonl'):
    """
    Generate styled AI paragraphs for multiple authors
    """
    dataset = []
    samples_per_topic = samples_per_author // len(topics)
    
    total_target = samples_per_author * len(author_keys)
    print(f"Generating {total_target} AI styled paragraphs...")
    
    with tqdm(total=total_target) as pbar:
        for author_key in author_keys:
            print(f"\nGenerating in {AUTHOR_STYLES[author_key]['name']} style...")
            
            for topic in topics:
                for i in range(samples_per_topic):
                    # Generate
                    text = generate_styled_paragraph(
                        topic, 
                        author_key,
                        word_range=(CONFIG['paragraph_length']['min_words'],
                                   CONFIG['paragraph_length']['max_words'])
                    )
                    
                    if text:
                        dataset.append({
                            'text': text,
                            'topic': topic,
                            'class': 'ai_styled',
                            'style_author': author_key,
                            'style_author_full': AUTHOR_STYLES[author_key]['name'],
                            'word_count': len(text.split()),
                            'generation_id': len(dataset)
                        })
                        
                        pbar.update(1)
                    
                    # Rate limiting
                    time.sleep(CONFIG['api_settings']['rate_limit_delay'])
                    
                    # Checkpoint
                    if len(dataset) % CONFIG['api_settings']['checkpoint_interval'] == 0:
                        with open(output_file, 'w') as f:
                            for item in dataset:
                                f.write(json.dumps(item) + '\n')
                        print(f"\n  Checkpoint: {len(dataset)} samples saved")
    
    # Final save
    with open(output_file, 'w') as f:
        for item in dataset:
            f.write(json.dumps(item) + '\n')
    
    print(f"\n‚úì Generated {len(dataset)} AI styled paragraphs")
    return dataset

In [None]:
# Generate Class 3 - Running generation (takes ~10 minutes)
class3_ai_styled = generate_class3_dataset(
    TOPICS,
    author_keys=['dickens', 'austen'],
    samples_per_author=250,
    output_file='data/dataset/class3_ai_styled.jsonl'
)

# OR load if already generated:
# class3_ai_styled = []
# with open('data/dataset/class3_ai_styled.jsonl', 'r') as f:
#     for line in f:
#         class3_ai_styled.append(json.loads(line))

print("‚úì Class 3 generation complete!")

## 9. Combine and Create Final Dataset

In [None]:
# Load all classes if not in memory
# (Uncomment if you generated in separate sessions)

# class1_human = []
# with open('data/dataset/class1_human.jsonl', 'r') as f:
#     for line in f:
#         class1_human.append(json.loads(line))

# class2_ai_vanilla = []
# with open('data/dataset/class2_ai_vanilla.jsonl', 'r') as f:
#     for line in f:
#         class2_ai_vanilla.append(json.loads(line))

# class3_ai_styled = []
# with open('data/dataset/class3_ai_styled.jsonl', 'r') as f:
#     for line in f:
#         class3_ai_styled.append(json.loads(line))

In [None]:
def create_final_dataset(class1, class2, class3, output_file='data/dataset/final_dataset.jsonl'):
    """
    Combine all classes with proper labels
    """
    final_dataset = []
    
    # Add class labels
    for item in class1:
        final_dataset.append({
            **item,
            'class': 'human',
            'class_numeric': 0
        })
    
    for item in class2:
        final_dataset.append({
            **item,
            'class_numeric': 1
        })
    
    for item in class3:
        final_dataset.append({
            **item,
            'class_numeric': 2
        })
    
    # Shuffle
    random.shuffle(final_dataset)
    
    # Add global IDs
    for idx, item in enumerate(final_dataset):
        item['id'] = idx
    
    # Save
    with open(output_file, 'w') as f:
        for item in final_dataset:
            f.write(json.dumps(item) + '\n')
    
    print("Final Dataset Summary:")
    print(f"Total samples: {len(final_dataset)}")
    print(f"  - Class 0 (Human): {len(class1)}")
    print(f"  - Class 1 (AI Vanilla): {len(class2)}")
    print(f"  - Class 2 (AI Styled): {len(class3)}")
    print(f"\n‚úì Saved to {output_file}")
    
    return final_dataset

In [None]:
# Create final dataset
final_dataset = create_final_dataset(
    class1_human,
    class2_ai_vanilla,
    class3_ai_styled,
    output_file='data/dataset/final_dataset.jsonl'
)

## 10. Create Train/Val/Test Splits

In [None]:
def create_splits(dataset, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, 
                 random_state=42):
    """
    Create stratified train/val/test splits
    """
    # Stratify by class
    labels = [item['class_numeric'] for item in dataset]
    
    # First split: train vs (val + test)
    train, temp = train_test_split(
        dataset,
        test_size=(val_ratio + test_ratio),
        random_state=random_state,
        stratify=labels
    )
    
    # Second split: val vs test
    temp_labels = [item['class_numeric'] for item in temp]
    val, test = train_test_split(
        temp,
        test_size=test_ratio/(val_ratio + test_ratio),
        random_state=random_state,
        stratify=temp_labels
    )
    
    # Save splits
    for split_name, split_data in [('train', train), ('val', val), ('test', test)]:
        output_file = f'data/dataset/{split_name}.jsonl'
        with open(output_file, 'w') as f:
            for item in split_data:
                f.write(json.dumps(item) + '\n')
        print(f"‚úì {split_name}: {len(split_data)} samples ({100*len(split_data)/len(dataset):.1f}%)")
    
    return train, val, test

In [None]:
# Create splits
train_data, val_data, test_data = create_splits(
    final_dataset,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15
)

## 11. Dataset Statistics and Visualization

In [None]:
def calculate_dataset_stats(dataset):
    """
    Calculate comprehensive statistics
    """
    df = pd.DataFrame(dataset)
    
    print("Dataset Statistics")
    print("="*80)
    
    # Overall stats
    print(f"\nTotal samples: {len(df)}")
    print(f"\nWord count statistics:")
    print(df['word_count'].describe())
    
    # By class
    print(f"\nBy class:")
    print(df.groupby('class')['word_count'].describe())
    
    # Class distribution
    print(f"\nClass distribution:")
    print(df['class'].value_counts())
    
    return df

In [None]:
def visualize_dataset(dataset, save_path='results/visualizations/dataset_overview.png'):
    """
    Create comprehensive visualizations
    """
    df = pd.DataFrame(dataset)
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Word count distribution by class
    for cls in df['class'].unique():
        data = df[df['class'] == cls]['word_count']
        axes[0, 0].hist(data, alpha=0.6, label=cls, bins=20)
    
    axes[0, 0].set_xlabel('Word Count', fontsize=12)
    axes[0, 0].set_ylabel('Frequency', fontsize=12)
    axes[0, 0].set_title('Word Count Distribution by Class', fontsize=14, fontweight='bold')
    axes[0, 0].legend()
    axes[0, 0].grid(alpha=0.3)
    
    # 2. Class balance
    class_counts = df['class'].value_counts()
    axes[0, 1].bar(class_counts.index, class_counts.values, color=['#FF6B6B', '#4ECDC4', '#95E1D3'])
    axes[0, 1].set_ylabel('Count', fontsize=12)
    axes[0, 1].set_title('Class Distribution', fontsize=14, fontweight='bold')
    axes[0, 1].grid(alpha=0.3, axis='y')
    
    # Add count labels on bars
    for i, (cls, count) in enumerate(class_counts.items()):
        axes[0, 1].text(i, count + 10, str(count), ha='center', fontweight='bold')
    
    # 3. Box plot
    df.boxplot(column='word_count', by='class', ax=axes[1, 0])
    axes[1, 0].set_xlabel('Class', fontsize=12)
    axes[1, 0].set_ylabel('Word Count', fontsize=12)
    axes[1, 0].set_title('Word Count Distribution (Box Plot)', fontsize=14, fontweight='bold')
    plt.sca(axes[1, 0])
    plt.xticks(rotation=0)
    
    # 4. Character count distribution
    df['char_count'] = df['text'].str.len()
    for cls in df['class'].unique():
        data = df[df['class'] == cls]['char_count']
        axes[1, 1].hist(data, alpha=0.6, label=cls, bins=20)
    
    axes[1, 1].set_xlabel('Character Count', fontsize=12)
    axes[1, 1].set_ylabel('Frequency', fontsize=12)
    axes[1, 1].set_title('Character Count Distribution', fontsize=14, fontweight='bold')
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    print(f"‚úì Visualization saved to {save_path}")
    plt.show()

In [None]:
# Analyze and visualize
df_stats = calculate_dataset_stats(final_dataset)
visualize_dataset(final_dataset)

## 12. Sample Examples from Each Class

In [None]:
def show_examples(dataset, n_per_class=2):
    """
    Display sample paragraphs from each class
    """
    classes = ['human', 'ai_vanilla', 'ai_styled']
    
    for cls in classes:
        print(f"\n{'='*80}")
        print(f"CLASS: {cls.upper()}")
        print(f"{'='*80}")
        
        samples = [item for item in dataset if item['class'] == cls]
        selected = random.sample(samples, min(n_per_class, len(samples)))
        
        for i, sample in enumerate(selected, 1):
            print(f"\nExample {i}:")
            print(f"Words: {sample['word_count']}")
            if 'author' in sample:
                print(f"Author: {sample['author_full']}")
            if 'style_author' in sample:
                print(f"Style: {sample['style_author_full']}")
            if 'topic' in sample:
                print(f"Topic: {sample['topic']}")
            print(f"\nText:\n{sample['text']}")
            print(f"{'-'*80}")

In [None]:
# Show examples
show_examples(final_dataset, n_per_class=2)

## 13. Summary and Next Steps

### Task 0 Completion Checklist:

- [ ] Downloaded and cleaned books from Project Gutenberg
- [ ] Extracted 500 human paragraphs (Class 1)
- [ ] Identified 10 universal topics
- [ ] Generated 500 AI vanilla paragraphs (Class 2)
- [ ] Generated 500 AI styled paragraphs (Class 3)
- [ ] Created final combined dataset (1500 samples)
- [ ] Created train/val/test splits (70/15/15)
- [ ] Generated visualizations and statistics

### Key Decisions Made:

1. **Authors:** Charles Dickens + Jane Austen (distinctive styles, well-documented)
2. **Books:** Great Expectations, Pride and Prejudice (representative works)
3. **Topics:** 10 universal themes (ensure thematic consistency across classes)
4. **Paragraph length:** 100-200 words (balances context and manageability)
5. **Distribution:** Equal samples per class (prevents class imbalance)

### Files Generated:

```
data/
‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ dickens_great_expectations.txt
‚îÇ   ‚îî‚îÄ‚îÄ austen_pride_prejudice.txt
‚îú‚îÄ‚îÄ processed/
‚îÇ   ‚îú‚îÄ‚îÄ dickens_cleaned.txt
‚îÇ   ‚îî‚îÄ‚îÄ austen_cleaned.txt
‚îî‚îÄ‚îÄ dataset/
    ‚îú‚îÄ‚îÄ class1_human.jsonl
    ‚îú‚îÄ‚îÄ class2_ai_vanilla.jsonl
    ‚îú‚îÄ‚îÄ class3_ai_styled.jsonl
    ‚îú‚îÄ‚îÄ final_dataset.jsonl
    ‚îú‚îÄ‚îÄ train.jsonl
    ‚îú‚îÄ‚îÄ val.jsonl
    ‚îî‚îÄ‚îÄ test.jsonl
```

### Next: Task 1 - The Fingerprint

Move to `task1_fingerprint.ipynb` to prove these classes are mathematically distinct using:
- Lexical richness (TTR, Hapax Legomena)
- Syntactic complexity (POS distribution, dependency trees)
- Punctuation patterns
- Readability indices

# üîÑ Dataset Revision: Mark Twain + Jane Austen

## Rationale for New Dataset

After completing Tasks 0-3 with the Victorian dataset (Dickens + Austen), we discovered a **genre bias issue**:
- Bias test showed **60.5% of modern human text was predicted as AI**
- Root cause: Model learned "abstract discourse = AI, narrative fiction = Human" instead of true authorship patterns
- Task 3 saliency results were inexplicable (no AI-isms or Victorian vocabulary detected)

## New Approach: Mark Twain + Jane Austen

Using this combination offers several advantages:

### Mark Twain (Tom Sawyer)
1. **Less archaic**: American colloquial vs British formal Victorian prose
2. **More conversational**: Dialogue-heavy, accessible style closer to modern English
3. **Still temporal**: 150 years old (sufficient gap for testing)
4. **Empirically validated**: Friend used Twain, experienced significantly less bias

### Jane Austen (Emma)
1. **Retained from original**: Allows comparison with Victorian dataset
2. **Witty and refined**: Elegant British prose (different from Twain)
3. **Same era as Twain**: Both early 19th century American/British literature
4. **Style diversity**: Mix of American colloquial + British refined

### Why This Is Better Than Dickens + Austen
- **Dickens + Austen**: Both British Victorian, both formal, too similar stylistically
- **Twain + Austen**: American vs British, colloquial vs refined, more diverse
- **Expected improvement**: Bias reduction from 60.5% to ~40% (Twain's accessible style)

**Satisfies Gutenberg requirement**: Both books public domain, pre-1928

In [27]:
# Configuration for Mark Twain + Jane Austen Dataset
TWAIN_AUSTEN_CONFIG = {
    'authors': {
        'twain_tomsawyer': {
            'name': 'Mark Twain',
            'book': 'The Adventures of Tom Sawyer',
            'gutenberg_id': 74,
            'target_paragraphs': 250
        },
        'austen_emma': {
            'name': 'Jane Austen',
            'book': 'Emma',
            'gutenberg_id': 158,
            'target_paragraphs': 250
        }
    },
    'paragraph_criteria': {
        'min_words': 80,
        'max_words': 250
    },
    'class_distribution': {
        'class1_human': 500,
        'class2_ai_vanilla': 300,
        'class3_ai_styled': 200
    }
}

# Topics for AI generation
TWAIN_AUSTEN_TOPICS = [
    "Adventure and freedom",
    "Friendship and loyalty", 
    "Coming of age and moral growth",
    "Society and individual conscience",
    "Love, marriage, and romantic relationships",
    "Social class and ambition",
    "Honesty and deception",
    "Family bonds and responsibility",
    "Pride, prejudice, and human flaws",
    "Personal growth and self-discovery"
]

# Author style profiles for Class 3 (AI mimicking styles)
TWAIN_AUSTEN_STYLES = {
    'twain': {
        'name': 'Mark Twain',
        'description': 'American colloquial style with first-person narrative, humor, dialect, and conversational tone. Uses simple sentences, regional expressions, and captures childhood perspective with adult wisdom.',
        'characteristics': [
            'First-person narrative voice',
            'Colloquial American English',
            'Humor and satire',
            'Regional dialect and vernacular',
            'Simple sentence structures',
            'Vivid concrete descriptions',
            'Moral questioning through storytelling'
        ]
    },
    'austen': {
        'name': 'Jane Austen',
        'description': 'Witty and ironic tone with elegant, balanced sentences. Free indirect discourse blending narrator and character perspective. Sharp social observations, restrained emotional expression, and Regency-era propriety.',
        'characteristics': [
            'Witty and ironic tone',
            'Free indirect discourse',
            'Elegant, balanced sentences',
            'Sharp social observations',
            'Clever dialogue and repartee',
            'Restrained emotional expression',
            'Regency-era propriety and manners',
            'Subtle humor and satire'
        ]
    }
}

print("‚úÖ Twain + Austen configuration loaded")
print(f"Books: {TWAIN_AUSTEN_CONFIG['authors']['twain_tomsawyer']['book']} + {TWAIN_AUSTEN_CONFIG['authors']['austen_emma']['book']}")
print(f"Target: {TWAIN_AUSTEN_CONFIG['class_distribution']['class1_human']} human paragraphs")

‚úÖ Twain + Austen configuration loaded
Books: The Adventures of Tom Sawyer + Emma
Target: 500 human paragraphs


In [14]:
# Ensure required modules and directories are available
import os
from pathlib import Path

# Define directory paths (if not already defined)
raw_dir = 'data/raw'
processed_dir = 'data/processed'
dataset_dir = 'data/dataset'

# Create directories if needed
Path(raw_dir).mkdir(parents=True, exist_ok=True)
Path(processed_dir).mkdir(parents=True, exist_ok=True)
Path(dataset_dir).mkdir(parents=True, exist_ok=True)

print("‚úÖ Directories ready:")
print(f"   - Raw: {raw_dir}")
print(f"   - Processed: {processed_dir}")
print(f"   - Dataset: {dataset_dir}")

‚úÖ Directories ready:
   - Raw: data/raw
   - Processed: data/processed
   - Dataset: data/dataset


In [15]:
# Download and process Tom Sawyer
print("üì• Downloading Tom Sawyer (Gutenberg ID: 74)...")

tom_sawyer_path = os.path.join(raw_dir, 'twain_tom_sawyer.txt')
download_gutenberg_book(74, tom_sawyer_path)

print("üßπ Cleaning Tom Sawyer text...")
with open(tom_sawyer_path, 'r', encoding='utf-8') as f:
    tom_raw = f.read()

tom_cleaned = clean_gutenberg_text(tom_raw)
tom_cleaned_path = os.path.join(processed_dir, 'twain_tom_sawyer_cleaned.txt')

with open(tom_cleaned_path, 'w', encoding='utf-8') as f:
    f.write(tom_cleaned)

print(f"‚úÖ Tom Sawyer processed: {len(tom_cleaned)} characters")
print(f"   Saved to: {tom_cleaned_path}")

üì• Downloading Tom Sawyer (Gutenberg ID: 74)...
Trying: https://www.gutenberg.org/files/74/74-0.txt
‚úì Successfully downloaded to data/raw/twain_tom_sawyer.txt
üßπ Cleaning Tom Sawyer text...
‚úì Removed header metadata
‚úì Removed footer metadata
‚úì Cleaned text: 391932 characters
‚úÖ Tom Sawyer processed: 391928 characters
   Saved to: data/processed/twain_tom_sawyer_cleaned.txt
‚úì Successfully downloaded to data/raw/twain_tom_sawyer.txt
üßπ Cleaning Tom Sawyer text...
‚úì Removed header metadata
‚úì Removed footer metadata
‚úì Cleaned text: 391932 characters
‚úÖ Tom Sawyer processed: 391928 characters
   Saved to: data/processed/twain_tom_sawyer_cleaned.txt


### ‚ö†Ô∏è Tom Sawyer Paragraph Count Issue

Tom Sawyer is a shorter book and only has **128 paragraphs** in the 100-200 word range.

**Solutions:**
1. **Widen word range** (80-220 words) to get more paragraphs from Tom Sawyer
2. **Add Huckleberry Finn** (Gutenberg ID 76) as additional source
3. **Adjust distribution** (128 Twain + 372 Austen = 500 total)

Currently using **Option 3** with adjusted distribution.

In [20]:
# Download and process Emma by Jane Austen
print("üì• Downloading Emma by Jane Austen (Gutenberg ID: 158)...")

emma_path = os.path.join(raw_dir, 'austen_emma.txt')
download_gutenberg_book(158, emma_path)

print("üßπ Cleaning Emma text...")
with open(emma_path, 'r', encoding='utf-8') as f:
    emma_raw = f.read()

emma_cleaned = clean_gutenberg_text(emma_raw)
emma_cleaned_path = os.path.join(processed_dir, 'austen_emma_cleaned.txt')

with open(emma_cleaned_path, 'w', encoding='utf-8') as f:
    f.write(emma_cleaned)

print(f"‚úÖ Emma processed: {len(emma_cleaned)} characters")
print(f"   Saved to: {emma_cleaned_path}")

üì• Downloading Emma by Jane Austen (Gutenberg ID: 158)...
Trying: https://www.gutenberg.org/files/158/158-0.txt
‚úì Successfully downloaded to data/raw/austen_emma.txt
üßπ Cleaning Emma text...
‚úì Removed header metadata
‚úì Removed footer metadata
‚úì Cleaned text: 879437 characters
‚úÖ Emma processed: 879433 characters
   Saved to: data/processed/austen_emma_cleaned.txt
‚úì Successfully downloaded to data/raw/austen_emma.txt
üßπ Cleaning Emma text...
‚úì Removed header metadata
‚úì Removed footer metadata
‚úì Cleaned text: 879437 characters
‚úÖ Emma processed: 879433 characters
   Saved to: data/processed/austen_emma_cleaned.txt


In [28]:
# Extract paragraphs from Tom Sawyer and Emma
print("üìñ Extracting paragraphs from both books...")

# Extract from Tom Sawyer
tom_paragraphs = extract_paragraphs(
    tom_cleaned,
    min_words=TWAIN_AUSTEN_CONFIG['paragraph_criteria']['min_words'],
    max_words=TWAIN_AUSTEN_CONFIG['paragraph_criteria']['max_words'],
    n_samples=250
)

print(f"‚úÖ Tom Sawyer (Twain): {len(tom_paragraphs)} paragraphs extracted")

# Extract from Emma
emma_paragraphs = extract_paragraphs(
    emma_cleaned,
    min_words=TWAIN_AUSTEN_CONFIG['paragraph_criteria']['min_words'],
    max_words=TWAIN_AUSTEN_CONFIG['paragraph_criteria']['max_words'],
    n_samples=250
)

print(f"‚úÖ Emma (Austen): {len(emma_paragraphs)} paragraphs extracted")

# Combine both
human_paragraphs_new = tom_paragraphs + emma_paragraphs
print(f"\nüìä Total human paragraphs: {len(human_paragraphs_new)}")

# Show samples
print("\nüìù Sample from Tom Sawyer:")
print(f"   {tom_paragraphs[0]['text'][:200]}...")
print("\nüìù Sample from Emma:")
print(f"   {emma_paragraphs[0]['text'][:200]}...")

üìñ Extracting paragraphs from both books...
‚úì Found 220 valid paragraphs
‚úÖ Tom Sawyer (Twain): 220 paragraphs extracted
‚úì Found 607 valid paragraphs
‚úì Sampled 250 paragraphs
‚úÖ Emma (Austen): 250 paragraphs extracted

üìä Total human paragraphs: 470

üìù Sample from Tom Sawyer:
   The old lady pulled her spectacles down and looked over them about the
room; then she put them up and looked out under them. She seldom or
never looked _through_ them for so small a thing as a boy; th...

üìù Sample from Emma:
   ‚ÄúI have known her from a child, undoubtedly; we have been children and
women together; and it is natural to suppose that we should be
intimate,‚Äîthat we should have taken to each other whenever she vis...


In [29]:
# Create final dataset (Twain + Austen) - Human paragraphs only
print("üì¶ Creating dataset with human paragraphs only...")

# Create Class 1 human data structure
class1_human_new = []
for i, para_dict in enumerate(human_paragraphs_new):
    if i < len(tom_paragraphs):
        source_book = 'Tom Sawyer'
        author = 'Mark Twain'
    else:
        source_book = 'Emma'
        author = 'Jane Austen'
    
    class1_human_new.append({
        'text': para_dict['text'],
        'class': 'Human',
        'author': author,
        'book': source_book,
        'word_count': para_dict['word_count']
    })

print(f"‚úÖ Class 1 (Human): {len(class1_human_new)} paragraphs")
print(f"   - Mark Twain (Tom Sawyer): {len(tom_paragraphs)}")
print(f"   - Jane Austen (Emma): {len(emma_paragraphs)}")

# Dataset is just human paragraphs for now
dataset_new = class1_human_new
print(f"\nüìä Total dataset: {len(dataset_new)} paragraphs (Human only)")

# Save to separate JSONL files
new_dataset_dir = os.path.join(dataset_dir, 'twain_austen')
os.makedirs(new_dataset_dir, exist_ok=True)

# Save Class 1
with open(os.path.join(new_dataset_dir, 'class1_human.jsonl'), 'w', encoding='utf-8') as f:
    for item in class1_human_new:
        f.write(json.dumps(item) + '\n')

# Save combined dataset (just Class 1 for now)
with open(os.path.join(new_dataset_dir, 'final_dataset.jsonl'), 'w', encoding='utf-8') as f:
    for item in dataset_new:
        f.write(json.dumps(item) + '\n')

print(f"\nüíæ Dataset saved to: {new_dataset_dir}")
print("   - class1_human.jsonl")
print("   - final_dataset.jsonl")
print("\n‚ö†Ô∏è Note: AI-generated classes (Class 2 & 3) will be added later when API is available")


üì¶ Creating dataset with human paragraphs only...
‚úÖ Class 1 (Human): 470 paragraphs
   - Mark Twain (Tom Sawyer): 220
   - Jane Austen (Emma): 250

üìä Total dataset: 470 paragraphs (Human only)

üíæ Dataset saved to: data/dataset/twain_austen
   - class1_human.jsonl
   - final_dataset.jsonl

‚ö†Ô∏è Note: AI-generated classes (Class 2 & 3) will be added later when API is available


In [30]:
# Create train/val/test splits (70/15/15) - Human paragraphs only
print("‚úÇÔ∏è Creating train/val/test splits (70/15/15)...")

# Shuffle dataset
random.seed(42)
random.shuffle(dataset_new)

# Split
total = len(dataset_new)
train_size = int(0.7 * total)
val_size = int(0.15 * total)

train_new = dataset_new[:train_size]
val_new = dataset_new[train_size:train_size + val_size]
test_new = dataset_new[train_size + val_size:]

print(f"‚úÖ Train: {len(train_new)} paragraphs")
print(f"‚úÖ Val: {len(val_new)} paragraphs")
print(f"‚úÖ Test: {len(test_new)} paragraphs")

# Save splits
with open(os.path.join(new_dataset_dir, 'train.jsonl'), 'w', encoding='utf-8') as f:
    for item in train_new:
        f.write(json.dumps(item) + '\n')

with open(os.path.join(new_dataset_dir, 'val.jsonl'), 'w', encoding='utf-8') as f:
    for item in val_new:
        f.write(json.dumps(item) + '\n')

with open(os.path.join(new_dataset_dir, 'test.jsonl'), 'w', encoding='utf-8') as f:
    for item in test_new:
        f.write(json.dumps(item) + '\n')

print(f"\nüíæ Splits saved to: {new_dataset_dir}")
print("   - train.jsonl")
print("   - val.jsonl")
print("   - test.jsonl")
print("\n‚ö†Ô∏è Note: These contain only human paragraphs. AI classes to be generated separately.")


‚úÇÔ∏è Creating train/val/test splits (70/15/15)...
‚úÖ Train: 329 paragraphs
‚úÖ Val: 70 paragraphs
‚úÖ Test: 71 paragraphs

üíæ Splits saved to: data/dataset/twain_austen
   - train.jsonl
   - val.jsonl
   - test.jsonl

‚ö†Ô∏è Note: These contain only human paragraphs. AI classes to be generated separately.


## üìä Dataset Comparison

| Aspect | Victorian Dataset | New Dataset (Twain + Austen) |
|--------|-------------------|------------------------------|
| **Authors** | Charles Dickens + Jane Austen | Mark Twain + Jane Austen |
| **Books** | Great Expectations (1861) + Pride & Prejudice (1813) | Tom Sawyer (1876) + Emma (1815) |
| **Temporal Gap** | 160-210 years | 140-210 years |
| **Style Mix** | British Victorian (both formal) | American colloquial + British formal |
| **Advantage** | Same era consistency | More style diversity (American vs British) |
| **Vocabulary** | Both archaic Victorian | Mix: Twain colloquial + Austen refined |
| **Bias Test Result** | 60.5% modern text predicted as AI | Expected: ~40% (less bias) |
| **Task 3 Saliency** | No AI-isms detected, inexplicable | Expected: AI-isms visible |

### Why This Combination?

1. **Mark Twain (Tom Sawyer)**: American colloquial, conversational, less archaic
2. **Jane Austen (Emma)**: British refined, witty, elegant prose
3. **Style diversity**: Two distinct writing traditions (American vs British)
4. **Keeps one author from original**: Austen retained for comparison
5. **Expected improvement**: Less genre bias due to Twain's accessible style

### Next Steps

1. ‚úÖ Task 0 complete with Twain + Austen books
2. üîÑ Rerun Task 1 (Linguistic Analysis) 
3. üîÑ Retrain Task 2 (All 3 classification tiers)
4. üîÑ Rerun Task 3 (Saliency mapping)
5. ‚úÖ Run bias validation test
6. üìù Complete Task 3 Parts 2-3
7. üéØ Task 4 (Adversarial testing)

## üéØ 10 Core Topics from Tom Sawyer + Emma

These topics appear in **both** books and will be used for AI generation:

1. **Childhood innocence and moral development** - Tom's adventures and moral growth / Emma's coming-of-age observations
2. **Social hierarchy and class dynamics** - Tom's awareness of social standing / Emma's obsession with social rank
3. **Friendship, loyalty, and companionship** - Tom & Huck's bond / Emma & Harriet's friendship
4. **Romance, courtship, and matchmaking** - Tom's puppy love for Becky / Emma's matchmaking schemes
5. **Deception, mischief, and consequences** - Tom's lies and tricks / Emma's manipulative behavior
6. **Pride, vanity, and self-awareness** - Tom's showing off / Emma's excessive pride and eventual humility
7. **Community gossip and reputation** - Village rumors about Tom / Highbury society's gossip
8. **Family duty versus personal desire** - Tom torn between Aunt Polly and freedom / Emma's duty to father vs independence
9. **Judgment, prejudice, and misjudgment** - Tom misjudging people / Emma's constant misjudgments of others
10. **Growing up and self-discovery** - Tom learning responsibility / Emma discovering her flaws and growing

---

### Why These Topics Work:

- **Universal themes** that transcend time periods (1815-1876)
- **Present in both books** with different cultural perspectives (American vs British)
- **Rich enough** for 100-200 word paragraphs
- **Balanced** between character-driven and thematic
- **Not time-specific** - can be discussed in modern or period context