%%html

<style>
    @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap');
    
    .ph-header {
        background: linear-gradient(135deg, #DA552F 0%, #FF6154 100%);
        padding: 60px 40px;
        border-radius: 12px;
        text-align: center;
        margin: 30px 0;
        box-shadow: 0 8px 32px rgba(218, 85, 47, 0.2);
    }
    
    .ph-header h1 {
        color: white;
        font-family: 'Inter', sans-serif;
        font-size: 52px;
        font-weight: 700;
        margin: 0 0 12px 0;
        letter-spacing: -0.5px;
    }
    
    .ph-header p {
        color: rgba(255, 255, 255, 0.95);
        font-family: 'Inter', sans-serif;
        font-size: 18px;
        font-weight: 400;
        margin: 0;
        line-height: 1.6;
    }
    
    .info-card {
        background: #f8f9fa;
        border-left: 3px solid #DA552F;
        padding: 16px 20px;
        border-radius: 6px;
        margin: 20px 0;
    }
    
    .info-card h3 {
        color: #DA552F;
        margin: 0 0 12px 0;
        font-size: 16px;
        font-weight: 600;
    }
</style>

<div class="ph-header">
    <h1>üöÄ ProductHuntDB</h1>
    <p>Product Hunt GraphQL API Data Sink & Kaggle Dataset Manager</p>
</div>


# üìñ Overview

This notebook manages a Product Hunt dataset on Kaggle with automatic daily updates.

## ‚ú® What This Notebook Does

1. **Install** ProductHuntDB package from GitHub
2. **Initialize** SQLite database for Product Hunt data
3. **Sync** data from Product Hunt GraphQL API (posts, users, topics, comments, votes)
4. **Export** data to CSV files
5. **Publish** updated dataset to Kaggle (optional)

## üìÖ Usage Strategy

**First Run (Initial Data Extraction):**
- Run with `--full-refresh` flag to download all historical data (2-4 hours)
- This creates the baseline dataset

**Scheduled Daily Updates (Kaggle Automation):**
- Run without flags for incremental sync (3-5 minutes)
- Only fetches new data since last run
- Schedule via: Notebook ‚Üí Schedule ‚Üí Daily

**üìö Resources:** [GitHub](https://github.com/wyattowalsh/producthuntdb) ‚Ä¢ [Product Hunt API](https://api.producthunt.com/v2/docs)

# 1Ô∏è‚É£ Installation & Setup

Install ProductHuntDB and configure the environment. This cell automatically detects whether you're running on Kaggle or locally and uses the appropriate installation method.


In [None]:
# Install ProductHuntDB with comprehensive error handling
# Works in Kaggle notebooks and standard Python environments
import subprocess
import sys
import os
from pathlib import Path

print("üì¶ Installing ProductHuntDB and dependencies...")

# Check if we're in a Kaggle environment or standard Python environment
is_kaggle = Path("/kaggle/working").exists()

try:
    if is_kaggle or "pip" in subprocess.run(
        [sys.executable, "-m", "pip", "--version"], 
        capture_output=True, text=True
    ).stdout:
        # Standard pip installation (works in Kaggle and most Python envs)
        try:
            subprocess.check_call([
                sys.executable, "-m", "pip", "install", "-q",
                "git+https://github.com/wyattowalsh/producthuntdb.git"
            ], stderr=subprocess.DEVNULL)
            print("‚úÖ Installed ProductHuntDB from GitHub")
        except subprocess.CalledProcessError:
            # Fallback to PyPI (if published)
            try:
                subprocess.check_call([
                    sys.executable, "-m", "pip", "install", "-q", "producthuntdb"
                ], stderr=subprocess.DEVNULL)
                print("‚úÖ Installed ProductHuntDB from PyPI")
            except subprocess.CalledProcessError as e:
                print("‚ùå Installation failed. Please install manually.")
                print("   Run: pip install git+https://github.com/wyattowalsh/producthuntdb.git")
                raise RuntimeError("Failed to install ProductHuntDB") from e
        
        # Install additional dependencies for notebook (not in core package)
        print("üì¶ Installing notebook-specific dependencies...")
        try:
            subprocess.check_call([
                sys.executable, "-m", "pip", "install", "-q",
                "plotly", "kaleido"  # plotly for interactive viz, kaleido for static export
            ], stderr=subprocess.DEVNULL)
            print("‚úÖ Installed plotly and kaleido")
        except subprocess.CalledProcessError:
            print("‚ö†Ô∏è  Optional dependencies (plotly) failed to install")
            print("   Visualizations may not work, but core functionality is OK")
    else:
        # For local development with uv or similar package managers
        print("‚ö†Ô∏è  Detected non-standard Python environment.")
        print("   If using uv, run: uv sync --group notebook")
        print("   Otherwise, run: pip install git+https://github.com/wyattowalsh/producthuntdb.git plotly")

    # Configure paths
    print("\nüîß Configuring environment...")
    WORKING_DIR = Path("/kaggle/working") if is_kaggle else Path.cwd()
    os.environ["DB_PATH"] = str(WORKING_DIR / "producthunt.db")
    os.environ["EXPORT_DIR"] = str(WORKING_DIR / "export")

    # Load secrets from Kaggle Secrets or environment
    token_loaded = False
    kaggle_configured = False
    
    try:
        from kaggle_secrets import UserSecretsClient
        user_secrets = UserSecretsClient()
        
        print("üîê Attempting to load secrets from Kaggle Secrets...")
        
        # Get Product Hunt token (required)
        try:
            producthunt_token = user_secrets.get_secret("PRODUCTHUNT_TOKEN")
            
            # Verify the token is valid (not None, not empty)
            if producthunt_token and len(producthunt_token.strip()) > 0:
                os.environ["PRODUCTHUNT_TOKEN"] = producthunt_token.strip()
                token_length = len(producthunt_token.strip())
                print(f"‚úÖ PRODUCTHUNT_TOKEN loaded from Kaggle Secrets ({token_length} chars)")
                token_loaded = True
            else:
                print("‚ö†Ô∏è  PRODUCTHUNT_TOKEN retrieved but is empty or invalid")
                print(f"   Token value type: {type(producthunt_token)}")
                print(f"   Token value repr: {repr(producthunt_token)}")
        except Exception as e:
            print(f"‚ö†Ô∏è  Failed to retrieve PRODUCTHUNT_TOKEN from Kaggle Secrets")
            print(f"   Error: {type(e).__name__}: {e}")
        
        # Get Kaggle publishing credentials (optional)
        try:
            kaggle_username = user_secrets.get_secret("KAGGLE_USERNAME")
            kaggle_key = user_secrets.get_secret("KAGGLE_KEY")
            kaggle_slug = user_secrets.get_secret("KAGGLE_DATASET_SLUG")
            
            # Only set if all three are valid
            if kaggle_username and kaggle_key and kaggle_slug:
                os.environ["KAGGLE_USERNAME"] = kaggle_username
                os.environ["KAGGLE_KEY"] = kaggle_key
                os.environ["KAGGLE_DATASET_SLUG"] = kaggle_slug
                print("‚úÖ Kaggle publishing credentials loaded from Secrets")
                kaggle_configured = True
            else:
                print("‚ÑπÔ∏è  Kaggle publishing credentials incomplete (optional)")
        except Exception:
            print("‚ÑπÔ∏è  Kaggle publishing credentials not configured (optional)")
            
    except ImportError:
        # Not in Kaggle environment, try environment variables
        print("‚ÑπÔ∏è  Not in Kaggle environment, checking environment variables...")
        
        producthunt_token = os.getenv("PRODUCTHUNT_TOKEN")
        if producthunt_token and len(producthunt_token.strip()) > 0:
            token_length = len(producthunt_token.strip())
            print(f"‚úÖ PRODUCTHUNT_TOKEN loaded from environment ({token_length} chars)")
            token_loaded = True
        else:
            print("‚ö†Ô∏è  PRODUCTHUNT_TOKEN not found in environment")

    # Summary of configuration
    print(f"\nüìÇ Working directory: {WORKING_DIR}")
    print(f"üíæ Database: {os.environ['DB_PATH']}")
    print(f"üì§ Export: {os.environ['EXPORT_DIR']}")
    
    if not token_loaded:
        print("\nüö® CRITICAL: No PRODUCTHUNT_TOKEN configured!")
        print("   ‚Üí On Kaggle: Add secret in Notebook Settings ‚Üí Add-ons ‚Üí Secrets")
        print("   ‚Üí Secret name: PRODUCTHUNT_TOKEN")
        print("   ‚Üí Get token at: https://api.producthunt.com/v2/oauth/applications")
        print("   ‚ö†Ô∏è  Pipeline will fail without this token!")
    
    if not kaggle_configured:
        print("\n‚ÑπÔ∏è  Kaggle dataset publishing not configured (optional)")
        print("   To enable, add these secrets:")
        print("   ‚Ä¢ KAGGLE_USERNAME")
        print("   ‚Ä¢ KAGGLE_KEY")
        print("   ‚Ä¢ KAGGLE_DATASET_SLUG")
    
    # Verify key imports work
    print("\nüîç Verifying installation...")
    try:
        import producthuntdb
        print("‚úÖ producthuntdb module imported successfully")
    except ImportError as e:
        print(f"‚ùå Failed to import producthuntdb: {e}")
        raise
    
    try:
        import plotly
        print("‚úÖ plotly module imported successfully")
    except ImportError:
        print("‚ö†Ô∏è  plotly not available - visualizations will be limited")
    
    # Final verification - try to instantiate settings to catch any validation errors
    if token_loaded:
        try:
            print("\nüîç Validating configuration...")
            from producthuntdb.config import settings
            print(f"‚úÖ Configuration validated successfully")
            print(f"   Token: {settings.redact_token()}")
            print(f"   Endpoint: {settings.graphql_endpoint}")
        except Exception as e:
            print(f"‚ùå Configuration validation failed: {e}")
            print("   Your PRODUCTHUNT_TOKEN may be invalid or too short (min 10 chars)")
            token_loaded = False
    
    print("\n" + "=" * 60)
    if token_loaded:
        print("‚úÖ SETUP COMPLETE - Ready to proceed!")
    else:
        print("‚ö†Ô∏è  SETUP INCOMPLETE - Configure PRODUCTHUNT_TOKEN before proceeding")
    print("=" * 60)
        
except Exception as e:
    print(f"\n‚ùå Setup failed with error: {str(e)}")
    print("   Check your environment configuration and try again.")
    raise

üì¶ Installing ProductHuntDB...
‚ö†Ô∏è  Detected non-standard Python environment.
   If using uv, run: uv sync
   Otherwise, run: pip install git+https://github.com/wyattowalsh/producthuntdb.git

üîß Configuring environment...
‚úÖ API token loaded from environment

üìÇ Working directory: /Users/ww/dev/projects/producthuntdb/notebooks
üíæ Database: /Users/ww/dev/projects/producthuntdb/notebooks/producthunt.db
üì§ Export: /Users/ww/dev/projects/producthuntdb/notebooks/export


# 2Ô∏è‚É£ Configuration

## üîê Required: Product Hunt API Token

Get your API token from [api.producthunt.com/v2/oauth/applications](https://api.producthunt.com/v2/oauth/applications)

**On Kaggle:**
1. Go to **Notebook Settings** ‚Üí **Add-ons** ‚Üí **Secrets**
2. Add secret: `PRODUCTHUNT_TOKEN` = your API token

**For local development:** Set `PRODUCTHUNT_TOKEN` environment variable

## üì§ Optional: Kaggle Dataset Publishing

To auto-publish datasets to Kaggle, add these additional secrets:

- `KAGGLE_USERNAME` - Your Kaggle username
- `KAGGLE_KEY` - API key from [kaggle.com/settings](https://www.kaggle.com/settings)
- `KAGGLE_DATASET_SLUG` - Dataset slug (format: `username/dataset-name`)

‚ö†Ô∏è **Security Note:** Never commit API tokens to version control!

# 2Ô∏è‚É£ Initialize Database & Verify Connection

Let's initialize the database and verify our API authentication works correctly.


In [None]:
# Initialize database and verify authentication with error handling
import subprocess
import sys

print("‚è±Ô∏è  Expected runtime: ~10-30 seconds\n")

try:
    # Initialize database
    print("üîß Initializing database...")
    result = subprocess.run(
        ["producthuntdb", "init"],
        capture_output=True,
        text=True,
        check=False
    )
    
    if result.returncode == 0:
        print("‚úÖ Database initialized successfully")
        if result.stdout.strip():
            print(result.stdout)
    else:
        print("‚ö†Ô∏è  Database initialization encountered issues:")
        print(result.stderr if result.stderr else result.stdout)
        if "already exists" in (result.stderr + result.stdout).lower():
            print("   (Database may already be initialized - this is usually fine)")
        else:
            raise RuntimeError(f"Database init failed: {result.stderr}")
    
    # Verify API authentication
    print("\nüîê Verifying API authentication...")
    result = subprocess.run(
        ["producthuntdb", "verify"],
        capture_output=True,
        text=True,
        check=False
    )
    
    if result.returncode == 0:
        print("‚úÖ API authentication verified successfully")
        # Show STDOUT which contains the Rich table with user info
        if result.stdout.strip():
            print(result.stdout)
        else:
            print("   API token is valid and authentication successful")
    else:
        print("‚ùå API authentication failed:")
        # On failure, show both stdout and stderr
        if result.stdout.strip():
            print("Output:", result.stdout)
        if result.stderr.strip():
            print("Error:", result.stderr)
        print("\nüí° Troubleshooting:")
        print("   1. Check your PRODUCTHUNT_TOKEN is valid")
        print("   2. Get a new token at: https://api.producthunt.com/v2/oauth/applications")
        print("   3. Verify the token is correctly set in Kaggle Secrets or environment")
        raise RuntimeError(f"API verification failed (exit code {result.returncode})")
        
except FileNotFoundError:
    print("‚ùå 'producthuntdb' command not found!")
    print("   The package may not be installed correctly.")
    print("   Try re-running the installation cell above.")
    raise
except Exception as e:
    print(f"\n‚ùå Initialization failed: {str(e)}")
    print("   Check logs above for specific error details.")
    raise

# 3Ô∏è‚É£ Sync Data from Product Hunt

Fetch posts, users, topics, comments, and votes from the Product Hunt API.

**‚è±Ô∏è Expected Runtime:**
- **First Run (Full Refresh)**: 2-4 hours for complete historical data
- **Incremental Updates**: 3-5 minutes for new data only

**üìù Configuration:**
- **First time?** Uncomment `--full-refresh` below to get all historical data
- **Daily updates?** Comment out `--full-refresh` to only fetch new data since last run

In [None]:
# Sync data from Product Hunt with comprehensive error handling
# Configure sync strategy based on your needs
import subprocess
import sys
from datetime import datetime
from pathlib import Path

print("‚è±Ô∏è  Expected runtime:")
print("   ‚Ä¢ Full refresh: 2-4 hours (first run)")
print("   ‚Ä¢ Incremental: 3-5 minutes (daily updates)")
print("   ‚Ä¢ Limited test: 1-2 minutes (--max-pages 10)\n")

# Track sync timing
start_time = datetime.now()
print(f"üöÄ Starting sync at {start_time.strftime('%Y-%m-%d %H:%M:%S')}\n")

try:
    # üéØ FOR FIRST RUN: Uncomment this line to get all historical data
    # sync_command = ["producthuntdb", "sync", "--full-refresh"]
    
    # üîÑ FOR SCHEDULED DAILY UPDATES: Use this (default, fast incremental updates)
    sync_command = ["producthuntdb", "sync"]
    
    # üß™ FOR TESTING: Limit to a few pages (uncomment to use)
    # sync_command = ["producthuntdb", "sync", "--max-pages", "10"]
    
    # üìä POSTS ONLY: Skip topics and collections (faster, uncomment to use)
    # sync_command = ["producthuntdb", "sync", "--posts-only"]
    
    print(f"üì° Running command: {' '.join(sync_command)}\n")
    
    result = subprocess.run(
        sync_command,
        capture_output=True,
        text=True,
        check=False,
        timeout=14400  # 4-hour timeout (Kaggle limit is 12 hours)
    )
    
    # Calculate elapsed time
    end_time = datetime.now()
    elapsed = end_time - start_time
    
    if result.returncode == 0:
        print(result.stdout)
        print(f"\n‚úÖ Sync completed successfully in {elapsed.total_seconds():.1f} seconds")
        print(f"   ({elapsed.total_seconds() / 60:.1f} minutes)")
    else:
        print("‚ö†Ô∏è  Sync encountered errors:")
        print(result.stderr)
        
        # Provide context-specific troubleshooting
        if "rate limit" in result.stderr.lower():
            print("\nüí° Rate Limit Hit - Troubleshooting:")
            print("   ‚Ä¢ The API has rate limits that reset periodically")
            print("   ‚Ä¢ Built-in retry logic will handle this automatically")
            print("   ‚Ä¢ For faster testing, use --max-pages option")
            print("   ‚Ä¢ Consider running sync during off-peak hours")
        elif "timeout" in result.stderr.lower():
            print("\nüí° Timeout - Troubleshooting:")
            print("   ‚Ä¢ Full refresh can take several hours")
            print("   ‚Ä¢ Use incremental sync for daily updates")
            print("   ‚Ä¢ Data collected before timeout is safely stored")
            print("   ‚Ä¢ Re-run to continue from where it left off")
        elif "authentication" in result.stderr.lower() or "token" in result.stderr.lower():
            print("\nüí° Authentication Error - Troubleshooting:")
            print("   ‚Ä¢ Verify PRODUCTHUNT_TOKEN is set correctly")
            print("   ‚Ä¢ Token may have expired - get new one from api.producthunt.com")
            print("   ‚Ä¢ Check Kaggle Secrets configuration")
        else:
            print("\nüí° General Troubleshooting:")
            print("   ‚Ä¢ Check database file is not corrupted")
            print("   ‚Ä¢ Verify sufficient disk space available")
            print("   ‚Ä¢ Review full error message above")
            print("   ‚Ä¢ Try running 'producthuntdb status' to check database state")
        
        # Don't raise if partial success (some data may have been synced)
        if "error" in result.stderr.lower() and result.stdout:
            print("\n‚ö†Ô∏è  Partial sync completed - some data was saved before error")
        else:
            raise RuntimeError(f"Sync failed: {result.stderr}")
    
    # Save sync timing for performance monitoring
    try:
        with open("/kaggle/working/sync_history.txt" if Path("/kaggle/working").exists() else "sync_history.txt", "a") as f:
            f.write(f"{start_time.isoformat()},{elapsed.total_seconds()},{result.returncode}\n")
    except Exception:
        pass  # Non-critical if we can't save timing data
        
except subprocess.TimeoutExpired:
    end_time = datetime.now()
    elapsed = end_time - start_time
    print(f"\n‚è±Ô∏è  Sync timed out after {elapsed.total_seconds() / 3600:.1f} hours")
    print("   Data collected up to this point has been saved to the database.")
    print("   You can re-run sync to continue where it left off.")
    print("\nüí° To avoid timeouts:")
    print("   ‚Ä¢ Use incremental sync instead of --full-refresh")
    print("   ‚Ä¢ Run during off-peak hours")
    print("   ‚Ä¢ Consider splitting into smaller batches with --max-pages")
except FileNotFoundError:
    print("‚ùå 'producthuntdb' command not found!")
    print("   Re-run the installation cell to fix this.")
    raise
except Exception as e:
    end_time = datetime.now()
    elapsed = end_time - start_time
    print(f"\n‚ùå Sync failed after {elapsed.total_seconds():.1f} seconds")
    print(f"   Error: {str(e)}")
    raise

# 4Ô∏è‚É£ Database Statistics & Status

Let's examine what we've collected and view key statistics about the database.


In [None]:
# View database statistics
!producthuntdb status

# 6Ô∏è‚É£ Export to CSV

Export the database tables to CSV files for easy analysis and sharing.


In [None]:
# Export database to CSV files
import subprocess
import os
from pathlib import Path

print("‚è±Ô∏è  Expected runtime: ~1-2 minutes\n")

export_dir = Path(os.environ.get('EXPORT_DIR', '/kaggle/working/export'))

try:
    print("üì§ Exporting to CSV format...")
    result = subprocess.run(
        ["producthuntdb", "export"],
        capture_output=True,
        text=True,
        check=False
    )
    
    if result.returncode == 0:
        print("‚úÖ CSV export completed")
        print(result.stdout)
    else:
        print("‚ö†Ô∏è  CSV export encountered issues:")
        print(result.stderr)
        if "database is locked" in result.stderr.lower():
            print("\nüí° Database is locked - close other connections and retry")
        raise RuntimeError(f"Export failed: {result.stderr}")
    
    # List exported files
    print("\n? Exported files:")
    if export_dir.exists():
        for csv_file in sorted(export_dir.glob("*.csv")):
            size_mb = csv_file.stat().st_size / (1024 * 1024)
            print(f"   ‚Ä¢ {csv_file.name} ({size_mb:.2f} MB)")
    else:
        print(f"‚ö†Ô∏è  Export directory not found: {export_dir}")
    
    print("\n‚úÖ Export complete!")
    
except FileNotFoundError:
    print("‚ùå 'producthuntdb' command not found!")
    print("   Re-run the installation cell to fix this.")
    raise
except subprocess.CalledProcessError as e:
    print(f"‚ùå Export command failed: {str(e)}")
    raise
except Exception as e:
    print(f"‚ùå Export failed: {str(e)}")
    raise

# 7Ô∏è‚É£ Publish to Kaggle

Publish your dataset to Kaggle! This will create a new dataset or update an existing one.

**Prerequisites:**

1. `KAGGLE_USERNAME` and `KAGGLE_KEY` - Your Kaggle API credentials
2. `KAGGLE_DATASET_SLUG` - Dataset identifier (e.g., `yourusername/product-hunt-database`)

**Setup:**

1. Go to **Notebook Settings** ‚Üí **Add-ons** ‚Üí **Secrets**
2. Add the three secrets listed above

<div class="info-card">
    <h3>üìù Note</h3>
    Publishing from a Kaggle notebook to Kaggle may have limitations. For production use, consider running the publish command from a local environment or CI/CD pipeline.
</div>


In [None]:
# Publish to Kaggle (requires credentials to be configured)
import os
import subprocess

try:
    # Check if credentials are already set from installation cell
    kaggle_username = os.getenv("KAGGLE_USERNAME")
    kaggle_key = os.getenv("KAGGLE_KEY")
    kaggle_slug = os.getenv("KAGGLE_DATASET_SLUG")
    
    if not all([kaggle_username, kaggle_key, kaggle_slug]):
        print("‚ö†Ô∏è  Kaggle credentials not configured.")
        print("   Publishing to Kaggle requires:")
        print("   ‚Ä¢ KAGGLE_USERNAME")
        print("   ‚Ä¢ KAGGLE_KEY (from kaggle.com/settings)")
        print("   ‚Ä¢ KAGGLE_DATASET_SLUG (format: username/dataset-name)")
        print("\n   Add these as Kaggle Secrets and re-run the installation cell.")
    else:
        print(f"‚úÖ Publishing to Kaggle dataset: {kaggle_slug}\n")
        
        result = subprocess.run(
            ["producthuntdb", "publish"],
            capture_output=True,
            text=True,
            check=False
        )
        
        if result.returncode == 0:
            print(result.stdout)
            print(f"\n‚úÖ Dataset published successfully!")
            print(f"   View at: https://www.kaggle.com/datasets/{kaggle_slug}")
        else:
            print("‚ö†Ô∏è  Publishing encountered issues:")
            print(result.stderr)
            print("\nüí° Troubleshooting:")
            print("   ‚Ä¢ Verify Kaggle credentials are correct")
            print("   ‚Ä¢ Ensure dataset exists or CLI can create it")
            print("   ‚Ä¢ Check you have write permissions")

except Exception as e:
    print(f"‚ùå Publishing failed: {str(e)}")
    print("   This is optional - core pipeline functionality is not affected.")

# 7Ô∏è‚É£ Schedule Automatic Updates

This notebook is ready for Kaggle's scheduling feature to keep your dataset current automatically.

## üöÄ Setup Instructions

1. **First Run**: Execute all cells once with `--full-refresh` in the sync cell to get historical data
2. **Enable Scheduling**:
   - Click **Notebook** ‚Üí **Schedule Run**
   - Select **Daily** (recommended) or your preferred frequency
3. **That's it!** The sync cell is already configured for incremental updates

## ‚öôÔ∏è How It Works

- **First run**: Use `--full-refresh` to populate database (2-4 hours)
- **Daily updates**: Default `sync` command fetches only new data (3-5 minutes)
- **Safety**: Built-in 5-minute lookback prevents data loss
- **Resilience**: Automatic retry logic handles API rate limits

## üìä Expected Performance

| Operation    | Duration    | Data                     |
| ------------ | ----------- | ------------------------ |
| Full Refresh | 2-4 hours   | All historical data      |
| Daily Update | 3-5 minutes | New posts since last run |
| Export       | 1-2 minutes | All tables to CSV        |
| Publish      | 1-2 minutes | Update Kaggle dataset    |

**Total daily runtime: ~10 minutes**

# üé¨ Complete Workflow

Here's the complete workflow for managing your Product Hunt dataset:

```bash
# 1. Initialize database
producthuntdb init

# 2. Verify API authentication
producthuntdb verify

# 3. Sync data (choose one)
producthuntdb sync --full-refresh   # Full historical harvest (first run)
producthuntdb sync                  # Incremental update (daily runs)

# 4. Check database status
producthuntdb status

# 5. Export to CSV
producthuntdb export

# 6. Publish to Kaggle (optional, requires credentials)
producthuntdb publish
```

## üìö CLI Help

For detailed help on any command:

```bash
producthuntdb --help
producthuntdb sync --help
producthuntdb export --help
```

# Ô∏è Troubleshooting

## Common Issues

### Authentication Error
**Error**: `Authentication failed` or `Invalid token`

**Solution**:
- Get new token at: https://api.producthunt.com/v2/oauth/applications
- Add to Kaggle Secrets: Settings ‚Üí Add-ons ‚Üí Secrets ‚Üí `PRODUCTHUNT_TOKEN`
- Verify no extra spaces or newlines in token

### Database Locked
**Error**: `database is locked`

**Solution**:
```bash
# Reset database (warning: deletes all data)
!rm -f /kaggle/working/producthunt.db*
!producthuntdb init
```

### Rate Limiting
**Error**: `rate limit exceeded` or `429 Too Many Requests`

**Solution**:
- Built-in retry logic handles this automatically
- Reduce testing size with `--max-pages 10`
- Run during off-peak hours (late night/early morning UTC)

### Timeout (>12 hours)
**Error**: Sync takes too long

**Solution**:
- Use incremental sync instead of `--full-refresh`
- Data is saved progressively - re-run to continue

### Import Error
**Error**: `ModuleNotFoundError: No module named 'producthuntdb'`

**Solution**:
```bash
# Reinstall package
!pip install -q git+https://github.com/wyattowalsh/producthuntdb.git
```

## üìö Resources

- **GitHub**: [github.com/wyattowalsh/producthuntdb](https://github.com/wyattowalsh/producthuntdb)
- **Issues**: [Report problems](https://github.com/wyattowalsh/producthuntdb/issues)
- **Product Hunt API**: [api.producthunt.com/v2/docs](https://api.producthunt.com/v2/docs)

# ‚úÖ Pre-Execution Checklist

Before running this notebook on Kaggle, verify:

- [ ] **API Token Configured** - `PRODUCTHUNT_TOKEN` added in Kaggle Secrets
- [ ] **First Run Setup** - Uncomment `--full-refresh` in sync cell for initial data harvest
- [ ] **Subsequent Runs** - Re-comment `--full-refresh` after first successful run
- [ ] **(Optional) Publishing Setup** - Add `KAGGLE_USERNAME`, `KAGGLE_KEY`, `KAGGLE_DATASET_SLUG` if publishing

Once configured, schedule the notebook to run daily for automatic dataset updates! üéâ