%%html

<style>
    @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap');
    
    .ph-header {
        background: linear-gradient(135deg, #DA552F 0%, #FF6154 100%);
        padding: 60px 40px;
        border-radius: 12px;
        text-align: center;
        margin: 30px 0;
        box-shadow: 0 8px 32px rgba(218, 85, 47, 0.2);
    }
    
    .ph-header h1 {
        color: white;
        font-family: 'Inter', sans-serif;
        font-size: 52px;
        font-weight: 700;
        margin: 0 0 12px 0;
        letter-spacing: -0.5px;
    }
    
    .ph-header p {
        color: rgba(255, 255, 255, 0.95);
        font-family: 'Inter', sans-serif;
        font-size: 18px;
        font-weight: 400;
        margin: 0;
        line-height: 1.6;
    }
    
    .info-card {
        background: #f8f9fa;
        border-left: 3px solid #DA552F;
        padding: 16px 20px;
        border-radius: 6px;
        margin: 20px 0;
    }
    
    .info-card h3 {
        color: #DA552F;
        margin: 0 0 12px 0;
        font-size: 16px;
        font-weight: 600;
    }
</style>

<div class="ph-header">
    <h1>🚀 ProductHuntDB</h1>
    <p>Product Hunt GraphQL API Data Sink & Kaggle Dataset Manager</p>
</div>


# 📖 Overview

This notebook demonstrates how to use **ProductHuntDB** to create, update, and manage a comprehensive Product Hunt dataset on Kaggle.

## ✨ Key Features

- 🔍 **Harvest** data from Product Hunt GraphQL API (posts, users, topics, collections, comments, votes)
- 💾 **Store** in optimized SQLite database with normalized schema
- 🔄 **Sync** incrementally with safety margins to avoid data loss
- 📤 **Publish** to Kaggle with automatic versioning
- ✅ **Validate** data with Pydantic v2 type-safe models

<div class="info-card">
    <h3>🎯 What You'll Learn</h3>
    <ul style="margin: 0; padding-left: 20px;">
        <li>Configure ProductHuntDB for Kaggle Notebooks</li>
        <li>Initialize database and verify API connections</li>
        <li>Perform full and incremental data syncs</li>
        <li>Export data to CSV and publish to Kaggle</li>
        <li>Query and analyze Product Hunt data</li>
    </ul>
</div>

**📚 Resources:** [GitHub](https://github.com/wyattowalsh/producthuntdb) • [Product Hunt API Docs](https://api.producthunt.com/v2/docs)


# 1️⃣ Installation & Setup

Install ProductHuntDB and configure the environment. This cell automatically detects whether you're running on Kaggle or locally and uses the appropriate installation method.


In [2]:
# Install ProductHuntDB
# Works in Kaggle notebooks and standard Python environments
import subprocess
import sys
import os
from pathlib import Path

print("📦 Installing ProductHuntDB...")

# Check if we're in a Kaggle environment or standard Python environment
is_kaggle = Path("/kaggle/working").exists()

if is_kaggle or "pip" in subprocess.run(
    [sys.executable, "-m", "pip", "--version"], 
    capture_output=True, text=True
).stdout:
    # Standard pip installation (works in Kaggle and most Python envs)
    try:
        subprocess.check_call([
            sys.executable, "-m", "pip", "install", "-q",
            "git+https://github.com/wyattowalsh/producthuntdb.git"
        ], stderr=subprocess.DEVNULL)
        print("✅ Installed from GitHub")
    except subprocess.CalledProcessError:
        # Fallback to PyPI (if published)
        try:
            subprocess.check_call([
                sys.executable, "-m", "pip", "install", "-q", "producthuntdb"
            ], stderr=subprocess.DEVNULL)
            print("✅ Installed from PyPI")
        except subprocess.CalledProcessError:
            print("❌ Installation failed. Please install manually.")
            print("   Run: pip install git+https://github.com/wyattowalsh/producthuntdb.git")
else:
    # For local development with uv or similar package managers
    print("⚠️  Detected non-standard Python environment.")
    print("   If using uv, run: uv sync")
    print("   Otherwise, run: pip install git+https://github.com/wyattowalsh/producthuntdb.git")

# Configure paths
print("\n🔧 Configuring environment...")
WORKING_DIR = Path("/kaggle/working") if is_kaggle else Path.cwd()
os.environ["DB_PATH"] = str(WORKING_DIR / "producthunt.db")
os.environ["EXPORT_DIR"] = str(WORKING_DIR / "export")

# Load API token from Kaggle Secrets or environment
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    os.environ["PRODUCTHUNT_TOKEN"] = user_secrets.get_secret("PRODUCTHUNT_TOKEN")
    print("✅ API token loaded from Kaggle Secrets")
except Exception:
    token = os.getenv("PRODUCTHUNT_TOKEN")
    if token:
        print("✅ API token loaded from environment")
    else:
        print("⚠️  No API token found. Please configure PRODUCTHUNT_TOKEN.")

print(f"\n📂 Working directory: {WORKING_DIR}")
print(f"💾 Database: {os.environ['DB_PATH']}")
print(f"📤 Export: {os.environ['EXPORT_DIR']}")

📦 Installing ProductHuntDB...
⚠️  Detected non-standard Python environment.
   If using uv, run: uv sync
   Otherwise, run: pip install git+https://github.com/wyattowalsh/producthuntdb.git

🔧 Configuring environment...
✅ API token loaded from environment

📂 Working directory: /Users/ww/dev/projects/producthuntdb/notebooks
💾 Database: /Users/ww/dev/projects/producthuntdb/notebooks/producthunt.db
📤 Export: /Users/ww/dev/projects/producthuntdb/notebooks/export


## 🔐 Configuration

### Required: Product Hunt API Token

ProductHuntDB requires a Product Hunt API token to access the GraphQL API.

**On Kaggle:**

1. Go to **Notebook Settings** → **Add-ons** → **Secrets**
2. Add secret: `PRODUCTHUNT_TOKEN` = Your API token from [api.producthunt.com](https://api.producthunt.com/v2/oauth/applications)

**For local development:** Create `.env` file with `PRODUCTHUNT_TOKEN=your_token_here`

<div class="info-card">
    <h3>⚠️ Security Note</h3>
    Never commit API tokens to version control! Always use environment variables or Kaggle Secrets.
</div>

### Optional: Kaggle Publishing

To publish datasets to Kaggle, add these additional secrets:

- `KAGGLE_USERNAME` - Your Kaggle username
- `KAGGLE_KEY` - Your Kaggle API key (from kaggle.com/settings)
- `KAGGLE_DATASET_SLUG` - Dataset slug (format: `username/dataset-name`)


# 2️⃣ Initialize Database & Verify Connection

Let's initialize the database and verify our API authentication works correctly.


In [None]:
# Initialize database and verify authentication
!producthuntdb init
!producthuntdb verify

# 3️⃣ Sync Data from Product Hunt

Sync data from the Product Hunt API. Start with a limited sync to test, then perform a full sync when ready.

**Sync Options:**

- **Full Refresh** - Downloads all historical data (can take several hours)
- **Incremental Update** - Only syncs new/updated data since last run (fast)
- **Limited Sync** - Fetches a specific number of pages (good for testing)

<div class="info-card">
    <h3>💡 Recommended Approach</h3>
    Start with <code>--max-pages 10</code> to test the setup, then run a full sync: <code>producthuntdb sync --full-refresh</code>
</div>


In [None]:
# Sync data from Product Hunt
# Options:
#   --full-refresh, -f    : Perform full refresh instead of incremental update
#   --max-pages, -n N     : Maximum pages to fetch (for testing)
#   --posts-only          : Only sync posts (skip topics/collections)

# Example: Limited sync for testing (10 pages)
!producthuntdb sync --max-pages 10

# For full refresh (uncomment):
# !producthuntdb sync --full-refresh

# For incremental update (uncomment):
# !producthuntdb sync

# 4️⃣ Database Statistics & Status

Let's examine what we've collected and view key statistics about the database.


In [None]:
# View database statistics
!producthuntdb status

# 5️⃣ Query & Analyze Data

Let's explore the data we've collected with some SQL queries using pandas.


In [None]:
import pandas as pd
import sqlite3
import os
from pathlib import Path

# Connect to the database
db_path = Path(os.environ.get('DB_PATH', '/kaggle/working/producthunt.db'))
conn = sqlite3.connect(db_path)

print("🏆 Top 10 Products by Votes\n")
top_posts = pd.read_sql_query(
    """
    SELECT 
        name,
        tagline,
        votes_count,
        comments_count,
        featured_at,
        url
    FROM post_row
    ORDER BY votes_count DESC
    LIMIT 10
""",
    conn,
)
display(top_posts)

print("\n👤 Top 10 Most Active Makers\n")
active_makers = pd.read_sql_query(
    """
    SELECT 
        u.name,
        u.username,
        COUNT(DISTINCT mpl.post_id) as products_made,
        u.url
    FROM user_row u
    JOIN maker_post_link mpl ON u.id = mpl.maker_id
    GROUP BY u.id
    ORDER BY products_made DESC
    LIMIT 10
""",
    conn,
)
display(active_makers)

print("\n🏷️  Top 10 Popular Topics\n")
popular_topics = pd.read_sql_query(
    """
    SELECT 
        t.name,
        t.slug,
        COUNT(DISTINCT ptl.post_id) as product_count
    FROM topic_row t
    JOIN post_topic_link ptl ON t.id = ptl.topic_id
    GROUP BY t.id
    ORDER BY product_count DESC
    LIMIT 10
""",
    conn,
)
display(popular_topics)

conn.close()

## 📈 Visualize Trends

Let's create some visualizations to understand Product Hunt trends better.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sqlite3
import os
from pathlib import Path

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

# Connect to database
db_path = Path(os.environ.get('DB_PATH', '/kaggle/working/producthunt.db'))
conn = sqlite3.connect(db_path)

# Posts by day of week
print("📅 Product Launches by Day of Week\n")
posts_by_day = pd.read_sql_query(
    """
    SELECT 
        CASE CAST(strftime('%w', featured_at) AS INTEGER)
            WHEN 0 THEN 'Sunday'
            WHEN 1 THEN 'Monday'
            WHEN 2 THEN 'Tuesday'
            WHEN 3 THEN 'Wednesday'
            WHEN 4 THEN 'Thursday'
            WHEN 5 THEN 'Friday'
            WHEN 6 THEN 'Saturday'
        END as day_of_week,
        COUNT(*) as count
    FROM post_row
    WHERE featured_at IS NOT NULL
    GROUP BY day_of_week
    ORDER BY CAST(strftime('%w', featured_at) AS INTEGER)
""",
    conn,
)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Day of week distribution
ax1 = axes[0]
sns.barplot(data=posts_by_day, x='day_of_week', y='count', palette='viridis', ax=ax1)
ax1.set_title('Product Launches by Day of Week', fontsize=16, fontweight='bold')
ax1.set_xlabel('Day of Week', fontsize=12)
ax1.set_ylabel('Number of Products', fontsize=12)
ax1.tick_params(axis='x', rotation=45)

# Votes vs Comments correlation
posts_metrics = pd.read_sql_query(
    """
    SELECT 
        votes_count,
        comments_count
    FROM post_row
    WHERE votes_count > 0 AND comments_count > 0
    LIMIT 1000
""",
    conn,
)

ax2 = axes[1]
ax2.scatter(
    posts_metrics['votes_count'], posts_metrics['comments_count'], alpha=0.5, c='#DA552F', s=30
)
ax2.set_title('Votes vs Comments Correlation', fontsize=16, fontweight='bold')
ax2.set_xlabel('Votes Count', fontsize=12)
ax2.set_ylabel('Comments Count', fontsize=12)
ax2.set_xscale('log')
ax2.set_yscale('log')

plt.tight_layout()
plt.show()

conn.close()

# 6️⃣ Export to CSV

Export the database tables to CSV files for easy analysis and sharing.


In [None]:
# Export database to CSV files
!producthuntdb export

# List exported files
import os
from pathlib import Path

export_dir = Path(os.environ.get('EXPORT_DIR', '/kaggle/working/export'))
if export_dir.exists():
    print("\n📁 Exported files:")
    for csv_file in sorted(export_dir.glob("*.csv")):
        size_kb = csv_file.stat().st_size / 1024
        print(f"   • {csv_file.name} ({size_kb:.1f} KB)")

# 7️⃣ Publish to Kaggle

Publish your dataset to Kaggle! This will create a new dataset or update an existing one.

**Prerequisites:**

1. `KAGGLE_USERNAME` and `KAGGLE_KEY` - Your Kaggle API credentials
2. `KAGGLE_DATASET_SLUG` - Dataset identifier (e.g., `yourusername/product-hunt-database`)

**Setup:**

1. Go to **Notebook Settings** → **Add-ons** → **Secrets**
2. Add the three secrets listed above

<div class="info-card">
    <h3>📝 Note</h3>
    Publishing from a Kaggle notebook to Kaggle may have limitations. For production use, consider running the publish command from a local environment or CI/CD pipeline.
</div>


In [None]:
# Configure Kaggle publishing credentials (if not already set)
import os

try:
    from kaggle_secrets import UserSecretsClient

    user_secrets = UserSecretsClient()

    os.environ["KAGGLE_USERNAME"] = user_secrets.get_secret("KAGGLE_USERNAME")
    os.environ["KAGGLE_KEY"] = user_secrets.get_secret("KAGGLE_KEY")
    os.environ["KAGGLE_DATASET_SLUG"] = user_secrets.get_secret("KAGGLE_DATASET_SLUG")

    print("✅ Kaggle credentials loaded from secrets")
    print(f"   Dataset: {os.environ['KAGGLE_DATASET_SLUG']}")

    # Publish to Kaggle using CLI
    print("\n📤 Publishing to Kaggle...")
    !producthuntdb publish

    print(
        f"\n✅ View your dataset at: https://www.kaggle.com/datasets/{os.environ['KAGGLE_DATASET_SLUG']}"
    )

except Exception as e:
    print(f"⚠️  Could not load Kaggle credentials: {e}")
    print("   Configure KAGGLE_USERNAME, KAGGLE_KEY, and KAGGLE_DATASET_SLUG in Kaggle Secrets.")
    print("   Or run manually: !producthuntdb publish")

# 8️⃣ Schedule Regular Updates

Schedule this notebook to run periodically on Kaggle for automatic dataset updates.

**Setup Scheduling:**

1. Click **Notebook** → **Schedule**
2. Choose frequency (daily, weekly, etc.)
3. Notebook will automatically run and update your dataset

**Recommended Settings for Scheduled Runs:**

```python
# In cell 3, modify the sync command to:
!producthuntdb sync  # Incremental update (no --max-pages limit)
```

This fetches only new data since the last run, making updates fast and efficient!

<div class="info-card">
    <h3>💡 Best Practices</h3>
    <ul style="margin: 0; padding-left: 20px;">
        <li>Run a full refresh once initially: <code>producthuntdb sync --full-refresh</code></li>
        <li>Use incremental updates for scheduled runs: <code>producthuntdb sync</code></li>
        <li>Monitor execution logs to ensure successful runs</li>
        <li>Keep API tokens secure using Kaggle Secrets</li>
    </ul>
</div>


# 🎬 Complete Workflow Summary

Here's the complete ProductHuntDB workflow using the CLI:

```bash
# 1. Initialize database
producthuntdb init

# 2. Verify API authentication
producthuntdb verify

# 3. Sync data (choose one)
producthuntdb sync --max-pages 10        # Limited sync (testing)
producthuntdb sync --full-refresh        # Full historical harvest
producthuntdb sync                       # Incremental update

# 4. Check database status
producthuntdb status

# 5. Export to CSV
producthuntdb export

# 6. Publish to Kaggle (requires credentials)
producthuntdb publish

# Advanced: Database migrations
producthuntdb migration-history          # View migration history
producthuntdb migrate "description"      # Create new migration
producthuntdb upgrade head               # Apply migrations
producthuntdb downgrade -1               # Rollback one revision
```

## 📚 CLI Help

For detailed help on any command:

```bash
producthuntdb --help
producthuntdb sync --help
producthuntdb export --help
```


# 🎓 Additional Resources

## 📖 Documentation

- **Full Documentation**: [GitHub Repository](https://github.com/wyattowalsh/producthuntdb)
- **API Reference**: [Product Hunt GraphQL API](https://api.producthunt.com/v2/docs)
- **Database Schema**: See `producthuntdb/models.py` for complete schema
- **Configuration Options**: See `producthuntdb/config.py` for all settings

## 🛠️ Troubleshooting

### Database Issues

```bash
# Reset database (warning: deletes all data)
!rm -f /kaggle/working/producthunt.db*
!producthuntdb init
```

### API Rate Limiting

The pipeline includes automatic retry logic with exponential backoff. If you hit rate limits:

- Reduce `--max-pages` for testing
- Use incremental updates instead of full refresh
- Check your API token is valid with `producthuntdb verify`

### Import Errors

```bash
# Reinstall the package
!pip uninstall -y producthuntdb
!pip install -q git+https://github.com/wyattowalsh/producthuntdb.git
```

## 🤝 Contributing

Found a bug or have a feature request?

- **Issues**: [GitHub Issues](https://github.com/wyattowalsh/producthuntdb/issues)
- **Pull Requests**: Contributions welcome!

## 📄 License

MIT License - see [LICENSE](https://github.com/wyattowalsh/producthuntdb/blob/main/LICENSE) for details.
