<a href="https://colab.research.google.com/github/trancethehuman/ai-workshop-code/blob/main/Don't_wait_and_poll_crawl_jobs_use_webhooks_to_get_notified_of_when_they're_done.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Firecrawl's python package to be able to use it.

In [None]:
pip install firecrawl-py -q

Setup Firecrawl's Python client

In [None]:
import getpass

FIRECRAWL_API_KEY = getpass.getpass("Firecrawl API Key: ")

In [None]:
from firecrawl import FirecrawlApp

crawler = FirecrawlApp(api_key=FIRECRAWL_API_KEY)

Let's pick two competitors: Stripe and Paddle. They do the same thing: process payments. But there are subtleties that I don't want to spend time reading about.

Good thing is their differentiations are in their pricing pages. So we're going to scrape them, give them to an LLM and compare.


In [None]:
pip install fastapi uvicorn pyngrok -q

Setup our FastAPI server

In [None]:
import os
import threading
import asyncio
import uvicorn
from fastapi import FastAPI, Request
from pyngrok import ngrok, conf
import getpass
from firecrawl import FirecrawlApp
from typing import List, Dict
from datetime import datetime

# Initialize FastAPI
app = FastAPI()

# Store crawl results
crawl_completed = asyncio.Event()

@app.post("/webhook")
async def webhook(request: Request):
    data = await request.json()
    # print(f"\nReceived webhook event: {data['type']}")
    print(data)

    if data['type'] == 'crawl.completed':
        print("\nCrawling completed!")
        crawl_completed.set()

    elif data['type'] == 'crawl.failed':
        print(f"Crawl failed")
        crawl_completed.set()





In [None]:
def run_server():
    config = uvicorn.Config(app, host="0.0.0.0", port=8000, log_level="info")
    server = uvicorn.Server(config)
    server.run()

In [None]:
ngrok_token = getpass.getpass("Enter your ngrok access token")

In [None]:
# Set up ngrok
conf.get_default().auth_token = ngrok_token

# Start FastAPI server in a separate thread
server_thread = threading.Thread(target=run_server)
server_thread.daemon = True
server_thread.start()

# Start ngrok tunnel
public_url = ngrok.connect(8000).public_url
webhook_url = f"{public_url}/webhook"
print(f"Webhook URL: {webhook_url}")


In [None]:
# Single news site to crawl
news_site = "https://news.ycombinator.com"

# Start crawl
print(f"\nStarting crawl for: {news_site}")
crawl_status = crawler.crawl_url(
    news_site,
    params={
        'limit': 10,
        'webhook': webhook_url,
        'scrapeOptions': {
            'formats': ['markdown']
        }
    }
)

# Wait for crawl to complete
crawl_completed.wait()


Starting crawl for: https://news.ycombinator.com
{'success': True, 'type': 'crawl.started', 'id': 'bcd9a2bc-3d13-4a7a-abd5-3d1b3abec17e', 'data': []}
INFO:     34.48.34.118:0 - "POST /webhook HTTP/1.1" 200 OK
{'success': True, 'type': 'crawl.page', 'id': 'bcd9a2bc-3d13-4a7a-abd5-3d1b3abec17e', 'data': [{'markdown': "|     |     |     |\n| --- | --- | --- |\n| [![](https://news.ycombinator.com/y18.svg)](https://news.ycombinator.com) | **[Hacker News](news)<br>** [new](newest)<br> \\| [past](front)<br> \\| [comments](newcomments)<br> \\| [ask](ask)<br> \\| [show](show)<br> \\| [jobs](jobs)<br> \\| [submit](submit) | [login](login?goto=news) |\n\n|     |     |     |\n| --- | --- | --- |\n| 1.  | [](vote?id=42000784&how=up&goto=news) | [OpenZFS deduplication is good now and you shouldn't use it](https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-is-good-dont-use-it/)<br> ([despairlabs.com](from?site=despairlabs.com)<br>) |\n|     |     | 233 points by [type0](user?id=type0)<br> [

<coroutine object Event.wait at 0x7a826a7d9850>

In [None]:
import asyncio
import threading
import subprocess

# Assuming your uvicorn process is stored in a variable called 'server_process'
# You might need to adjust this based on how you started your server.

def shutdown_server():
  """Shuts down the FastAPI server."""
  try:
      print("Shutting down the server...")
      # Replace this with the actual method to stop your server
      # You could use:
      # - server.should_exit = True  (if 'server' is your uvicorn Server object)
      # - subprocess.Popen.terminate() if you're using subprocess
      # - Send a signal to the process if you have its PID
      # Depending on how you started your Uvicorn server
      subprocess.call(["pkill", "uvicorn"])
  except Exception as e:
      print(f"Error shutting down server: {e}")


# Call this function to shut down the server
shutdown_server()