Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] firecrawl-py has outdated types for Python #1295

Open
nicholas-johnson-techxcel opened this issue Mar 5, 2025 · 8 comments
Open

[Bug] firecrawl-py has outdated types for Python #1295

nicholas-johnson-techxcel opened this issue Mar 5, 2025 · 8 comments
Labels
bug Something isn't working question Further information is requested

Comments

@nicholas-johnson-techxcel

Describe the Bug
Many fields when used make the server return 400 status code, with a body listing the unrecognised fields. For instance,

    class ExtractParams(pydantic.BaseModel):
        """
        Parameters for the extract operation.
        """
        prompt: Optional[str] = None
        schema_: Optional[Any] = pydantic.Field(None, alias='schema')
        system_prompt: Optional[str] = None
        allow_external_links: Optional[bool] = False
        enable_web_search: Optional[bool] = False
        # Just for backwards compatibility
        enableWebSearch: Optional[bool] = False
        show_sources: Optional[bool] = False

I can only get schema and prompt to work.

For instance, despite system_prompt defined in the latest 1.12.0 package version:

Error during extraction: ("Unexpected error during extract: Status code 400. Bad Request - [{'code': 'unrecognized_keys', 'keys': ['system_prompt'], 'path': [], 'message': 'Unrecognized key in body -- please review the v1 API documentation for request body changes'}]", 500)

def scrape_url does not even have a params type schema at all.

To Reproduce
Use firecrawl-py and try to use all of the parameters defined in the package in the requests to the Firecrawl API. You will get 400 BAD REQUEST.

Expected Behavior
The API packages need to be kept up to date with the API schema. Perhaps some Swagger / OpenAPI / GraphQL introspection so that the python package can update its type definitions without having to have a new version. Perhaps there should be a check to warn when the package is no longer consistent with the API so that developers know it is time to upgrade the package.

The package should also have a stub file so that it can be used in a typesafe manner. My import of this library has an error

Method Request Definitions Response Definitions
search
extract
crawl_url
scrape_url

The above table shows that types are not present for crawl_url and scrape_url. And the ones that are present are out of date.

Please can you makes these changes?

Stub file not found for "firecrawl"Pylance[reportMissingTypeStubs](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportMissingTypeStubs)

To be honest, the state of Python development is a little horrifying because so few people enable strict type checking.

Screenshots
N/A

Logs
N/A

Additional Context
I have not checked the TypeScript API but it could well need updating too.

@nicholas-johnson-techxcel nicholas-johnson-techxcel added the bug Something isn't working label Mar 5, 2025
@nicholas-johnson-techxcel
Copy link
Author

I just noticed that version 1.13.2 is out, but it does not seem to have changed anything, other than adding deep research functions.

@nickscamara
Copy link
Member

nickscamara commented Mar 5, 2025

Hey @nicholas-johnson-techxcel, we're re-working the python sdk this week! Just pushed an update that fixes the 400 in the status code when using the sytem prompt. v1.13.5 should be working as expected.

@nickscamara nickscamara added the question Further information is requested label Mar 5, 2025
@nicholas-johnson-techxcel
Copy link
Author

Even on v1.13.5 it rejects valid schemas

{
        "type": "object",
        "properties": {
            "urls": {
                "type": "object",
                "properties": {
                    "text": {"$ref": "#/definitions/stringOrUrl"},
                    "link": {"$ref": "#/definitions/stringOrUrl"},
                },
                "required": ["text", "link"],
            },
            "images": {
                "type": "object",
                "properties": {
                    "url": {"$ref": "#/definitions/stringOrUrl"},
                    "alt": {"$ref": "#/definitions/stringOrUrl"},
                },
                "required": ["url", "alt"],
            },
            "pdfs": {
                "type": "object",
                "properties": {
                    "url": {"$ref": "#/definitions/stringOrUrl"},
                    "title": {"$ref": "#/definitions/stringOrUrl"},
                    "text": {"$ref": "#/definitions/stringOrUrl"},
                },
                "required": ["url", "title", "text"],
            },
            "tables": {
                "type": "object",
                "properties": {
                    "headers": {
                        "type": "array",
                        "items": {"$ref": "#/definitions/stringOrUrl"},
                    },
                    "rows": {
                        "type": "array",
                        "items": {
                            "type": "array",
                            "items": {"$ref": "#/definitions/stringOrUrl"},
                        },
                    },
                },
                "required": ["headers", "rows"],
            },
            "forms": {
                "type": "object",
                "properties": {
                    "inputs": {
                        "type": "array",
                        "items": {"$ref": "#/definitions/stringOrUrl"},
                    },
                    "values": {
                        "type": "array",
                        "items": {"$ref": "#/definitions/stringOrUrl"},
                    },
                },
                "required": ["inputs", "values"],
            },
            "buttons": {
                "type": "object",
                "properties": {
                    "text": {"$ref": "#/definitions/stringOrUrl"},
                    "ref": {"$ref": "#/definitions/stringOrUrl"},
                },
                "required": ["text", "ref"],
            },
        },
        "required": ["urls", "images", "pdfs", "tables", "forms", "buttons"],
        "definitions": {
            "stringOrUrl": {
                "oneOf": [
                    {"type": "string"},
                    {
                        "type": "object",
                        "properties": {
                            "text": {"type": "string"},
                            "href": {"type": "string", "format": "uri"},
                        },
                        "required": ["text", "href"],
                    },
                ],
            },
        },
    }

@nicholas-johnson-techxcel
Copy link
Author

The API does not return the reason why it thinks the schema is valid, either. This is an issue.

@nicholas-johnson-techxcel
Copy link
Author

That was weird - the last schema was fine, but the new one did not work. Until I changed it to schema_ but I don't understand why the schema works with one schema and not the other (requiring schema_). I do not understand why you did this field alias. In my experience the aliases in pydantic are nothing but buggy.

@nicholas-johnson-techxcel
Copy link
Author

Okay so if I use schema_ it ignores the schema, and if I use schema, it fails validation of a valid schema and does not tell me why. How should I proceed?

@nicholas-johnson-techxcel
Copy link
Author

We are loading schema=json.dumps(my_schema_from_above) and it fails with invalid schema but it does not tell me why.

@nicholas-johnson-techxcel
Copy link
Author

I have had to discard your library because it was unusable in the end. Just to give you an idea of what you could do for type safety:

import asyncio
import json
from datetime import UTC, datetime
from http import HTTPStatus
from typing import Any

import pydantic
from httpx import AsyncClient

from api.firecrawl_client_types import ExtractParams, FirecrawlExtractResponse, InternalFirecrawlResponse
from config import FIRECRAWL_API_KEY


class FirecrawlClient:
    def __init__(
        self,
        api_key: str | None = None,
        api_url: str = "https://api.firecrawl.dev",
    ) -> None:
        self.api_key = api_key or FIRECRAWL_API_KEY
        self.api_url = api_url
        self.headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}",
        }

    async def extract[T: pydantic.BaseModel](self, urls: list[str], params: ExtractParams[type[T]] | None) -> FirecrawlExtractResponse[T | dict[str, Any]]:
        if not params or (not params.prompt and not params.output_schema):
            raise ValueError("Either prompt or schema is required")

        request_data: dict[str, Any] = {
            "urls": urls,
            "origin": "api-sdk",
        }

        if params.output_schema:
            request_data["schema"] = params.output_schema.model_json_schema()

        if params.prompt:
            request_data["prompt"] = params.prompt

        if params.system_prompt:
            request_data["systemPrompt"] = params.system_prompt

        try:
            async with AsyncClient() as client:
                response = await client.post(
                    f"{self.api_url}/v1/extract",
                    json=request_data,
                    headers=self.headers,
                )
                if response.status_code == HTTPStatus.OK:
                    try:
                        data = InternalFirecrawlResponse(**response.json())
                    except:
                        raise Exception("Failed to parse Firecrawl internal response as JSON.")
                    if data.success:
                        job_id = data.id
                        if not job_id:
                            raise Exception("Job ID not returned from extract request.")

                        while True:
                            status_response = await client.get(
                                f"{self.api_url}/v1/extract/{job_id}",
                                headers=self.headers,
                            )
                            if status_response.status_code == HTTPStatus.OK:
                                try:
                                    print(json.dumps(status_response.json(), indent=2))
                                    status_data = FirecrawlExtractResponse[params.output_schema or dict[str, Any]].model_validate(status_response.json())
                                except:
                                    raise Exception("Failed to parse Firecrawl response as JSON.")
                                if status_data.status == "completed":
                                    if status_data.success:
                                        return status_data
                                    raise Exception(f"Failed to extract. Error: {status_data.error}")
                                if status_data.status in ["failed", "cancelled"]:
                                    raise Exception(f"Extract job {status_data.status}. Error: {status_data.error}")

                            await asyncio.sleep(2)
                    else:
                        raise Exception(f"Failed to extract. Error: {data.error}")
        except Exception as e:
            raise ValueError(str(e), HTTPStatus.INTERNAL_SERVER_ERROR)

        return FirecrawlExtractResponse[params.output_schema or dict[str, Any]](
            success=False,
            error="Internal server error.",
            status="failed",
            expires_at=datetime.now(UTC),
        )
from datetime import datetime
from typing import Any, Literal

import pydantic


class InternalFirecrawlError(pydantic.BaseModel):
    code: Literal["unrecognized_keys"]
    keys: list[str]
    path: list[str]
    message: str


class ExtractParams[T](pydantic.BaseModel):
    prompt: str | None = None
    output_schema: T | None = None
    system_prompt: str | None = None


class FirecrawlExtractResponse[T](pydantic.BaseModel):
    success: bool
    data: T | None = None
    status: Literal[
        "completed",
        "cancelled",
        "failed",
        "processing",
    ]
    expires_at: datetime = pydantic.Field(...)
    error: str | None = None
    details: InternalFirecrawlError | None = None

    @pydantic.field_validator("data", mode="before")
    @classmethod
    def validate_data_presence(cls, data: list[Any] | T | None) -> T | None:
        if isinstance(data, list):
            return None
        return data

    @pydantic.model_validator(mode="before")
    @classmethod
    def validate_expires_at(cls, data: dict[str, Any]) -> dict[str, Any]:
        if "expiresAt" in data:
            data["expires_at"] = data["expiresAt"]
            del data["expiresAt"]
        return data

This uses generic types so that the output is typed according to the output_format param. The server response is typed according to the actual responses, with data: . If you could please consider adopting this approach for your library, it would save your users a lot of time.

I recommend avoiding the use of Pydantic column aliasing as it is extremely buggy. Instead I just used a model validator. That way, constructing a class does not produce a type error saying that your aliased field is not provided in your args.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants