-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] firecrawl-py has outdated types for Python #1295
Comments
I just noticed that version |
Hey @nicholas-johnson-techxcel, we're re-working the python sdk this week! Just pushed an update that fixes the 400 in the status code when using the sytem prompt. v1.13.5 should be working as expected. |
Even on v1.13.5 it rejects valid schemas {
"type": "object",
"properties": {
"urls": {
"type": "object",
"properties": {
"text": {"$ref": "#/definitions/stringOrUrl"},
"link": {"$ref": "#/definitions/stringOrUrl"},
},
"required": ["text", "link"],
},
"images": {
"type": "object",
"properties": {
"url": {"$ref": "#/definitions/stringOrUrl"},
"alt": {"$ref": "#/definitions/stringOrUrl"},
},
"required": ["url", "alt"],
},
"pdfs": {
"type": "object",
"properties": {
"url": {"$ref": "#/definitions/stringOrUrl"},
"title": {"$ref": "#/definitions/stringOrUrl"},
"text": {"$ref": "#/definitions/stringOrUrl"},
},
"required": ["url", "title", "text"],
},
"tables": {
"type": "object",
"properties": {
"headers": {
"type": "array",
"items": {"$ref": "#/definitions/stringOrUrl"},
},
"rows": {
"type": "array",
"items": {
"type": "array",
"items": {"$ref": "#/definitions/stringOrUrl"},
},
},
},
"required": ["headers", "rows"],
},
"forms": {
"type": "object",
"properties": {
"inputs": {
"type": "array",
"items": {"$ref": "#/definitions/stringOrUrl"},
},
"values": {
"type": "array",
"items": {"$ref": "#/definitions/stringOrUrl"},
},
},
"required": ["inputs", "values"],
},
"buttons": {
"type": "object",
"properties": {
"text": {"$ref": "#/definitions/stringOrUrl"},
"ref": {"$ref": "#/definitions/stringOrUrl"},
},
"required": ["text", "ref"],
},
},
"required": ["urls", "images", "pdfs", "tables", "forms", "buttons"],
"definitions": {
"stringOrUrl": {
"oneOf": [
{"type": "string"},
{
"type": "object",
"properties": {
"text": {"type": "string"},
"href": {"type": "string", "format": "uri"},
},
"required": ["text", "href"],
},
],
},
},
} |
The API does not return the reason why it thinks the schema is valid, either. This is an issue. |
That was weird - the last schema was fine, but the new one did not work. Until I changed it to |
Okay so if I use |
We are loading |
I have had to discard your library because it was unusable in the end. Just to give you an idea of what you could do for type safety: import asyncio
import json
from datetime import UTC, datetime
from http import HTTPStatus
from typing import Any
import pydantic
from httpx import AsyncClient
from api.firecrawl_client_types import ExtractParams, FirecrawlExtractResponse, InternalFirecrawlResponse
from config import FIRECRAWL_API_KEY
class FirecrawlClient:
def __init__(
self,
api_key: str | None = None,
api_url: str = "https://api.firecrawl.dev",
) -> None:
self.api_key = api_key or FIRECRAWL_API_KEY
self.api_url = api_url
self.headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.api_key}",
}
async def extract[T: pydantic.BaseModel](self, urls: list[str], params: ExtractParams[type[T]] | None) -> FirecrawlExtractResponse[T | dict[str, Any]]:
if not params or (not params.prompt and not params.output_schema):
raise ValueError("Either prompt or schema is required")
request_data: dict[str, Any] = {
"urls": urls,
"origin": "api-sdk",
}
if params.output_schema:
request_data["schema"] = params.output_schema.model_json_schema()
if params.prompt:
request_data["prompt"] = params.prompt
if params.system_prompt:
request_data["systemPrompt"] = params.system_prompt
try:
async with AsyncClient() as client:
response = await client.post(
f"{self.api_url}/v1/extract",
json=request_data,
headers=self.headers,
)
if response.status_code == HTTPStatus.OK:
try:
data = InternalFirecrawlResponse(**response.json())
except:
raise Exception("Failed to parse Firecrawl internal response as JSON.")
if data.success:
job_id = data.id
if not job_id:
raise Exception("Job ID not returned from extract request.")
while True:
status_response = await client.get(
f"{self.api_url}/v1/extract/{job_id}",
headers=self.headers,
)
if status_response.status_code == HTTPStatus.OK:
try:
print(json.dumps(status_response.json(), indent=2))
status_data = FirecrawlExtractResponse[params.output_schema or dict[str, Any]].model_validate(status_response.json())
except:
raise Exception("Failed to parse Firecrawl response as JSON.")
if status_data.status == "completed":
if status_data.success:
return status_data
raise Exception(f"Failed to extract. Error: {status_data.error}")
if status_data.status in ["failed", "cancelled"]:
raise Exception(f"Extract job {status_data.status}. Error: {status_data.error}")
await asyncio.sleep(2)
else:
raise Exception(f"Failed to extract. Error: {data.error}")
except Exception as e:
raise ValueError(str(e), HTTPStatus.INTERNAL_SERVER_ERROR)
return FirecrawlExtractResponse[params.output_schema or dict[str, Any]](
success=False,
error="Internal server error.",
status="failed",
expires_at=datetime.now(UTC),
) from datetime import datetime
from typing import Any, Literal
import pydantic
class InternalFirecrawlError(pydantic.BaseModel):
code: Literal["unrecognized_keys"]
keys: list[str]
path: list[str]
message: str
class ExtractParams[T](pydantic.BaseModel):
prompt: str | None = None
output_schema: T | None = None
system_prompt: str | None = None
class FirecrawlExtractResponse[T](pydantic.BaseModel):
success: bool
data: T | None = None
status: Literal[
"completed",
"cancelled",
"failed",
"processing",
]
expires_at: datetime = pydantic.Field(...)
error: str | None = None
details: InternalFirecrawlError | None = None
@pydantic.field_validator("data", mode="before")
@classmethod
def validate_data_presence(cls, data: list[Any] | T | None) -> T | None:
if isinstance(data, list):
return None
return data
@pydantic.model_validator(mode="before")
@classmethod
def validate_expires_at(cls, data: dict[str, Any]) -> dict[str, Any]:
if "expiresAt" in data:
data["expires_at"] = data["expiresAt"]
del data["expiresAt"]
return data This uses generic types so that the output is typed according to the output_format param. The server response is typed according to the actual responses, with data: . If you could please consider adopting this approach for your library, it would save your users a lot of time. I recommend avoiding the use of Pydantic column aliasing as it is extremely buggy. Instead I just used a model validator. That way, constructing a class does not produce a type error saying that your aliased field is not provided in your args. |
Describe the Bug
Many fields when used make the server return 400 status code, with a body listing the unrecognised fields. For instance,
I can only get schema and prompt to work.
For instance, despite system_prompt defined in the latest 1.12.0 package version:
Error during extraction: ("Unexpected error during extract: Status code 400. Bad Request - [{'code': 'unrecognized_keys', 'keys': ['system_prompt'], 'path': [], 'message': 'Unrecognized key in body -- please review the v1 API documentation for request body changes'}]", 500)
def scrape_url
does not even have a params type schema at all.To Reproduce
Use
firecrawl-py
and try to use all of the parameters defined in the package in the requests to the Firecrawl API. You will get 400 BAD REQUEST.Expected Behavior
The API packages need to be kept up to date with the API schema. Perhaps some Swagger / OpenAPI / GraphQL introspection so that the python package can update its type definitions without having to have a new version. Perhaps there should be a check to warn when the package is no longer consistent with the API so that developers know it is time to upgrade the package.
The package should also have a stub file so that it can be used in a typesafe manner. My import of this library has an error
search
extract
crawl_url
scrape_url
The above table shows that types are not present for crawl_url and scrape_url. And the ones that are present are out of date.
Please can you makes these changes?
To be honest, the state of Python development is a little horrifying because so few people enable strict type checking.
Screenshots
N/A
Logs
N/A
Additional Context
I have not checked the TypeScript API but it could well need updating too.
The text was updated successfully, but these errors were encountered: