Skip to content

Web Archive (WARC) Support #3023

Open
Open
@Ristellise

Description

@Ristellise

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

Crawlee supports other export formats such as json but doesn't seem to have any support for Web Archive formats.

Motivation

Saving it as web archive enables developers to psot-process the data as much as as the crawler has gathered.

Ideal solution or implementation, and any additional constraints

Optional feature switch integration with fastwarc library is preferred. Crawlee should be able to write raw file contents into the fastwarc library and fastwarc should be able to provide for the rest.

Alternative solutions or implementations

Alternatively, warcio is an alternative library for consideration. However, that is slower in python as it's written in pure python compared to fastwarc which is written mainly in C.

Other context

Was introduced to crawlee at pyconsg 2025

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureIssues that represent new features or improvements to existing features.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions