warc

Here are 30 public repositories matching this topic...

ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Updated Nov 18, 2024
Python

Rhizome-Conifer / conifer

Star

Collect and revisit web pages.

python docker archives warc web-archiving wayback webrecorder pywb

Updated Nov 8, 2023
Python

ArchiveTeam / grab-site

Star

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

crawler spider archiving crawl warc

Updated Jul 7, 2024
Python

oduwsdl / ipwb

Star

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

python docker service-worker ipfs memento warc web-archiving wayback memento-rfc

Updated Nov 14, 2024
Python

Florents-Tselai / WarcDB

Sponsor

Star

WarcDB: Web crawl data as SQLite databases.

cli database sqlite crawling warc web-archiving web-data

Updated Jul 13, 2024
Python

webrecorder / warcio

Sponsor

Star

Streaming WARC/ARC library for fast web archive IO

python warc web-archiving web-archives pywb

Updated Nov 12, 2024
Python

bitextor / bitextor

Star

Bitextor generates translation memories from multilingual websites

Updated Nov 11, 2024
Python

harvard-lil / warc-gpt

Star

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

ai warc webarchiving rag

Updated Oct 30, 2024
Python

cocrawler / cocrawler

Star

CoCrawler is a versatile web crawler built using modern tools and concurrency.

screenshot crawler concurrency async-python python3 aiohttp warc aiohttp-client pluggable-modules

Updated Apr 29, 2022
Python

cocrawler / cdx_toolkit

Star

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

python warc web-archiving cdx web-archives commoncrawl cdx-api

Updated Oct 5, 2024
Python

mikwielgus / forum-dl

Sponsor

Star

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

python scraper forum discourse phpbb warc data-fetching simplemachines internet-archiving

Updated Jun 27, 2024
Python

internetarchive / cdx-summary

Star

Summarize web archive capture index (CDX) files.

nodejs python statistics collection webcomponents archive report summary warc cdx web-archive

Updated Jul 29, 2022
Python

openzim / warc2zim

Sponsor

Star

Command line tool to convert a file in the WARC format to a file in the ZIM format

scraper warc zim

Updated Nov 15, 2024
Python

PromyLOPh / crocoite

Star

Web archiving using Google Chrome

devtools archiving chrome-browser warc

Updated Dec 30, 2019
Python

datacoon / metawarc

Star

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

metadata osint warc webarchiving warc-files osint-python

Updated Aug 19, 2024
Python

webrecorder / cdxj-indexer

Sponsor

Star

CDXJ Indexing of WARC/ARCs

warc web-archiving

Updated Nov 13, 2024
Python

internetarchive / scrapy-warcio

Star

Support for writing WARC files with Scrapy

python scrapy warc web-archiving

Updated Dec 21, 2019
Python

ArchiveTeam / WebArchiver

Star

Decentralized web archiving

python crawler web decentralized archiving archiver warc webarchiving

Updated Aug 7, 2018
Python

commoncrawl / whirlwind-python

Star

A whilrlwind tour of Common Crawl's data using Python

python tutorial archive warc

Updated Nov 12, 2024
Python

natliblux / warc-safe

Star

A tool for detecting viruses and NSFW material in WARC files

antivirus warc webarchiving nsfw-classifier warc-safe

Updated Aug 16, 2024
Python

Improve this page

Add a description, image, and links to the warc topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warc

Here are 30 public repositories matching this topic...

ArchiveBox / ArchiveBox

Rhizome-Conifer / conifer

ArchiveTeam / grab-site

oduwsdl / ipwb

Florents-Tselai / WarcDB

webrecorder / warcio

bitextor / bitextor

harvard-lil / warc-gpt

cocrawler / cocrawler

cocrawler / cdx_toolkit

mikwielgus / forum-dl

internetarchive / cdx-summary

openzim / warc2zim

PromyLOPh / crocoite

datacoon / metawarc

webrecorder / cdxj-indexer

internetarchive / scrapy-warcio

ArchiveTeam / WebArchiver

commoncrawl / whirlwind-python

natliblux / warc-safe

Improve this page

Add this topic to your repo