🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
Updated
Nov 18, 2024 - Python
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Collect and revisit web pages.
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Streaming WARC/ARC library for fast web archive IO
Bitextor generates translation memories from multilingual websites
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
CoCrawler is a versatile web crawler built using modern tools and concurrency.
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
Summarize web archive capture index (CDX) files.
Web archiving using Google Chrome
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Support for writing WARC files with Scrapy
Decentralized web archiving
A tool for detecting viruses and NSFW material in WARC files
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."