warc
Here are 26 public repositories matching this topic...
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
-
Updated
Jun 13, 2024 - Python
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
-
Updated
Jun 12, 2024 - Python
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
Updated
Jun 10, 2024 - Python
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
-
Updated
May 31, 2024 - Python
Streaming WARC/ARC library for fast web archive IO
-
Updated
May 27, 2024 - Python
Bitextor generates translation memories from multilingual websites
-
Updated
May 21, 2024 - Python
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
-
Updated
May 20, 2024 - Python
A tool for detecting viruses and NSFW material in WARC files
-
Updated
May 3, 2024 - Python
Collect and revisit web pages.
-
Updated
Nov 8, 2023 - Python
minimalistic crawler
-
Updated
Oct 15, 2023 - Python
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
-
Updated
Sep 19, 2023 - Python
Summarize web archive capture index (CDX) files.
-
Updated
Jul 29, 2022 - Python
CoCrawler is a versatile web crawler built using modern tools and concurrency.
-
Updated
Apr 29, 2022 - Python
Web archiving using Google Chrome
-
Updated
Dec 30, 2019 - Python
Improve this page
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."