warc
Here are 26 public repositories matching this topic...
Hadoop streaming EMR job
-
Updated
Dec 18, 2018 - Python
This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
-
Updated
Nov 7, 2017 - Python
minimalistic crawler
-
Updated
Oct 15, 2023 - Python
A tool for detecting viruses and NSFW material in WARC files
-
Updated
May 3, 2024 - Python
ARCHIVED--Docker app to crawl URLs and generate WARCs
-
Updated
Apr 11, 2017 - Python
Support for writing WARC files with Scrapy
-
Updated
Dec 21, 2019 - Python
Decentralized web archiving
-
Updated
Aug 7, 2018 - Python
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
-
Updated
May 31, 2024 - Python
Web archiving using Google Chrome
-
Updated
Dec 30, 2019 - Python
Summarize web archive capture index (CDX) files.
-
Updated
Jul 29, 2022 - Python
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
-
Updated
Jun 7, 2024 - Python
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
-
Updated
Sep 19, 2023 - Python
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
-
Updated
May 20, 2024 - Python
CoCrawler is a versatile web crawler built using modern tools and concurrency.
-
Updated
Apr 29, 2022 - Python
Bitextor generates translation memories from multilingual websites
-
Updated
May 21, 2024 - Python
Improve this page
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."