ARCHIVED--Docker app to crawl URLs and generate WARCs
-
Updated
Apr 11, 2017 - Python
ARCHIVED--Docker app to crawl URLs and generate WARCs
This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Decentralized web archiving
Hadoop streaming EMR job
Support for writing WARC files with Scrapy
Web archiving using Google Chrome
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Summarize web archive capture index (CDX) files.
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
minimalistic crawler
Collect and revisit web pages.
A tool for detecting viruses and NSFW material in WARC files
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Bitextor generates translation memories from multilingual websites
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."