#

warc

Here are 26 public repositories matching this topic...

edgi-govdata-archiving / eis-WARC-archiver

ARCHIVED--Docker app to crawl URLs and generate WARCs

docker warc warc-format

Updated Apr 11, 2017
Python

shawnmjones / OffTopic-Detection

This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

python memento warc web-archiving timemap memento-rfc archive-it

Updated Nov 7, 2017
Python

ArchiveTeam / WebArchiver

Decentralized web archiving

python crawler web decentralized archiving archiver warc webarchiving

Updated Aug 7, 2018
Python

ggodreau / huhdewp

Hadoop streaming EMR job

big-data hadoop bigdata warc hadoop-streaming common-crawl

Updated Dec 18, 2018
Python

info-labs / owlbot

WARC archive crawler

Updated Apr 9, 2019
Python

internetarchive / scrapy-warcio

Support for writing WARC files with Scrapy

python scrapy warc web-archiving

Updated Dec 21, 2019
Python

PromyLOPh / crocoite

Web archiving using Google Chrome

devtools archiving chrome-browser warc

Updated Dec 30, 2019
Python

marinoandrea / wikidata-entity-linking

CLI to extract named entities from web pages and link them to potential entity candidates in the WikiData knowledge base.

nlp web wikidata trident warc entity-linking

Updated Nov 26, 2021
Python

oduwsdl / off-topic-memento-toolkit

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

measure topic simhash memento warc cosine timemap

Updated Dec 19, 2021
Python

cocrawler / cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

screenshot crawler concurrency async-python python3 aiohttp warc aiohttp-client pluggable-modules

Updated Apr 29, 2022
Python

webrecorder / cdxj-indexer

CDXJ Indexing of WARC/ARCs

warc web-archiving

Updated May 22, 2024
Python

cdx-summary

internetarchive / cdx-summary

Summarize web archive capture index (CDX) files.

nodejs python statistics collection webcomponents archive report summary warc cdx web-archive

Updated Jul 29, 2022
Python

mikwielgus / forum-dl

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

python scraper forum discourse phpbb warc data-fetching simplemachines internet-archiving

Updated Sep 19, 2023
Python

wsdookadr / femtocrawl

minimalistic crawler

firefox crawler offline scraping http-archive warc zim web-archives

Updated Oct 15, 2023
Python

conifer

Rhizome-Conifer / conifer

Collect and revisit web pages.

python docker archives warc web-archiving wayback webrecorder pywb

Updated Nov 8, 2023
Python

Florents-Tselai / WarcDB

WarcDB: Web crawl data as SQLite databases.

cli database sqlite crawling warc web-archiving web-data

Updated Feb 22, 2024
Python

ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

crawler spider archiving crawl warc

Updated Mar 22, 2024
Python

natliblux / warc-safe

A tool for detecting viruses and NSFW material in WARC files

antivirus warc webarchiving nsfw-classifier warc-safe

Updated May 3, 2024
Python

cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

python warc web-archiving cdx web-archives commoncrawl cdx-api

Updated May 20, 2024
Python

bitextor

bitextor / bitextor

Bitextor generates translation memories from multilingual websites

Updated May 21, 2024
Python

Improve this page

Add a description, image, and links to the warc topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."