warc

Star

Here are 14 public repositories matching this topic...

bottomless-archive-project / common-crawl-client

Star

This library is a very lightweight client to Common Crawl's WARC files.

warc common-crawl

Updated Jan 16, 2020
Java

helgeho / WarcPartitioner

Star

Partition (W)ARC Files by MIME Type and Year

hadoop warc web-archiving webarchive

Updated Feb 13, 2017
Java

This project is intended to turn a WARC file into a sitemap or into something (a graph description) one could build a sitemap from. The first release only offers to create a Graphviz file that can then be rendered - for example into SVG.

graphviz warc graphviz-dot

Updated Dec 19, 2023
Java

cldellow / gzip

Star

A fork of java.util.zip.GZIPInputStream that emits the offsets of nested streams.

compression gzip warc

Updated Apr 28, 2019
Java

pierlauro / MDBubing

Star

From WARC records to MongoDB documents

crawler crawling warc webarchive webarchiving warc-files warc-format warc-record bubing

Updated Nov 3, 2020
Java

ukwa / waybacks

Star

This module builds our Waybacks in the various different configurations we require.

warc web-archiving webarchive web-archives

Updated Jun 30, 2018
Java

laxika / java-warc

Star

Read Web ARChive (WARC) files in Java.

java library warc web-archive

Updated Dec 4, 2019
Java

bottomless-archive-project / java-warc

Star

Read Web ARChive (WARC) files in Java.

java library warc web-archive

Updated Sep 12, 2021
Java

VAle512 / WarcExtractor

Star

A simple WARC extractor that extract HTML from WARC!

html extraction archive warc

Updated Oct 16, 2017
Java

helgeho / HadoopConcatGz

Star

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

spark hadoop warc web-archiving webarchive

Updated Feb 7, 2018
Java

Mixnode / mixnode-warcreader-java

Star

Read Web ARChive (WARC) files in Java.

java warc

Updated Mar 18, 2017
Java

centic9 / CommonCrawlDocumentDownload

Sponsor

Star

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

java mime-types warc cdx-files commoncrawl