Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
The primary goal of this project to provide full-text search for our web archives. To achieve this, the warc-indexer component is used to parse the (W)ARC files and, for each resource, it posts a record into one or more Apache Solr servers. We then use client facing tools that allow researchers to query the Solr index and explore the collections.
- Quick Start
- Source Code Project Structure
- Similar Systems
- Dataset Generation
- IIPC Solr Workshop (Jan 2014)
- Version 3 Solr 7 notes
IIPC Solr Training Workshop (Jan 2014)
The schedule for this event is here.
Getting started with webarchive-discovery (Quick Start)
- Setting up a test Solr service
- Indexing your ARCs or WARCs
- Browsing the results, basic queries, the schema browser, etc.
- Using the Solr UI:
- Installing SolrCloud (towards a production environment)
- Benchmarking and performance analysis