Spacip

Introduction

Spacip is a tool to prepare web archive container files in the ARC format which are stored in a Hadoop Distributed File System (HDFS) in order to allow processing of the individual files by means of the SCAPE Platform tool Tomar.

It unpackages flat ARC container files to HDFS, creates a map that allows identifying which file corresponds to which ARC record, and creates an input file to be used with Tomar.

Usage

Invoke the application using hadoop without parameters to get the help output with parameters that can be used:

hadoop jar ./target/spacip-1.0-SNAPSHOT-jar-with-dependencies.jar

The following parameters can be used in any order:

-d,--dir <arg>   HDFS directory containing (the) text file(s) with HDFS
                 paths to container files (e.g. '/user/name/inputdir/').
                 [required].
-n,--npt <arg>   Number of items to be processed per invokation (e.g. 50).
                 [optional, default: 50].

Important: The input directory (parameter -d) should not contain the ARC files, but it must contain (a) text file(s) with absolute HDFS paths to the ARC container files, for example:

hadoop jar ./target/spacip-1.0-SNAPSHOT-jar-with-dependencies.jar -d /user/name/inputpaths/

If the hadoop job runs successfully, various files are created in separate output directories.

First there is the directory where unpacked files are copied to:

/user/name/spacip_unpacked/

Second, there is the job output directory (default: spacip_joboutput) which contains different output files wrapped in a timestamp directory:

./spacip_joboutput/1385143157862/_SUCCESS
./spacip_joboutput/1385143157862/_logs
./spacip_joboutput/1385143157862/keyfilmapping-m-00000
./spacip_joboutput/1385143157862/part-r-00000
./spacip_joboutput/1385143157862/tomarinput-m-00000

The 'keyfilemapping-*' file contains the container/record-identifier as key and the file name as value so that each unpacked file in HDFS can be clearly assigned to the corresponding record and the ARC container.

The 'tomarinput-*' file contains the input file which can be used as input by Tomar (-i parameter).

The 'part-r-00000' is an empty reducer file which can be ignored.

Note that, depending on the Hadoop configuration, failed hadoop tasks might be re-scheduled to run on another node. This can lead to a higher number of unpacked files than there are records in the container because Hadoop does not take care of cleaning up files which have been created by the failed task. The output files listed above keep track of the generated files, but any additional files caused by task failures will be ignored. It just means that in case of task failures some additional storage is required.

The job produces one output file per map task for the 'keyfilemapping' and 'tomarinput' output types.

These files can be easily merged, e.g. for the 'keyfilemapping' using the following command:

hadoop fs -cat ./spacip_joboutput/1385143157862/keyfilmapping-m-* | hadoop fs -put - ./spacip_joboutput/1385143157862/keyfilmapping-aggregated.txt

And for the Tomar input file correspondingly:

hadoop fs -cat ./spacip_joboutput/1385143157862/tomarinput-m-* | hadoop fs -put - ./spacip_joboutput/1385143157862/tomarinput-aggregated.txt

By that way Tomar can be invoked using the aggregated input file:

hadoop jar tomar.jar -i /user/onbfue/spacip_joboutput/1385143157862/tomarinput-aggregated.txt -r /user/name/scape-toolspecs

where the value of the parameter -i is the aggregated Tomar input file and the value of the -r parameter indicates the directory where the tool specification files for Tomar can be found.

Dependencies

This application is using the Cloudera CDH3u5 Hadoop distribution, see the corresponding dependendy in the maven file:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>0.20.2-cdh3u5</version>
</dependency>

The application produces an input file for the SCAPE Platform tool Tomar.

Installation

Maven is used to build the application, change to the directory of the application and type

mvn install

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Spacip

Introduction

Usage

Dependencies

Installation

About

Releases

Packages

Languages

License

shsdev/spacip

Folders and files

Latest commit

History

Repository files navigation

Spacip

Introduction

Usage

Dependencies

Installation

About

Resources

License

Stars

Watchers

Forks

Languages