Skip to content

Using Amazon S3 for Storage

Dan LaRocque edited this page Sep 5, 2014 · 6 revisions
This is the documentation for Faunus 0.4.
Faunus was merged into Titan and renamed Titan-Hadoop in version 0.5.
Documentation for the latest Titan version is available at http://s3.thinkaurelius.com/docs/titan/current.

Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. — Amazon S3 webpage

HDFS implements the FileSystem API. There are other FileSystem implementations such as LocalFileSystem, S3FileSystem and NativeS3FileSystem. Faunus is able to leverage these file systems like another other file system in Hadoop.

Interacting with S3 via the Gremlin REPL

gremlin> import org.apache.hadoop.fs.s3native.*
...
gremlin> id = // AWS_ACCESS_KEY_ID as a String
==>*******
gremlin> key = // AWS_SECRET_ACCESS_KEY as a String
==>*******
gremlin> s3fs = new NativeS3FileSystem()
==>org.apache.hadoop.fs.s3native.NativeS3FileSystem@7433c78b
gremlin> s3fs.initialize(new URI("s3n://${id}:${key}@aurelius-data"), new Configuration())
==>null
gremlin> s3fs.ls('/friendster')
==>rwxrwxrwx   0 _SUCCESS
==>rwxrwxrwx   0 (D) _logs
==>rwxrwxrwx   71400361 part-m-00000
==>rwxrwxrwx   70471619 part-m-00001
==>rwxrwxrwx   69870835 part-m-00002
==>rwxrwxrwx   69452656 part-m-00003
...
gremlin> s3fs.ls('/friendster')._().count()
==>962

At this point, s3fs is like any other file system and can be used to store and retrive data. In fact, it is possible to set up S3 as the source and sink of Faunus jobs (see documentation for more information). Given that Gremlin provides full access to the Java/Groovy API, connecting to arbitrary FileSystem implementations is relatively simple.

Parallel Uploading and Downloading of Data

The parallel copy tool DistCp can be used from the command line to execute a parallel copy from any source filesystem to any destination filesystem. An example is provided below.

$ hadoop distcp s3n://$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY@aurelius-data/friendster hdfs://ec2-204-236-188-243.us-west-1.compute.amazonaws.com:8020/user/ubuntu/
13/04/25 14:32:38 INFO tools.DistCp: srcPaths=[s3n://*****:*****@aurelius-data/friendster]
13/04/25 14:32:38 INFO tools.DistCp: destPath=hdfs://ec2-204-236-188-243.us-west-1.compute.amazonaws.com:8020/user/ubuntu
13/04/25 14:32:45 INFO tools.DistCp: sourcePathsCount=966
13/04/25 14:32:45 INFO tools.DistCp: filesToCopyCount=963
13/04/25 14:32:45 INFO tools.DistCp: bytesToCopyCount=60.0g
13/04/25 14:32:46 INFO mapred.JobClient: Running job: job_201304251356_0003
13/04/25 14:32:47 INFO mapred.JobClient:  map 0% reduce 0%
13/04/25 14:33:28 INFO mapred.JobClient:  map 1% reduce 0%
...