GitHub - softwaremill/emr-scalding-tutorial: Simple Scalding task template for Amazon EMR

Command to get the size of the bucket

aws s3api list-objects --bucket YOUR_BUCKET_NAME --output json --query "[sum(Contents[].Size), length(Contents[])]" | awk 'NR!=2 {print $0;next} NR==2 {print $0/1024/1024/1024" GB"}'

Run job locally:

hadoop jar target/scala-2.11/emr-scalding-tutorial-assembly-0.1.jar com.softwaremill.AgeCounterJob — hdfs — input “data/*” — output data-output

or

hadoop jar target/scala-2.11/emr-scalding-tutorial-assembly-0.1.jar com.softwaremill.AgeCounterJob — local — input “data/hello.txt” — output data-output.txt

Once your task definition is ready and you have tested it locally, you can start pushing your assembled jar to the S3 service.

Copy the jar file to S3 location:

aws s3 cp target/scala-2.11/emr-scalding-tutorial-assembly-0.1.jar s3://grajo001log/emr-scalding-tutorial-assembly-0.1.jar

Run the cluster with the task

aws emr create-cluster --name "Scalding test cluster" --ami-version 3.9.0 --use-default-roles --instance-type m1.medium --instance-count 3 --log-uri s3://grajo001out/2 --steps Type=CUSTOM_JAR,Name="EMR Tutorial",ActionOnFailure=TERMINATE_CLUSTER,Jar=s3://grajo001log/emr-scalding-tutorial-assembly-0.1.jar ,Args=["com.softwaremill.AgeCounterJob","--hdfs","--input","s3n://grajo001log/users/*","--output","s3n://grajo001log/output"] --auto-terminate

Push task to EMR

aws emr add-steps --cluster-id=YOUR-CLUSTER_ID --steps=file://./task.json

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
project		project
src/main/scala/com/softwaremill		src/main/scala/com/softwaremill
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
prepare-data.sh		prepare-data.sh
print-graph.py		print-graph.py
sites.txt		sites.txt
task.json		task.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src/main/scala/com/softwaremill

src/main/scala/com/softwaremill

.gitignore

.gitignore

README.md

README.md

build.sbt

build.sbt

prepare-data.sh

prepare-data.sh

print-graph.py

print-graph.py

sites.txt

sites.txt

task.json

task.json

Repository files navigation

Command to get the size of the bucket

Run job locally:

or

Copy the jar file to S3 location:

Run the cluster with the task

Push task to EMR

About

Releases

Packages

Languages

softwaremill/emr-scalding-tutorial

Folders and files

Latest commit

History

Repository files navigation

Command to get the size of the bucket

Run job locally:

or

Copy the jar file to S3 location:

Run the cluster with the task

Push task to EMR

About

Resources

Stars

Watchers

Forks

Languages