MapReduce

Distributed sort and merge implementation with fault tolerance (also supports proactive fault tolerance)

How To Compile:

To compile this project simply run compile.sh script found in root directory.
After the build is successful, use the following instructions to run the program

Scripts for launching different components

Following 3 scripts are used for launching different services. Use the scripts in same following order

master
slave
client

Options for master

master script should be run first before running any scripts. The usage for master script is

usage: master -cs Chunk size in bytes (default: 1048576) -hi Heartbeat interval in milliseconds (default: 500) -bs Batch size for merge operation (default: 8) -tr Task redundancy for proactive fault tolerance (default: 1) -nfp Fail probability for nodes (default: 0.0) -tfp Fail probability for tasks (default: 0.1) -h Help

To run with defaults,

> ./scripts/master

To run with different chunk size and different merge batch size

> ./scripts/master -cs 2000000 -bs 4

To run with different fail probability for nodes (Assumption here is all nodes have same probability of failure)

> ./scripts/master -nfp 0.05

To enable proactive fault tolerance with 2 task for each sort and merge

> ./scripts/master -tr 2

To use different heartbeat interval and node fail probability NOTE: Keeping the interval short may result in nodes dying quickly before starting the client and running the tests. So its recommended to increase the hearbeat interval or decrease the fail probability to give some time for the nodes to run.

> ./scripts/master -hi 3000 -nfp 0.1

Since using automatic node fail probability can result in a condition where all nodes are dead, to avoid that task fail probability can be specified. By default, 10% of all tasks (sort and merge) could fail to simulate fault tolerance. Using task fail probability is more reliable than using node fail probability as with node failures the nodes won't come back. To run with task fail probability and task redundancy

> ./scripts/master -tr 2 -tfp 0.2

Options for slave

slave script accepts one optional argument i.e, hostname of master. If no argument is specified, by default localhost will be used for master. This script can be run multiple times on a single host (will use different ports for every run) or can be run on different machines. When master and slave scripts are run on different machines it is mandatory to provide the hostname for master

To run (master running on same host),

> ./scripts/slave

To run when master running on different host,

> ./scripts/slave <hostname-for-master>

Options for client

client script used to trigger the job that sorts and merges the input file.

Following is the usage guide for client script.

usage: client
 -i <arg>  Input file to be sorted
 -h <arg>  Hostname for master (default: localhost)
 --help    Help

To run the client, specify the required input file argument

> ./scripts/client -i <path-to-input-file>

To run the client with master running on different host

> ./scripts/client -i <path-to-input-file> -h <hostname-for-master>

Running the client will print sample output like below

./scripts/client -i input_dir/20000000

17:04:54.350 [main] INFO edu.umn.cs.mapreduce.client.Client - Submitted request to master: JobRequest(inputFile:input_dir/20000000)

--------------------------------------------------------------------------
                                  JOB STATISTICS
--------------------------------------------------------------------------
Input file:                             input_dir/20000000
Ouput file:                             ./output_dir/output_sorted
Input file size:                        93.25 MB
Output file size:                       93.25 MB
Requested chunk size:                   1 MB
Task redundancy:                        2
Node fail probability:                  0.05
Task fail probability:                  0.2
Number of splits:                       94
Total sort tasks:                       94
Total scheduled sort tasks:             192
Total successful sort tasks:            94
Total failed sort tasks:                2
Total killed sort tasks:                94
Total merge tasks:                      15
Total scheduled merge tasks:            36
Total successful merge tasks:           18
Total failed merge tasks:               3
Total killed merge tasks:               16
Average time to sort:                   195 ms
Average time to merge:                  23898 ms
Overall execution time:                 55310 ms
--------------------------------------------------------------------------

Explanation of task counts

Input file: Specified input file

Output file: Location of sorted output file

Input file size: Input file size in MB

Output file size: Sorted output file size in MB

Requested chunk size: Chunk size requested for split computation in MB

Task redundancy: Total tasks to launch for unit of work. In the above example task redundancy 2 means, 2 tasks will be launched for each split or merge job, whichever comes first will be used and other will be killed.

Node fail probability: Probability with which a node can fail. Nodes never come back after failure.

Task fail probability: Probability with which a submitted task (sort or merge) will fail.

Number of splits: Input file size divided by chunk size

Total sort tasks: Equal to number of splits as we need to sort all splits

Total scheduled sort tasks: If proactive fault tolerance is enabled, this is product of task redundancy and total sort tasks (in the above example the task redundancy is 2) and failed tasks that are rescheduled. In above example, 2 (redundancy) * 94 (sort tasks) + 2 (redundancy) * 2 (failed sort task) = 192

Total successful sort tasks: Successfully completed sort tasks

Total failed sort tasks: Failed sort tasks (due to task or node failure)

Total killed sort tasks: Redundant tasks killed by master after receiving output from one node

Similarly for merge tasks.

Average time to sort: Average execution time for sort tasks

Average time to merge: Average execution time for merge tasks

Overall execution time: Time between when client submits job to end of job

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
src		src
thrift		thrift
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduce

How To Compile:

Scripts for launching different components

Options for master

Options for slave

Options for client

Explanation of task counts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MapReduce

How To Compile:

Scripts for launching different components

Options for master

Options for slave

Options for client

Explanation of task counts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages