# Elly.jl
### Plug in to HDFS & Yarn from Julia

_Tanmay Mohapatra (@tanmaykm)_

_JuliaCon 2015, Bangalore_

# Elly

- Interface to:
    - HDFS: familiar Julia filesystem APIs
    - Yarn: register application, allocate/deallocate containers
        - familiar Julia cluster manager interface
- Pure Julia => easy to install
- Protocol buffers for communication (ProtoBuf.jl)

# HDFS with Elly

In [2]:
using Elly
dfs = HDFSClient("localhost", 9000)

HDFSClient: tan@localhost:9000/

    pwd: /



    id: e2444a63-07f3-42
    connected: false


## Navigating the filesystem

In [3]:
pwd(dfs)

"/"

In [4]:
readdir(dfs)

15-element Array{AbstractString,1}:
 "colsuminp.csv"    
 "maxvalinp.csv"    
 "sort"             
 "sorted"           
 "sortinp"          
 "sortout"          
 "sortsamp"         
 "sortval"          
 "tan_data"         
 "test"             
 "testdir"          
 "testfile.txt"     
 "tmp"              
 "twitter_small.csv"
 "user"             

In [5]:
cd(dfs, "tmp")

"/tmp"

In [6]:
mkdir(dfs, "foo")

true

In [7]:
cd(dfs, "foo")

"/tmp/foo"

## Files

In [8]:
stat(dfs, "/user/tan/t.avi")

HDFSFileInfo: /user/tan/t.avi
    type: file
    size: 2449213436
    block_sz: 134217728
    owner: tan
    group: supergroup


In [9]:
hdfs_blocks(dfs, "/user/tan/t.avi")

19-element Array{Tuple{UInt64,Array{T,N}},1}:
 (0x0000000000000000,AbstractString["127.0.0.1"])
 (0x0000000008000000,AbstractString["127.0.0.1"])
 (0x0000000010000000,AbstractString["127.0.0.1"])
 (0x0000000018000000,AbstractString["127.0.0.1"])
 (0x0000000020000000,AbstractString["127.0.0.1"])
 (0x0000000028000000,AbstractString["127.0.0.1"])
 (0x0000000030000000,AbstractString["127.0.0.1"])
 (0x0000000038000000,AbstractString["127.0.0.1"])
 (0x0000000040000000,AbstractString["127.0.0.1"])
 (0x0000000048000000,AbstractString["127.0.0.1"])
 (0x0000000050000000,AbstractString["127.0.0.1"])
 (0x0000000058000000,AbstractString["127.0.0.1"])
 (0x0000000060000000,AbstractString["127.0.0.1"])
 (0x0000000068000000,AbstractString["127.0.0.1"])
 (0x0000000070000000,AbstractString["127.0.0.1"])
 (0x0000000078000000,AbstractString["127.0.0.1"])
 (0x0000000080000000,AbstractString["127.0.0.1"])
 (0x0000000088000000,AbstractString["127.0.0.1"])
 (0x0000000090000000,AbstractString["127.0.0.1"])

## File IO

In [10]:
baz_file = HDFSFile(dfs, "baz.txt")

HDFSFile: hdfs://tan@localhost:9000/tmp/foo/baz.txt


In [11]:
open(baz_file, "w") do f
    write(f, b"hello world")
end

11

In [12]:
open(baz_file, "r") do f
    bytes = Array(UInt8, filesize(f))
    read!(f, bytes)
    println(bytestring(bytes))
end

hello world


In [13]:
cd(dfs, "/")
rm(dfs, "/tmp/foo", true)

true

# Yarn

In [18]:
ugi = UserGroupInformation()
yarnclnt = YarnClient("localhost", 8032, ugi)
nnodes = nodecount(yarnclnt)

1

In [19]:
nlist = Elly.nodes(yarnclnt)

YarnNodes: 1 (connected to 0)
YarnNode: /default-rack/tanlt:43485 running, Used mem: 0/8192, cores: 0/8


## Native Julia Yarn Application

- powerful
    - fine grained and dynamic resource allocation
    - optimize cluster resources
- but... lot of boilerplate code, complex

In [21]:
yarnam = YarnAppMaster("localhost", 8030, ugi)
function on_alloc(cid)
    # probably start container process here
    println("allocated $cid")
end
function on_finish(cid)
    # release the container or start a new process here
    println("finished $cid")
end
callback(yarnam, Nullable(on_alloc), Nullable(on_finish))

In [22]:
yarnapp = submit(yarnclnt, yarnam)

YarnApp YARN (EllyApp/4): accepted-0.0
    location: tan@N/A:0/default


- request containers for the application

In [23]:
container_allocate(yarnam, 1)

allocated Elly.hadoop.yarn.ContainerIdProto(#undef,Elly.hadoop.yarn.ApplicationAttemptIdProto(Elly.hadoop.yarn.ApplicationIdProto(4,1444129433715),1),1)


- use allocated container to launch applications
- stop and release containers when done
- finally unregister the application

In [24]:
unregister(yarnam, true)

true

## Cluster Manager for Yarn

In [25]:
yarncm = YarnManager(yarnhost="localhost")

YarnManager for YarnClient: tan@localhost:8032/
    id: 7154d9df-57da-43
    connected: true


In [26]:
addprocs(yarncm; np=1, env=Dict("JULIA_PKGDIR"=>Pkg.dir()));
@everywhere println(myid())

1
	From worker 2:	2


In [27]:
rmprocs(workers())
Elly.disconnect(yarncm)

true

# End

_Next: processing large files on HDFS/Yarn_