# Elly.jl
### Use HDFS & Yarn from Julia

_Tanmay Mohapatra (@tanmaykm)_

_JuliaCon 2015, Bangalore_

# Elly

- Julia filesystem APIs for HDFS
- Julia cluster manager interface for Yarn
- Yarn/HDFS specific APIs where appropriate
- Pure Julia
    - easy to install
    - Protocol buffers for communication (ProtoBuf.jl)
- https://github.com/JuliaParallel/Elly.jl

# HDFS with Elly

## Connecting to HDFS
- load package
- create a HDFS client

In [13]:
using Elly

# client connection to namenode
dfs = HDFSClient("localhost", 9000)

HDFSClient: tan@localhost:9000/

    pwd: /



    id: 058e072c-584e-4f
    connected: false


## Navigating the filesystem

In [14]:
pwd(dfs)

"/"

In [15]:
readdir(dfs)

15-element Array{AbstractString,1}:
 "colsuminp.csv"    
 "maxvalinp.csv"    
 "sort"             
 "sorted"           
 "sortinp"          
 "sortout"          
 "sortsamp"         
 "sortval"          
 "tan_data"         
 "test"             
 "testdir"          
 "testfile.txt"     
 "tmp"              
 "twitter_small.csv"
 "user"             

In [16]:
cd(dfs, "tmp")

"/tmp"

In [17]:
mkdir(dfs, "foo")

true

In [18]:
cd(dfs, "foo")

"/tmp/foo"

## Files

In [19]:
stat(dfs, "/user/tan/t.avi")

HDFSFileInfo: /user/tan/t.avi
    type: file
    size: 2449213436
    block_sz: 134217728
    owner: tan
    group: supergroup


In [20]:
hdfs_blocks(dfs, "/user/tan/t.avi")

19-element Array{Tuple{UInt64,Array{T,N}},1}:
 (0x0000000000000000,AbstractString["127.0.0.1"])
 (0x0000000008000000,AbstractString["127.0.0.1"])
 (0x0000000010000000,AbstractString["127.0.0.1"])
 (0x0000000018000000,AbstractString["127.0.0.1"])
 (0x0000000020000000,AbstractString["127.0.0.1"])
 (0x0000000028000000,AbstractString["127.0.0.1"])
 (0x0000000030000000,AbstractString["127.0.0.1"])
 (0x0000000038000000,AbstractString["127.0.0.1"])
 (0x0000000040000000,AbstractString["127.0.0.1"])
 (0x0000000048000000,AbstractString["127.0.0.1"])
 (0x0000000050000000,AbstractString["127.0.0.1"])
 (0x0000000058000000,AbstractString["127.0.0.1"])
 (0x0000000060000000,AbstractString["127.0.0.1"])
 (0x0000000068000000,AbstractString["127.0.0.1"])
 (0x0000000070000000,AbstractString["127.0.0.1"])
 (0x0000000078000000,AbstractString["127.0.0.1"])
 (0x0000000080000000,AbstractString["127.0.0.1"])
 (0x0000000088000000,AbstractString["127.0.0.1"])
 (0x0000000090000000,AbstractString["127.0.0.1"])

## File IO
- HDFSFile to refer to a file object
- open an HDFSFile to get a IOStream

In [21]:
baz_file = HDFSFile(dfs, "baz.txt")

HDFSFile: hdfs://tan@localhost:9000/tmp/foo/baz.txt


In [22]:
open(baz_file, "w") do f
    write(f, b"hello world")
end

11

In [23]:
open(baz_file, "r") do f
    bytes = Array(UInt8, filesize(f))
    read!(f, bytes)
    println(bytestring(bytes))
end

hello world


In [24]:
cd(dfs, "/")
rm(dfs, "/tmp/foo", true)

true

# Yarn with Elly

## Cluster Manager for Yarn

In [26]:
yarncm = YarnManager(yarnhost="localhost")

YarnManager for YarnClient: tan@localhost:8032/
    id: f3a1cbb0-c2cb-46
    connected: true


In [27]:
addprocs(yarncm; np=1, env=Dict("JULIA_PKGDIR"=>Pkg.dir()));
@everywhere println(myid())

1
	From worker 2:	2


In [28]:
rmprocs(workers())
Elly.disconnect(yarncm)

true

## Native Julia Yarn Application

- powerful
    - fine grained and dynamic resource allocation
    - optimize cluster resources
- but... lot of boilerplate code, complex

## Connecting to Yarn

In [29]:
ugi = UserGroupInformation()

# connect to resource manager
yarnclnt = YarnClient("localhost", 8032, ugi)
nnodes = nodecount(yarnclnt)

1

In [30]:
nlist = Elly.nodes(yarnclnt)

YarnNodes: 1 (connected to 0)
YarnNode: /default-rack/tanlt:58465 running, Used mem: 0/8192, cores: 0/8


In [31]:
yarnam = YarnAppMaster("localhost", 8030, ugi)
function on_alloc(cid)
    # probably start container process here
    println("allocated $cid")
end
function on_finish(cid)
    # release the container or start a new process here
    println("finished $cid")
end
callback(yarnam, Nullable(on_alloc), Nullable(on_finish))

In [32]:
yarnapp = submit(yarnclnt, yarnam)

YarnApp YARN (EllyApp/2): accepted-0.0
    location: tan@N/A:0/default


- request containers for the application

In [33]:
container_allocate(yarnam, 1)

allocated Elly.hadoop.yarn.ContainerIdProto(#undef,Elly.hadoop.yarn.ApplicationAttemptIdProto(Elly.hadoop.yarn.ApplicationIdProto(2,1444374287227),1),1)


- use allocated container to launch applications
- stop and release containers when done
- finally unregister the application

In [34]:
unregister(yarnam, true)

true

# End

_Next: processing large files on HDFS/Yarn_

<https://github.com/tanmaykm/TwitterLinks.jl>