This client library can access the variety of REST APIs provided by Haddop either directly or through Apache Knox. The individual protocols are wrapped into classes that know how to interact with the protocol and simplify access. In addition, the usage is uniform regardless of whether you access the service directly or through a Knox gateway.
In addition, the library is "proxy aware" in case you have additional network proxies.
A simple command-line client allows you to access Knox over the gateway. The command-line client can be run by:
python -m pyox
Currently, there are two commands supported:
hdfs
- commands for interacting with WebHDFSoozie
- commands for interacting with the Oozie Service for scheduling jobssubmit
- a simplified single-action submit command for Ooziecluster
- cluster status and queue information
A KNOX gateway must be specified or it defaults to localhost:50070
:
--base
- the base URI of the Knox service--host
- the host and port of the Knox service--secure
- indicates TLS (https) should be used--gateway
- the Knox gateway name--auth
- the username and password (colon separated)
The Knox gateway can either be completely specified by the --base
option or
specified in parts by --secure
, --host
, and --gateway
.
A proxy for a protocol can be specified by the -p
option and requires a protocol
scheme (e.g., https
) and the proxy url.
python -m pyox hdfs *command* ...
Outputs the file paths to stdout.
python -m pyox hdfs cat [--offset N] [--length N] path ...
Options:
--length N
- output N bytes of the file--offset N
- start at N bytes offset into the file
Outputs the file paths to stdout.
python -m pyox hdfs download [-v] [--chunk-size N] [-o file] file
Options:
--chunk-size N
- download the file in chunks of size N bytes-v
- verbose (show download status)
A directory or file listing.
python -m pyox hdfs ls [-b] [-l] path ...
Options:
-b
- show the file sizes in bytes-l
- show the file details (long format)
Create directories
python -m pyox hdfs mkdir path ...
Move a file
python -m pyox hdfs mv source destination
Move a file
python -m pyox hdfs rm [-r] path ...
Options:
-r
- recursively remove files
Copy a set of files/direcrories to the target destination.
python -m pyox hdfs upload [-f] [-r] [-s] [-v] source ... destination/
Copy a single file to a destination.
python -m pyox hdfs upload [-f] [-s] [-v] source destination
Options:
-f
- force (overwrite files)-r
- recursively upload-s
- send file size-v
- verbose (show download status)
ls
- list jobs (by status, detailed, etc.)start
- start a jobstatus
- show the job status
python -m pyox oozie ls -h
To start jobs on oozie you can:
- specify a JSON properties file for job properties via
-P
- specify a single property via
-p name value
or--property name value
- specify the workflow definition via
-d file.xml
- copy resources to the job path via
-cp
- specify the name node (
--namenode
) or job tracker (--tracker
) to override what is in the properties
info
- shows basic cluster information such as versions, status, etc.metrics
- shows information about applications, containers, cores, memory, and nodesscheduler
- shows the queues and their utilization
Options:
-r
- output the raw JSON response-p
- pretty print the JSON--status
- output cluster status--version
- output only the hadoop version-a
- output all information
Options:
-r
- output the raw JSON response-p
- pretty print the JSON
Options:
-r
- output the raw JSON response-p
- pretty print the JSON--users
- show user utilization of queues
The CLI uses a simple API that you can embed directly in your application. Every client object has the same parameters (all keywords)
base
- the base URI of the knox servicesecure
- whether SSL transport is to be used (defaults toFalse
, mutually exclusive with base)host
- the host name of the KNOX service (defaults tolocalhost
, mutually exclusive with base)port
- the port of the KNOX service (defaults to50070
, mutually exclusive with base)gateway
- the gateway name to usernameusername
- the authentication userpassword
- the authentication password
A simple HDFS client example:
from pyox import WebHDFS
hdfs = WebHDFS(base='https://knox.example.com/',gateway='bigdata',username='jane',password='xyzzy')
if not hdfs.make_directory('/user/bob/data/'):
print('Can not make directory!')
There are three main API classes:
WebHDFS
- an HDFS clientOozie
- an Oozie workflow clientClusterInformation
- a cluster information client
(more documentation is to come!)
A workflow for a job can be constructed by a DSL. For example, a simple shell action to copy yarn logs:
from pyox import Oozie, Workflow
from io import StringIO
# create the oozie client
oozie = Oozie(base='https://knox.example.com/',gateway='bigdata',username='jane',password='xyzzy')
# create the job directory
oozie.createHDFSClient().make_directory('/user/jane/shell/')
# a workflow with a single shell aciton
workflow = \
Workflow.start('invoke-shell','shell') \
.action(
'shell',
Workflow.shell(
'my-job-tracker','hdfs://sandbox',
command,
configuration=Workflow.configuration({
'mapred.job.queue.name' : 'my-queue'
}),
argument=['application_1500977774979_2776'],
file='/user/jane/shell/copy.sh
)
).kill('error','Cannot run workflow shell')
# the script to execute
script = StringIO("""#!/bin/bash
yarn logs -applicationId $1 | hdfs dfs -put - /user/jane/shell/job.log
""")
# Copy whatever is necessary and submit the job via Oozie
jobid = oozie.submit(
'/user/jane/shell/',
properties={
'oozie.use.system.libpath' : True,
'user.name' : 'jane'
},
workflow=workflow,
copy=[(script,'copy.sh')]
)
A simple flask application can provide a web UI and proxy to the cluster information and scheduler queues. The application can be run by:
python -m pyox.apps.monitor conf
where conf.py
is in your python import path and contains the application configuration. Alternatively, you
can set the environment variable WEB_CONF
to the location of this file.
The configuration can contain any of the standard flask configuration options. The variable KNOX
must
be present for the configuraiton of the Apache Knox gateway.
For example, conf.py
might contain:
DEBUG=True
KNOX={
'base' : 'https://knox.example.com',
'gateway': 'bigdata'
}
Any of the client configuration keywords are available (e.g., service
,base
, secure
, host
, port
) except
the user authentication. The user authentication for both the API and service are passed through to the Knox
Web Service. You must have authentication credentials for Knox to use the Web application.
Once you have the application running, you can access it at the address you have configured. By default, this is http://localhost:5000/
The Tracker Microservice will track Oozie jobs via Knox. Often, local policies for retention of logs will cause application container logs to disappear before a developer can view or inspect them.
The tracker service provides a simple API and UI for tracking jobs via Apache Knox. If the fails, it will submit a job to inspect and copy various application logs to your home directory on HDFS.
The Web UI display the current status of jobs, the cluster, and allows various interactions and operations to be perform. Alternatively, the Tracker API provides a programmatic way to interact with the microservice.
The microservice can be deploy easily on kubernetes.