Skip to content

Latest commit

 

History

History
38 lines (25 loc) · 1.4 KB

File metadata and controls

38 lines (25 loc) · 1.4 KB

Bulk Ingest

The ingest.sh shell script is used to ingest data onto a PNDA cluster.

Install

To install dependencies and configure the tool for an HttpFS endpoint, run the install command with the IP address of a node running HttpFS role and port 14000.

 ingest.sh install http://host:port
 e.g 
 ingest.sh install http://192.168.0.0:14000

Please check with your cluster administrator for the correct IP address and port, as it may change in production deployments.

Upload

To upload file or directory onto a cluster, run the upload command with the dir/file name. You can use the -f flag to overwrite existing files, and -t number of threads for parallelism.

ingest.sh upload localfile or local_directory
e.g
ingest.sh upload Readme.txt
ingest.sh upload -f -t 10 /user/data

Once the upload completes, verify whether the transferred files are stored in the /user/pnda/PNDA_datasets/bulk/ folder in HDFS.

Dependencies

The tool depends on the hdfs python pip package. The install command when run also sets up the package, as well as populating the cli config.

Known bugs

  • Sometime the files or nested directories are not overwritten for large folders. In such case rerun the upload command with -f switch.
  • Appending a '/' slash at the end seem to have weird effect, in some case even overwriting the directory. To upload a directory just use its name as the argument.