The ingest.sh
shell script is used to ingest data onto a PNDA cluster.
To install dependencies and configure the tool for an HttpFS endpoint, run the install
command with the IP address of a node running HttpFS
role and port 14000.
ingest.sh install http://host:port
e.g
ingest.sh install http://192.168.0.0:14000
Please check with your cluster administrator for the correct IP address and port, as it may change in production deployments.
To upload file or directory onto a cluster, run the upload
command with the dir/file name.
You can use the -f
flag to overwrite existing files, and -t number of threads
for parallelism.
ingest.sh upload localfile or local_directory
e.g
ingest.sh upload Readme.txt
ingest.sh upload -f -t 10 /user/data
Once the upload completes, verify whether the transferred files are stored in the /user/pnda/PNDA_datasets/bulk/
folder in HDFS.
The tool depends on the hdfs
python pip package. The install
command when run also sets up the package, as well as populating the cli config.
- Sometime the files or nested directories are not overwritten for large folders. In such case rerun the upload command with -f switch.
- Appending a '/' slash at the end seem to have weird effect, in some case even overwriting the directory. To upload a directory just use its name as the argument.