# Know your environment

In [1]:
!cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"


In [2]:
!python --version

Python 2.7.14 :: Anaconda, Inc.


In [3]:
!hadoop version | head -1

Hadoop 2.9.2


# Working with folders on HDFS
It's quite similar to working with folders on any \*nix system, just don't forget to add `hdfs dfs -` before common commands:
* `ls`
* `mkdir`
* `rmdir`
* `mv`

Usually it's a good idea to use your user's directory: `/uesr/%username%`

In [11]:
!hdfs dfs -ls /user

Found 9 items
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/hbase
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/hdfs
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/hive
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/mapred
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/pig
drwxr-xr-x   - root hadoop          0 2020-11-19 13:47 /user/qlr
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/spark
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/yarn
drwxrwxrwt   - hdfs hadoop          0 2020-11-19 13:20 /user/zookeeper


In [10]:
!hdfs dfs -mkdir /user/qlr

In [12]:
!hdfs dfs -mkdir /user/qlr/data

In [13]:
!hdfs dfs -ls /user/qlr/data

In [14]:
!hdfs dfs -mkdir /user/qlr/data/tmp

In [15]:
!hdfs dfs -ls /user/qlr/data

Found 1 items
drwxr-xr-x   - root hadoop          0 2020-11-19 13:47 /user/qlr/data/tmp


In [16]:
!hdfs dfs -mv /user/qlr/data/tmp /user/qlr/data/tmp1

In [17]:
!hdfs dfs -ls /user/qlr/data

Found 1 items
drwxr-xr-x   - root hadoop          0 2020-11-19 13:47 /user/qlr/data/tmp1


In [18]:
!hdfs dfs -rmdir /user/qlr/data/tmp1

In [19]:
!hdfs dfs -ls /user/qlr/data

Mind the difference between you local system and HDFS:

In [20]:
!ls /

bin   dev  hadoop  lib	  lib64   lost+found  mnt  proc  run   srv  tmp  var
boot  etc  home    lib32  libx32  media       opt  root  sbin  sys  usr


In [21]:
!!hdfs dfs -ls /

['Found 3 items',
 'drwx------   - mapred hadoop          0 2020-11-19 13:20 /hadoop',
 'drwxrwxrwt   - hdfs   hadoop          0 2020-11-19 13:20 /tmp',
 'drwxrwxrwt   - hdfs   hadoop          0 2020-11-19 13:47 /user']

<hr>

In [23]:
!hdfs dfs -ls /user/qlr

Found 1 items
drwxr-xr-x   - root hadoop          0 2020-11-19 13:47 /user/qlr/data


In [24]:
!hdfs dfs -ls /user/qlr/data

# Uploading data from your local filesystem to HDFS and getting it back
For uploading and downloading data from your HDFS cluster, an `ftp`-like parlance is used:
* `put`
* `get`

In [26]:
!hdfs dfs -put file:///home/user/Downloads/yelp_academic_dataset_review.json /user/qlr/data

You can also use `du` command to measure the actual size of your data on HDFS:

In [27]:
!hdfs dfs -du -h /user/qlr/data

5.9 G  /user/qlr/data/yelp_academic_dataset_review.json


In [18]:
!mkdir -p /home/user/Downloads/from_hdfs

In [28]:
!hdfs dfs -get /user/qlr/data/* file:///home/user/Downloads/from_hdfs

In [29]:
!ls -lh /home/user/Downloads/from_hdfs

-rw-r--r-- 1 root root 5.9G Nov 19 13:55 /home/user/Downloads/from_hdfs


Mind that since storage on HDFS is distributed, you can end up `get`-ing a bunch of files instead of one, each slice coming from a different node of the cluster.

If you want to get a single file, use `getmerge` command.

If you have several HDFS clusters in your disposal, you can move data from one to another with the same pair of commands (`get`/`put`).

# Do It Yourself
* run `hdfs dfs --help` and get an idea of other possible commands

In [30]:
!hdfs dfs --help

--help: Unknown command
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u]

* try to create a subfolder in your `/user/%username%` folder

In [31]:
!hdfs dfs -mkdir /user/qlr/test_folder

* create a new file on HDFS using `touch`

In [36]:
!hdfs dfs -touchz /user/qlr/test_folder/test.py

In [46]:
!hdfs dfs -ls /user/qlr/test_folder

Found 2 items
-rw-r--r--   2 root hadoop        139 2020-11-19 14:10 /user/qlr/test_folder/mapper.py
-rw-r--r--   2 root hadoop          0 2020-11-19 14:04 /user/qlr/test_folder/test.py


* `cat` the newly created file (don't do that in future when working with huge files:))

In [38]:
!hdfs dfs -cat /user/qlr/test_folder/test.py

* upload some files from your system to the cluster

In [39]:
!hdfs dfs -put file:///home/user/Downloads/mapper.py /user/qlr/test_folder

* use `head`/`tail` to look into the uploaded files

In [45]:
!hdfs dfs -cat /user/qlr/test_folder/mapper.py | head

#!/usr/bin/python

counter = 0
while True:
    try:
        counter += 1
        input()
    except EOFError:
        break
print(counter)


In [41]:
!hdfs dfs -tail /user/qlr/test_folder/mapper.py

#!/usr/bin/python

counter = 0
while True:
    try:
        counter += 1
        input()
    except EOFError:
        break
print(counter)


* create a full copy of your working subfolder on HDFS (use `cp`)

In [48]:
!hdfs dfs -cp /user/qlr/test_folder /user/qlr/test_folder_copy

In [49]:
!hdfs dfs -ls /user/qlr/test_folder_copy

Found 2 items
-rw-r--r--   2 root hadoop        139 2020-11-19 14:14 /user/qlr/test_folder_copy/mapper.py
-rw-r--r--   2 root hadoop          0 2020-11-19 14:14 /user/qlr/test_folder_copy/test.py


* remove the redundant copy

In [52]:
!hdfs dfs -rm -R /user/qlr/test_folder_copy

Deleted /user/qlr/test_folder_copy


* compare the results of `get` and `getmerge` commands

* compare the results of `put` and `moveFromLocal` commands

* feel free to play with other HDFS commands