# Know your environment

In [2]:
!cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"


In [3]:
!python --version

Python 2.7.14 :: Anaconda, Inc.


In [4]:
!hadoop version | head -1

Hadoop 2.9.2


# Working with folders on HDFS
It's quite similar to working with folders on any \*nix system, just don't forget to add `hdfs dfs -` before common commands:
* `ls`
* `mkdir`
* `rmdir`
* `mv`

Usually it's a good idea to use your user's directory: `/uesr/%username%`

In [5]:
!hdfs dfs -ls /user/borisshminke

In [6]:
!hdfs dfs -mkdir /user/borisshminke/data

In [7]:
!hdfs dfs -ls /user/borisshminke/data

In [8]:
!hdfs dfs -mkdir /user/borisshminke/data/tmp

In [9]:
!hdfs dfs -ls /user/borisshminke/data

Found 1 items
drwxr-xr-x   - root hadoop          0 2020-11-17 09:21 /user/borisshminke/data/tmp


In [10]:
!hdfs dfs -mv /user/borisshminke/data/tmp /user/borisshminke/data/tmp1

In [11]:
!hdfs dfs -ls /user/borisshminke/data

Found 1 items
drwxr-xr-x   - root hadoop          0 2020-11-17 09:21 /user/borisshminke/data/tmp1


In [12]:
!hdfs dfs -rmdir /user/borisshminke/data/tmp1

In [13]:
!hdfs dfs -ls /user/borisshminke/data

Mind the difference between you local system and HDFS:

In [14]:
!ls /

bin   dev  hadoop  lib	  lib64   lost+found  mnt  proc  run   srv  tmp  var
boot  etc  home    lib32  libx32  media       opt  root  sbin  sys  usr


In [15]:
!!hdfs dfs -ls /

['Found 3 items',
 'drwx------   - mapred hadoop          0 2020-11-17 09:10 /hadoop',
 'drwxrwxrwt   - hdfs   hadoop          0 2020-11-17 09:10 /tmp',
 'drwxrwxrwt   - hdfs   hadoop          0 2020-11-17 09:10 /user']

# Uploading data from your local filesystem to HDFS and getting it back
For uploading and downloading data from your HDFS cluster, an `ftp`-like parlance is used:
* `put`
* `get`

In [16]:
!hdfs dfs -put \
    file:///home/borisshminke/Downloads/yelp_academic_dataset_review.json \
    /user/borisshminke/data

You can also use `du` command to measure the actual size of your data on HDFS:

In [17]:
!hdfs dfs -du -h /user/borisshminke/data

5.9 G  /user/borisshminke/data/yelp_academic_dataset_review.json


In [18]:
!mkdir -p /home/borisshminke/Downloads/from_hdfs

In [19]:
!hdfs dfs -get /user/borisshminke/data/* file:///home/borisshminke/Downloads/from_hdfs

In [20]:
!ls -lh /home/borisshminke/Downloads/from_hdfs

total 5.9G
-rw-r--r-- 1 root root 5.9G Nov 17 09:24 yelp_academic_dataset_review.json


Mind that since storage on HDFS is distributed, you can end up `get`-ing a bunch of files instead of one, each slice coming from a different node of the cluster.

If you want to get a single file, use `getmerge` command.

If you have several HDFS clusters in your disposal, you can move data from one to another with the same pair of commands (`get`/`put`).

# Do It Yourself
* run `hdfs dfs --help` and get an idea of other possible commands
* try to create a subfolder in your `/user/%username%` folder
* create a new file on HDFS using `touch`
* `cat` the newly created file (don't do that in future when working with huge files:))
* upload some files from your system to the cluster
* use `head`/`tail` to look into the uploaded files
* create a full copy of your working subfolder on HDFS (use `cp`)
* remove the redundant copy
* compare the results of `get` and `getmerge` commands
* compare the results of `put` and `moveFromLocal` commands
* feel free to play with other HDFS commands