# Demo 2

# Hadoop Compatible File System

Hadoop computing can be connected to other file systems.

* Amazon S3
* Azure Blob Storage
* Azure Data Lake Storage
* Swift Storage
* **Linux Local File System**

You can access it using a URI:

```bash
scheme://authority/path
```
 For instance
 
* **Local file system:** `file://<path>` 
* **HDFS:** `hdfs://namenode:port/<path>` 
* **Amazon S3** `s3://<bucket-name>/<key>/<path>`

In [3]:
!hdfs dfs -ls /

Found 1 items
drwxr-xr-x   - matheus supergroup          0 2019-07-19 04:25 /user


In [5]:
!hdfs dfs -ls file:///home/

Found 1 items
drwxr-xr-x   - matheus matheus       4096 2019-07-19 03:40 file:///home/matheus


Getting the default URI

In [7]:
!hdfs getconf -confKey fs.defaultFS

hdfs://localhost:9000


# Example dataset #2

**Description:** This dataset consists of 1,048,576 NYC taxi trip records of yellow taxis for the month of January 2016 collected by NYC’s Taxi and Limousine Commission. Trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.  [Detailed information about this dataset can be accessed at Trip Record Data.](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

**Download URL**: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv

**File Type:** CSV

**File Size:** 1.6 GB

Sample Hadoop Use Cases:

1) What’s the location with the most number of pickups made by yellow taxis in January 2016?

2) Which day of the week has the most number of trips made by yellow taxis in January 2016?

In [None]:
!wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv -q --show-progress

## Downloading the dataset:

### Copying data to HDFS using URI

#### cp

Requires URI. Assumes default URI if no URI is provided.

Use this command to copy data from external datalakes.

In [10]:
!hdfs dfs -put dataset-lastfm/artists.dat /user/theo

put: `dataset-lastfm/artists.dat': No such file or directory


In [None]:
!ls dataset-lastfm

In [None]:
!tail dataset-lastfm/artists.dat

## dfs

### Listing the content of a directory

In [None]:
!hdfs dfs -ls /

In [None]:
!hdfs dfs -ls /user

### Creating a directory

In [None]:
!hdfs dfs -mkdir /user/theo

In [None]:
!hdfs dfs -ls /user

### Copying files from the local system to HDFS

In [None]:
!hdfs dfs -ls /user/theo

In [None]:
#Using put
!hdfs dfs -put dataset-lastfm/artists.dat /user/theo

In [None]:
#Using copyFromLocal
!hdfs dfs -copyFromLocal dataset-lastfm/user_artists.dat /user/theo

In [None]:
!hdfs dfs -ls /user/theo

<img src= "resources/images/fileinfo.png" width="55%">

### Listing the content of a file

In [None]:
!touch list.txt
!echo "item1" >  list.txt
!echo "item2" >> list.txt
!echo "item3" >> list.txt
!cat list.txt

In [None]:
!hdfs dfs -put list.txt /user/theo

In [None]:
!hdfs dfs -cat /user/theo/list.txt

In [None]:
!hdfs dfs -tail /user/theo/artists.dat

### Creating a empty file

`touchz` creates a file of zero length. An error is returned if the file exists with non-zero length.


In [None]:
!hdfs dfs -touchz /user/theo/newfile.txt

In [None]:
!hdfs dfs -ls /user/theo/

### Copy file from HDFS to local system

In [None]:
!hdfs dfs -get /user/theo/newfile.txt newfile.txt

In [None]:
!ls

### Merging files

In [None]:
!hdfs dfs -put dataset-lastfm/tags.dat /user/theo

`getmerge` - Takes a source directory and a destination file as input and concatenates files in src into the destination local file


In [None]:
!hdfs dfs -getmerge /user/theo/tags.dat /user/theo/artists.dat artist-tags.txt

In [None]:
!tail artist-tags.txt

### Verifiying replication factory

In [None]:
!hdfs dfs -stat %r /user/theo/artists.dat

### Changing replication factory

In [None]:
!hdfs dfs -setrep 3 /user/theo/artists.dat

In [None]:
!hdfs dfs -ls /user/theo/artists.dat

In [None]:
!hdfs dfs -stat %r /user/theo/artists.dat

### Deleting a file

In [None]:
!hdfs dfs -ls /user/theo

In [None]:
!hdfs dfs -rm /user/theo/tags.dat

In [None]:
!hdfs dfs -ls /user/theo

In [None]:
!hdfs dfs -rm /user/theo/*

In [None]:
!hdfs dfs -ls /user/theo

### Deleting a directory

In [None]:
!hdfs dfs -ls /user

In [None]:
!hdfs dfs -rmdir /user/theo/

In [None]:
!hdfs dfs -ls /user

### Getting help

In [None]:
#usage - Return the help for an individual command
!hdfs dfs -usage chmod

In [None]:
!hdfs dfs -help

## More commands

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html