## Downloading Hadoop

In [1]:
%env HADOOP_VERSION     3.1.2
%env HADOOP_PATH hadoop-3.1.2

env: HADOOP_VERSION=3.1.2
env: HADOOP_PATH=hadoop-3.1.2


In [2]:
!wget http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz -q --show-progress



### Extracting compressed files and removing .tar

In [3]:
!tar -xvf hadoop-${HADOOP_VERSION}.tar.gz >/dev/null 
!rm       hadoop-${HADOOP_VERSION}.tar.gz

### Discovering Java path

In [4]:
!dirname $(dirname $(readlink -f $(which javac)))

/usr/lib/jvm/java-8-openjdk-amd64


### Setting Java path envvar

We also added it to user's .bashrc so it will be loaded as the nodes perform ssh connections.

In [5]:
%env JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64

env: JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


In [6]:
!echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 " > .bashrc

# Using Hadoop in Standalone Mode (local)

# Using Hadoop in Pseudo-Distributed Mode

### Starting sshd server

Check postBuild and sshd_config files for more details

In [7]:
!/usr/sbin/sshd -f resources/configs/ssh/sshd_config 

### Copying configurations files to Hadoop folder

In [20]:
!cp resources/configs/hadoop/${HADOOP_VERSION}/* ${HADOOP_PATH}/etc/hadoop/

### Formating the filesystem

In [21]:
!./${HADOOP_PATH}/bin/hdfs namenode -format -force -nonInteractive

2019-07-07 01:12:11,447 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = jupyter-thedatasociety-2dbinderhub-2dhadoop-2d7t7j5qi7/10.12.14.122
STARTUP_MSG:   args = [-format, -force, -nonInteractive]
STARTUP_MSG:   version = 3.1.2
STARTUP_MSG:   classpath = /home/jovyan/hadoop-3.1.2/etc/hadoop:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/jetty-server-9.3.24.v20180605.jar:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/jersey-json-1.19.jar:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/jackson-core-2.7.8.jar:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/commons-lang-2.6.jar:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/kerby-util-1.0.1.jar:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/nimbus-jose-jwt-4.41.1.jar:/home/jovyan/hadoop-3.1.2/share/hadoop/common/lib/jul-to-slf4j-1.7.25.jar:/hom

### Adding ssh options: running in a diferent port

In [22]:
%env HADOOP_SSH_OPTS= -o StrictHostKeyChecking=no -p 8822 

env: HADOOP_SSH_OPTS=-o StrictHostKeyChecking=no -p 8822


### Starting/stoping NameNode daemon and DataNode daemon

In [23]:
!./${HADOOP_PATH}/sbin/start-dfs.sh
#!./${HADOOP_PATH}/sbin/stop-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [jupyter-thedatasociety-2dbinderhub-2dhadoop-2d7t7j5qi7]


### Creating folders in the distributed file system

In [24]:
!./${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/
!./${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/matheus/
!./${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/matheus/input/

### Copying a file to a folder in the distributed file system

In [25]:
!./${HADOOP_PATH}/bin/hdfs dfs -put ./resources/examples/newyorknewyork.txt /user/matheus/input/

In [26]:
!./${HADOOP_PATH}/bin/hdfs dfs -ls /user/matheus/input/

Found 1 items
-rw-r--r--   1 jovyan supergroup        865 2019-07-07 01:14 /user/matheus/input/newyorknewyork.txt


In [27]:
!./${HADOOP_PATH}/bin/hdfs dfs -cat /user/matheus/input/newyorknewyork.txt

Start spreading the news
I am leaving today
I want to be a part of it
New York New York
These vagabond shoes
They are longing to stray
Right through the very heart of it
New York New York
I want to wake up in that city
That doesn't sleep
And find I'm king of the hill
Top of the heap
My little town blues
They are melting away
I gonna make a brand new start of it
In old New York
If I can make it there
I'll make it anywhere
It's up to you
New York New York
New York New York
I want to wake up in that city
That never sleeps
And find I'm king of the hill
Top of the list
Head of the heap
King of the hill
These are little town blues
They have all melted away
I am about to make a brand new start of it
Right there in old New York
And you bet baby
If I can make it there
You know I'm gonna make it just about anywhere
Come on come through
New York New York New York


# Running YARN on a single node

In [36]:
!./${HADOOP_PATH}/sbin/start-yarn.sh

Starting resourcemanager
resourcemanager is running as process 1125.  Stop it first.
Starting nodemanagers
localhost: nodemanager is running as process 1222.  Stop it first.


# Simple word count example

In [32]:
!./${HADOOP_PATH}/bin/yarn jar ./${HADOOP_PATH}/share/hadoop/mapreduce/hadoop-mapreduce-examples-${HADOOP_VERSION}.jar  \
                                 wordcount \
                                 /user/matheus/input /user/matheus/output

2019-07-07 01:15:36,634 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2019-07-07 01:15:38,374 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-07 01:15:39,375 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-07 01:15:40,376 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-07 01:15:41,377 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-07-07 01:15:42,379 INFO ipc.Client: Retrying connect to

In [33]:
!./${HADOOP_PATH}/bin/hdfs dfs -ls /user/matheus/output/

ls: `/user/matheus/output/': No such file or directory


In [34]:
!./${HADOOP_PATH}/bin/hdfs dfs -cat /user/matheus/output/part-r-00000

cat: `/user/matheus/output/part-r-00000': No such file or directory


In [35]:
# !./${HADOOP_PATH}/bin/hdfs dfs -rm -r  /user/matheus/output/