## Downloading Hadoop

In [1]:
%env HADOOP_VERSION     2.9.2
%env HADOOP_PATH hadoop-2.9.2

env: HADOOP_VERSION=2.9.2
env: HADOOP_PATH=hadoop-2.9.2


In [2]:
# !wget http://ftp.unicamp.br/pub/apache/hadoop/common/${HADOOP_VERSION}/${HADOOP_VERSION}.tar.gz -q --show-progress

### Extracting compressed files and removing .tar

In [3]:
!tar -xvf hadoop-${HADOOP_VERSION}.tar.gz >/dev/null 
# !rm       ${HADOOP_VERSION}.tar.gz

### Discovering Java path

In [4]:
!dirname $(dirname $(readlink -f $(which javac)))

/usr/lib/jvm/java-8-openjdk-amd64


### Setting Java path envvar

We also added it to user's .bashrc so it will be loaded as the nodes perform ssh connections.

In [5]:
%env JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64

env: JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


In [6]:
!echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 " >> ~/.bashrc

# Using Hadoop in Standalone Mode (local)

# Using Hadoop in Pseudo-Distributed Mode

### Starting sshd server

Check postBuild and sshd_config files for more details

In [7]:
!/usr/sbin/sshd -f resources/configs/ssh/sshd_config 

### Adding to know hosts by establishing a ssh connectcion (avoiding yes/no host confirmation)

In [8]:
!ssh -o "StrictHostKeyChecking no" $USER@localhost -p 8822 -C "exit" 
!ssh -o "StrictHostKeyChecking no" $USER@0.0.0.0   -p 8822 -C "exit"



### Copying configurations files to Hadoop folder

In [11]:
!cp resources/configs/hadoop/${HADOOP_VERSION}/* ${HADOOP_PATH}/etc/hadoop/

### Formating the filesystem

In [13]:
!./${HADOOP_PATH}/bin/hdfs namenode -format -force -nonInteractive

19/07/07 16:26:50 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 9106882a5d72/172.17.0.2
STARTUP_MSG:   args = [-format, -force, -nonInteractive]
STARTUP_MSG:   version = 2.9.2
STARTUP_MSG:   classpath = /home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/etc/hadoop:/home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/share/hadoop/common/lib/xmlenc-0.52.jar:/home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/share/hadoop/common/lib/nimbus-jose-jwt-4.41.1.jar:/home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/share/hadoop/common/lib/httpcore-4.4.4.jar:/home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/share/hadoop/common/lib/junit-4.11.jar:/home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/share/hadoop/common/lib/activation-1.1.jar:/home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/s

### Adding ssh options: running in a diferent port

In [14]:
%env HADOOP_SSH_OPTS=-o StrictHostKeyChecking=no -p 8822

env: HADOOP_SSH_OPTS=-o StrictHostKeyChecking=no -p 8822


### Starting/stoping NameNode daemon and DataNode daemon

In [16]:
!./${HADOOP_PATH}/sbin/start-dfs.sh
#!./${HADOOP_PATH}/sbin/stop-dfs.sh

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/logs/hadoop-matheus-namenode-9106882a5d72.out
localhost: starting datanode, logging to /home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/logs/hadoop-matheus-datanode-9106882a5d72.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/logs/hadoop-matheus-secondarynamenode-9106882a5d72.out


### Creating folders in the distributed file system

In [17]:
!./${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/
!./${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/matheus/
!./${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/matheus/input/

### Copying a file to a folder in the distributed file system

In [18]:
!./${HADOOP_PATH}/bin/hdfs dfs -put ./resources/examples/newyorknewyork.txt /user/matheus/input/

In [19]:
!./${HADOOP_PATH}/bin/hdfs dfs -ls /user/matheus/input/

Found 1 items
-rw-r--r--   1 matheus supergroup        865 2019-07-07 16:27 /user/matheus/input/newyorknewyork.txt


In [20]:
!./${HADOOP_PATH}/bin/hdfs dfs -cat /user/matheus/input/newyorknewyork.txt

Start spreading the news
I am leaving today
I want to be a part of it
New York New York
These vagabond shoes
They are longing to stray
Right through the very heart of it
New York New York
I want to wake up in that city
That doesn't sleep
And find I'm king of the hill
Top of the heap
My little town blues
They are melting away
I gonna make a brand new start of it
In old New York
If I can make it there
I'll make it anywhere
It's up to you
New York New York
New York New York
I want to wake up in that city
That never sleeps
And find I'm king of the hill
Top of the list
Head of the heap
King of the hill
These are little town blues
They have all melted away
I am about to make a brand new start of it
Right there in old New York
And you bet baby
If I can make it there
You know I'm gonna make it just about anywhere
Come on come through
New York New York New York


# Running YARN on a single node

In [21]:
!./${HADOOP_PATH}/sbin/start-yarn.sh

starting yarn daemons
starting resourcemanager, logging to /home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/logs/yarn-matheus-resourcemanager-9106882a5d72.out
localhost: starting nodemanager, logging to /home/matheus/local-home/git/github/binderhub-hadoop/hadoop-2.9.2/logs/yarn-matheus-nodemanager-9106882a5d72.out


# Simple word count example

In [26]:
!./${HADOOP_PATH}/bin/yarn jar  ./${HADOOP_PATH}/share/hadoop/mapreduce/hadoop-mapreduce-examples-${HADOOP_VERSION}.jar wordcount \
                                      /user/matheus/input /user/matheus/output

19/07/07 16:29:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/07/07 16:29:39 INFO input.FileInputFormat: Total input files to process : 1
19/07/07 16:29:40 INFO mapreduce.JobSubmitter: number of splits:1
19/07/07 16:29:40 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/07/07 16:29:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1562516889785_0002
19/07/07 16:29:40 INFO impl.YarnClientImpl: Submitted application application_1562516889785_0002
19/07/07 16:29:40 INFO mapreduce.Job: The url to track the job: http://9106882a5d72:8088/proxy/application_1562516889785_0002/
19/07/07 16:29:40 INFO mapreduce.Job: Running job: job_1562516889785_0002
19/07/07 16:29:44 INFO mapreduce.Job: Job job_1562516889785_0002 running in uber mode : false
19/07/07 16:29:44 INFO mapreduce.Job:  map 0% reduce 0%
19/07/07 16:29:48 INFO mapreduce.Job:  map 100% reduce

In [23]:
!./${HADOOP_PATH}/bin/hdfs dfs -ls /user/matheus/output/

Found 2 items
-rw-r--r--   1 matheus supergroup          0 2019-07-07 16:28 /user/matheus/output/_SUCCESS
-rw-r--r--   1 matheus supergroup        571 2019-07-07 16:28 /user/matheus/output/part-r-00000


In [24]:
!./${HADOOP_PATH}/bin/hdfs dfs -cat /user/matheus/output/part-r-00000

And	3
Come	1
Head	1
I	8
I'll	1
I'm	3
If	2
In	1
It's	1
King	1
My	1
New	13
Right	2
Start	1
That	2
These	2
They	3
Top	2
York	13
You	1
a	3
about	2
all	1
am	2
anywhere	2
are	3
away	2
baby	1
be	1
bet	1
blues	2
brand	2
can	2
city	2
come	1
doesn't	1
find	2
gonna	2
have	1
heap	2
heart	1
hill	3
in	3
it	8
just	1
king	2
know	1
leaving	1
list	1
little	2
longing	1
make	6
melted	1
melting	1
never	1
new	2
news	1
of	10
old	2
on	1
part	1
shoes	1
sleep	1
sleeps	1
spreading	1
start	2
stray	1
that	2
the	8
there	3
through	2
to	6
today	1
town	2
up	3
vagabond	1
very	1
wake	2
want	3
you	2


In [None]:
# !./${HADOOP_PATH}/bin/hdfs dfs -rm -r  /user/matheus/output/