# Hadoop Installation in Google Colab

Hadoop is a java programming-based data processing framework. Let’s install Hadoop setup step by step in Google Colab. There are two ways first is we have to install java on our machines and the second way is we will install java in google colab, so there is no need to install java on our machines. As we are using Google colab we choose the second way to install Hadoop:

In [1]:
# Install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
#create java home variable 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

### Install Hadoop

In [3]:
#download hadoop
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

--2022-01-27 10:40:28--  https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.95.219, 2a01:4f8:10a:201a::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500749234 (478M) [application/x-gzip]
Saving to: ‘hadoop-3.3.0.tar.gz’


2022-01-27 10:40:35 (65.2 MB/s) - ‘hadoop-3.3.0.tar.gz’ saved [500749234/500749234]



In [4]:
# we’ll use the tar command with the -x flag to extract, -z to uncompress, 
# -v for verbose output, and -f to specify that we’re extracting from a file
!tar -xzvf hadoop-3.3.0.tar.gz

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/FSDataOutputStream.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/TrashPolicyDefault.Emptier.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/HarFileSystem.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/PathExistsException.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/XAttrSetFlag.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/sourc

In [5]:
#copying the hadoop file to user/local
!cp -r hadoop-3.3.0/ /usr/local/

### Configuring Java Home variable

In [6]:
# finding  the default Java path
!readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/


### Run Hadoop

In [7]:
#Running Hadoop
!/usr/local/hadoop-3.3.0/bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the
    

In [8]:
!mkdir ~/input

In [9]:
!cp /usr/local/hadoop-3.3.0/etc/hadoop/*.xml ~/input

In [10]:
!ls ~/input

capacity-scheduler.xml	hdfs-rbf-site.xml  kms-acls.xml     yarn-site.xml
core-site.xml		hdfs-site.xml	   kms-site.xml
hadoop-policy.xml	httpfs-site.xml    mapred-site.xml


In [11]:
!/usr/local/hadoop-3.3.0/bin/hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep ~/input ~/grep_example 'allowed[.]*'

2022-01-27 10:41:08,720 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-01-27 10:41:08,803 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-01-27 10:41:08,803 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-01-27 10:41:08,972 INFO input.FileInputFormat: Total input files to process : 10
2022-01-27 10:41:09,005 INFO mapreduce.JobSubmitter: number of splits:10
2022-01-27 10:41:09,181 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local224624305_0001
2022-01-27 10:41:09,181 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-01-27 10:41:09,350 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2022-01-27 10:41:09,351 INFO mapreduce.Job: Running job: job_local224624305_0001
2022-01-27 10:41:09,357 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2022-01-27 10:41:09,365 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2022-0

### General Info about Hadoop

- Data sets that are so large or complex that traditional data processing application software is inadequate to deal with them.
- Data analysis requires massively parallel software running on several servers.
- Volume, Variety, Velocity, Variability and Veracity describe Big Data properties.

- Framework for running applications on large cluster.
- The Hadoop framework transparently provides applications both reliability and data motion.
- Hadoop implements the computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
- It provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
- Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

### HDFS

- It is a distributed file systems.
- HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
- HDFS is suitable for applications that have large data sets.
- HDFS provides interfaces to move applications closer to where the data is located. The computation is much more efficient when the size of the data set is huge.
- HDFS consists of a single NameNode with a number of DataNodes which manage storage.
- HDFS exposes a file system namespace and allows user data to be stored in files.
  - A file is split by the NameNode into blocks stored in DataNodes.
  - The NameNode executes operations like opening, closing, and renaming files and directories.
  - The Secondary NameNode stores information from NameNode.
  - The DataNodes manage perform block creation, deletion, and replication upon instruction from the NameNode.
  - The placement of replicas is optimized for data reliability, availability, and network bandwidth utilization.
  - User data never flows through the NameNode.
- Files in HDFS are write-once and have strictly one writer at any time.
- The DataNode has no knowledge about HDFS files.

### Accessibility

All [HDFS commands](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html)  are invoked by the bin/hdfs Java script:
```shell
hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
```
#### Manage files and directories
```shell
hdfs dfs -ls -h -R # Recursively list subdirectories with human-readable file sizes.
hdfs dfs -cp  # Copy files from source to destination
hdfs dfs -mv  # Move files from source to destination
hdfs dfs -mkdir /foodir # Create a directory named /foodir	
hdfs dfs -rmr /foodir   # Remove a directory named /foodir	
hdfs dfs -cat /foodir/myfile.txt #View the contents of a file named /foodir/myfile.txt	
```

But, for now, in Colab, we will use full path in order to use `hdfs` or `hadoop` commans. 
Like this: 
```shell
/usr/local/hadoop-3.3.0/bin/hadoop fs -ls # hadoop fs -ls
/usr/local/hadoop-3.3.0/bin/hdfs dfs -mkdir /data # hdfs dfs -mkdir /data
```

Next, you can work on above given examples. 

In [12]:
# !cat ~/grep_example/*

In [13]:
!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls

Found 4 items
drwxr-xr-x   - root root       4096 2022-01-07 14:33 .config
drwxr-xr-x   - 1001 1001       4096 2020-07-06 19:50 hadoop-3.3.0
-rw-r--r--   1 root root  500749234 2020-07-15 17:30 hadoop-3.3.0.tar.gz
drwxr-xr-x   - root root       4096 2022-01-07 14:33 sample_data


In [14]:
ll /usr/local/hadoop-3.3.0/bin/

total 1732
-rwxr-xr-x 1 root 802832 Jan 27 10:41 [0m[01;32mcontainer-executor[0m*
-rwxr-xr-x 1 root   9034 Jan 27 10:41 [01;32mhadoop[0m*
-rwxr-xr-x 1 root  11265 Jan 27 10:41 [01;32mhadoop.cmd[0m*
-rwxr-xr-x 1 root  11274 Jan 27 10:41 [01;32mhdfs[0m*
-rwxr-xr-x 1 root   8081 Jan 27 10:41 [01;32mhdfs.cmd[0m*
-rwxr-xr-x 1 root   6237 Jan 27 10:41 [01;32mmapred[0m*
-rwxr-xr-x 1 root   6311 Jan 27 10:41 [01;32mmapred.cmd[0m*
-rwxr-xr-x 1 root  29200 Jan 27 10:41 [01;32moom-listener[0m*
-rwxr-xr-x 1 root 837840 Jan 27 10:41 [01;32mtest-container-executor[0m*
-rwxr-xr-x 1 root  12439 Jan 27 10:41 [01;32myarn[0m*
-rwxr-xr-x 1 root  12840 Jan 27 10:41 [01;32myarn.cmd[0m*


In [15]:
!/usr/local/hadoop-3.3.0/bin/hdfs dfs -mkdir /data

In [16]:
!/usr/local/hadoop-3.3.0/bin/hdfs dfs -put sample_data/mnist_test.csv /data/

In [17]:
!/usr/local/hadoop-3.3.0/bin/hdfs dfs -ls -R /data/

-rw-r--r--   1 root root   18289443 2022-01-27 10:41 /data/mnist_test.csv


In [18]:
!/usr/local/hadoop-3.3.0/bin/hdfs dfs -tail /data/mnist_test.csv

0,0,0,0,0,0,0,33,217,253,253,132,64,0,0,18,43,157,171,253,253,253,253,253,160,2,0,0,0,0,0,0,0,0,3,166,253,253,242,49,17,49,158,210,254,253,253,253,253,253,253,253,253,11,0,0,0,0,0,0,0,0,10,227,253,253,207,15,172,253,253,253,254,247,201,253,210,210,253,253,175,4,0,0,0,0,0,0,0,0,10,228,253,253,224,87,242,253,253,184,60,54,9,60,35,182,253,253,52,0,0,0,0,0,0,0,0,0,13,253,253,253,253,231,253,253,253,93,86,86,86,109,217,253,253,134,5,0,0,0,0,0,0,0,0,0,2,115,253,253,253,253,253,253,253,253,254,253,253,253,253,253,134,5,0,0,0,0,0,0,0,0,0,0,0,3,166,253,253,253,253,253,253,253,254,253,253,253,175,52,5,0,0,0,0,0,0,0,0,0,0,0,0,0,7,35,132,225,253,253,253,195,132,132,132,110,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

Configurations of hadoop can be viewed here, this file stores the global settings used by all Hadoop shell commands.

In [19]:
!cat /usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

##
## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.
## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,
## ONE CAN USE THIS FILE TO SET

### Spark. DataFrame. Spark SQL. 

One more thing about Spark, it consists of 4 main modules: 
1. Spark SQL - helps to write spark programs using SQL like queries.
2. Spark Streaming - is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. used heavily in processing of social media data.
3. Spark MLLib - is the machine learning component of SPark. It helps train ML models on massive datasets with very high efficeny.
4. Spark GraphX - is the visualization component of Spark. It enables users to view data both as graphs and as collections without data movement or duplication.

Hopefully this image gives a better idea of what I am talking about:

![alt](https://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2015/11/spark-streaming-datanami.png)

In [20]:
!pip install -q findspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 40 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 67.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=56b484e75ab91ad4841ebfde6b0c904da89cd7ad2618481be09c3499dd6444f8
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [21]:
!ls /usr/lib/jvm 

default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


In [22]:
!pip install -U pyarrow

Collecting pyarrow
  Downloading pyarrow-6.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
[K     |████████████████████████████████| 25.6 MB 38.2 MB/s 
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 3.0.0
    Uninstalling pyarrow-3.0.0:
      Successfully uninstalled pyarrow-3.0.0
Successfully installed pyarrow-6.0.1


In [23]:
import findspark
findspark.init()

In [24]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

#### Dataset Preparation

In [25]:
!wget https://jacobceles.github.io/knowledge_repo/colab_and_pyspark/cars.csv

--2022-01-27 10:42:28--  https://jacobceles.github.io/knowledge_repo/colab_and_pyspark/cars.csv
Resolving jacobceles.github.io (jacobceles.github.io)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Connecting to jacobceles.github.io (jacobceles.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://jacobcelestine.com/knowledge_repo/colab_and_pyspark/cars.csv [following]
--2022-01-27 10:42:28--  https://jacobcelestine.com/knowledge_repo/colab_and_pyspark/cars.csv
Resolving jacobcelestine.com (jacobcelestine.com)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to jacobcelestine.com (jacobcelestine.com)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22608 (22K) [text/csv]
Saving to: ‘cars.csv’


2022-01-27 10:42:28 (86.8 MB/s) - ‘cars.csv’ saved [22608/22608]



In [26]:
!/usr/local/hadoop-3.3.0/bin/hdfs dfs -put cars.csv /data/

In [27]:
!/usr/local/hadoop-3.3.0/bin/hdfs dfs -ls -R /data/

-rw-r--r--   1 root root   18289443 2022-01-27 10:41 /data/mnist_test.csv
-rw-r--r--   1 root root      22608 2022-01-27 10:42 /data/cars.csv


In [28]:
df = spark.read.csv('cars.csv', header=True, sep=";")
df.show(5)

+--------------------+----+---------+------------+----------+------+------------+-----+------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0| 3504.|        12.0|   70|    US|
|   Buick Skylark 320|15.0|        8|       350.0|     165.0| 3693.|        11.5|   70|    US|
|  Plymouth Satellite|18.0|        8|       318.0|     150.0| 3436.|        11.0|   70|    US|
|       AMC Rebel SST|16.0|        8|       304.0|     150.0| 3433.|        12.0|   70|    US|
|         Ford Torino|17.0|        8|       302.0|     140.0| 3449.|        10.5|   70|    US|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
only showing top 5 rows

