<a href="https://colab.research.google.com/github/sreent/big-data-analysis/blob/main/Hadoop%20Hand-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hadoop Hand-On Lab**

##Setting Up Hadoop Environment

In [None]:
# install java and set JAVA_HOME variable 
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# download hadoop to colab's compute engine
!wget -q https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
!tar -xzf hadoop-3.3.0.tar.gz

# create folder for storing hadoop files
!mkdir -p /usr/local/hadoop
# copy downloaded files the created hadoop folder 
!cp -r hadoop-3.3.0/* /usr/local/hadoop/.

# delete download files
!rm -rf hadoop*

# set java and hadoop environment
import os, sys
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["HADOOP_VERSION"] = "3.3.0"
os.environ["HADOOP_HOME"] = "/usr/local/hadoop"
os.environ["HADOOP_CONF_DIR"] = "/usr/local/hadoop/etc/hadoop"
os.environ["HADOOP_MAPRED_HOME"] = "/usr/local/hadoop"
os.environ["HADOOP_COMMON_HOME"] = "/usr/local/hadoop"
os.environ["HADOOP_HDFS_HOME"] = "/usr/local/hadoop"
os.environ["YARN_HOME"] = "/usr/local/hadoop"
os.environ["HADOOP_TOOLS"] = "/usr/local/hadoop/share/hadoop/tools/lib"

# append hadoop executable paths to the existing system path
%set_env PATH=/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

##Command Line Cheat Sheet

####Accessibility

All [HADOOP commands](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html)  are invoked by the bin/hadoop Java script:
```shell
hadoop [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]
```

####Manage files and directories
```shell
hadoop fs -ls -h -R # Recursively list subdirectories with human-readable file sizes.
hadoop fs -cp  # Copy files from local to hdfs destination
hadoop fs -mv  # Move files from source to destination
hadoop fs -mkdir /foodir # Create a directory named /foodir	
hadoop fs -rm -r /foodir   # Remove a directory named /foodir	
hadoop fs -cat /foodir/myfile.txt #View the contents of a file named /foodir/myfile.txt	
```

####Transfer files between nodes
##### put
```shell
hadoop fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>
```
Copy single src, or multiple srcs from local file system to the destination file system. 

Options:

    -p : Preserves rights and modification times.
    -f : Overwrites the destination if it already exists.

```shell
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
```
Similar to the fs -put command
- `moveFromLocal` : to delete the source localsrc after copy.
- `copyFromLocal` : source is restricted to a local file
- `copyToLocal` : destination is restricted to a local file

##**Lab 1**: Hadoop Cluster
**Task 1.1** Check that your HDFS home directory required to execute MapReduce jobs exists.
```bash
hadoop fs -ls /user/${USER}
```

Type the following commands: 
```bash
hadoop fs -ls
hadoop fs -ls ~/
hadoop fs -mkdir ~/lab1
```

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

**Task 1.2** Create a folder called <code>lab1</code> and add a file <code>user.txt</code> containing your name and the date into i:
```bash
mkdir -p ./lab1
echo "FirstName LastName" > ./lab1/user.txt
echo `date` >> ./lab1/user.txt 
cat ./lab1/user.txt
```

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

**Task 1.3** Copy it on  HDFS :
```bash
hadoop fs -copyFromLocal ./lab1/user.txt ~/lab1/.
```

Check with:
```bash
hadoop fs -ls -R ~/lab1
hadoop fs -cat ~/lab1/user.txt 
hadoop fs -tail ~/lab1/user.txt 
```

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

**Task 1.4** Remove file and directory on HDFS :
Remove the file:
```bash
hadoop fs -rm ~/lab1/user.txt
```
Remove the directory:
```bash
hadoop fs -rm -r ~/lab1
```

Check with:
```bash
hadoop fs -ls -R ~/lab1 
```

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

##**Lab 2**: Command Line Hands-On Practice
1. Create a directory <code>lab2</code> in <code>HDFS</code>.
2. List the contents of a directory <code>~/lab2</code>.
3. Upload the file <code>today.txt</code> in <code>HDFS</code>.
```bash
mkdir -p ./lab2
date > ./lab2/today.txt
whoami >> ./lab2/today.txt
```
4. Display contents of file <code>today.txt</code>
5. Copy <code>today.txt</code> file from source to <code>lab2</code> directory.
6. Copy file <code>jps.txt</code> from/To Local file system to <code>HDFS</code>. The <code>jps</code> command will report the local VM identifier for each instrumented JVM found on the target system.
```bash
jps > ./lab2/jps.txt
```
7. Move file <code>jps.txt</code> from source to <code>~/lab2</code>.
8. Remove file <code>today.txt</code> from home directory in <code>HDFS</code>.
9. Display last few lines of <code>jps.txt</code>.
10. Display the help of <code>du</code> command and show the total amount of space in a human-readable fashion used by your home hdfs directory.
12. Display the help of <code>df</code> command and show the total amount of space available in the filesystem in a human-readable fashion.
13. With <code>chmod</code> change the rights of <code>today.txt</code> file. I has to be readable and writeable only by you.

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

##**Lab 3**: Hadoop Streaming Using Python – Word Count Problem
Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use various different languages for writing MapReduce programs like Python, C++, Ruby, etc. It supports all the languages that can read from standard input and write to standard output. We will be implementing Python with Hadoop Streaming and will observe how it works. We will implement the word count problem in python to understand Hadoop Streaming. We will be creating mapper.py and reducer.py to perform map and reduce tasks.

Let’s create one file which contains multiple words that we can count.


**Task 3.1**: Create a folder called <code>lab3</code> and add a text file with the name <code>data.txt</code> with some content into it.
```shell
mkdir -p ./lab3
``` 
```shell
%%writefile ./lab3/data.txt
geeks for geeks is best online conding platform
welcome to geeks for geeks hadoop streaming lab
```

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%writefile ./lab3/data.txt
# Add Your Content Here

Check if <code>data.txt</code> is created in the <code>lab3</code> folder.

In [None]:
%%shell
# Insert Your Code Here

**Task 3.2**: Create a <code>mapper.py</code> file that implements the mapper logic. It will read the data from <code>STDIN</code> and will split the lines into words, and will generate an output of each word with its individual count. 

In [None]:
%%file ./lab3/mapper.py
#!/usr/bin/env python
  
# import sys because we need to read and write data to STDIN and STDOUT
import sys
  
# reading entire line from STDIN (standard input)
for line in sys.stdin:
    # to remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
      
    # we are looping over the words array and printing the word
    # with the count of 1 to the STDOUT
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        print("%s\t%s" % (word, 1))

Let’s test our mapper.py locally that it is working fine or not.

***Syntax***:
```shell
cat <text_data_file> | python <mapper_code_python_file>
```

In [None]:
%%shell
# Insert Your Code Here

**Task 3.3**: Create a <code>reducer.py</code> file that implements the reducer logic. It will read the output of <code>mapper.py</code> from <code>STDIN </code> (standard input) and will aggregate the occurrence of each word and will write the final output to <code>STDOUT</code>. 

In [None]:
%%file ./lab3/reducer.py
#!/usr/bin/env python

import sys
  
current_word = None
current_count = 0
word = None
  
# read the entire line from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # splitting the data on the basis of tab we have provided in mapper.py
    word, count = line.split("\t", 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue
  
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print("%s\t%s" % (current_word, current_count))
        current_count = count
        current_word = word
  
# do not forget to output the last word if needed!
if current_word == word:
    print("%s\t%s" % (current_word, current_count))

Now let’s check our reducer code <code>reducer.py</code> with <code>mapper.py</code> is it working properly or not with the help of the below command.

<pre>
cat ./lab3/data.txt | python ./lab3/mapper.py | sort -k1,1 | python ./lab3/reducer.py
</pre>

In [None]:
%%shell
# Insert Your Code Here

**Task 3.4**: Let’s deploy our MapReduce Python code into the Hadoop environemnt.

Now make a directory word_count_in_python in our HDFS in the root directory that will store our word_count_data.txt file with the below command.
<pre>
hadoop fs -mkdir -p ~/lab3
hadoop fs -mkdir -p ~/lab3/input
</pre>

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here

Copy <code>data.txt</code> to this folder in our <code>HDFS</code> with help of <code>copyFromLocal</code> command.

Syntax to copy a file from your local file system to the HDFS is given below:
<pre>
hadoop fs -copyFromLocal /path 1 /path 2 .... /path n /destination
</pre>

In [None]:
%%shell
# Insert Your Code Here

Now our data file has been sent to <code>HDFS</code> successfully. we can check whether it sends or not by using the below command or by manually visiting our HDFS. 

<pre>
hadoop fs -ls ~/lab3/input    # list down content of ~/lab3 directory
</pre>

In [None]:
%%shell
# Insert Your Code Here

Let’s give executable permission to our <code>mapper.py</code> and <code>reducer.py</code> with the help of below command.
<pre>
chmod +x ./lab3/mapper.py ./lab3/reducer.py     # changing the permission to read, write, execute for user, group and others
</pre>

In [None]:
%%shell
# Insert Your Code Here

Then we can observe that we have changed the file permission.

In [None]:
%%shell
# Insert Your Code Here

**Task 3.5**: Now download the latest hadoop-streaming jar file from this Link. Then place, this Hadoop,-streaming jar file to a place from you can easily access it. In my case, I am placing it to /Documents folder where mapper.py and reducer.py file is present.

Now let’s run our python files with the help of the Hadoop streaming utility as shown below.

```shell
hadoop jar ${HADOOP_TOOLS}/hadoop-streaming-${HADOOP_VERSION}.jar \
    -file  ./lab3/mapper.py  -mapper  ./lab3/mapper.py \
    -file  ./lab3/reducer.py -reducer ./lab3/reducer.py \
    -input ~/lab3/input/*.txt -output ~/lab3/output
```

In [None]:
%%shell
# Insert Your Code Here

In the above command in <code>-output</code>, we will specify the location in <code>HDFS</code> where we want our output to be stored. So let’s check our output in output file at location <code>~/lab3/output/part-00000</code>. We can check results by manually vising the location in <code>HDFS</code> or with the help of <code>cat</code> command as shown below.
```shell
hadoop fs -ls ~/lab3/output
hadoop fs -cat ~/lab3/output/part-00000
```

In [None]:
%%shell
# Insert Your Code Here

In [None]:
%%shell
# Insert Your Code Here