## Hadoop Overview

<a href="#MapReduce">MapReduce for Processing</a>  
<a href="#HDFS">HDFS for Storing</a>  
<a href="#Sqoop">Sqoop for taking in data</a>  


#### Why Hadoop?

Its very cheap to store large amounts of data, but its hard to process with traditional tools  
Prob 1 = how can we **reliably** store large data?  
Prob 2 = how can we analyze all this data? Takes lots of **time** to read data into memory  

Hadoop is the solution to this - scalable, reliable, available, fast, economical  
1) HDFS = Reliable, distributes data across a cluster   
2) MapReduce = framework for parallel processing  

Hadoop is run on a collection of servers - as opposed to running on a "super computer"  
ie Hadoop used horizontal scaling (adding more machines)  
Node = name of each server  
Each node stores & processes data (called data locality)  

Hadoop 2.0 supports non MapReduce applications, also supports interactive and streaming applications (that will not use MapReduce)

#### Benefits of Hadoop

1. Fault tolerance    
Hadoop has built in data redundancy (3 copies) in case one of the nodes fails  

---

#### Hadoop Trends

(Old) `Data warehouse` --> need to clean and structure before storing  
(New) `Data Lakes` --> able to store structured and unstructured raw data  
$\quad$ Hadoop is a platform for building a data lake  
$\quad$ Issue: Can we actually use all this to add value?  
$\quad$ Issue: Hard to work with, especially with multiple people

**Cloud computing**  
On premise big data providers - Cloudera  
Cloud computing providers - AWS, Azure, Google Cloud

---

# Hadoop Ecosystem - All of these are scalable and distributed

<img src=https://i.imgur.com/NktUrwM.png width="500" height="440" align="left">

# TASK 1 - Processing Layer (eg MapReduce)

##### Batch Processing Tools (Top left)

1. Hadoop MapReduce

2. Apache Pig - offers higher level data processing than Hadoop  
Useful for ETL - commands are more "human friendly", similar to SQL language

3. Apache Hive - data warehouse application for Hadoop  
Similar to SQL but for big data

4. Spark - fast in-memory processing engine

##### Interactive Query Tools

1. Apache Impala - interactive SQL engine, low latency (interactive query tool)  

2. Apache Drill - high performance SQL engine  
Can query semi structured data (eg HDFS, HBase, JSON, MongoDB, Amazon S3)

##### Machine Learning

1. Apache Mahout - ML for Hadoop

2. Apache Spark MLlib - ML component of Spark

3. H2O - in-memory, distributed - can be used with R, Python


##### Streaming

1. Apache Storm

2. Spark Streaming

3. Apache Flink

---

# MapReduce
`processing`


<img src=https://i.imgur.com/JcxE2Sa.jpg width="400" height="340" align="left">



MapReduce is a programming model, typically uses Java (also supports Python)  
Follows the "shared-nothing" architecture - tasks are not dependent on each other

**2 functions, both written by the user**  
Map - takes input pair, produces set of intermediate pairs, then groups values that have the same key and pass to reduce  
$\quad$ The value is a row from the input  
Reduce - accepts intermediate key + values, shuffle and sort, merges values together  
$\quad$ The values a collection of all values for each key

**Example**  
`Input` - Take input of url and page  
`Map output` Key = url, Value = series of key value pairs. Ultimately value is a  pair of key=term, value=frequency  
`Reduce output` Key = url, Value = term freq pairs compressed into one


#### Benefits of MapReduce

1. Simple compared to other distributed programming models  
2. Flexible - handles data like images, video, etc  
3. Scalable - able to finish sooner by adding more workers

Quickly losing ground to Spark and other engines


**debugging before running thru mapreduce**  
`hadoop fs -cat /dualcore/employees/* \
| head -n 100 \
| python mapper.py \
| sort \
| python reducer.py`


---

# Lab 2 - MapReduce

https://pages.github.umn.edu/deliu/bigdata19/02-Hadoop/lab02-mapreduce.html  
https://pages.github.umn.edu/deliu/bigdata19/02-Hadoop/lab02-mapreduce-solution.html


for each state, how many employees earn > 75k?

**Step 1 - enter the directory**  
`cd ADIR/exercises/data_ingest/bonus_01`  
`ls -l` to see the files

**Step 2 - clean output destination**  
`rm results.txt
hadoop fs -rm -r /user/cloudera/empcounts`


**Step 3 - view mapper, reducer, and runjob shell script**  
`cat mapper.py
cat reducer.py
cat runjob.sh`


**Step 4 - verify results**  
`hadoop fs -ls /user/cloudera/empcounts` does this directory already exist?  
`hadoop fs -cat input_data | head -n 100 | python mapper.py | sort | python reducer.py` to debug  
`./runjob.sh` **to run the job**  
`hadoop fs -getmerge /user/cloudera/empcounts results.txt` to download the results to a local file  
`less results.txt` to view results

---

# TASK 2 - Storage Layer (eg HDFS)

1. HDFS (Hadoop Distributed File System)  
Good - inexpensive, scalable, reliable
Issue - high latency (long delays)

2. Apache HBase (key-store storage)  
noSQL database built on HDFS  
Scales very well, high thruput
Maps row key, col key, timestamp to find data  
Issue - no high-level query language, no SQL support, API access only  
Not really good for analytics

3. Apache Kudu  
Designed for analytics - columnar storage  
Good for SQL  
Not built on HDFS









# HDFS 
`storing`

Files stored in HDFS are not visisble to host's file system. Have to use a special utility (e.g., hadoop fs or a web-based utility) to view files in HDFS.

Optimized for millions of large files (likely over 100 mb)  
Hierarchical directory storage  

**Differences between Hadoop and Linux**  
No current directory  
CANNOT modify files once they are written  
Have to use a special utility to access files

---

#### HDFS Structure

NameNode contains the files  
Files break down into blocks  
Blocks are duplictaed 3 times and go into the DataNodes  
A Java Virtual Machine is required to process a block  
$\quad$ if we have too many blocks, scalability is limited

**More detailed look of ^**  
Files are broken into blocks and then duplicated 3 times  
$\quad$ Typical HDFS block = 128 mb (128000 kb)  
$\quad$ Typical windows block = 4 kb    

Each cluster runs as master/slave  
Master = NameNode (1 or 2) - keeps track of files, blocks, DataNodes  
$\quad$ NameNodes store this info in memory (rule of thumb - each block takes 150 bytes of memory)  
Slave = DataNodes (many) - read and write the actual data

#### Copying Info 

hadoop fs -put == copy local files to HDFS  
hadoop fs -get == copy HDFS to local files


### LAB

https://pages.github.umn.edu/deliu/bigdata19/02-Hadoop/lab01-hdfs.html  
https://pages.github.umn.edu/deliu/bigdata19/02-Hadoop/lab01-hdfs-solution.html

---

# TASK 3 - Integration Layer

1. HDFS - direct file transfer

2. Apache Sqoop - mainly for moving relational data between Hadoop and other databases

3. Apache Kafka - for streaming data, messaging systems

4. Apache Flume - for streaming data, messaging systems


# Sqoop
`Transfer data between RDMS and Hadoop`

Sqoop is a command line utility mainly used for ingesting data from RDBSMS, then importing as comma delimited text file (default)

Performs tasks using mapreduce jobs (typically 4 parallel when importing, splits the jobs up by primary key).  
When importing, creates a new directory in HDFS.  

We can import partial tables by specifying which `--column`s we want, or by adding a filter that the rows will need to meet with `--where`.  
Incremental imports with `--incremental`. Will need to include a `--check-column` and `--last-value` to specify where to increment

**Techniques**  
`import`  
`import-all-tables`  
`export`  
`list-tables`

---

**Example - import order_details**  
**Step 1 - see our main directories in the local host**  
`hadoop fs -ls /` 

**Step 2 - Import tables into dualcore folder**  
`sqoop import \  
--connect jdbc:mysql://localhost/dualcore \  
--username training --password training \  
--fields-terminated-by '\t' \  
--warehouse-dir /dualcore \  
--table order_details \  
--split-by=order_id`  # if there isnt one single primary key