## Big Data Hadoop - 
https://www.udemy.com/big-data-and-hadoop-for-beginners/learn/v4/overview

### Goals: 
Learn about Big Data market, different job roles, technology trends, history of Hadoop, HDFS, Hadoop Ecosystem, Hive and Pig. In this course, we will see how as a beginner one should start with Hadoop. This course comes with a lot of hands-on examples which will help you learn Hadoop quickly.

+ Big Data at a Glance 
+ Getting Started with Hadoop
+ Getting Started with Hive
+ Getting Started with Pig
+ Use Cases
+ Practice

#### Learnings: 
Understand what Hadoop is for, and how it works
Understand complex architectures of Hadoop and its component
Hadoop installation on your machine
Understand how MapReduce, Hive and Pig can be used to analyze big data sets
High quality documents
Demos: Running HDFS commands, Hive queries, Pig queries
Sample data sets and scripts (HDFS commands, Hive sample queries, Pig sample queries, Data Pipeline sample queries)
Start writing your own codes in Hive and Pig to process huge volumes of data
Design your own data pipeline using Pig and Hive
Understand modern data architecture: Data Lake
Practice with Big Data sets

## Big Data at a Glance

Topics: 
1. Introduction to Big Data [09:23]
2. Job Roles in Big Data [06:30]
3. Salary Analysis [02:55]
4. Technology Trends in the Market [06:30]
5. Advice for Big Data Beginners [02:45]

### What's Big Data?

#### What's the problem big data solves? 
Complex to analyse: 
- semi-structured, unstructure

Problem: cannot be analysed by tradition systems (oracle, MySQL,SQL)
- Traditional only store structured data

What is structured data? 
- xls or any other db

Semi-structured: 
- xml 

Unstructured: 
- computer log files

#### Big Data 5 Vs
+ **Volume**
Vast data amounts
Terabytes

+ **Velocity**: 
Speed at which data generated
Speed data moves around

+ **Variety**: 
Different types of data (structured, semi-structured, unstructured) can be analysed

+ **Veracity**
Accuracy and truthfulness of data

+ **Value**
Access only valuable if valueable use cases

### Why is Big Data important ?
Capture and process users data real-time and turn it into insights

#### How companies are making money with Big Data? 

e.g.
Credit Card Companies: track customer rules
Retailers: identify patterns in behaviour



### Job roles + salaries

+ Big data analyst
    - Works with data scientists
    - BI tools (Tableau)
    - R, Python, Matlab
    - Hadoop, MapReduce, Hive, Pig, SQL


+ Hadoop Adminstrator

+ Big Data Engineer
    - Builds what was designed from Architect
    - Design of big data solutions
    - Builds large scale data processing systems
    - DW, ETL, BI
    - Hadoop: HDFS,  MapReduce, Hive, Pig
    - NoSQL: MongoDB, Cassandra
    - Cloud environment (familiarity)

+ Data Scientists
    - Machine Learning
    - Predictive Modelling, Stats Analysis
    - Python, R, Java, Clojure, Ruby
    - NLP
    - Hadoop: HDFS, MapReduce, Hive, Pig
    - NoSQL: MongoDB


+ Big Data Manager
    - Between business and technical team
    - Manage Big Data Team
    - ML. Predictive modelling, Stats
    - Hadoop: HDFS, MapReduce, Hive, Pig
    ...


+ Big Data Architect
    - Good exposure design large scale data systems
    - Good hadoop exposure: Hadoop, Hive, Pig, Mahout, Oozie...
    - NoSQL databases
    - RDBMS, DW, ETL (Pentaho, Informatica)
    - Hadoop on cloud
    - Python, Java

+ Chief Data Officer


### Advice to beginners

+ Make habit
    - meetup
    - conferences
    - online news (TechCrunch, VentureBeat...)
    - follow companies doing big data 


+ Skills
    - RDBMS (MySQL, Oracle, MS SQL)
    - ETL tool hands-on
    - BI understanding
    - DW
    - Migration RDBMS to Hadoop


+ Start Small
    - take small dataset
    - Use HDFS, Hive, Pig
    - Try out use case using Hive, Pig (**dataflow** and **datapipeline**)


+ Go Big
    - Take bigger datasets
    - Play HDFS, Hive, Pig
    - Implement data processing techniques
    - Pick use case
    - Benchmark data processing
    - Tune your techniques and configs

## Getting started with Hadoop

Topics
7. Introduction to Hadoop [08:23]
8. Hadoop Ecosystem [05:01]
9. Hadoop 1.x vs Hadoop 2.x [14:13]
10. ETL vs ELT [03:19]
11. Different Hadoop Vendors [04:20]
12. Hadoop Installation - [HDP 2.2 Download Link](https://hortonassets.s3.amazonaws.com/2.2/Sandbox_HDP_2.2_VirtualBox.ova)
13. Managing HDFS from Command Line [09:09]
14. Hadoop on Cloud [05:11]

### Introduction to Hadoop -- Hadoop 2.2 version

#### What is Hadoop
open source software that enables distributed processing of large data sets across clusters of commodity servers.
design to scale from one server to thousands of machines with high fault tolerance
Cluster resiliency from ability to detect/handle failures at app layer

**Hadoop Fundamentals:** 
- Engine storage of files = HFDS
- Data processsing engine = MapReduce

**HDFS** 
- distributed file system
- store files of any size
- as many files as possible 
- distributed storage across machines

**MapReduce**
- distributed data processing framework
- processes data in HDFS
- move processing codes to the data
- saves network latency

#### MapReduce: high level process
+ Feed input data file to mapper.sh
+ Process data to generate key_value pairs back to framework -- programmer instructions
+ Framework performs search and sort op on key-value pairs from nodes across cluster
+ Framework feed key-value back to reducer (method) -- programmer instructions
+ Reducer performs reduce operations to get final result


### Hadoop Ecosystem

![Screen%20Shot%202019-04-11%20at%204.09.06%20PM.png](attachment:Screen%20Shot%202019-04-11%20at%204.09.06%20PM.png)

#### Hadooop Platform:

Two key services: 
+ HDFS: reliable distributed file system 
+ MapReduce: high performance data processing engine

#### Components of Hadoop Ecosystem: 
They provide a means to access and process data in HDFS
Each component are designed for certain business need

**Hive** [SUPER IMPORTANT]
- equivelent to DW on top Hadoop
- SQL-like query to interact with data (instead of MapReduce code in Java)

**Pig**
- Data flow language (Pig latin) to interact with Hadoop
- Pig latin similar SQL
- Scripts to process data in Hadoop

**Scoop** [IMPORTANT]
- SQL to Hadoop
- Tool used to transfer data from **RDBMDS to HDFS (and vice-versa)**

**Oozie** [IMPORTANT]
- Java webapp
- Used to schedule Apache Hadoop jobs
- Combines multiples jobs sequentially into one unit of work
- supports various H jobs (Pig, Hive, Scoop..)

**ZooKeeper**
- provide operational service to H cluster
- distributed config service
- synchronisation service
- naming registry for distributed systems

**HBase** [IMPORTANT]
- NoSQL database
- real-time read-write access to H datasets
- scale linearly for big datasets
- combines data sources with different structures and schema

**Flume**
- distributed, reliable, available
- collect, aggregate, move large data streams into HDFS
- e.g. collect server logs into HDFS real-time

**Mahout** [SUPER IMPORTANT]
- lib for scalable **ML algos**
- on top of H uses MapRed paradigm
- data science tools to **find patterns in big datasets** from HDFS
    
    + Use cases: 
    - Collaborative filtering
    - Clustering
    - Classification
    - Frequent itemset mining
    

