## Big Data Overview:

In this section we're going to discuss a few __Big Data__ processing frameworks. For example we're going to go through the explanation of the following things:

* Hadoop, MapReduce, Spart and PySpark
* Local Vs Distributed systems
* Overview of the Hadoop Echosystem
* Detailed Overview of Spark
* Set-up everything we need like (Jupyter, Spark) on __Amazon Web Services__
* Resources on other Spark options
* Jupyter Notebook hands-on code with __PySpark__ and __RDDs__

We've worked with data that can fit on our local computer, in the scale of 0-8 GB. But what can we do if we have a larger set of Data.
* Try using __SQL DATABASE__ to move storage into __Hard Drive__ insted of __RAM__
* Or use a distrubited system, that distributes the data to multiple computers

A local machine is our own computer, in which we are restricted to only __One Hard Drive__ and __One RAM__, but on a __Distributed Machine__ we can have __One Machine__ controlling a distribution of __Multiple Machines__

* A local process will use the computation resources of a __Single Machine__
* A distributed process has access to the computational resources across a No. of machines connected together.
* Distributed machines also have the advantage of easily __Scaling__. We can simple add more machines.
* Distributed machines also include __Fault Tolerance__, if one machine fails the whole process can still go.

## Format of Distributed Architecture with Hadoop:

* __Hadoop__ is a way to distribute very large files across multiple machines.
* It uses __HDFS__ (Hadoop Distributed File System)
* HDFS allows user to work with large datasets
* HDFS also duplicates blocks of data for __Fault Tolerance__.
* It also then uses __MapReduce__, which allows computations on __Large Scaled Data__.

## MapReduce:

* MapReduce is a way of __Splitting Computation__ task to a distributed set of files (such as HDFS).
* It consists of Job Tracker and Multiple Task Trackers
* Job Tracker sends code to run on the Task Tracker.
* The Task Tracker allocates CPU and Memory for the tasks and monitor the tasks on the worker nodes.

What we covered can be though of in two distinct parts:
* Using __HDFS__ to distribute large datsets
* Using __MapReduce__ to distribute a computational task to a distributed dataset.




## Spark Overview:

Let's go ahead and discuss the following things about __Spark__:
* What is Spark
* Spark Vs MapReduce
* Spark RDDs
* RDD Operations

* Spark is one of the latest technologies being used to quickly and easily handle data.
* It is an Open Source project on __Apache__
* It was created at __AMPLab (UC Berkeley)__
* We can think __Spark__ as a flexible alternative to __MapReduce__

* Spark can use data stored in variety of formats e.g:
    * Cassandra
    * AWS S3
    * HDFS
    * And More..
* __MapReduce__ requires files to be stored in __HDFS__, __Spark__ does not!
* __Spark__ can also perform operations upto __100x__ faster than __MapReduce__.
* __MapReduce__ writes most of it's data to disk after each map and reduce operation. While __Spark__ on the other hand keeps most of the data to the memory after each transformation. It can spill over to the disk of the memory is full.

At the core of the __Spark__ is the idea of a __Resilient Distributed Dataset (RDD)__. It has four main features:
* Distributed Collection of Data
* Fault-tolerant
* Parallel operation - partioned
* Ability to use many data sources

RDDs are immutable, lazily evaluated, and cacheable. There are two main types of RDD operations:
* Transformations
* Actions

### Basic Actions:
* First -> Return the first element in RDD
* Collect -> Returns all the elements of RDD as an array at the driver program
* Count -> Return the number of elements in RDD
* Take -> Return an Array with the first 'n' elements, of the RDD

### Basic Transformations:
* Filter
* Map
* FlatMap

### RDD.filter():
* Applies a function to each element and returns elements that evaluate to __True__

### RDD.map()
* Transforms each element and preserves # of elements. Very similar idea to Pandas.apply() method

### Map():
* Grabbing first letter from list of names

### FlatMap():
* Transforming a corpus of text into a list of words!

### Pair RDDs:
Often RDDs will be holdig their values in tuples (key,value). This offers better partitioning of leads functionality based on reduction.

### Reduce():
* An action that will aggregate RDD elements using a function that returns a single element.

### ReduceByKey():
* An action that will aggregate Pair RDD elements using a function that returns a pair RDD. These ideas are similar to a Pandas __Groupby__ operations.

Finally, the __Spark__ echosystem includes:
* SparkSQL
* Spark DataFrames
* MLib
* GraphX
* Spark Streaming

Now that we have learnt enough about __Apache Spark__, let's go ahead and set up the __Amazon Web Services__ account to get __Spark__ up and running!! 
