# Big Data and Spark Overview

## Big Data Overview

### Local vs Distributed Machine

![Capture.PNG](attachment:Capture.PNG)

### What is a Local Machine?
A local machine is a single machine where you are restricted to the RAM or hard drive of that one machine. A local process will use the computation resources of a local machine

### What is a Distributed Machine?
A distributed system is basically one machine controlling the distribution of multiple machines. A distributed process has access to the computational resources across a number of machines connected through a network

Distributed machines also has the advantage of easily scaling, you can just add more machines

They also include fault tolerance. If one machine fails, the whole network can still go on

### What is Hadoop?

1. Hadoop is a way to distribute very large files across multiple machines
2. It uses Hadoop Distributed File System (HDFS)
3. HDFS allows a user to work with large datasets
4. HDFS also duplicates block of data for fault tolerance
5. It also uses Map Reduce
6. MapReduce allows computation on that data

### Distributed Storage - HDFS

1. HDFS will use blocks of data with a size of 128MB by default
2. Each of these blocks is replicated 3 times
3. The block are distributed in a way to support fault tolerance
4. Smaller blocks provide more parallelization during processing
5. Multiple copies of a block prevent loss of data due to a failure of a node

![1.PNG](attachment:1.PNG)

### MapReduce

1. MapReduce is a way of splitting a computation task to a distributed set of files (such as HDFS)
2. It consists of a Job Tracker and multiple Task Trackers
3. The Job Tracker sends the code to run on the Task Tracker
4. The Task Trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes

![2.PNG](attachment:2.PNG)

## Spark Overview

### What is Spark?

1. Spark is one of the latest technologies being used to quickly and easily handle Big Data
2. It is an open source project on Apache
3. It was first released in February 2013 and has exploded its popularity due to its ease of use and speed
4. It was created at the AMPLab at UCB
5. Spark can be though of as a flexible alternative to MarReduce
6. Spark can run on top of HDFS infrastructure to provide enchanced and additional functionality
7. Spark can use data stored in a variety of formats - Cassandra, AWS S3, HDFS and more

### Difference between Spark and MapReduce

1. Spark is more of an alternative to MapReduce than a replacement to Hadoop
2. It is not intended to replace Hadoop but to provide a comprehensive and unified solution to manage different big data use cases and requirements
3. MapReduce in difference to Spark requires files to be stored specifically in the HDFS. Spark, while it can run on top of HDFS does not actually require those files to be stored in that manner
4. Spark can perform operations upto 100X faster than MapReduce
5. MapReduce writes most data to disk after each map and reduce operation
6. Spark keeps most of the data in memory after each transformation
7. Spark can spill over to disk if the memory is filled

### Spark RDDs

1. At the core of the Spark is the idea of Resilient Distributed Dataset (RDD)
2. RDD has four main features: 
A. Distribted Collection of Data
B. Fault Tolerant
C. Parallel operation - partioned
D. Ability to use many data sources

### Spark RDDs Basic Actions

1. Collect: Return all the elements of the RDD as an array of the Driver program
2. Count: Return the number of elements in the RDD
3. First: Return the first element in the RDD
4. Take: Return an array with the first n elements of the RDD

### Spark RDDs Basic Transformation

1. Filter (RDD.filter()) - Applies a function to each element and returns element that evaluates to true
2. Map (RDD.map()) - Transforms each element and preserves number of elements, very similar idea to pandas.apply() 
3. FlatMap (RDD.flatMap()) - Transforms each element into 0-N elements and changes the number of elements

### Pair RDDs

1. Often RDDs will be holding their values in tuples (key,value)
2. This offers better partitioning of data and leads to functionality based on reduction

### Reduce and ReduceByKey

1. Reduce() - An action that will aggregate RDD elements using a function that returns single element
2. ReduceByKey() - An action that will aggregate Pair RDD elements using a function that returns a Pair RDD
3. These ideas are similar to Group By operation

### PySpark Setup on EC2 Instance

1. Create an Ubuntu EC2 instance
2. Connect to that instance using Putty
3. Now we need to download and install Anaconda on our EC2 instance. Type the following commands to download and install Anaconda on EC2
    - wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
    - bash Anaconda3-4.1.1-Linux-x86_64.sh
4. Once the Anaconda is installed, it will ask whether you agree with the conditions, just go on Pressing enter until the condition list is complete and then just type yes
5. Now all the packages will be slowly installed. Now it will ask you whether the particular path should be the path of your Anaconda. Type yes
6. Now, we need to locate the path where python has been installed so that we can start using pyton on our EC2 instance. Type the following commands
    - source .bashrc
    - which python
    - python
7. Now you are into python. Type the python commands to check if they are running successfully (such as print("hello")) and then quit the python console by quit()
8. Now, Anaconda and Python is properly installed on the instance
9. Now next step is to install Jupyter Notebook. Jupyter Notebook comes with Anaconda but we need to configure it to use through EC2 instance and connect with SSH. Type the following command
    - jupyter notebook --generate-config
10. The above step will write a configuration file. The next step would be to create certifications for our connections in the form of .pem files. Type the following commands
    - mkdir certs
    - cd certs
    - sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
11. Now the configuration file will be created and it will ask you for some questions such as the country, state, city etc. Just fill the details or leave it blank and click enter
12. Now we need to edit the configuration file that has been created. Type the following command:
    - cd ~/.jupyter
    - vi jupyter_notebook_config.py
13. Now a file opens which has some commented python code. We need to start inserting our code in this file. Type i to start inserting something. And then type the following code
    - c = get_config()
    - // Notebook config this is where you saved your pem cert
    - c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem' 
    - // Run on all IP addresses of your instance
    - c.NotebookApp.ip = '*'
    - // Don't open browser by default
    - c.NotebookApp.open_browser = False  
    - // Fix port to 8888
    - c.NotebookApp.port = 8888
14. Once this code is typed, Click Esc and then to save the text type :wq!. Your file is now written and saved
15. Now cd over to the certs folder and type the following command:
    - sudo chmod 777 mycert.pem
16. Now, we need to start the Jupyter notebook. For that, type the following commands:
    - juypter notebook
17. Now you need to grab the public DNS of your EC2 instance and then paste the link into the web browser. The link is : https://(address of your EC2 instance):8888
18. Now you can see that you can access your jupyter notebook on EC2 instance.
19. Now, we need to install scala because spark depends on Scala
20. Since, Spark depends on Scala so we need to install Scala and since Scala depends on Java, we need to install Java. Type the following commands:
    - sudo apt-get update
    - // Type cd to get to the home directory
    - cd
    - // Install Java
    - sudo apt-get install default-jre
    - // Check whether java has been properly installed
    - java -version
    - // Install Scala
    - sudo apt-get install scala
    - // Check Scala version
    - scala -version
    - // Install py4j. This will allow us Python to connect to Java. For this, we have to make sure that pip install         is connected to our Anaconda's installation of python instead of Ubuntu's default
    - export PATH=PATH:$HOME/anaconda3/bin
    - conda install pip
21. To check whether pip has been installed properly, type the following command:
     - /bin/which pip
22. To install py4j, type the following command:
    - pip install py4j
23. The final step is to just install Spark and Hadoop. Type the following commands:
    - /usr/bin/wget http://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
    - /usr/bin/sudo tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz
24. Finally, we need to set our paths for Spark so Python finds it:
    - export SPARK_HOME='/home/ubuntu/spark-2.0.0-bin-hadoop2.7'
    - export PATH=$SPARK_HOME:$PATH
    - export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
25. After this just type the following command:
    - jupyter notebook
26. Now go to web browser, and check whether spark is working or not. Repeat step 17 for this
27. Once you are in the Jupyter notebook fromyour EC2 instance, open a new notebook and type the following command:
    - from pyspark import SparkContext
    - sc = SparkContext()
28. Now everything has been properly installed