# Big Data II (part 1)

In "Big Data I" lesson we've seen multicore programming. That is, taking advantage of multiple cores in a computer to speed up our code.  
In "Big Data II" we want to focus on what can we do to go further. How to increase the amount of data we can manage by unit of time.  

We are going to introduce multiple key concepts: **scale-up (supercomputers)** and **scale-out (cluster/grid computing)** and, finally, we will see that not every solution is parallelizable.


## 1.- Parallelization

When working with big data is important to pay attention to throughput; that is reducing computational time at maximum for each operation on data. When working with millions of registers, improving 1µs in computation time for each register is significant.  

What can we do to improve our code execution? 

+ Buy new/more hardware. 
+ Program using all hardware resources. 
       
As we have seen in "Big Data I" lesson, multicore programming permits us to take advantage of all resources.
That is **"Divide and conquer"**: divide a problem in smaller parts and deal with each part individually.
This idea can be used in computing to speed up some kind of code (parallelizable code)

### Multi-core parallelization
In **"Big Data I"** lesson we have seen how to use **ipcluster** command to take advantage of all the cores in a computer. Let's see a short review.

#### Start the cluster
Using the "Clusters" Tab, we can create and start a cluster of engines in your computer. We can see the processes related to ipyparallel in our computer with a command like:  
>*top -p \`pgrep -f ipyparallel -d ","\` -c*

#### Connect to the cluster
Using 'parallel' package, you can connect to the cluster




In [None]:
import os
import ipyparallel as ipp

clients = ipp.Client()
clients.block = True
clients.ids

#### Using the engines
With above commands, we have started N engines that can be used in parallel. That is: we have N python interpreters that can be used in parallel.

We have two primary models for interacting with engines:  

**Direct interface**: Send instructions to engines explicitly.

In [None]:
# Examples taken form "Big Data I" lesson
#Send direct commands to multiple engines:
clients[0].execute('a = 2')
clients[0].execute('b = 10')
clients[1].execute('a = 9')
clients[1].execute('b = 7')
clients[0:2].execute('c = a + b')

#Get the result from the engines
print(clients[0:2].pull('c'))

#We can acces multiple engines at a time using a variable:
dview2 = clients[0:2]
noResultExpected = dview2.execute('reset -f') #This instruction cleans the interpreter

**LoadBalanced View**: The cluster decides which engines must be used in a balanced strategy (SGE)

In [None]:
print("Our cluster has {0} engines/nodes. Their ids are: {1}".format(len(clients.ids), clients.ids))

#With blocking option, our interpreter will be blocked until all engines have received all values.
# Avoids that engine N recives its ID after receiving its task. More info on this in "Big Data I" notebook
clients.block = True

#Create a loadBalanced view
lview = clients.load_balanced_view()

#With execute, we send a single instruction to each engine.
# Sets a diferent var "my_id" in each engine
for i in clients.ids:     #i=0,1,2,3 in a 4 nodes cluster.
    clients[i].execute('my_id = ' + str(i), block=False)
    
#Define a function that recives a parameter, does "some work" with it,
#  and returns the value of "my_id" var:
def sleep_and_return_id(sec):
    import time
    #Do something very interesting here, or just sleep
    time.sleep(sec)
    return "engine id:" + str(my_id) + " waited for " + str(sec) + " seconds"

#With the map method, we send the function and data to be managed by the LoadBalanced View
lview.map(sleep_and_return_id, [1,1,1,1,1,1])

## 2.- Scale-up and scale-out
With with this, we have reviewed how to use each of our N cores to speedup the code execution.
We have reduced the compute time from *t* to something near *t*/N.

What can we do if we want to reduce this time even more? How can we scale our computational system?

+ One solution might be what is called **scale-up**: buy a new computer or a new processor with more cores, add more RAM, buy new storage, and so on.
+ Another solution is called **scale-out**: buy commodity computers, interconnect them and make them to operate together to solve your problem. That is: create a **Cluster** (hundreds of nodes in an enterprise) or a **GRID** (user computers interconected)

The main idea is:
+ **Scale-up** gives you the best performance values, but have limitations on scalability.
+ **Scale-out** gives you a great scalability, but performance is worst.

In http://www.top500.org/ you can see top 500 supercomputers/HPC (High Performance Computing) worldwide. Since  2016/6, the top HPC is **Sunway TaihuLight** (https://www.top500.org/site/50623) a supercomputer with **10,649,600 cores**, 93PFlops that needs **15,371kW** to work

Sunway TaihuLihgt has, on april 2016, the highest performance registered in a single system; besides the price, this kind of systems have other limitations:

+ If possible, it is very expensive to scale the system.
+ It wastes a lot of power, even if you want to execute a simple instruction.

With cluster/grid computing you can scale your system to meet your needs: add as many nodes as you need, use all of them or only a part of them.  
One of most prominent examples of grid computing is SETI@home: a project for searching extraterrestial life using user's computers; the GRID computes ~22PFlops (~855,000 computers)

When working with grid computing, we must keep in mind its important limitations:

+ Computers are interconnected using internet: it is slow and subject to failures.
+ Computers are distant. It's impossible to share memory and it's difficult to share a FileSystem. Data movement slowdown must be taken into account.


## 3.- How to get a Cluster? In the cloud, where else?

Three major providers
Amazon EC2 - https://aws.amazon.com/ec2/
Microsoft Azure - https://azure.microsoft.com/es-es/
Google Cloud Platform - https://cloud.google.com/

We are going to see an example on Amazon EC2 because, on 2017, it has a 40% share market.

### Amazon EC2 - Amazon Elastic Compute Cloud ###

"Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers."

In few words: **Amazon EC2 gives us the possibility to have a grid in the cloud.**  
Amazoon EC2 has a cost by hour, according to the OS, the CPU, memory, storage and so on that we contract. 
There is a calculator in http://calculator.s3.amazonaws.com/index.html to get the cost estimation of our infraestructure.

When signin in, the service gives you a **Free Tier**, with develop purposes. This gives us, with restrictions,  **750h/month** of computing for free during one year. That would be enough for develop purposes.

Keep in mind some restrictions of the Free Tier:

+ 750h/month during one year.
+ Only includes t2.micro instances
+ Only includes some OS
+ There is a limit in amonut of storage
+ There is a limit on the traffic input/output

For more info: http://aws.amazon.com/free/

At last, keep in mind that sign in implies that Amazon have your credit card number. If you commit an error, you'll have charges in you credit card. Most common errors are:

+ Using instances or OS that are not elegible by the Free Tier.
+ Not terminating instances: 750h with 100 nodes gives you "only" 7.5h for free.
+ Reserving a public IP and not ocupying it.

Last, but not least important: be carefull with your credentials. With your EC2 credentials, someone could start nodes that will be charged in your credit card. Be specially carefull introducing your credentials into world readable sites (GitHub).

## Creating the grid

### The hard way

Ipcluster command can be used to create a **grid** of computers in Amazon EC2.

The problem with this method is that there is a lot of hard work to do to make it operational:

We may create instances in Amazon EC2 and configure them:

+ Install ipython
+ Defining a security police
+ Interconnect them
+ We may want to create a shared filesystem

This is a hard way. If you have your own grid and you are interested on using ipcluster, you must start reading **Starting the IPython controller and engines**[1]

### The engineer way: StarCluster[2]

http://star.mit.edu/cluster/  
"StarCluster is an open source cluster-computing toolkit for Amazon’s Elastic Compute Cloud (EC2) released under the LGPL license.   
StarCluster has been designed to automate and simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud. StarCluster allows anyone to easily create a cluster computing environment in the cloud suited for distributed and parallel computing applications and systems."

http://star.mit.edu/cluster/docs/latest/overview.html  
Out-of-the-box, StarCluster automates:  

+ Create a security police for the cluster (Firewall)
+ User-friendly Hostnames: master, node001, node002, etc
+ Comunication bw nodes: SSH + NFS share on /home
+ Creates a Queuing system (Oracle Grid Engine)

#### First of all: non-human interaction with Amazon EC2 API

There are two main ways to operate with AWS: using the **web portal** or in a non-human interaction using the **Amazon API**.  
As we want Starcluster to do the hard work automatically, we must use Amazon API.

Using Amazon API requires an User with an Access Key ID and a Secret Access Key. To create them:  

+ In AWS, go to IAM (Identify & Access Management) -> In Users section, choose the user or create a new one. The user needs "Programatic Access". Download the Access Key ID and the Secret Access Key and store them.

For this class, in the Policies section, we are going to authorize our user to have AdministratorAccess.

#### Installing Starcluster

http://star.mit.edu/cluster/docs/latest/installation.html  
On Linux/Mac/Windows: 'pip install StarCluster'

#### Create a small cluster with StarCluster

To create a new **configuration file**, in a command line, execute:  
    
     $> starcluster help #Choose option 2

Edit the configuration file **(~/.starcluster/config)** and set your **AWS authentication settings**:  

      [aws info]
      AWS_ACCESS_KEY_ID = your-acces-key-id
      AWS_SECRET_ACCESS_KEY = your-secret_access_key
      AWS_USER_ID = your-user_id (User name!)
      
To create the cluster, StarCluster needs to log in into the diferent nodes in EC2 without providing a password.  
With this purpose, Starcluster uses an RSA **keipair keys**. We can create and activate a keypair with the command:

    $> starcluster createkey -o bigDataKeyPair.rsa bigDataKeyPair
    
Be sure to have 600 permissions on the keypair file!  

    $> chmod 600 bigDataKeyPair.rsa

After this, we can set the keypair into Starcluster configuration file **(~/.starcluster/config)**:

      #Set your keypairs, used to connect to your instances. We are going to create them in a moment
      [key bigDataKeyPair] #Must be the same name you created!
      KEY_LOCATION=/path/to/your/bigDataKeyPair.rsa
      
And finally, setup our cluster:  

      #Cluster settings
      [cluster smallcluster]
      KEYNAME = bigDataKeyPair #This is the same specified in [keipair] section
      CLUSTER_SIZE = 5 #Number of nodes to be started
      PLUGINS = ipcluster #To enable ipython
      #
      #Ipcluster plugin settings
      [plugin ipcluster]
      #Enable next lines if you want the cluster to start a new ipython interpreter
      # Be sure to modify permission to grant https acces to your cluster
      #enable_notebook = True
      #notebook_passwd = SuperSecretPassword    
         
#### Start the cluster:

To start the cluster, from command line, execute the command:

     $> starcluster start smallcluster
     
More info on **[3] - IPython Cluster Plugin**

#### Using the cluster
The following command creates and connects to a python interpreter in the master node using SSH:

    $>starcluster sshmaster smallcluster -u sgeadmin
    
When inside the master node, we can start an ipython interpreter and start working with the cluster.

    In [1]: from IPython.parallel import Client
    In [2]: rc = Client()
    In [3]: rc.ids
    Out[4]: [0, 1, 2, 3, 4]
    In [4]: %px !ifconfig eth0 | grep 'inet addr'
    [stdout:0]           inet addr:10.232.96.237  Bcast:10.232.96.255  Mask:255.255.255.192
    [stdout:1]           inet addr:10.170.57.47  Bcast:10.170.57.63  Mask:255.255.255.192
    [stdout:2]           inet addr:10.170.90.68  Bcast:10.170.90.127  Mask:255.255.255.192
    [stdout:3]           inet addr:10.171.4.251  Bcast:10.171.4.255  Mask:255.255.255.192
    [stdout:4]           inet addr:10.170.58.209  Bcast:10.170.58.255  Mask:255.255.255.192
    
You can copy files between your local computer and the cluster, using put and get commands: 

    starcluster put mycluster /local/path/file /remote/path/  
    starcluster get smallcluster /remote/path/results /local/path
    
#### Stop and Terminate cluster

When terminating an instance, Amazon stops the instance and then deletes all data stored in it.
To stop the cluster:

    $> starcluster stop smallcluster
    
To terminate the cluster (stop and deletes all data) you can use:

    $> starcluster terminate smallcluster


### The way: SaaS (Software as a Service) Amazon EMR

Cloud providers launched services like:
+ **Elastic Cache**. Amazon's service to create clusters.
+ **EMR**. Amazon's service to create MapReduce clusters.
+ **HDInsights**. Microsoft's service to create linux based clusters in Azure (Hadoop and so on)
+ **Google Cloud Dataproc**. Google's service to create hadoop clusters to operate with data.

## Final notes on data movement and MapReduce    

Something important to take into account when working with distributed computing is **data movement**. Where do you have your data? How can your nodes access data?  

One choice is having a shared filesystem, like we have in Amazon or other cloud providers. There are some problems with shared filesystem are:

+ Dificult to implement with computers in diferent networks (providers).
+ Data is **stored in a server**. Each node has to connect with the server and retrieve data from the same server. It can be a bottleneck when working with **big data**

Here, we want to introduce the concept of **distributed filesystems**[4]. The idea behind them is: divide your filesystem in chunks and replicate each chunk in multiple nodes. With this solution you get:

+ Your data is distributed: the same node that stores the chunk can work with it.
+ Your data is replicated, so there is no single point of failure.

This is the idea behind MapReduce:

## MapReduce?

At this point, we want to mention one word that you might have listened once introduced into Big Data world: **MapReduce**.

From Wikipedia [6]:  
*"MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster."*   
  
In this lesson we are not going in depth on MapReduce, but we want you to keep one key concept: **"MapReduce is not only Map and Reduce"**.

Python, as many other languages, has map and reduce functions:  

+ map(function, iterable)  - Applies a 'function' to each item of 'iterable' and returns a new iterable with the results.
+ reduce(function, iterable [,initializer]) - Applies a cumulative 'function' to the items of iterable, from left to right, so as to reduce the iterable to a single value.

Let's see a simple example:

In [None]:
import functools

#Maps the square function to each element of a list
mapResults = map(lambda x: x**2, range(1,11))

#Reduce mapResults to get the total sum
reduceResults = functools.reduce(lambda x, y: x+y, mapResults)

#Show the results
print(reduceResults)

**MapReduce** is an entire framework that simplifies our lifes in the way that programmers only have to create Map and Reduce functions. MapReduce does the rest of the work:

+ Uses a distributed filesystem, shares data between the nodes (Data Movement)
+ Applyes Map and Reduce functions to the data
+ Returns reduce results to the users
+ Manage nodes failures
+ ...

Some well known implementations of MapReduce are:

+ Google MapReduce cluster [7]
+ Hadoop [8]
+ MongoDB [9]

Is interesting to read the Google papers, publicated on 2003 and 2004, that revolutionazed distributed computing. Both are public:

+ Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung from Google. oct 2003. **The Google File System** <http://research.google.com/archive/gfs.html>  
+ Jeffrey Dean and Sanjay Ghemawat from Google. dec 2004. **MapReduce: Simplified Data Processing on Large Clusters** <http://research.google.com/archive/mapreduce.html>

Sources

[1] Starting the IPython controller and engines: http://ipython.org/ipython-doc/2/parallel/parallel_process.html#using-ipcluster  
[2] StarCluster - http://star.mit.edu/cluster/index.html  
[3] IPython Cluster Plugin - http://star.mit.edu/cluster/docs/latest/plugins/ipython.html  
[4] Wikipedia Distributed Fylesystems - http://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems  
[5] Wikipedia Apache Hadoop - http://en.wikipedia.org/wiki/Apache_Hadoop  
[6] Wikipedia MapReduce - http://en.wikipedia.org/wiki/MapReduce  
[7] Google MapReduce cluster - http://static.googleusercontent.com/media/research.google.com/ca//archive/mapreduce-osdi04.pdf  
[8] Hadoop - http://hadoop.apache.org/  
[9] MongoDB - http://www.mongodb.org/  