# Getting started with...

![Apache Spark](http://spark.apache.org/images/spark-logo.png)

In the [previous session](../intro/Spark_workshop_Introduction.ipynb) we discussed some basics of the map/reduce programming paradigm. 

Now, we'll see how we can "scale up" these ideas to distribute the workload among many machines.

For this, we will use [Spark](http://spark.apache.org), a distributed computing framework that fits nicely into 
the [Apache Hadoop](http://hadoop.apache.org) ecosystem.

## Basic Data Abstraction -> the Resilient Distributed Dataset

An RDD is the essential building block of every Spark application. 

* keeps track of data distribution across the workers  
* provides an interface to the user to access and operate on the data
* can rebuild data upon failure

** As a Spark user, you write applications that feed data into RDDs and subsequently transform them into something useful **



![Basic RDD diagram](https://raw.githubusercontent.com/rokroskar/spark_workshop/master/figs/basic_rdd.png)

## Spark Architecture Overview

A very abbreviated and naive description of how Spark works: 

The runtime system consists of a **driver**, **executors** and **workers**. The driver coordinates the work to be done and keeps track of tasks, collects metrics about the tasks (disk IO, memory, etc.), and communicates with the executors. The executors receive tasks to be done from the driver and distribute them among the workers that they control. Bsaically, you can think of an executor as a *node* and a worker as a *core* on that node. 

The user's access point to this Spark universe is the **Spark Context** which provides an interface to generate RDDs. 

** The only point of contact with the Spark "universe" is the Spark Context and the RDDs **

![Spark Universe](https://raw.githubusercontent.com/rokroskar/spark_workshop/master/figs/spark_universe.png)

## Flexibility of Spark runtime

A few points before we get our feet wet with doing some basic data massaging in Spark. 

The spark runtime can be deployed on: 
* a single machine (local)
* a set of pre-defined machines (stand-alone)
* a dedicated Hadoop-aware scheduler (YARN/Mesos)
* "cloud", e.g. Amazon EC2 

The development workflow is that you start small (local) and scale up to one of the other solutions, depending on your needs and resources. 

In addition, you can run applications on any of these platforms either

* interactively through a shell (or an ipython notebook as we'll see)
* batch mode 

** Often, you don't need to change *any* code to go between these methods of deployment! **

## Starting up the `SparkContext` locally

The most lightweight way of playing around with Spark is to run the whole Spark runtime on a single (local) machine. 

First, we need to do a few lines of setup (we can later move these to a startup script of some sort) and then we start the `SparkContext`

In [5]:
import sys, os
os.environ['SPARK_HOME'] = '/Users/rok/spark'
sys.path.insert(0, '/Users/rok/spark/python/')
sys.path.insert(0, '/Users/rok/spark/python/lib/py4j-0.8.2.1-src.zip')

import pyspark

In [6]:
sc = pyspark.SparkContext(master='local[4]')

Hurrah! We have a Spark Context! Now lets get some data in there just to see if it works: 

In [9]:
import numpy as np

data = np.random.rand(1000000)
rdd = sc.parallelize(data)
print 'Number of elements: ', rdd.count()
print 'Sum and mean: ', rdd.sum(), rdd.mean()

Number of elements:  1000000
Sum and mean:  500178.597714 0.500178597714


Now if you look at your console, you will see *a lot* of output -- Spark is reporting all the stages of execution and can become rather verbose. Initially it's useful to inspect this output just to see what's going on and to see when issues arise. Later on we'll see how to quiet it down. 

In addition, each Spark application runs its own dedicated Web UI, accessible by default at `driver:4040`. In this case this is http://localhost:4040. 