# Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

*Shantenu Jha and Andre Luckow*

The tutorial material is available as iPython notebook on Github:

* https://github.com/radical-cybertools/supercomputing2015-tutorial

## Requirements and Setup:

Python with the following libraries:

* iPython
* Numpy
* Pandas
* Scikit-Learn
* Matplotlib, Seaborn

We recommend to use [Anaconda](http://continuum.io/downloads).


## 1. Hadoop and Spark Introduction





## 2. Pilot-Abstraction for distributed HPC and Apache Hadoop Big Data Stack (ABDS)

The Pilot-Abstraction has been successfully used in HPC for supporting a diverse set of task-based workloads on distributed resources. A Pilot-Job is a placeholder job that is submitting to the resource management system and is used as a container for a dynamically determined set of compute tasks. The Pilot-Data abstraction extends the Pilot-Abstraction for supporting the management of data in conjunction with compute tasks. 

The Pilot-Abstraction supports a heterogeneous resources, in particular different kinds of cloud, HPC and Hadoop resources.

![Pilot Abstraction](./figures/interoperable_pilot_job.png)

The following example demonstrates how the Pilot-Abstraction is used to manage a set of compute tasks.




## 3. Advanced Analytics

The following pairplots show the scatter-plot between each of the four features. Clusters for the different species are indicated by the color.

## 3.3 KMeans (Spark)

https://spark.apache.org/docs/latest/mllib-clustering.html#k-means

In [2]:
data_spark=sqlCtx.createDataFrame(data)

NameError: name 'sqlCtx' is not defined

In [16]:
data_spark_without_class=data_spark.select('SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth').show()

SepalLength SepalWidth PetalLength PetalWidth
5.1         3.5        1.4         0.2       
4.9         3.0        1.4         0.2       
4.7         3.2        1.3         0.2       
4.6         3.1        1.5         0.2       
5.0         3.6        1.4         0.2       
5.4         3.9        1.7         0.4       
4.6         3.4        1.4         0.3       
5.0         3.4        1.5         0.2       
4.4         2.9        1.4         0.2       
4.9         3.1        1.5         0.1       
5.4         3.7        1.5         0.2       
4.8         3.4        1.6         0.2       
4.8         3.0        1.4         0.1       
4.3         3.0        1.1         0.1       
5.8         4.0        1.2         0.2       
5.7         4.4        1.5         0.4       
5.4         3.9        1.3         0.4       
5.1         3.5        1.4         0.3       
5.7         3.8        1.7         0.3       
5.1         3.8        1.5         0.3       


### Convert DataFrame to Tuple for MLlib

In [30]:
data_spark_tuple = data_spark.map(lambda a: (a[0],a[1],a[2],a[3]))

### Run MLlib KMeans

In [31]:
# Build the model (cluster the data)
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(data_spark_tuple, 3, maxIterations=10,
                        runs=10, initializationMode="random")

### Evaluate Model

In [34]:
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = data_spark_tuple.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

Within Set Sum of Squared Error = 97.3259242343


## 5. Future Work: Midas

![Midas](figures/midas.png)
