# TensorFlow + Spark

This notebook shows how to use Spark for tensorflow application. There is two segment of this notebook. First one is about data parallelization, when data can stack up on single machine. Second one is task parallelilzation of tensorflow accross claster, that can be used only if data can be stored on single machine.

In [1]:
import tensorflow as tf

In [2]:
from collections import namedtuple

In [3]:
DataPoint = namedtuple('data_point', ['label','features'])

In [4]:
points = [DataPoint(1.0, [1.0, 2.0, 3.0]),
          DataPoint(0.0, [1.0, 1.0, 1.0]),
          DataPoint(1.0, [2.0, 2.0, 2.0]),
          DataPoint(1.0, [3.0, 3.0, 3.0])]

In [5]:
df = (sc.parallelize(points)).toDF()

Function <code>next_batch</code> will give one portion of data every time called. This is useful for batch training step in neural networks. 

In [6]:
class BatchHandler(object):
    def __init__(self, df):
        self._df = df
    
    def next_batch(self, ratio=0.5, seed=None):
        if seed==None:
            return df.randomSplit([ratio, 1-ratio])[0]
        else:
            return df.randomSplit([ratio, 1-ratio],seed)[0]

In [7]:
handler = BatchHandler(df)

## SampleModel
Simple model for dot product in TensorFlow, this makes graph of operations that will be executed on CPU or GPU when session starts.
## Train
Function <code>train</code> is a sample function that takes <code>BatchHandler</code> and some model of tensorflow graph, for example it can be neural network model, but in this case it's <code>SampleModel</code> of dot product. Performance issue - using <code>collect()</code>
## Task
Function <code>taks</code> like function <code>train</code> performe simple dot product of TensorFlow, but there is no dataframe and this function assumes that whole data can be stored in memory of single machine. Thus, this function can be executed in parallel with others <code>task</code> functions.

In [8]:
class SampleModel(object):
    
    def __init__(self, parameter):
        self.parameter = parameter
        self.x = tf.constant(value=parameter)
        self.y = tf.placeholder(dtype=tf.float32, shape=[3,])
        
        self.dot = tf.reduce_sum(tf.mul(self.x, self.y))

def train(handler, sample_model):
    batch = handler.next_batch().collect()  
    
    print 'BATCH:',batch,'SAMPLE_MODEL',sample_model.parameter
    
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    session = tf.Session(config=config)
    with tf.Session(config=config) as sess:
        result = [sess.run(sample_model.dot, feed_dict = {sample_model.y:map(float,i['features'])}) for i in batch]
    
    
    return result

def task(dataset, sample_model):
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    session = tf.Session(config=config)
    with tf.Session(config=config) as sess:
        result = [sess.run(sample_model.dot, feed_dict = {sample_model.y:map(float,i)}) for i in dataset]
    
    
    return result

In [9]:
models = [SampleModel([1.0, 1.0, 1.0]), SampleModel([2.0, 2.0, 2.0]), SampleModel([3.0, 3.0, 3.0])]

### Starts executing dot product with DataFrame

In [10]:
results = map(lambda x: train(handler,x), models)

BATCH: [Row(label=1.0, features=[1.0, 2.0, 3.0]), Row(label=1.0, features=[3.0, 3.0, 3.0])] SAMPLE_MODEL [1.0, 1.0, 1.0]
BATCH: [Row(label=1.0, features=[1.0, 2.0, 3.0]), Row(label=0.0, features=[1.0, 1.0, 1.0]), Row(label=1.0, features=[2.0, 2.0, 2.0]), Row(label=1.0, features=[3.0, 3.0, 3.0])] SAMPLE_MODEL [2.0, 2.0, 2.0]
BATCH: [] SAMPLE_MODEL [3.0, 3.0, 3.0]


In [11]:
print results

[[6.0, 9.0], [12.0, 6.0, 12.0, 18.0], []]


In [12]:
dataset = [[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 4.0]]

In [13]:
parameters = [[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 4.0]]

### Starts executing dot product of various models in parallel 

In [14]:
task_results = sc.parallelize(parameters).map(lambda x: task(dataset, SampleModel(x)))

In [15]:
task_results

PythonRDD[17] at RDD at PythonRDD.scala:48

In [16]:
task_results.collect()

[[3.0, 6.0, 10.0], [6.0, 12.0, 20.0], [10.0, 20.0, 34.0]]