### grp

# Spark: The Definitive Guide

## PART 6: Advanced Analytics and Machine Learning 

## dataPaths

In [1]:
imagesDL = '/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/'

## _Chapter #31 - Deep Learning_

-  learns from unstructured data with high dimensions like images, audio, and text

### Deep Neural Networks:
-  graph of nodes with weights and activation functions
-  nodes are organized into _layers_
-  layers are connected to previous layers in the network
-  layers are stacked together with many nodes to recognize complex signals in the input
-  networks are trained to associate certain inputs with certain outputs by tuning weights and values of each node

### Use Cases:
-  computer vision
-  speech processing
-  NLP
-  facial recognition
-  image (brand) detection
-  sound patterns

### Spark Deep Learning Methods:
-  Inference:
    -  take pretrained model and apply to large dataset via Spark's parallel processing
    -  typically call _map_ function on DL library (ex: TensorFlow) to trigger distributed inference
-  Featurization and Transfer Learning:
    -  use existing model as a _featurizer_ via "transfer learning" method (leverages a pre-trained model and then modifying it to better fit new use case)
-  Model Training:
    -  train from scratch via Spark's parallel processing
    -  perform ETL/FE via Spark's parallel processing and export model to THEN run on a single machine using libraries like TensorFlow or Keras for training

### Deep Learning Libraries:
-  MLlib Neural Network Support:
    -  MLlib's **ml.classification.MultilayerPerceptronClassifier** deep neutral network
    -  uses sigmoid activation function and output layer with softmax activation function
-  TensorFrames:
    -  https://github.com/databricks/tensorframes
    -  helps pass data between Spark DFs and TensorFlow
    -  better alternative to using _map_ function on DL library via Python
-  BigDL:
    -  https://github.com/intel-analytics/BigDL
    -  distributed DL framework optimized to run on CPUs rather than typically GPUs
-  TensorFlowOnSpark:
    -  https://github.com/yahoo/TensorFlowOnSpark
    -  ability to train TensorFlow models in parallel on Spark clusters
-  DeepLearning4J:
    -  https://deeplearning4j.org/docs/latest/deeplearning4j-spark-training
    -  Java/Scala DL library for both single node and cluster distributed training
-  Deep Learning Pipelines:
    -  https://github.com/databricks/spark-deep-learning
    -  https://spark-packages.org/package/databricks/spark-deep-learning
    -  Databricks package integrating DL functionality into Spark's ML Pipelines API via distributed computing
    -  currently only available via Python language for integrating with TensorFlow and Keras libraries
    -  Install Dependencies:
        -  **if using cluster make sure everything is installed on driver machine and all worker machines**
        -  TensorFrames => https://github.com/databricks/tensorframes
        -  TensorFlow => https://www.tensorflow.org
        -  Keras => https://keras.io
        -  h5py => http://www.h5py.org

### _Chapter #31 Exercises (DL)_

### _Terminal Packages Example_

In [2]:
'''
pyspark --packages databricks:tensorframes:0.5.0-s_2.11,databricks:spark-deep-learning:1.2.0-spark2.3-s_2.11
'''

'\npyspark --packages databricks:tensorframes:0.5.0-s_2.11,databricks:spark-deep-learning:1.2.0-spark2.3-s_2.11\n'

### _Spark DL Import Example_

In [3]:
from pyspark.ml.image import ImageSchema

In [4]:
image_df = ImageSchema.readImages(imagesDL)
image_df.printSchema()

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |    |-- nChannels: integer (nullable = false)
 |    |-- mode: integer (nullable = false)
 |    |-- data: binary (nullable = false)



### _Transfer Learning Example_

In [5]:
from pyspark.ml.image import ImageSchema
from pyspark.sql.functions import lit

In [6]:
# reading images
tulips_df = ImageSchema.readImages(imagesDL + "/tulipsSample").withColumn("label", lit(1))
daisy_df = ImageSchema.readImages(imagesDL + "/daisySample").withColumn("label", lit(0))

# train/test splits
tulips_train, tulips_test = tulips_df.randomSplit([0.6, 0.4])  
daisy_train, daisy_test = daisy_df.randomSplit([0.6, 0.4])
train_df = tulips_train.unionAll(daisy_train)
test_df = tulips_test.unionAll(daisy_test)

In [7]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer 

Using TensorFlow backend.


In [8]:
# using transformer (DeepImageFeaturizer):
    # leverages pre-trained model called Inception:
        # neural network used to identify patterns in images
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")

# logistic regression learning algorithm is being used to train model
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])

p_model = p.fit(train_df)

In [9]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [10]:
# classification evaluation metrics
tested_df = p_model.transform(test_df)
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(tested_df.select("prediction", "label"))))

Test set accuracy = 0.7368421052631579


In [11]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import expr, udf

In [12]:
# find rows/images where mistakes were made in training
# a simple UDF to convert the value to a double
def _p1(v):
  return float(v.array[1])
p1 = udf(_p1, DoubleType())

df = tested_df.withColumn("p_1", p1(tested_df.probability))
wrong_df = df.orderBy(expr("abs(p_1 - label)"), ascending=False)
for i in wrong_df.select("image.origin", "p_1", "label").limit(10).collect(): print(i)

Row(origin='file:/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/daisySample/4571353297_5634177744_n.jpg', p_1=0.9436311454723025, label=0)
Row(origin='file:/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/daisySample/4571993204_5b3efe0e78.jpg', p_1=0.9191149810614544, label=0)
Row(origin='file:/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/tulipsSample/367020749_3c9a652d75.jpg', p_1=0.10012947989947016, label=1)
Row(origin='file:/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/daisySample/130684941_d1abfa3be6_m.jpg', p_1=0.6930696737582386, label=0)
Row(origin='file:/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/daisySample/11642632_1e7627a2cc.jpg', p_1=0.5809598871642018, label=0)
Row(origin='file:/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/tulipsSample/11642632_1e7627a2cc.jpg', p_1=0.5809598871642018, label=1)
Row(origin='file:/Users/grp/sparkTheDefinitiveGuide/data/deep-learning-images/daisySample/206

### _Apply Deep Image Predictor [Transformer] Example_

In [13]:
from pyspark.ml.image import ImageSchema
from sparkdl import DeepImagePredictor

In [14]:
'''
image_df = ImageSchema.readImages(imagesDL)

predictor = DeepImagePredictor\
(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10)
predictions_df = predictor.transform(image_df)

predictions_df.select("predicted_labels", "image.origin").show()
'''

'\nimage_df = ImageSchema.readImages(imagesDL)\n\npredictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10)\npredictions_df = predictor.transform(image_df)\n\npredictions_df.select("predicted_labels", "image.origin").show()\n'

In [15]:
'''
df = p_model.transform(image_df)
'''

'\ndf = p_model.transform(image_df)\n'

### _Apply Custom Keras Model Example_

In [16]:
from keras.applications import InceptionV3
from sparkdl.udf.keras_image_model import registerKerasImageUDF

In [17]:
'''
registerKerasImageUDF("inceptionV3_udf", InceptionV3(weights="imagenet"))
'''

'\nregisterKerasImageUDF("inceptionV3_udf", InceptionV3(weights="imagenet"))\n'

### grp