
### Multilayer Perceptron

A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable. MLPC consists of multiple layers of nodes. Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node’s weights w and bias b and applying an activation function. This can be written in matrix form for MLPC with K+1 layers as follows: 
<pre>y(x)=fK(...f2(wT2f1(wT1x+b1)+b2)...+bK)</pre>
Nodes in intermediate layers use sigmoid (logistic) function:
<pre>f(zi)=11+e−zi</pre>
Nodes in the output layer use softmax function:
<pre>f(zi)=ezi∑Nk=1ezk</pre>
The number of nodes N in the output layer corresponds to the number of classes.For example for binary classification the number of output nodes is 2. 




<h1>Neural Networks</h1>

A multilayer perceptron algorithm in Spark requires a dataframe of numerical features and a label column as input and outputs a predictions dataframe.</p> Our work flow is as follows;<ul>
<li> Load dataset</li>
<li>Select features we are going to use</li>
<li>Convert features</li>
<li>Split dataset to train and test</li>
<li>Train and fit our model</li>
<li>Evaluate our model on test set</li>
<li>Conclusion</li>
</ul>   
</p>

<p>Our data is the breast cancer dataset which consists of the feature columns and a diagnosis(label) column with two classes which show whether the features are malignant or benign and run with Spark local mode and pyspark.

</p>



In [11]:
#imports
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer


<h1>Load and Parse the Data from a csv</h1>

In [4]:
# Load training data
data = data = spark.read.format("csv").load("../data/breast_cancer.csv", header = True,inferSchema = True).cache()

data.show(5)

+--------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+----+
|      id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|_c32|
+--------+---------+-----------+------

In [15]:
data.printSchema()

root
 |-- id: integer (nullable = true)
 |-- diagnosis: string (nullable = true)
 |-- radius_mean: double (nullable = true)
 |-- texture_mean: double (nullable = true)
 |-- perimeter_mean: double (nullable = true)
 |-- area_mean: double (nullable = true)
 |-- smoothness_mean: double (nullable = true)
 |-- compactness_mean: double (nullable = true)
 |-- concavity_mean: double (nullable = true)
 |-- concave points_mean: double (nullable = true)
 |-- symmetry_mean: double (nullable = true)
 |-- fractal_dimension_mean: double (nullable = true)
 |-- radius_se: double (nullable = true)
 |-- texture_se: double (nullable = true)
 |-- perimeter_se: double (nullable = true)
 |-- area_se: double (nullable = true)
 |-- smoothness_se: double (nullable = true)
 |-- compactness_se: double (nullable = true)
 |-- concavity_se: double (nullable = true)
 |-- concave points_se: double (nullable = true)
 |-- symmetry_se: double (nullable = true)
 |-- fractal_dimension_se: double (nullable = true)
 |-- radi

In [8]:
#our features are already in numerical format so we just create a feature column

assembler = VectorAssembler(
    inputCols=['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean','smoothness_mean', 'compactness_mean','concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'radius_mean', 'perimeter_mean', 'area_se', 'smoothness_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_se', 'compactness_worst', 'concave points_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'],
    outputCol="features")
output = assembler.transform(data)

In [9]:
output.show(5)

+--------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+----+--------------------+
|      id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|_c32|            featur

In [10]:
#select the label and feature columns
ftDf = output.select("diagnosis", "features")
ftDf.show(5)

+---------+--------------------+
|diagnosis|            features|
+---------+--------------------+
|        M|[17.99,10.38,122....|
|        M|[20.57,17.77,132....|
|        M|[19.69,21.25,130....|
|        M|[11.42,20.38,77.5...|
|        M|[20.29,14.34,135....|
+---------+--------------------+
only showing top 5 rows



In [12]:
#change the label column to numerical format to represent malignant and benign to 1 and 0
indexer = StringIndexer(inputCol="diagnosis", outputCol="label")
new_ftDf = indexer.fit(ftDf).transform(ftDf)
new_ftlabelDf = new_ftDf.select("label", "features")
new_ftlabelDf.show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|[17.99,10.38,122....|
|  1.0|[20.57,17.77,132....|
|  1.0|[19.69,21.25,130....|
|  1.0|[11.42,20.38,77.5...|
|  1.0|[20.29,14.34,135....|
+-----+--------------------+
only showing top 5 rows



In [14]:
# Split the data into train and test and set seed for reproducing results
splits = new_ftlabelDf.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

### How to build a neural network in Spark
In Spark the MultilayerPerceptronClassifier class expects a dataframe of features and labels. The input layer has to have the number of feature columns and output layer the number of class labels and hidden layers in between which can be changed to improve the model performance.  

In [10]:

# specify layers for the neural network:
# input layer of size 28 (features), two intermediate of size 5 and 4
# and output of size 2 (classes)
layers = [28, 5, 4, 2]


In [11]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

In [12]:
# train the model
model = trainer.fit(train)

# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

Test set accuracy = 0.90099009901


In [13]:
predictionAndLabels.show(5)

+----------+-----+
|prediction|label|
+----------+-----+
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
|       0.0|  0.0|
+----------+-----+
only showing top 5 rows



### Conclusion
This is just a simple implementation of a applying a deep learning model in Spark. There are other ways to approach this problem with Spark and other machine learning frameworks.