# Iris-classification-with-pyspark 

download dataset from here https://archive.ics.uci.edu/ml/datasets/iris

In [1]:
ls ../datasets/iris_uci_edu/

bezdekIris.data  iris.data  iris.names


In [2]:
from pyspark.sql.types import *

In [3]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

In [4]:
schema = StructType([
    StructField('sepal_length', DoubleType(), nullable=False),
    StructField('sepal_width', DoubleType(), nullable=False),
    StructField('petal_length', DoubleType(), nullable=False),
    StructField('petal_width', DoubleType(), nullable=False),
    StructField('class', StringType(), nullable=False),
])

In [5]:
iris = spark.read.csv('../datasets/iris_uci_edu/iris.data',
                      schema=schema)

In [6]:
iris.describe().toPandas()

Unnamed: 0,summary,sepal_length,sepal_width,petal_length,petal_width,class
0,count,150.0,150.0,150.0,150.0,150
1,mean,5.843333333333335,3.0540000000000007,3.758666666666669,1.1986666666666672,
2,stddev,0.8280661279778637,0.4335943113621737,1.764420419952262,0.7631607417008414,
3,min,4.3,2.0,1.0,0.1,Iris-setosa
4,max,7.9,4.4,6.9,2.5,Iris-virginica


In [7]:
iris.select('class').distinct().toPandas()

Unnamed: 0,class
0,Iris-virginica
1,Iris-setosa
2,Iris-versicolor


In [8]:
iris.where('class = "Iris-setosa"').drop('class').describe().toPandas()

Unnamed: 0,summary,sepal_length,sepal_width,petal_length,petal_width
0,count,50.0,50.0,50.0,50.0
1,mean,5.005999999999999,3.4180000000000006,1.464,0.2439999999999999
2,stddev,0.3524896872134513,0.381024397954691,0.1735111594364455,0.1072095030816784
3,min,4.3,2.3,1.0,0.1
4,max,5.8,4.4,1.9,0.6


### register dataframe that allows us to use SQL queries temporarily until the SparkSession is established

In [9]:
iris.registerTempTable('iris')

In [10]:
spark.sql('''
SELECT *
FROM iris
LIMIT 5
''').toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [11]:
spark.sql('''
SELECT 
    class, 
    min(sepal_length), avg(sepal_length), max(sepal_length),
    min(sepal_width), avg(sepal_width), max(sepal_width),
    min(petal_length), avg(petal_length), max(petal_length),
    min(petal_width), avg(petal_width), max(petal_width)
FROM iris
GROUP BY class
''').toPandas()

ParseException: "\nno viable alternative at input 'SELECT\xa0'(line 2, pos 6)\n\n== SQL ==\n\nSELECT\xa0\n------^^^\n\xa0 \xa0 class,\xa0\n\xa0 \xa0 min(sepal_length), avg(sepal_length), max(sepal_length),\n\xa0 \xa0 min(sepal_width), avg(sepal_width), max(sepal_width),\n\xa0 \xa0 min(petal_length), avg(petal_length), max(petal_length),\n\xa0 \xa0 min(petal_width), avg(petal_width), max(petal_width)\nFROM iris\nGROUP BY class\n"

## Transformers
A Transformer is a piece of logic that transforms the data without needing to learn or fit anything from the data. A good intuition for transformers is that they represent functions that we wish to map over our data. All stages of a pipeline have parameters so that we can make sure that the transformation is being applied to the right fields, and with the desired configuration. Let’s look at a few examples.<br>

### SQLTRANSFORMER
The SQLTransformer has only one parameter - statement - which is the SQL statement that will be executed against our DataFrame. Let’s use a SQLTransformer to do the group-by we performed above.

In [13]:
from pyspark.ml.feature import SQLTransformer

statement = '''
SELECT 
    class, 
    min(sepal_length), avg(sepal_length), max(sepal_length),
    min(sepal_width), avg(sepal_width), max(sepal_width),
    min(petal_length), avg(petal_length), max(petal_length),
    min(petal_width), avg(petal_width), max(petal_width)
FROM iris
GROUP BY class
'''

sql_transformer = SQLTransformer(statement=statement)

In [14]:
sql_transformer.transform(iris).toPandas()

ParseException: "\nno viable alternative at input 'SELECT\xa0'(line 2, pos 6)\n\n== SQL ==\n\nSELECT\xa0\n------^^^\n\xa0 \xa0 class,\xa0\n\xa0 \xa0 min(sepal_length), avg(sepal_length), max(sepal_length),\n\xa0 \xa0 min(sepal_width), avg(sepal_width), max(sepal_width),\n\xa0 \xa0 min(petal_length), avg(petal_length), max(petal_length),\n\xa0 \xa0 min(petal_width), avg(petal_width), max(petal_width)\nFROM iris\nGROUP BY class\n"

##### SQLTransformer is useful when you have preprocessing or restructuring that you need to perform on your data before other steps in the pipeline. Now let’s look at a transformer that works on one field and returns the original data with a new field. 

### BINARIZER 

The Binarizer is Transformer that applies a threshold to a numeric field, turning it into 0s (when below the threshold), and 1s (when above the threshold). It takes three parameters

inputCol the column to be binarized
outputCol the column containing the binarized values
threshold the threshold we will apply

In [15]:
from pyspark.ml.feature import Binarizer

binarizer = Binarizer(inputCol='sepal_length',
                     outputCol='sepal_length_above_5', threshold=5.0)

In [16]:
binarizer.transform(iris).limit(5).toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,sepal_length_above_5
0,5.1,3.5,1.4,0.2,Iris-setosa,1.0
1,4.9,3.0,1.4,0.2,Iris-setosa,0.0
2,4.7,3.2,1.3,0.2,Iris-setosa,0.0
3,4.6,3.1,1.5,0.2,Iris-setosa,0.0
4,5.0,3.6,1.4,0.2,Iris-setosa,0.0


***note:*** binarizer returns modified version of dataframe 

### VECTORASSEMBLER 

Another import Transformer is the VectorAssembler. It takes list of numeric and vector-valued columns and constructs a single vector. This is useful since MLLib’s machine learning algorithms all expect a single vector-valued input column for features.

Another import Transformer is the VectorAssembler. It takes list of numeric and vector-valued columns and constructs a single vector. This is useful since MLLib’s machine learning algorithms all expect a single vector-valued input column for features. The VectorAssembler takes two parameters.

inputCols the list of columns to be assembled
outputCol the column containing the new vectors

In [17]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['sepal_length', 'sepal_width',
                                      'petal_length', 'petal_width'],
                           outputCol='features')

persisting the data

In [18]:
iris_w_vecs = assembler.transform(iris).persist()

In [19]:
iris_w_vecs.limit(5).toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,features
0,5.1,3.5,1.4,0.2,Iris-setosa,"[5.1, 3.5, 1.4, 0.2]"
1,4.9,3.0,1.4,0.2,Iris-setosa,"[4.9, 3.0, 1.4, 0.2]"
2,4.7,3.2,1.3,0.2,Iris-setosa,"[4.7, 3.2, 1.3, 0.2]"
3,4.6,3.1,1.5,0.2,Iris-setosa,"[4.6, 3.1, 1.5, 0.2]"
4,5.0,3.6,1.4,0.2,Iris-setosa,"[5.0, 3.6, 1.4, 0.2]"


***note:*** MLib works with feature vectors what we created above using VectorTransformer

## Estimators and Models 

Estimators allow us to create transformations that are informed by our data. The Estimator is fit with a DataFrame returning a Model which is a kind of Transformer. The Models created from classifier and regression Estimators are PredictionModels.

This is a similar design to scikit_learn, with the exception that in scikit-learn when we call fit we mutate the estimator instead of creating a new object. There are pros and cons to this, as there always is when debating mutability. Idiomatic Scala strongly prefers immutability.

### MINMAXSCALER 

MinMaxScaler allows us to scale data between 0 to 1, it takes 4 parameters.<br>
1- inputCol the column to be scaled<br>
2- outputCol the column containing the scaled values<br>
3- max the new maximum value (optional, default=1)<br>
4- min the new minimum value (optional, default=0)

In [21]:
from pyspark.ml.feature import MinMaxScaler

scaler = MinMaxScaler(inputCol='features', 
                      outputCol='petal_length_scaled')

first we fit on some data

In [22]:
scaler_model = scaler.fit(iris_w_vecs)

now we transform the desired data

In [23]:
scaler_model.transform(iris_w_vecs).limit(5).toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,features,petal_length_scaled
0,5.1,3.5,1.4,0.2,Iris-setosa,"[5.1, 3.5, 1.4, 0.2]","[0.22222222222222213, 0.6249999999999999, 0.06..."
1,4.9,3.0,1.4,0.2,Iris-setosa,"[4.9, 3.0, 1.4, 0.2]","[0.1666666666666668, 0.41666666666666663, 0.06..."
2,4.7,3.2,1.3,0.2,Iris-setosa,"[4.7, 3.2, 1.3, 0.2]","[0.11111111111111119, 0.5, 0.05084745762711865..."
3,4.6,3.1,1.5,0.2,Iris-setosa,"[4.6, 3.1, 1.5, 0.2]","[0.08333333333333327, 0.4583333333333333, 0.08..."
4,5.0,3.6,1.4,0.2,Iris-setosa,"[5.0, 3.6, 1.4, 0.2]","[0.19444444444444448, 0.6666666666666666, 0.06..."


## StringIndexer 

Let’s build a model! We will try and predict the class from the other features. We will use a decision tree. First though, we must convert our target into index values.<br><br>
The StringIndexer Estimator will turn our class values in to indexes. We want to do this to simplify some of the downstream processing. It is simpler to implement most training algorithms with the assumption that the target is a number. Converting the It takes four parameters.
<br><br>
inputCol the column to be indexed<br>
outputCol the column containing the indexed values<br>
handleInvalid the policy for how the model should handle values not seen by the estimator (optional, default=error)<br>
stringOrderType how to order the values to make the indexing <br>deterministic (optional, default=frequencyDesc)
<br><br>We will also want an IndexToString Transformer. This will let us map our predictions which will be indices back to string values. IndexToString takes three parameters
<br><br>
inputCol the column to be mapped<br>
outputCol the column containing the mapped values<br>
labels the mapping from index to value, usually generated by an StringIndexer.

converting string to indexes and also a vector in which indexes point to string

In [27]:
from pyspark.ml.feature import StringIndexer, IndexToString

indexer = StringIndexer(inputCol='class', outputCol='class_ix')
indexer_model = indexer.fit(iris_w_vecs)

index2string = IndexToString(
    inputCol=indexer_model.getOrDefault('outputCol'),
    outputCol = 'pred_class',
    labels=indexer_model.labels
)

In [28]:
iris_indexed = indexer_model.transform(iris_w_vecs)

### Training decision tree classifier 

In [29]:
from pyspark.ml.classification import DecisionTreeClassifier

dl_clfr = DecisionTreeClassifier(
    featuresCol = 'features',
    labelCol = 'class_ix',
    maxDepth=5,
    impurity='gini',
    seed=123
)

In [30]:
dt_clfr_model = dl_clfr.fit(iris_indexed)

In [31]:
iris_w_pred = dt_clfr_model.transform(iris_indexed)

In [32]:
iris_w_pred.limit(5).toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,features,class_ix,rawPrediction,probability,prediction
0,5.1,3.5,1.4,0.2,Iris-setosa,"[5.1, 3.5, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
1,4.9,3.0,1.4,0.2,Iris-setosa,"[4.9, 3.0, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
2,4.7,3.2,1.3,0.2,Iris-setosa,"[4.7, 3.2, 1.3, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
3,4.6,3.1,1.5,0.2,Iris-setosa,"[4.6, 3.1, 1.5, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
4,5.0,3.6,1.4,0.2,Iris-setosa,"[5.0, 3.6, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0


##### now map the indexes to string using IndexToString 

In [33]:
iris_w_pred_class = index2string.transform(iris_w_pred)

In [34]:
iris_w_pred_class.limit(5).toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,features,class_ix,rawPrediction,probability,prediction,pred_class
0,5.1,3.5,1.4,0.2,Iris-setosa,"[5.1, 3.5, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa,"[4.9, 3.0, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa,"[4.7, 3.2, 1.3, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa,"[4.6, 3.1, 1.5, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa,"[5.0, 3.6, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa


## EVALUATE MODEL WITH EVALUATOR 

In our example, we are trying to solve a multiclass prediction problem, so we will use the MulticlassClassificationEvaluator.

In [35]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(
    labelCol = 'class_ix',
    metricName='accuracy'
)

In [36]:
evaluator.evaluate(iris_w_pred_class)

1.0

***note:*** This seems too good. What if we are overfit? Perhaps we should try using using cross-validation to be able to evaluate our models. Before we do that, let’s organize stages into a pipeline.

## PIPELINES 

#####  Pipelines are a special kind of Estimator that takes a list of Transformers and Estimators and allows us to use them as a single Estimator.

In [39]:
from pyspark.ml import Pipeline

pipeline = Pipeline(
stages = [assembler, indexer, dl_clfr, index2string])

In [40]:
pipeline_model = pipeline.fit(iris)

In [41]:
pipeline_model.transform(iris).limit(5).toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,features,class_ix,rawPrediction,probability,prediction,pred_class
0,5.1,3.5,1.4,0.2,Iris-setosa,"[5.1, 3.5, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa,"[4.9, 3.0, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa,"[4.7, 3.2, 1.3, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa,"[4.6, 3.1, 1.5, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa,"[5.0, 3.6, 1.4, 0.2]",0.0,"[50.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0,Iris-setosa


## CROSS VALIDATION

Now that we have a Pipeline and an Evaluator we can create a CrossValidator. The CrossValidator itself is also an Estimator. When we call fit, it will fit our Pipeline to each fold of data, and calculate the metric determined by our Evaluator. CrossValidator takes five parameters<br>
<br>estimator the Estimator to be tuned
<br>estimatorParamMaps the hyperparameter values to try in a <br>hyperparameter grid search
<br>evaluator the Evaluator that calculate the metric
<br>numFolds the number of folds to split the data into
<br>seed a seed for making the splits reproducible
<br><br>We will make a trivial hyperparameter grid here, since we are only interested in estimating how well our model does on data it has not seen.

In [42]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

param_grid = ParamGridBuilder().\
    addGrid(dl_clfr.maxDepth, [5]).\
    build()

cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3,
    seed=123
)

In [43]:
cv_model = cv.fit(iris)

In [44]:
cv_model.avgMetrics

[0.9588996659642801]

## Serialization Of Models 

MLLib allows us to save Pipelines so that we can use them later. We can also save individual Transformers and Models, but we will often want to keep all the stages of a Pipeline together. Generally speaking, we use separate programs for building models and using models.

In [45]:
pipeline_model.write().overwrite().save('pipeline.model')

In [47]:
! ls pipeline.model/* 

pipeline.model/metadata:
part-00000  _SUCCESS

pipeline.model/stages:
0_VectorAssembler_f0e3c11f1cbb	2_DecisionTreeClassifier_dec81554aae2
1_StringIndexer_d54497b894f8	3_IndexToString_094101ae0ef8
