<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 2: Preparing Data - Transformers and Estimators

## Transformers and Estimators 

### Lesson Objectives 

After completing this lesson, you should be able to: 

-	Understand, create, and use a `Transformer`
-	Understand, create and use an `Estimator` 
-	Set parameters of `Transformers` and `Estimators`
-	Create a feature `Vector` with `VectorAssembler`

## Transformers

-	Algorithm which can transform one `DataFrame` into another `DataFrame`
-	Abstraction that includes feature transformers and learned models. 
-	Implements a method `transform(),` which converts one `DataFrame` into another, generally by appending one or more columns
-	Input and output columns set with `setInputCol` and `setOutputCol` methods 
-	Examples:
  -	read one or more columns and map them into a new column of feature vectors
  -	read a column containing feature vectors and make a prediction for each vector

In [1]:
import org.apache.spark.mllib.util.MLUtils

In [2]:
import  org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer}


val data = Seq((0, "Hi I heard about Spark"), 
 (1, "I wish Java could use case classes"), 
 (2, "Logistic, regression, models,are,neat"))
 
val  sentenceDataFrame = spark.createDataFrame(data).toDF("label", "sentence")
val tokenizer = new Tokenizer(). setInputCol("sentence").setOutputCol("words")
val tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.show()

+--------------------+
|    0|Hi I heard about ...|[hi, i, heard, ab...|
|    1|I wish Java could...|[i, wish, java, c...|
|    2|Logistic, regress...|[logistic,, regre...|
+-----+--------------------+--------------------+

+-----+--------------------+--------------------+
|label|            sentence|               words|
+-----+--------------------

data = List((0,Hi I heard about Spark), (1,I wish Java could use case classes), (2,Logistic, regression, models,are,neat))
sentenceDataFrame = [label: int, sentence: string]
tokenizer = tok_cbb0dc52c9cb
tokenized = [label: int, sentence: string ... 1 more field]


[label: int, sentence: string ... 1 more field]

## Estimators 

-	Algorithm which can be fit on a `DataFrame` to produce a `Transformer` 
-	Abstracts the concept of a learning algorithm or any algorithm that fits or trains on data 
-	Implements a method `fit(),` which accepts a `DataFrame` and produces a `Model`, which is a `Transformer`
-	Example: `LogisticRegression`
-	It is a learning algorithm and therefore an `Estimator` 
- By calling the method `fit()` to train the logistic regression, a `Model` is returned

In [3]:
// A Simple Example of an Estimator

import  org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.util.MLUtils


import org.apache.spark.mllib.linalg.{Vector, Vectors}

val training = MLUtils.convertVectorColumnsToML(spark.createDataFrame(Seq(
 (1.0, Vectors.dense(0.0, 1.1, 0.1)),
 (0.0, Vectors.dense(2.0, 1.0, -1.0)), 
 (0.0, Vectors.dense(20, 1.3, 1.0)),
 (1.0, Vectors.dense(0.0, 1.2, -0.5)))).toDF("label", "features"))

val lr = new LogisticRegression() 

lr.setMaxIter(10).setRegParam(0.01)

val model1 = lr.fit(training)

training = [label: double, features: vector]
lr = logreg_e74e96516dd0
model1 = LogisticRegressionModel: uid = logreg_e74e96516dd0, numClasses = 2, numFeatures = 3


LogisticRegressionModel: uid = logreg_e74e96516dd0, numClasses = 2, numFeatures = 3

In [4]:
model1.transform(training).show()

+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|         probability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0| [0.0,1.1,0.1]|[-2.6711500036993...|[0.06469734519861...|       1.0|
|  0.0|[2.0,1.0,-1.0]|[1.25786304216765...|[0.77865802305260...|       0.0|
|  0.0|[20.0,1.3,1.0]|[2.78558770259479...|[0.94189202365949...|       0.0|
|  1.0|[0.0,1.2,-0.5]|[-1.3897978468983...|[0.19944003140081...|       1.0|
+-----+--------------+--------------------+--------------------+----------+



## Parameters

-	Transformers and Estimators use a uniform API for specifying parameters
-	A `ParamMap` is a set of `(parameter, value)` pairs
-	Parameters are specific to a given instance
-	There are two main ways to pass parameters to an algorithm: 
  -	Setting parameters for an instance using an appropriate method, for instance `setMaxIter(10)`
  -	Passing a `ParamMap` to `fit()` or `transform(),` for instance, `ParamMap(lr1.MaxIter->10,lr2.MaxIter->20)`
  -	In this case, the parameter `MaxIter` is being specified to two different instances of models, `lr1` and `lr2`

In [5]:
// A Simple Example of Parameter Setting 

import  org.apache.spark.ml.param.ParamMap

val  paramMap = ParamMap(lr.maxIter -> 20, lr.regParam -> 0.01)

val model2 = lr.fit(training, paramMap)

model2.transform(training).show()

+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|         probability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0| [0.0,1.1,0.1]|[-2.0332608460019...|[0.11575473786511...|       1.0|
|  0.0|[2.0,1.0,-1.0]|[1.93242469544033...|[0.87351755418157...|       0.0|
|  0.0|[20.0,1.3,1.0]|[2.46813565317686...|[0.92187760149918...|       0.0|
|  1.0|[0.0,1.2,-0.5]|[-2.5016071722078...|[0.07574558805102...|       1.0|
+-----+--------------+--------------------+--------------------+----------+



paramMap = 
model2 = LogisticRegressionModel: uid = logreg_e74e96516dd0, numClasses = 2, numFeatures = 3


{
	logreg_e74e96516dd0-maxIter: 20,
	logreg_e74e96516dd0-regParam: 0.01
}


LogisticRegressionModel: uid = logreg_e74e96516dd0, numClasses = 2, numFeatures = 3

## Vector Assembler

-	Transformer that combines a given list of columns into a single vector column
-	Useful for combining raw features and features generated by other transformers into a single feature vector
-	Accepts the following input column types: 
  -	all numeric types 
  -	boolean
  -	vector

In [6]:
// An Example of a VectorAssembler

import  org.apache.spark.ml.feature.VectorAssembler
import  org.apache.spark.sql.functions._


val dfRandom = spark.range(0, 10).select("id")
 .withColumn("uniform", rand(10L))
 .withColumn("normal1", randn(10L))
 .withColumn("normal2", randn(11L))

val assembler = new VectorAssembler().
 setInputCols(Array("uniform","normal1","normal2")).
 setOutputCol("features")

val dfVec = assembler.transform(dfRandom)


// An Example of a VectorAssembler

dfVec.select("id","features").show()

+---+--------------------+
| id|            features|
+---+--------------------+
|  0|[0.41371264720975...|
|  1|[0.73117192818966...|
|  2|[0.90317011551182...|
|  3|[0.09430205113458...|
|  4|[0.38340505276222...|
|  5|[0.55692461355235...|
|  6|[0.49774414066138...|
|  7|[0.20766661062014...|
|  8|[0.95719194065089...|
|  9|[0.74293954612044...|
+---+--------------------+



dfRandom = [id: bigint, uniform: double ... 2 more fields]
assembler = vecAssembler_139594e48681
dfVec = [id: bigint, uniform: double ... 3 more fields]


[id: bigint, uniform: double ... 3 more fields]

## Lesson Summary 

-	Having completed this lesson, you should be able to: 
  - Understand, create, and use a `Transformer`
  -	Understand, create, and use an `Estimator`
  -	Set parameters of `Transformers` and `Estimators`
  -	Create a feature `Vector` with `VectorAssembler`

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.