salesforce · tovbinm · Jul 2, 2019 · Jul 1, 2019 · Jul 1, 2019 · Jul 1, 2019
@@ -1,6 +1,6 @@
 # Boston Regression
 
-The following code illustrates how TransmogrifAI can be used to do linear regression. We use Boston dataset to predict housing prices.
+The following code illustrates how TransmogrifAI can be used to do linear regression. We use Boston dataset to predict housing prices. This example is very similar to the Titanic Binary Classification example, so you should look over that example first if you have not already.
 The code for this example can be found [here](https://github.com/salesforce/TransmogrifAI/tree/master/helloworld/src/main/scala/com/salesforce/hw/boston), and the data over [here](https://github.com/salesforce/op/tree/master/helloworld/src/main/resources/BostonDataset).
 
 **Define features**
@@ -25,58 +25,48 @@ val medv = FeatureBuilder.RealNN[BostonHouse].extract(_.medv.toRealNN).asRespons
 **Feature Engineering**
 
 ```scala
-val houseFeatures = Seq(crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat).transmogrify()
+val features = Seq(crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat).transmogrify()
+val label = medv
+val checkedFeatures = label.sanityCheck(features, removeBadFeatures = true)
 ```
+
 **Modeling & Evaluation**
+
+For regression problems, we use ```RegressionModelSelector``` to choose which type of models to use, which in this case is Linear Regression. You can find more model types [here](../developer-guide#modelselector).
+
 ```scala
 val prediction = RegressionModelSelector
-  .withCrossValidation(dataSplitter = Option(DataSplitter(seed = randomSeed)), seed = randomSeed)
-  .setRandomForestSeed(randomSeed)
-  .setGradientBoostedTreeSeed(randomSeed)
-  .setInput(medv, houseFeatures)
-  .getOutput()
+  .withTrainValidationSplit(
+    modelTypesToUse = Seq(OpLinearRegression))
+  .setInput(label, checkedFeatures).getOutput()
 
 val workflow = new OpWorkflow().setResultFeatures(prediction)
 
-val evaluator = Evaluators.Regression().setLabelCol(medv).setPredictionCol(prediction)
+val evaluator = Evaluators.Regression().setLabelCol(label).setPredictionCol(prediction)
+
+val model = workflow.train()
+```
+
+**Results**
 
-def runner(opParams: OpParams): OpWorkflowRunner =
-  new OpWorkflowRunner(
-    workflow = workflow,
-    trainingReader = trainingReader,
-    scoringReader = scoringReader,
-    evaluationReader = Option(trainingReader),
-    evaluator = Option(evaluator),
-    scoringEvaluator = None,
-    featureToComputeUpTo = Option(houseFeatures)
-  )
+We can extract each feature's contribution to the model via ```ModelInsights```, like in the Titanic Binary Classification example.
+
+```scala
+val modelInsights = model.modelInsights(prediction)
+val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
+val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
+  feature.contribution.map( contribution => math.abs(contribution))
+    .foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))
+val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)
+
+val (scores, metrics) = model.scoreAndEvaluate(evaluator = evaluator)
 ```
-You can run the code using the following commands for train, score and evaluate:
+
+You can run the code using the following command:
+
 ```bash
 cd helloworld
 ./gradlew compileTestScala installDist
-```
-**Train**
-```bash
-./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
---run-type=train \
---model-location=/tmp/boston-model \
---read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data"
-```
-**Score**
-```bash
-./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
---run-type=score \
---model-location=/tmp/boston-model \
---read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data \
---write-location=/tmp/boston-scores"
-```
-**Evaluate**
-```bash
-./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
---run-type=evaluate \
---read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data \
---write-location=/tmp/boston-eval \
---model-location=/tmp/boston-model \
---metrics-location=/tmp/boston-metrics"
+./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpBostonSimple -Dargs="\
+`pwd`/src/main/resources/BostonDataset/housingData.csv"
 ```
@@ -1,72 +1,78 @@
 # Iris MultiClass Classification
 
-The following code illustrates how TransmogrifAI can be used to do classify multiple classes over the Iris dataset.
-The code for this example can be found [here](https://github.com/salesforce/TransmogrifAI/tree/master/helloworld/src/main/scala/com/salesforce/hw/iris), and the data over [here](https://github.com/salesforce/op/tree/master/helloworld/src/main/resources/IrisDataset).
+The following code illustrates how TransmogrifAI can be used to do classify multiple classes over the Iris dataset. This example is very similar to the Titanic Binary Classification example, so you should look over that example first if you have not already. 
+The code for this example can be found [here](https://github.com/salesforce/TransmogrifAI/tree/master/helloworld/src/main/scala/com/salesforce/hw/OpIrisSimple.scala), and the data over [here](https://github.com/salesforce/op/tree/master/helloworld/src/main/resources/IrisDataset/iris.csv).
+
+**Data Schema**
+
+```scala
+case class Iris
+(
+  id: Int,
+  sepalLength: Double,
+  sepalWidth: Double,
+  petalLength: Double,
+  petalWidth: Double,
+  irisClass: String
+)
+```
+
+**Define Features**
 
-**Define features**
 ```scala
-val id = FeatureBuilder.Integral[Iris].extract(_.getID.toIntegral).asPredictor
 val sepalLength = FeatureBuilder.Real[Iris].extract(_.getSepalLength.toReal).asPredictor
 val sepalWidth = FeatureBuilder.Real[Iris].extract(_.getSepalWidth.toReal).asPredictor
 val petalLength = FeatureBuilder.Real[Iris].extract(_.getPetalLength.toReal).asPredictor
 val petalWidth = FeatureBuilder.Real[Iris].extract(_.getPetalWidth.toReal).asPredictor
 val irisClass = FeatureBuilder.Text[Iris].extract(_.getClass$.toText).asResponse
-
 ```
+
 **Feature Engineering**
 
 ```scala
-val labels = irisClass.indexed()
 val features = Seq(sepalLength, sepalWidth, petalLength, petalWidth).transmogrify()
+val label = irisClass.indexed()
+val checkedFeatures = label.sanityCheck(features, removeBadFeatures = true)
 ```
+
 **Modeling & Evaluation**
+
+In MultiClass Classification, we use the ```MultiClassificationModelSelector``` to select the model we want to run on, which is Logistic Regression in this case. You can find more information on model selection [here](../developer-guide#modelselector).
+
 ```scala
-val pred = MultiClassificationModelSelector
-  .withCrossValidation(splitter = Some(DataCutter(reserveTestFraction = 0.2, seed = randomSeed)), seed = randomSeed)
-  .setInput(labels, features).getOutput()
-
-private val evaluator = Evaluators.MultiClassification.f1()
-  .setLabelCol(labels)
-  .setPredictionCol(pred)
-
-private val wf = new OpWorkflow().setResultFeatures(pred, labels)
-
-def runner(opParams: OpParams): OpWorkflowRunner =
-  new OpWorkflowRunner(
-    workflow = wf,
-    trainingReader = irisReader,
-    scoringReader = irisReader,
-    evaluationReader = Option(irisReader),
-    evaluator = Option(evaluator),
-    featureToComputeUpTo = Option(features)
-  )
+val prediction = MultiClassificationModelSelector
+  .withTrainValidationSplit(
+    modelTypesToUse = Seq(OpLogisticRegression))
+  .setInput(label, checkedFeatures).getOutput()
+
+val evaluator = Evaluators.MultiClassification()
+  .setLabelCol(label)
+  .setPredictionCol(prediction)
+
+val workflow = new OpWorkflow().setResultFeatures(prediction, label).setReader(dataReader)
+
+val model = workflow.train()
 ```
-You can run the code using the following commands for train, score and evaluate:
+
+**Results**
+
+We can still find the contributions of each feature for the model, but in MultiClass Classification, ```ModelInsights``` has a contribution of each feature to the prediction of each class. This code takes the max of all of these contributions as the overall contribution.
+
+```scala
+val modelInsights = model.modelInsights(prediction)
+val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
+val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
+  feature.contribution.map( contribution => math.abs(contribution))
+    .foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))
+val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)
+
+val (scores, metrics) = model.scoreAndEvaluate(evaluator = evaluator)
+```
+
+You can run the code using the following command:
 ```bash
 cd helloworld
 ./gradlew compileTestScala installDist
-```
-**Train**
-```bash
-./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.iris.OpIris -Dargs="\
---run-type=train \
---model-location=/tmp/iris-model \
---read-location Iris=`pwd`/src/main/resources/IrisDataset/iris.data"
-```
-**Score**
-```bash
-./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.iris.OpIris -Dargs="\
---run-type=score \
---model-location=/tmp/iris-model \
---read-location Iris=`pwd`/src/main/resources/IrisDataset/bezdekIris.data \
---write-location=/tmp/iris-scores"
-```
-**Evaluate**
-```bash
-./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.iris.OpIris -Dargs="\
---run-type=evaluate \
---model-location=/tmp/iris-model \
---metrics-location=/tmp/iris-metrics \
---read-location Iris=`pwd`/src/main/resources/IrisDataset/bezdekIris.data \
---write-location=/tmp/iris-eval"
+./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpIrisSimple -Dargs="\
+`pwd`/src/main/resources/IrisDataset/iris.csv"
 ```
@@ -36,13 +36,13 @@ In the main function, we create a spark session as per usual:
 import org.apache.spark.SparkConf
 import org.apache.spark.sql.SparkSession
 
-val conf = new SparkConf().setAppName("TitanicPrediction")
-implicit val spark = SparkSession.builder.config(conf).getOrCreate()
+implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
+import spark.implicits._ // Needed for Encoders for the Passenger case class
 ```
 
-We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](/Developer-Guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: Text, Numeric, Vector, List , Set, Map. In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the [Type Hierarchy and Automatic Feature Engineering](/Developer-Guide#type-hierarchy-and-automatic-feature-engineering) section in the Documentation.
+We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](../developer-guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: Text, Numeric, Vector, List , Set, Map. In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the [Type Hierarchy and Automatic Feature Engineering](../developer-guide#type-hierarchy-and-automatic-feature-engineering) section in the Documentation.
 
-Basic FeatureBuilders will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described [here](/examples/Bootstrap-Your-First-Project.html). However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).
+Basic FeatureBuilders will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described [here](../examples/Bootstrap-Your-First-Project.html). However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).
 
 When defining raw features, specify the extract logic to be applied to the raw data, and  also  annotate the features as either predictor or response variables via the FeatureBuilders:
 ```scala
@@ -114,16 +114,18 @@ The next stage applies another powerful AutoML Estimator — the [SanityChecker]
 
 ```scala
 // Optionally check the features with a sanity checker
-val sanityCheck = false
-val finalFeatures = if (sanityCheck) survived.sanityCheck(passengerFeatures) else passengerFeatures
+val checkedFeatures = survived.sanityCheck(passengerFeatures, removeBadFeatures = true)
 ```
 Finally, the OpLogisticRegression Estimator is applied to derive a new triplet of Features which are essentially probabilities and predictions returned by the logistic regression algorithm:
 
 ```scala
 // Define the model we want to use (here a simple logistic regression) and get the resulting output
 import com.salesforce.op.stages.impl.classification.OpLogisticRegression
 
-val prediction = new OpLogisticRegression().setInput(survived, finalFeatures).getOutput
+// Define the model we want to use (here a simple logistic regression) and get the resulting output
+val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(
+  modelTypesToUse = Seq(OpLogisticRegression)
+).setInput(survived, checkedFeatures).getOutput()
 ```
 We could alternatively have used the [ModelSelector](../automl-capabilities#modelselectors) — another powerful AutoML Estimator that automatically tries out a variety of different classification algorithms and then selects the best one.
 
@@ -132,40 +134,41 @@ Notice that everything we've done so far has been purely at the level of definit
 ```scala
 import com.salesforce.op.readers.DataReaders
 
-val trainDataReader = DataReaders.Simple.csvCase[Passenger](
-      path = Some(csvFilePath), // location of data file
-      key = _.id.toString  // identifier for entity being modeled
-)   
+val dataReader = DataReaders.Simple.csvCase[Passenger](path = Option(csvFilePath), key = _.id.toString)   
 
 val workflow =
    new OpWorkflow()
       .setResultFeatures(survived, prediction)
-      .setReader(trainDataReader)
+      .setReader(dataReader)
 ```
 
-When we now call 'train' on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features ```survived, prediction, rawPrediction```, and ```prob```, fitting all the estimators on the training data in the process. Calling ```score``` on the fitted workflow then transforms the underlying training data to produce a DataFrame with the all the features manifested. The ```score``` method can optionally be passed an evaluator that produces metrics. 
+When we now call 'train' on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features ```survived, prediction, rawPrediction```, and ```prob```, fitting all the estimators on the training data in the process. Calling ```scoreAndEvaluate``` on the model then transforms the underlying training data to produce a DataFrame with the all the features manifested. The ```scoreAndEvaluate``` method can optionally be passed an evaluator that produces metrics. 
 
 ```scala
 import com.salesforce.op.evaluators.Evaluators
 
 // Fit the workflow to the data
-val fittedWorkflow = workflow.train()
+val model = workflow.train()
 
 val evaluator = Evaluators.BinaryClassification()
    .setLabelCol(survived)
    .setPredictionCol(prediction)
 
 // Apply the fitted workflow to the train data and manifest
 // the resulting dataframe together with metrics
-val (transformedTrainData, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)
+val (scores, metrics) = model.scoreAndEvaluate(evaluator = evaluator)
 ```
 
-The fitted workflow can now be saved, and loaded again to be applied to any new data set of type Passengers by changing the reader. 
-
+We can find information about the model via ```ModelInsights```. In this example, we extracting each feature's contribution to the model by first listing all of the features that the model derives from the input features and finding their maximum contribution. We can also find the correlation of the feature with the label, mean of the feature values, variance of the values, etc... More information about ```ModelInsights``` can be found [here](../developer-guide#extracting-modelinsights-from-a-fitted-workflow).
 ```scala
-fittedWorkflow.save(saveWorkflowPath)
-
-val savedWorkflow = OpWorkflowModel.load(saveWorkflowPath).setReader(testDataReader)
+// Extract information (i.e. feature importance) via model insights
+val modelInsights = model.modelInsights(prediction)
+val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
+val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
+  feature.contribution.map( contribution => math.abs(contribution))
+    .foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))
+val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)
 ```
 
 
+