# Op Housing Prices Sample
Here we describe a very simple TransmogrifAI workflow for predicting the housing prices based on the features of house on sale. The code for building and applying the Titanic model can be found here: Titanic Code, and the data can be found here: Titanic Data.

First we need to load transmogrifai and Spark Mllib jars

In [None]:
%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.6.0

In [None]:
%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.3

**Import the classes**

In [None]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.udf

import com.salesforce.op._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.evaluators.Evaluators

In [None]:
import com.salesforce.op.OpWorkflow
import com.salesforce.op.evaluators.Evaluators
import com.salesforce.op.readers.DataReaders

Instantiate a SparkSession

In [None]:
val conf = new SparkConf().setMaster("local[*]").setAppName("HousingPricesPrediction")
implicit val spark = SparkSession.builder.config(conf).getOrCreate()

### Schema class
Let us create a case class to describe the schema for the data:

In [None]:
case class HousingPrices(
  lotFrontage: Double,
  area: Integer,
  lotShape: String,
  yrSold : Integer,
  saleType: String,
  saleCondition: String,
  salePrice: Double)

#### Feature Engineering

We then define the set of raw features that we would like to extract from the data. The 
raw features are defined using [FeatureBuilders](https://docs.transmogrif.ai/Developer-Guide#featurebuilders), 
and are strongly typed. TransmogrifAI supports the following basic feature types: `Text`, 
`Numeric`, `Vector`, `List` , `Set`, `Map`. In addition it supports many specific feature 
types which extend these base types: Email extends Text; Integral, Real and Binary extend 
Numeric; Currency and Percentage extend Real. For a complete view of the types supported 
see the Type Hierarchy and Automatic Feature Engineering section in the Documentation.

Basic `FeatureBuilders` will be created for you if you use the TransmogrifAI CLI to bootstrap 
your project as described here. However, it is often useful to edit this code to customize 
feature generation and take full advantage of the Feature types available (selecting the 
appropriate type will improve automatic feature engineering steps).
When defining raw features, specify the extract logic to be applied to the raw data, and 
also annotate the features as either predictor or response variables via the FeatureBuilders:


In [None]:
import org.apache.spark.sql.{Encoders}
implicit val srEncoder = Encoders.product[HousingPrices]

In [None]:
val lotFrontage = FeatureBuilder.Real[HousingPrices].extract(_.lotFrontage.toReal).asPredictor
val area = FeatureBuilder.Integral[HousingPrices].extract(_.area.toIntegral).asPredictor

In [None]:
val lotShape = FeatureBuilder.Integral[HousingPrices].extract(_.lotShape match {
    case "IR1" => 1.toIntegral
    case _ => 0.toIntegral
}).asPredictor

In [None]:
val yrSold = FeatureBuilder.Integral[HousingPrices].extract(_.yrSold.toIntegral).asPredictor

In [None]:
val saleType = FeatureBuilder.Text[HousingPrices].extract(_.saleType.toText).asPredictor.indexed()

In [None]:
val saleCondition = FeatureBuilder.Text[HousingPrices]
  .extract(_.saleCondition.toText).asPredictor.indexed()

In [None]:
val salePrice = FeatureBuilder.RealNN[HousingPrices].extract(_.salePrice.toRealNN).asResponse

In [None]:
 val trainFilePath = "/home/beakerx/helloworld/src/main/resources/HousingPricesDataset/train_lf_la_ls_ys_st_sc.csv"

Create a training data reader from the `trainFilePath` using `DataReaders.Simple`

In [None]:
val trainDataReader = DataReaders.Simple.csvCase[HousingPrices](
      path = Option(trainFilePath)
    )

### Create a feature sequence and transmogrify it

`.transmogrify()` is a Transmografai shortcut to many estimators. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.

The next stage applies another powerful transmogrifai Estimator — the SanityChecker. The SanityChecker applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of label leakage or that show little to no predictive power. This is in essence the automatic feature selection Stage of TransmogrifAI:

In [None]:
import com.salesforce.op.stages.impl.tuning.{DataCutter, DataSplitter}
val features = Seq(lotFrontage,area,lotShape, yrSold, saleType, saleCondition).transmogrify()
val randomSeed = 42L
val splitter = DataSplitter(seed = randomSeed)

### Model selector
Create a prediction(model) based on RegressionModelSelector. We are using Gradient Boosted Trees and Random Forest. Notice how input is applied  of `salesPrice` and `features`.

In [None]:
import com.salesforce.op.stages.impl.regression.RegressionModelSelector
import com.salesforce.op.stages.impl.regression.RegressionModelsToTry.{OpGBTRegressor, OpRandomForestRegressor}

val prediction1 = RegressionModelSelector
      .withCrossValidation(
        dataSplitter = Some(splitter), seed = randomSeed,
        modelTypesToUse = Seq(OpGBTRegressor, OpRandomForestRegressor)
      ).setInput(salePrice,features).getOutput()

Create an evaluator of type Regression and call setLabelCol and setPredictionCol

In [None]:
val evaluator = Evaluators.Regression().setLabelCol(salePrice).setPredictionCol(prediction1)

### Workflow  and WorkflowModel

Workflow for TransmogrifAI. Takes the final features that the user wants to generate as 
inputs and constructs the full DAG needed to generate them from those features lineage. 
Then fits any estimators in the pipeline dag to create a sequence of transformations that 
are saved in a workflow model.
When we now call `train` on this workflow, it automatically computes and executes the 
entire DAG of Stages needed to compute the features fitting all the estimators on the training data in the process. 
Calling score on the fitted workflow then transforms the underlying training data to 
produce a DataFrame with the all the features manifested. The score method can optionally 
be passed an evaluator that produces metrics.
`workflow.train()` methods fits all of the estimators in the pipeline and return a 
pipeline model of only transformers. Uses data loaded as specified by the data reader to 
generate the initial data set.

In [None]:
val workflow = new OpWorkflow().setResultFeatures(prediction1, salePrice).setReader(trainDataReader)
val workflowModel = workflow.train()

### Score and evaluate the model

In [None]:
val (scores, metrics) = workflowModel.scoreAndEvaluate(evaluator)
scores.show(false)

In [None]:
metrics.toString()