# PimaIndians Diabetes Classification

Import the latest TransmogrifAI Core jar version 0.6.0

In [None]:
%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.6.0

Import the Spark MLlib version 2.3.2

In [3]:
%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.2

### Import the scala packages and classes 
Let us import the relevant classes for for Spark and TransmogrifAI

In [None]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.udf

import com.salesforce.op._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.evaluators.Evaluators

In [None]:
import com.salesforce.op.OpWorkflow
import com.salesforce.op.evaluators.Evaluators
import com.salesforce.op.readers.DataReaders

Instantiate Spark instance and make it implicit

In [None]:
val conf = new SparkConf().setMaster("local[*]").setAppName("PimaIndiansClassification")
implicit val spark = SparkSession.builder.config(conf).getOrCreate()

### Schema class
Create the schema class with dependent and to be predicted variable

In [None]:
case class PimaIndians
(
  numberOfTimesPreg: Double,
  plasmaGlucose: Double,
  bp: Double,
  spinThickness: Double,
  serumInsulin: Double,
  bmi: Double,
  diabetesPredigree : Double,
  ageInYrs : Double,
  piClass: String
)

### Feature Creation
Create the feature from schema class and specify as Predictor or Response

In [None]:
val numberOfTimesPreg = FeatureBuilder.Real[PimaIndians].extract(_.numberOfTimesPreg.toReal).asPredictor
val plasmaGlucose = FeatureBuilder.Real[PimaIndians].extract(_.plasmaGlucose.toReal).asPredictor
val bp = FeatureBuilder.Real[PimaIndians].extract(_.bp.toReal).asPredictor
val spinThickness = FeatureBuilder.Real[PimaIndians].extract(_.spinThickness.toReal).asPredictor
val serumInsulin = FeatureBuilder.Real[PimaIndians].extract(_.serumInsulin.toReal).asPredictor
val bmi = FeatureBuilder.Real[PimaIndians].extract(_.bmi.toReal).asPredictor
val diabetesPredigree = FeatureBuilder.Real[PimaIndians].extract(_.diabetesPredigree.toReal).asPredictor
val ageInYrs = FeatureBuilder.Real[PimaIndians].extract(_.diabetesPredigree.toReal).asPredictor
val piClass = FeatureBuilder.Text[PimaIndians].extract(_.piClass.toText).asResponse

Training Data path

In [None]:
 val trainFilePath = "../src/main/resources/PimaIndiansDataset/primaindiansdiabetes.data"

In [None]:
import com.salesforce.op.features.types._

Instantiate the trainDataReader which is a csv reader.

Factory method to create an instance of Data Reader for CSV data. Each CSV record will be automatically converted to an Avro record using the provided schema.

In [None]:
import spark.implicits._ 

val trainDataReader = DataReaders.Simple.csvCase[PimaIndians](
      path = Option(trainFilePath)
    )

Created the features sequence and call `.transmogrify()` on it. Create a DataSplitter

In [None]:
import com.salesforce.op.stages.impl.tuning.{DataCutter, DataSplitter}

val features = Seq( numberOfTimesPreg, plasmaGlucose,bp,spinThickness,serumInsulin,
    bmi,diabetesPredigree,ageInYrs).transmogrify()
val randomSeed = 42L
val splitter = DataSplitter(seed = randomSeed)

Create an Encoder for serializing Java Bean of type `PimaIndians`

In [None]:
import org.apache.spark.sql.Encoders

implicit val piEncoder = Encoders.product[PimaIndians]
//val piReader = DataReaders.Simple.csvCase[PimaIndians]()
val labels = piClass.indexed()

Imprt MultiClassificationSelector with the splitter addded and inputs set

In [None]:
import com.salesforce.op.stages.impl.classification.MultiClassificationModelSelector
import com.salesforce.op.stages.impl.tuning.DataCutter

val cutter = DataCutter(reserveTestFraction = 0.2, seed = randomSeed)
val prediction = MultiClassificationModelSelector
    .withCrossValidation(splitter = Option(cutter), seed = randomSeed)
    .setInput(labels, features).getOutput()

Create an Evaluator of type MultiClassification

In [None]:
val evaluator = Evaluators.MultiClassification.f1().setLabelCol(labels).setPredictionCol(prediction)

### Workflow

Once all the Features and Feature transformations have been defined, actual data can be materialized by adding the desired Features to a TransmogrifAI Workflow and feeding it a DataReader. When the Workflow is trained, it infers the entire DAG of Features, Transformers, and Estimators that are needed to materialize the result Features. It then prepares this DAG by passing the data specified by the DataReader through the DAG and fitting all the intermediate Estimators in the DAG to Transformers.

In [None]:
val workflow = new OpWorkflow().setResultFeatures(prediction, labels).setReader(trainDataReader)

### Fitting the workflow
When a workflow gets fitted a number of things happen: the data is read using the DataReader, raw Features are built, each Stage is executed in sequence and all Features are materialized and added to the underlying Dataframe. During Stage execution, each Estimator gets fitted and becomes a Transformer. A fitted Workflow (eg. a OpWorkflowModel) therefore contains sequence of Transformers (map operations) which can be applied to any input data of the appropriate type.

In [None]:
val workflowModel = workflow.train()

Score and evaluate the model.

In [None]:
val (scores, metrics)  = workflowModel.scoreAndEvaluate(evaluator)

### Extracting ModelInsights from a Fitted Workflow

In [None]:
val modelInsights = workflowModel.modelInsights(prediction)
val labelSummary = modelInsights.label
println("labelName: " + labelSummary.labelName)
println("rawFeatureName: "+ labelSummary.rawFeatureName)
println("stagesApplied: " + labelSummary.stagesApplied)

Extract the model features from modelInsights. 

In [None]:
val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
modelFeatures.foreach(x => println(x.derivedFeatureName + ", " + x.contribution(0)))

In [None]:
val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
  feature.contribution.map( contribution => math.abs(contribution))
  .foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))

val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)
sortedContributions.foreach(x => println(x._1 + "," + x._2) )