Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update helloworld examples to be simple #351

Merged
merged 5 commits into from
Jul 2, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 33 additions & 43 deletions docs/examples/Boston-Regression.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Boston Regression

The following code illustrates how TransmogrifAI can be used to do linear regression. We use Boston dataset to predict housing prices.
The following code illustrates how TransmogrifAI can be used to do linear regression. We use Boston dataset to predict housing prices. This example is very similar to the Titanic Binary Classification example, so you should look over that example first if you have not already.
The code for this example can be found [here](https://github.com/salesforce/TransmogrifAI/tree/master/helloworld/src/main/scala/com/salesforce/hw/boston), and the data over [here](https://github.com/salesforce/op/tree/master/helloworld/src/main/resources/BostonDataset).

**Define features**
Expand All @@ -25,58 +25,48 @@ val medv = FeatureBuilder.RealNN[BostonHouse].extract(_.medv.toRealNN).asRespons
**Feature Engineering**

```scala
val houseFeatures = Seq(crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat).transmogrify()
val features = Seq(crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat).transmogrify()
val label = medv
val checkedFeatures = label.sanityCheck(features, removeBadFeatures = true)
```

**Modeling & Evaluation**

For regression problems, we use ```RegressionModelSelector``` to choose which type of models to use, which in this case is Linear Regression. You can find more model types [here](../developer-guide#modelselector).

```scala
val prediction = RegressionModelSelector
.withCrossValidation(dataSplitter = Option(DataSplitter(seed = randomSeed)), seed = randomSeed)
.setRandomForestSeed(randomSeed)
.setGradientBoostedTreeSeed(randomSeed)
.setInput(medv, houseFeatures)
.getOutput()
.withTrainValidationSplit(
modelTypesToUse = Seq(OpLinearRegression))
.setInput(label, checkedFeatures).getOutput()

val workflow = new OpWorkflow().setResultFeatures(prediction)

val evaluator = Evaluators.Regression().setLabelCol(medv).setPredictionCol(prediction)
val evaluator = Evaluators.Regression().setLabelCol(label).setPredictionCol(prediction)

val model = workflow.train()
```

**Results**

def runner(opParams: OpParams): OpWorkflowRunner =
new OpWorkflowRunner(
workflow = workflow,
trainingReader = trainingReader,
scoringReader = scoringReader,
evaluationReader = Option(trainingReader),
evaluator = Option(evaluator),
scoringEvaluator = None,
featureToComputeUpTo = Option(houseFeatures)
)
We can extract each feature's contribution to the model via ```ModelInsights```, like in the Titanic Binary Classification example.

```scala
val modelInsights = model.modelInsights(prediction)
val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
feature.contribution.map( contribution => math.abs(contribution))
.foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))
val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)

val (scores, metrics) = model.scoreAndEvaluate(evaluator = evaluator)
```
You can run the code using the following commands for train, score and evaluate:

You can run the code using the following command:

```bash
cd helloworld
./gradlew compileTestScala installDist
```
**Train**
```bash
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
--run-type=train \
--model-location=/tmp/boston-model \
--read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data"
```
**Score**
```bash
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
--run-type=score \
--model-location=/tmp/boston-model \
--read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data \
--write-location=/tmp/boston-scores"
```
**Evaluate**
```bash
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.boston.OpBoston -Dargs="\
--run-type=evaluate \
--read-location BostonHouse=`pwd`/src/main/resources/BostonDataset/housing.data \
--write-location=/tmp/boston-eval \
--model-location=/tmp/boston-model \
--metrics-location=/tmp/boston-metrics"
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpBostonSimple -Dargs="\
`pwd`/src/main/resources/BostonDataset/housingData.csv"
```
106 changes: 56 additions & 50 deletions docs/examples/Iris-MultiClass-Classification.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,78 @@
# Iris MultiClass Classification

The following code illustrates how TransmogrifAI can be used to do classify multiple classes over the Iris dataset.
The code for this example can be found [here](https://github.com/salesforce/TransmogrifAI/tree/master/helloworld/src/main/scala/com/salesforce/hw/iris), and the data over [here](https://github.com/salesforce/op/tree/master/helloworld/src/main/resources/IrisDataset).
The following code illustrates how TransmogrifAI can be used to do classify multiple classes over the Iris dataset. This example is very similar to the Titanic Binary Classification example, so you should look over that example first if you have not already.
The code for this example can be found [here](https://github.com/salesforce/TransmogrifAI/tree/master/helloworld/src/main/scala/com/salesforce/hw/OpIrisSimple.scala), and the data over [here](https://github.com/salesforce/op/tree/master/helloworld/src/main/resources/IrisDataset/iris.csv).

**Data Schema**

```scala
case class Iris
(
id: Int,
sepalLength: Double,
sepalWidth: Double,
petalLength: Double,
petalWidth: Double,
irisClass: String
)
```

**Define Features**

**Define features**
```scala
val id = FeatureBuilder.Integral[Iris].extract(_.getID.toIntegral).asPredictor
val sepalLength = FeatureBuilder.Real[Iris].extract(_.getSepalLength.toReal).asPredictor
val sepalWidth = FeatureBuilder.Real[Iris].extract(_.getSepalWidth.toReal).asPredictor
val petalLength = FeatureBuilder.Real[Iris].extract(_.getPetalLength.toReal).asPredictor
val petalWidth = FeatureBuilder.Real[Iris].extract(_.getPetalWidth.toReal).asPredictor
val irisClass = FeatureBuilder.Text[Iris].extract(_.getClass$.toText).asResponse

```

**Feature Engineering**

```scala
val labels = irisClass.indexed()
val features = Seq(sepalLength, sepalWidth, petalLength, petalWidth).transmogrify()
val label = irisClass.indexed()
val checkedFeatures = label.sanityCheck(features, removeBadFeatures = true)
```

**Modeling & Evaluation**

In MultiClass Classification, we use the ```MultiClassificationModelSelector``` to select the model we want to run on, which is Logistic Regression in this case. You can find more information on model selection [here](../developer-guide#modelselector).

```scala
val pred = MultiClassificationModelSelector
.withCrossValidation(splitter = Some(DataCutter(reserveTestFraction = 0.2, seed = randomSeed)), seed = randomSeed)
.setInput(labels, features).getOutput()

private val evaluator = Evaluators.MultiClassification.f1()
.setLabelCol(labels)
.setPredictionCol(pred)

private val wf = new OpWorkflow().setResultFeatures(pred, labels)

def runner(opParams: OpParams): OpWorkflowRunner =
new OpWorkflowRunner(
workflow = wf,
trainingReader = irisReader,
scoringReader = irisReader,
evaluationReader = Option(irisReader),
evaluator = Option(evaluator),
featureToComputeUpTo = Option(features)
)
val prediction = MultiClassificationModelSelector
.withTrainValidationSplit(
modelTypesToUse = Seq(OpLogisticRegression))
.setInput(label, checkedFeatures).getOutput()

val evaluator = Evaluators.MultiClassification()
.setLabelCol(label)
.setPredictionCol(prediction)

val workflow = new OpWorkflow().setResultFeatures(prediction, label).setReader(dataReader)

val model = workflow.train()
```
You can run the code using the following commands for train, score and evaluate:

**Results**

We can still find the contributions of each feature for the model, but in MultiClass Classification, ```ModelInsights``` has a contribution of each feature to the prediction of each class. This code takes the max of all of these contributions as the overall contribution.

```scala
val modelInsights = model.modelInsights(prediction)
val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
feature.contribution.map( contribution => math.abs(contribution))
.foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))
val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)

val (scores, metrics) = model.scoreAndEvaluate(evaluator = evaluator)
```

You can run the code using the following command:
```bash
cd helloworld
./gradlew compileTestScala installDist
```
**Train**
```bash
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.iris.OpIris -Dargs="\
--run-type=train \
--model-location=/tmp/iris-model \
--read-location Iris=`pwd`/src/main/resources/IrisDataset/iris.data"
```
**Score**
```bash
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.iris.OpIris -Dargs="\
--run-type=score \
--model-location=/tmp/iris-model \
--read-location Iris=`pwd`/src/main/resources/IrisDataset/bezdekIris.data \
--write-location=/tmp/iris-scores"
```
**Evaluate**
```bash
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.iris.OpIris -Dargs="\
--run-type=evaluate \
--model-location=/tmp/iris-model \
--metrics-location=/tmp/iris-metrics \
--read-location Iris=`pwd`/src/main/resources/IrisDataset/bezdekIris.data \
--write-location=/tmp/iris-eval"
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpIrisSimple -Dargs="\
`pwd`/src/main/resources/IrisDataset/iris.csv"
```
43 changes: 23 additions & 20 deletions docs/examples/Titanic-Binary-Classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ In the main function, we create a spark session as per usual:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

val conf = new SparkConf().setAppName("TitanicPrediction")
implicit val spark = SparkSession.builder.config(conf).getOrCreate()
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._ // Needed for Encoders for the Passenger case class
```

We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](/Developer-Guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: Text, Numeric, Vector, List , Set, Map. In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the [Type Hierarchy and Automatic Feature Engineering](/Developer-Guide#type-hierarchy-and-automatic-feature-engineering) section in the Documentation.
We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](../developer-guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: Text, Numeric, Vector, List , Set, Map. In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the [Type Hierarchy and Automatic Feature Engineering](../developer-guide#type-hierarchy-and-automatic-feature-engineering) section in the Documentation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think ../ prefixes would work once we deploy to https://docs.transmogrif.ai
were the links broken? how did you test them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The links are broken on the website, but only for that paragraph. I just imitated the style of the links that work in the other paragraphs. How do I link a page that will work when deployed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try running the docs server locally as described here - https://github.com/salesforce/TransmogrifAI/tree/master/docs#docs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right. We actually have them as ../ elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They work.


Basic FeatureBuilders will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described [here](/examples/Bootstrap-Your-First-Project.html). However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).
Basic FeatureBuilders will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described [here](../examples/Bootstrap-Your-First-Project.html). However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).

When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders:
```scala
Expand Down Expand Up @@ -114,16 +114,18 @@ The next stage applies another powerful AutoML Estimator — the [SanityChecker]

```scala
// Optionally check the features with a sanity checker
val sanityCheck = false
val finalFeatures = if (sanityCheck) survived.sanityCheck(passengerFeatures) else passengerFeatures
val checkedFeatures = survived.sanityCheck(passengerFeatures, removeBadFeatures = true)
```
Finally, the OpLogisticRegression Estimator is applied to derive a new triplet of Features which are essentially probabilities and predictions returned by the logistic regression algorithm:

```scala
// Define the model we want to use (here a simple logistic regression) and get the resulting output
import com.salesforce.op.stages.impl.classification.OpLogisticRegression

val prediction = new OpLogisticRegression().setInput(survived, finalFeatures).getOutput
// Define the model we want to use (here a simple logistic regression) and get the resulting output
val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(
modelTypesToUse = Seq(OpLogisticRegression)
).setInput(survived, checkedFeatures).getOutput()
```
We could alternatively have used the [ModelSelector](../automl-capabilities#modelselectors) — another powerful AutoML Estimator that automatically tries out a variety of different classification algorithms and then selects the best one.

Expand All @@ -132,40 +134,41 @@ Notice that everything we've done so far has been purely at the level of definit
```scala
import com.salesforce.op.readers.DataReaders

val trainDataReader = DataReaders.Simple.csvCase[Passenger](
path = Some(csvFilePath), // location of data file
key = _.id.toString // identifier for entity being modeled
)
val dataReader = DataReaders.Simple.csvCase[Passenger](path = Option(csvFilePath), key = _.id.toString)

val workflow =
new OpWorkflow()
.setResultFeatures(survived, prediction)
.setReader(trainDataReader)
.setReader(dataReader)
```

When we now call 'train' on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features ```survived, prediction, rawPrediction```, and ```prob```, fitting all the estimators on the training data in the process. Calling ```score``` on the fitted workflow then transforms the underlying training data to produce a DataFrame with the all the features manifested. The ```score``` method can optionally be passed an evaluator that produces metrics.
When we now call 'train' on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features ```survived, prediction, rawPrediction```, and ```prob```, fitting all the estimators on the training data in the process. Calling ```scoreAndEvaluate``` on the model then transforms the underlying training data to produce a DataFrame with the all the features manifested. The ```scoreAndEvaluate``` method can optionally be passed an evaluator that produces metrics.

```scala
import com.salesforce.op.evaluators.Evaluators

// Fit the workflow to the data
val fittedWorkflow = workflow.train()
val model = workflow.train()

val evaluator = Evaluators.BinaryClassification()
.setLabelCol(survived)
.setPredictionCol(prediction)

// Apply the fitted workflow to the train data and manifest
// the resulting dataframe together with metrics
val (transformedTrainData, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)
val (scores, metrics) = model.scoreAndEvaluate(evaluator = evaluator)
```

The fitted workflow can now be saved, and loaded again to be applied to any new data set of type Passengers by changing the reader.

We can find information about the model via ```ModelInsights```. In this example, we extracting each feature's contribution to the model by first listing all of the features that the model derives from the input features and finding their maximum contribution. We can also find the correlation of the feature with the label, mean of the feature values, variance of the values, etc... More information about ```ModelInsights``` can be found [here](../developer-guide#extracting-modelinsights-from-a-fitted-workflow).
```scala
fittedWorkflow.save(saveWorkflowPath)

val savedWorkflow = OpWorkflowModel.load(saveWorkflowPath).setReader(testDataReader)
// Extract information (i.e. feature importance) via model insights
val modelInsights = model.modelInsights(prediction)
val modelFeatures = modelInsights.features.flatMap( feature => feature.derivedFeatures)
val featureContributions = modelFeatures.map( feature => (feature.derivedFeatureName,
feature.contribution.map( contribution => math.abs(contribution))
.foldLeft(0.0) { (max, contribution) => math.max(max, contribution)}))
val sortedContributions = featureContributions.sortBy( contribution => -contribution._2)
```



Loading