TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning
Clone or download
marcovivero Streaming histogram implementation (#152)
* Rebase adaptive histogram implementation

* Add test

* Outlier test

* Histogram final defs

* Add distribution tests

* Update distribution tests

* Use cassandra implementation

* Use correct Kryo serializers

* Clean up tests + address comments

* Address comments
Latest commit 8264265 Oct 16, 2018
Permalink
Failed to load latest commit information.
.circleci Use OpenJDK8 for CircleCI builds + refactor build config (#140) Oct 1, 2018
.github Update feature_request.md Aug 17, 2018
cli - Corrected test file names to match class names (#116) Sep 6, 2018
core Added test and removed dead code for Sanity Checker dealing with maps… Oct 16, 2018
docs Update Running-from-Spark-Shell.md Sep 23, 2018
features Added test and removed dead code for Sanity Checker dealing with maps… Oct 16, 2018
gradle Update Gradle to 4.10.2 (#142) Oct 1, 2018
helloworld Update Gradle to 4.10.2 (#142) Oct 1, 2018
local Added serializable Sep 2, 2018
models Update README.md Aug 31, 2018
readers XGBoost classification & regression models + Spark 2.3.2 (#44) Oct 15, 2018
resources Update type hierarchy image Aug 11, 2018
templates/simple Update Gradle to 4.10.2 (#142) Oct 1, 2018
test-data Added test io.Avro object/class + minor refactor (#48) Aug 11, 2018
testkit Fix typos and misspellings detected by github.com/client9/misspell (#66) Aug 17, 2018
utils Streaming histogram implementation (#152) Oct 16, 2018
.gitattributes Release 3.1.2 Dec 1, 2017
.gitignore Github Wiki migration (#79) Aug 22, 2018
.travis.yml Build configs cleanups (#118) Sep 6, 2018
CHANGELOG.md Release 0.4.0 (#111) Sep 23, 2018
CONTRIBUTING.md Create CONTRIBUTING.md Aug 7, 2018
LICENSE Update license Aug 3, 2018
README.md Added Maven Central badge (#149) Oct 5, 2018
build.gradle Streaming histogram implementation (#152) Oct 16, 2018
gradle.properties Bump version to 0.4.1-SNAPSHOT Sep 23, 2018
gradlew Release 3.2.0 (#2) Dec 20, 2017
repl Release 3.1.2 Dec 1, 2017
settings.gradle Local scoring (aka Sparkless) using Aardpfark (#41) Aug 30, 2018
static.json Github Wiki migration (#79) Aug 22, 2018

README.md

TransmogrifAI

Maven Central Download Javadocs Spark version Scala version License Chat

TravisCI Build Status CircleCI Build Status Codecov CodeFactor

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.

Use TransmogrifAI if you need a machine learning library to:

  • Build production ready machine learning applications in hours, not months
  • Build machine learning models without getting a Ph.D. in machine learning
  • Build modular, reusable, strongly typed machine learning workflows

Skip to Quick Start and Documentation.

Predicting Titanic Survivors with TransmogrifAI

The Titanic dataset is an often-cited dataset in the machine learning community. The goal is to build a machine learnt model that will predict survivors from the Titanic passenger manifest. Here is how you would build the model using TransmogrifAI:

import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._

// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = pathToData).readDataset().toDF()

// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()

println("Model summary:\n" + model.summaryPretty())

Model summary:

Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric.
Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376]
Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159]

Selected model Random Forest classifier with parameters:
|-----------------------|--------------|
| Model Param           |     Value    |
|-----------------------|--------------|
| modelType             | RandomForest |
| featureSubsetStrategy |         auto |
| impurity              |         gini |
| maxBins               |           32 |
| maxDepth              |           12 |
| minInfoGain           |        0.001 |
| minInstancesPerNode   |           10 |
| numTrees              |           50 |
| subsamplingRate       |          1.0 |
|-----------------------|--------------|

Model evaluation metrics:
|-------------|--------------------|---------------------|
| Metric Name | Hold Out Set Value |  Training Set Value |
|-------------|--------------------|---------------------|
| Precision   |               0.85 |   0.773851590106007 |
| Recall      | 0.6538461538461539 |  0.6930379746835443 |
| F1          | 0.7391304347826088 |  0.7312186978297163 |
| AuROC       | 0.8821603927986905 |  0.8766642291593114 |
| AuPR        | 0.8225075757571668 |   0.850331080886535 |
| Error       | 0.1643835616438356 | 0.19682151589242053 |
| TP          |               17.0 |               219.0 |
| TN          |               44.0 |               438.0 |
| FP          |                3.0 |                64.0 |
| FN          |                9.0 |                97.0 |
|-------------|--------------------|---------------------|

Top model insights computed using correlation:
|-----------------------|----------------------|
| Top Positive Insights |      Correlation     |
|-----------------------|----------------------|
| sex = "female"        |   0.5177801026737666 |
| cabin = "OTHER"       |   0.3331391338844782 |
| pClass = 1            |   0.3059642953159715 |
|-----------------------|----------------------|
| Top Negative Insights |      Correlation     |
|-----------------------|----------------------|
| sex = "male"          |  -0.5100301587292186 |
| pClass = 3            |  -0.5075774968534326 |
| cabin = null          | -0.31463114463832633 |
|-----------------------|----------------------|

Top model insights computed using CramersV:
|-----------------------|----------------------|
|      Top Insights     |       CramersV       |
|-----------------------|----------------------|
| sex                   |    0.525557139885501 |
| embarked              |  0.31582347194683386 |
| age                   |  0.21582347194683386 |
|-----------------------|----------------------|

While this may seem a bit too magical, for those who want more control, TransmogrifAI also provides the flexibility to completely specify all the features being extracted and all the algorithms being applied in your ML pipeline. Visit our docs site for full documentation, getting started, examples, faq and other information.

Adding TransmogrifAI into your project

You can simply add TransmogrifAI as a regular dependency to an existing project.

For Gradle in build.gradle add:

repositories {
    jcenter()
    mavenCentral()
}
dependencies {
    // TransmogrifAI core dependency
    compile 'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.4.0'

    // TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
    // compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.4.0'
}

For SBT in build.sbt add:

scalaVersion := "2.11.12"

resolvers += Resolver.jcenterRepo

// TransmogrifAI core dependency
libraryDependencies ++= "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.4.0"

// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// libraryDependencies ++= "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.4.0"

Then import TransmogrifAI into your code:

// TransmogrifAI functionality: feature types, feature builders, feature dsl, readers, aggregators etc.
import com.salesforce.op._
import com.salesforce.op.aggregators._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.readers._

// Spark enrichments (optional)
import com.salesforce.op.utils.spark.RichDataset._
import com.salesforce.op.utils.spark.RichRDD._
import com.salesforce.op.utils.spark.RichRow._
import com.salesforce.op.utils.spark.RichMetadata._
import com.salesforce.op.utils.spark.RichStructType._

Quick Start and Documentation

Visit our docs site for full documentation, getting started, examples, faq and other information.

See scaladoc for the programming API.

Authors

Internal Contributors (prior to release)

License

BSD 3-Clause © Salesforce.com, Inc.