<a id="toc"></a>
## Table of Contents

[Spark](#spark)

[TransmogrifAI](#automl)


<a id="scala"></a>
## Scala

[Table of Contents](#toc)

In [1]:
scala.util.Properties.versionString

version 2.11.12

<a id="automl"></a>
## Get TransmogrifAI

[Table of Contents](#toc)

In [2]:
%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.5.0

<a id="spark"></a>
## Get Spark 

[Table of Contents](#toc)

In [3]:
%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.0

Grabbing mllib gets spark-core, spark-mllib and spark-sql

## Get Started

In [4]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext


In [7]:
import com.salesforce.op._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import com.salesforce.op.evaluators.Evaluators

import com.salesforce.op._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import com.salesforce.op.evaluators.Evaluators


In [8]:
org.apache.log4j.Logger.getRootLogger().setLevel(org.apache.log4j.Level.WARN);

In [9]:
val conf = new SparkConf().setMaster("local[*]").setAppName("automl-app") // Spark configuration
val sc = new SparkContext(conf)  // initialize spark context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)  // initialize sql context
implicit val spark = SparkSession.builder.config(conf).getOrCreate() // start spark session 
import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@735019e9

## Get Data

Using the SparkSQL Context, we'll read in the titanic dataset.  

In [10]:
val rawData = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("titanic_with_headers.csv")

[PassengerId: int, Survived: int ... 10 more fields]

In [11]:
rawData.printSchema

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [44]:
println(rawData.columns.mkString(","))
rawData.take(5).foreach(println)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
[1,0,3,Braund, Mr. Owen Harris,male,22.0,1,0,A/5 21171,7.25,null,S]
[2,1,1,Cumings, Mrs. John Bradley (Florence Briggs Thayer),female,38.0,1,0,PC 17599,71.2833,C85,C]
[3,1,3,Heikkinen, Miss. Laina,female,26.0,0,0,STON/O2. 3101282,7.925,null,S]
[4,1,1,Futrelle, Mrs. Jacques Heath (Lily May Peel),female,35.0,1,0,113803,53.1,C123,S]
[5,0,3,Allen, Mr. William Henry,male,35.0,0,0,373450,8.05,null,S]


It is important to note - we want to get a model in a few steps as possible, so we will require that any numeric variable will be a double.  This is important for when we read the data into our feature engineering.  It will, by default, handle one type of numeric type or a string.  

In [13]:
// cast all non doulbe numeric types to double
rawData.createOrReplaceTempView("raw")
val passengerData = spark.sql("""
    select 
      cast(passengerId as double) as id, 
      cast(survived as double) as survived, 
      cast(pclass as double) as pclass,+
      name, sex, age, 
      cast(sibsp as double) as sibsp, 
      cast(parch as double) as parch, 
      ticket, fare, cabin, embarked 
      from raw
""")

[id: double, survived: double ... 10 more fields]

## Declare target and features. 

Below, The `FeatureBuilder` object has a method `fromDataFrame`, to which we pass the DataFrame `passengerData` created above as well as the name of the target variable, in this case `survived`.  It is important to point out that the `RealNN` is a type parameterization (well beyond the scope of this talk), but in short, it will make it a requirement that all fields in the DataFrame which are numeric to have a Double data type.  If not you will get an error.  It should be clear that every field which is not `survived` will be treated as a potential feature in the feature engineering section.  

In [14]:
val (target, features) = FeatureBuilder.fromDataFrame[RealNN](passengerData, response = "survived")
OutputCell.HIDDEN

The `features` variables which is returned in the previous cell is a sequence of features.  And as is, it is not entirely useful.  Our ML algorithms was numbers, and we still have text floating around in our feature set.  So we will need to perform one hot encodings.  

In [15]:
features.foreach(i => println(i + "\n"))

Feature(name = id, uid = Real_000000000001, isResponse = false, originStage = FeatureGeneratorStage_000000000001, parents = [], distributions = [])

Feature(name = pclass, uid = Real_000000000003, isResponse = false, originStage = FeatureGeneratorStage_000000000003, parents = [], distributions = [])

Feature(name = name, uid = Text_000000000004, isResponse = false, originStage = FeatureGeneratorStage_000000000004, parents = [], distributions = [])

Feature(name = sex, uid = Text_000000000005, isResponse = false, originStage = FeatureGeneratorStage_000000000005, parents = [], distributions = [])

Feature(name = age, uid = Real_000000000006, isResponse = false, originStage = FeatureGeneratorStage_000000000006, parents = [], distributions = [])

Feature(name = sibsp, uid = Real_000000000007, isResponse = false, originStage = FeatureGeneratorStage_000000000007, parents = [], distributions = [])

Feature(name = parch, uid = Real_000000000008, isResponse = false, originStage = FeatureGenerat

### Transmogrify

Our next step with be to transmogrify the sequence of features to a feature vector.  

This is like sparks VectorAssembler on steriods. 

The `transmogrify` method takes in a sequence of features, automatically applies default transformations to them based on feature types (e.g. imputation, null value tracking, one hot encoding, tokenization, split Emails and pivot out the top K domains) and combines them into a single vector.

Mind you, we have not declared any special types other than text, so it will probably just do tokeniziation and to bag of words, null value tracking and maybe som imputation on the numeric data types will null values.  

In [16]:
val featureVector = features.transmogrify()

Feature(name = age-cabin-embarked-fare-id-name-parch-pclass-sex-sibsp-ticket_3-stagesApplied_OPVector_00000000000f, uid = OPVector_00000000000f, isResponse = false, originStage = VectorsCombiner_00000000000f, parents = [OPVector_00000000000d,OPVector_00000000000e], distributions = [])

Upon inspecting the `featureVector.toString`, we can see the name of our feature vector.  It is essentially a concatenation of all the fields that we said were features as well as the total number of stages that have been appied.  It provides no insight into the transmogrified feature vector.  

In [17]:
featureVector.parents.foreach(v => println(v + "\n"))

Feature(name = age-fare-id-parch-pclass-sibsp_1-stagesApplied_OPVector_00000000000d, uid = OPVector_00000000000d, isResponse = false, originStage = RealVectorizer_00000000000d, parents = [Real_000000000001,Real_000000000003,Real_000000000006,Real_000000000007,Real_000000000008,Real_00000000000a], distributions = [])

Feature(name = cabin-embarked-name-sex-ticket_1-stagesApplied_OPVector_00000000000e, uid = OPVector_00000000000e, isResponse = false, originStage = SmartTextVectorizer_00000000000e, parents = [Text_000000000004,Text_000000000005,Text_000000000009,Text_00000000000b,Text_00000000000c], distributions = [])



#### Transmogrify for numeric variables

In [55]:
println("just a few examples of options for transmogrify on numeric variables")
featureVector.parents(0).originStage.explainParams.split("\n").take(4).mkString("\n\n")

just a few examples of options for transmogrify on numeric variables


fillValue: default value for FillWithConstant (default: 0.0)

inputFeatures: Input features (default: [Lcom.salesforce.op.features.TransientFeature;@73899234, current: [Lcom.salesforce.op.features.TransientFeature;@18b205b6)

inputSchema: the schema of the input data from the dataframe (default: StructType(), current: StructType(StructField(id,DoubleType,true), StructField(pclass,DoubleType,true), StructField(age,DoubleType,true), StructField(sibsp,DoubleType,true), StructField(parch,DoubleType,true), StructField(fare,DoubleType,true)))

outputFeatureName: output name that overrides default output name for feature made by this stage (undefined)

#### Transmogrify for non numeric variables. 

In [53]:
println("just a few examples of options for transmogrify on non numeric variables")
featureVector.parents(1).originStage.explainParams.split("\n").take(5).mkString("\n\n")

just a few examples of options for transmogrify on non numeric variables


autoDetectLanguage: whether to attempt language detection (default: false, current: false)

autoDetectThreshold: language detection threshold (default: 0.99, current: 0.99)

binaryFreq: if true, term frequency vector will be binary such that non-zero term counts will be set to 1.0 (default: false, current: false)

cleanText: ignore capitalization and punctuation in grouping categories (default: true, current: true)

defaultLanguage: default language (default: unknown, current: unknown)

### Sanity Checker

The Sanity Checker is a TransmogrifAI estimator that will help identify problems in your dataset.  The easiest way to use it is to call the `sanityCheck` method of the target variable.  You need to provide, as the first argument, the feature vector.  There are many other arguments you can pass in, but on removing bad features as determined by the checks performed.  

Some other options 
* Max correlation allowed between a feature and the target (default is 0.95)
* Min correlation allowed between a feature and the target (default is 0.0)
* Minimum variance allowed for a feature vector (default is 1e-5)
* Set the type of correlation to use.

In [20]:
val checkedFeatures = target.sanityCheck(featureVector, removeBadFeatures = true)

Feature(name = age-cabin-embarked-fare-id-name-parch-pclass-sex-sibsp-survived-ticket_4-stagesApplied_OPVector_000000000010, uid = OPVector_000000000010, isResponse = false, originStage = SanityChecker_000000000010, parents = [RealNN_000000000002,OPVector_00000000000f], distributions = [])

In [56]:
println("just an example of a few things to to set in the sanity checker\n")
println(checkedFeatures.originStage.explainParams.split("\n").take(5).mkString("\n\n"))

just an example of a few things to to set in the sanity checker

categoricalLabel: If true, then label is treated as categorical (eg. Cramer's V will be calculated between it and categorical features). If this is not set, then use a max class fraction of 0.1 to estimate whether label iscategorical or not. (undefined)

checkSample: Rate to downsample the data for statistical calculations (note: actual sampling will not be exact due to Spark's dataset sampling behavior) (default: 1.0, current: 1.0)

correlationExclusion: Setting for what categories of feature vector columns to exclude from the correlation calculation (default: NoExclusion, current: NoExclusion)

correlationType: Which coefficient to use for computing correlation (default: Pearson, current: Pearson)

featureLabelCorrOnly: If true, then only calculate the correlations between the features and the label. Otherwise, calculate the entire correlation matrix, which includes all feature-feature correlations. (default: false, cur

## Model

In [22]:
val prediction = BinaryClassificationModelSelector.
withCrossValidation(seed=142L)
.setInput(target, checkedFeatures).setOutputFeatureName("prediction").getOutput()

Feature(name = prediction, uid = Prediction_00000000001c, isResponse = true, originStage = ModelSelector_00000000001c, parents = [RealNN_000000000002,OPVector_000000000010], distributions = [])

In [23]:
val Array(train, test) = passengerData.randomSplit( Array(0.7, 0.3))

[id: double, survived: double ... 10 more fields]

In [24]:
// Setting up a TransmogrifAI workflow
val workflow = new OpWorkflow().setInputDataset(train).setResultFeatures(prediction)

com.salesforce.op.OpWorkflow@7aca6d33

In [25]:
val fittedWorkflow = workflow.train()

com.salesforce.op.OpWorkflowModel@3f1fd3d4

In [26]:
println("Model summary:\n" + fittedWorkflow.summaryPretty())

Model summary:
Evaluated OpLogisticRegression, OpLinearSVC, OpGBTClassifier, OpRandomForestClassifier models using Cross Validation and area under precision-recall metric.
Evaluated 8 OpLogisticRegression models with area under precision-recall metric between [0.7059060187219958, 0.8019345677832576].
Evaluated 4 OpLinearSVC models with area under precision-recall metric between [0.6689174995130905, 0.6776168955058083].
Evaluated 18 OpGBTClassifier models with area under precision-recall metric between [0.7919492354936848, 0.8413025093504036].
Evaluated 18 OpRandomForestClassifier models with area under precision-recall metric between [0.6757512570359909, 0.7985987427196202].
+--------------------------------------------------------+
|            Selected Model - OpGBTClassifier            |
+--------------------------------------------------------+
| Model Param           | Value                          |
+-----------------------+--------------------------------+
| cacheNodeIds       

## Test Dataset Performance

In [57]:
val evaluator = Evaluators.BinaryClassification()
   .setLabelCol(target)
   .setPredictionCol(prediction)

OpBinaryClassificationEvaluator_000000000044

In [58]:
fittedWorkflow.setInputDataset(test)

com.salesforce.op.OpWorkflowModel@3f1fd3d4

In [59]:
val (scoredTestData, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)
OutputCell.HIDDEN

In [60]:
metrics.toMap

## Refine a few features by hand


#### have to create a new workflow



In [None]:
val passengersData2 = passengersData.select("survived", "name", "pClass", "sex", "age", "sibSp", "parCh", "fare", "embarked")

In [None]:
val columns = passengerData.columns.filter( col => col != "id" & col != "name")

In [None]:
// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()
//val pred = new OpGBTClassifier().setInput(survived, checkedFeatures).getOutput()

SQL User Defined Functions (UDF)

In [38]:
// Define a regular Scala function
val title: String => String = _.split("\\.").apply(0).split(",").apply(1)

// Define a UDF that wraps the upper Scala function defined above
// You could also define the function in place, i.e. inside udf
// but separating Scala functions from Spark SQL's UDFs allows for easier testing
import org.apache.spark.sql.functions.udf
val titleUDF = udf(title)

// Apply the UDF to change the source dataset
passengerData.withColumn("title", titleUDF(passengerData.col("name"))).createOrReplaceTempView("title")

spark.sql("select title, sum(1), avg(survived) as p from title group by 1 order by 1").show(100)

+-------------+------+-------------------+
|        title|sum(1)|                  p|
+-------------+------+-------------------+
|         Capt|     1|                0.0|
|          Col|     2|                0.5|
|          Don|     1|                0.0|
|           Dr|     7|0.42857142857142855|
|     Jonkheer|     1|                0.0|
|         Lady|     1|                1.0|
|        Major|     2|                0.5|
|       Master|    40|              0.575|
|         Miss|   182| 0.6978021978021978|
|         Mlle|     2|                1.0|
|          Mme|     1|                1.0|
|           Mr|   517|0.15667311411992263|
|          Mrs|   125|              0.792|
|           Ms|     1|                1.0|
|          Rev|     6|                0.0|
|          Sir|     1|                1.0|
| the Countess|     1|                1.0|
+-------------+------+-------------------+



null

In [None]:
// Passenger data schema
case class Passenger(
  id: Int,
  survived: Int,
  pClass: Option[Int],
  name: Option[String],
  sex: Option[String],
  age: Option[Double],
  sibSp: Option[Int],
  parCh: Option[Int],
  ticket: Option[String],
  fare: Option[Double],
  cabin: Option[String],
  embarked: Option[String]
)

In [None]:
// import necessary packages
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op._

// Define features using the TransmogrifAI types based on the data
val survived = FeatureBuilder.RealNN[Passenger].extract(_.survived.toRealNN).asResponse

val pClass = FeatureBuilder.PickList[Passenger].extract(_.pClass.map(_.toString).toPickList).asPredictor

val name = FeatureBuilder.Text[Passenger].extract(_.name.toText).asPredictor

val sex = FeatureBuilder.PickList[Passenger].extract(_.sex.map(_.toString).toPickList).asPredictor

val age = FeatureBuilder.RealNN[Passenger].extract(_.age.toRealNN).asPredictor

val sibSp = FeatureBuilder.Integral[Passenger].extract(_.sibSp.toIntegral).asPredictor

val parCh = FeatureBuilder.Integral[Passenger].extract(_.parCh.toIntegral).asPredictor

val ticket = FeatureBuilder.PickList[Passenger].extract(_.ticket.map(_.toString).toPickList).asPredictor

val fare = FeatureBuilder.Real[Passenger].extract(_.fare.toReal).asPredictor

val cabin = FeatureBuilder.PickList[Passenger].extract(_.cabin.map(_.toString).toPickList).asPredictor

val embarked = FeatureBuilder.PickList[Passenger].extract(_.embarked.map(_.toString).toPickList).asPredictor