# Summary

We will be using TransmogrifAI to help build a model so we may make a submission to the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) Kaggle competition.  This is an excellent challenge if you are new to data science, and its aim (in my opinion) is to introduce
* feature engineering
* binary classification

We will also use it to get a quick intro to Scala, Spark and TransmogrifAI.

#### Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

<img src="./images/pic1.PNG" alt="drawing" width="200"/>

# Want to see my Transmogrif-AI-er? 

TransmogrifAI will be helpful in a few regards. 

* It will help automate __feature engineering__.  
* Machine learning model selection

## Feature Engineering

Feature engineering is usually a complex and time intensive process used to develop features for a machine learning model.  Crafted the right features for an ML model usually depends on subject matter expertise, experience and trial and error, But keep in mind, there are components to feature engineering that are ripe for automation.  Consider categorical variables - you need to come up with a numeric representation of categorical variables.  You could automate a workflow by always choosing to one hot encode categorical data.  There are other thing that you could automate by using rules of thumbs or more complex methods of vectorization (e.g., entity embedding), but be advised, you will not find that automated feature engineering to be as good as a knowledgable SME.  

### Automated Feature Engineering

You can't always rely on automated methods over domain experience.  Consider the following example.  A feature is your dataset has missing values.  Without knowing the context for the problem or what the feature represent, a person may decide that mean value replacement should be used to impute the data.  But suppose that the feature has represents a customers credit utilization rate.  While we may not know what a credit utilization rate is, our good buddy Keith does.  Keith tells us that __credit utilization rate__ _is the amount of revolving credit you're currently using divided by the total amount of revolving credit you have available_.  Good to know!  But why wouldn't it be missng?  Keith informs us that while some customers may have credit (ala a car loan), they may not have access to revolving credit.    

Hmm, so a missing value for credit utilization rate may mean this person can't get revolving credit, is wealthy and doesn need it, or something else.  So be aware - automated feature engineering isn't an end all, but there are some work arounds (imputation and null value tracking).

## Machine Learning Model Selection

TransmogrifAI will try a number of preselected models with sets of predetermined hyperparameters.  The best is chosen via crossvalidation and made available for scoring.  Additionally we can gather some insights.

# Check Scala Version

In [2]:
scala.util.Properties.versionString

version 2.11.12

# Get TransmogrifAI

In [3]:
%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.5.0

# Get Spark

In [4]:
%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.0

http://bailiwick.io/2017/08/21/using-xgboost-with-the-titanic-dataset-from-kaggle/

# Get Started

In [5]:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.udf

import com.salesforce.op._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import com.salesforce.op.evaluators.Evaluators

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.udf
import com.salesforce.op._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import com.salesforce.op.evaluators.Evaluators


In [6]:
val conf = new SparkConf().setMaster("local[*]").setAppName("automl-app").setExecutorEnv(Array( ("memory", "10g"))) // Spark configuration
val sc = new SparkContext(conf)  // initialize spark context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)  // initialize sql context
implicit val spark = SparkSession.builder.config(conf).getOrCreate() // start spark session 
import spark.implicits._

org.apache.spark.sql.SparkSession$implicits$@2c8fc96f

In [7]:
sc.getConf.getAll

[(spark.driver.host,R90RY1SX.myfnb.us), (spark.executorEnv.memory,10g), (spark.app.name,automl-app), (spark.master,local[*]), (spark.executor.id,driver), (spark.driver.port,51769), (spark.app.id,local-1549338046942)]

# Get Data

In [8]:
val rawData = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("data/train.csv")
rawData.printSchema

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



null

In [9]:
println(rawData.columns.mkString(","))
rawData.take(5).foreach(println)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
[1,0,3,Braund, Mr. Owen Harris,male,22.0,1,0,A/5 21171,7.25,null,S]
[2,1,1,Cumings, Mrs. John Bradley (Florence Briggs Thayer),female,38.0,1,0,PC 17599,71.2833,C85,C]
[3,1,3,Heikkinen, Miss. Laina,female,26.0,0,0,STON/O2. 3101282,7.925,null,S]
[4,1,1,Futrelle, Mrs. Jacques Heath (Lily May Peel),female,35.0,1,0,113803,53.1,C123,S]
[5,0,3,Allen, Mr. William Henry,male,35.0,0,0,373450,8.05,null,S]


In [10]:
// cast all non doulbe numeric types to double
rawData.createOrReplaceTempView("raw")
val passengerData = spark.sql("""
    select 
      passengerId as id, 
      cast(survived as double) as survived, 
      cast(pclass as string) as pclass, name, sex, age, 
      sibsp, parch, ticket, 
      fare, cabin, embarked 
      from raw
""")

[id: int, survived: double ... 10 more fields]

In [11]:
passengerData.show

+---+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
| id|survived|pclass|                name|   sex| age|sibsp|parch|          ticket|   fare|cabin|embarked|
+---+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|  1|     0.0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|  2|     1.0|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|  3|     1.0|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|  4|     1.0|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|  5|     0.0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|  6|     0.0|     3|    Moran, Mr. James|  male|null|    0|    0|          330877| 8.4583| null|       Q|
|  7|     0.0|     1|McCarthy, Mr. Ti

# Declare Target and Features

Features for our dataset could come in all types of flavors.  We could have location data, email address, text, addresses, etc, but for our purposes we would broadly clsasify our features as either text of numeric.  This broad classification will affect the feature engineering step of our workflow because the feature engineering has predefined routines based on the type of the feature.  For instance, with a feature that is an email address, it may parse out the domain and generate a one hote encoder for the top $k$ domains.  For our purposes, text and numeric will work fine.

In [12]:
val Array(train, test) = passengerData.randomSplit(Array(0.7, 0.3))

[id: int, survived: double ... 10 more fields]

I believe that the type parameterization on `FeatureBuilder.fromDataFrame` concerns the repsonse field.  Below we have `RealNN` which is a real number which is not nullable.  

In [13]:
val (target, features) = FeatureBuilder.fromDataFrame[RealNN](train, response = "survived")
OutputCell.HIDDEN

In [14]:
val id = features(0)

Feature(name = id, uid = Integral_000000000001, isResponse = false, originStage = FeatureGeneratorStage_000000000001, parents = [], distributions = [])

# Transmogrify (Automated Feature Engineering)

Eliminate the id field as it doesn't make sense as a feature.  

In [20]:
val featureVector = features.filter{ _.name != "id" }.transmogrify()

Feature(name = age-cabin-embarked-fare-name-parch-pclass-sex-sibsp-ticket_4-stagesApplied_OPVector_000000000010, uid = OPVector_000000000010, isResponse = false, originStage = VectorsCombiner_000000000010, parents = [OPVector_00000000000d,OPVector_00000000000e,OPVector_00000000000f], distributions = [])

In [21]:
featureVector.prettyParentStages

+-- combVec
|    +-- smartTxtVec
|    |    +-- ConcatText(embarked)
|    |    +-- ConcatText(cabin)
|    |    +-- ConcatText(ticket)
|    |    +-- ConcatText(sex)
|    |    +-- ConcatText(name)
|    |    +-- ConcatText(pclass)
|    +-- vecReal
|    |    +-- SumReal(fare)
|    |    +-- SumReal(age)
|    +-- vecInt
|    |    +-- SumIntegral(parch)
|    |    +-- SumIntegral(sibsp)


In [22]:
featureVector.parents(0).originStage.explainParams.split("\n").mkString("\n\n")

fillValue: default value for FillWithConstant (default: 0)

inputFeatures: Input features (default: [Lcom.salesforce.op.features.TransientFeature;@6cf9dd0b, current: [Lcom.salesforce.op.features.TransientFeature;@549e2c1a)

inputSchema: the schema of the input data from the dataframe (default: StructType())

outputFeatureName: output name that overrides default output name for feature made by this stage (undefined)

outputMetadata: any metadata that user wants to save in the transformed DataFrame (default: {}, current: {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"sibsp_0"},{"idx":1,"name":"parch_1"}]},"num_attrs":2},"vector_history":{"sibsp":{"stages":["vecInt_IntegralVectorizer_00000000000d"],"origin_features":["sibsp"]},"parch":{"stages":["vecInt_IntegralVectorizer_00000000000d"],"origin_features":["parch"]}},"vector_columns":[{"indices":[1],"parent_feature_type":["com.salesforce.op.features.types.Integral"],"parent_feature":["parch"]},{"indices":[0],"parent_feature_type":["com.s

In [23]:
features(1)

Feature(name = pclass, uid = Text_000000000003, isResponse = false, originStage = FeatureGeneratorStage_000000000003, parents = [], distributions = [])

In [24]:
//featureVector.parents(1).originStage.explainParams.split("\n").mkString("\n\n")

input is incomplete: input is incomplete

# Sanity Check (Feature Refinement)

In [25]:
val checkFeatures = false

false

In [26]:
val checkedFeatures = if(checkFeatures) target.sanityCheck(featureVector, removeBadFeatures = true) else featureVector

Feature(name = age-cabin-embarked-fare-name-parch-pclass-sex-sibsp-ticket_4-stagesApplied_OPVector_000000000010, uid = OPVector_000000000010, isResponse = false, originStage = VectorsCombiner_000000000010, parents = [OPVector_00000000000d,OPVector_00000000000e,OPVector_00000000000f], distributions = [])

In [27]:
println(checkedFeatures.originStage.explainParams.split("\n").mkString("\n\n"))

inputFeatures: Input features (default: [Lcom.salesforce.op.features.TransientFeature;@183fd255, current: [Lcom.salesforce.op.features.TransientFeature;@38136ff6)

inputSchema: the schema of the input data from the dataframe (default: StructType())

outputFeatureName: output name that overrides default output name for feature made by this stage (undefined)

outputMetadata: any metadata that user wants to save in the transformed DataFrame (default: {})


# Model Selection

In [45]:
// error
// auROC
// auPR
val modelSelector = BinaryClassificationModelSelector.withCrossValidation(numFolds = 3, parallelism=10 //, validationMetric = Evaluators.BinaryClassification.auPR
                                                                         )
.setInput(target, checkedFeatures).setOutputFeatureName("prediction")

ModelSelector_000000000041

In [29]:
modelSelector.validator.getParams

In [30]:
println(s"total models to estimate: ${modelSelector.models.map(i => i._2.length).sum}")

total models to estimate: 48


In [39]:
val prediction = modelSelector.getOutput()

Feature(name = prediction, uid = Prediction_00000000001c, isResponse = true, originStage = ModelSelector_00000000001c, parents = [RealNN_000000000002,OPVector_000000000010], distributions = [])

# Setting up a TransmogrifAI Workflow

In [49]:
val workflow = new OpWorkflow().setInputDataset(train).setResultFeatures(id, prediction)

com.salesforce.op.OpWorkflow@6541e029

In [50]:
workflow.uid

OpWorkflow_000000000042

In [51]:
// val fittedWorkflow = workflow.loadModel("model")

input is incomplete: input is incomplete

# Train a Workflow

In [52]:
val fittedWorkflow = workflow.train()

com.salesforce.op.OpWorkflowModel@1b9d95f

In [53]:
fittedWorkflow.save("model")

In [54]:
println(fittedWorkflow.summaryPretty())

Evaluated OpLogisticRegression, OpLinearSVC, OpRandomForestClassifier, OpGBTClassifier models using Cross Validation and area under precision-recall metric.
Evaluated 8 OpLogisticRegression models with area under precision-recall metric between [0.7536868326562881, 0.8190254684318203].
Evaluated 4 OpLinearSVC models with area under precision-recall metric between [0.7416743180639309, 0.7461669122149207].
Evaluated 18 OpRandomForestClassifier models with area under precision-recall metric between [0.6871269308007691, 0.7960781816903247].
Evaluated 18 OpGBTClassifier models with area under precision-recall metric between [0.8072934000924459, 0.8392060064114096].
+---------------------------------------------------------+
|            Selected Model - OpGBTClassifier             |
+---------------------------------------------------------+
| Model Param           | Value                           |
+-----------------------+---------------------------------+
| cacheNodeIds          | false

# Evaluate on Test Data

In [55]:
val evaluator = Evaluators.BinaryClassification()
   .setLabelCol(target)
   .setPredictionCol(prediction)

OpBinaryClassificationEvaluator_000000000067

In [56]:
evaluator.explainParams

labelCol: label column name (default: label, current: survived)
predictionCol: prediction column name (current: prediction)
predictionValueCol: prediction column name (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction)

In [57]:
fittedWorkflow.setInputDataset(test)

com.salesforce.op.OpWorkflowModel@1b9d95f

In [58]:
val (scoredTestData, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)
OutputCell.HIDDEN

In [65]:
metrics.toMap

# Kaggle Test Data

Pull in Kaggle test data and score it.  

In [64]:
// import data

val rawTestData = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("data/test.csv")
rawTestData.createOrReplaceTempView("rawTest")
val passengerTestData = spark.sql("""
    select 
      passengerId as id, 
      cast(1 as double) as survived, 
      cast(pclass as string) as pclass, name, sex, age, 
      sibsp, parch, ticket, 
      fare, cabin, embarked 
      from rawTest
""")


fittedWorkflow.setInputDataset(passengerTestData)

val output = fittedWorkflow.computeDataUpTo(prediction).select("id", "prediction")

import java.io.{File, PrintWriter}


val getScore = udf{ map: Map[String, Double ] => map.get("probability_1")}
val local = output.withColumn("p", getScore(output.col("prediction"))).selectExpr("id", "case when p >= 0.5 then 1 else 0 end").collect

val myFile = new File("myprediction1.csv")
val pw = new PrintWriter(myFile)
pw.write("PassengerId,Survived\n")
local.foreach( record => pw.write(record.mkString(",") +"\n"))

pw.close

null

Accuracy around 77-78%