# Classification Machine Learning:  will YOU survive the Titanic?
<img src='https://raw.githubusercontent.com/bradenrc/Spark_POT/master/Modules/MachineLearning/Classification/titanic.jpg' width="70%" height="70%"></img>
#### With Spark, we can easily describe data and use it to make predictions.  We'll be using the famous Titanic data set from Kaggle (https://www.kaggle.com/c/titanic/data) and the machine learning package in Spark to do just that.
## Access your data
#### We have the titanic data on an instance of Object Storage, a cloud datat store for access and storage of unstructured data content.  We'll configure the connection here.

In [1]:
def set_hadoop_config(name):
    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', 'XXXXXXXX')
    hconf.set(prefix + '.username', 'XXXXXXXX')
    hconf.set(prefix + '.password', 'XXXXXXXX')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', True)

name = 'keystone'
set_hadoop_config(name)

## Data processing
#### Once we have the data, all of the processing is done in memory.  Here, we're formatting the data, removing columns, dropping rows with insufficient data, creating a DataFrame, and creating columns.

In [2]:
from pyspark.sql import SQLContext,Row
from pyspark.sql.functions import lit

sqlContext = SQLContext(sc)
loadTitanicData = sqlContext.read.format("csv").options(header="true",inferSchema="true").load("swift://XXXXXXXX.keystone/train.csv")
loadTitanicData.show(2)

+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|   Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
|          1|       0|     3|Braund Mr. Owen H...|  male|22.0|    1|    0|A/5 21171|   7.25|     |       S|
|          2|       1|     1|Cumings Mrs. John...|female|38.0|    1|    0| PC 17599|71.2833|  C85|       C|
+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
only showing top 2 rows



In [3]:
loadTitanicData = loadTitanicData.drop("PassengerId").drop("Name").drop("Ticket").drop("Cabin").dropna(subset=["Age", "Embarked"])
loadTitanicData = loadTitanicData.withColumn("SurvivedTemp", loadTitanicData["Survived"].cast("double")).drop("Survived").withColumnRenamed("SurvivedTemp","Survived")
loadTitanicData.show(2)
loadTitanicData.printSchema

+------+------+----+-----+-----+-------+--------+--------+
|Pclass|   Sex| Age|SibSp|Parch|   Fare|Embarked|Survived|
+------+------+----+-----+-----+-------+--------+--------+
|     3|  male|22.0|    1|    0|   7.25|       S|     0.0|
|     1|female|38.0|    1|    0|71.2833|       C|     1.0|
+------+------+----+-----+-----+-------+--------+--------+
only showing top 2 rows



<bound method DataFrame.printSchema of DataFrame[Pclass: int, Sex: string, Age: double, SibSp: int, Parch: int, Fare: double, Embarked: string, Survived: double]>

## Building your Spark.ML pipeline
#### String Indexer and One Hot Encoder

In [4]:
from pyspark.ml.feature import StringIndexer

SexIndexer = StringIndexer().setInputCol("Sex").setOutputCol("SexIndex").setHandleInvalid("skip")
EmbarkedIndexer = StringIndexer().setInputCol("Embarked").setOutputCol("EmbarkedIndex").setHandleInvalid("skip")

In [5]:
from pyspark.ml.feature import OneHotEncoder

SexEncoder = OneHotEncoder(inputCol="SexIndex", outputCol="SexFeatures")
EmbarkedEncoder = OneHotEncoder(inputCol="EmbarkedIndex",outputCol="EmbarkedFeatures")

#### Bucketizer

In [6]:
loadTitanicData.describe().show()

+-------+------------------+------------------+------------------+-------------------+------------------+------------------+
|summary|            Pclass|               Age|             SibSp|              Parch|              Fare|          Survived|
+-------+------------------+------------------+------------------+-------------------+------------------+------------------+
|  count|               714|               714|               714|                714|               714|               714|
|   mean|2.2366946778711485| 29.69911764705882|0.5126050420168067|0.43137254901960786|34.694514005602215|0.4061624649859944|
| stddev| 0.838249862698379|14.526497332334042|0.9297834541221924| 0.8532893658062201|52.918929502543556|0.4914598643353705|
|    min|                 1|              0.42|                 0|                  0|               0.0|               0.0|
|    max|                 3|              80.0|                 5|                  6|          512.3292|               1.0|


In [7]:
from pyspark.ml.feature import Bucketizer

AgeSplits = [-float("inf"), 4.0, 12.0, 18.0, 35.0, 60.0, 80.0, float("inf")]
FareSplits = [-float("inf"), 20.0, 50.0, 100.0, float("inf")]

AgeBucketizer = Bucketizer(splits=AgeSplits, inputCol="Age", outputCol="AgeBuckets")
FareBucketizer = Bucketizer(splits=FareSplits, inputCol="Fare", outputCol="FareBuckets")

#### Vector Assembler

In [8]:
from pyspark.ml.feature import VectorAssembler

Assembler = VectorAssembler(inputCols=["SexFeatures", "EmbarkedFeatures", "AgeBuckets", "FareBuckets", "SibSp", "Pclass", "Parch"],outputCol="features")

#### Normalizer

In [9]:
from pyspark.ml.feature import Normalizer

Normalizer = Normalizer(inputCol="features", outputCol="normFeatures")

#### Logistic Regresssion

In [10]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=10, regParam=0.2, elasticNetParam=0.8, featuresCol="normFeatures", labelCol="Survived")

### All added into the pipeline

In [11]:
from pyspark.ml import Pipeline, PipelineModel

pipeline = Pipeline(stages=[SexIndexer, EmbarkedIndexer, SexEncoder, EmbarkedEncoder, AgeBucketizer, FareBucketizer, Assembler, Normalizer, lr])

#### Create a testing data set and training data set

In [12]:
[train, test] = loadTitanicData.randomSplit([.75, .25])

#### Create a model from the pipeline

In [13]:
model = pipeline.fit(train)

#### Predict whether or not a passenger will survive the titanic using unobserved data.

In [14]:
predictions = model.transform(test)



In [15]:
predictions.take(1)

[Row(Pclass=3, Sex=u'male', Age=20.0, SibSp=0, Parch=0, Fare=8.05, Embarked=u'S', Survived=0.0, SexIndex=0.0, EmbarkedIndex=0.0, SexFeatures=0.0, EmbarkedFeatures=SparseVector(3, {0: 1.0}), AgeBuckets=3.0, FareBuckets=0.0, features=SparseVector(9, {1: 1.0, 4: 3.0, 7: 3.0}), normFeatures=SparseVector(9, {1: 0.2294, 4: 0.6882, 7: 0.6882}), rawPrediction=DenseVector([0.7168, -0.7168]), probability=DenseVector([0.6719, 0.3281]), prediction=0.0)]

In [16]:
predictions.filter("Survived = 0.0").select("Sex", "Age", "Fare", "Embarked", "Pclass", "Parch", "SibSp", "Survived", "prediction").show(5)
predictions.filter("Survived = 1.0").select("Sex", "Age", "Fare", "Embarked", "Pclass", "Parch", "SibSp", "Survived", "prediction").show(5)

+----+----+-------+--------+------+-----+-----+--------+----------+
| Sex| Age|   Fare|Embarked|Pclass|Parch|SibSp|Survived|prediction|
+----+----+-------+--------+------+-----+-----+--------+----------+
|male|20.0|   8.05|       S|     3|    0|    0|     0.0|       0.0|
|male|39.0| 31.275|       S|     3|    5|    1|     0.0|       0.0|
|male|19.0|  263.0|       S|     1|    2|    3|     0.0|       0.0|
|male|66.0|   10.5|       S|     2|    0|    0|     0.0|       0.0|
|male|65.0|61.9792|       C|     1|    1|    0|     0.0|       0.0|
+----+----+-------+--------+------+-----+-----+--------+----------+
only showing top 5 rows

+------+----+-------+--------+------+-----+-----+--------+----------+
|   Sex| Age|   Fare|Embarked|Pclass|Parch|SibSp|Survived|prediction|
+------+----+-------+--------+------+-----+-----+--------+----------+
|female|55.0|   16.0|       S|     2|    0|    0|     1.0|       1.0|
|female|38.0|31.3875|       S|     3|    5|    1|     1.0|       0.0|
|female|29.0|

### Evaluate and tune your model

In [17]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="Survived", metricName="areaUnderROC")
print evaluator.evaluate(predictions)

0.797305318139


In [18]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = ParamGridBuilder().baseOn({lr.labelCol: 'Survived'}).baseOn([lr.predictionCol, 'prediction'])\
    .addGrid(lr.elasticNetParam,[0.0,0.4,0.8]).addGrid(lr.maxIter, [2, 10, 20]).addGrid(lr.regParam, [0.0,0.1,0.2,0.3])\
    .build()

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator)

cvModel = cv.fit(train)
newPredictions = cvModel.transform(test)

In [19]:
print "Area under the ROC curve for non-tuned model = " + str(evaluator.evaluate(predictions))
print "Area under the ROC curve for best fitted model = " + str(evaluator.evaluate(newPredictions))
print "Improvement = " + str((evaluator.evaluate(newPredictions) - evaluator.evaluate(predictions)) *100 / evaluator.evaluate(predictions)) + "%"

Area under the ROC curve for non-tuned model = 0.797305318139
Area under the ROC curve for best fitted model = 0.884021842355
Improvement = 10.876200402%


# Will YOU survive?

In [20]:
from pyspark.sql.types import *
schema = StructType([StructField("Sex", StringType(), True),StructField("Age", DoubleType(), True),\
                    StructField("Fare", DoubleType(), True),StructField("Embarked", StringType(), True),\
                    StructField("Pclass", IntegerType(), True),StructField("SibSp", IntegerType(), True),\
                    StructField("Parch", IntegerType(), True)])
me = sc.parallelize([("male",28.0,15.0,"C",2,1,1)])

PredictionFeatures = sqlContext.createDataFrame(me,schema)
SurvivedOrNotPrediction = cvModel.transform(PredictionFeatures)
SurvivedOrNotPrediction.select("Sex", "Age", "Fare", "Embarked", "Pclass", "SibSp", "Parch","prediction").show()

+----+----+----+--------+------+-----+-----+----------+
| Sex| Age|Fare|Embarked|Pclass|SibSp|Parch|prediction|
+----+----+----+--------+------+-----+-----+----------+
|male|27.0|15.0|       C|     2|    1|    1|       0.0|
+----+----+----+--------+------+-----+-----+----------+

