# Use Logistic Regression to predict survival of titanic passengers
Resources
* https://www.kaggle.com/c/titanic

By Shagun Garg

## Objective
* The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

* One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

* In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Data download
* Downloading from OneDrive cloud

In [4]:
%sh
wget --no-check-certificate 'https://onedrive.live.com/download?cid=19C57AC336968345&resid=19C57AC336968345%2113479&authkey=AM7djqn63GKsfxk' -O titanic.csv

## Data explored and explained

Loading file into a dataframe

In [7]:
titanic = spark.read.csv('file:/databricks/driver/titanic.csv',header="true",inferSchema = "true")

In [8]:
 titanic.show(3)

## Data cleaning

In [10]:
 from pyspark.ml import Pipeline
 from pyspark.ml.classification import LogisticRegression
 from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
 from pyspark.ml.feature import HashingTF, Tokenizer
 from pyspark.sql import Row
 from pyspark.sql.functions import UserDefinedFunction
 from pyspark.sql.types import *

Removing null values from all the columns

In [12]:
# Remove columns with bad (null) or bad result column
titanicCleaned = titanic.filter(titanic.Travel_class.isNotNull() & titanic.Sex.isNotNull() & titanic.Age.isNotNull() & titanic.Survived.isNotNull())
titanicCleaned.show(3)

In [13]:
# The survival field to be predicted...
titanic.select('Survived').distinct().show()

In [14]:
titanic.registerTempTable('Titanic')

# Finding Survivors by class

In [16]:
%sql
SELECT Travel_class, COUNT(Survived) AS cnt FROM Titanic WHERE Survived = "Yes" GROUP BY Travel_class

## Data transformation

In [18]:
# Convert results for to MLlib input, which requires labels as a float
def labelForResults(s):
     if s == 'No':
         return 0.0
     elif s == 'Yes':
         return 1.0
     else:
         return -1.0
label = UserDefinedFunction(labelForResults, DoubleType())
labeledData = titanic.select(label(titanic.Survived).alias('label'), titanic.Travel_class, titanic.Sex, titanic.Age).where('label >= 0')
labeledData.take(1)

In [19]:
display(labeledData)

In [20]:
# Split into training and testing data
train, test = labeledData.randomSplit([0.8, 0.2], seed=12345)
display(train)

## Data modeling

In [22]:
# Configure an ML pipeline into stages:
stringIndexer_tc = StringIndexer(inputCol="Travel_class", outputCol="TC_IX")
stringIndexer_sex = StringIndexer(inputCol="Sex", outputCol="SEX_IX")
stringIndexer_age = StringIndexer(inputCol="Age", outputCol="AGE_IX")
vectorAssembler_features = VectorAssembler(inputCols=["TC_IX", "SEX_IX", "AGE_IX"], outputCol="features")

## Model evaluation

In [24]:
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[stringIndexer_tc, stringIndexer_sex, stringIndexer_age, vectorAssembler_features, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(train)

## Prediction

In [26]:
# Make predictions on test data.
predictionsDf = model.transform(test)
predictionsDf.registerTempTable('Predictions')
display(predictionsDf)

## Model evaluation

In [28]:
numSuccesses = predictionsDf.where("(label = 0 AND prediction = 0) OR  (label = 1 AND prediction = 1)").count()
numPassengers = predictionsDf.count()

print "There were", numPassengers, "passengers and there were", numSuccesses, "successful predictions"
print "This is a", str((float(numSuccesses) / float(numInspections)) * 100) + "%", "success rate"

## Visualization

In [30]:
truePositive = int(predictionsDf.where("(label = 1 AND prediction = 1)").count())
trueNegative = int(predictionsDf.where("(label = 0 AND prediction = 0)").count())
falsePositive = int(predictionsDf.where("(label = 0 AND prediction = 1)").count())
falseNegative = int(predictionsDf.where("(label = 1 AND prediction = 0)").count())

print [['TP', truePositive], ['TN', trueNegative], ['FP', falsePositive], ['FN', falseNegative]]
resultDF = sqlContext.createDataFrame([['TP', truePositive], ['TN', trueNegative], ['FP', falsePositive], ['FN', falseNegative]], ['metric', 'value'])
display(resultDF)

In [31]:
resultDF.createOrReplaceTempView("LRresult")

In [32]:
%r
library(SparkR)
sparkdf <- sql("FROM LRresult SELECT *")
rdf <- collect(sparkdf)
print( rdf)
vals <- (t(rdf[2]))
labels <- (t(rdf[1]))
# Simple Pie Chart
pie(vals,labels)