# Titanic: Machine Learning from Disaster

* Description:
    The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
* Problem definition: In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

# Importing relevant packages

In [17]:
import numpy as np
import pandas as pd
import sklearn.linear_model as lm
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

%matplotlib inline


# Downloading data:

In order to download the relevant data:

Open a user on Kaggle https://www.kaggle.com

follow instructions here (https://www.kaggle.com/docs/api?utm_me)

Register to competition (https://www.kaggle.com/c/titanic/data) before trying to download the data.

In [2]:
! pip install kaggle
! kaggle competitions download -c titanic #Download data

titanic.zip: Skipping, found more recently modified local copy (use --force to force download)


## Data Exploration

You need to change the path such that it points to where the data is stored in your computer after downloading it



In [3]:
train = pd.read_csv('data_sets/titanic/train.csv')
test = pd.read_csv('data_sets/titanic/test.csv')
y_test = pd.read_csv('data_sets/titanic/gender_submission.csv')

Training data - the data we will let the model to train on
test data - part of the data that we will put asside, and use it very few times (ideally once) to check the performance of the predictive model we built. 


In this dataset the division to test and train data is predetermined (as it is part of a kaggle competition). 

In [4]:
print ("Dimension of train data",train.shape)
print ("Dimension of test data",test.shape) #This contains only the features (inputs, covariates) of the test data
print("Dimension of test labels", y_test.shape) #This is the output (target) of the test data.

Dimension of train data (891, 12)
Dimension of test data (418, 11)
Dimension of test labels (418, 2)


You are going to use a subset of features for your prediction. 

If you are interested in a more detailed description of how to build smart features from this data set, please contact me and I will provide the full data analysis if there is a demand. 

### Here I am dropping useful features - if you want to do some preprocessing for these features, go ahead!

In [5]:
label_drop=  ['PassengerId', 'Name',  'Cabin', 'Ticket']
train.drop(labels=label_drop, axis=1, inplace=True)
test.drop(labels=label_drop, axis=1, inplace=True)


## data imputation - replace with median and add a column indicator

In [6]:
train.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [7]:
test.isnull().sum()

Pclass       0
Sex          0
Age         86
SibSp        0
Parch        0
Fare         1
Embarked     0
dtype: int64

In [8]:
#Not creating Embarked and Fare missing colums as we have very few samples
for i in list(train):
    if i == 'Age':
        if train[i].isnull().any().any():
            train[i+'_miss'] = np.where(train[i].isnull(), 1, 0)          
        
for i in list(test):
    if i == 'Age':
        if test[i].isnull().any().any():
            test[i+'_miss'] = np.where(test[i].isnull(), 1, 0)             

In [9]:
#Filling the 2 samples of missing embarked with most common value
train['Embarked'].fillna((train['Embarked'].mode()[0]), inplace=True)

In [10]:
##One hot encoding for dichotomous columns
train['Sex'] = np.where(train['Sex'] == 'female',1,0)
test['Sex'] = np.where(test['Sex'] == 'female',1,0)

#For tree based algorithms this is not necessarily needed.

train = pd.get_dummies(train, columns=['Embarked'])
test = pd.get_dummies(test, columns=['Embarked'])

In [11]:
train.isnull().sum()

Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Fare            0
Age_miss        0
Embarked_C      0
Embarked_Q      0
Embarked_S      0
dtype: int64

In [12]:
test.isnull().sum()

Pclass         0
Sex            0
Age           86
SibSp          0
Parch          0
Fare           1
Age_miss       0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
dtype: int64

In [13]:
train['Age'].fillna((train['Age'].median()), inplace=True)
test['Age'].fillna((train['Age'].median()), inplace=True)
test['Fare'].fillna((train['Fare'].median()), inplace=True) #Have to fill 

In [14]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Age_miss,Embarked_C,Embarked_Q,Embarked_S
0,0,3,0,22.0,1,0,7.25,0,0,0,1
1,1,1,1,38.0,1,0,71.2833,0,1,0,0
2,1,3,1,26.0,0,0,7.925,0,0,0,1
3,1,1,1,35.0,1,0,53.1,0,0,0,1
4,0,3,0,35.0,0,0,8.05,0,0,0,1


Defining the output (target) for the train data

In [15]:
y_train = train['Survived']
Survived_drop=  ['Survived']
train.drop(labels=Survived_drop, axis=1, inplace=True)

y_test.drop(labels='PassengerId', axis=1, inplace=True)

# Your code should Go Below

Predict using logistic regression and random forest (think if you want to do some pre-processing (also you are welcome to use features that I dropped in my data cleaning). Try to use the test set not more than 5 times. 

If you know what is AUC, please use AUC as a measure of performance (and not accuracy, think which is more relevant in this case) 

If you want to get creative - think of how to impute the data differently than what I have done. 

The relevant data is found in the following objects:

train  - training features
test - test features
y_train - target for training
y_test - traget for test