[View in Colaboratory](https://colab.research.google.com/github/shashank2806/kaggle-titanic/blob/master/titanic.ipynb)


#Titanic: Machine Learning from Disaster

Predict survival on the Titanic  with ML.

In [0]:
# Upload dataset form local machine to colab
from google.colab import files
files.upload()

In [3]:
# unzip the data
!unzip titanic-kaggle.zip

Archive:  titanic-kaggle.zip
  inflating: train.csv               
  inflating: test.csv                
  inflating: gender_submission.csv   


In [0]:
# import dependencies
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

train_file_path = 'train.csv'

titanic_data = pd.read_csv(train_file_path)

# target object
y = titanic_data.Survived

# input features
features = ['PassengerId', 'Pclass', 'Age', 'Parch', 'Fare']
X = titanic_data[features]

In [5]:
# view data
X.head()

Unnamed: 0,PassengerId,Pclass,Age,Parch,Fare
0,1,3,22.0,0,7.25
1,2,1,38.0,0,71.2833
2,3,3,26.0,0,7.925
3,4,1,35.0,0,53.1
4,5,3,35.0,0,8.05


In [6]:
# check if any feild is null
# X[X.isnull().any(axis=1)]
X.isnull().sum()

PassengerId      0
Pclass           0
Age            177
Parch            0
Fare             0
dtype: int64

In [0]:
# impute missing values
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
X_inputed  = my_imputer.fit_transform(X)

In [0]:
# X_inputed is numpu array. so we have to cast it back to dataframe
X_inputed = pd.DataFrame(X_inputed)

In [9]:
# we can see that we have lost column titles.
X_inputed.head()

Unnamed: 0,0,1,2,3,4
0,1.0,3.0,22.0,0.0,7.25
1,2.0,1.0,38.0,0.0,71.2833
2,3.0,3.0,26.0,0.0,7.925
3,4.0,1.0,35.0,0.0,53.1
4,5.0,3.0,35.0,0.0,8.05


In [0]:
#Since the order of the columns does not change after imputation
#you can add the titles back like this
X_inputed.columns = X.columns

In [21]:
X_inputed.head()

Unnamed: 0,PassengerId,Pclass,Age,Parch,Fare
0,1.0,3.0,22.0,0.0,7.25
1,2.0,1.0,38.0,0.0,71.2833
2,3.0,3.0,26.0,0.0,7.925
3,4.0,1.0,35.0,0.0,53.1
4,5.0,3.0,35.0,0.0,8.05


Now that we have taken care of missing values. We are good to go further.

In [0]:
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X_inputed, y, test_size=0.2)

In [0]:
# model
titanic_model = LogisticRegression(C=1e2)
titanic_model.fit(train_X, train_y)
val_predictions = titanic_model.predict(val_X)

In [27]:
# print accuracy
train_accuracy = titanic_model.score(train_X, train_y)
val_accuracy = titanic_model.score(val_X, val_y)
print('train_accuracy: ',train_accuracy)
print('val_accuracy: ',val_accuracy)

train_accuracy:  0.699438202247191
val_accuracy:  0.6871508379888268


In [28]:
# model on full data
titanic_model_on_full_data = LogisticRegression(C=1e2)
titanic_model_on_full_data.fit(X_inputed, y)

LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [0]:
# test file path
test_file_path = 'test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_file_path)

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]

# impute test data
test_X_inputed  = pd.DataFrame(my_imputer.fit_transform(test_X))
test_X_inputed.columns = test_X.columns

In [0]:
# make predictions which we will submit.
test_preds = titanic_model_on_full_data.predict(test_X_inputed)

In [0]:
# The lines below shows you how to save your data in the format needed to score it in the competition
output = pd.DataFrame({'PassengerId': test_data.PassengerId,
                       'Survived': test_preds})

In [0]:
# output to csv
output.to_csv('submission.csv', index=False)

In [0]:
files.download('submission.csv')

This submission scored 66.985% accuracy.

A lot need to be done to improve the model.
* We can use categorical values
* We can try different algorithm
* We can use another techniqe for handling missing data