 # Titanic: Machine Learning from Disaster
**Start here! Predict survival on the Titanic and get familiar with ML basics**
 Overview
 The data has been split into two groups:

 training set (train.csv)
 test set (test.csv)
 The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

 The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

 We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

# Step 1: Let's start by importing the data files and take a look at it

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
import os
warnings.filterwarnings('ignore')
# Add the complete dataset to the repository. The data is added to ../input/ directory
!ls ../input/

#Read the first 5 headers of the dataset 
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

looking at the data files:

In [None]:
train.head()

In [None]:
test.head()

Looking at the dataset, we need to think about the features which could be useful to predict the survival. For this one can start by thinking about how the features are correlated with the survival. This can be done during data cleaning.

# 2: Data cleaning

Evaluate if the data needs cleaning. First, checking if there are any missing values in the data

In [None]:
NanExist = False
if train.count().min() == train.shape[0] and test.count().min() == test.shape[0] :
    print('There is no missing data!') 
else:
    NanExist = True
    print('we have NAN!!!')
if NanExist == True:
    NumOfNan = pd.concat([train.isnull().sum(), test.isnull().sum()], axis=1, keys=['Train Data', 'Test Data']) 
    print(NumOfNan[NumOfNan.sum(axis=1) > 0])

Now we want to create array of the train and test data with the features only which we want to work with. Here **Age** is also one of the important parameter for prediction. It is very important to fix the missing values in **Age**. As mentioned above, it is not advicable to just replace the **Age** with 0 because large number (177) of **Age** values are missing and it will effect the Survival prediction. Here, we are going to use the **Name** where the title are giving some clue about the **Age**. For this first we need to extract title from the **Name** column. 

In [None]:
title_train = (train['Name'].str.split(',').str[1]).str.split('.').str[0]
title_test = (test['Name'].str.split(',').str[1]).str.split('.').str[0]

Based on the title the easiest way is to replace the missing  value of **Age** for  title with **Miss** and **Master** with 0 and rest of the title with 18

In [None]:
for i in range(0,len(title_train)): #both have same dimension
    if np.isnan(train['Age'][i]) == True:
        if 'Miss' in title_train[i] or 'Master' in title_train[i]:
            train['Age'][i] = 0
        else:train['Age'][i] = 18
for i in range(0,len(title_test)): 
    if np.isnan(test['Age'][i]) == True:
        if 'Miss' in title_test[i] or 'Master' in title_test[i]:
            test['Age'][i] = 0
        else:test['Age'][i] = 18
sum(train["Age"].isna()) # checking train
train_orig = train.copy() # save the original data 
sum(test["Age"].isna())  #checking test

Here in this array  "**Sex**" and "**Embarked**" are categorical features and have strings instead of numeric values. We need to encode these strings into numeric data, so the algorithm can perform its calculations.

In [None]:
train['Sex'] = train['Sex'].replace('male', 1)
train['Sex'] = train['Sex'].replace('female', 2)

test['Sex'] = test['Sex'].replace('male', 1)
test['Sex'] = test['Sex'].replace('female', 2)

Similarly for **Embarked**, there are 2 missing values. Here, there are 3 categories in **Embarked**, so best is to fill in wiht the frequent port. 

In [None]:
fp = train['Embarked'].dropna().mode()[0]
train['Embarked'] = train['Embarked'].fillna(fp)

Mapping the values for Embarked

In [None]:
train['Embarked'] = train['Embarked'].map({'S': 0, 'C':1,'Q':2}).astype(int)
test['Embarked'] = test['Embarked'].map({'S': 0, 'C':1,'Q':2}).astype(int)

Now the data is resonably cleaned. Here, we are not using **Cabin** data so we can leave this column as it is.
Next is to convert the DataFrames into array

In [None]:
#Converting Pandas DataFrame to numpy arrays so that they can be used in sklearn
train_feature = train[['Sex','Age','Pclass','SibSp','Parch','Embarked']].values
train_class = train['Survived'].values
feature_names = ['Sex','Age','Pclass','SibSp','Parch','Embarked']
test_feature = test[['Sex','Age','Pclass','SibSp','Parch','Embarked']].values

# 3: Apply Classifier

# **Logistic regression**

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(train_feature,train_class) # This is applying the fitting

test_predict = clf.predict(test_feature) # this is the predicted RESULT
cv_score = clf.score(
    train_feature,train_class)
cv_score


Try with polynomial fitting

In [None]:
from sklearn import preprocessing
poly = preprocessing.PolynomialFeatures(degree=2)
poly_train_feature = poly.fit_transform(train_feature)
poly_test_feature = poly.fit_transform(test_feature)
classfier = LogisticRegression()
classifier_ = classfier.fit(poly_train_feature, train_class)
poly_test_predict = classifier_.predict(poly_test_feature)
print(classifier_.score(poly_train_feature, train_class))
#print(classifier_.score(poly_test_feature,poly_test_predict))


In [None]:
LogReg_TestResult= pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':poly_test_predict})
LogReg_TestResult.head()
LogReg_TestResult.to_csv('PLogReg_TestResult.csv',index=False)

# 2: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
#This has to be improved
rclf = RandomForestClassifier(criterion='gini',n_estimators=1000,
                             min_samples_split=10,
                             min_samples_leaf=1,
                             max_features='auto',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1)
seed= 42
rclf =RandomForestClassifier(n_estimators=1000, criterion='entropy', max_depth=5, min_samples_split=2,
                           min_samples_leaf=1, max_features='auto',    bootstrap=False, oob_score=False, 
                           n_jobs=1, random_state=seed,verbose=0)
rclf.fit(train_feature,train_class)
test_predict = rclf.predict(test_feature)
print(rclf.score(train_feature, train_class))
#cv_score = rclf.score(test_feature,test_predict)


In [None]:
RandForst_TestResult= pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':test_predict})
RandForst_TestResult.head()
RandForst_TestResult.to_csv('RandForst_Test.csv',index=False)