## Classifying And Predicting the survivability of the passangers given in the dataset.

### Classification using the decision tree

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing as pp

trainData = pd.read_csv("Dataset/train.csv")

trainData.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

#### So no null cells in the required variable columns 

In [2]:
encoded_gender = pp.LabelEncoder().fit_transform(trainData['Sex'])

In [3]:
classify = pd.DataFrame([trainData['Pclass'],trainData['Age'],encoded_gender,trainData['Fare']]).T

tree_model = tree.DecisionTreeClassifier(max_depth=6)

tree_model.fit(X=classify,y=trainData['Survived'])

DecisionTreeClassifier(max_depth=6)

In [4]:
with open("Dtree.dot","w") as f:
    f=tree.export_graphviz(tree_model,feature_names=["Pclass","Age","Sex","Fare"],out_file=f)

### Too big tree
### We will use the Random forest to get the important variables to minimize the tree

### After the Classification has done, the tree model is generated

#### Below we have the accuracy of the trained model

In [5]:
tree_model.score(X=classify,y=trainData["Survived"])

0.8627671541057368

## =========================================================================
## Now creating the predictive model using the test dataset

In [6]:
testData =  pd.read_csv("Dataset/test.csv")

testData['Sex'] = pp.LabelEncoder().fit_transform(testData['Sex'])

test_features = pd.DataFrame([testData['Pclass'],testData['Age'],testData['Sex'],testData['Fare']]).T

test_predict = tree_model.predict(X=test_features)

predicted_data = pd.DataFrame({"PassengerId":testData['PassengerId'],"Survived":test_predict})

#Output file created
predicted_data.to_csv("Predicted_output.csv",index=False)

predicted_data.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,0


## =========================================================================
## Now applying RandomForestClassifier

In [7]:
trainData['Sex']      = pp.LabelEncoder().fit_transform(trainData['Sex'])
trainData['Embarked'] = pp.LabelEncoder().fit_transform(trainData['Embarked'])

trainData['Age']  = np.round(trainData['Age'])
trainData['Fare'] = np.round(trainData['Fare'])

In [8]:
rf_model = RandomForestClassifier(n_estimators=1000,max_features=2,oob_score=True)

features=["Pclass","Sex","Age","SibSp","Fare","Embarked"]

rf_model.fit(X=trainData[features],y=trainData['Survived'])

RandomForestClassifier(max_features=2, n_estimators=1000, oob_score=True)

In [9]:
print("OOB Accuracy score: ", rf_model.oob_score_)

OOB Accuracy score:  0.8110236220472441


In [10]:
for feature, imp in zip(features, rf_model.feature_importances_):
    print(feature,"\t: ", imp)

Pclass 	:  0.09664082784881883
Sex 	:  0.2830930497424547
Age 	:  0.3001951951608912
SibSp 	:  0.05555661631118755
Fare 	:  0.22640227784765757
Embarked 	:  0.038112033088990055


### Only "Age", "Sex", "Fare" has the higher value compare to others so only these variables are the important ones.

### Now only taking the important variables for Decision Tree

In [14]:
tree_model = tree.DecisionTreeClassifier(max_depth=6, max_leaf_nodes=12)

cl_data = pd.DataFrame([trainData["Age"],trainData["Sex"],trainData["Fare"]]).T

tree_model.fit(X=cl_data,y=trainData["Survived"])

DecisionTreeClassifier(max_depth=6, max_leaf_nodes=12)

In [15]:
tree_model.score(X=cl_data,y=trainData["Survived"])

0.8076490438695163

In [16]:
with open("Dtree_Survived.dot","w") as f:
    f=tree.export_graphviz(tree_model,feature_names=["Age","Sex","Fare"],out_file=f)

## Survival Conditions

##### 1. If the person is female and have paid fair more than 44.5, then the survivability is high.
##### 2. If the person is female and have paid fair more than 48.5 and age is more than 8 years, then the survivability is high. 
##### 3. If the person is male and the age is more than 6.5 years and paid fare more than 27, then the survivability is high.
##### 4. If the person is male and the age is more than 13.5 years and paid fare between 25.5 and 27 , then the survivability is high.
##### 5. If the person is male and the age is more than 47.5 years and paid fare between 25.5 and 387.5, then the survivability is high.
##### 6. If the person is male and the age is more than 6.5 years and paid fare more than 387.5, then the survivability is high.
