In this kernel we will do a complete analysis in order to determine which group of people were most likely to survive in the infamous Titanic incident. In particular, we will apply the tools of machine learning to predict which passengers would have survived the tragedy.

**Description of the Data Fields:**
**survival** - Survival (0 = No; 1 = Yes)  **class** - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) **name** - Name **sex** - Sex **age** - Age **sibsp** - Number of Siblings/Spouses  **Aboard parch** - Number of Parents/Children Aboard **ticket** - Ticket Number **fare** - Passenger Fare **cabin** - Cabin **embarked** - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [None]:
#importing required libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Reading the training and test data
titanic_train = pd.read_csv("../input/train.csv")
titanic_test = pd.read_csv("../input/test.csv")

# The first thing after reading the dataset is to know the dimensions.
print("The dimensions of training data and test data is",titanic_train.shape,"and",titanic_test.shape,"respectively")

We can infer from above that training data contains an extra column which is the label Let’s get some more information about the dataset using .info() function.


In [None]:
titanic_train.info()



The most vital insight that we get from above is that about 20 percent of age data and 78 percent of Cabin data is having null value.
Looking at the Cabin column, it seems like we are missing too much of that data to do something useful, even at a basic level. We will either remove the cabin column or give tags like 0/1 to the values in the column.

**Exploratory Data Analysis**

In [None]:

sns.set_style('whitegrid')

sns.countplot(x='Survived', data= titanic_train)

#people who survived v/s who didn't

From the above plot it is clear that the number of survivors was significantly lower than the number of people who didn’t survive. 

In [None]:
sns.countplot(x='Survived', hue='Sex', data= titanic_train,palette='RdBu_r')

The above  plot shows that the number of males was much higher in the list of people who didn’t survive. We can observe that the number of females was significantly greater than males in the list of survivors. Maybe females first policy was used by the ship crew while transferring people on lifeboats.

In [None]:
sns.countplot(x='Survived', hue='Pclass', data= titanic_train, palette='rainbow')



The above plot shows that most of the passengers who lost their lives in this tragic incident belonged to Class 3. There were more survivors from Class 1 than any other class.

Now let's handle the missing values. For Cabin data giving '1' tag to value with valid Cabin no. and '0' tags to value with NaN.


In [None]:
def impute_cabin(col):
   Cabin = col[0]
   if type(Cabin) == str:
       return 1
   else:
       return 0

titanic_train['Cabin'] = titanic_train[['Cabin']].apply(impute_cabin, axis = 1)

In [None]:
titanic_train['Cabin'].describe()

Now let's fill in the missing values of the age column. We can do this by taking the mean and standard deviation of the age and then filling up the null age values randomly .

In [None]:
age_avg=titanic_train['Age'].mean()
age_std=titanic_train['Age'].std()

import random
random_list = np.random.randint(age_avg - age_std, age_avg + age_std )
titanic_train['Age'][np.isnan(titanic_train['Age'])] = random_list
titanic_train['Age'] = titanic_train['Age'].astype(int)

In [None]:
titanic_train['Age'].describe()

In [None]:
titanic_train["Embarked"]=titanic_train["Embarked"].fillna("S")

So we have now successfully handled the missing data. 

Adding certain columns:-
1. Adding a column denoting the family size.
2. Adding a column denoting whether the passenger travelled alone or not.

In [None]:
titanic_train['family_size'] = titanic_train['SibSp'] + titanic_train['Parch'] + 1
titanic_train['is_alone'] = 0
titanic_train.loc[titanic_train['family_size'] == 1, 'is_alone'] = 1


Now , we’ll need to convert categorical features to dummy variables using pandas! Otherwise, our machine learning algorithm won’t be able to directly take in those features as inputs. For that we would map the categorical data 

In [None]:
train=titanic_train.copy()
     #Mapping Sex
sex_map = { 'female':0 , 'male':1 }
train['Sex'] = train['Sex'].map(sex_map).astype(int)

    #Mapping Embarked
embark_map = {'S':0, 'C':1, 'Q':2}
train['Embarked'] = train['Embarked'].map(embark_map).astype(int)

In [None]:
train.head(5)

From above we can infer that ticket column is just composed of random strings which is not so much useful and the name column mainly comprises of Mr. ,Mrs. and Miss which obviously depicts their respective sex or might be age , hence name column is dependent on other columns and hence not so useful. So the above two features can be considered to be redundant and could be dropped. 

In [None]:
train=train.drop("Name",axis=1)
train=train.drop("Ticket",axis=1)

In [None]:
train.describe(include='all')

Following is our new cleaned dataset on which we will be applying our machine learning models.

In [None]:
Y=train.Survived
train=train.drop("Survived",axis=1)

Now splitting the training data into training and validation set.

In [None]:

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train, Y, test_size=0.33, random_state=42)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, log_loss
from xgboost import XGBClassifier
classifier =  XGBClassifier(n_estimators=1000, learning_rate=0.05,n_jobs=-1)
classifier.fit(X_train, y_train)
pred3 = classifier.predict(X_valid)

print(classification_report(y_valid, pred3))
print('\n')
print(confusion_matrix(y_valid, pred3))
print('\n')
print(accuracy_score(y_valid, pred3))

By using this model , we  obtained a good accuracy.