# Kaggle - Machine learning Competition

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [1]:
# Importing the Dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [2]:
# Importing the Machine Learning Classification models

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [3]:
# Loading the Titanic Dataset

x_train_df = pd.read_csv('c://users/santhosh reddy/desktop/untitled folder/project-1 (titanic dataset)/train.csv')

x_test_df = pd.read_csv('c://users/santhosh reddy/desktop/untitled folder/project-1 (titanic dataset)/test.csv')

y_test_df = pd.read_csv('c://users/santhosh reddy/desktop/untitled folder/project-1 (titanic dataset)/y_test_data.csv')

In [4]:
x_train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
x_test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [6]:
x_train_df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [7]:
y_test_df.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [8]:
print(x_train_df.shape, x_test_df.shape)

(891, 12) (418, 11)


# Exploratory Data Analysis

In [9]:
x_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [10]:
# Removing the Name, Passangergerld and Ticket columns from the dataset

x_train_df = x_train_df.drop(columns=['Name','Ticket','PassengerId'], axis=1)

x_test_df = x_test_df.drop(columns=['Name','Ticket','PassengerId'], axis=1)

y_test_df = y_test_df.drop(columns=['PassengerId'], axis=1)

In [11]:
# Finding the null values

x_train_df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [12]:
x_test_df['Survived'] = y_test_df

In [13]:
# More null values in Cabin, so we remove cabin column

x_train_df = x_train_df.drop(columns='Cabin', axis=1)

x_test_df = x_test_df.drop(columns='Cabin', axis=1)

In [14]:
x_train_df.Sex.unique()

array(['male', 'female'], dtype=object)

In [15]:
# Removing the row that have null values

x_train_df = x_train_df.dropna(axis=0)
x_test_df = x_test_df.dropna(axis=0)

In [16]:
print(x_train_df.shape, x_test_df.shape)

(712, 8) (331, 8)


In [17]:
# Creating the y_test and x_test data

y_test = x_test_df['Survived']
x_test_df = x_test_df.drop(columns='Survived')

In [18]:
x_train_df.Survived.value_counts()

0    424
1    288
Name: Survived, dtype: int64

In [19]:
x_train_df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [20]:
# Splitting the x_train into features and target

x_train = x_train_df.drop(columns='Survived', axis=1)
y_train = x_train_df['Survived']

In [21]:
# Knowing the shape of the training data

print(x_train.shape, y_train.shape)

(712, 7) (712,)


# LabelEncoding the Training and Testing data

In [22]:
# Converting the 'Sex' column into numbers using the labelencoder

le = LabelEncoder()
x_train['Sex'] = le.fit_transform(x_train['Sex'])
x_test_df['Sex'] = le.fit_transform(x_test_df['Sex'])

Sex column LabelEncoding output

Male   --> 1   
Female --> 0

In [23]:
# Converting the 'Embarked' column numbers using the labelencoder

x_train['Embarked'] = le.fit_transform(x_train['Embarked'])
x_test_df['Embarked'] = le.fit_transform(x_test_df['Embarked'])

Embarked column LabelEncoding output

C --> 0   
Q --> 1    
S --> 2

In [24]:
x_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,22.0,1,0,7.25,2
1,1,0,38.0,1,0,71.2833,0
2,3,0,26.0,0,0,7.925,2
3,1,0,35.0,1,0,53.1,2
4,3,1,35.0,0,0,8.05,2


In [25]:
x_test_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,34.5,0,0,7.8292,1
1,3,0,47.0,1,0,7.0,2
2,2,1,62.0,0,0,9.6875,1
3,3,1,27.0,0,0,8.6625,2
4,3,0,22.0,1,1,12.2875,2


# Standardizing the train and test data

In [26]:
scalar = StandardScaler()

In [27]:
# Standardizing the x_train and x_test data

x_train_std = scalar.fit_transform(x_train)
x_test_std = scalar.fit_transform(x_test_df)

In [28]:
print(x_train_std, x_test_std)

[[ 0.90859974  0.75613751 -0.52766856 ... -0.50678737 -0.51637992
   0.51958818]
 [-1.48298257 -1.32251077  0.57709388 ... -0.50678737  0.69404605
  -2.04948671]
 [ 0.90859974 -1.32251077 -0.25147795 ... -0.50678737 -0.50362035
   0.51958818]
 ...
 [-1.48298257 -1.32251077 -0.73481151 ... -0.50678737 -0.08633507
   0.51958818]
 [-1.48298257  0.75613751 -0.25147795 ... -0.50678737 -0.08633507
  -2.04948671]
 [ 0.90859974  0.75613751  0.16280796 ... -0.50678737 -0.50692839
  -0.76494927]] [[ 1.01542612  0.78901776  0.30665727 ... -0.49211953 -0.54228095
  -0.50868113]
 [ 1.01542612 -1.2673986   1.19423645 ... -0.49211953 -0.55584416
   0.6525151 ]
 [-0.16804587  0.78901776  2.25933148 ... -0.49211953 -0.51188479
  -0.50868113]
 ...
 [ 1.01542612 -1.2673986  -0.15488391 ... -0.49211953 -0.5431675
   0.6525151 ]
 [-1.35151786 -1.2673986   0.62618578 ... -0.49211953  1.11093161
  -1.66987736]
 [ 1.01542612  0.78901776  0.59068261 ... -0.49211953 -0.55175491
   0.6525151 ]]


# Logistic Regression model

In [29]:
# parameters of the model
logistic_param = {'fit_intercept':[True,False],
                 'C':[1,5,10,20],
                 'solver':['lbfgs','liblinear','newton-cg','saga']
                 }

In [30]:
# Using GridSearchCV to find the best parameters for the model

logistic_grid = GridSearchCV(LogisticRegression(), logistic_param, cv=5)

In [31]:
# Fitting the Gridsearchcv

logistic_grid.fit(x_train_std, y_train)

In [32]:
# Best parameter combinations from GridsearchCV

logistic_grid.best_params_

{'C': 5, 'fit_intercept': True, 'solver': 'lbfgs'}

In [33]:
logistic_model = LogisticRegression(C=5, fit_intercept=True, max_iter=1000)

In [34]:
# Training the model with training data

logistic_model.fit(x_train_std, y_train)

In [35]:
# Logistic model train data prediction

logistic_train_pred = logistic_model.predict(x_train_std)

In [36]:
# Logistic model train data accuracy

logistic_train_accuracy = accuracy_score(y_train, logistic_train_pred)
print(logistic_train_accuracy)

0.8033707865168539


In [37]:
# Logistic model test data prediction

logistic_test_pred = logistic_model.predict(x_test_std)

In [38]:
# Logistic model test data accuracy_score

logistic_test_accuracy = accuracy_score(y_test, logistic_test_pred)
print(logistic_test_accuracy)

0.9244712990936556


# Support Vector Classifier

In [39]:
# SVC model parameters 

svc_params = {'C':[1,5,10],
              'kernel':['linear','rbf','sigmoid'],
              'coef0':[0.001,0.4,0.8]
             }

In [40]:
# Using GridsearchCV to find the best best parameters of the model

svc_grid = GridSearchCV(SVC(), svc_params, cv=5)

In [41]:
# Training the GridSearchCV with training data

svc_grid.fit(x_train_std, y_train)

In [42]:
# Best parameters combination for the model

svc_grid.best_params_

{'C': 1, 'coef0': 0.001, 'kernel': 'rbf'}

In [43]:
svc_model = SVC(C=1.0, coef0=0.001, kernel='rbf', tol=0.0001, max_iter=-1)

In [44]:
#Training the model with Training data

svc_model.fit(x_train_std, y_train)

In [45]:
# Training data prediction for SVC model

svc_train_pred = svc_model.predict(x_train_std)

In [46]:
# Accuracy of the training data

svc_train_accuracy = accuracy_score(y_train, svc_train_pred)
print(svc_train_accuracy)

0.8441011235955056


In [47]:
# Test data prediction for SVC model

svc_test_pred = svc_model.predict(x_test_std)

In [48]:
# Accuracy of the Test data

svc_test_accuracy = accuracy_score(y_test, svc_test_pred)
print(svc_test_accuracy)

0.9063444108761329


# KNeighbors Classifier

In [49]:
knc_params = {'n_neighbors':[3,5,10,20],
             'weights':['uniform','distance'],
             'leaf_size':[30,40,50],
             'p':[1,1.5,2]
             }

In [50]:
# Using GridsearchCV to find the best best parameters of the model

knc_grid = GridSearchCV(KNeighborsClassifier(), knc_params, cv=5)

In [51]:
# Training the GridSearchCV with training data

knc_grid.fit(x_train_std, y_train)

In [52]:
# Best parameters combination for the model

knc_grid.best_params_

{'leaf_size': 30, 'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}

In [53]:
knc_model = KNeighborsClassifier(n_neighbors=10, p=1, leaf_size=30, weights='uniform')

In [54]:
#Training the model with Training data

knc_model.fit(x_train_std, y_train)

In [55]:
# Training data prediction of the KNeighbours model

knc_train_pred = knc_model.predict(x_train_std)

In [56]:
# Accuracy of the training data

knc_train_accuracy = accuracy_score(y_train, knc_train_pred)
print(knc_train_accuracy)

0.8426966292134831


In [57]:
# Test data prediction of the KNeighbours model

knc_test_pred = knc_model.predict(x_test_std)

In [58]:
# Accuracy of the test data

knc_test_accuracy = accuracy_score(y_test, knc_test_pred)
print(knc_test_accuracy)

0.851963746223565


# GaussianNB Classifier

In [59]:
gaussian_model = GaussianNB()

In [60]:
#Training the model with Training data

gaussian_model.fit(x_train_std, y_train)

In [61]:
# Training data Prediction of teh GasussianNB model

gaussian_train_pred = gaussian_model.predict(x_train_std)

In [62]:
# Accuracy of the training data

gaussian_train_accuracy = accuracy_score(y_train, gaussian_train_pred)
print(gaussian_train_accuracy)

0.7949438202247191


In [63]:
# Training data Prediction of the GaussianNB model

gaussian_test_pred = gaussian_model.predict(x_test_std)

In [64]:
# Accuracy of the test data

gaussian_test_accuracy = accuracy_score(y_test, gaussian_test_pred)
print(gaussian_test_accuracy)

0.8580060422960725


# Decision Tree Classifier

In [65]:
dtree_params = {'criterion':['gini','entropy','log_loss'],
               'max_depth':[3,4,5,6,7,8],
               'min_samples_split':[2,5,10,20,25,30],
               'min_samples_leaf':[1,2,5,7,10],
               'max_features':['sqrt','log2'],
                'random_state':[3]
               }

In [66]:
# Using GridsearchCV to find the best best parameters of the model

dtree_grid = GridSearchCV(DecisionTreeClassifier(), dtree_params, cv=5)

In [67]:
# Training the GridSearchCV with training data

dtree_grid.fit(x_train_std, y_train)

In [68]:
# Best parameters combination for the model

dtree_grid.best_params_

{'criterion': 'gini',
 'max_depth': 6,
 'max_features': 'sqrt',
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'random_state': 3}

In [69]:
dtree_model = DecisionTreeClassifier(criterion='gini', max_depth=6, max_features='sqrt', min_samples_leaf=5, min_samples_split=2, random_state=3)

In [70]:
#Training the model with Training data

dtree_model.fit(x_train_std, y_train)

In [71]:
# Training data prediction of the DecisionTree model

dtree_train_pred = dtree_model.predict(x_train_std)

In [72]:
# Accuracy of the training data

dtree_train_accuracy = accuracy_score(y_train, dtree_train_pred)
print(dtree_train_accuracy)

0.8342696629213483


In [73]:
# Test data prediction of the DecisionTree model
dtree_test_pred = dtree_model.predict(x_test_std)

In [74]:
# Accuracy of the test data

dtree_test_accuracy = accuracy_score(y_test, dtree_test_pred)
print(dtree_test_accuracy)

0.9274924471299094


In [75]:
# List of the models we had used for classify the data

models_list = [LogisticRegression(C=5, fit_intercept=True, max_iter=1000),
              SVC(C=1.0, coef0=0.001, kernel='rbf', tol=0.0001, max_iter=-1),
              KNeighborsClassifier(n_neighbors=10, p=1, leaf_size=30, weights='uniform'),
              GaussianNB(),
              DecisionTreeClassifier(criterion='gini', max_depth=6, max_features='sqrt', min_samples_leaf=5, min_samples_split=2, random_state=3)
              ]

In [76]:
# Accuracy score of training and test data for all models

for model in models_list:
    model.fit(x_train_std, y_train)
    train_pred = model.predict(x_train_std)
    test_pred = model.predict(x_test_std)
    train_accuracy = accuracy_score(y_train, train_pred)
    test_accuracy = accuracy_score(y_test, test_pred)
    print(model)
    print('The training data accuracy score is :',round(train_accuracy*100,2))
    print('The test data accuracy score is :',round(test_accuracy*100,2))
    print('***********************************************************************************')

LogisticRegression(C=5, max_iter=1000)
The training data accuracy score is : 80.34
The test data accuracy score is : 92.45
***********************************************************************************
SVC(coef0=0.001, tol=0.0001)
The training data accuracy score is : 84.41
The test data accuracy score is : 90.63
***********************************************************************************
KNeighborsClassifier(n_neighbors=10, p=1)
The training data accuracy score is : 84.27
The test data accuracy score is : 85.2
***********************************************************************************
GaussianNB()
The training data accuracy score is : 79.49
The test data accuracy score is : 85.8
***********************************************************************************
DecisionTreeClassifier(max_depth=6, max_features='sqrt', min_samples_leaf=5,
                       random_state=3)
The training data accuracy score is : 83.43
The test data accuracy score is : 92.75
*******

We have a very high accuracy score using the DecisionTreeClassifier model.   
We will proceed further using this model for creating the predictive system.

In [79]:
def survival_classification():
    input_data = []
    features_list = ['Pclass','Sex(M or F)','Age','SibSp','Parch','Fare','Embarked(Q, C or S)']
    for feature in features_list:
        print('Enter the',feature,'value :')
        in_data = input()
        input_data.append(in_data)
    print(input_data)

    # LabelEncoding the Sex Column
    if (input_data[1]=='M'or'm'):
        input_data[1] = 1
    elif(input_data[1]=='S'or's'):
        input_data[1] = 0

    # LabelEncoding the Embarked Column
    if (input_data[6]=='C'or'c'):
        input_data[6] = 0
    elif (input_data[6]=='Q'or'q'):
        input_data[6] = 1
    elif (input_list[6]=='s'or'S'):
        input_data[6] = 2

    # Converting the input_data into numpy array
    input_data_array = np.asarray(input_data)

    # Reshaping the input data
    input_data_reshaped = input_data_array.reshape(1,-1)

    # Standardizing the Input_data
    input_data_std = scalar.fit_transform(input_data_reshaped)

    dtree_model = DecisionTreeClassifier(max_depth=6, max_features='sqrt', min_samples_leaf=5,random_state=3)
    dtree_model.fit(x_train_std, y_train)

    input_data_prediction = dtree_model.predict(input_data_reshaped)

    if input_data_prediction == 0:
        print('The person would have Survived')
    else:
        print('The person would not have survived')