<a href="https://colab.research.google.com/github/shuklashwin/titanic/blob/master/Kaggle_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
from zipfile import ZipFile

In [0]:
file_name = '/content/titanic-machine-learning-from-disaster.zip'

In [0]:
with ZipFile(file_name, 'r') as zip:
  # zip.printdir()
  zip.extractall()

In [0]:
train_data = pd.read_csv('/content/train.csv')
test_data = pd.read_csv('/content/test.csv')
all_data = [train_data, test_data]

In [0]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [0]:
train_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [0]:
train_data.shape

(891, 12)

Feature 1: How many people survived based on passenger class?

In [0]:
print( train_data[['Pclass','Survived']].groupby(["Pclass"], as_index=False).mean())

   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363


Feature 2: How many people survived based on sex?

In [0]:
print( train_data[['Sex','Survived']].groupby(["Sex"], as_index=False).mean())

      Sex  Survived
0  female  0.742038
1    male  0.188908


Feature 3: Did family size play a role in survival? SibSp + Parch + 1

In [0]:
for data in all_data:
  data['family_size'] = data['SibSp'] + data['Parch'] + 1
print( train_data[['family_size', 'Survived']].groupby(["family_size"], as_index=False).mean())

   family_size  Survived
0            1  0.303538
1            2  0.552795
2            3  0.578431
3            4  0.724138
4            5  0.200000
5            6  0.136364
6            7  0.333333
7            8  0.000000
8           11  0.000000


From above hypothesis, we can conclude that families of size 4 had the most chance of survival.

Feature 4: Did survival depend on whether a passenger was alone or not?

In [0]:
for data in all_data:
  data['is_alone'] = 0
  data.loc[data['family_size'] == 1, 'is_alone'] = 1
print( train_data[['is_alone', 'Survived']].groupby(["is_alone"], as_index=False).mean())

   is_alone  Survived
0         0  0.505650
1         1  0.303538


Feature 5: Does place of embarkation play a role in survival? Data contains some NA fields. We fill those fields with value 'S' since it is the most occurring value.

In [0]:
for data in all_data:
  data['Embarked'] = data['Embarked'].fillna('S')
print( train_data[['Embarked', 'Survived']].groupby(["Embarked"], as_index=False).mean())

  Embarked  Survived
0        C  0.553571
1        Q  0.389610
2        S  0.339009


Feature 6: Did fare play a role in survival? Column contains some NA values. Filling those values by the median of all values and when we cut using qcut, the bins will be chosen so that there are same number of records in each bin (equal parts).

In [0]:
for data in all_data:
  data['Fare'] = data['Fare'].fillna(data['Fare'].median())
train_data['category_fare'] = pd.qcut(train_data['Fare'], 4)
print( train_data[['category_fare', 'Survived']].groupby(["category_fare"], as_index=False).mean())

     category_fare  Survived
0   (-0.001, 7.91]  0.197309
1   (7.91, 14.454]  0.303571
2   (14.454, 31.0]  0.454955
3  (31.0, 512.329]  0.581081


Feature 6: Did age play a role in survival?

In [0]:
data['Age'].isnull().values.any()

True

Since age column contains some NA values, we will fill it with some random numbers between (average age - average std deviation) amd (average age + average std deviation). After that, we will group it in 5.

In [0]:
import numpy as np

for data in all_data:
  age_mean = data['Age'].mean()
  age_std = data['Age'].std()
  age_null = data['Age'].isnull().sum()

  random_age_list = np.random.randint(age_mean - age_std, age_mean + age_std, size=age_null)
  data['Age'][np.isnan(data['Age'])] = random_age_list
  data['Age'] = data['Age'].astype(int)

train_data['category_age'] = pd.cut(train_data['Age'], 5)
print( train_data[['category_age', 'Survived']].groupby(["category_age"], as_index=False).mean())

    category_age  Survived
0  (-0.08, 16.0]  0.530435
1   (16.0, 32.0]  0.354120
2   (32.0, 48.0]  0.368421
3   (48.0, 64.0]  0.434783
4   (64.0, 80.0]  0.090909


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Feature 7: Did survival depend on the name of the passengers? Need to separate out the titles for this one using regular expressions.

In [0]:
data['Name'][:5]

0                                Kelly, Mr. James
1                Wilkes, Mrs. James (Ellen Needs)
2                       Myles, Mr. Thomas Francis
3                                Wirz, Mr. Albert
4    Hirvonen, Mrs. Alexander (Helga E Lindqvist)
Name: Name, dtype: object

In [0]:
import regex as re

def get_title(name):
  title_search = re.search(' ([A-Za-z]+)\. ', name)
  if title_search:
    return title_search.group(1)
  return ""

for data in all_data:
  data['titles'] = data['Name'].apply(get_title)

for data in all_data:
  data['titles'] = data['titles'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
  data['titles'] = data['titles'].replace('Mlle', 'Miss')
  data['titles'] = data['titles'].replace('Mme', 'Mrs')
  data['titles'] = data['titles'].replace('Ms', 'Miss')

print( pd.crosstab(train_data['titles'], train_data['Sex']))
print('-------------------------------------')
print( train_data[['titles', 'Survived']].groupby(["titles"], as_index=False).mean())

Sex     female  male
titles              
Master       0    40
Miss       185     0
Mr           0   517
Mrs        126     0
Rare         3    20
-------------------------------------
   titles  Survived
0  Master  0.575000
1    Miss  0.702703
2      Mr  0.156673
3     Mrs  0.793651
4    Rare  0.347826


Feature engineering complete!

Mapping begins!

In [0]:
# Map Data
for data in all_data:

  # Mapping Sex
  sex_map = {'female':0, 'male':1}
  data['Sex'] = data['Sex'].map(sex_map).astype(int)

  #Mapping Titles
  titles_map = {'Mr':1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Rare': 5}
  data['titles'] = data['titles'].map(titles_map)
  data['titles'] = data['titles'].fillna(0)

  #Mapping Embarked
  embark_map = {'S': 0, 'C': 1, 'Q': 2}
  data['Embarked'] = data['Embarked'].map(embark_map).astype(int)

  #Mapping Fare
  data.loc[ data['Fare'] <= 7.91, 'Fare']                             = 0
  data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare']  = 1
  data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare']    = 2
  data.loc[ data['Fare'] > 31, 'Fare']                                = 3
  data['Fare'] = data['Fare'].astype(int)

  #Mapping Age
  data.loc[ data['Age'] <= 16, 'Age']                       = 0
  data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
  data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
  data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
  data.loc[ data['Age'] > 64, 'Age']                        = 4

#Feature Selection
#Create a list of columns to drop
drop_elements = ["Name", "Ticket", "Cabin", "SibSp", "Parch", "family_size"]

#Drop columns from both sets
train_data = train_data.drop(drop_elements, axis=1)
train_data = train_data.drop(['PassengerId', 'category_fare', 'category_age'], axis = 1)
test_data = test_data.drop(drop_elements, axis=1)

#Print ready to use data
print(train_data.head(10))

   Survived  Pclass  Sex  Age  Fare  Embarked  is_alone  titles
0         0       3    1    1     0         0         0       1
1         1       1    0    2     3         1         0       3
2         1       3    0    1     1         0         1       2
3         1       1    0    2     3         0         0       3
4         0       3    1    2     1         0         1       1
5         0       3    1    1     1         2         1       1
6         0       1    1    3     3         0         1       1
7         0       3    1    0     2         0         0       4
8         1       3    0    1     1         0         0       3
9         1       2    0    0     2         1         0       3


Step 4: Prediction

In [0]:
#Prediction
#Train and test data

X_train = train_data.drop("Survived", axis=1)
Y_train = train_data["Survived"]
X_test = test_data.drop("PassengerId", axis=1).copy()

Now, we call the classifier!

In [0]:
#Running the classifier
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred = decision_tree.predict(X_test)
accuracy = round(decision_tree.score(X_train, Y_train) * 100, 2)

print('Model accuracy: ', accuracy)

Model accuracy:  87.21


Create CSV with results for submission.

In [0]:
submission = pd.DataFrame({
    "PassengerId": test_data['PassengerId'],
    "Survived": Y_pred
})

submission.to_csv('submission.csv', index=False)

Fin!