# Titanic

This is an attempt at it after studying some basics of Data Science.

Titanic is a discrete problem, hence it needs to be solved using classification.

We have the test set and the train set, and we need to make a model that categorizes the survivability of the passengers of the infamous Titanic disaster. So to do so, we are going to be looking at the dimensions of the data provided to us.

In [1]:
import pandas

train = pandas.read_csv("./data/train.csv")
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


So the first thing we wis to figure out is the behaviour and type of data given.

## Data Description

We describe the data as the following:

| Column | Definition | Keys if any | Numbers of keys |
|------|----------|-----------|-----|
| PassengerId | Key for identifying individual passengers |   | |
| Survived | Tells if a person survived or not | 0 for dead 1 for alive | 2 |
| Pclass | Class of ticket purchased | n for nth class | 3 |
| Name | Name of the passenger | | |
| Sex | The gender of the passenger | | |
| Age | The age of the person | | |
| SibSp | Number of siblings/spouses aboard the titanic | | |
| Parch | Number of parents/children aboard the titanic | | |
| Ticket | Ticket Number | | |
| Fare | The fare for the ticket | | |
| Cabin | The cabin numbers for people with a cabin | | |
| Embarked | Port of embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | 3 | 


Next object of interest is the correlation of the various fields, but in order to do so, we need to change certain categorical information into discrete numericals.

In [2]:
# categorical transformations

train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [0, 1, 2])

In [3]:
train.corr()['Survived']

PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Sex           -0.543351
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Embarked      -0.169718
Name: Survived, dtype: float64

Some columns we readily exempt from our analysis.

1. Fare, since Pclass is a much better indicator
2. PassengerId
3. Parch and Sibsp are reflective in the sense that if there was a sibling of the passenger onboard for a passenger, it would reflect the same for the person. This in turn is highly specific and a model around this cannot be built.


In [4]:
attribs_required = ['Pclass', 'Sex', 'Embarked', 'Age'] 
# Pclass, Sex and Embarked are discrete
# Age is continuous(relative to the dimensionality of the other attributes)

Pclass is questionable, to show why, let us see the value counts of the data.

In [5]:
train.groupby("Survived")['Pclass'].value_counts()

Survived  Pclass
0         3         372
          2          97
          1          80
1         1         136
          3         119
          2          87
Name: Pclass, dtype: int64

This shows us that while it has a high correlation, it might prove to be create a further non-deterministic category. 

Next, we need to check what we need to impute, for training to work.

In [6]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Age has 177 null values out of 891 values. This is significant to cause a change in the distribution if averaged, or imputed based on other factors.  
We should still try to see if there is some quick method to do so....

In [7]:
required = train[train['Age'].isnull()]
required

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",1,,0,0,330877,8.4583,,1.0
17,18,1,2,"Williams, Mr. Charles Eugene",1,,0,0,244373,13.0000,,2.0
19,20,1,3,"Masselmani, Mrs. Fatima",0,,0,0,2649,7.2250,,0.0
26,27,0,3,"Emir, Mr. Farred Chehab",1,,0,0,2631,7.2250,,0.0
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",0,,0,0,330959,7.8792,,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",1,,0,0,2629,7.2292,,0.0
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",0,,8,2,CA. 2343,69.5500,,2.0
868,869,0,3,"van Melkebeke, Mr. Philemon",1,,0,0,345777,9.5000,,2.0
878,879,0,3,"Laleff, Mr. Kristo",1,,0,0,349217,7.8958,,2.0


In [8]:
underage_indicators = ['M\.', 'Ms\.', 'Miss\.', 'Master\.']
train[train['Name'].str.contains('|'.join(underage_indicators))]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.9250,,2.0
7,8,0,3,"Palsson, Master. Gosta Leonard",1,2.0,3,1,349909,21.0750,,2.0
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",0,4.0,1,1,PP 9549,16.7000,G6,2.0
11,12,1,1,"Bonnell, Miss. Elizabeth",0,58.0,0,0,113783,26.5500,C103,2.0
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",0,14.0,0,0,350406,7.8542,,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
869,870,1,3,"Johnson, Master. Harold Theodor",1,4.0,1,1,347742,11.1333,,2.0
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",0,15.0,0,0,2667,7.2250,,0.0
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",0,22.0,0,0,7552,10.5167,,2.0
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.0,0,0,112053,30.0000,B42,2.0


In [9]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
count,891.0,891.0,891.0,891.0,714.0,891.0,891.0,891.0,889.0
mean,446.0,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,1.535433
std,257.353842,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429,0.792088
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,20.125,0.0,0.0,7.9104,1.0
50%,446.0,0.0,3.0,1.0,28.0,0.0,0.0,14.4542,2.0
75%,668.5,1.0,3.0,1.0,38.0,1.0,0.0,31.0,2.0
max,891.0,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,2.0


So we see that if we attempt to impute the age based on pre-existent data and the honorific, we'll have a deviation of 14, which is unacceptable, and would serve to increase issue in the prediction model.

In [10]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [11]:
train['Embarked'].value_counts()

2.0    644
0.0    168
1.0     77
Name: Embarked, dtype: int64

In [12]:
train['Embarked'] = train['Embarked'].fillna(2)

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

y = train['Survived']
# x_train, x_valid, y_train, y_valid = train_test_split(train[attribs_required[:-1]], y, train_size=0.9, test_size=0.1,random_state=0)

model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=3)
model.fit(train[attribs_required[:-1]], y)
# val_predictions = model.predict(x_valid)

# accuracy_score(val_predictions, y_valid)

RandomForestClassifier(max_depth=3, random_state=3)

In [14]:
test_data = pandas.read_csv('./data/test.csv')
test_data
# val_predictions = model.predict(test_data)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [15]:
# transform test_data based on the predefined transform logic
test_data['Sex'] = test_data['Sex'].replace(['female', 'male'], [0, 1])
test_data['Embarked'] = test_data['Embarked'].replace(['C', 'Q', 'S'], [0, 1, 2])

In [16]:
test_data.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [17]:
test_data['Survived'] = model.predict(test_data[attribs_required[:-1]])

In [18]:
answer_set = pandas.read_csv('./data/actual_result.csv')
answer_set

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [20]:
check = (test_data[['PassengerId', 'Survived']] == answer_set)
counts = check['Survived'].value_counts()
accuracy = counts[True] / check['Survived'].size
accuracy

0.777511961722488