# Introduction
The sinking of the **Titanic** is one of the most infamous shipwrecks in history. On **April 15, 1912**, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing **1502** out of **2224** passengers and crew. This is a very unforgetable disaster that no one in the world can forget.

It took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for begineers to start a journey in data science and participate in competitions in Kaggle.

The Objective of this notebook is to give an idea how is the workflow in any predictive modeling problem. How do we check features, how do we add new features and some Machine Learning Concepts. I have tried to keep the notebook as basic as possible so that even newbies can understand every phase of it.

![](https://preview.redd.it/0izq0428pe661.jpg?width=960&format=pjpg&auto=webp&s=15022053715fc50198a17c401be035445592fee2)

## Importing Required Packages

In [2]:
import pandas as pd 
import numpy as np
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('dark')
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

**Load and display train data**

In [3]:
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')
train_data.head(50)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


**Load and display test data**

In [4]:
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


**Let's check for missing values in both training and test data.**

In [None]:
train_data.info()

In [None]:
test_data.info()

**From above tables, we can say that `Age`, `Cabin` and `Embarked` are missing in the train data set, while values in `Age`, `Fare` and `Cabin` are missing in the test data.**

In [None]:
train_data.shape, test_data.shape

In [None]:
train_data.duplicated().sum()

In [None]:
test_data.duplicated().sum()

**Let's focus on the target(survival) and see how many passengers survived.**

In [None]:
train_data['Survived'].value_counts(normalize=True)

In [None]:
plt.figure(figsize=(10,5))
plt.title('Survivors and Deads Count', fontsize=14)
sns.countplot(x=train_data['Survived'], palette=('#C52219', '#23C552'))
plt.xlabel("Survival & Dead Rate", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

**We see that in the training data only around 38.4% of the passengers managed to survive the disaster.**

# Feature Analysis

* **Here we'll see how our data used to perform a more precise feature selection in the modeling part.** 
* **We will thus explore one feature at a time in order to determine its importance in predicting if a passenger survived or not.**

## Sex
* **We see that around 65% of the passengers were male while the remaining 35% were female.** 
* **The important thing to notice here is that the survival rate for women was four times the survival rate for men and this makes `Sex` one of the most informative features.**

In [None]:
train_data['Sex'].value_counts().to_frame()

In [None]:
train_data.groupby('Sex').Survived.mean()

In [None]:
fig, axarr = plt.subplots(1, 2, figsize=(10,5))
a = sns.countplot(train_data['Sex'], ax=axarr[0], palette=('#003f7f','#ff007f')).set_title('Passengers count by sex')
axarr[1].set_title('Survival rate by sex')
b = sns.barplot(x='Sex', y='Survived', data=train_data, palette=('#003f7f','#ff007f'), ci=None, ax=axarr[1]).set_ylabel('Survival rate')

## Pclass
* **There were three classes on the ship and from the plot we see that the number of passengers in the third class was higher than the number of passengers in the first and second classes combined.**
* **However, the survival rate by class is not the same, more than 60% of first-class passengers and around half of the second class passengers were rescued, whereas 75% of third class passengers were not able to survive the disaster.**  
* **For this reason, this is definitely an important aspect to consider.**

In [None]:
train_data['Pclass'].value_counts().to_frame()

In [None]:
train_data.groupby('Pclass').Survived.mean()

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(12,6))
a = sns.barplot(x='Pclass', y='Survived', data=train_data, palette="Greens", ci=None, ax=axarr[0]).set_ylabel('Survival rate')
axarr[0].set_title('Survival rate by class')
b = sns.countplot(x='Pclass', hue='Survived', data=train_data, palette=('#C52219', '#23C552'), ax=axarr[1]).set_title('Survivors and deads count by class')

## Pclass & Sex

* **We can also see the survival rate by `Sex` and `Pclass`, which is quite impressive. First class and second class women who were rescued were respectively 97% and 92%, while the percentage drops to 50% for third-class women.**  
* **Despite that, this is still more than the 37% survival rate for first-class men.** 

In [None]:
train_data.groupby(['Pclass', 'Sex']).Survived.mean().to_frame()

In [None]:
plt.figure(figsize = [10,5])
plt.title('Survival rate by sex and class')
g = sns.barplot(x='Pclass', y='Survived', hue='Sex', palette=('#003f7f','#ff007f'), ci=None, data=train_data).set_ylabel('Survival rate')

## Age
* **Despite this column contains a lot of missing values, we see that in the training data the average age was just under 30 years.**  
* **Here is the plot of the age distribution in general compared to the one for the survivors and the deads.**

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(12,6))
axarr[0].set_title('Age distribution')
f = sns.distplot(train_data['Age'], color='g', bins=40, ax=axarr[0])
axarr[1].set_title('Age distribution for the two subpopulations')
g = sns.kdeplot(train_data['Age'].loc[train_data['Survived'] == 1], color='#C52219',
                shade= True, ax=axarr[1], label='Survived').set_xlabel('Age')
g = sns.kdeplot(train_data['Age'].loc[train_data['Survived'] == 0], color='#23C552',
                shade=True, ax=axarr[1], label='Not Survived')

## Age & Sex
* **At a first look, the relationship between `Age` and `Survived` appears not to be very clear, we notice for sure that there is a peak corresponding to young passengers for those who survived, but apart from that the rest is not very informative.**  
* **We can appreciate this feature more if we consider `Sex` too: now it is clearer that a good number of male survivors had less than 12 years, while the female group has no particular properties.**

In [None]:
plt.figure(figsize=(10,5))
g = sns.swarmplot(y='Sex', x='Age', hue='Survived', palette=('#C52219', '#23C552'), data=train_data).set_title('Survived by age and sex')

## Age, Pclass & Sex
* **Another interesting thing to look at is the relation between `Age`, `Pclass` and `Survived`.**  
* **We see the influence of `Pclass` is the important one as there are no super clear horizontal patterns.** 
* **Also, we note that there were not many children in the first class.**

In [None]:
plt.figure(figsize=(10,5))
h = sns.barplot(x='Pclass', y='Age', hue='Survived', palette=('#C52219', '#23C552'), ci=None, data=train_data).set_title('Survived by age and class')

## Fare
* **From the description, we see that the `Fare` distribution is positively skewed, with 75% of data under 31 and a maximum of 512.**  
* **Just to understand better this feature, the simplest idea here could be creating fare ranges using quartiles.** 
* **At a first look, we notice that the higher the fare, the higher the possibility of surviving.**

In [None]:
train_data.Fare.describe().to_frame()

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(12,6))
f = sns.distplot(train_data.Fare, color='g', ax=axarr[0]).set_title('Fare distribution')
fare_ranges = pd.qcut(train_data.Fare, 4, labels = ['Low', 'Mid', 'High', 'Very high'])
axarr[1].set_title('Survival rate by fare category')
g = sns.barplot(x=fare_ranges, y=train_data.Survived, palette='mako', ci=None, ax=axarr[1]).set_ylabel('Survival rate')

## Fare & Sex
* **Looking at the more detailed plot below, we also see for example that all males with fare between 200 and 300 died.**  
* **For this reason, we can left the `Fare` feature as it is in order to prevent losing too much information; at deeper levels of a tree, a more discriminant relationship might open up and it could become a good group detector.**

In [None]:
plt.figure(figsize=(10,5))
a = sns.swarmplot(x='Sex', y='Fare', hue='Survived', palette=('#C52219', '#23C552'), data=train_data).set_title('Survived by fare and sex')

**Also after looking describe function, we noticed that the minimum value for `Fare` is zero and that is a bit strange.  
Let's see who these passengers are.**

In [None]:
train_data.loc[train_data.Fare==0]

**There are almost 15 such passengers are present.
Since some of them are 1st or 2nd class passengers, we should remove zero-Fares that might confuse our model.  
With the help of this function, we are going to set null values every time we encounter a zero value for `Fare`.**

In [None]:
def remove_zero_fares(row):
    if row.Fare == 0:
        row.Fare = np.NaN
    return row
# Apply the function
train_data = train_data.apply(remove_zero_fares, axis=1)
test_data = test_data.apply(remove_zero_fares, axis=1)
# Check if it did the job
print('Number of zero-Fares: {:d}'.format(train_data.loc[train_data.Fare==0].shape[0]))

## Embarked 
* **`Embarked` tells us where a passenger boarded from.**
* **There are three possible values for it: Southampton, Cherbourg and Queenstown.**  
* **In the training data, more than 70% of the people boarded from Southampton, slightly under 20% from Cherbourg and the rest from Queenstown.**
* **Counting survivors by boarding point, we see that more people who embarked from Cherbourg survived than those who died.**
* **People who Embarked from Southampton, most of them couldn't survive the disaster.**

In [None]:
train_data['Embarked'].value_counts().to_frame()

In [None]:
train_data.groupby('Embarked').Survived.mean().to_frame()

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(12,6))
sns.countplot(train_data['Embarked'], palette='magma', ax=axarr[0]).set_title('Passengers count by boarding point')
p = sns.countplot(x = 'Embarked', hue = 'Survived', data = train_data, palette=('#C52219', '#23C552'),
                  ax=axarr[1]).set_title('Survivors and deads count by boarding point')

## Embarked & Pclass
* **Since we don't expect that a passenger's boarding point could change the chance of surviving, we guess this is probably due to the higher proportion of first and second class passengers for those who came from Cherbourg rather than Queenstown and Southampton.** 
* **To check this, we see the class distribution for the different embarking points.**

In [None]:
train_data.groupby(['Embarked', 'Pclass']).Survived.sum().to_frame()

In [None]:
plt.figure(figsize=(10,5))
g = sns.countplot(data=train_data, x='Embarked', hue='Pclass', palette="twilight").set_title('Pclass count by embarking point')

* **The claim is correct and hopefully justifies why that survival rate is so high at Cherbourg** 
* **Again this feature might be useful in detecting groups at a deeper level of a tree and this is the only reason why I keep it.**
* **Also, most of the 3rd class people have Embarked from Southampton and died.**
* **And there is only 1 person from 1st class and 2 person from 2nd class Embarked from Queenstown.**

## Name
* **The `Name` column contains useful information as for example we could identify family groups using surnames.**  
* **In this notebook, however, we extracted only the passengers' title from it, creating a new feature for both train and test data.**

In [None]:
train_data['Title'] = train_data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
test_data['Title'] = test_data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [None]:
train_data['Title'].value_counts().to_frame()

In [None]:
test_data['Title'].value_counts().to_frame()

* **Looking at the distribution of the titles, it might be convenient to move the really low-frequency ones into bigger groups.**  
* **After analyzing them, we can substitute all rare female titles with Miss and all rare male titles with Mr.**

In [None]:
train_data['Title'].replace(['Mme', 'Ms', 'Lady', 'Mlle', 'the Countess', 'Dona'], 'Miss', inplace=True)
test_data['Title'].replace(['Mme', 'Ms', 'Lady', 'Mlle', 'the Countess', 'Dona'], 'Miss', inplace=True)
train_data['Title'].replace(['Major', 'Col', 'Capt', 'Don', 'Sir', 'Jonkheer'], 'Mr', inplace=True)
test_data['Title'].replace(['Major', 'Col', 'Capt', 'Don', 'Sir', 'Jonkheer'], 'Mr', inplace=True)

**Here is the final result. We have relatively high hopes for this new feature since the survival rate in most cases appears to be either significantly above or below the average survival rate, which should help our model.**

In [None]:
train_data.groupby('Title').Survived.mean()

In [None]:
plt.figure(figsize=(10,5))
plt.title('Survival rate by Title')
g = sns.barplot(x='Title', y='Survived', palette="magma", ci=None, data=train_data).set_ylabel('Survival rate')

## Cabin and Ticket
* **The `Cabin` feature is somewhat problematic as there are many missing values.**  
* **We can not expect it to help our model too much.**  
* **On the other side, a correctly engineered `Ticket` column is the best way to find family groups.** 
* **Since it is a pity to delete it knowing its full potential, we can create two new columns; one for the ticket first two letters and the second one for the ticket length.**

**Extract the first two letters**

In [None]:
train_data['Ticket_lett'] = train_data.Ticket.apply(lambda x: x[:2])
test_data['Ticket_lett'] = test_data.Ticket.apply(lambda x: x[:2])

**Calculate ticket length**

In [None]:
train_data['Ticket_len'] = train_data.Ticket.apply(lambda x: len(x))
test_data['Ticket_len'] = test_data.Ticket.apply(lambda x: len(x))

## SibSp
* **`SibSp` is the number of siblings or spouses of a person aboard the Titanic.**  
* **We see that more than 90% of people traveled alone or with one sibling or spouse.** 
* **The survival rate between the different categories is a bit confusing but we see that the chances of surviving are lower for those who traveled alone or with more than 2 siblings.**  
* **Furthermore, we notice that no one from a big family with 5 or 8 siblings was able to survive.**

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(12,6))
a = sns.countplot(train_data['SibSp'], palette="magma", ax=axarr[0]).set_title('Passengers count by SibSp')
axarr[1].set_title('Survival rate by SibSp')
b = sns.barplot(x='SibSp', y='Survived', data=train_data, palette="mako", ci=None, ax=axarr[1]).set_ylabel('Survival rate')

In [None]:
plt.figure(figsize = [10,5])
plt.title('Survival rate by SibSp')
sns.countplot(x='SibSp', hue='Survived', palette=('#C52219', '#23C552'), data=train_data)

## Parch
* **Similar to the `SibSp` column, this feature contains the number of parents or children each passenger was traveling with.** 
* **Here we draw the same conclusions as `SibSp`; we see again that small families had more chances to survive than bigger ones and passengers who traveled alone.**

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(12,6))
a = sns.countplot(train_data['Parch'], palette="magma", ax=axarr[0]).set_title('Passengers count by Parch')
axarr[1].set_title('Survival rate by Parch')
b = sns.barplot(x='Parch', y='Survived', data=train_data, palette="mako", ci=None, ax=axarr[1]).set_ylabel('Survival rate')

In [None]:
plt.figure(figsize = [10,5])
plt.title('Survival rate by Parch')
sns.countplot(x='Parch', hue='Survived', palette=('#C52219', '#23C552'), data=train_data)

## Family Size
* **Since we have two seemingly weak predictors, one thing we can do is combine them to get a stronger one.** 
* **In the case of `SibSp` and `Parch`, we can join the two variables to get a family size feature, which is the sum of `SibSp`, `Parch` and 1 (who is the passenger himself).** 
* **Creation of a new Fam_size column**

In [None]:
train_data['Fam_size'] = train_data['SibSp'] + train_data['Parch'] + 1
test_data['Fam_size'] = test_data['SibSp'] + test_data['Parch'] + 1

**Plotting the survival rate by family size it is clear that people who were alone had a lower chance of surviving than families up to 4 components, while the survival rate drops for bigger families and ultimately becomes zero for very large ones.**

In [None]:
plt.figure(figsize=(10,5))
plt.title('Survival rate by family size')
g = sns.barplot(x='Fam_size', y='Survived', palette="magma", ci=None, data=train_data).set_ylabel('Survival rate')

In [None]:
plt.figure(figsize=(10,5))
plt.title('Survival rate by family size')
sns.countplot(x='Fam_size', hue='Survived', data=train_data, palette=('#C52219', '#23C552'))

## Family Type
**To further summarize the previous trend, as our final feature,  Let's create four groups for family size.**

In [None]:
# Creation of four groups
train_data['Fam_type'] = pd.cut(train_data.Fam_size, [0,1,4,7,11], labels=['Solo', 'Small', 'Big', 'Very big'])
test_data['Fam_type'] = pd.cut(test_data.Fam_size, [0,1,4,7,11], labels=['Solo', 'Small', 'Big', 'Very big'])

**Here is the final result, we discovered a nice pattern.**

In [None]:
plt.figure(figsize=(10,5))
plt.title('Survival rate by family type')
g = sns.barplot(x=train_data.Fam_type, y=train_data.Survived, palette='twilight', ci=None).set_ylabel('Survival rate')

In [None]:
plt.figure(figsize=(10,5))
plt.title('Survival rate by family type')
sns.countplot(x='Fam_type', hue='Survived', data=train_data, palette=('#C52219', '#23C552'))

# Modeling
* **We start by selecting the features we will use and isolating the target.**  
* **We will not consider `Cabin` and in the end, we also excluded `Age` as the relevant information which is being a young man is encoded in the Master title.**  
* **We also did not use `Sex` as it is not useful given the `Title` column: adult males and young children have the same sex but are really different categories as we saw before, so we don't want to confuse our algorithm.**  

***If you don't extract the `Title` column, remember to put `Sex` in your models as it is pretty important!***

In [None]:
y = train_data['Survived']
features = ['Pclass', 'Fare', 'Title', 'Embarked', 'Fam_type', 'Ticket_len', 'Ticket_lett']
X = train_data[features]
X.head()
X_test = test_data[features]

In [None]:
X_test

Since we have multiple fields of categories in our data, we need to one hot encode it so that our model understands it better.

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(X)
one_hot_encoded_test_predictors = pd.get_dummies(X_test)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
                                                                    join='left', 
                                                                    axis=1)

Now comes the splitting of the training data into the model's training and testing data. I divided the training data itself into training and testing data, so that you can see the results on the testing data and validate your score before finally submitting your final csv file for score generation. This will be helpful to know nearly how much you are gonna score!

In [None]:
from sklearn.model_selection import train_test_split
xtr,xts,ytr,yts=train_test_split(final_train,y)

In [None]:
xtr

In [None]:
ytr

Handle null values with SimpleImputer

In [None]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(xtr)

In [None]:
from sklearn import tree
alg=tree.DecisionTreeClassifier()
alg.fit(data_with_imputed_values,ytr)

**We are now ready to make our predictions by simply calling the predict method on the test data.**

Fill Null values in final_test dataset with 0.

In [None]:
final_test.fillna(0)

In [None]:
final_test.describe()

Sometimes fillna does not work as required so we need to use Numpy's replace(np.nan,0) function.

In [None]:
final_test=final_test.replace(np.nan, 0)

**Finally predicting the final_test test dataset provided**

In [None]:
# Preprocessing of test data, get predictions
predictions = alg.predict(final_test)

In [None]:
predictions.shape

**All we have to do now is convert them into the submission file!**

In [None]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print('Your submission was successfully saved!')