# **Titanic:Machine Learning Disaster**

Hello everyone,<br>
This is my first detailed contest notebook. A month ago I did the first submission just for fun and my ranking was around 80 percent. After that, I wanted to deal with it again and I was able to reduce it to **%8**.

Notebook content is as follows:

 - [Exploratory Data Analysis](#1)
 - [Missing Value Analysis](#2)
 - [Feature Engineering](#3)
 - [Label Encoding](#4)
 - [Modelling](#5)
 - [Submission](#6)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()

In [None]:
test_path = "/kaggle/input/titanic/test.csv"
train_path = "/kaggle/input/titanic/train.csv"

titanic_test = pd.read_csv(test_path)
titanic_train = pd.read_csv(train_path)

In [None]:
titanic_train.head()

In [None]:
titanic_train.describe().T

In [None]:
titanic_train.info()

eksik değişkenler tipler sayılar hakkında bilgi ekle

<a id="1"></a> <br>
# **EDA**

In [None]:
categorical = []
numerical = []

for column in titanic_train.columns:
  if titanic_train[column].dtype == "object":
    categorical.append(column)
  else:
    numerical.append(column)

print("Categorical Variables: ", *categorical)
print("Numerical Variables: " , *numerical)

## **Variables**

We have 12 variables, some of them categorical some of them numerical.<br>

**Categorical**:
  - Name
  - Sex
  - Ticket
  - Cabin
  - Embarked

**Numerical**:
  - PassengerId
  - Survived (target)
  - Pclass
  - Age
  - SibSp
  - Parch
  - Fare 

## **Survived**

- **Survived** is our target variable. As the name suggests, this variable gives us information about the passengers who survived the Titanic crash.

  - Survived = 1
  - Not Survived = 0


In [None]:
values = titanic_train["Survived"].value_counts()

# plotting
values.plot.pie(autopct='%1.1f%%',shadow=True,figsize=(10,6))
plt.show()

#printing the values
print("Number of Survived")
print(values)

- As we can see, there is a ratio of 60 to 40 percent.

## **Sex**

- Let's take a look at the number of men and women on board. 
- Next, let's examine the relationship between gender and target variable.

In [None]:
sns.countplot(x = "Sex", data = titanic_train)
plt.title("Number of Sex (fig.1)")
plt.show()

print("Proportion of Sex")
print(titanic_train.Sex.value_counts(normalize=True)*100)

#### **Sex-Survived**

In [None]:
sns.catplot(x = "Sex", y="Survived",
            data=titanic_train, kind = "bar", height = 5)
plt.title("Survived Probability (fig.2)")
plt.show()

sns.countplot(x = "Sex", hue = "Survived", data = titanic_train)
plt.title("Number of Survived (fig.3)")
plt.show()

- When we look at the Figure 1, we see that 64 percent of the passengers on the ship are men.
- But only an average of 20 percent of the majority men survived according to Figure 2
- As far as we understand from all these graphs, we see that the rate of survival of women from this accident is higher.

## **Pclass**

**Pclass** shows us in which part of the ship the passengers travel. You can think of part number 1 as *First Class*. This variable actually gives us information about the **economic and social** status of the passengers.

In [None]:
sns.countplot(x = "Pclass", data = titanic_train)
plt.title("Number of Pclass (Fig.1)")
plt.show()


titanic_train.Pclass.value_counts(normalize = True).plot.pie(autopct='%1.1f%%',shadow=True,figsize=(10,6))
plt.title("Proportions of Pclass (Fig.2)")
plt.show()

#### **Pclass-Survived**

In [None]:
sns.countplot(x = "Pclass", hue = "Survived", data = titanic_train)
plt.title("Number of Survived for Each Class (Fig.2)")
plt.show()
sns.catplot(x= "Pclass", y="Survived", data = titanic_train,
            kind = "bar", height = 5)
plt.title("Survived Ratio According to Pclass (Fig.3)")
plt.show()

- We can see that the number of First Class passengers on board is the highest from Fig1 and Fig2.
- And when we looked at the recovery rates, we found that likewise, those traveling in First Class were higher. The lowest rate is in 3rd class passengers.
- As far as I remember from the movie, 3rd class passengers were traveling in the lowest part of the ship. This is probably the reason for such a high death rate.

## **Age**
I don't think much of an explanation is needed, as the name suggests, it gives information about the ages of the passengers on board.

In [None]:
sns.distplot(titanic_train["Age"], bins = 20, kde = True)
plt.title("Age Distribution (Fig.1)")
plt.show()

#### **Age-Survived**


In [None]:
# Box Plot
sns.boxplot(x ="Survived",y="Age", data = titanic_train)
plt.title("Dist. of Age According to Survived (Fig.2)")
plt.show()

# KDE plot
ax = sns.kdeplot(titanic_train.loc[(titanic_train.Survived == 0), "Age"],
                 color = "r", shade = True,label = "Not Survived")

ax = sns.kdeplot(titanic_train.loc[(titanic_train.Survived == 1), "Age"],
                 color = "b", shade = True, label = "Survived")
ax.legend(loc="upper right")
ax.set_xlabel("Number of Ages")
ax.set_ylabel("Frequency")
ax.set_title("Age - Survived (Fig.3)")
plt.show()

print("-- Mean of Age the Survived --")
print(titanic_train.groupby("Survived")[["Age"]].mean())

- As we can see from the graphics, we can say that there is a nearly young population on the Titanic.
- We can say that the average age of death(30.6) is higher than the survivors(28.3).

## **Cabin**

In [None]:
print(titanic_train.Cabin.unique())

In [None]:
def extract_first(x):
  if str(x)[0:3] != "nan":
    return str(x)[0]

# extracting first letter of cabin values except nan values
titanic_train["Cabin_first"] = titanic_train.Cabin.apply(lambda x: extract_first(x))

In [None]:
sns.countplot(x = "Cabin_first", data=titanic_train,
              order = titanic_train.Cabin_first.value_counts().index)
plt.title("Number of People in Cabins (Fig.1)")
plt.show()


### **Cabin-Survived**

In [None]:
sns.catplot(x="Cabin_first", y="Survived",
            kind = "bar", height = 5, data = titanic_train)
plt.title("Proportions of Survived (Fig.2)")
plt.show()


print("Mean of Age that each cabin part")
print(titanic_train.groupby(titanic_train["Cabin_first"])[["Age"]].mean().sort_values(by ="Age",ascending=False))


print("Mean of Survival Rate that each cabin part")
print(titanic_train.groupby(titanic_train["Cabin_first"])[["Survived"]].mean().sort_values(by = "Survived",ascending=False))

- Most passengers are traveling in **Cabin C**
- Mean of Age according to each cabin like following:
  - T : 45.0
  - A : 44.1
  - D : 39.7
  - C : 38.3
  - E : 38.1
  - B : 36.4
  - F : 21.3
  - G : 12.0

- I will prepare Cabin_first feature as follows in Feature Engineering:
  - **High Survival Rate**: D,E,B,F,C
  - **Normal Survival Rate**: G, A
  - **Lower**: T


In [None]:
# but I will drop now because I will handle with that in F.Engineering
titanic_train = titanic_train.drop("Cabin_first", axis = 1)

## **Fare**
It tells us about the prices of passengers' tickets

In [None]:
sns.distplot(titanic_train["Fare"], kde = True)
plt.title("Distribution of Fare")
plt.show()

### **Fare-Survived**

In [None]:
sns.boxplot(y = "Fare", x = "Survived", data = titanic_train)
plt.show()

# cut the fare into 4 parts
print(pd.cut(titanic_train['Fare'], 4).value_counts())
print("-"*20)
print("Fare Mean According to Each Cabin")
titanic_train["Fare"].groupby(titanic_train["Cabin"]).mean().sort_values(ascending = False)

In [None]:
titanic_train["Survived"].groupby(pd.cut(titanic_train['Fare'], 4)).mean()

- I will group Fare values like as follows in F.Engineering :
  - **Very High Fare**: 384.247 - 512.329
  - **High Fare**: 256.165 - 384.247
  - **Normal Fare**: 128.082 - 256.165
  - **Low Fare**: -0.512, 128.082

## **Ticket**
Passengers' ticket codes

In [None]:
print(titanic_train["Ticket"].unique())

In [None]:
# Get first letters of the tickets
titanic_train["Ticket_first"] = titanic_train["Ticket"].apply(lambda x: str(x)[0])


sns.catplot(x="Ticket_first", y="Survived", 
            height=5, kind="bar", data = titanic_train)
plt.title("Survival Rate")
plt.show()

print("Surviving rates of first letters")
print(titanic_train.groupby("Ticket_first")["Survived"].mean().sort_values(ascending=False))

In [None]:
# but I will drop now because I will handle with that in F.Engineering
titanic_train = titanic_train.drop("Ticket_first", axis = 1)

## **Embarked**
The Embarked feature shows us at which port the passengers board the Titanic.
  - S: Southampton
  - C: Cherbourg
  - Q: Queenstown


In [None]:
titanic_train.Embarked.value_counts().plot.pie(autopct='%1.1f%%',shadow=True,figsize=(10,6))
plt.title("Embarked Ports (Fig.1)")
plt.show()

print("-- Mean of Age According to Each Embarked Points --")
print(titanic_train.groupby("Embarked")[["Age"]].mean())

### **Embarked-Survived**

In [None]:
sns.catplot(x = "Embarked",y="Survived",
            data=titanic_train, kind="bar", height = 5)
plt.title("Survived Rate (Fig.2)")
plt.show()

- We see that the most passengers board the Titanic at **Southampton** from Figure 1.
- But we see that the people most likely to survived are the passengers on **Cherbourg** from Figure 2

## **SibSp**
Number of Sibling or Spouse in the Titanic

In [None]:
sns.countplot(x = "SibSp", data = titanic_train)
plt.title("Number of Sibling or Spouse")
plt.show()


### **SibSp-Survived**

In [None]:
g = sns.catplot(x = "SibSp", y = "Survived",
                data = titanic_train, kind = "bar", height = 5)
g.set_ylabels("Survived Probability")
plt.show()

- Having a lot of **SibSp** have less chance to survive.
- If **SibSp** value is equal 0 or 1 or 2, passenger has more chance to survive
- We can consider a new feature describing these categories.

## **Parch**
Number of Parent or Child in the Titanic

In [None]:
sns.countplot(x = "Parch", data = titanic_train)
plt.title("Number of Parch")
plt.show()

### **Parch-Survived**

In [None]:
g = sns.catplot(x = "Parch", y = "Survived", 
                   kind = "bar", data = titanic_train, height = 5)
g.set_ylabels("Survived Probability")
plt.show()

- **SibSp** and **Parch** can be used for new feature extraction with th = 3
- Small familes have more chance to survive.
- There is a std in survival of passenger with Parch = 3

## **Name**
Passengers' names

In [None]:
titanic_train.Name.value_counts()

In [None]:
# Get titles
titanic_train["Title"] = titanic_train['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

# Print title counts
print(titanic_train["Title"].value_counts())

### **Name-Survived**

In [None]:
g = sns.catplot(x = "Title", y = "Survived",kind = "bar", 
                data = titanic_train, height = 5)
g.set_ylabels("Survived Probability")
plt.xticks(rotation=90) 
plt.show()

print(titanic_train["Survived"].groupby(titanic_train["Title"]).mean().sort_values(ascending=False))

I will group title by their surviving rates like following

  - **Higher** = the Countess, Mlle, Lady, Ms , Sir, Mme, Mrs, Miss, Master
  - **Neutral** = Major, Col, Dr
  - **Lower** = Mr, Rev, Jonkheer, Don, Capt

In [None]:
# but I will drop now because I will handle with that in F.Engineering
titanic_train = titanic_train.drop("Title", axis = 1)

<a id="2"></a> <br>
# **Missing Value Analysis**

In [None]:
# Before starting imputation I will take a copy from my original dataset
data1 = titanic_train.copy()
data2 = titanic_test.copy()

In [None]:
def missing_val_table(data):
    """
    Takes the dataframe as Input and returns the missing values and
    percentages with respect to dataframe length.
    """
    missing_val = data.isnull().sum()
    missing_val_perc = 100 * data.isnull().sum() / len(data)
    table = pd.concat([missing_val, missing_val_perc], axis=1)
    table = table.rename(columns = {0:"Missing Values",
                                    1:"% of Total Values"})
    table = table.sort_values(by="% of Total Values",
                              ascending=False)
    return table

missing_val_table(data1)

- I will impute as follows:
  - Mean: Age
  - Mode: Cabin and Embarked

In [None]:
# imputing Age
data1["Age"] = data1["Age"].fillna(data1["Age"].mean())
data2["Age"] = data2["Age"].fillna(data2["Age"].mean())

# imputing Embarked
data1["Embarked"] = data1["Embarked"].fillna(data1["Embarked"].mode()[0])
data2["Embarked"] = data2["Embarked"].fillna(data2["Embarked"].mode()[0])

<a id="3"></a> <br>
# **Feature Engineering**

## **Cabin**
I will prepare first letter of Cabin feature as follows in Feature Engineering:

- **High Survival Rate**: D,E,B,F,C
- **Normal Survival Rate**: G, A
- **Lower**: T

In [None]:
def assign_label_cabin(cabin):
    if cabin in ["D", "E", "B", "F", "C"]:
        return "Cabin_high"
    elif cabin in ["G", "A"]:
        return "Cabin_middle"
    else:
        return "Cabin_low"

# extract first letter
data1["Cabin"] = data1["Cabin"].apply(lambda x: str(x)[0])
data2["Cabin"] = data2["Cabin"].apply(lambda x: str(x)[0])

# apply the function
data1["Cabin_first"] = data1["Cabin"].apply(lambda x: assign_label_cabin(x))
data2["Cabin_first"] = data2["Cabin"].apply(lambda x: assign_label_cabin(x))

#drop the cabin feature
data1 = data1.drop("Cabin", axis = 1)
data2 = data2.drop("Cabin", axis = 1)

## **Fare**
I will group Fare values like as follows:
  - **Very High Fare**: 384.247 - 512.329
  - **High Fare**: 256.165 - 384.247
  - **Normal Fare**: 128.082 - 256.165
  - **Low Fare**: -0.512, 128.082

In [None]:
def fare_bound(x):
  x = float(x)
  if (x > 384.247) & (x <= 512.329):
    return "Very High Fare"
  elif (x > 256.165) & (x <= 384.247):
    return "High Fare"
  elif (x > 128.082) & (x <= 256.165):
    return "Normal Fare"
  else:
    return "Low Fare"

# apply the function
data1["Fare_cat"] = data1["Fare"].apply(lambda x: fare_bound(x))
data2["Fare_cat"] = data2["Fare"].apply(lambda x: fare_bound(x))

#drop the fare feature
data1 = data1.drop("Fare", axis = 1)
data2 = data2.drop("Fare", axis = 1)

## **Ticket**
I am going to group them like as follows:

  - **Ticket High** = F, 1, P , 9
  - **Ticket Middle** = S, C, 2
  - **Ticket Low** = else

In [None]:
def label_ticket(x):
    if x in ["F", "1", "P", "9"]:
        return "Ticket_high"
    elif x in ["S", "C", "2"]:
        return "Ticket_middle"
    else:
        return "Ticket_low"

# extract first letter
data1["Ticket"] = data1["Ticket"].apply(lambda x: str(x)[0])
data2["Ticket"] = data2["Ticket"].apply(lambda x: str(x)[0])

# apply the function
data1["Ticket_cat"] = data1["Ticket"].apply(lambda x: label_ticket(x))
data2["Ticket_cat"] = data2["Ticket"].apply(lambda x: label_ticket(x))


#drop the ticket feature
data1 = data1.drop("Ticket", axis = 1)
data2 = data2.drop("Ticket", axis = 1)

## **Name**
I will group title by their surviving rates like following

  - **Higher** = the Countess, Mlle, Lady, Ms , Sir, Mme, Mrs, Miss, Master
  - **Neutral** = Major, Col, Dr
  - **Lower** = Mr, Rev, Jonkheer, Don, Capt

In [None]:
def assign_label_title(title):
    if title in ["the Countess", "Mlle", "Lady", "Ms", "Sir", "Mme", "Mrs", "Miss", "Master"]:
        return "Title_high"
    elif title in ["Major", "Col", "Dr"]:
        return "Title_middle"
    else:
        return "Title_low"

# extract title from the name
data1["Title"] = data1['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
data2["Title"] = data2['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

#apply the function
data1["Title"] = data1["Title"].apply(lambda x: assign_label_title(x))
data2["Title"] = data2["Title"].apply(lambda x: assign_label_title(x))

#drop the name
data1 = data1.drop("Name", axis = 1)
data2 = data2.drop("Name", axis = 1)

## **SibSp & Parch**

In [None]:
data1["family_size"] = data1["SibSp"] + data1["Parch"]
data2["family_size"] = data2["SibSp"] + data2["Parch"]

In [None]:
def family_label(family_size):
    if family_size == 0:
        return "Alone"
    elif family_size <=3:
        return "Small_family"
    else:
        return "Big_family"

#apply the function
data1["family_size"] = data1["family_size"].apply(lambda x: family_label(x))
data2["family_size"] = data2["family_size"].apply(lambda x: family_label(x))

#drop the SibSp and Parch
data1 = data1.drop("SibSp", axis=1)
data1 = data1.drop("Parch", axis =1)

data2 = data2.drop("Parch", axis =1)
data2 = data2.drop("SibSp", axis =1)

**Lets drop also PassengerId**

In [None]:
data1 = data1.drop("PassengerId", axis = 1)
data2 = data2.drop("PassengerId", axis = 1)

**Let's look at the final version of our data set at the end of Feature Engineering**

In [None]:
display(data1.head())
display(data2.head())

<a id="4"></a> <br>
# **Label Encoding**

In [None]:
data1_new = data1.copy()
data2_new = data2.copy()

In this part I'm going to follow:
- LabelEncoding: Sex
- OneHotEncoding: Rest

In [None]:
from sklearn.preprocessing import LabelEncoder

labelEncoder = LabelEncoder()

data1_new["Sex"] = labelEncoder.fit_transform(data1[["Sex"]].values.ravel())
data2_new["Sex"] = labelEncoder.fit_transform(data2_new[["Sex"]].values.ravel())

In [None]:
data1_new = pd.get_dummies(columns=["Pclass", "Embarked", "Ticket_cat", "Fare_cat","Cabin_first","Title", "family_size"], data=data1_new, drop_first=True)
data2_new = pd.get_dummies(columns=["Pclass", "Embarked", "Ticket_cat", "Fare_cat", "Cabin_first","Title", "family_size"], data=data2_new, drop_first=True)

**Final look our data**

In [None]:
display(data1_new.head())

In [None]:
display(data2_new.head())

<a id="5"></a> <br>
# **Modelling**

I will import necessary libraries

In [None]:
from sklearn.model_selection import train_test_split,cross_val_score,RandomizedSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.ensemble import RandomForestClassifier

In [None]:
# I will make another copy
train = data1_new.copy()
test = data2_new.copy() 

X = train.drop("Survived", axis = 1)
y = train["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42, stratify=y)

### **Random Forest**

In [None]:
rf = RandomForestClassifier()

params = {'n_estimators': [100,300,500,700,1000],
          'max_depth': [3,5,7],
          'criterion':['entropy', 'gini'],
          'min_samples_leaf' : [1, 2, 3, 4, 5],
          'max_features':['auto'],
          'min_samples_split': [3, 5, 10],
          'max_leaf_nodes':[2,3,5,7],
          }

rf_cv = RandomizedSearchCV(rf, params, cv = 10, n_jobs=-1, verbose=2).fit(X_train, y_train)

In [None]:
rf_cv.best_params_
best_rf_model = rf_cv.best_estimator_

print(best_rf_model)
print(rf_cv.best_score_)

In [None]:
rf_pred = rf_cv.predict(X_test)

# Print the accuracy with accuracy_score function
print("Accuracy: ", accuracy_score(y_test, rf_pred))

# Display the confusion matrix
print("\nConfusion Matrix\n")
print(confusion_matrix(y_test, rf_pred))

Let's save the model

In [None]:
import pickle

pickle.dump(best_rf_model, open("titanic_model.pkl", 'wb'))

In [None]:
Importance = pd.DataFrame({"Importance": best_rf_model.feature_importances_*100},
                         index = X_train.columns)
Importance.sort_values(by = "Importance", 
                       axis = 0, 
                       ascending = True).plot(kind ="barh", color = "r")

plt.xlabel("Feature Importance")
plt.show()

In [None]:
last_model =RandomForestClassifier(max_depth=3, max_leaf_nodes=7, min_samples_leaf=3,
                       min_samples_split=10, n_estimators=500).fit(X,y)

In [None]:
IDs = pd.read_csv(test_path)[["PassengerId"]].values

predictions = last_model.predict(test.values)

print(predictions)

<a id="6"></a> <br>
# **Submission**

In [None]:
result_df = {'PassengerId': IDs.ravel(), 'Survived':predictions}
submission = pd.DataFrame(result_df)

display(submission.head())


In [None]:
# Save the file
submission.to_csv("titanic_sub.csv", index=False)