# Titanic - Machine Learning from Disaster
Hello everyone,

I am new to meachine learning and would like to try Titanic problem. So I'm welcome to any comment and feedback.

My plan is to first understand the data with some visualization, then process the data for modeling and finally creating a ML model for prediction.

**Best Score: 0.79186 - Top %7**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df_train = pd.read_csv("/kaggle/input/titanic/train.csv")
df_test = pd.read_csv("/kaggle/input/titanic/test.csv")
df_train.head()

# 0. Reference Score

[Alexi Cook's Titanic Tutorial notebook](https://www.kaggle.com/alexisbcook/titanic-tutorial) is a great tutorial for how to use Kaggle, approach Titanic problem and create a basic ML model and make a prediction. Thanks for the tutorial!

First, I would like to use the same code from tutorial and make a prediction. So that, i can see how Random Forest model performs and use that score as a benchmark.

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

After submitting this prediction, **score** is **0.77511** (v1 of this notebook)

# 1. Understanding the Data

In [None]:
df_train.info()
print("\n" + "-"*50 + "\n")

print("# of Missing values in train data:")
n_miss_val = df_train.isnull().sum().sort_values(ascending=False)
print(n_miss_val[n_miss_val>0])
print("# of Missing values in test data:")
n_miss_val_test = df_test.isnull().sum().sort_values(ascending=False)
print(n_miss_val_test[n_miss_val_test>0])
print("\n" + "-"*50 + "\n")

print("Distribution of Survived column:")
df_train["Survived"].value_counts()

So, we have 891 values in Survived column. Unfortunately, we see from the distribution that most of the people did not survive.

From intuition, i fell like Class and Sex columns have strong relation with surviving. Lets start with them:

**a. Pclass and Sex**

In [None]:
df_train[["Pclass", "Survived"]].groupby(["Pclass"], as_index=False).mean()

In [None]:
df_train[["Sex", "Survived"]].groupby(["Sex"], as_index=False).mean()

Pclass and Sex columns have a strong relationship with surviving. I will use these columns as it is.

**b. SibSp and Parch**


In [None]:
df_train[["SibSp", "Survived"]].groupby("SibSp").agg(["count","mean"])

In [None]:
df_train[["Parch", "Survived"]].groupby("Parch").agg(["count","mean"])

* SipSp is the number of siblings + spouses
* Parch is the number of parents + children


* People with 0 SipSp / Parch have about 0.34 survival chance
* People with a few family members seems to have more survival chance
* But when people have more family members, survival chance suddenly drops


* To test my theory, i will define a new column for family size 'FamSize'
* If my theory is correct, i will define a new categorical column for family type
* Categories will be: Alone, Small Family, Large Family

In [None]:
df_train['FamSize'] = df_train['SibSp'] + df_train['Parch']
df_train[['FamSize', 'Survived']].groupby('FamSize').agg(['count','mean'])

In [None]:
df_train['FamType'] = np.where(df_train['FamSize'] == 0, "Alone", "Small_Family")
df_train.loc[df_train['FamSize'] > 3, 'FamType'] = "Large_Family"

df_train[['FamType', 'Survived']].groupby('FamType').agg(['count','mean'])

In [None]:
# same operations in test data
df_test['FamSize'] = df_test['SibSp'] + df_test['Parch']
df_test['FamType'] = np.where(df_test['FamSize'] == 0, "Alone", "Small_Family")
df_test.loc[df_test['FamSize'] > 3, 'FamType'] = "Large_Family"

As a result, I am happy with my new column Family Type and use it in my model.

**c. Name**

"Name" variable seem to be formed of "LastName, Title. FirstName". We can extract these three information from this column. I don't think FirstName is a valuable information for us. I am also not sure about about LastName, we already defined a column for families. So for now i will extract Title and see if it is useful.

In [None]:
# extracting the string between "," and "."
df_train["Title"] = df_train["Name"].apply(lambda x :x.split(",")[-1]).str.split(".",expand=True).loc[:,0]
df_train["Title"].value_counts()

In [None]:
# Mr, Miss, Mrs and Master titles are looking good
# But other titles have really low occurence
# I will replace them with "Other"
title_counts = df_train["Title"].value_counts()
titles_to_keep = title_counts[title_counts>10]
titles_to_replace = list( set(title_counts.index) - set(titles_to_keep.index) )
df_train["TitleProcessed"] = df_train["Title"].replace(titles_to_replace, "Other")

df_train[["TitleProcessed", "Survived"]].groupby("TitleProcessed").agg(["count","mean"])

In [None]:
# same operations in test data
df_test["Title"] = df_test["Name"].apply(lambda x :x.split(",")[-1]).str.split(".",expand=True).loc[:,0]

title_counts_test = df_test["Title"].value_counts() # we will keep same columns with train
titles_to_replace_test = list( set(title_counts_test.index) - set(titles_to_keep.index) )
df_test["TitleProcessed"] = df_test["Title"].replace(titles_to_replace_test, "Other")
df_test["TitleProcessed"].value_counts()

**d. Age**

Age column in the train dataset has 177 missing values. There are severals ways to fill these values. For example filling with mean/median, using a placeholder value or even generating random number from a gaussian distribution using column's mean and std.

In Pedro Marcelino's (pmarcelino) [Data analysis and feature extraction with Python
](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python) notebook, I've seen a really interesting approach. It is mentioned that these values can be estimated based on known relationships. Person's title is relevant to the age, so mean value of each title's age is used to fill missing values.

Inspired from this approach, I will calculate average ages for my TitleProcessed's categories and use them for filling.

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(12, 7))
fig.suptitle('Distribution of Age - Before Filling Missing Values')
sns.kdeplot(ax=axes[0], data=df_train, x="Age", hue="Survived", shade=True)
sns.barplot(ax=axes[1], data=df_train, x="TitleProcessed", y="Age")

In [None]:
# Creating a dictionary for the average age value of each Title
titles_avg_age = df_train.groupby('TitleProcessed')['Age'].mean().round(2).to_dict()
titles_avg_age

In [None]:
# Filling missing values
df_train['Age'].fillna(df_train['TitleProcessed'].map(titles_avg_age), inplace=True)
df_test['Age'].fillna(df_test['TitleProcessed'].map(titles_avg_age), inplace=True)

In [None]:
fig, axes = plt.subplots(1, 1, figsize=(12, 4))
fig.suptitle('Distribution of Age - After Filling Missing Values')
sns.kdeplot(data=df_train, x="Age", hue="Survived", shade=True)

**e. Embarked**

In [None]:
df_train[["Embarked", "Survived"]].groupby(["Embarked"]).agg(['count','mean'])

In [None]:
sns.catplot(x="Embarked", y="Survived", data=df_train, kind = "bar")

In [None]:
# In Embarked column, we only have 2 missing values. 
# So I will fill them with the most occurent value S
df_train["Embarked"].fillna("S", inplace=True)

**f. Fare**

In [None]:
df_train["Fare"].describe()

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sns.kdeplot(ax=axes[0], data=df_train, x="Fare", hue="Survived")
sns.violinplot(ax=axes[1], data=df_train, x="Survived", y="Fare")
sns.violinplot(ax=axes[2], data=df_train, x="Pclass", y="Fare", hue="Survived")

As the fare increases, chance of survival also seems to increase.We can also see from the third plot that there is a relation between fare and class.

In PClass 1, there are really expensive tickets which we can consider as outlier. Fare is a positively skewed variable, so i also want to see plots after log transform. Since the minimum value of the fare is 0, we have to make it bigger than 1 to apply logarithm. From intuition, first I will add the median of the fare and later take log transform.

In [None]:
df_train["FareLog"] = np.log( df_train["Fare"] + df_train["Fare"].median() )

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sns.kdeplot(ax=axes[0], data=df_train, x="FareLog", hue="Survived")
sns.violinplot(ax=axes[1], data=df_train, x="Survived", y="FareLog")
sns.violinplot(ax=axes[2], data=df_train, x="Pclass", y="FareLog", hue="Survived")

We obtained an interesting variable. I am going to try out this variable for the model.

In [None]:
# in the test data, we have 1 missing Fare. I will basicly fill it with median
# after that, apply the same log operation
df_test["Fare"].fillna(df_test["Fare"].median(), inplace=True)
df_test["FareLog"] = np.log( df_test["Fare"] + df_test["Fare"].median() )

**g. Ticket and Cabin**

In [None]:
print("Random 10 Tickets:")
print( df_train["Ticket"].sample(10, random_state=42) )
print()

print("Checking max repetition:")
ticket_counts = df_train["Ticket"].value_counts()
print(ticket_counts.head())
print()

print("# of unique tickets: {}".format(df_train["Ticket"].nunique()))

In [None]:
print("Random 10 Cabins:")
print( df_train["Cabin"].sample(10, random_state=42) )
print()

print("Checking max repetition:")
cabin_counts = df_train["Cabin"].value_counts()
print(cabin_counts.head())
print()

print("# of unique cabins: {}".format(df_train["Cabin"].nunique()))
print("# of missing values: {}".format(df_train["Cabin"].isnull().sum()))

Ticket:
* I'm not sure if ticket has a relationship with surviving
* Some values are numerical, some of them are not
* Most of the tickets are unique
* There are some repeated tickets, maybe they are family members

Cabin:
* If there were more values, maybe we could process this column and find some information
* But most of the values are missing

For now, I will not use these Ticket and Cabin columns in my model.

# 2. Preprocess Data

In [None]:
# In this section:
# Selecting the columns that i will use in model
# Applying get_dummies for categorical columns
# Dropping one column for each categorical column
#    because the other columns contain that information
# Finally, feature scaling

X_train = df_train[["Pclass","TitleProcessed","Sex","Age","FamType","FareLog","Embarked"]]
X_train = pd.get_dummies(X_train)
X_train = X_train.drop(["TitleProcessed_Other","Sex_female","FamType_Large_Family","Embarked_Q"], axis=1)
X_train.head()

In [None]:
# Same operation to test data
X_test = df_test[["Pclass","TitleProcessed","Sex","Age","FamType","FareLog","Embarked"]]
X_test = pd.get_dummies(X_test)
X_test = X_test.drop(["TitleProcessed_Other","Sex_female","FamType_Large_Family","Embarked_Q"], axis=1)
X_test.head()

In [None]:
y_train = df_train["Survived"]
y_train.head()

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

# 3. Model Selection

In this section, I will try different classifiers and compare cross validation score. I will use both scaled and unscaled data to compare.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [None]:
# LogisticRegression
classifier = LogisticRegression(solver='liblinear')

score_unscaled = cross_val_score(classifier, X_train, y_train, cv=10).mean().round(3)
score_scaled = cross_val_score(classifier, X_train_scaled, y_train, cv=10).mean().round(3)
print("Logistic Regression")
print("Cross Val Score: {}".format(score_unscaled))
print("Cross Val Score for Scaled Data: {}".format(score_scaled))

In [None]:
# K-Nearest Neighbors
classifier = KNeighborsClassifier()

score_unscaled = cross_val_score(classifier, X_train, y_train, cv=10).mean().round(3)
score_scaled = cross_val_score(classifier, X_train_scaled, y_train, cv=10).mean().round(3)
print("K-Nearest Neighbors")
print("Cross Val Score: {}".format(score_unscaled))
print("Cross Val Score for Scaled Data: {}".format(score_scaled))

In [None]:
# Support Vector Classifier (SVC)
classifier = SVC()

score_unscaled = cross_val_score(classifier, X_train, y_train, cv=10).mean().round(3)
score_scaled = cross_val_score(classifier, X_train_scaled, y_train, cv=10).mean().round(3)
print("Support Vector Classifier")
print("Cross Val Score: {}".format(score_unscaled))
print("Cross Val Score for Scaled Data: {}".format(score_scaled))

In [None]:
# Naive Bayes
classifier = GaussianNB()

score_unscaled = cross_val_score(classifier, X_train, y_train, cv=10).mean().round(3)
score_scaled = cross_val_score(classifier, X_train_scaled, y_train, cv=10).mean().round(3)
print("Naive Bayes")
print("Cross Val Score: {}".format(score_unscaled))
print("Cross Val Score for Scaled Data: {}".format(score_scaled))

In [None]:
# Decision Tree Classification
classifier = DecisionTreeClassifier()

score_unscaled = cross_val_score(classifier, X_train, y_train, cv=10).mean().round(3)
score_scaled = cross_val_score(classifier, X_train_scaled, y_train, cv=10).mean().round(3)
print("Decision Tree Classifier")
print("Cross Val Score: {}".format(score_unscaled))
print("Cross Val Score for Scaled Data: {}".format(score_scaled))

In [None]:
# Random Forest Classification
classifier = RandomForestClassifier()

score_unscaled = cross_val_score(classifier, X_train, y_train, cv=10).mean().round(3)
score_scaled = cross_val_score(classifier, X_train_scaled, y_train, cv=10).mean().round(3)
print("Random Forest Classifier")
print("Cross Val Score: {}".format(score_unscaled))
print("Cross Val Score for Scaled Data: {}".format(score_scaled))

# 4. Modeling and Submission

In section 3, Logistic Regression and SVC have the best cross validation scores, and later KNN and Random Forest come. In this section I will submit the predictions from these classifiers and compare the submission score. Later, I will try to improve the best classifier with hyperparameter tuning.

In [None]:
# SVC
classifier = SVC()
classifier.fit(X_train_scaled, y_train)

In [None]:
y_pred = classifier.predict(X_test_scaled)

In [None]:
result = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': y_pred})
result.head()

In [None]:
result.to_csv('submission.csv', index=False)
print("Resuls are saved to submission.csv")

So without hyperparameter tuning, i submitted all the predictions and achieved the following scores. SVC has the highest score. Now, lets tune hyperparameters using grid search and submit again.
* LogReg  : 0.76076 (V4)
* SVC     : **0.78947** (V5)
* RandFor : 0.75119 (V6)
* KNN     : 0.76794 (V7)

In [None]:
# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

parameters = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf', 'sigmoid']
             }

grid = GridSearchCV(SVC(), parameters, refit = True)
grid.fit(X_train_scaled,y_train)

# printing results
print("Best Score: {}".format(grid.best_score_))
print("Best Parameters: {}".format(grid.best_params_))
print("Model: {}".format(grid.best_estimator_))

After parameter tuning, i obtained best validation score for the parameters {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}. Even so, i didn't get a better competition score using these parameters.

For now, I get the best score using SVC with default parameters.

V9 update:
In Fare section, I used to add mean value before taking log transform. Now, I tried adding median instead of mean. The distribution of FareLog is now better. With keeping other parameters same, I increased my score from 0.78947 to 0.79186

Thank you very much for your interest in my notebook! :)