<a href="https://colab.research.google.com/github/udx1/Machine-Learning-Specialization/blob/main/Kaggle/Titanic_Survival_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

titanic_path = kagglehub.competition_download('titanic')

print('Data source import complete.')


# *Problem Statement*

### **Context**

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

### Objective
To predict if a passenger survived the sinking of the Titanic or not.


### Data Dictionary


* survival : Survivied yes or no (0 = No, 1 = Yes)
* pclass   : Ticket class	(1 = 1st, 2 = 2nd, 3 = 3rd)
* sex      : Male or Female
* Age	   : Age in years
* sibsp	   : # of siblings / spouses aboard the Titanic
* parch	   : # of parents / children aboard the Titanic
* ticket   : Ticket number
* fare	   : Passenger fare
* cabin	   : Cabin number
* embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Import the python libraries

In [None]:
# Libraries required to do data visualization.
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px

# To train decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

# To evaluate model performance
from sklearn.metrics import accuracy_score, f1_score

# To tune different models.
from sklearn.model_selection import GridSearchCV

# import warnings and filter them
import warnings
warnings.filterwarnings("ignore")


### Load Data
Files
* /kaggle/input/titanic/train.csv
* /kaggle/input/titanic/test.csv
* /kaggle/input/titanic/gender_submission.csv

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")

df = train_data.copy()



In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")


# Data Overview

In [None]:
# Check the attributes and data types
df.info()

**Observations**

* Data has 7 numeric and 5 categorical features.
* Some data is missing in Age, Cabin and Embarked features.


In [None]:
# Check the shape of the data
df.shape

* There are 891 rows and 12 columns in the data

In [None]:
# Check the top 5 rows of the data
df.head()

In [None]:
# Check the statistical summary of the numeric features.

numeric_features = df.select_dtypes(include="number").columns.to_list()

df[numeric_features].describe().T

* The youngest passenger is 14years, while the oldest is 80 years. The average age of the passengers is 30years.
* The passenger fare ranges from 0 to 512

In [None]:
# Check for missing values
df.isna().sum()

* Age and Cabin are missing lot of values.
* Embarked is missing 2 values.

In [None]:
# Check for duplicate records
df.duplicated().sum()

* There are no duplicate records in the data

# *Exploratory Data Analysis*

# *Univariate Analysis*

In [None]:
# Check the distribution of numeric features

plt.figure(figsize=(12,8))

for i,feature in enumerate(numeric_features):
    plt.subplot(3,3,i+1)
    sns.histplot(data=df, x=feature, kde=True)

plt.tight_layout()
plt.show()



* Age exhibits right skewness in the data.
* Fare exhibits right skewness.

In [None]:
# Check for Outliers

plt.figure(figsize=(12,15))

for i, feature in enumerate(numeric_features):
    plt.subplot(3,3,i+1)
    sns.boxplot(data=df, y=feature)

plt.tight_layout()
plt.show()


* Age and Fare data has some outliers

In [None]:
# Check for Unique and count of Survived feature

print("Unique values and count of Survivied feature: \n", df['Survived'].value_counts())
print("Unique values and percentage of Survivied feature: \n", round(df['Survived'].value_counts()*100/df.shape[0],2))

sns.countplot(data=df, x='Survived')
plt.show()

* Out of 891 passengers, 341 passengers survived and 549 didn't survive.
* Nearly 62% of the passengers didn't survive, indicating significant loss of life.

In [None]:
# Check for Unique and count of Sex feature

print("Unique values and count of Sex feature: \n", df['Sex'].value_counts())
print("Unique values and percentage of Sex feature: \n", round(df['Sex'].value_counts()*100/df.shape[0],2))

sns.countplot(data=df, x='Sex')
plt.show()

* Out of 891 passengers, 577 are male and 314 are female passengers.
* Nearly 65% of the passengers are male

# *Bivariate Analysis*

In [None]:
# Check the correlation between the numeric variables

sns.heatmap(data=df[numeric_features].corr(), annot=True, cmap="Spectral", fmt="0.2f")
plt.show()

* Survived has a positive correlation with Fare, but its a weak one.
* Survived has a negative correlation with Pclass
* PClass has a negative correlation with Fare and Age.

In [None]:
# Survived vs Male/Female

sns.countplot(data=df, x="Survived", hue="Sex")
plt.show()

* Female passengers are survived more compared to male.
* Gender is an important factor to consider for modeling.

In [None]:
# Survived vs PClass

sns.countplot(data=df, x="Survived", hue="Pclass")
plt.show()

* Class 1 Passengers survived the most, while Class 2 are lowest survived.
* Significant number of Class 3 passengers didn't survive.
* Pclass might be an important factor for modeling.

In [None]:
# Survivied vs Cabin

sns.countplot(data=df, x='Survived', hue='Cabin')
plt.show()

* There are too many cabins to derive any insight.

In [None]:
# Survivied vs Age

sns.scatterplot(data=df, y='Survived', x='Age')
plt.show()

* No particular pattern exists between Age and Survived passengers.

In [None]:
# Survived vs Embarked

print("Passengers by Port of Embarkation")
print(round(df.groupby('Embarked')['PassengerId'].count()*100/df.shape[0],2))


print("Survived by Port of Embarkation")
print(round(df.groupby('Embarked')['Survived'].value_counts(normalize=True)*100,2))

sns.countplot(data=df, x="Survived", hue="Embarked")
plt.show()

* 72% of the passengers embarked Titanic ship from Southampton port, and a significant number of those did not survive. Only 33% of passengers survived.
* Passengers from the Port of Cherbourg survived the most, with 55% of them surviving.
* 38% passengers who embarked at Port of Queenstown survived.


In [None]:
# Paitplot of numeric variables
sns.pairplot(data=df[numeric_features])
plt.show()

**Observations**

* There is a correlation between the Sex, Pclass, Port of Embarkation and Survived features.

# **Data Preparation for Modeling**

In [None]:
df.info()

In [None]:
# Check the missing Embarked records.

df[df['Embarked'].isna()]

In [None]:
# Are there any other passengers in the Cabin B28 or with Ticket 113572

df[(df['Cabin'] == 'B28') | (df['Ticket'] == '113572')]


* Since these 2 passengers are survivied, removing them from data could result in loosing important information.
* Instead, I will impute them with the mode of the Cabin

In [None]:
print(df['Embarked'].mode()[0])
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df[df['Embarked'].isna()]


In [None]:
# Check for missing values.
df.isna().sum()

* Age and Cabin have significant missing data, but they seem to have no effect on the Survival, so ignoring those missing values for now.

In [None]:
y_train = df['Survived']
X_train = df.drop('Survived', axis=1)

In [None]:
# Dropping some of the input features that do not affect the Survival.

#X_train.drop('PassengerId', axis=1, inplace=True)
X_train.drop('Name', axis=1, inplace=True)
X_train.drop('Age', axis=1, inplace=True)
X_train.drop('Cabin', axis=1, inplace=True)
X_train.drop('Ticket', axis=1, inplace=True)




In [None]:
# Convert the categorical variables into numeric values.

X_train = pd.get_dummies(data=X_train, columns=['Sex', 'Embarked'], dtype=float, drop_first=True)


In [None]:
X_train.info()

# **Model Building**

In [None]:
# Model build using DecisionTreeClassifier and GridSearchCV Hyperparameter tuning.

# Instantiate the DecisionTreeClassifier model.
dtree = DecisionTreeClassifier(random_state=42)

# Define the hyper parameters
param_grid = {
    "max_depth": np.arange(2,11,2),
    "max_leaf_nodes": np.arange(10,51,10),
    "min_samples_split" : np.arange(10,51,10)
}

grid_search = GridSearchCV(
    estimator=dtree,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Fit the training data.
grid_search.fit(X_train, y_train)

# Best fit model
best_dtree_model = grid_search.best_estimator_

# Compute the accuracy scores on training data.
training_score = accuracy_score(y_train, best_dtree_model.predict(X_train))
print("Accuracy score with best parameters = {:.2f}".format(training_score))

In [None]:
# Check the accuracy score on test data
# Prepare the test data for modeling.
X_test = test_data.copy()

#X_train.drop('PassengerId', axis=1, inplace=True)
X_test.drop('Name', axis=1, inplace=True)
X_test.drop('Age', axis=1, inplace=True)
X_test.drop('Cabin', axis=1, inplace=True)
X_test.drop('Ticket', axis=1, inplace=True)


In [None]:
print(X_test['Embarked'].mode()[0])
X_test['Embarked'].fillna(X_test['Embarked'].mode()[0], inplace=True)
X_test[X_test['Embarked'].isna()]

In [None]:
X_test = pd.get_dummies(
    data=X_test,
    columns=['Sex', 'Embarked'],
    dtype=float,
    drop_first=True
)

In [None]:
# Populate the missing Fare value with its mean.

print(X_test['Fare'].mean())
X_test['Fare'].fillna(X_test['Fare'].mean(), inplace=True)
X_test[X_test['Fare'].isna()]

In [None]:
X_test.info()

In [None]:
# Check for missing values in test data
X_test.isna().sum()

In [None]:
# Predict the Survival on test data.

y_test = best_dtree_model.predict(X_test)

In [None]:
# Create output file for submission.

output_dict = {
    "PassengerId" : X_test["PassengerId"],
    "Survived" : y_test
}

output_df = pd.DataFrame(output_dict)
output_df[1:10]



In [None]:
# Generate the submission file
output_df.to_csv("submission.csv", index=False)