# About Dataset

* **Survived** - Survival (0 = No, 1 = Yes) ---> Output Variable
* **Pclass** - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) ---> Input Variable
* **Sex** - Sex of the passenger ---> Input Variable
* **Age** - Age in years ---> Input Variable
* **Sibsp** - number of siblings/spouses aboard the Titanic ---> Input Variable
* **Parch** - number of parents/children aboard the Titanic ---> Input Variable
* **Ticket** - Ticket number ---> Input Variable
* **Fare** - Passenger fare ---> Input Variable
* **Cabin** - Cabin number ---> Input Variable
* **Embarked** - Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) ---> Input Variable

# Importing Libraries / Reading Dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler , LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

display(train.head())

# EDA

In [None]:
display(train.info())

In [None]:
display( test.info() )

**PassengerId , Name , Ticket don't seem to have any significance thus dropping those columns.**

**Cabin has more than 500 rows missing thus dropping that column.**

In [None]:
train.drop(['PassengerId' , 'Name' , 'Ticket' , 'Cabin'] , axis=1 , inplace=True)

train.info()

In [None]:
test.drop(['PassengerId' , 'Name' , 'Ticket' , 'Cabin'] , axis=1 , inplace=True)

test.info()

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

**Age and Embarked columns still have few missing values, replacing Age(numerical) missing values with mean and Embarked(categorical) missing values with mode.**

In [None]:
train['Age'].fillna(train['Age'].median() , inplace=True)
train['Embarked'].fillna(train['Embarked'].mode().values[0] , inplace=True)

train.isna().sum()

In [None]:
test['Age'].fillna(test['Age'].median() , inplace=True)
test['Fare'].fillna(test['Fare'].mean() , inplace=True)

test.isna().sum()

In [None]:
train.head()

In [None]:
train.duplicated().sum()

In [None]:
train.drop_duplicates(inplace=True)
train.info()

In [None]:
train.describe()

In [None]:
train.describe(include=['O'])

**Analyzing feature correlation by pivoting features against each other. (Categorical , Ordinal , Discrete)**

* **Pclass -** We observe significant correlation (>0.5) among Pclass=1,2 and Survived (classifying #3).

* **Sex -** We confirm the observation that Sex=female had very high survival rate at 74%

* **SibSp / Parch -** These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features

In [None]:
train[['Pclass' , 'Survived']].groupby(['Pclass'] , as_index=False).mean().sort_values(by='Survived' , ascending=False)

In [None]:
train[['Sex' , 'Survived']].groupby(['Sex'] , as_index=False).mean().sort_values(by='Survived' , ascending=False)

In [None]:
train[['SibSp' , 'Survived']].groupby(['SibSp'] , as_index=False).mean().sort_values(by='Survived' , ascending=False)

In [None]:
train[['Parch' , 'Survived']].groupby(['Parch'] , as_index=False).mean().sort_values(by='Survived' , ascending=False)

# Data Visualization

In [None]:
cat_cols = list(train.select_dtypes(include='object').columns)
num_cols = list(train.select_dtypes(exclude='object').columns)
num_cols.remove('Survived')

print(cat_cols , num_cols)

In [None]:
cat_cols = ['Sex' , 'Embarked' , 'Pclass' , 'SibSp' , 'Parch']
num_cols = ['Age' , 'Fare']

**Plotting values for each Feature**

In [None]:
sns.set()

for n in num_cols:
    plt.figure(figsize=(12,8))
    sns.distplot(train[n])
    plt.title(f'{n}' , size=14)
    plt.show()
    
for c in cat_cols:
    plt.figure(figsize=(12,8))
    ax = sns.countplot(train[c])
    for i in ax.containers:
        ax.bar_label(i, label_type='edge', padding=1)
        
    plt.title(f'{c}' , size=14)
    plt.show()
        

In [None]:
plt.figure(figsize=(12, 8))
sns.histplot(x=train.Age, hue=train.Sex, element='step')
plt.title('Male/Female Ages' , size=15)

**Relationship between each feature and target variable ( Survived )**

In [None]:
for n in num_cols:
    plt.figure(figsize=(18,12))
    g = sns.FacetGrid(train , col='Survived' , size=4)
    g.map(plt.hist, n , bins=20)
    
    #plt.title(f"{n} vs Survived", size=15)
    plt.show()

for c in cat_cols:
    plt.figure(figsize=(12,8))
    ax = sns.countplot(x=train.Survived, hue=train[c])
    
    for i in ax.containers:
        ax.bar_label(i, label_type='edge', padding=1)
    ax.margins(y=0.1)
    
    plt.title(f"Survived vs {c}", size=15)
    plt.show()

In [None]:
grid = sns.FacetGrid(train, col='Survived', row='Pclass', size=4, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
plt.show()

* Pclass=3 had most passengers, however most did not survive. 
* Infant passengers in Pclass=2 and Pclass=3 mostly survived. 
* Most passengers in Pclass=1 survived.
* Pclass varies in terms of Age distribution of passengers.

In [None]:
grid = sns.FacetGrid(train , row='Embarked' , col='Survived' , size=3 , aspect=1.8)
grid.map(sns.barplot , 'Sex' , 'Fare' , alpha=0.5)
plt.show()

* Higher fare paying passengers had better survival. 
* Port of embarkation correlates with survival rates.

In [None]:
print(train.corr())
sns.heatmap(train.corr())

# FE

In [None]:
#FOR TRAIN DATA
train['FamilySize'] = train['SibSp'] + train['Parch']

train[['FamilySize' , 'Survived']].groupby(['FamilySize'] , as_index=False).mean().sort_values(by='Survived' , ascending=False)


In [None]:
train['IsAlone'] = train['FamilySize'].apply(lambda x: 0 if x>0 else 1)

train[['IsAlone' , 'Survived']].groupby(['IsAlone'] , as_index=False).mean().sort_values(by='Survived' , ascending=False)

In [None]:
train.head()

In [None]:
#FOR TEST DATA

test['FamilySize'] = test['SibSp'] + test['Parch']
test['IsAlone'] = test['FamilySize'].apply(lambda x: 0 if x>0 else 1)

test.head()

In [None]:
print(train.corr())
sns.heatmap(train.corr())

In [None]:
df= train.copy()
df.head()

In [None]:
#Dropping Family Size , SibSp and Parch due to correlation with IsAlone
train.drop(columns=['Parch' , 'FamilySize' , 'SibSp'] , axis=1 , inplace=True)
test.drop(columns=['Parch' , 'FamilySize' , 'SibSp'] , axis=1 , inplace=True)

train.head()

# Data Preprocessing

In [None]:
X = train.drop('Survived' , axis=1)
y = train['Survived']

**One Hot Encoding**

In [None]:
X = pd.get_dummies(X , columns=['Sex' , 'Embarked'])
X.head()

In [None]:
#FOR TEST DATA
test = pd.get_dummies(test , columns=['Sex' , 'Embarked'])
test.head()

**Scaling/Standardizing Data**

In [None]:
sc = StandardScaler()

X[['Age' , 'Fare']] = sc.fit_transform(X[['Age' , 'Fare']])
X.head()

In [None]:
#For TEST DATA

test[['Age' , 'Fare']] = sc.fit_transform(test[['Age' , 'Fare']])
test.head()

**Splitting Data**

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X , y , test_size=0.25 , random_state=42)

# Modelling

check this notebook for parameter tuning and cross validation -> https://www.kaggle.com/kenjee/titanic-project-example/notebook

**Tuned Random Forest Classifier**

In [None]:
# param_grid_rfc = {"max_depth": [None],
#                   "max_features": [1, 3, 10],
#                   "min_samples_split": [2, 3, 10],
#                   "min_samples_leaf": [1, 4, 10],
#                   "n_estimators" :[100, 200, 500]}

# grid_rf = GridSearchCV(RandomForestClassifier(), param_grid_rfc, cv=5, scoring="accuracy", verbose=0, n_jobs=-1)

# grid_rf.fit(X_train, y_train)

In [None]:
# rfc_params = grid_rf.best_params_

rf = RandomForestClassifier()   #(**rfc_params)
rf.fit(X_train, y_train)

rfpred = rf.predict(X_test)

rfscore = accuracy_score(y_test , rfpred)
print('Accuracy Score = ' , rfscore)

In [None]:
from sklearn.metrics import classification_report , roc_curve , auc

In [None]:
fpr,tpr,threshold =roc_curve(y_test,rfpred)
rfauc = auc(fpr,tpr)
plt.figure(figsize=(5,5),dpi=100)
plt.plot(fpr,tpr,linestyle='-',label = "(auc = %0.3f)" % rfauc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

In [None]:
def f_importances(model, model_name):
    f_imp = pd.DataFrame({"Feature Importances": model.feature_importances_}, index=X.columns)

    plt.figure(figsize=(12,8))
    sns.barplot(x=f_imp["Feature Importances"], y=f_imp.index)
    plt.title(f"{model_name} Feature Importances", size=15)
    plt.show()

In [None]:
f_importances(rf, "Random Forest")

**SVM rbf Kernel**

In [None]:
svc1 = SVC(kernel='rbf')
svc1.fit(X_train , y_train)

pred1 = svc1.predict(X_test)

score1 = accuracy_score(y_test , pred1)

print('Accuracy Score = ' , score1)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, pred1), annot=True, cmap="Blues", fmt='d')
plt.title("Confusion Matrix of SVM", size=15)
plt.show()

In [None]:
from sklearn.metrics import classification_report , roc_curve , auc

In [None]:
report1 = classification_report(y_test , pred1)

print(report1)

In [None]:
roc_auc_score(y_test,pred1)

In [None]:
fpr,tpr,threshold =roc_curve(y_test,pred1)
auc1 = auc(fpr,tpr)
plt.figure(figsize=(5,5),dpi=100)
plt.plot(fpr,tpr,linestyle='-',label = "(auc = %0.3f)" % auc1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

**2d Plot for rbf kernel SVM**

In [None]:
def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

In [None]:
XX = X_train[['Fare' , 'Age']] # we only take the first two features , as doing 2d plot

In [None]:
model = SVC(kernel='rbf')
clf = model.fit(XX, y_train)

fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of rbf SVC')
# Set-up grid for plotting.
X0, X1 = XX.iloc[:, 0], XX.iloc[:, 1]
xx, yy = make_meshgrid(X0, X1)

plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('Age')
ax.set_xlabel('Fare')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()


**SVM Polynomial degree 3 kernel**

In [None]:
svc3 = SVC(kernel='poly' , degree=3)
svc3.fit(X_train , y_train)

pred3 = svc3.predict(X_test)

score3 = accuracy_score(y_test , pred3)

print('Accuracy Score = ' , score3)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix(y_test, pred3), annot=True, cmap="Blues", fmt='d')
plt.title("Confusion Matrix of SVM", size=15)
plt.show()

In [None]:
report3 = classification_report(y_test , pred3)

print(report3)

In [None]:
roc_auc_score(y_test,pred3)

In [None]:
fpr,tpr,threshold =roc_curve(y_test,pred3)
auc3 = auc(fpr,tpr)
plt.figure(figsize=(5,5),dpi=100)
plt.plot(fpr,tpr,linestyle='-',label = "(auc = %0.3f)" % auc3)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

**Plotting 2d Polynomial SVC**

In [None]:
model = SVC(kernel='poly' , degree=3)
clf = model.fit(XX, y_train)

fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of Polynomial SVC degree 3')
# Set-up grid for plotting.
X0, X1 = XX.iloc[:, 0], XX.iloc[:, 1]
xx, yy = make_meshgrid(X0, X1)

plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('Age')
ax.set_xlabel('Fare')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()


# Testing and Submission

In [None]:
final_svc = SVC(kernel='poly' , degree=3)
final_svc.fit(X , y)

final_pred = final_svc.predict(test)

In [None]:
test_sub = pd.read_csv('../input/titanic/test.csv')

submission = pd.DataFrame({
        "PassengerId": test_sub["PassengerId"],
        "Survived": final_pred
        })

submission.to_csv('submission.csv', index=False)