# <p style="font-family:Papyrus;color:Orange;font-size:1.em"> Titanic Disaster EDA , Visualization and Survival Prediction using SVM </p>

## Introduction :
<p style="color: indigo">
The Kernel is trying to follow through a typical data science workflow to bulid a prediction model for survival of passenger onboard on Titanic. If you like the work here , please upvote. If you have suggestions to make this better , will love to read them in the comments. Thanks.
    
</p>



# Objective Definition


**1. Data Acquisition of the Titanic Data set  :** We will read, understand available data , validate high level data quality in the source.

**2. Data Wrangling :** Clean data where applicable. Handle any missing values. Build Integrated views based on associations. Build Summaries if needed.

**3. Exploratory Analysis :** Here we will perform detailed analysis to explore hiddent patterns , co-relations etc. Use Visualize where needed.

**4. Feature Engineering :** This is where we will identify Features to develop the model, check and prepare if any Derived Features are needed.

**5. Model Preparation :** Building the model using defined features. Based on nature of problem definition, we are having a Classification Problem at hand. At the same time, Regression pattern can also be applicable on the data. We will restrict our Model Algorithms to use (SVM) Classification and (Logistict Regression) Regression.

**6. Model Evaluation :** We will build evaluation around both the model we will prepare and evaluate the accuracy score that can be obtained from both. Where applicable , we will use Visualization to compare the accuracy scores. Through evaluation , we will pick the better scoring prediction model.

**7. Predict :** Final stage, we will run our final model to execute predictions.

# Data Acuisition

In [None]:
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Import necessary packages
# Create any reusable methods to use later

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
def screen_data(df):
    print('-'*40)
    print('list of Columns : ',df.columns.to_list())
    print('-'*40)
    print('Missing Values in the columns : \n')
    print(df.isnull().sum())
    print('-'*40)
    print('Unique Value Counts : \n')
    print(df.nunique())
    print('-'*40)

In [None]:
df_train = pd.read_csv('../input/titanic/train.csv')
screen_data(df_train)


In [None]:
# Random sampling in the data
df_train.sample(3)

In [None]:
df_test = pd.read_csv(  '../input/titanic/test.csv'  )
screen_data(df_test)

Here we see that both `train.csv` and `test.csv` has common behavior of **missing values in columns Age and Cabin.**

### To understand the missing values and percentages better, we will run a heat map on the dataframes

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(18,6))
colors = ['lightyellow' , 'red']

f1 = sns.heatmap(df_train[['Age','Cabin','Embarked','Fare']].sort_values('Age').isnull(), cmap=sns.color_palette(colors), ax=ax1)
f1.set_title('Missing Values - Train Data')

f2 = sns.heatmap(df_test[['Age','Cabin','Embarked','Fare']].sort_values('Age').isnull(), cmap=sns.color_palette(colors), ax=ax2)
f2.set_title('Missing Values - Test Data')

plt.show()

The red area in the above heat maps shows distribution of missing values in respective datasets.
    
   - It is seens tha we have large amount of missing data for `Cabin` column,
   
   - Missing data for `Age` is also considerable , but we can work around the same.
   
   - Column `Embarked` and `Fare` does not have any significant missing data.
   

# Data Exploration and  Data Wrangling

Since exploring the data and accordingly wrangle it further the next exploration is an iterative process, we will combine the two stages together. 

As part of Data Exploration stage, we would want to explore possible relationships of the avaialble features by looking at the `Survived` fact for the passengers.

We will also want to identify if there are new features we want to derive , that can be a better indicator for the survival chances.

During the Wrangling , our goal is to identify and create a reusable function for the necessary pre-processing operations that needs to go on both train and test data sets so that they remain in unison standards when applied to model in later stages.

In [None]:
df_train.head(3)

In [None]:
df_train[['PassengerId','Age']].groupby('Age').count().reset_index().rename(columns={'PassengerId' : 'Cnt'}).sort_values('Cnt', ascending=False).head(5)

In [None]:
df_train['Age_range'] = pd.cut(df_train['Age'], 10, precision=0)

fig , ax = plt.subplots(1,1, figsize=(14,5))

z = sns.barplot(data = df_train[['Survived','Age_range']] , x='Age_range' , y='Survived',  ax = ax, palette=sns.color_palette('pastel'))
z.set_title('Age comparison for Survival')
plt.show()

df_train.drop('Age_range', axis = 1, inplace  = True)


#### Age data is working better when put into range bins.

#### Ticket data is basically ticket identifiers, so they can simply be removed from analysis as Ticket Id may not have any significance to survival rate.

In [None]:

s1 = sns.barplot(data = df_train, y='Survived' ,  hue='Sex' , x='Sex')
s1.set_title('Male-Female Survival Comparison')
plt.show()

#### Female passengers had higher Survival rate compared to Male passengers.

In [None]:
# Finding Titles in the names

import re
from collections import Counter


def check_title(x) : 
    return re.search(' ([A-Za-z]+)\.', x).group(1)

Counter(df_train['Name'].map(check_title).to_list())

#### We will try to explore if the Title have any co-relation with survival

In [None]:
df_train['Title'] = df_train['Name'].map(check_title)

fig , ax = plt.subplots(1,1, figsize=(16,6))
bar = sns.barplot(data = df_train[['Survived' , 'Title']] , y='Title' , x='Survived',  orient='h', ax=ax, palette=sns.color_palette('Blues'))
# bar = sns.swarmplot(data = df_train[['Survived' , 'Title']] , x='Title' , y='Survived', ax=ax)
bar.set_title ('Survival Comparison for Passengers with Titles')
plt.show()

df_train.drop('Title', axis=1, inplace=True)

#### There is a definite inclination to survival for some Titles like Lady, Sir, Countess etc. To capture this in our prediction, we will want to add a new Derived Feature for 'Title' in our Data set. 

In [None]:
# df_train['Fare_range'] = pd.cut(df_train['Fare'], 10, precision=0)

fig , ax = plt.subplots(1,1, figsize=(16,4))
bar = sns.violinplot(data = df_train[['Survived' , 'Fare']] , y='Fare' , x='Survived', ax=ax)
bar.set_title ('Survival based on Fare')
plt.show()

### Exploring mutliple features together

In [None]:
sns.pairplot(df_train, hue='Survived')

We can observe in above visuals that survival rate is :
    
   1. better on certain Pclass values.
   2. better on certain Age ranges
   3. not so focussed on Parch Values.
   4. better to for very high Fare rate.
   
For other Features such as PassengerId, the survival rate is not having any centralized inclination on the feature at individual values level. 

### Checking Embarked for Survival

In [None]:
grid = sns.FacetGrid(df_train, row='Survived', col='Embarked', height=2, aspect=2, palette=sns.color_palette('ocean'))
grid.map(plt.hist, 'Embarked', alpha=.5, bins=50)
grid.add_legend()
plt.show()

# Feature Engineering

### Building a Reusable Function to apply on both Train and Test Data set

In [None]:
def feature_process(df):
    df['Embarked'].fillna(df['Embarked'].mode(), inplace=True)  # Fix missing values in Embarked if any
    df['Fare'].fillna(df['Fare'].mean() , inplace=True)  # Fix missing values in Fare
    df['Age'].fillna(df['Age'].median() , inplace=True)  # Put median value for Age
    
    if 'Cabin' in df.columns:
        df.drop('Cabin', axis = 1, inplace = True)  # drop Cabin Column
    
    if 'Ticket' in df.columns:
        df.drop('Ticket', axis = 1, inplace = True)  # drop Ticket Column , being a ticket number it has no relevance
    
    df['Age_cd'] = pd.cut(df_train['Age'], 10, precision=0).astype('category').cat.codes  # new Column to bucket the age ranges and putting code for it
    

#     df['Fare_range'] = pd.cut(df_train['Fare'], 10, precision=0)  # new Column
    
    df['Embarked_cd'] = df['Embarked'].astype('category').cat.codes # new Column
    
    df['Title_cd'] = df['Name'].map(check_title).astype('category').cat.codes # new Column
    
    df['Sex_cd'] = df['Sex'].astype('category').cat.codes  # Change sex to codes
           
    print("Preprocessing on the data complete ..")

#### Running the preprocessing on Train and Test Data set

In [None]:
feature_process(df_train)

In [None]:
feature_process(df_test)

In [None]:
df_train.head(3)

In [None]:
df_train.columns

### Checking Correlation of various Features

In [None]:
h = sns.heatmap(pd.get_dummies(df_train[['Survived', 'Pclass', 'Sex', 'Age_cd']], 
               columns=['Survived', 'Pclass', 'Sex', 'Age_cd']).corr(),
           annot=True,cmap='RdYlGn_r',linewidths=0.2)

fig=plt.gcf()

h.set_title('Correlation on Various key Features in consideration')

fig.set_size_inches([18,10])

plt.show()


# Building the Model


In this stage we will be building our models. As already identified in the objective section above, we will be looking to build 2 type of models.

1. SVM model

2. Logistic Regression model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
# identified features to be used
features = ['Pclass', 'SibSp', 'Parch', 'Fare',  'Age_cd', 'Sex_cd', 'Embarked_cd', 'Title_cd']


X_train, X_test, y_train, y_test = train_test_split(df_train[features],  df_train['Survived'], test_size=0.3 , random_state=25)

# Check basic setup of train and test
for x , y in enumerate([X_train, X_test, y_train, y_test]):
    print(f'{x+1} :  {y.shape}')  



In [None]:
# Logistic regression
lr = LogisticRegression(max_iter=2000)
lr.fit(X_train, y_train)

In [None]:
# SVM 
svm = SVC(kernel='rbf', C=100 , random_state=1)
svm.fit(X_train, y_train)

# Model Evaluation



In [None]:
def evaluate_model(model_name) : 
    print("-"*40,'\n')
    print("Evaluation for Model : ", model_name)
    print('\n',"-"*40,'\n')
    y_predict = model_name.predict(X_test)
    acc = accuracy_score(y_test , y_predict)
    print(f'Accuracy score of the model : {acc*100} %')
    cmat = confusion_matrix(y_test , y_predict)
    scores = cmat.diagonal() / cmat.sum(axis=1)
    for x in zip(['Not Survived' , 'Survived' ], scores) :
        print(f'Accuracy Scores for - {x[0]} : {x[1]*100} %')
    print('\n',"-"*40,'\n')
    sns.heatmap(cmat, cmap='Set3' , annot=True , fmt = '4.0f')
    title = f'Confusion_matrix : {model_name}'
    plt.title(f'{title}', y=1.1, size=20)
    plt.show()

    

In [None]:
evaluate_model(lr)

In [None]:
evaluate_model(svm)

#### Overall accuracy scores received for both the models are close by. However, SVM score of individual category of prediction shows that it is slightly better when  predicting 'Survived'. We will use SVM for final prediction.

# Prediction of the Test Set

In [None]:
prediction = svm.predict(df_test[features])
final = pd.DataFrame ({'PassengerId' : df_test['PassengerId'], 'Survived': prediction})
final.to_csv('./submission_svm.csv', index=False)

final.head()


In [None]:
df_result = final.groupby('Survived').count().reset_index().rename(columns = {'PassengerId': 'Passenger Count'})
res = sns.barplot(data=df_result, x = 'Survived' , y='Passenger Count', hue='Survived', palette='cool_r')
res.set_title('Final Results from Prediction')
plt.show()
