## Synthanic EDA + Visualization Using Plotly 

![](https://storage.googleapis.com/kaggle-competitions/kaggle/26478/logos/header.png?t=2021-03-29-17-07-0)


<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Content</h3>
    
* [Part-0: Introduction](#Introduction)
    * [Import libraries, load data](#Import-libraries,-load-data)
* [Part-1: Data Overview](#Data-Overview)
    * [View data table, stats summary](#View-data-table,-stats-summary)
    * [Unique cats and missing values](#Unique-cats-and-missing-values)
    * [Features distribution](#Features-distribution)
    * [Feature correlation](#Feature-correlation)
* [Part-2.1: Survival by Features](#Survival-by-Features)
    * [Survival by Age, Sex, Fare](#Survival-by-Age,-Sex,-Fare)
    * [Survival by SibSP, Parch](#Survival-by-(SibSP,-Parch))
    * [Survival by Embarked, Pclass](#Survival-by-Embarked,-Pclass)
* [Part-2.2: Survival Features Interaction](#Survival-Features-Interaction)
    * [Gender-Age survival](#Gender-Age-survival)
    * [Pclass-Age Survival](#Pclass-Age-Survival)
* [References](#References)

## Introduction

Starting from January this year, the kaggle competition team is offering a month-long tabulary playground competitions. This series aims to bridge between inclass competition and featured competitions with a friendly and approachable datasets.

For April kaggle is offering a dataset which is synthetic but based on a real dataset and generated using a CTGAN. This time the features are not anonymized, they are based on the famous titanic dataset.


#### Data Dictionary
#### Variable: Definition (Key)
- **survival**:	Survival(0 = No, 1 = Yes)
- **pclass**:	Ticket class(1 = 1st, 2 = 2nd, 3 = 3rd)
- **sex**: Sex (male, female)	
- **Age**: Age in years	
- **sibsp**:	Number of siblings / spouses aboard the Titanic	
- **parch**:	Number of parents / children aboard the Titanic	
- **ticket**:	Ticket number	
- **fare**:	Passenger fare	
- **cabin**:	Cabin number	
- **embarked**:	Port of Embarkation(C = Cherbourg, Q = Queenstown, S = Southampton)


### Import libraries, load data

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
pio.templates.default = "none"

template = 'ggplot2',#'plotly_dark', 'seaborn', 'simple_white', 'plotly'

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv', )
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv', )
subm = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')

## Data Overview
### View data table, stats summary
#### Size of the data
- Both train and test datasets have the same length (100k samples each) and 11 features excluding the target (survived) column of the train data. Survived in the test data is to be predicted (a target variable). 

In [None]:
print("Shape of train data is ", format(train.shape))
print("Shape of test data is ", format(test.shape))

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

### Unique cats and missing values
#### Unique catagories
- Of the common features in train and test data *Pclass, Sex, SibSp, Parch* and *Embarked* have the same unique values. In the rest of the features train and test data do not have the same unique values. But this can be expected as the features are of float (continuous-feature) data type. 
- Unique values in *name column*  is not equal to the number of passengers. This might be an indication that there could be duplicate names! Otherwise, in a rare case paggengers with identical names. This is true for both train and test data. To be checked later!

#### Missing values
- Four features (Cabin, Ticket, Age, Embarked and Fare) have missing data in them .
- Cabin has the large amount of missing data in both train (67.8%) and test (70.1%) data followed by Ticket (train: 4.6%, test: 5.2%) and Age (train:3.3%, test:3.1%).

[back to contents](#Contents)

In [None]:
display('Unique values in train data')
for col in train.columns:
    print('{} unique values in {}'.format(train[col].nunique(), col))
print('*x50')
display('Unique values in test data')
for col in test.columns:
    print('{} unique values in {}'.format(test[col].nunique(), col))

In [None]:
# train_data missing values
null_values_train = []
for col in train.columns:
    if train[col].isna().sum() != 0:
        pct_na = np.round((100 * (train[col].isna().sum())/len(train)), 2)            
        dict1 ={
            'Features' : col,
            'NA_train (count)': train[col].isna().sum(),
            'NA_trian (%)': '{}%'.format(pct_na)
        }
        null_values_train.append(dict1)
DF1 = pd.DataFrame(null_values_train, index=None).sort_values(by='NA_train (count)',ascending=False)
#print(DF1)


# test_data missing values
null_values_test = []
for col in test.columns:
    if test[col].isna().sum() != 0:
        pct_na = np.round((100 * (test[col].isna().sum())/len(test)), 2)            
        dict2 ={
            'Features' : col,
            'NA_test (count)': test[col].isna().sum(),
            'NA_test (%)': '{}%'.format(pct_na)
        }
        null_values_test.append(dict2)
DF2 = pd.DataFrame(null_values_test, index=None).sort_values(by='NA_test (count)',ascending=False)
#print(DF2)


# barplots
fig = go.Figure(data=[go.Bar(x=DF1['Features'],
                             y=DF1["NA_train (count)"], 
                             text=DF1['NA_trian (%)'], 
                             textposition='auto', name='Train', marker_color='lightseagreen'),        

                go.Bar(x=DF2['Features'],
                             y=DF2["NA_test (count)"], 
                             text=DF2['NA_test (%)'], 
                             textposition='auto', name='Test', marker_color='lightsalmon')])
fig.update_traces(marker_line_color='black', marker_line_width=1.5, opacity=1)
fig.update_layout(title_text='Missing values', 
                  #template='plotly_dark',
                  paper_bgcolor='rgb(230, 230, 230)',
                  plot_bgcolor='rgb(230, 230, 230)',
                  width=600, height=300,
                  xaxis_title='Features', yaxis_title='Count',
                  titlefont={'color':'black', 'size': 24, 'family': 'San-Serif'})
fig.show()

### Features distribution

- More passengers survived (57.2%) than not (42.8%). Note that this is present only in the train data!
- **Pclass 3** has more passegers with (train:64%, test:41%) and Pclass 2 being the lowest with (train:29%, test:9%)
- There are more **male** passengers than **females** (train: 56.1% vs 43.9% and test: 70% vs 30%)
- More passengers **embarked** at port S (train:72%, test:69%) with Q being the lowest witht (train:5.4%, test:8.6%). Both train and test data have less that 1 % missing data.
- Passengers with no **parent or child** (Parch 0) are the highest with in both train and test data (train:73.5%, test:71.5%). 
- Passengers who did not have **siblings or parents** are the majority with (train: 73.5%, test: 62%)
- The average **Age** in train data is 38.3years  whereas in the test data it is 30.56years.
- Average **Fare** in the train and test data is 43.9 and 45.4 respectively

[back to contents](#Contents)

In [None]:
#colors = ['lightgray', 'Rebeccapurple','gold','royalblue','lightseagreen','lightsalmon']
#colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen', 'black', 'Gray']

fig = make_subplots(rows=3, cols=2,
                    specs=[[{'type':'domain'}, {'type':'domain'}],
                           [{'type':'domain'}, {'type':'domain'}], 
                           [{'type':'domain'}, {'type':'domain'}], 
                           ])
fig.add_trace(
    go.Pie(
        labels=train['Sex'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Sex (train)',
        titlefont={'color':'white', 'size': 24},         

        ),
    row=1,col=1
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['lightseagreen', 'lightsalmon'], 
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=test['Sex'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Sex (test)',
        titlefont={'color':'white', 'size': 24},
        ),
    row=1,col=2
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['lightseagreen', 'lightsalmon'],
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=train['Embarked'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Embarked (train)',
        titlefont={'color':'white', 'size': 24},
        ),
    row=2,col=1
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['lightseagreen', 'lightsalmon', 'gray'],
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=test['Embarked'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Embarked (test)',
        titlefont={'color':'white', 'size': 24},
        ),
    row=2,col=2
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['lightseagreen', 'lightsalmon', 'gray'],
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=train['Pclass'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Pclass (train)',
        titlefont={'color':'white', 'size': 24},
       ),
    row=3,col=1
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['lightseagreen', 'lightsalmon', 'gray'],
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    
    
    go.Pie(
        labels=test['Pclass'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Pclass(test)',
        titlefont={'color':'white', 'size': 24},
       ),
    row=3,col=2
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['lightseagreen', 'lightsalmon', 'gray'],
        line=dict(color='#000000',
                  width=2)
        )
    )
fig.layout.update(title="Features Distribution (train/test data)", showlegend=False, height=650, width=600, 
                  template='plotly_dark', titlefont={'color':'white', 'size': 24, 'family': 'San-Serif'}
                 )
fig.show()


In [None]:
fig = make_subplots(rows=2, cols=2)

trace0 = go.Histogram(x=train['Parch'],name='Parch (train)',
                      histnorm='percent', marker_color='seagreen',                                          
                     )
trace1 = go.Histogram(x=test['Parch'],name='Parch (test)',
                      histnorm='percent',marker_color='lightseagreen', opacity=0.5,                     
                      )
trace2 = go.Histogram(x=train['SibSp'],name='SibSp (train)',
                      histnorm='percent',marker_color='salmon',                      
                      )
trace3 = go.Histogram(x=test['SibSp'],name='SibSp (test)', 
                      histnorm='percent', marker_color='lightsalmon',opacity=0.5,   
                      )

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 2)

fig.update_layout(title="Parch and SibSp distribution (train-test)", 
                  bargap=0.2,
                  titlefont={'size': 24},
                  font_family ='San Serif',
                  template='plotly_dark',
                  width=600, height=450,
                  legend=dict(
                  orientation="v", y=1.0, yanchor="top", x=1.2, xanchor="right",)                
                  )

fig['layout']['xaxis']['title']='Parch'
fig['layout']['xaxis2']['title']='Parch'
fig['layout']['xaxis3']['title']='SibSp'
fig['layout']['xaxis4']['title']='SibSp'
fig['layout']['yaxis']['title']='%'
fig['layout']['yaxis2']['title']='%'
fig['layout']['yaxis3']['title']='%'
fig['layout']['yaxis4']['title']='%'
fig.show()

In [None]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=train['Age'],
                           name='train', 
                           histnorm='probability density',
                           xbins=dict(
                               start=0,
                               end=100,
                               size=2
                           ),
                           marker_color='lightsalmon',
                           opacity=0.75
                          )
             ) 
fig.add_trace(go.Histogram(x=test['Age'],
                           name='test', 
                           histnorm='probability density',
                           xbins=dict(
                               start=0,
                               end=100,
                               size=2
                           ),
                           marker_color='lightseagreen',
                           opacity=0.75
                          )
             ) 
fig.update_layout(title='Passengers Age Distribution (train-test)',
                  xaxis_title='Age [years]', 
                  yaxis_title='Probability Density [-]',
                  titlefont={'size': 24},
                  font_family = 'San Serif',
                  width=600,height=300,
                  template="plotly_dark",
                  showlegend=True,
                  font=dict(
                      color ='white',
                      ),
                  legend=dict(
                      orientation="v",
                      y=1, 
                      yanchor="top", 
                      x=1.0, 
                      xanchor="right",)   
 )
fig.show()

In [None]:
fig = go.Figure()

fig.add_trace(       
    go.Pie(
        labels=train['Survived'],
        values=None,
        hole=.4,
        title='Survived',
        titlefont={'color':'white', 'size': 24, 'family': 'San Serif'},
       ))

fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=['lightseagreen', 'lightsalmon'],
        line=dict(color='#000000',
                  width=2)
        )
    )
fig.layout.update(title="Passengers Survival (train)", showlegend=False, height=450, width=600,
                  template='plotly_dark', titlefont={'color':'white', 'size': 24, 'family': 'San Serif'}
                 )
fig.show()


In [None]:
train = train.copy()
test = test.copy()

train['Fare'] = train['Fare'].fillna(train['Fare'].mode().iloc[0])
test['Fare'] = test['Fare'].fillna(test['Fare'].mode().iloc[0])


group_labels = ['train', 'test']

fig = ff.create_distplot([train['Fare'], test['Fare']],
                         group_labels, 
                         show_hist=False, 
                         show_rug=False,
                         )

fig.update_layout(title='Fare Paid by Passengers',
                  xaxis_title='Fare', 
                  yaxis_title='Density',
                  titlefont={'size': 24},
                  font_family = 'San Serif',
                  width=700,height=400,
                  template="plotly_dark",
                  showlegend=True,
                  paper_bgcolor="black",
                  font=dict(
                      color ='white',
                      ),
                  legend=dict(
                      orientation="v",
                      y=1, 
                      yanchor="top", 
                      x=1.0, 
                      xanchor="right",)   
 )
fig.show()

### Feature correlation
- **Cramer's V** correlation method is used to find correlation between the categorical features and the target variable (Survived).
- Whereas for the continuous features **pearson's correlation** is used.
- Correlation between continuous and target (categorical) features is calculated using **point-biserial correlation methode**.
- **Sex, Embarked and Pclass** seem to have the highest correlation with survival.
- **Name, Ticket, PassegerId and Cabin** are drop from the correlation analysis. These features need further feature engineering to be used for such analysis.

[back to contents](#Contents)

In [None]:
## I borrowred this code snippet from https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

categoricals = ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch']
print("Cramer's -V categorival features correlation with Survival ")
print('**********************************************************')
for cats in categoricals:
    print('Correlation between {} and survival is {:.2f}'.format(cats, cramers_v(train[cats], train['Survived'])))
    

In [None]:
#Point  Biserial correlation for categorical-continuous features
# first we need to impute the missig values in Age and Fare 
# (NA values are not accepted in the point_biserial function) 

train['Age'] = train['Age'].fillna(train['Age'].mode().iloc[0])
train['Fare'] = train['Fare'].fillna(train['Fare'].mode().iloc[0])

def point_biserial(cat):
    a = train['Survived']
    b = train[cat]
    pb = stats.pointbiserialr(a, b)
    return pb

point_biserial('Fare')[0]
continuous =['Fare', 'Age', 'SibSp', 'Parch']
print("Point  Biserial correlation for categorical-continuous features")
print('***************************************************************')
for conts in continuous:
    print('Correlation between {} and Survival is {:.2f} '.format(conts, point_biserial(conts)[0] ))    

In [None]:
data = train.copy()
data.drop(columns=['PassengerId', 'Ticket', 'Name', 'Sex', 'Pclass', 'Embarked', 'Cabin', 'Survived'], axis=1, inplace=True)

cat_features = [col for col in data.columns if data[col].dtype=='object']
num_features = [col for col in data.columns if data[col].dtype=='float']

# label encoding

le = LabelEncoder()

le_data = data.copy()

for col in cat_features:
    le_data[col] = le.fit_transform(data[col])
corrdata = le_data

## correlation 

corr = corrdata.corr(method='pearson')

mask1 = np.triu(np.ones_like(corr, dtype=bool))
mask2 = np.tril(np.ones_like(corr, dtype=bool))
corr1=corr.mask(mask1)
corr2=corr.mask(mask2)

fig = go.Figure(data= go.Heatmap(z=corr1,
                  x=corr1.index.values,
                  y=corr1.columns.values,       
                  xgap=3, ygap=3,
                  colorscale='emrld',
                  colorbar_thickness=10,
                  colorbar_ticklen=3,
                   )
                )
fig.update_layout(title_text='Continuous Features Correlation', 
                title_x=0.5,
                font_family="San Serif",
                titlefont={'size': 24},
                width=500, height=500,
                xaxis_showgrid=False,
                yaxis_showgrid=False,
                yaxis_autorange='reversed', 
                paper_bgcolor=None,
                margin=dict(l=70, r=70, t=70, b=70, pad=1),
                template="plotly_dark"    )
fig.show()


## Survival by Features

### Survival by Age, Sex, Fare 
- More female passengers survived than male passengers. 71% of the females survived compared to that of males (only 29% survived).
- Survivors tend to be older that those who did not.
- On average survivors paid higher fare than non-survivors. On average survivors paid 59units whereas non-surviours paid 32.6units.

In [None]:
df = train
fig = px.histogram(df, x="Survived", y=None, color="Sex",
                width=600,height=350,
                histnorm='percent',
                color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                template="plotly_dark"
                )

fig.update_layout(title="Survival by Gender", 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right",)                 
                  )
fig.show()

In [None]:
Survived0 = train[train['Survived'] == 0]['Age']
Survived1 = train[train['Survived'] == 1]['Age']

fig = go.Figure()

fig.add_trace(go.Violin(x=Survived0, line_color='salmon', name='Survived = 0',))
fig.add_trace(go.Violin(x=Survived1, line_color='gold', name= 'Survived = 1', ))


fig.update_traces(orientation='h', side='positive', width=3, points=False, meanline_visible=True,)
fig.update_layout(xaxis_showgrid=True, xaxis_zeroline=False)

fig.update_layout(title='Survival-Age distn.',
                  xaxis_title='Age',
                  font_family="San Serif",
                  width=600,height=350,
    template="plotly_dark",
    showlegend=False,
    titlefont={'size': 24},
    paper_bgcolor="black",
    font=dict(
        color ='white', 
    )
 )

fig.show()

In [None]:
Survived0_fare = train[train['Survived'] == 0]['Fare']
Survived1_fare = train[train['Survived'] == 1]['Fare']

fig = go.Figure()
fig.add_trace(go.Violin(x=Survived0_fare, line_color='salmon', name='Non-Survivors'))
fig.add_trace(go.Violin(x=Survived1_fare, line_color='seagreen', name='Survivors'))


fig.update_traces(orientation='h', side='positive', width=3, points=False, meanline_visible=True)
fig.update_layout(xaxis_showgrid=True, xaxis_zeroline=False)

fig.update_layout(title="Survival-Fare distn.",
                  font_family="San Serif",
                  xaxis_title='Fare',
                  width=600,
                  height=300,
                  template="plotly_dark",
                  titlefont={'size': 24},
                  showlegend=False,
                  paper_bgcolor="black",
                  font=dict(
                      color ='white',
                      )
                  )

fig.show()

### Survival by family (SibSP, Parch)


In [None]:
df = train

fig = px.histogram(df, x="SibSp", y=None, color="Survived",
                width=600,height=300,
                histnorm='percent',
                color_discrete_map={ 
                    1: "RebeccaPurple", 0: "lightsalmon"
                },
                template="plotly_dark"
                )

fig.update_layout(title="SibSp-survival", 
                  font_family="San Serif",
                  barmode='group',
                  bargap=0.2,
                  titlefont={'size': 24},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()
fig = px.histogram(df, x="Parch", y=None, color="Survived",
                width=600,height=300,
                histnorm='percent',
                color_discrete_map={ 
                    1: "RebeccaPurple", 0: "lightsalmon"
                },
                template="plotly_dark"
                )

fig.update_layout(title="Parch-survival", 
                  font_family="San Serif",
                  barmode='group',
                  bargap=0.2,
                  titlefont={'size': 24},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()


 ### Survival by Embarked, Pclass
- There are more male passanger in Pclass3, but more female passengers than male in Pclass1 and 2
- More female passengers embarked at C and Q, but more male passengers embarked at S

- More Pclass 3 (54%) passengers survived than Pclass 2 and Pclass 1
- Passengers who embarked at S had the highest survival rate (85.7%).

[back to contents](#Contents)

In [None]:
df = train
fig = px.histogram(df, x="Embarked",
                   y=None, color="Survived",                   
                   width=600,height=300,
                   histnorm='percent',
                   color_discrete_map={
                       1: "RebeccaPurple", 0: "lightsalmon"
                       },
                   template="plotly_dark"
                  )       

fig.update_layout(title="Embarked-survival", 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()

fig = px.histogram(df, x="Pclass", y=None, color="Survived",
                width=600,height=300,
                histnorm='percent',
                color_discrete_map={ 
                    1: "RebeccaPurple", 0: "lightsalmon"
                },
                template="plotly_dark"
                )

fig.update_layout(title="Pclass-survival", 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()

## Survival Features Interaction
### Gender-Age survival 
- Male passengers had an average age of 37 years and females 42 years-old
- Passengers who survived are slightly older than those who didn't

### Pclass-Age Survival
- Generally the age distribution is Pclass3 < Pclass2 < Pclass1. Except Pclass3 surviors tend to be younger than non-surviours.

### Pclass-Sex Survival
- As is the case with overall sex survival rate, in all Pclasses female passengers had the highest survival rate.

[back to contents](#Contents)

In [None]:
female_age = train[train['Sex'] == 'female']['Age']
male_age = train[train['Sex'] == 'male']['Age']

female_age_1 = train[(train['Sex'] == 'female') & (train['Survived'] == 1)]['Age']
female_age_0 = train[(train['Sex'] == 'female') & (train['Survived'] == 0)]['Age']

male_age_1 = train[(train['Sex'] == 'male') & (train['Survived'] == 1)]['Age']
male_age_0 = train[(train['Sex'] == 'male') & (train['Survived'] == 0)]['Age']


fig = go.Figure()

fig.add_trace(go.Box(x=male_age, line_color="RebeccaPurple", name='male'))
fig.add_trace(go.Box(x=male_age_1, line_color='darkturquoise', name= 'male_survived'))
fig.add_trace(go.Box(x=male_age_0, line_color='darkgray', name= 'male_not_survived'))
fig.add_trace(go.Box(x=female_age, line_color='salmon', name= 'female'))
fig.add_trace(go.Box(x=female_age_1, line_color='lightsalmon', name= 'female_survived'))
fig.add_trace(go.Box(x=female_age_0, line_color='gray', name= 'female_not_survived'))

fig.update_layout(xaxis_showgrid=True, xaxis_zeroline=False)

fig.update_layout(title='Gender-Age Survival',
                  font_family="San Serif",
                  xaxis_title='Age',
                  width=600,height=400,
                  template="plotly_dark",
                  showlegend=False,
                  titlefont={'size': 24},
                  paper_bgcolor="black",
                  font=dict(
                      color ='white',
                      )
                  )
fig.show()

In [None]:
PClass1_1 = train[(train['Pclass'] == 1) &(train['Survived'] == 1)]['Age']
PClass1_0 = train[(train['Pclass'] == 1) &(train['Survived'] == 0)]['Age']
PClass2_1 = train[(train['Pclass'] == 2) &(train['Survived'] == 1)]['Age']
PClass2_0 = train[(train['Pclass'] == 2) &(train['Survived'] == 0)]['Age']
PClass3_1 = train[(train['Pclass'] == 3) &(train['Survived'] == 1)]['Age']
PClass3_0 = train[(train['Pclass'] == 3) &(train['Survived'] == 0)]['Age']


fig = go.Figure()

fig.add_trace(go.Violin(x=PClass1_1, line_color='salmon', name='PClass1_[1]', ))
fig.add_trace(go.Violin(x=PClass1_0, line_color='lightsalmon', name= 'PClass1_[0]', ))
fig.add_trace(go.Violin(x=PClass2_1, line_color='seagreen', name='PClass2_[1]', ))
fig.add_trace(go.Violin(x=PClass2_0, line_color='lightseagreen', name='PClass2_[0]', ))
fig.add_trace(go.Violin(x=PClass3_1, line_color='gold', name= 'PClass3_[1]', ))
fig.add_trace(go.Violin(x=PClass3_0, line_color='silver', name='PClass3_[0]', ))

fig.update_traces(orientation='h', side='positive', width=3,
                  bandwidth = None, points=False, meanline_visible=True, scalemode='count')
fig.update_layout(xaxis_showgrid=True, xaxis_zeroline=False)

fig.update_layout(title='Pclass-Age Survival',
                  font_family="San Serif",
                  xaxis_title='Age',
                  width=600,height=400,
                  template="plotly_dark",
                  showlegend=False,
                  titlefont={'size': 24},
                  paper_bgcolor="black",
                  font=dict(
                      color ='white',
                      )
                  )
fig.show()

In [None]:
PClass1_1 = train[(train['Pclass'] == 1) &(train['Survived'] == 1)]['Sex']
PClass1_0 = train[(train['Pclass'] == 1) &(train['Survived'] == 0)]['Sex']
PClass2_1 = train[(train['Pclass'] == 2) &(train['Survived'] == 1)]['Sex']
PClass2_0 = train[(train['Pclass'] == 2) &(train['Survived'] == 0)]['Sex']
PClass3_1 = train[(train['Pclass'] == 3) &(train['Survived'] == 1)]['Sex']
PClass3_0 = train[(train['Pclass'] == 3) &(train['Survived'] == 0)]['Sex']


fig = go.Figure()

fig.add_trace(go.Histogram(y=PClass1_1, marker_color='darkkhaki', histnorm='percent', name='PClass1_[1]', ))
fig.add_trace(go.Histogram(y=PClass1_0, marker_color='paleturquoise', histnorm='percent',name= 'PClass1_[0]', ))
fig.add_trace(go.Histogram(y=PClass2_1, marker_color='lightsalmon', histnorm='percent',name='PClass2_[1]', ))
fig.add_trace(go.Histogram(y=PClass2_0, marker_color='salmon', histnorm='percent',name='PClass2_[0]', ))
fig.add_trace(go.Histogram(y=PClass3_1, marker_color='lightseagreen', histnorm='percent',name= 'PClass3_[1]', ))
fig.add_trace(go.Histogram(y=PClass3_0, marker_color='seagreen', histnorm='percent',name='PClass3_[0]', ))

fig.update_traces(orientation='h')
fig.update_layout(xaxis_showgrid=True, xaxis_zeroline=False)

fig.update_layout(title='Pclass-Sex Survival',
                  font_family="San Serif",
                  xaxis_title='Survival [%]',
                  width=600,height=400,
                  template="plotly_dark",
                  showlegend=True,
                  titlefont={'size': 24},
                  paper_bgcolor="black",
                  font=dict(
                      color ='white',
                      )
                  )
fig.show()

[back to contents](#Contents)

## References 
- @subinium did a great vizualization work by customizing the matplotlib library in [dark-theme](https://www.kaggle.com/subinium/dark-mode-visualization-apple-version/comments) and I found it quite beautiful. As a result I wanted to try dark-theme myself. However this notebook is based on plotly's own dark-theme.

- The code for Cramer's V correlation comes from [this](https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9) article.

- I have adapted figures from my other [EDA notebook](https://www.kaggle.com/desalegngeb/students-performance-practice-eda-with-plotly), where I used plotly on students performance dataset.


Thank you so much for reading my notebook! If you have any feedback, please drop them in the comments. 