# Synthanic : Synthetically created dataset from Titanic

# Exploratory Data Analysis for Beginners

- Image: Titantic ship created using Lego drowing in a swimming pool -Synthanic enough? 

![https://i.ytimg.com/vi/cFhP-TN4EG4/maxresdefault.jpg](https://i.ytimg.com/vi/cFhP-TN4EG4/maxresdefault.jpg)

# Import Libraries

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
import statistics

# visualization
import seaborn as sns
sns.set(style='white', context='notebook', palette='deep')
import matplotlib.pyplot as plt
%matplotlib inline


# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve


from collections import Counter
import warnings
warnings.filterwarnings("ignore")

# Read the files 

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv')

In [None]:
display("train data", train_df)
display("test data", test_df)

# What do you observe?

The target variable "Survived" is not present in the test data

the test data as well as  training data have 100,000 rows 

some data are numerical and some are categorical

We may use include='all' in describe to see all the details. Note that std and percentile values will not be visible for categorical values


In [None]:
train_df.describe(include='all')



The lower count indicates some data like Age, Ticket, Cabin and Embarked are missing....

some values like Pclass which are shown as numerical are actually ordinal--- may be assiged as categorical

which columns are continuous, discrete, ordinal and categorical?

Continous: Age, Fare.

Discrete: SibSp, Parch.

Ordinal: Pclass.

Categorical: Survived, Sex, Ticket, cabin and Embarked.

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum() # even test data has missing values

Which of the fields are relevant?

PassengerId like most ids do not carry much value and may be discarded

Is Pclass (Ticket class) a useful metrics or can it be discarded?

In [None]:
print(train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))

g = sns.catplot(x="Pclass",y="Survived",data=train_df,kind="bar")
g = g.set_ylabels("survival probability")

The 1st class passengers or the rich ones have a higher probability to survive!!

What about gender? and age?

In [None]:
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

g = sns.catplot(x="Sex",y="Survived",data=train_df,kind="bar")
g = g.set_ylabels("survival probability")

In [None]:
# Explore Age distibution 
g = sns.kdeplot(train_df["Age"][(train_df["Survived"] == 0) & (train_df["Age"].notnull())], color="Red", shade = True)
g = sns.kdeplot(train_df["Age"][(train_df["Survived"] == 1) & (train_df["Age"].notnull())], ax =g, color="Blue", shade= True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived","Survived"])

We observe that ladies have a higher probability of survival

We also notice that the childred and old survive more ... the young and middle aged are less likely to survive

What about number of siblings? Does that impact?


In [None]:
# Explore SibSp feature vs Survived
display(train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False))

g = sns.catplot(x="SibSp",y="Survived",data=train_df,kind="bar", height = 6 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

Sibling does not appear to be a distinguishing factor for survival rate. Most seem around 40% expcet 5 siblings and this is likely just a satitistical anamoly 


Lets look at more charts and explore what they tell us....

In [None]:
# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived 
g = sns.heatmap(train_df[["Survived","SibSp","Parch","Age","Fare"]].corr(),
                annot=True, fmt = ".2f", cmap = "coolwarm")

In [None]:
#Explore Parch feature vs Survived
g  = sns.factorplot(x="Parch",y="Survived",data=train_df,kind="bar", height = 6 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

In [None]:
# Explore Age vs Sex, Parch , Pclass and SibSP
g = sns.catplot(y="Age",x="Sex",data=train_df,kind="box")
g = sns.catplot(y="Age",x="Sex",hue="Pclass", data=train_df,kind="box")
g = sns.catplot(y="Age",x="Parch", data=train_df,kind="box")
g = sns.catplot(y="Age",x="SibSp", data=train_df,kind="box")