# Titanic Dataset
Data Analysis on Titanic Data (Python)
***

![](https://cdn.pixabay.com/photo/2021/03/04/16/32/ship-6068668_1280.png)

# Introduction

# Data Dictionary

* **'Survival':** 	 0 = No, 1 = Yes
* **'Pclass':** 	 Ticket class 	 1 = 1st, 2 = 2nd, 3 = 3rd
* **'Sex':** 	 Sex
* **'Age':**	 Age in years
* **'SibSp':** 	 # of siblings / spouses aboard the Titanic
* **'Parch':** 	 # of parents / children aboard the Titanic
* **'Ticket':** 	 Ticket number
* **'Fare':** 	 Passenger fare
* **'Cabin':** 	 Cabin number
* **'Embarked':** 	 Port of Embarkation 	 C = Cherbourg, Q = Queenstown, S = Southampton

# Import Lybraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
sns.set(style="darkgrid")
import matplotlib.pyplot as plt
#plt.style.use('ggplot')
from matplotlib.pyplot import figure
from scipy import stats


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load and View Data

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
gender_data = pd.read_csv("/kaggle/input/titanic/gender_submission.csv")

In [None]:
display('Train Data:',train_data.head(), 'Test Data:',test_data.head(), 'Gender Data:',gender_data.head())

In [None]:
# Merging gender table and test table to dataframe 'gender_test'
left = gender_data
right = test_data

gender_test = pd.merge(left, right, on=["PassengerId"])
gender_test

In [None]:
# Merging the created dataframe 'gender_test' to the existing dataframe 'train_data'
df = pd.concat([train_data, gender_test])
df

In [None]:
#Show all column names
df.columns.tolist()

In [None]:
#Overview about Columns, Range, Non-Null Value Counts, Memory Usage and Data Types (#df.dtpyes)
df.info()

In [None]:
# Counting rows and coulums
col_row = df.shape
print('Columns in dataset:', col_row[1],'\nRows in dataset:', col_row[0])

In [None]:
# Total cells in the dataset
total_cells=np.product(df.shape)
print('Total cells in this dataset:',total_cells)

# Data Cleaning

## Identifying duplicate values

In [None]:
df.duplicated().sum()

In [None]:
df.loc[train_data.duplicated(keep=False)]

## Identifying missing values

In [None]:
# Find the amount of missing values in each column
missing_values = df.isnull().sum().sort_values(ascending=False)
missing_values

In [None]:
# Calculating the percentage of missing values:

# 1. Set variables for total amount of cells and total amount of missing data
total_missing = missing_values.sum()

# 2. Calculating percent of data that is missing
percent_missing = (total_missing/total_cells) * 100

print("Total missing values: {}  =  {:.2f} %".format(total_missing, percent_missing))

In [None]:
number_missing = df.isnull().sum().sort_values(ascending=False)
pct_column = (df.isnull().sum() / len(df) * 100).round(2).astype('str')+' %'
pct_total = (df.isnull().sum()/df.isna().sum().sum()*100).round(2).astype('str')+' %'
missing_values = pd.concat([number_missing, pct_column, pct_total], axis=1, keys=['Number_Missing_Values', 'PCT_Missing_in_Column','PCT_of_all_Missing'])
   
print('\nMISSING VALUES IN',df.shape[0],'ROWS:')    
all_missing = missing_values.loc[missing_values['Number_Missing_Values'] > 0]
all_missing

## Visualisation of missing data

In [None]:
# Detect missing values
missing = df.isnull()

# Visualisation
plt.figure(figsize=(15,5), dpi=100)
sns.heatmap(missing,yticklabels=False, cbar=False, cmap=None)
plt.title('MISSING VALUES', size=17, pad=13)
plt.show()

## Dealing with missing values

**Drop data**
*     Drop the whole row
*     Drop the whole column

*or*

**Replace data**
*     Replace it by mean / median
*     Replace it by frequency
*     Replace it based on other functions

Choosing a method based on the composition and correlation of the data and the task or goal settings.

#### Looking at the columns with missing data

***
`Embarked`
***

In [None]:
# Only two values are missing in the 'Embarked' column. Let's check the corresponding rows.
df[pd.isnull(df.Embarked)]

In [None]:
# Checking and counting the values in the 'Embarked' column.
df['Embarked'].value_counts()

In [None]:
# As only two values are missing (0.22%), I decided to replace them by the most listed value.
df["Embarked"] = df["Embarked"].fillna("S")

In [None]:
# ... and if the values have benn replaced by 'S'
df['Embarked'].value_counts()

***
`Cabin`
***

As **most of the data in the column "Cabin" is missing** (687 out of 891 values --> **77%**), and I do not need the column for my analysis I deciced to **delete the column** - deleting the rows would remove 687 useful rows.

In [None]:
# Deleting the column
df.drop("Cabin", axis=1, inplace=True)

In [None]:
# Checking if the columns "Embarked" and "Cabin" are deleted
train_data.head()

***
`Age`
***

Looking for the best way to fill the missing data

In [None]:
df.Age.describe().to_frame()

In [None]:
# The histogramm shows the age distribution of all passengers
df['Age'].hist(bins=16, color='purple' ,figsize=(16,7))
plt.title('Age Distribution of all passengers', size=17, pad=13)
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(True)

In [None]:
plt.figure(figsize=(10,6), dpi=75)
sns.boxplot(x="Age", data=train_data, color='purple')
plt.title('Age Boxplot', size=17, pad=13)
plt.show()

In [None]:
# Mean age vs. median
display(df.Age.mean())
display(df.Age.median())

The **average age** of all people on board is **29.7**. The **middle value** is **28**.
Replacing the missing values with the mean or the median would be inacurate. I would like to find out the average age of men and women in each class and check if there is a difference and if there might be further correlations.

In [None]:
# Grouping Column 'Sex' to see the average age or women and men
grouped_sex_age = df.groupby(['Sex']).Age.agg([len, min, max, 'mean', 'median'])
grouped_sex_age

In [None]:
# Grouping the 'Pclass' to see if the average age changes from class to class
grouped_class_age = df.groupby(['Pclass']).Age.agg([len, min, max, 'mean', 'median'])
grouped_class_age

**As the average age seem to depend on both, sex and class.**

In [None]:
# Checking the (aveage) age and amount of people for each class and sex
grouped_sex_pclass_age = train_data.groupby(['Sex', 'Pclass']).Age.agg([len, min, max, 'mean', 'median'])
grouped_sex_pclass_age

**The average age in each class and sex is very different. There also seems to be a correlation between Age and SibSp. 
In my opinion this needs to be considered when replacing the missing values**.

In [None]:
# Checking the age and amount of people for each class and sex and SibSp
grouped_sex_pclass_age = df.groupby(['Pclass', 'SibSp']).Age.agg([len, min, max, 'mean', 'median'])
grouped_sex_pclass_age

In [None]:
fem_p1_s0 =df.loc[(df.Sex == 'female') & (df.Pclass == 1) & (df.SibSp == 0)].Age.median()
fem_p1_s1 =df.loc[(df.Sex == 'female') & (df.Pclass == 1) & (df.SibSp == 1)].Age.median()
fem_p1_s2 =df.loc[(df.Sex == 'female') & (df.Pclass == 1) & (df.SibSp == 2)].Age.median()
fem_p1_s3 =df.loc[(df.Sex == 'female') & (df.Pclass == 1) & (df.SibSp == 3)].Age.median()
fem_p2_s0 =df.loc[(df.Sex == 'female') & (df.Pclass == 2) & (df.SibSp == 0)].Age.median()
fem_p2_s1 =df.loc[(df.Sex == 'female') & (df.Pclass == 2) & (df.SibSp == 1)].Age.median()
fem_p2_s2 =df.loc[(df.Sex == 'female') & (df.Pclass == 2) & (df.SibSp == 2)].Age.median()
fem_p2_s3 =df.loc[(df.Sex == 'female') & (df.Pclass == 2) & (df.SibSp == 3)].Age.median()
fem_p3_s0 =df.loc[(df.Sex == 'female') & (df.Pclass == 3) & (df.SibSp == 0)].Age.median()
fem_p3_s1 =df.loc[(df.Sex == 'female') & (df.Pclass == 3) & (df.SibSp == 1)].Age.median()
fem_p3_s2 =df.loc[(df.Sex == 'female') & (df.Pclass == 3) & (df.SibSp == 2)].Age.median()
fem_p3_s3 =df.loc[(df.Sex == 'female') & (df.Pclass == 3) & (df.SibSp == 3)].Age.median()
fem_p3_s4 =df.loc[(df.Sex == 'female') & (df.Pclass == 3) & (df.SibSp == 4)].Age.median()
fem_p3_s5 =df.loc[(df.Sex == 'female') & (df.Pclass == 3) & (df.SibSp == 5)].Age.median()
fem_p3_s8 =df.loc[(df.Sex == 'female') & (df.Pclass == 3) & (df.SibSp == 8)].Age.median()
male_p1_s0 =df.loc[(df.Sex == 'male') & (df.Pclass == 1) & (df.SibSp == 0)].Age.median()
male_p1_s1 =df.loc[(df.Sex == 'male') & (df.Pclass == 1) & (df.SibSp == 1)].Age.median()
male_p1_s2 =df.loc[(df.Sex == 'male') & (df.Pclass == 1) & (df.SibSp == 2)].Age.median()
male_p1_s3 =df.loc[(df.Sex == 'male') & (df.Pclass == 1) & (df.SibSp == 3)].Age.median()
male_p2_s0 =df.loc[(df.Sex == 'male') & (df.Pclass == 2) & (df.SibSp == 0)].Age.median()
male_p2_s1 =df.loc[(df.Sex == 'male') & (df.Pclass == 2) & (df.SibSp == 1)].Age.median()
male_p2_s2 =df.loc[(df.Sex == 'male') & (df.Pclass == 2) & (df.SibSp == 2)].Age.median()
male_p3_s0 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 0)].Age.median()
male_p3_s1 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 1)].Age.median()
male_p3_s2 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 2)].Age.median()
male_p3_s3 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 3)].Age.median()
male_p3_s4 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 4)].Age.median()
male_p3_s5 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 5)].Age.median()
male_p3_s6 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 6)].Age.median()
male_p3_s8 =df.loc[(df.Sex == 'male') & (df.Pclass == 3) & (df.SibSp == 8)].Age.median()

In [None]:
# Filling missing values with average age of women and men in each class
def myfunc(age, pclass, sex, SibSp):
    if pd.isnull(age) and pclass==1 and sex == 'female' and SibSp == 0:
        age=fem_p1_s0
    elif pd.isnull(age) and pclass==1 and sex == 'female' and SibSp == 1:
        age=fem_p1_s1
    elif pd.isnull(age) and pclass==1 and sex == 'female' and SibSp == 2:
        age=fem_p1_s2
    elif pd.isnull(age) and pclass==1 and sex == 'female' and SibSp == 3:
        age=fem_p1_s3      
    elif pd.isnull(age) and pclass==2 and sex == 'female' and SibSp == 0:
        age=fem_p2_s0
    elif pd.isnull(age) and pclass==2 and sex == 'female' and SibSp == 1:
        age=fem_p2_s1
    elif pd.isnull(age) and pclass==2 and sex == 'female' and SibSp == 2:
        age=fem_p2_s2
    elif pd.isnull(age) and pclass==2 and sex == 'female' and SibSp == 3:
        age=fem_p2_s3
    elif pd.isnull(age) and pclass==3 and sex == 'female' and SibSp == 0:
        age=fem_p3_s0
    elif pd.isnull(age) and pclass==3 and sex == 'female' and SibSp == 1:
        age=fem_p3_s1
    elif pd.isnull(age) and pclass==3 and sex == 'female' and SibSp == 2:
        age=fem_p3_s2
    elif pd.isnull(age) and pclass==3 and sex == 'female' and SibSp == 3:
        age=fem_p3_s3
    elif pd.isnull(age) and pclass==3 and sex == 'female' and SibSp == 4:
        age=fem_p3_s4
    elif pd.isnull(age) and pclass==3 and sex == 'female' and SibSp == 5:
        age=fem_p3_s5  
    elif pd.isnull(age) and pclass==3 and sex == 'female' and SibSp == 8:
        age=df.Age.median()  
    elif pd.isnull(age) and pclass==1 and sex == 'male' and SibSp == 0:
        age=male_p1_s0
    elif pd.isnull(age) and pclass==1 and sex == 'male' and SibSp == 1:
        age=male_p1_s1
    elif pd.isnull(age) and pclass==1 and sex == 'male' and SibSp == 2:
        age=male_p1_s2
    elif pd.isnull(age) and pclass==1 and sex == 'male' and SibSp == 3:
        age=male_p1_s3      
    elif pd.isnull(age) and pclass==2 and sex == 'male' and SibSp == 0:
        age=male_p2_s0
    elif pd.isnull(age) and pclass==2 and sex == 'male' and SibSp == 1:
        age=male_p2_s1
    elif pd.isnull(age) and pclass==2 and sex == 'male' and SibSp == 2:
        age=male_p2_s2   
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 0:
        age=male_p3_s0
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 1:
        age=male_p3_s1
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 2:
        age=male_p3_s2
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 3:
        age=male_p3_s3
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 4:
        age=male_p3_s4
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 5:
        age=male_p3_s5
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 6:
        age=male_p3_s6
    elif pd.isnull(age) and pclass==3 and sex == 'male' and SibSp == 8:
        age=male_p3_s8 
    else:
        age=age
    return age

In [None]:
# Creating a new columns 'Age_Filled_Na' with the new average age values 
df['Age_Filled'] = df.apply(lambda x: myfunc(x['Age'], x['Pclass'], x['Sex'], x['SibSp']), axis=1)

In [None]:
# Checking the new column and values
df.head()

In [None]:
# Finally checking if there is any missing data in the new column 'Age_Filled_Na'
df.Age_Filled.isnull().sum()

In [None]:
# The histogramm shows the age distribution of all passengers after replacing the missing values
df['Age_Filled'].hist(bins=16, color='purple' ,figsize=(16,8))
plt.title('Age Distribution of all passengers')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(True)

***
`Fare`
***

In [None]:
# Drop whole row with NaN in "price" column
df.dropna(subset=["Fare"],axis=0 , inplace=True)

# Reset index, because two rows were dropped
df.reset_index(drop=True, inplace=True)

df[pd.isnull(df['Fare'])]

# Analysing Patterns using Visualisations

In [None]:
df.describe(include='all')

In [None]:
# Checking data types
df.info()

The data type of 'Sex' is an object, which will not be visible in a correlation. I would like to include the values of the column 'Sex' into the correlation matrix by creating a new column 'Sex_Number' and set the value 'female' to '1' and 'male' to '0'

In [None]:
# Creating a new column 'SexNo' with the values '1' for 'female' and '0' for 'male'
df['Sex_Number'] = np.where((df['Sex'] == "female"), 1, 0)
# Checking if the column 'SexNo' has been added
df.head()

## Correlation

In [None]:
# Correlation table including the new column 'SexNo'
correlation = df.corr()
correlation

In [None]:
# Visualisation of the corralation table
plt.figure(figsize=(12,8), dpi=77)
sns.heatmap(correlation, linecolor='white',linewidths=0.1, annot=True)
plt.title('Correlation Matrix'.upper(), size=19, pad=13)
plt.xlabel('Titanic Data')
plt.ylabel('Titanic Data')
plt.xticks(rotation=33)
plt.show()

**This matrix shows that there is a correlation between sex and the chance to survive.** There is also a correlation between fare and the chance to survive, as well as a **negative correlation between the class and the chance to survive**.

## P-values

In [None]:
# Correlation and P-value of 'Survived' and 'Sex'
pearson_coef, p_value = stats.pearsonr(df['Survived'], df['Sex_Number'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

In [None]:
# Correlation and P-value of 'Survived' and 'Fare'
pearson_coef, p_value = stats.pearsonr(df['Survived'], df['Fare'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

In [None]:
# Correlation and P-value of 'Survived' and 'Pclass'
pearson_coef, p_value = stats.pearsonr(df['Survived'], df['Pclass'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

## Continuous Numerical Variables

### Linear Relationship

***
`Age` and `Fare`
***

In [None]:
df[["Age_Filled","Fare"]].corr()

In [None]:
# Calculating the P-vau
pearson_coef, p_value = stats.pearsonr(df['Age_Filled'], df['Fare'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

In [None]:
plt.figure(figsize=(16,8))
sns.regplot(x="Fare", y="Age_Filled",data=df, scatter_kws={'color':'blue'}, line_kws={'color':'orange'}, marker='*')
plt.title('Relationship between Fare and Age')
plt.ylabel('Age')
plt.ylim(0.1,)

In [None]:
plt.figure(figsize=(20,8), dpi=75)
sns.scatterplot(x='Age_Filled', y='Fare', hue='Sex', data = df)
plt.title('Relationship between Age and Fare', size=17, pad=13)
plt.show()

## Categorical Variables

***
`Sex`
***

In [None]:
male = (df['Sex'] == 'male').sum()
female = (df['Sex']== 'female').sum()
proportions = [male,female]

plt.figure(figsize=(12,8), dpi=77)
plt.pie(proportions, data=df, labels= ['Males', 'Females'], explode = (0.05,0), startangle=90, autopct='%1.1f%%', shadow=False)
plt.axis('equal')
plt.title("Sex Proportion", size=17, pad=13)
plt.show()

***
`Survived`
***

In [None]:
# How many people survived ('Survived' == 0)
survived_data=df.Survived.value_counts().to_frame()
survived_data

In [None]:
pd.pivot_table(df, index="Survived", values=['Pclass','Age_Filled','SibSp', 'Parch', 'Fare'])


***
`Survived` and `Age`
***

In [None]:
# Amount and average age of people who survived compared to those who died.
df.groupby(['Survived']).Age.agg([len,min, max,'mean', 'median'])

In [None]:
# Age comparision of the people who survived and those who died using a boxplot. 
plt.figure(figsize=(10,8), dpi=77)
sns.boxplot(x="Survived", y="Age_Filled", data=df)
plt.title("Comparison: Age of People who died / survived", size=17, pad=13)
plt.ylabel('Age')
plt.xlabel(' ')
plt.xticks([0, 1], ['Not Survived', 'Survived'])
plt.show()

***
`Survived` and `Sex`
***

In [None]:
# The barplot compares the survival of men to women
plt.figure(figsize=(10,8), dpi=77)
sns.barplot(x="Sex", y="Survived", data=df)
plt.title("Survivors - Male & Female", size=17, pad=13 )
plt.show()

In [None]:
# Sex and Age compared with Survived and Not Survived
g = sns.FacetGrid(df, col='Survived', sharey=False, ylim=(0,80), hue='Sex', height=7, aspect=1.1)
g.map_dataframe(sns.histplot, x='Age')
g.set_axis_labels('Age', 'Count')
g.add_legend()
plt.show()

In [None]:
plt.figure(figsize=(10,8), dpi=77)
sns.countplot(x=df['Sex'],hue=df['Survived'])
plt.title("Comparison: Survivors - Male & Female", size=17, pad= 13)
plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 12})
plt.show()

In [None]:
# Amount and average age of women and man who survived compared to those who died.
df.groupby(['Sex','Survived']).Age.agg([len,'mean', 'median'])

In [None]:
# Percentage of women who survived
women = df.loc[df.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)*100

print("% of women who survived: {:.2f}".format(rate_women))

In [None]:
# Percentage of men who survived
men = df.loc[df.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)*100

print("% of men who survived: {:.2f}".format(rate_men))

***
`Survived` and `Pclass`
***

In [None]:
df['Pclass'].value_counts()

In [None]:
df.groupby(['Pclass', 'Survived']).Age.agg([len])

In [None]:
# Compares the chance of survival for each ticket class
plt.figure(figsize=(10,8), dpi=77)
sns.barplot(x="Pclass", y="Survived", data=df)
plt.title("Chance of Survival for each Ticket Class", size=17, pad=13)
plt.show()

In [None]:
# Survived and not survived compared for each ticket class
plt.figure(figsize=(10,8), dpi=77)
sns.countplot(x=df['Pclass'],hue=df['Survived'])
plt.title("Comparison: Survivors - Ticket Class", size=17, pad=13)
plt.legend(['Not Survived', 'Survived'], loc='upper left', prop={'size': 12})
plt.show()

***
`Survived`, `Pclass`, `Sex` and `Age`
***

In [None]:
# Amount and average age of women and man of each class who survived compared to those who died.
df.groupby(['Sex','Survived', 'Pclass']).Age.agg([len,min, max,'mean'])

***
`Survived` and `Parch`
***

In [None]:
df['Parch'].value_counts().to_frame()

In [None]:
plt.figure(figsize=(12,6), dpi=77)
sns.barplot(x="Parch", y="Survived", data=df)
plt.title("Chance of Survival for Passengers with Parents or Children", size=17, pad=13)
plt.xlabel('Number of Parents / Children')
plt.show()

In [None]:
plt.figure(figsize=(12,8), dpi=77)
sns.countplot(x=df['Parch'],hue=df['Survived'])
plt.title("Survived - Parents or Children", size=17, pad=12)
plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 12})
plt.xlabel('Parents / Children')
plt.show()

In [None]:
# Age of people with parents or children
df.groupby(['Parch','Survived']).Age.agg([len, min, max])

***
`Survived` and `SibSp`
***

In [None]:
df.SibSp.value_counts()


In [None]:
df.groupby(['SibSp','Survived']).Age.agg([len, 'mean'])

In [None]:
# This barplot compares the chance of survival within a category
plt.figure(figsize=(12,6), dpi=77)
sns.barplot(x="SibSp", y="Survived", data=df)
plt.title("Chance of Survival for Passengers with Siblings or Spouses", size=17, pad=13)
plt.xlabel('Number of Siblings / Spouses')
plt.show()

In [None]:
plt.figure(figsize=(12,8), dpi=77)
sns.countplot(x=df['SibSp'],hue=df['Survived'])
plt.title("Survived - Siblings or Spouses", size=17, pad=13)
plt.xlabel('Siblings / Spouses')
plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 12})
plt.show()

In [None]:
df.groupby(['SibSp','Survived']).Age.agg([len, min, max, 'mean'])

***
`Survived` and `Embarked`
***

In [None]:
plt.figure(figsize=(10,8), dpi=77)
sns.barplot(x="Embarked", y="Survived", data=df)
plt.title('Chance of Survival by Port of Embarkation', size=17, pad=13)
plt.xlabel('Port of Embarkation', size=13)
plt.xticks([0, 1, 2],['Southampton', 'Cherbourd', 'Queenstown'])
plt.show()

In [None]:
plt.figure(figsize=(10,8), dpi=77)
sns.countplot(x=df['Embarked'],hue=df['Survived'])
plt.title("Comparison: Survivors by Port of Embarkation", size=17, pad=13)
plt.xlabel('Port of Embarkation')
plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 12})
plt.xticks([0, 1, 2],['Southampton', 'Cherbourd', 'Queenstown'])
plt.show()

***
`Age` and `SibSp`
***

In [None]:
plt.figure(figsize=(12,6), dpi=77)
sns.barplot(x="SibSp", y="Age", data=df)
plt.title('Average Age of Passengers with Siblings', size=17, pad=13)
plt.xlabel('Siblings')
plt.show()

***
`Age` and `Parch`
***

In [None]:
plt.figure(figsize=(12,8), dpi=77)
sns.boxplot(x="Parch", y="Age", data=df)
plt.title('Age of Passengers with Parents or Children', size=17, pad=13)
plt.xlabel('Parents / Children')
plt.show()

In [None]:
plt.figure(figsize=(12,6), dpi=77)
sns.barplot(x="Parch", y="Age", data=df)
plt.title('Average Age of Passengers with Parents or Children', size=17, pad=12)
plt.xlabel('Parents / Children')
plt.show()

***
`Age` and `Pclass`
***

In [None]:
plt.figure(figsize=(12,6), dpi=77)
sns.boxplot(x="Pclass", y="Age", data=df)
plt.title('Age of Passengers for each Ticket Class', size=17, pad=13)
plt.xlabel('Ticket Class')
plt.show()

### Comparing multiple columns

In [None]:
# Comparison of Pclass, Age, Sex and Survivors
g = sns.FacetGrid(df, col='Survived', row='Pclass', sharey=False, ylim=(0,300), hue='Sex', height=7)
g.map_dataframe(sns.scatterplot, x='Age', y='Fare')
g.set_axis_labels('Age', 'Fare')
g.add_legend()
# g.set_titles(col_template='', row_template='')
plt.show()

In [None]:
# Comparison of Parch, Age, Sex and Survivors
g = sns.FacetGrid(df, col='Survived', row='SibSp', sharey=False, ylim=(0,300), hue='Sex', height=7)
g.map_dataframe(sns.scatterplot, x='Age', y='Fare')
g.set_axis_labels('Age', 'Fare')
g.add_legend()
# g.set_titles(col_template='', row_template='')
plt.show()

***
`Fare`
***

In [None]:
plt.figure(figsize=(20,10))
sns.histplot(df.Fare)
plt.title('Fares Paid', size=17, pad=13)
plt.show()

In [None]:
fig, axs = plt.subplots(figsize=(22, 9))
sns.countplot(x='Fare', hue='Survived', data=df)
plt.xlabel('Fare', size=16, labelpad=10)
plt.ylabel('Count', size=15, labelpad=10)
plt.tick_params(axis='x', labelsize=13)
plt.tick_params(axis='y', labelsize=15)
plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 15})
plt.title('Survial compared to Fare', size=20, y=1, pad=13)
plt.show()
