
# Exploratory Data Analysis I

## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Data Loading and Description](#section2)
3. [Data Profiling](#section3)
    - 3.1 [Understanding the Dataset](#section301)<br/>
    - 3.2 [Pre Profiling](#section302)<br/>
    - 3.3 [Preprocessing](#section303)<br/>
    - 3.4 [Post Profiling](#section304)<br/>
4. [Questions](#section4)
    - 4.1 [Off all the passengers, how many survived and how many died?](#section401)<br/>
    - 4.2 [Who is more likely to survive, Male or Female?](#section402)<br/>
    - 4.3 [What is the rate of survival of males, females and child on the basis of Passenger Class?](#section403)<br/>
    - 4.4 [What is the survival rate considering the Embarked variable?](#section404)<br/>
    - 4.5 [Survival rate - Comparing Embarked and Sex.](#section405)<br/>
    - 4.6 [How survival rate vary with Embarked, Sex and Pclass](#section406)<br/>
    - 4.7 [Segment age in bins with size 10.](#section407)<br/>
    - 4.8 [Analysing SibSp and Parch variable.](#section408)<br/>
    - 4.9 [Segment fare in bins of size 12.](#section409)<br/>
    - 4.10 [Draw pair plot to know the joint relationship between 'Fare','Age','Pclass' and 'Survived'](#section410)<br/>
    - 4.11 [Establish coorelation between all the features using heatmap.](#section411)<br/>
    - 4.12 [Hypothesis: Women and children are more likely to survive](#section412)<br/>
5. [Conclusions](#section5)<br/>  

<a id=section1></a>

### 1. Problem Statement

The notebooks explores the basic use of __Pandas__ and will cover the basic commands of __Exploratory Data Analysis(EDA)__ which includes __cleaning__, __munging__, __combining__, __reshaping__, __slicing__, __dicing__, and __transforming data__ for analysis purpose.
it involves collecting, aggregating, cleaning, and organizing the data to be consumed by the algorithms designed to make discoveries or to create models.

* __Exploratory Data Analysis__ <br/>
Understand the data by EDA and derive simple models with Pandas as baseline.
EDA ia a critical and first step in analyzing the data and we do this for below reasons :
    - Finding patterns in Data
    - Determining relationships in Data
    - Checking of assumptions
    - Preliminary selection of appropriate models
    - Detection of mistakes


<a id=section2></a>

### 2. Data Loading and Description


<a id=section201></a>

- The dataset consists of the information about people boarding the famous RMS Titanic. Various variables present in the dataset includes data of age, sex, fare, ticket etc.
- The dataset comprises of __891 observations of 12 columns__. Below is a table showing names of all the columns and their description.

| Column Name   | Description                                               |
| ------------- |:-------------                                            :|
| PassengerId   | Passenger Identity                                        |
| Survived      | Whether passenger survived or not                         |  
| Pclass        | Class of ticket                                           |
| Name          | Name of passenger                                         |   
| Sex           | Sex of passenger                                          |
| Age           | Age of passenger                                          |
| SibSp         | Number of sibling and/or spouse travelling with passenger |
| Parch         | Number of parent and/or children travelling with passenger|
| Ticket        | Ticket number                                             |
| Fare          | Price of ticket                                           |
| Cabin         | Cabin number                                              |

#### Some Background Information
The sinking of the RMS Titanic in the early morning of __15 April 1912, four days into the ship's maiden voyage__ from __Southampton to New York City__, was one of the deadliest peacetime maritime disasters in history, __killing more
than 1,500 people__. The largest passenger liner in service at the time, Titanic had an __estimated 2,224 people on
board__ when she struck an __iceberg in the North Atlantic__. The ship had received __six warnings__ of sea ice but
was travelling at near __maximum speed when the lookouts sighted the iceberg__. Unable to turn quickly enough, the
ship suffered a glancing blow that buckled the starboard (right) side and opened __five of sixteen compartments to
the sea__. The disaster caused widespread outrage over the lack of lifeboats, lax regulations, and the __unequal treatment__ of the three passenger classes during the evacuation. Inquiries recommended sweeping changes to maritime regulations, leading to the __International Convention for the Safety of Life at Sea (1914)__, which continues to govern maritime safety.

In [None]:
!pip install https://github.com/ydataai/pandas-profiling/archive/master.zip

#### Importing packages                                          

In [None]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

from subprocess import check_output ###Assignment



#### Importing the Dataset

In [None]:
titanic_data = pd.read_csv("https://raw.githubusercontent.com/amity1415/DS/main/EKeeda/titanicRawData.csv")



In [None]:
type(titanic_data)

In [None]:
titanic_data.head()

In [None]:
type(titanic_data)

In [None]:
titanic_data.info() ## It helps provide the sturcture of the dataset.

In [None]:
type(titanic_data['Name'][0])

In [None]:
i = titanic_data.Name[1]
type(i)

<a id=section3></a>

## 3. Data Profiling

- In the upcoming sections we will first __understand our dataset__ using various pandas functionalities.
- Then with the help of __pandas profiling__ we will find which columns of our dataset need preprocessing.
- In __preprocessing__ we will deal with erronous and missing values of columns.
- Again we will do __pandas profiling__ to see how preprocessing have transformed our dataset.

<a id=section301></a>

In [None]:
#Generate Pandas Pre profiling report
#Perform Data Preprocessing based on Issues shown by the Pre Profiling reprot
# Generate Pandas Profiling report --> Post Profiling report--> Verify, If my data is fine for processing(analysis.)

### 3.1 Understanding the Dataset

To gain insights from data we must look into each aspect of it very carefully. We will start with observing few rows and columns of data both from the starting and from the end


In [None]:
titanic_data.shape ## This will print the number of rows and columns of the Data Frame

titanic_data has __891 rows__ and __12 columns.__

In [None]:
titanic_data.columns # THis will print the names of all columns

In [None]:
 titanic_data.head()

<a id=section301></a>

In [None]:
 titanic_data.tail() # This will print the last n rows of the Data Frame

In [None]:
titanic_data.isnull().sum() # Finding the count of null values in the data set.

In [None]:
titanic_data.isnull().count()

From the above output we can see that __Age__ and __Cabin__ columns contains __maximum null values__. We will see how to deal with them.

<a id=section302></a>

### 3.2 Pre Profiling

- By pandas profiling, an __interactive HTML report__ gets generated which contins all the information about the columns of the dataset, like the __counts and type__ of each _column_. Detailed information about each column, __coorelation between different columns__ and a sample of dataset.<br/>
- It gives us __visual interpretation__ of each column in the data.
- _Spread of the data_ can be better understood by the distribution plot.
- _Grannular level_ analysis of each column.

In [None]:
 profile = pandas_profiling.ProfileReport(titanic_data)

In [None]:
 profile.to_file(output_file="titanic_before_processing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as __titanic_before_preprocessing.html__. Take a look at the file and see what useful insight you can develop from it. <br/>
Now we will process our data to better understand it.

<a id=section303></a>

In [None]:
for name in titanic_data.Name:
    print(name)

### 3.3 Preprocessing

- Dealing with missing values<br/>
    - Dropping/Replacing missing entries of __Embarked.__
    - Replacing missing values of __Age__ with median values.
    - Dropping the column __'Cabin'__ as it has too many _null_ values.
    - Replacing 0 values of fare with median values.

In [None]:
titanic_data.drop(['PassengerId','Name'], axis=1, inplace=True)

In [None]:
titanic_data.head(1)

In [None]:
titanic_data.groupby('Pclass')['Fare'].median()

In [None]:
#Lets try to replace the missing embarked value
titanic_data['Embarked'].value_counts() #Value Counts

In [None]:
titanic_data['Embarked'].isnull()# Finding out the details of the the passanger whose Embarked Data is null

In [None]:
titanic_data[titanic_data['Embarked'].isnull()] # Finding out the details of the the passanger whose Embarked Data is null

In [None]:
titanic_data[titanic_data['Ticket'] == '113572']

In [None]:
titanic_data[titanic_data['Cabin'] == 'B28']

In [None]:
#Create a column that will give an representation of Ticket Price Per passenger.

In [None]:
titanic_data['Ticket'].value_counts()

In [None]:
titanic_data[titanic_data['Ticket']=='347082']

In [None]:
titanic_data[titanic_data['Pclass']==3]['Fare'].median()

In [None]:
titanic_data.groupby('Ticket').agg({'Ticket':'count','Fare':'mean'})

In [None]:
titanic_data['Ticket_Pass_Count']= titanic_data.groupby('Ticket')['Ticket'].transform('count')

In [None]:
#titanic_data.drop('Ticket_Count', axis=1,inplace=True)

In [None]:
#Looking at a 3rd class Fare where #passenger on a single ticket is 1 (sibsp and parch=0)
titanic_data['FarePerPass']= titanic_data['Fare']/(titanic_data['Ticket_Pass_Count'])

In [None]:
titanic_data.head(20)

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(titanic_data['Fare'], kde=True, bins=40, color='skyblue')
plt.title('Normal Distribution Curve - Fare', fontsize=16)
plt.xlabel('Fare')
plt.ylabel('Density')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
#'''''''''''''''''''''''''""#
plt.figure(figsize=(10,5))
sns.histplot(titanic_data['FarePerPass'], kde=True, bins=40, color='skyblue')
plt.title('Normal Distribution Curve - Fare Per Pass', fontsize=16)
plt.xlabel('Fare Per Pass')
plt.ylabel('Density')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
titanic_data[titanic_data['Fare']>=80.0].head(1)

In [None]:
titanic_data[titanic_data['Embarked'].isnull()] # Finding out the details of the the passanger whose Embarked Data is null

In [None]:
titanic_data['Embarked'].mode()
titanic_data['Embarked'].value_counts()

In [None]:
titanic_data[ (titanic_data['Embarked']=='C') & (titanic_data['Pclass']== 1) & (titanic_data['Sex']=='female')]['FarePerPass'].median()

In [None]:
titanic_data[ (titanic_data['Embarked']=='S') & (titanic_data['Pclass']== 1) & (titanic_data['Sex']=='female')]['FarePerPass'].median()

In [None]:
titanic_data[ (titanic_data['Embarked']=='Q') & (titanic_data['Pclass']== 1) & (titanic_data['Sex']=='female')]['FarePerPass'].median()

In [None]:
titanic_data['Embarked'].mode() # To find the mode of the Embarked Data

In [None]:
#titanic_data['Embarked'].mode()[0] #Getting the Mode Value using the [0] index

In [None]:
titanic_data.Embarked.fillna('C' ,inplace=True) # Inline Changes ##Any Questions???
#titanic_data= titanic_data.Embarked.fillna('C' ,inplace=True)
#titanic_data.Embarked.fillna(titanic_data['Embarked'].mode()[0], inplace=True)

In [None]:
titanic_data.info()

In [None]:
#titanic_data[titanic_data['Embarked'].isnull()]
titanic_data[titanic_data['Embarked'].isnull()]['Embarked'] # We get a series when we expect a column output.

In [None]:
titanic_data['Age'].median()

In [None]:
titanic_data[titanic_data['Age'].isnull()]

In [None]:
titanic_data[(titanic_data['Sex']=='female') & (titanic_data['Survived']==1)  ]['Age'].median()

In [None]:
titanic_data[(titanic_data['Sex']=='male') & (titanic_data['Survived']==1)]['Age'].median()

In [None]:
titanic_data[titanic_data['Age'].isnull()]['Survived']

In [None]:
titanic_data.groupby(['Pclass','Sex','Survived'])['Age'].median()

In [None]:
#Dealing with the Null Values of the Age Data
# Perventage Survived in the Age missing dataset.
(titanic_data[titanic_data['Age'].isnull()]['Survived'].sum()/titanic_data[titanic_data['Age'].isnull()].Survived.value_counts().sum() )*100

In [None]:
(titanic_data.Survived.sum()/titanic_data.Survived.value_counts().sum())*100 # Percentage Survived in overall data set

In [None]:
titanic_data.Age.median()

In [None]:
#median_age= titanic_data.Age.median()# Filling the missing values with Median Value of Age
#titanic_data['Age']= titanic_data.Age.fillna(median_age)
male_median_age = titanic_data[titanic_data['Sex']=='male']['Age'].median()
male_median_age

In [None]:
female_median_age = titanic_data[titanic_data['Sex']=='female']['Age'].median()
female_median_age

In [None]:
#Snice the difference between median age of male and female is not significant,
#we can replace the missing values of age with overall median age
#titanic_data['Age']= titanic_data['Age'].fillna(titanic_data.Age.median())

In [None]:
titanic_data['Age'] = titanic_data.groupby(['Pclass', 'Sex','Survived'])['Age'].transform(
   lambda x: x.fillna(x.median())
)

In [None]:
#Rechecking is the null values got filled with median
titanic_data[titanic_data['Age'].isnull()].Age
titanic_data.info()


In [None]:
#Dropping the Cabin column as the no. of data rows are very low and is not supposed to contribute much to the Analysis.
#titanic_data.drop('Cabin', axis=1, inplace=True)

titanic_data['Cabin'].fillna(0, inplace=True)

In [None]:
titanic_data['Cabin'].value_counts()

In [None]:
 #index_cabin_not_zero =titanic_data[titanic_data['Cabin'] != 0]['PassengerId']
 titanic_data.loc[titanic_data['Cabin'] != 0, 'Cabin']=1

In [None]:
titanic_data['Cabin'].value_counts()

In [None]:
titanic_data['Cabin'].value_counts()

In [None]:
titanic_data.head(1)

 We that is now the case and all the passangers are well over 18 years of age.

In [None]:
titanic_data[titanic_data.Age<=1]['FarePerPass']

In [None]:
titanic_data[titanic_data.Fare==0]['Pclass'].value_counts() ## Checking the age of passengers with 0 Fare.


In [None]:
titanic_data.groupby('Pclass')['FarePerPass'].median()

In [None]:
# Checking the median value of fare
#median_fare_1= titanic_data.loc[titanic_data['Pclass']==1,'Fare'].median()
#median_fare_1

In [None]:
#titanic_data.loc[((titanic_data['Fare'] == 0) & (titanic_data['Pclass']==1)),['Fare']]=median_fare_1

In [None]:
#median_fare_2= titanic_data.loc[titanic_data['Pclass']==2,'Fare'].median()
#titanic_data.loc[((titanic_data['Fare'] == 0) & (titanic_data['Pclass']==2)),['Fare']]=median_fare_2

In [None]:
#median_fare_3= titanic_data.loc[titanic_data['Pclass']==3,'Fare'].median()
#titanic_data.loc[((titanic_data['Fare'] == 0) & (titanic_data['Pclass']==3)),['Fare']]=median_fare_3

In [None]:
##print("Median Fare 1st Class: ",median_fare_1)
#print("Median Fare 2nd Class: ",median_fare_2)
#print("Median Fare 3rd Class: ",median_fare_3)

In [None]:
#Replacing the 0 Fare with its respective pclass mean fare.
titanic_data['Fare'] = titanic_data.groupby('Pclass')['Fare'].transform(
    lambda x: x.mask(x == 0, x.median())
)

In [None]:
titanic_data['FarePerPass'] = titanic_data.groupby('Pclass')['FarePerPass'].transform(
    lambda x: np.where(x==0,x.median(), x)
)

In [None]:
#Replace the 0 value of Fare with median Value. Cansidering that 0 value of fare was input by mistake.
#titanic_data['Fare']= titanic_data['Fare'].replace(0, titanic_data.Fare.median())
titanic_data['Fare'].isnull().sum()

In [None]:
titanic_data[titanic_data.Fare==0] #Rechecking if 0 Fare Exists.


In [None]:
titanic_data[titanic_data['Fare'].isnull()]# Double checking if the Fare/FarePerPass column is having any null value or not.

In [None]:
titanic_data['GenderClass'] = titanic_data.apply(lambda x: 'child' if x['Age']<15 else x['Sex'], axis=1 ) #Creating a new column

- Segmenting __Sex__ column as per __Age__, Age less than 15 as __Child__, Age greater than 15 as __Males and Females__ as per their gender.

In [None]:
titanic_data[titanic_data.Age<15].head()

In [None]:
#Create a new Column : Family Size
titanic_data['FamilySize'] = titanic_data['SibSp']+ titanic_data['Parch']+1

In [None]:
titanic_data.head()

In [None]:
titanic_data.drop('Sex', axis=1, inplace=True) # Since we created Gender class, we are dropping the sex column.


In [None]:
titanic_data.drop(['SibSp','Parch','Ticket'], axis=1, inplace=True)

In [None]:
titanic_data.drop(['Cabin'], axis=1, inplace=True)

In [None]:
titanic_data.drop(['Ticket_Pass_Count'], axis=1, inplace=True)

In [None]:
titanic_data.head(2)

In [None]:
titanic_data.drop_duplicates( inplace=True) #Dropping the Duplicate values originated as result of Pandas Preprocessing.

In [None]:
titanic_data.info()

<a id=section304></a>

## 3.4 Post Pandas Profiling

In [None]:
import pandas_profiling
profile = pandas_profiling.ProfileReport(titanic_data)
profile.to_file(output_file="Titanic_after_preprocessing.html")

Now we have preprocessed the data, now the dataset doesnot contain missing values, we have also introduced new feature named __FamilySize__. So, the pandas profiling report which we have generated after preprocessing will give us more beneficial insights. You can compare the two reports, i.e __titanic_after_preprocessing.html__ and titanic_before_preprocessing.html.<br/>
In titanic_after_preprocessing.html report, observations:
- In the Dataset info, Total __Missing(%)__ = __0.0%__
- Number of __variables__ = __8__
- Observe the newly created variable FamilySize, Click on Toggle details to get more detailed information about it.

<a id=section4></a>

### 4. Questions

<a id=section401></a>

### 4.1 Of all the passengers, how many survived and how many died ?

- Using Countplot

In [None]:
sns.countplot(x='Survived', data=titanic_data).set_title('Count plot for survived')

You can see that __more people died than survived.__ To know the exact count:

- Using groupby

In [None]:
titanic_data.head(1)

In [None]:
titanic_data.groupby(['Survived'])['Fare'].median() #5 --2 and 3 d

Notice that __455__ people __died__ and only __319 survived.__

<a id=section402></a>

### 4.2 Who is more likely to survive, Male or Female?

First of all looking at how __Age is varying with gender.__

In [None]:

as_fig=sns.FacetGrid(titanic_data,hue='GenderClass', aspect=5) # Always assign HUE to categorical columns
as_fig.map(sns.kdeplot,'Age', shade=True)
oldest = titanic_data['Age'].max()
as_fig.set(xlim=(0,oldest))
as_fig.add_legend()
plt.title('Age Distribution using FacetGrid')


- In titanic RMS __child__ of Age __3-8__ yrs are in majority.
- Maximum __males and females__ are of Age __25-35__ yrs.

Using groupby

In [None]:
titanic_data.groupby(['Survived','GenderClass','Pclass'])['Survived'].value_counts()

From the above you can see that its __difficult__ to absorb information quickly by looking at __numbers.__ Therefore we will make variety of plots to get clear vision of the scenario.

- Using catplot

In [None]:
sns.catplot(x='GenderClass', hue='Survived', kind= 'count', data=titanic_data)
plt.title('Factor plot for male female and child')

In [None]:
titanic_data[titanic_data.GenderClass == 'female']['Survived'].count() # Total Female

In [None]:
titanic_data[titanic_data.GenderClass == 'female']['Survived'].sum()# Total Female that survived

- Majority of __males died__.
- __Females__ have high probability to __survive.__

To know the exact %

In [None]:
titanic_data[titanic_data.GenderClass == 'female']['Survived'].sum()

In [None]:
print("% of women survived: ", titanic_data[titanic_data.GenderClass == 'female']['Survived'].sum()/ titanic_data[titanic_data.GenderClass == 'female']['Survived'].count()*100)
print("% of male survived: ", titanic_data[titanic_data.GenderClass == 'male']['Survived'].sum()/ titanic_data[titanic_data.GenderClass == 'male']['Survived'].count()*100)
print("% of children survived: ", titanic_data[titanic_data.GenderClass == 'child']['Survived'].sum()/ titanic_data[titanic_data.GenderClass == 'child']['Survived'].count()*100)

- Using pie plot

In [None]:
titanic_data['Survived'][titanic_data['GenderClass']=='male'].value_counts()

In [None]:
f,ax = plt.subplots(1,3,figsize=(20,7))
titanic_data['Survived'][titanic_data['GenderClass']=='male'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[0], shadow=True)
titanic_data['Survived'][titanic_data['GenderClass']=='female'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[1], shadow=True)
titanic_data['Survived'][titanic_data['GenderClass']=='child'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax=ax[2], shadow=True)
ax[0].set_title('Survived (male)')
ax[1].set_title('Survived (female)')
ax[2].set_title('Survived (child)')

From the above pie plot you can see how survival depends on whether the passenger is a child, male or female.
- __76% of females__ survived.
- __57% of children__ also survived.
- Only __16% of males__ survived.

In [None]:
titanic_data['Survived'][titanic_data['GenderClass']=='male'].value_counts()

In [None]:
titanic_data['Survived'][titanic_data['GenderClass']=='female'].value_counts()

<a id=section403></a>

### 4.4. What is the rate of survival of males, females and child on the basis of Passenger Class?

- Using mathematical function

In [None]:
print("% of Survuval in PClass=1: ", titanic_data[titanic_data.Pclass==1]['Survived'].sum()/ titanic_data[titanic_data.Pclass==1]['Survived'].count()*100)
print("% of Survuval in PClass=2: ", titanic_data[titanic_data.Pclass==2]['Survived'].sum()/ titanic_data[titanic_data.Pclass==2]['Survived'].count()*100)
print("% of Survuval in PClass=3: ", titanic_data[titanic_data.Pclass==3]['Survived'].sum()/ titanic_data[titanic_data.Pclass==3]['Survived'].count()*100)

- Using crosstab function

In [None]:
pd.crosstab([titanic_data.GenderClass, titanic_data.Survived], titanic_data.Pclass, margins=True).apply(lambda r: 100*r/len(titanic_data), axis=1).style.background_gradient(cmap='autumn_r')

In [None]:
titanic_data.groupby(['Survived','GenderClass','Pclass'])['Survived'].value_counts()

You can see how the percentage of males, females and children survived are varying depending on the passenger class they are in. Also, its quiet difficult to develop quick insights by looking only at numbers. Therefore we will explore doing the same with the help of __plotting.__

- Using __violin plot__ to see the relationship between __Pclass and Survived__

In [None]:
sns.violinplot(x='Pclass',y='Survived', data=titanic_data)
plt.title('Vioninplot Pclass vs Survived')
plt.show()

Above is another beautiful way to see how the survival rate is varying with Passenger class.
- __Pclass 3__ have __more__ people who __died__, and for __Pclass 1 survival rate is more.__

 Drawing __factorplot__ to look at the __distribution of popluation__ with __Pclass and GenderClass.__

In [None]:
sns.catplot(x='Pclass', data=titanic_data, hue='GenderClass', kind='count')

plt.show()
plt.title('Factorplot with kind= "Count" for Pclass and GenderClass')

1. __Pclass 3__ have _maximum_ number of __males__
2. __Pclass 1__ have _minimum_ number of __children__.

- using factorplot to see the variation of __surviavl rate with Pclass and GenderClass.__

In [None]:
sns.catplot(x='Pclass',y='Survived', col='Embarked', data=titanic_data, hue='GenderClass')
plt.title('Factorplot for Survival rate variation with Pclass and GenderClass')

The above graph shows:
1. the survival rate for male is very __low__ _irrespective of the class_ he belongs to.
2. And, the survival rate is _less_ for all the _3rd class passengers._
3. __Almost all women__ in Pclass __1 and 2 survived__ and __nearly all men__ in Pclass __2 and 3 died.__

<a id=section404></a>

### 4.4 What is the survival rate considering the Embarked variable?

- Using countplot

In [None]:
sns.countplot(x='Embarked', data=titanic_data, hue='Survived')

In [None]:
# 6% of the females survived from the overall data set and who embarked at C and were females.
100*titanic_data.groupby(['Embarked','GenderClass'])['Survived'].value_counts()/len(titanic_data)

1. __Maximum___ number of people have __Southampton__ as port of embarkment.
2. Also observe people who boarded at _Cherbourg_, _more_ people _survived than died_, and this is reverse for Queenstown.

- Using __factorplot__ and __kind = 'point'__

In [None]:
sns.catplot(x= 'Embarked',y='Survived', kind='point' ,data=titanic_data)
plt.title('Factorplot for Embarked and Survived')
plt.show()

<a id=section405></a>

### 4.5. Survival rate - Comparing Embarked and Sex.

- Distribution of _GenderClass_ with respect to _Port of Embarkment_ using __Countplot__.

In [None]:
sns.countplot(x='Embarked', data=titanic_data, hue='GenderClass')

Most of the people boarded from __S__, Also among all who boarded, __males__ constitutes the __majority__ of percentage.

- Using Factorplot to see variation of __survival rate with port of embarkment and GenderClass__

In [None]:
sns.catplot(x='Embarked',y='Survived', kind='point', hue='GenderClass', data=titanic_data)
plt.title('Factorplot for Embarked and Survived')
plt.show()

- Chances of survival of __females__ who boarded from __C__ is _highest_.
- Chances of survival of __males__ boarding from __Q__ is _lowest_

<a id=section406></a>

### 4.6 How survival rate vary with Embarked, Sex and Pclass.

Seeing relation between Pclass and Embarked.

In [None]:
relation = pd.crosstab(titanic_data.Embarked, titanic_data.Pclass)
relation.plot.barh(figsize=(15,5))
plt.xticks(size=10)
plt.yticks(size=10)
plt.title('Relation between Pclass and Embarked', size=20)

Maximum people who boarded from __S__ belongs to __Pclass 3__.<br/>
Most of the passengers belonging to __Pclass 1__ boarded from __C and S__  


- Using Swarmplot

In [None]:
sns.set(style='whitegrid', palette='muted')
sns.swarmplot(x='Embarked',y='Age', hue='GenderClass', palette='gnuplot', data=titanic_data)

- Using factorplot with kind = 'point'

In [None]:
sns.catplot(x='Embarked',y='Survived',col='Pclass', hue='GenderClass', kind='point', data=titanic_data)
plt.show()

- Practically all _women_ of __Pclass 2__ that embarked in __C and Q survived__, also nearly all _women_ of __Pclass 1__ survived_.
- All _men_ of __Pclass 1 and 2__ embarked in __Q died__, survival rate for men in __Pclass 2 and 3__ is always __below 0.2__.
- For the remaining men in Pclass 1 that embarked in S and C, survival rate is approx. __0.4__

<a id=section407></a>

### 4.7 Segment age in bins with size 10.

In [None]:
for i in range(8,0,-1):
    titanic_data.loc[titanic_data['Age']<=i*10, 'Age_bin']=i # 80,70,60,50....

In [None]:
print(titanic_data[['Age','Age_bin']])

In [None]:
titanic_data.plot.hexbin(x='Age_bin', y='Survived', gridsize=12, legend=True)

Comparing count of those who survived and died with respect to the Age_bin they are in.
- __Age_bin 1__: As you can see hexagon for Survived( 1.0 ) is darker than Died(0.0), means __more children survived than died__.
- __Age_bin 3__: __More died than survived__, Also count of survived is highest among all age bins ( see horizontaly along Survived = 1.0 ) , means maximum people who boarded Titanic were from this age group.
- __Age_bin >4__: More people died than survived.

In [None]:
sns.barplot(x='Age_bin',y='Survived', hue='Pclass', data=titanic_data)
plt.show()

- Calculating number of people of Age_bin = 1 and 8 from each Pclass.

In [None]:
titanic_data[(titanic_data.Age_bin==1)]['Pclass'].value_counts()

In [None]:
titanic_data[(titanic_data.Age_bin==8)]['Age'].value_counts()

In [None]:
titanic_data[(titanic_data.Age_bin==1)&(titanic_data.Pclass==1)]['Survived']

In [None]:
titanic_data[(titanic_data.Age_bin==8)]['Pclass'].value_counts()

In [None]:
titanic_data[(titanic_data.Age_bin==8)& (titanic_data.Pclass==1)]['Survived'].value_counts()

In [None]:
titanic_data[(titanic_data.Age_bin==8)& (titanic_data.Pclass==3)]['Survived'].value_counts()

- Among children of __age 0-10 yrs__ we dont have enough data points(3) in Pclass 1, therefore __discarding it__ (blue line of Age_bin 1)<br/>
- Also number of passengers belonging to age group __70-80 yrs__, is very less, therefore __ignoring them.__<br/>
- In __each Pclass__, we can see that the probability of survivying of __small children(Age = 0-10 yrs)__ is _higher_ than rest age group.<br/>
- In every Age_bin(ignoring Pclass 1 of first, and last Age_bin), __survival probability is highest for Pclass 1 and lowest for Pclass 3.__

In [None]:
sns.catplot(x='Age_bin', y='Survived',kind='point',data=titanic_data)
plt.show()

In [None]:
sns.factorplot('Age_bin', 'Survived',kind='point',hue='GenderClass',data=titanic_data)
plt.show()

Its clear from the above graph that among people of all the ages, __females__ in general have __higher probability of survival than males__.

In [None]:
sns.catplot(x='Age_bin',y='Survived', col='Pclass', row='GenderClass',kind='point',hue='Embarked',data=titanic_data)
plt.show()

From the factor plot:<br/>

- In general for males, as __Pclass increases, survival probability decreases.__
- For the rest of the females, as _Pclass increases_, _survival probability decreases._<br/>
- You can also see survival rate within each Pclass for males and females.

<a id=section408></a>

### 4.8 Analysing survival rate with FamilySize.

- Using __factorplot__ to know the survival rate on the basis of __FamilySize__.

In [None]:
ax= sns.catplot(x='FamilySize', y='Survived',data=titanic_data, kind='violin', aspect=1.5, palette='Greens')
ax.set(ylabel="Percent of Passengers")
plt.title('Survival by Total Family Size')

In [None]:
titanic_data.head()

As __size of family increases__ its chances of survival also __increases__.

<a id=section409></a>

### 4.9 Segment fare in bins of size 12.

- Using Distplot to see the distribution of __Fare__.

In [None]:
sns.distplot(titanic_data['FarePerPass'], color='g')
plt.title('Distribution of Fare')
plt.show()

We have seen that __'Fare'__ mostly varies between __10 and 90.__ We will use this information to create bins.

- Creating a new column named __'Fare_bin'__ based on 12 interval ranges in 'Fare' as __12 bins.__

In [None]:
for i in range(8,0,-1):
    titanic_data.loc[titanic_data['FarePerPass']<=i*10, 'Fare_bin']=i
titanic_data.loc[titanic_data['Fare']>80, 'Fare_bin']=8

In [None]:
titanic_data[['Fare','Fare_bin']].groupby('Fare_bin')['Fare'].mean()

In [None]:
sns.distplot(titanic_data['Fare_bin'], color='g')
plt.title('Distributuin of Fare Bin')
plt.show()

In [None]:
titanic_data['Fare'].mean()
titanic_data['Fare'].median()

- Using __barrplot__ to plot the relationship between __survival rate and Fare_bin and Pclass.__

In [None]:
fig, ax = plt.subplots(figsize=(8,8))
sns.barplot(x='Fare_bin', y='Survived', hue='Pclass', data=titanic_data, ax=ax)
plt.show()

- As __fare increases, survival chances also increases__.
- Also __Pclass 1__ (blue color) have __more chances to survive__ compared to other Pclass.

<a id=section410></a>

### 4.10 Draw pair plot to know the joint relationship between 'Fare','Age','Pclass' and 'Survived'

In [None]:
sns.pairplot(titanic_data[['Fare','Age','Pclass','Survived']],vars=['Fare','Age','Pclass'], hue='Survived', dropna=True, markers=['o','s'])
plt.title('Pair Plot')

Observing the diagonal elements,
- More people of Pclass 1 survived than died (First peak of red is higher than blue)
- More people of Pclass 3 died than survived (Third peak of blue is higher than red)
- More people of age group 20-40 died than survived.
- Most of the people paying less fare died.

<a id=section411></a>

### 4.11 Establish coorelation between all the features using heatmap.

In [None]:
# Convert categorical columns to numerical using one-hot encoding
titanic_data_encoded = pd.get_dummies(titanic_data, columns=['Embarked', 'GenderClass'], drop_first=True)


corr = titanic_data_encoded.corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr, vmax=1, linewidth=0.01, square=True, annot=True, cmap='YlGnBu', linecolor='black')
plt.title('Correlation between features')
plt.show()

- __Age and Pclass are negatively corelated with Survived.__
- FamilySize is made from Parch and SibSb only therefore high positive corelation among them.
- __Fare and FamilySize__ are __positively coorelated with Survived.__
- With high corelation we face __redundancy__ issues.

<a id=section412></a>

### 4.12 Hypothesis: Women and children are more likely to survive

On studying Questionnaire 4.1, 4.2 and 4.3 we observed that an overwhelming percentage of __women & children__ have survived the titanic clash.
- __76%__ of __females__ survived.
- __57%__ of __children__ also survived.
- Only __16%__ of __males__ survived.<br/>
Also the survival rate for male is very low irrespective of the _class_ he belongs to and the _survival rate is less_ for all the _3rd class passengers._ Almost all women in Pclass 1 and 2 survived and nearly all men in Pclass 2 and 3 died.

<a id=section5></a>

## 5. Conclusion

- With the help of this notebook we learnt how exploratory data analysis can be carried out using Pandas plotting.
- Also we have seen making use of packages like __matplotlib and seaborn__ to develop better insights about the data.<br/>
- We have also seen how __preproceesing__ helps in dealing with _missing_ values and irregualities present in the data. We also learnt how to _create new features_ which will in turn help us to better predict the survival.
- We also make use of __pandas profiling__ feature to generate an html report containing all the information of the various features present in the dataset.
- We have seen the impact of columns like _Age, Embarked, Fare, SibSp and Parch_ on the rate of survival.
- The most important inference drawn from all this analysis is, we get to know what are the __features on which survival is highly positively and negatively coorelated with.__
- This analysis will help us to choose which __machine learning model__ we can apply to predict survival of test dataset.