# Exploratory Data Analysis (EDA)
In this analysis, we explore the Haberman's cancer survival dataset using visualization and apply some stastical analysis to gain some intuition about relationship between the attributes and the classes. 
This will help us derive some logics which will help us distinguish two classes.<br>
1 = the patient survived 5 years or longer<br>
2 = the patient died within 5 year<br>

Please upvote this kernel if you find it helpful...

## Contents 

1. About the dataset<br>
&nbsp;&nbsp;1.1 Dataset Loading<br>
&nbsp;&nbsp;1.2 Tabular Information<br>
2. Objective of our EDA<br>
3. Univariate EDA<br>
&nbsp;&nbsp;3.1 Bar chart<br>
&nbsp;&nbsp;3.2 Histogram<br>
&nbsp;&nbsp;&nbsp;&nbsp;3.2.1 Age attribute<br>
&nbsp;&nbsp;&nbsp;&nbsp;3.2.2 Year attribute<br>
&nbsp;&nbsp;&nbsp;&nbsp;3.2.3 Positive Nodes attribute<br>
4. Bivariate EDA<br>
&nbsp;&nbsp;4.1 Scatter plot<br>
&nbsp;&nbsp;4.2 Pair plot<br>
&nbsp;&nbsp;4.3 Correlation Heatmap<br>
5. Conclusion<br>

## 1. About the dataset 

1. Title: Haberman's Survival Data. Link: https://www.kaggle.com/gilsousa/habermans-survival-data-set 
2. Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999
3. Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
4. Attribute Information:
    * Age of patient at time of operation (numerical)
    * Patient's year of operation (year - 1900, numerical)
    * Number of positive axillary nodes detected (numerical)
    * Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

### 1.1. Dataset Loading 

In [None]:
# Importing required Libraries 
import numpy as np
import pandas as pd 
import seaborn as sns 
import matplotlib 
from matplotlib import pyplot as plt 
from scipy import stats
import pylab
from mpl_toolkits.mplot3d import Axes3D

#Loading the haberman.csv file and printing some of the rows.
HabermanDataFrame = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')
HabermanDataFrame.head(5)

### 1.2. Other Informations About The Dataset

In [None]:
# Overall information about the DataFrame
HabermanDataFrame.info()

In [None]:
# Finding out the column names 
HabermanDataFrame.columns 

In [None]:
# Renaming the columns according to their meanings
HabermanDataFrame = HabermanDataFrame.rename(columns={"30":"Age", "64":"Year", "1":"Positive Nodes", "1.1":"Survival Status"})
HabermanDataFrame.columns

In [None]:
# Checking for null values
HabermanDataFrame.isnull().any()

In [None]:
# Shape of our dataset 
HabermanDataFrame.shape

In [None]:
# How many patients from each class 
HabermanDataFrame["Survival Status"].value_counts()

Observations:
* Haberman's dataset has 305 entries and it ranges from index 0 to 304.
* It has 4 columns and each column's data type is int64.
* No attritubes(columns) contain null value.
* Shape of our DataFrame is (305, 4) i.e we have 305 rows and 4 columns.
* Among 305 patients, 224 patients survived more than 5 years and other 81 patients died within 5 years. 

## 2. Objective of our EDA
Our objective here is to explore the Haberman's cancer survival dataset using visualization and apply some stastical analysis to gain some intuition about relationship between the attributes and the classes, So that we can come up with a set of attributes or some logics which will help us distinguish two classes. 

## 3. Univariate EDA
In univariate data analysis we explore characteristics of the population distribution of a quantitative variable. Such as its center, spread, modality (number of peaks in the pdf), shape (including “heaviness of the tails”) and outliers etc.

### 3.1 Bar Chart 

In [None]:
#Simple bar chart to visualize overall data points in each class
sns.set(style="darkgrid")
plt.figure(figsize=(10,7))
ax = sns.barplot(x=HabermanDataFrame["Survival Status"].unique(), y=HabermanDataFrame["Survival Status"].value_counts(), data=HabermanDataFrame)
plt.title('Bar Chart On Survial Status', fontsize=15)

### 3.2 Histogram

#### 3.2.1 Age Attribute

In [None]:
#Distribution plot along with histogram for Age attribute
plt.figure(figsize=(12,5))
sns.distplot(HabermanDataFrame["Age"])
plt.title('"Age" Attribute Distribution Plot', fontsize=18)
HabermanDataFrame["Age"].describe()

In [None]:
#Basic descriptive Statistics of Age attribute
iqr_age = HabermanDataFrame["Age"].describe()['75%'] - HabermanDataFrame["Age"].describe()['25%']
median_age = HabermanDataFrame["Age"].median()
mode_age = HabermanDataFrame["Age"].mode()
print("Inter Quartile Range for Age attribute is:", iqr_age)
print("Median of Age attribute is:", median_age)
print("Mode of Age attribute is:", mode_age)

In [None]:
#Both classes distribution comparision over Age attribute
sns.FacetGrid(HabermanDataFrame, hue="Survival Status", aspect=2, size=5) \
   .map(sns.distplot, "Age") \
   .add_legend();
plt.title('Both Classes Distribution Comparison Over "Age"', fontsize=18)
plt.show();

In [None]:
#Both classes box and violin plot comparision over Age attribute
plt.figure(1)
plt.figure(figsize=(12,5))
sns.boxplot(x=HabermanDataFrame['Survival Status'],y=HabermanDataFrame['Age'], data=HabermanDataFrame)
plt.title('Both Classes Box Plot Comparison Over "Age"', fontsize=18)
plt.figure(2)
plt.figure(figsize=(12,5))
sns.violinplot(x=HabermanDataFrame['Survival Status'],y=HabermanDataFrame['Age'], data=HabermanDataFrame)
plt.title('Both Classes Violin Plot Comparison Over "Age"', fontsize=18)
plt.show()

#### Observations on Age Attribute :
* From the plot it is totally clear that the Age attribute is normally distributed.
* Without even ploting the distribution we can easily say that the distribution is normal as mean, median and mode are same. 
* But both classes distributions comparison over Age attribute seems both classes are mixed up. They have nearly the same mean, median and variance etc which is not helpful to distinguish the classes.

#### 3.2.2 Year Attribute

In [None]:
#Distribution plot along with histogram for Year attribute
plt.figure(figsize=(12,5))
sns.distplot(HabermanDataFrame["Year"])
plt.title('"Year" Attribute Distribution Plot', fontsize=18)
HabermanDataFrame["Year"].describe()

In [None]:
#Basic descriptive Statistics of Year attribute
iqr_year = HabermanDataFrame["Year"].describe()['75%'] - HabermanDataFrame["Year"].describe()['25%']
median_year = HabermanDataFrame["Year"].median()
mode_year = HabermanDataFrame["Year"].mode()
print("Inter Quartile Range for Age attribute is:", iqr_age)
print("Median of Age attribute is:", median_year)
print("Mode of Age attribute is:", mode_year)

In [None]:
#Both classes distribution comparision over Year attribute
sns.FacetGrid(HabermanDataFrame, hue="Survival Status", size=5, aspect=2) \
   .map(sns.distplot, "Year") \
   .add_legend();
plt.title('Both Classes Distribution Comparison Over "Year"', fontsize=18)
plt.show();

In [None]:
#Both classes box and violin plot comparision over Year attribute
plt.figure(1)
plt.figure(figsize=(12, 5))
sns.boxplot(x=HabermanDataFrame['Survival Status'],y=HabermanDataFrame['Year'], data=HabermanDataFrame)
plt.title('Both Classes Box Plot Comparison Over "Year"', fontsize=18)
plt.figure(2)
plt.figure(figsize=(12, 5))
sns.violinplot(x=HabermanDataFrame['Survival Status'],y=HabermanDataFrame['Year'], data=HabermanDataFrame)
plt.title('Both Classes Violin Plot Comparison Over "Year"', fontsize=18)
plt.show()

#### Observations on Year Attribute :-
* Year attribute is not normally distributed. From visualization it is somewhat bimodal distribution because it has 2 peaks.
* The both classes distributions over Year attribute have nearly same mean and they overlap with each other completely. 
* The survived class distribution seems like multi-modal distribution where as died class distribution seems like a bimoal distribution.
* From the died class distribution, the age group around 58 and 63 seems to die in large numbers then the survived ones. But it is not completely distinguishable due to fair amount of overlapping.

#### 3.2.3 Positive Nodes Attribute 

In [None]:
#Distribution plot along with histogram for Positive Nodes attribute 
plt.figure(figsize=(12,5))
sns.distplot(HabermanDataFrame["Positive Nodes"])
plt.title('"Year" Attribute Distribution Plot', fontsize=18)
HabermanDataFrame["Positive Nodes"].describe()

In [None]:
#Basic descriptive Statistics of Positive Nodes attribute
iqr_positive_node = HabermanDataFrame["Positive Nodes"].describe()['75%'] - HabermanDataFrame["Positive Nodes"].describe()['25%']
median_positive_node = HabermanDataFrame["Positive Nodes"].median()
mode_positive_node = HabermanDataFrame["Positive Nodes"].mode()
print("Inter Quartile Range for Age attribute is:", iqr_positive_node)
print("Median of Age attribute is:", median_positive_node)
print("Mode of Age attribute is:", mode_positive_node)

In [None]:
#Both classes distribution comparision over Positive Nodes attribute
sns.FacetGrid(HabermanDataFrame, hue="Survival Status", size=5, aspect=2) \
   .map(sns.distplot, "Positive Nodes") \
   .add_legend();
plt.title('Both Classes Distribution Comparison Over "Positive Nodes"', fontsize=18)
plt.show();

In [None]:
#Both classes box and violin plot comparision over Positive Nodes attribute
plt.figure(1)
plt.figure(figsize=(12, 5))
sns.boxplot(x=HabermanDataFrame['Survival Status'],y=HabermanDataFrame['Positive Nodes'], data=HabermanDataFrame)
plt.title('Both Classes Box Plot Comparison Over "Positive Nodes"', fontsize=18)
plt.figure(2)
plt.figure(figsize=(12, 5))
sns.violinplot(x=HabermanDataFrame['Survival Status'],y=HabermanDataFrame['Positive Nodes'], data=HabermanDataFrame)
plt.title('Both Classes Violin Plot Comparison Over "Positive Nodes"', fontsize=18)
plt.show()

In [None]:
#Check for pareto distribution 
plt.figure(figsize=(12, 5))
stats.probplot(HabermanDataFrame["Positive Nodes"], dist=stats.pareto, sparams=(2.5,), plot=pylab)
plt.title('Q-Q plot to check Pareto distribution \nfor Positive Nodes', fontsize=18)
pylab.show()

#### Observations on Positive nodes Attribute :
* The distribution of positive nodes attribute is heavily right skewed.
* So box and whisker plot is plotted to find out possible outliers. And here we can see clearly that our positive nodes attribute contains a lot of outliers. 
* In this distribution there is no data point below 2nd quantile. It seems like a Pareto distribution. 
* But from Q-Q plot it is quite clear that it is not a pareto distribution.
* Even this attribute is seemingly overlapping but people having more than 5 positive auxilary nodes are died in much numbers than the survied ones. 

## 4. Bivariate EDA
Bivariate analysis is the analysis of exactly two variables. It gives us intuition about which two attributes combiningly helps us in classification better.

### 4.1 Scatter Plot


In [None]:
#Simple scatterplot demonstration 
plt.figure(figsize=(12,5))
plt.scatter(x=HabermanDataFrame['Age'], y=HabermanDataFrame['Positive Nodes'])
plt.title('Scatterplot Between "Age" & "Positive Nodes"', fontsize=18)

### 4.2 Pair Plot

In [None]:
#Pairplot to visualize scatterplot between each pair of attributes
sns.pairplot(HabermanDataFrame, hue="Survival Status",vars=['Age', 'Year', 'Positive Nodes'], diag_kind = 'kde', size=4)
plt.suptitle('Pair Plot Between All Attributes',x=.48, y=1, fontsize=18)
plt.show()

#### Observations:
* It is really hard to tell anything about which two attributes are preferable for classification.
* Both classes i.e survied and died data points in every pair of scatter plots are so mixed that it is impossible to write any simple if-else logic to classify them.
* And even drawing any boundary line using ML models to classify the two classes may result low accuracy. 

### 4.3 Correlation Heatmat
Correlation matrix to gain further knowledge about the relationship in between each pair of attributes. 

In [None]:
#Heatmap of correlation matrix
plt.figure(figsize=(12,10))
sns.heatmap(HabermanDataFrame.drop('Survival Status', axis =1).corr(), annot=True)
plt.title('Correlation Heatmap Between All Attributes', fontsize=18)

#### Observations:
* All the correlation values are too low i.e the attributes are poorly correlated to each other.

## 5. Conclusion:
* Both classes data points are jumbled up over all attributes. 
* From the 3-D plot it seems that people having 10 or more number of positive nodes have more chances to die within 5 years.
* On a final note this dataset is kind of a jumbled one so distinguishing the classes with more accuracy is hard from these set of attributes.

## *Please upvote this kernel if you find it helpful...*