# Exploratory Data Analysis (EDA) with Haberman Dataset

# Data Description

1. Haberman Cancer Survival Dataset.
2. Collected between 1958 to 1970 at the University of Chicago's Billings Hospital.
3. It is based on the survival of patients who had undergone surgery for breast cancer.

# Environment Setup

In [None]:
#importing necessary Libraries 

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

#Loading Haberman Dataset

hman=pd.read_csv('../input/haberman/haberman.csv')



# Data Attributes

In [None]:
# Number of Data Points

print(hman.shape)

In [None]:
# Columns of the Dataset

print(hman.columns)

In [None]:
# Overview of the Data

print(hman.tail())

In [None]:
# modify the target column values to be meaningful and categorical

hman['Survival_status_after_5_years'] = hman['Survival_status_after_5_years'].map({1:"yes", 2:"no"})
hman['Survival_status_after_5_years'] = hman['Survival_status_after_5_years'].astype('category')

# printing top of modified data

print(hman.head())

In [None]:
# Number of Datapoints wrt 'Survival_status_afer_5_years'

hman['Survival_status_after_5_years'].value_counts()


# The Haberman Dataset is an imbalance Dataset with yes = 225 , no = 81



In [None]:
# High Level Statistics

print(hman.describe())

**Observation(s)**

1. The dataset is an imbalanced dataset.
2. The age of Patients vary from 30 to 83 with a mean of 52.
3. 25% of Patients have no axil nodes , 50 % of patients have one axil node ,75 % have four axillary lymph nodes.
4. 75% of patients diagonised with breast cancer is above 60 years old.


# Objective

To predict a patient who have undergone surgery for breast cancer will survive 5 years or not based on patient's age , year of operation and number of axillary lymph nodes .

# Univariate Analysis

In [None]:
# plotting probability distribution wrt Patient's Age

plt.close()
sns.FacetGrid(hman,hue='Survival_status_after_5_years',size=5).map(sns.distplot,'Age').add_legend()
plt.show()


In [None]:
# plotting probability distribution wrt Patient's Year of Opearation

plt.close()
sns.FacetGrid(hman,hue='Survival_status_after_5_years',size=5).map(sns.distplot,'Operation_Year').add_legend()
plt.show()

In [None]:
# plotting probability distribution wrt Patient's Year of Opearation
plt.close()
sns.FacetGrid(hman,hue='Survival_status_after_5_years',size=6).map(sns.distplot,'axillary_lymph_nodes').add_legend()
plt.show()

**Observations**

1. We can not easily separate survival status based on age or year of operation.
2. We can  separate survival status comparatively better wrt. axillary lymph nodes.
3. Patients with 0 to 2 axillary lymph nodes have highest probability of survival after 5 years.

**Cumulative Distribution Function**

In [None]:
# CDF wrt Patient's Age

label = ["pdf of survived", "cdf of survived", "pdf of not survived", "cdf of not survived"]
hman_survived=hman.loc[hman['Survival_status_after_5_years']=='yes']
hman_not_survived=hman.loc[hman['Survival_status_after_5_years']=='no']
counts,bin_edges = np.histogram(hman_survived['Age'],bins=15)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.figure(1)
#plt.subplot(211)
plt.plot(bin_edges[1:],pdf,bin_edges[1:],cdf)


#plt.subplot(212)
counts,bin_edges = np.histogram(hman_not_survived['Age'],bins=15)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.grid()
plt.plot(bin_edges[1:],pdf,bin_edges[1:],cdf)
plt.xlabel('Age')
plt.ylabel('number of patient')
plt.legend(label)




In [None]:
# CDF wrt Year of Operation

counts,bin_edges = np.histogram(hman_survived['Operation_Year'],bins=15)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,bin_edges[1:],cdf)



counts,bin_edges = np.histogram(hman_not_survived['Operation_Year'],bins=15)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.grid()
plt.plot(bin_edges[1:],pdf,bin_edges[1:],cdf)
plt.xlabel('Year of Operation')
plt.ylabel('number of patient')
plt.legend(label)

In [None]:
#CDF wrt axillary lymph nodes

counts,bin_edges = np.histogram(hman_survived['axillary_lymph_nodes'],bins=15)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.figure(1)
#plt.subplot(221)
plt.plot(bin_edges[1:],pdf,bin_edges[1:],cdf)


#plt.subplot(222)
counts,bin_edges = np.histogram(hman_not_survived['axillary_lymph_nodes'],bins=15)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.grid()
plt.plot(bin_edges[1:],pdf,bin_edges[1:],cdf)
plt.xlabel('axillary_lymph_nodes')
plt.ylabel('number of patient')
plt.legend(label)

**Observation(s)**

1. 80% of survived patients are of aged less than equal to 64.
2. Patients with more than 47 axial lymph nodes have not survived.
3. Patients oparated with the age more than 77 years have not survived.
4. 80% of survived patients have less than 4 axillary lymph nodes.


# Boxplot

In [None]:
# BoxPlot wrt patient's age

sns.boxplot(y='Age',x='Survival_status_after_5_years',data=hman)
plt.show()

In [None]:
# BoxPlot wrt year of operation

sns.boxplot(y='Operation_Year',x='Survival_status_after_5_years',data=hman)
plt.show()

In [None]:
# BoxPlot wrt axillary lymph nodes

sns.boxplot(y='axillary_lymph_nodes',x='Survival_status_after_5_years',data=hman)
plt.show()

# Violin Plot

In [None]:
# Violine plot wrt patient's age

sns.violinplot(y='Age',x='Survival_status_after_5_years',data=hman)
plt.show()

In [None]:
# Violin plot wrt year of operation

sns.violinplot(y='Operation_Year',x='Survival_status_after_5_years',data=hman)
plt.show()

In [None]:
# Violin plot wrt axillary lymph nodes

sns.violinplot(y='axillary_lymph_nodes',x='Survival_status_after_5_years',data=hman)
plt.show()

**Observation(s)** 
(from boxplot and violin plot)

1. 75% of patients survived are of aged less than equal to 60.
2. 75% of patients survived have less than 4 axillary lymph nodes.
3. 75% of patients who have more have not survived have more than 10 axillary lymph nodes.

# Bi-variate Analysis

# Scatter Plot

In [None]:
# we can draw 3C2 = 3 SCatter plots

sns.set_style('whitegrid')
sns.FacetGrid(hman,hue='Survival_status_after_5_years',size=6).map(plt.scatter,'Age','axillary_lymph_nodes').add_legend()
plt.show()


# Pair Plot



In [None]:
# Pair Plot 

plt.close()
sns.pairplot(hman,hue='Survival_status_after_5_years',size=5)
plt.show()


**Observation(s)**

 1. We can not easily separate survival status based on this plot.
 2. 'Year of Operaion' and 'axial nodes' gives comparatively better separation than other scatter plots.

# Conclusion :

1. Number of Axillary lymph nodes gives comparatively better separtion than other factors . 
2. Patients with less than four axillary nodes has better probability to survive .
2. From the given dataset we can not easily predict survival status using simple techniques as the plots of the attributes are highly overlapped.