# Haberman Exploratory Data Analysis

# Haberman's Survival Dataset

Dataset Haberman : [https://www.kaggle.com/gilsousa/habermans-survival-data-set ]

Title: Haberman's Survival Data

Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999

Past Usage:

Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of Instances: 306

Number of Attributes: 4 (including the class attribute)



# Attribute Information:



Age of patient at time of operation (numerical)

Patient's year of operation (year - 1900, numerical)

Number of positive axillary nodes detected (numerical)

Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

Missing Attribute Values: None

# Objective


This is a two class classification problem

class 1 = the patient survived 5 years or longer 

class 2 = the patient died within 5 year

Try to classify them using the given features ***Age, year_of_operation, axillary_nodes***

In [None]:
#import required modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#suppress the warnings you get in python
import warnings
warnings.filterwarnings("ignore")

In [None]:
# download dataset from... https://www.kaggle.com/gilsousa/habermans-survival-data-set

# Load dataset into pandas dataframe.

h=pd.read_csv('../input/habermans-survival-data-set/haberman.csv',names=['Age','year_of_operation','axillary_nodes','Survival_status'])  #add the column names

In [None]:
h.head()

In [None]:
# how many datapoints and features??
print(h.shape)

In [None]:
# what are the features??
print(h.columns)

In [None]:
print(h.describe())

## Observation:
by looking at Age column.  
Age lies between               ----- 30 and 83

by looking at year_of_operation column.      
this data is collected between ----- 1958 and 1969

by looking at axillary_nodes column.      
the maximum axillary_nodes  is 52
whereas 75% of them have axillary_nodes < 5.0  and 
25% of them have axillary_nodes = 0.00

In [None]:
print(h.info())

In [None]:
#data-points per class
print(h['Survival_status'].value_counts())
#balanced vs imbalanced dataset
# haberman is an imbalanced dataset since data points in each class are at different ends/extremes

In [None]:
print(h['Survival_status'].value_counts(normalize=True))

## Observation:
Hence it is observed that the given dataset is **Imbalanced dataset** because the 73.5% of points belong to class 1(the patient survived 5 years or longer) and 26.4% of points are belong to class 2(the patient died within 5 year)

In [None]:
# print Age of The patients died within 5 years(class 2)
q=list(h['Age'][h['Survival_status']==2])
print(q)

# Domain Knowledge - What are axillary lymph nodes in breast?
The body has about 20 to 40 bean-shaped axillary lymph nodes located in the underarm area. These lymph nodes are responsible for draining lymph – a clear or white fluid made up of white blood cells – from the breasts and surrounding areas, including the neck, the upper arms, and the underarm area.


# 2-D Scatter plot

In [None]:
h.plot(x='Age',y='axillary_nodes',kind='scatter')
plt.title("Scatter plot of Age vs axillary_nodes")
plt.grid()
plt.show()

In [None]:
# Age vs axillary_nodes
sns.set_style('whitegrid')

sns.FacetGrid(h , hue='Survival_status',size=5) \
   .map(plt.scatter,'Age','axillary_nodes').add_legend()

plt.title("Age vs axillary_nodes")
plt.show()

**observation :**         
<1> this plot is not really useful in classifying blue and orange points       
<2> this plot can not be classified linearly

In [None]:
# Age vs year_of_operation
sns.set_style('whitegrid')

sns.FacetGrid(h , hue='Survival_status',size=5) \
   .map(plt.scatter,'Age','year_of_operation').add_legend()
plt.title("Age vs year_of_operation")
plt.show()

**observation:**    
<1> its hard to classify them           
<2> Not linearly seperable

In [None]:
# axillary_nodes vs year_of_operation
sns.set_style('whitegrid')

sns.FacetGrid(h , hue='Survival_status',size=5) \
   .map(plt.scatter,'axillary_nodes','year_of_operation') \
    .add_legend()
plt.title("axillary_nodes vs year_of_operation")
plt.show()

**observation:**       
<1> its also hard to classify them     
<2> this looks much better that other two above plots            
<3> but this plot is also not precisely seperable                
<4> this can be considered as best of All.

# Pair-Plots

In [None]:
sns.set_style('whitegrid')
sns.pairplot( h , hue="Survival_status",vars=['Age','year_of_operation','axillary_nodes'],size=5) 
plt.show()

## observation:
In the above pair plots the **year_of_operation vs axillary_nodes** has quite well seperable distribution of blue and orange points.       

Hence, **year_of_operation and axillary_nodes** are considered as the best features in classification task.     

# Perform Univariate Analysis

PDF,CDF,BoxPlot,ViolinPlot

Everything is done in order to understand which features are very important for the classification task

In [None]:
hamberman_1=h.loc[h['Survival_status']==1]
hamberman_2=h.loc[h['Survival_status']==2]

plt.plot(hamberman_1['Age'],np.zeros_like(hamberman_1['Age']),'o',label="class 1")
plt.plot(hamberman_2['Age'],np.zeros_like(hamberman_2['Age']),'o',label="class 2")
plt.title("1-D Histogram for Age")
plt.xlabel("Age")
plt.ylabel("counts plotted to zero")
plt.legend()
plt.show()

In [None]:
plt.plot(hamberman_1['year_of_operation'],np.zeros_like(hamberman_1['year_of_operation']),'o',label="class 1")
plt.plot(hamberman_2['year_of_operation'],np.zeros_like(hamberman_2['year_of_operation']),'o',label="class 2")
plt.title("1-D Histogram for year_of_operation")
plt.xlabel("year_of_operation")
plt.ylabel("counts plotted to zero")
plt.legend()
plt.show()

In [None]:
plt.plot(hamberman_1['axillary_nodes'],np.zeros_like(hamberman_1['axillary_nodes']),'o',label="class 1")
plt.plot(hamberman_2['axillary_nodes'],np.zeros_like(hamberman_2['axillary_nodes']),'o',label="class 2")
plt.title("1-D Histogram for axillary_nodes")
plt.xlabel("axillary_nodes")
plt.ylabel("counts plotted to zero")
plt.legend()
plt.show()

## Histograms 

In [None]:
sns.FacetGrid(h,hue='Survival_status',size=5)\
   .map(sns.distplot,'Age')\
   .add_legend()
plt.title("Histogram of Age")
plt.ylabel("corresponding values")
plt.show()

In [None]:
sns.FacetGrid(h,hue='Survival_status',size=5)\
   .map(sns.distplot,'year_of_operation')\
   .add_legend()
plt.title("Histogram of year_of_operation")
plt.ylabel("corresponding values")
plt.show()

In [None]:
sns.FacetGrid(h,hue='Survival_status',size=5)\
   .map(sns.distplot,'axillary_nodes')\
   .add_legend()
plt.title("Histogram of axillary_nodes")
plt.ylabel("corresponding values")
plt.show()

## Observation:
All the distributions are completely overlapping on each other hence, one feature alone cann't help in classification

## PDF's and CDF's

In [None]:
#PDF's

counts , bin_edge=np.histogram(hamberman_1['Age'],bins=10,density=True)

# density: bool, optional

# If False, the result will contain the number of samples in each bin. 

# If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. 

# Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen;

# it is not a probability mass function.

print("counts :" ,counts)
print("bin_edge :", bin_edge)

pdfs= counts/(sum(counts))
cdfs=np.cumsum(pdfs)

print("pdf:",pdfs)
print("cdf :",cdfs)

plt.plot(bin_edge[1:],pdfs,label="PDF")
plt.plot(bin_edge[1:],cdfs,label="CDF")
plt.xlabel("Age from class 1 data")
plt.ylabel("corresponding values")
plt.title("PDF and CDF of Age belong to class 1 ")
plt.legend()
plt.show()

In [None]:
#PDF's

counts , bin_edge=np.histogram(hamberman_2['Age'],bins=10,density=True)
print("counts :" ,counts)
print("bin_edge :", bin_edge)

pdfs= counts/(sum(counts))
cdfs=np.cumsum(pdfs)

print("pdf:",pdfs)
print("cdf :",cdfs)

plt.plot(bin_edge[1:],pdfs,label="PDF")
plt.plot(bin_edge[1:],cdfs,label="CDF")
plt.xlabel("Age from class 2 data")
plt.ylabel("corresponding values")
plt.title("PDF and CDF of Age belong to class 2")
plt.legend()
plt.show()

In [None]:
#PDF's

counts , bin_edge=np.histogram(hamberman_1['year_of_operation'],bins=10,density=True)

print("counts :" ,counts)
print("bin_edge :", bin_edge)

pdfs= counts/(sum(counts))
cdfs=np.cumsum(pdfs)

print("pdf:",pdfs)
print("cdf :",cdfs)

plt.plot(bin_edge[1:],pdfs,label="PDF")
plt.plot(bin_edge[1:],cdfs,label="CDF")
plt.xlabel("year_of_operation from class 1 data")
plt.ylabel("corresponding values")
plt.title("PDF and CDF of year_of_operation belong to class 1")
plt.legend()
plt.show()

In [None]:
#PDF's

counts , bin_edge=np.histogram(hamberman_2['year_of_operation'],bins=10,density=True)

print("counts :" ,counts)
print("bin_edge :", bin_edge)

pdfs= counts/(sum(counts))
cdfs=np.cumsum(pdfs)

print("pdf:",pdfs)
print("cdf :",cdfs)

plt.plot(bin_edge[1:],pdfs,label="PDF")
plt.plot(bin_edge[1:],cdfs,label="CDF")
plt.xlabel("year_of_operation from class 2 data")
plt.ylabel("corresponding values")
plt.title("PDF and CDF of year_of_operation belong to class 2")
plt.legend()
plt.show()

In [None]:
#PDF's

counts , bin_edge=np.histogram(hamberman_1['axillary_nodes'],bins=10,density=True)

print("counts :" ,counts)
print("bin_edge :", bin_edge)

pdfs= counts/(sum(counts))
cdfs=np.cumsum(pdfs)

print("pdf:",pdfs)
print("cdf :",cdfs)

plt.plot(bin_edge[1:],pdfs,label="PDF")
plt.plot(bin_edge[1:],cdfs,label="CDF")
plt.xlabel("axillary_nodes from class 1 data")
plt.ylabel("corresponding values")
plt.title("PDF and CDF of axillary_nodes belong to class 1")
plt.legend()
plt.show()

In [None]:
#PDF's

counts , bin_edge=np.histogram(hamberman_2['axillary_nodes'],bins=10,density=True)

print("counts :" ,counts)
print("bin_edge :", bin_edge)

pdfs= counts/(sum(counts))
cdfs=np.cumsum(pdfs)

print("pdf:",pdfs)
print("cdf :",cdfs)

plt.plot(bin_edge[1:],pdfs,label="PDF")
plt.plot(bin_edge[1:],cdfs,label="CDF")
plt.xlabel("axillary_nodes from class 2 data")
plt.ylabel("corresponding values")
plt.title("PDF and CDF of axillary_nodes belong to class 2")
plt.legend()
plt.show()

## Observation:
**PDF** : PDF's are useful in determining percentage of points present in a given interval.   
**CDF** : CDF's are cumulative sum. It says percentage of points that has values less that the given value. similar to (Percentiles)

Intergration(PDF) == CDF (like Area under the curve studied in claculus)

Differentiation(CDF) == PDF


# Mean and Std



In [None]:
for i in list(h.columns)[:-1]:
    print('mean of %s in hamberman_1'%i)
    print(np.mean(hamberman_1[i]))
    print('mean of %s in hamberman_2'%i)
    print(np.mean(hamberman_2[i]))


In [None]:
for i in list(h.columns)[:-1]:
    print('std of %s in hamberman_1'%i)
    print(np.std(hamberman_1[i]))
    print('std of %s in hamberman_2'%i)
    print(np.std(hamberman_2[i]))

In [None]:
h.columns

# Median and Quantiles

In [None]:
for i in list(h.columns)[:-1]:   
    print('median of %s in class 1'%i)
    print(np.median(hamberman_1[i]))
    print('median of %s in class 2'%i)
    print(np.median(hamberman_2[i]))

In [None]:
for i in list(h.columns)[:-1]:
    print("Quantiles of %s of hamberman_1:"%i)
    print(np.percentile(hamberman_1[i],np.arange(0,100,25)))
    print("Quantiles of %s of hamberman_2:"%i)
    print(np.percentile(hamberman_2[i],np.arange(0,100,25)))

In [None]:
# Median Absolute Deviation

# Median( abs( Xi - median ) )

# MAD is less prone to outliers hence it is used.

from statsmodels import robust

for i in list(h.columns)[:-1]:
    print ("Median Absolute Deviation of %s of hamberman_1:" %i)
    print(robust.mad(hamberman_1[i]))
    print ("Median Absolute Deviation of %s of hamberman_2:" %i)
    print(robust.mad(hamberman_2[i]))

# Box-Plot and Whiskers

In [None]:
sns.boxplot(data=h,x='Survival_status',y='Age',hue='Survival_status')
plt.title("Box plot of Age")
plt.show()

In [None]:
sns.boxplot(data=h,x='Survival_status',y='year_of_operation',hue='Survival_status')
plt.title("Box plot of year_of_operation")
plt.show()

In [None]:
sns.boxplot(data=h,x='Survival_status',y='axillary_nodes',hue='Survival_status')
plt.title("Box plot of axillary_nodes")
plt.show()

## Observation:
The Middle line denotes the 50th Percentile value.         
The above and below line denotes 75th Percentile and 25th Percentile respectively.       

The whiskers don't say much information about data.

# Violin-plot

In [None]:
sns.violinplot(data=h,x='Survival_status',y='Age',hue='Survival_status')
plt.title("Violin-Plot of Age")
plt.show()

In [None]:
sns.violinplot(data=h,x='Survival_status',y='year_of_operation',hue='Survival_status')
plt.title("Violin-Plot of Age")
plt.show()

In [None]:
sns.violinplot(data=h,x='Survival_status',y='axillary_nodes',hue='Survival_status')
plt.title("Violin-Plot of Age")
plt.show()

## Observation:
The Violin Plots are combinations of Box-Plot and PDF's.

# Contour plot

In [None]:
# 2D Density plot,Contour plots

sns.jointplot(hamberman_1['Age'],hamberman_1['year_of_operation'],kind='kde')
plt.title("Density plot of Age vs year_of_operation belong to class 1")
plt.show()

## Observation:
Imagine the above image as a Mountain coming out of the Screen.           
The darker areas have the Highest points(Denser Areas).       
As the color shade drops the height also reduces.


# Conclusion 

1. From haberman dataset it is clear that the features are mostly overlapping on each other.  
2. It is hard to separate blue points and orange points linerarly or in any other way.     
3. But best of all the features the year_of_operation and axillary_nodes helps in classifying points in minimal amount ( its not the best, not the worst ).
4. And also this analysis gave some intution about how age , year_of_operation and axillary_nodes are related in determining the survival_status.