Haberman's Survival : Exploratory Data Analysis
**DESCRIPTION** Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
**AttrIBUTE Information:**
Age of patient at the time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

In [None]:
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np 
can_df=pd.read_csv('../input/habermans-survival-data-set/haberman.csv',names=['age','year_of_operation','axillary_lymph_node','survival_status_after_5_years'])
can_df

In [None]:
can_df.shape# this shows data-points and features

statistics


In [None]:

print(can_df.describe())# mean shows the average number of points of a feature. 25%- 25 percent of people have 1 axillary nodes

In [None]:
print(can_df.info())

In [None]:
can_df['survival_status_after_5_years'].value_counts()#shows contents in feature

> **BIVARIATE ANALYSIS**

**Scatter plots**

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(can_df,hue="survival_status_after_5_years",size=4) \
   .map(plt.scatter,"axillary_lymph_node","survival_status_after_5_years") \
   .add_legend()
plt.show()

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(can_df,hue="survival_status_after_5_years",size=4) \
   .map(plt.scatter,"age","year_of_operation") \
   .add_legend()
plt.show()

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(can_df,hue="survival_status_after_5_years",size=4) \
   .map(plt.scatter,"age","axillary_lymph_node") \
   .add_legend()
plt.show()

*pair plots used for huge no of features*

In [None]:
sns.set_style("whitegrid")
sns.pairplot(can_df,hue='survival_status_after_5_years', size=3)
plt.show()

*Observations*


1. From this we can say that this dataset is not linearly separable.
2. simple if-else conditions cannot be used to draw the classes from the features.


**UNIVARIATE ANALYSIS**

*Histograms* 

In [None]:
for i,features in enumerate(list(can_df.columns[:-1])):
    sns.FacetGrid(can_df,hue='survival_status_after_5_years',size=5) \
       .map(sns.distplot,features) \
       .add_legend()
plt.show()


*pdf and cdf for the features*

In [None]:
counts,bin_edges=np.histogram(can_df['age'],bins=10,density=True)
counts1,bin_edges1=np.histogram(can_df['year_of_operation'],bins=10,density=True)
counts2,bin_edges2=np.histogram(can_df['axillary_lymph_node'],bins=10,density=True)
print(counts,counts1,counts2)

In [None]:
pdf=counts/sum(counts)
pdf1=counts1/sum(counts1)
pdf2=counts2/sum(counts2)
cdf=np.cumsum(pdf)
cdf1=np.cumsum(pdf1)
cdf2=np.cumsum(pdf2)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.xlabel('age')
label=['pdf','cdf']
plt.legend(label)


In [None]:
plt.plot(bin_edges1[1:],pdf1)
plt.plot(bin_edges1[1:],cdf1)
plt.xlabel('year_of_operation')
label=['pdf','cdf']
plt.legend(label)

In [None]:
plt.plot(bin_edges2[1:],pdf2)
plt.plot(bin_edges2[1:],cdf2)
plt.xlabel('axillary_lymph_node')
label=['pdf','cdf']
plt.legend(label)

*Box plots and whiskers* 
with 25 percentile and 50th percentile and 75th percentile for all the age, year of operation, axillary lymph nodes persons who survived after 5 years.
whiskers=1.5*Inter Quartile Range


In [None]:
sns.boxplot(x='survival_status_after_5_years',y='axillary_lymph_node',data=can_df)
plt.show()

In [None]:
sns.boxplot(x='survival_status_after_5_years',y='year_of_operation',data=can_df)
plt.show()

*violin plots*(histogram + boxplots)
which gives a simple understanding as outer curve as a pdf of the plot and white dot a 50 quartile and thin lines as whiskers(upper and lower)

In [None]:
sns.violinplot(x='survival_status_after_5_years',y='year_of_operation',data=can_df,size=8)
plt.show()

In [None]:
sns.violinplot(x='survival_status_after_5_years',y='age',data=can_df,size=8)
plt.show()

**MULTI-VARIATE PROBABILITY DENSITY**

In [None]:
sns.jointplot(x='age',y='axillary_lymph_node',data=can_df,kind='kde')
plt.show()

In [None]:
sns.jointplot(x='year_of_operation',y='survival_status_after_5_years',data=can_df,kind='kde')
plt.show()

**CONCLUSIONS **
1. we can say that this dataset is not linearly separable
2. 0-5 AXILLARY LYMPH NODES HAS HIGH SURVIVAL AFTER 5 YEARS
3. 75 percentile of survived people has age less than 60.
4. Age with 40-65 have high nnumber of axillary lymph nodes.
5. year of operation 59-66 has high survival status after 5 years

