**EXPLORATORY DATA ANALYSIS : HABERMAN DATASET**

**Data Description :**

The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Attribute Information :**

1. Age of patient at time of operation (numerical)

2. Patient's year of operation (year - 1900, numerical)

3. Number of positive axillary nodes detected (numerical)

4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

**OBJECTIVE :**

**To predict whether the patient will survive after 5 years or not based upon the patient's age, year of operation and the number of positive axillary nodes.**


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np



#Load haberman.csv into a pandas dataFrame.

haberman=pd.read_csv("../input/habermans-survival-data-set/haberman.csv",names=['age', 'operation_year', 'axil_nodes', 'survival_status'])



**HIGH LEVEL STATISTICS OF THE DATASET :**


In [None]:
haberman.head(10)

**Number of points :**

In [None]:
haberman.shape

**Observations :**

*   number of datapoints = 306 per column.
*   number of columns = 4.





**Number of Features :**

In [None]:
haberman.columns

**Observation :**
*   Name of the columns supports the attribute information provided by the dataset.




In [None]:
haberman.info()

In [None]:
haberman['survival_status'].value_counts()

**Observations :**

*   Count of people that survived the disease for 5 years or longer are 225 and the ones who died within 5 years are 81.
*   Class Label survival_status is integer and needs to converted to valid categorical datatype.
*   Dataset is not balanced but complete as no single value is missing.





In [None]:
# Class Label "survival_status" are now labelled as yes & no, stating "yes" as survived and "no" as Not Survived.
haberman['survival_status'] = haberman['survival_status'].map({1:"yes", 2:"no"})
#Updated survival_status
haberman["survival_status"].value_counts()

**Number of classes :**

In [None]:
print(haberman["survival_status"].unique())

**Data-points per class :**


In [None]:
print(haberman.groupby("survival_status").count())

**Analysis of Dataset through Mean, Variance and Standard deviation :**

In [None]:
status_yes=haberman.loc[haberman["survival_status"]=="yes"]
status_no=haberman.loc[haberman["survival_status"]=="no"]

print("SURVIVAL STATUS : YES -> STATISTICS :")
print(status_yes.describe())
print("\n****************************************************************************\n")
print("SURVIVAL STATUS : NO -> STATISTICS :")
print(status_no.describe())

**SURVIVAL STATUS : YES**
  
  **Observations :**


*   Number of people survived : 225
*   average age of people that survived : 52
*   minimum age of people that survived : 30
*   maximum age of people that survived : 77

**SURVIVAL STATUS : NO**

  **Observations :**


*   Number of people who did not survive : 81
*   average age of people who did not survive : 53
*   minimum age of people who did not survive : 34
*   maximum age of people who did not survive : 83


  

**Analysis of Dataset through Medians, quantiles, median absolute deviation :**

In [None]:
print("MEDIANS :\n")
print("Median age of the people who survived : ",np.median(status_yes["age"]))
print("Median age of the people who could not survive : ", np.median(status_no["age"]))
print("Median Positive axillary nodes in the people who survived : ", np.median(status_yes["axil_nodes"]))
print("Median Positive axillary nodes in the people who could not survive :  ", np.median(status_no["axil_nodes"]))

print("\n************************************************************************************************\n")

print("QUANTILES :\n")
print("Survival status : Yes")
print("AGE :",np.percentile(status_yes["age"], np.arange(0, 100, 25)))
print("NODES : ", np.percentile(status_yes["axil_nodes"], np.arange(0,100,25)))
print("Survival Status : No")
print("AGE :",np.percentile(status_no["age"], np.arange(0, 100, 25)))
print("NODES : ", np.percentile(status_no["axil_nodes"], np.arange(0,100,25)))

print("\n************************************************************************************************\n")

from statsmodels import robust
print("MEDIAN ABSOLUTE DEVIATION :\n")
print("Survival Status : Yes")
print("AGE :",robust.mad(status_yes["age"]))
print("NODES :",robust.mad(status_yes["axil_nodes"]))
print("Survival Status : No")
print("AGE :",robust.mad(status_no["age"]))
print("NODES :",robust.mad(status_no["axil_nodes"]))


**Observation :**


*   It's a Binary Classification Problem, where we need to predict whether the patient will survive after 5 years or not based upon the patient's age, year of operation and the number of positive axillary nodes.



**UNIVARIATE ANALYSIS :**

   

**HISTOGRAMS :**

In [None]:
#Analysis of Patient Age
sns.set_style("whitegrid");
sns.FacetGrid(haberman,hue="survival_status",height=6)\
    .map(sns.distplot,"age")\
    .add_legend();

plt.title('Histogram of ages of patients', fontsize=17)
plt.show();

**Observation :**


*   People with age range 40-60 have survived the most.



In [None]:
#Analysis of Operation year
sns.FacetGrid(haberman,hue="survival_status",height=6)\
    .map(sns.distplot,"operation_year")\
    .add_legend();

plt.title('Histogram of operation year of patients', fontsize=17)
plt.show();

**Observations :**


*   Operation year 60 had highest survival rate.
*   Operation year having range 63-66 had lowest survival rate.



In [None]:
#Analysis of auxillary nodes
sns.FacetGrid(haberman,hue="survival_status",height=6)\
    .map(sns.distplot,"axil_nodes")\
    .add_legend();

plt.title('Histogram of auxillary nodes detected', fontsize=17)
plt.show();

**Observation :**


*  Auxillary node=0 has the highest Survival rate.



**PDF AND CDF :**

In [None]:
plt.figure(figsize=(20,6))
plt.subplot(131)
counts,bin_edges=np.histogram(status_yes["age"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.xlabel('AGE')
plt.ylabel("COUNT")
plt.title('PDF-CDF of AGE for Survival Status = YES')
plt.legend(['PDF-AGE', 'CDF-AGE'], loc = 5,prop={'size': 12})

plt.subplot(132)
counts,bin_edges=np.histogram(status_yes["operation_year"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.xlabel('YEAR OF OPERATION')
plt.ylabel("COUNT")
plt.title('PDF-CDF of OPERATION YEAR for Survival Status = YES')
plt.legend(['PDF-OPERATION YEAR', 'CDF-OPERATION YEAR'], loc = 5,prop={'size': 11})

plt.subplot(133)
counts,bin_edges=np.histogram(status_yes["axil_nodes"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.xlabel('AUXILLARY NODES')
plt.ylabel("COUNT")
plt.title('PDF-CDF of AXIL NODES for Survival Status = YES')
plt.legend(['PDF-AXIL NODES', 'CDF-AXIL NODES'], loc = 5,prop={'size': 11})
plt.show()


**Observation :**


*   There are 90% of Patients, all of whom have Auxillary Nodes <= 10



**Box Plot and Whiskers :**

In [None]:
sns.boxplot(x = "survival_status" , y = "age", data = haberman )
plt.title("1. Box plot for survival_status and Age")
plt.show()

sns.boxplot(x = "survival_status" , y = "operation_year", data = haberman )
plt.title("\n2. Box plot for survival_status and Operation Year")
plt.show()

sns.boxplot(x = 'survival_status', y = 'axil_nodes', data = haberman)
plt.title("\n3. Box plot for survival_status and Auxillary Nodes")
plt.show()

**Observations :**


*   From box plot 1, It can be inferred that patient between age of range 30-34 survived after the treatment.
*   Patients with age > 77 were not able to survive.
*   From box plot 2, We can conclude that higher the axil_nodes, higher the chances of their death.


**VIOLIN PLOTS :**

In [None]:
sns.violinplot(x = 'survival_status', y = 'age', data = haberman)
plt.title("Violin plot for survival_status and Age")
plt.show()

sns.violinplot(x = 'survival_status', y = 'operation_year', data = haberman)
plt.title("\nViolin plot for survival_status and Operation Year")
plt.show()

sns.violinplot(x = 'survival_status', y = 'axil_nodes', data = haberman)
plt.title("\nViolin plot for survival_status and Auxillary Node")
plt.show()

**CONTOUR PLOT :**

In [None]:
sns.jointplot(x="age",y="operation_year",data=haberman, kind="kde")
plt.show()

sns.jointplot(x="age",y="axil_nodes",data=haberman, kind="kde")
plt.show()

sns.jointplot(x="operation_year",y="axil_nodes",data=haberman, kind="kde")
plt.show()

**BI-VARIATE ANALYSIS :**


**SCATTER PLOT :**

In [None]:
# AGE VS AUXILLARY NODES
sns.FacetGrid(haberman, hue="survival_status", height=6) \
   .map(plt.scatter, "age", "axil_nodes") \
   .add_legend();
plt.show();

**Observations :**


*   Patients with Age < 40 and Auxillary nodes < 30 have higher chances of survival.
*   Patients with Age > 50 and Auxillary nodes > 10 has less chances of survival.


In [None]:
#AUXILLARY NODES VS OPERATION YEAR
sns.FacetGrid(haberman, hue="survival_status", height=6) \
   .map(plt.scatter, "axil_nodes", "operation_year") \
   .add_legend();

plt.show();

**Observation :**


*   There is not really much information that can be obtained from this scatterplot since the data points do not really support visualisation in a 2-D space with scatterplot.


In [None]:
#AGE VS OPERATION YEAR
sns.FacetGrid(haberman, hue="survival_status", height=6) \
   .map(plt.scatter, "operation_year", "age") \
   .add_legend();
plt.show();

**Obseravation :**


*   Operation year 60, 61 and 68 have more survival rate.



**PAIR PLOTS :**





In [None]:
plt.close()
sns.pairplot(haberman,hue="survival_status",height=3.5)
plt.show()

**Observation :**


*   We can conclude from all the above Pair Plots that they are not Linearly Separable.


**TOTAL OBSERVATIONS :**

*   It's a Binary Classification Problem, We need to predict whether the patient will survive after 5 years or not based upon the patient's age, year of operation and the number of positive auxillary nodes.
*   Dataset is not balanced but complete as no single value is missing.
*   Our class label that is "survival_status" is integer and needs to converted into valid categorical datatype.
*   Class label "survival_status" is now labelled as yes & no, stating "yes" as survived and "no" as Not Survived.
*   People with age range 40-60 have survived the most.
*   Operation year 60 had highest survival rate.
*   Operation year having range 63-66 had lowest survival rate.
*   Auxillary node=0 has the highest Survival rate.
*   There are 90% of Patients, all of whom have Auxillary Nodes <= 10
*   Patients between age of range 30-34 survived after the treatment.
*   Patients with age > 77 were not able to survive.
*   Higher the axil_nodes, Higher the chances of patient's death.
*   Patients with Age < 40 and Auxillary nodes < 30 have higher chances of survival.
*   Patients with Age > 50 and Auxillary nodes > 10 has less chances of survival.
*   Operation year 60, 61 and 68 have more survival rate.
*   We can conclude from all the Pair Plots that they are not linearly Separable.
