**Details about Haberman's dataset**

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
Number of Instances: 306 Number of Attributes: 4 (including the class attribute)

**Attribute Information:**
* Age of patient at time of operation (numerical)
* Patient's year of operation (year - 1900, numerical)
* Number of positive axillary nodes detected (numerical)
* Survival status (class attribute)
    * 1 = the patient survived 5 years or longer 
    * 2 = the patient died within 5 year
    
Missing Attribute Values: None

**Objective**

Check if the patient can survive for less than 5 years or more than 5 years after breast cancer operation. Given following parameters.
 * patients Age at the time of operation
 * year of operation
 * positive axillary nodes

**Loading data**

Let us load the data to understand more

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd

#reading data from csv and adding column names(as the data do not contain the columns)
habermans=pd.read_csv("../input/haberman.csv",
                      names=["age_patients","year_operation","positive_axillary_nodes","survival_status"])
habermans

In [None]:
#High Level statistics on the data
print("Number of datapoints in the Haberman's dataset ",habermans.shape[0])
print("Number of features in the Haberman's dataset ",habermans.shape[1]-1)#because last column is class label
print("Count of each class in habermans dataset \n",habermans["survival_status"].value_counts())

**Observations on the above data**

*  Habermans dataset contains 306 rows i.e it contains information about 306 operations
* The above dataset tells us that there are 225 patients who have survived 5 years and longer after the operation
*  There are 81 patients who have died within the span of 5 years after the operation
*  It is an imbalanced dataset where the ratio of people surviving for more than 5 years after operation to not surviving for 5 years after operation is of great difference

**Univariate Analysis**

Let us understand which feature will affect the column survival_status by performing Univariate analysis(i.e analyzing each feature one at a time)

*Analyzing if the patients age affect the classification*

Constructing histograms/PDF to check if it will affect the survival_status of patient


In [None]:
import matplotlib.pyplot as plt
sns.FacetGrid(hue="survival_status",data=habermans,height=4).map(sns.distplot,"age_patients").add_legend()
plt.title("Histogram's/Pdf for patients age")
plt.show()

**Observations**

Pdf's of age of patients are completly overlapping each other.We cannot clearly distinguish between the survival status. Henceforth, age of patients alone do not determine if the patient can survive breast cancer or not


In [None]:
#creating CDF plot
age_survival1=habermans[habermans["survival_status"]==1].age_patients
age_survival2=habermans[habermans["survival_status"]==2].age_patients

count_survival1,bin_survival1=np.histogram(age_survival1,bins=10,density=True)
count_survival2,bin_survival2=np.histogram(age_survival2,bins=10,density=True)

#calculating Pdf
pdf1=(count_survival1/sum(count_survival1))
pdf2=count_survival2/sum(count_survival2)

#calculating cdf
cdf1=np.cumsum(pdf1)
cdf2=np.cumsum(pdf2)

#plotting
plt.plot(bin_survival1[1:],pdf1,label="pdf_survivalStatus_1")
plt.plot(bin_survival1[1:],cdf1,label="cdf_survivalStatus_1")

plt.plot(bin_survival2[1:],pdf2,label="pdf_survivalStatus_2")
plt.plot(bin_survival2[1:],cdf2,label="cdf_survivalStatus_2")

plt.xlabel("Age of patients")
plt.ylabel("Probablity")
plt.title("CDF of age_patients")
plt.legend()
plt.show()

**Observation**

Age of patients who have survived and not survived for 5 years after breast cancer operation are more than 50 years[ The probality of occuring is 0.4 to 1.0 which is 60% of patients] [If you observe closely, their cdf lines are almost colliding with each other]

In [None]:
#drawing box plot and violin plot to understand more

plt.close()
plt.title("Box plot for age_patients")
sns.boxplot(x="survival_status",y="age_patients",data=habermans)
plt.show()

**Observations**

Unable to classify data based on patients age. It is difficult to build a model around it as more than 90% errors can be produced as points are completely overlapping

In [None]:
plt.close()
plt.title("violin plot for age_patients")
sns.violinplot(x="survival_status",y="age_patients",data=habermans)
plt.show()

**Observations**

* 50th percentile value for survival_status are almost same. That means to say that age of 50% patients who have survived breast cancer for less than 5 year or more than five years after operation are mostly same around 52 years
* Age of patients who can survive for more than 5 years and less than 5 years after operation is around 60 years for nearly 75% of patients
* we cannot determine if the patient will survive for less than years or more than 5 years after breast cancer operation with the help of age of patients. Henceforth, model around age_patients cannot be built

*Analyzing if the year of operation affect the classification¶*

Constructing histograms/pdf if year_operation affect the survival status

In [None]:
sns.FacetGrid(habermans,hue="survival_status",height=4).map(sns.distplot,"year_operation").add_legend()
plt.title("Histogram/pdf's for year_operation")
plt.show()

plt.close()
plt.title("Box plot for year_operation")
sns.boxplot(x="survival_status",y="year_operation",data=habermans)
plt.show()

plt.close()
plt.title("Violin plot for year_operation")
sns.violinplot(y="year_operation",x="survival_status",data=habermans)
plt.show()

**Observation**

year_operation do not affect the survival status as their histograms are completly overlapping with each other. In simple words, we can conclude that operation year will not help us to determine if the patient is going to survive for less than or more than 5 years after operation. This is completly straight forward and with due common sense we can rule out this field as the year of operation cannot determine the survival status of the patient undergoing operation[Unless special doctor is appointed for all operations that year or any medical advancement happened that year]

In [None]:
#CDF for year_operation
survivalStatus_1=habermans[habermans["survival_status"]==1]
survivalStatus_2=habermans[habermans["survival_status"]==2]

count1_yr,bin1_year=np.histogram(survivalStatus_1.year_operation,bins=10)
count2_yr,bin2_year=np.histogram(survivalStatus_2.year_operation,bins=10)

pdf1=count1_yr/sum(count1_yr)
pdf2=count2_yr/sum(count2_yr)

cdf1=np.cumsum(pdf1)
cdf2=np.cumsum(pdf2)

plt.plot(bin1_year[1:],pdf1)
plt.plot(bin1_year[1:],cdf1)
plt.plot(bin2_year[1:],pdf2)
plt.plot(bin2_year[1:],cdf2)

plt.xlabel("year of operation")
plt.ylabel("probability")
plt.title("CDF for year_operation")
plt.show()

**Observation**

90% of operation happened within the year 1968 where most of the patients survived and expired after 5 years of operation.

*Analyzing if the positive auxillary nodes affect the survival status*

check if the positive_axillary_nodes affect the cancer survial status

In [None]:
sns.FacetGrid(hue="survival_status",height=4,data=habermans).map(sns.distplot,"positive_axillary_nodes").add_legend()
plt.show()

**Observation**

No of auxillary nodes alone is also not helping us to check the survival status as the pdf's are overlapping but they are better indicator compared to other features such as age and year of operation.

In [None]:
#CDF plot
count_node,bin_node=np.histogram(survivalStatus_1.positive_axillary_nodes,bins=10)
pdf_node=count_node/sum(count_node)
cdf_node=np.cumsum(pdf_node)

plt.plot(bin_node[1:],pdf_node)
plt.plot(bin_node[1:],cdf_node)
plt.xlabel("Positive Auxillary Nodes")
plt.ylabel("Probability")
plt.title("CDF for positive_axillary_nodes")
plt.show()

**Observations¶**

Nearly 90% of patients have positve auxillary node less than 10. Remaining 10% of patients have positive auxillary nodes ranging from 10 to 48

In [None]:
#plotting box plot 
plt.close()
sns.boxplot(x="survival_status",y="positive_axillary_nodes",data=habermans)
plt.show()

**Observation**

25th percentile and 50th percentile points of both survival status are same.i.e nearly 25% of operated patients have the same positive axillary nodes who have survived for more than 5 years after operation or who have died less than 5 years after operation.
There is no clear cut classification between the two, we cannot build model along the positive axillary node as we might end up identifying more than 50% patients surving for less than 5 years (if the threshold at 9 is set to identify patients who are going to survive longer than 5 years)

In [None]:
#violin plot
plt.close()
sns.violinplot(x="survival_status",y="positive_axillary_nodes",data=habermans)
plt.show()

**Observation**

If we set a threshold at > 50 to identify patients surving for less than 5 years after operation, we achieve model accuracy as only 10% as only 10 % points are different from 1 (i.e patients surviving more than 5 years after operation)
positive auxillary nodes from .5 to 50 are completly overlapping. hence identifying survival status is difficult for those points. Any point lying outer than this range can be classified as case 2 (patients died before 5 years after operation)

**Bi-Variate Analysis**

Single variable alone is not helping us to differentiate the survival status. Let us check if there are possibility using two different features

In [None]:
#building 2D scatter plot to understand the relationship between patients age and positive axillary nodes to determine
#survival rate
sns.FacetGrid(habermans,hue="survival_status",height=4).map(plt.scatter,"age_patients","positive_axillary_nodes").add_legend()
plt.show()

In [None]:
sns.FacetGrid(habermans,hue="survival_status",height=4).map(plt.scatter,"age_patients","year_operation").add_legend()
plt.show()

In [None]:
sns.FacetGrid(habermans,hue="survival_status",height=4).map(plt.scatter,"year_operation","positive_axillary_nodes").add_legend()
plt.show()

**Observation**

If you above all the combination of 2D scatter plot above, there is no combination of 2 different features to classify the survival status.
This tells us that above data is insufficient for us to determine the survival status of the new patient

In [None]:
#Let us draw pair plot to understand more
plt.close();
sns.set_style("whitegrid");
sns.pairplot(habermans, hue="survival_status", vars=["age_patients","year_operation","positive_axillary_nodes"],height=3);
plt.show()

**Observations**

Unable to determine if the patient can survive for less than 5 years or more than 5 years after operation as data is insufficient for classification
There is no clear cut combination of features which can clearly classify the survival status as the data is INSUFFICIENT to determine between the two survival status

**Summary**

Cannot build the model as Data is insufficient. We need more data points to determine if the patient can survive for less than 5 years or more than 5 years after breast cancer operation.