**Exploratory Data Analysis With Haberman Dataset**

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

* Number of Instances: 306
* Number of Attributes: 4 (including the class attribute)

**Attribute Information:**
*    Age of patient at time of operation (numerical)
*     Patients year of operation (year - 1900, numerical)
*     Number of positive axillary nodes detected (numerical)
*     survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
*     Missing attribute Values: None

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



In [None]:
df_input = pd.read_csv("../input/haberman.csv")
df_input.shape

In [None]:
columns =["Age","Year_of_Operation","positive_axillary_node","survival_status"]
df_input.columns = columns
df = df_input.copy()
df_input.head()

**Making Dataset more readable**

In [None]:
mask = (df.survival_status == 1)
df.loc[mask,"survival_status"] = "Survived more than 5 years"
df.loc[~mask,"survival_status"] = "Survived less than 5 years"

In [None]:
df_input.head()

In [None]:
df.describe()

**Observation **

*  Average Positive Age of patients surveyed is 52.5 years .
* Average Positive axillary node count  of patients surveyed is 4 .
* Majority of Patients surveyed had age between 44 to 61 .
* Majority of Patients surveyed had  Positive axillary node count between 0 to 4 .


In [None]:
df.isnull().sum()

**df.isnull().sum()**

From this we infer dataset is clean and none of feature has any missing value .

In [None]:

print("Number of Rows in dataset" ,df.shape[0])
print("Number of columns in dataset" ,df.shape[1])
df_not_survived = df[df.survival_status == "Survived less than 5 years"]
ptnt_died_within5yrs = df_not_survived.shape[0]
df_survived = df[df.survival_status == "Survived more than 5 years"]
patient_survived = df_survived.shape[0]
print(f"{ptnt_died_within5yrs} Patient died within 5 yrs of operation ")
print(f"{patient_survived} patient survived the operation")

**Observation **
1. Total number of instances in a Habermans_survival_dataset are  305.
2.  There are 3 features
3. Out of the total patients who had breast cancer operation 224 patient survived 5 years or longer and 81 patient died within 5 yrs of operation .


                                          **UNIVARIATE ANALYSIS**

CDF and PDF 

In [None]:
counts,bins  = np.histogram(df.positive_axillary_node,bins = 100,density = True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bins[1:],pdf,label = 'PDF')
plt.plot(bins[1:], cdf,label = 'CDF')
plt.title("positive_axillary_node")
plt.legend()


**Observation :**

From above pdf and cdf graph of positve_axillary_nodes we could infer probability density for positive_axillary_nodes count close to zero is comparatively higher and it decrease gradually .Thus we can say majority of patient surveyed had positive_axillary_nodes close to zero .

In [None]:
counts,bins  = np.histogram(df.Age,bins = 50,density = True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bins[1:],pdf,label = 'PDF')
plt.plot(bins[1:], cdf,label = 'CDF')
plt.title("Age")
plt.legend()

**Observation :**

No such relation could be found here but we can say probability density between age 45 to 55 is comparatively higher .That is number of patient surveyed between the above mentioned age range is comparatively higher .

In [None]:
counts,bins  = np.histogram(df.Year_of_Operation,bins = 50,density = True)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bins[1:],pdf, label='PDF')
plt.plot(bins[1:], cdf,label = 'CDF')
plt.title("Year_of_Operation")
plt.legend()



**Observation ** :

There is quiet uniform distribution  in  number of patient surveyed between year 1955 to 1965.

In [None]:
df_not_survived.Age.mean()

In [None]:
df_survived.Age.mean()


In [None]:
sns.FacetGrid(df,hue = "survival_status",height = 4,aspect = 4).map(sns.distplot,"Age",bins =100).add_legend()
sns.FacetGrid(df,hue = "survival_status",height = 4,aspect = 4).map(sns.distplot,"positive_axillary_node" ,bins =200).add_legend()
sns.FacetGrid(df,hue = "survival_status",height = 4,aspect = 4).map(sns.distplot,"Year_of_Operation" ,bins =100).add_legend()


In [None]:
sns.FacetGrid(df,row = "survival_status",hue = 'survival_status' ,height =4,aspect  =4).map(sns.distplot,"Year_of_Operation").add_legend()
sns.FacetGrid(df,row = "survival_status",hue = 'survival_status' ,height =4,aspect  =4).map(sns.distplot,"positive_axillary_node").add_legend()
sns.FacetGrid(df,row = "survival_status",hue = 'survival_status' ,height =4,aspect  =4).map(sns.distplot,"Age").add_legend()    

**Observations from histogram **

* From this plots we can infer for the patient between 30 to 40 age , chances of survival of more than 5 years after operation is comparatively  higher
* Majority of patient surveyed has Age in between 45 to 55 .
* Patient with Age between 78 to 82 has less chance of survival for more than 5 years after operation .
* There is strong relation of positive_axillary_node with survival status .It is found majority of Patient who survived more 
   than 5 years after operation  had positive_axillary_node count close close to zero .
* No such strong relation was found between year of operation and survival status .

**Exploratory data analysis using box plots**

In [None]:
sns.boxplot(x = "survival_status", y ="Age" , data = df ).set_title("Age")


In [None]:
sns.boxplot(x = "survival_status", y= "positive_axillary_node",data = df).set_title("positive_axillary_node")


In [None]:
sns.boxplot(x = "survival_status", y= "Year_of_Operation",data = df).set_title("Year_of_Operation")

**Observation:**

* Majority of patient who survived more than 5 years after operation has positive_axillary_node count close to zero
* positive_axillary_node feature has some outliers

** Exploratory data analysis using ViolinPlots**

In [None]:
sns.violinplot(x="survival_status",y = "Year_of_Operation",data =df).set_title("Year_of_Operation")

In [None]:
sns.violinplot(x="survival_status",y = "Age",data =df).set_title("Age")

In [None]:
sns.violinplot(x = "survival_status" , y= "positive_axillary_node",data = df).set_title("positive_axillary_node")

** Observation from Violinplot **
* Violin plot also state that majority of patient who survived more than 5 years after operation has positive_axillary_node count close to zero .*

Bivariate analysis

In [None]:
sns.set_style("darkgrid");
sns.pairplot(df_input,hue="survival_status",height=5)

**Conclusion**
* Patient with Positive axillary node count close to zero has greater chance of surviving 5 years after operation. 
* Positive axillary node is the useful feature to predict cancer survival status of patients .
* Can't  get any  suitable observation from "Age" and "Year of operation " feature as survival status curves are overlapping