# Exploratory Data Analysis(EDA) on Haberman Cancer Survival Dataset

Download Haberman Cancer Survival dataset from Kaggle:(https://www.kaggle.com/gilsousa/habermans-survival-data-set)

  


# Objective:
To explore the Haberman Cancer Survival Dataset and find which feature or combination of feature are helpfull in determining the status of a person in 5 years after the operation.  



In [None]:
#impoting the necessary packages
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [None]:
#importing the dataset
import os
print(os.listdir("../input"))
df = pd.read_csv('../input/haberman.csv')

#Print the number of datapoints and features.
print(df.shape)

In [None]:
#Check the columns in the dataset.
df.columns

In [None]:
#add columns name to the dataset and recheck the columns names.
df.columns=["Age","Operation_year","Axil_nodes","Surv_status"]  
print(df.columns)

In [None]:
#Datapoint per class.
#Surv_status: 1-Survived,2-Died
df["Surv_status"] = df["Surv_status"].apply(lambda x: "Survived" if x == 1 else "Died")
df["Surv_status"].value_counts()

In [None]:
df.head(10) # no of Rows =10

In [None]:
df.describe()

### Observations:
1) The dataset has 4 features/variable and 305 data points.  

2) The dataset has a collection of data of patient aged between 30-83 years those who had undergone cancer surgery    in year 1958-1969.

3) Almost 75% of the patient had 0-4 axil nodes where 25% of them had 0 node and very few had up to 52 axil nodes. 


4) The dataset has 224 datapoint labeled as "1" and 81 datapoint labeled as "2" viz.,Surv_status:"1=Survived(the patient survived 5 years or longer)", "2=Died(the patient died within 5 year)" 

5) The dataset is an imbalance dataset.(w.r.t. obs.point(4)). 

## Univarient Analysis:  
Univerarient analysis on the data-set will help us explore a single feature/Variable. Univerarient analysis includes Histogram, CDF, PFD etc.

In [None]:
# Distribution of Operation Year
sns.FacetGrid(df, hue="Surv_status", size=5)\
.map(sns.distplot, "Operation_year").add_legend();
plt.show();

### Observations:

1) The survival status corresponding to operation year data points are overlapping, hence no conclusion about the survival status of the patient could be drawn based on the Year of operation. Except that the patient who had undergone surgery between the year 1959-1963 has higher probability of survival.



In [None]:
# Distribution for Age of Operation 
sns.FacetGrid(df, hue="Surv_status", size=5)\
.map(sns.distplot, "Age").add_legend();
plt.show();

### Observations:
1) The data is overlapping hence no major information could be gained.   
2) Patients with age less than 40 yrs. has higher chance to survive and patient with age more than 78 yrs are most likely to died within 5 yrs. of surgery.

In [None]:
# Distribution for axil_ nodes of Operation 
sns.FacetGrid(df, hue="Surv_status", size=5)\
.map(sns.distplot, "Axil_nodes").add_legend();
plt.show();

### Observations:
1) It is seen that 95% of the patient has axil nodes between 0 to 25.  
2) Patient with 0-3 axil node had higher chances of survival.      
3) Data is overlapping hence we can't find "point" and "if-else" conditions to build a simple model to classify the survive and death from this observations.

### PDF and CDF:

In [None]:
Survived= df.loc[df["Surv_status"]== "Survived"]
Died = df.loc[df["Surv_status"]=="Died"]


plt.figure(figsize=(20,5))
i=1
for state in (list(df.columns)[:-1]):
#survived
    plt.subplot(1,3,i)
    Counts , bin_edges = np.histogram(Survived[state],bins=20,density=True)
    pdf=Counts/sum(Counts)
    cdf = np.cumsum(Counts)
    plt.plot(bin_edges[1:],cdf,label="cdf of survived",color="red")
    plt.plot(bin_edges[1:],pdf,label="pdf of survived",color="black")

#Death
    Counts , bin_edges = np.histogram(Died[state],bins=20,density=True)
    pdf=Counts/sum(Counts)
    cdf = np.cumsum(Counts)
    plt.plot(bin_edges[1:],cdf,label="cdf of Death")
    plt.plot(bin_edges[1:],pdf,label="pdf of Death")
    plt.xlabel(state)
    plt.grid()
    plt.legend()
    i+=1
plt.show()

### Observations:
1) Patient with age between age 32-36 has definitly survived the operation and pataient aged 77-85 has definitly not survived the operation. 

2) No insight of the patient survival status can be drawn form the year of operation as the data for both the case are evenly distrubuted along the year of operations. Excapt the patient who had undergone the surgery between 1961-1965 has slightly higher probablity of survival.  

3) Also it has been seen that the patient with axil nodes <22 has has better probability of survival and Patient with 0-2 axil node are more likely to survive.  

In [None]:
# Box_plot
print("********************************* Box Plot ***********************************************")
plt.figure(figsize=(20,5))
j=1
for features in (list(df.columns)[:-1]):  
    plt.subplot(1,3,j); j+=1 
    sns.boxplot(x= 'Surv_status',y= features,data=df)
plt.grid()    
plt.show()

print("*********************************** Violin Plot ******************************************")
# violin_plot
plt.figure(figsize=(20,5))
k=1
for features in (list(df.columns)[:-1]):  
    plt.subplot(1,3,k); k+=1 
    sns.violinplot(x= 'Surv_status',y= features,data=df)
plt.grid()
plt.show()


### observations:

1) No major conclusion could be drawn from this plots as the data points are overlapping (i.e. scattered within the same range of values).  

2) The number of axil node for survival is dense from 0-5.

## Bi-variate analysis:

In [None]:
#Pair Plot
df['Surv_status'] = df['Surv_status'].astype('category')
plt.close();
sns.set_style("whitegrid");
sns.pairplot(df, hue="Surv_status",vars = ["Age","Operation_year","Axil_nodes"], size = 3)
plt.show()

### Observations:
The data are highly mixed up, none of the variable-pairs can help us find linearly separable clusters hence we can't find "lines" and "if-else" conditions to build a simple model to classify the survive status of the patient.