# Exploratory Data Analysis : On Haberman's Cancer Survival data set

###### Introduction:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

### Info On data set

1. 306 datapoints/rows

2. 4 features


##### Features Information:

There are given 4 features in this data set

1. Age of patient at time of operation
2. Patient's year of operation (year - 1900)
3. Number of positive axillary nodes detected
4. Survival status 
   1. 1 = the patient survived 5 years or longer :::: GT_5y
   2. 2 = the patient died within 5 year :::: LT_5y

##### Objective of analysis

This analysis main objective is to find the chances of patient to survive, who has undegone operation 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np


#haberman.csv into a pandas dataFrame.
haberman = pd.read_csv("../input/haberman-dataset/haberman.csv")

In [None]:
# to find  data-points and features
print (haberman.shape[0], " number of data points")

print (haberman.shape[1], " number of features")

In [None]:
haberman.columns

In [None]:
haberman["surv_status"].value_counts()

##### Observation:

The given dataset looks like imbalanced dataset,

need to do analysis and find out if we can come to conclusion based on the given dataset

starting with 2D scatter plots for analysis


# 2D-Scatter Plot

#### age vs axil nodes :

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="surv_status", height=4) \
   .map(plt.scatter, "age", "axil_nodes") \
   .add_legend();
plt.show();

#### axil_nodes vs year_of_operation

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="surv_status", height=4) \
   .map(plt.scatter, "axil_nodes", "op_year") \
   .add_legend();
plt.show();

#### age vs year_of_opeartion

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="surv_status", height=4) \
   .map(plt.scatter, "age", "op_year") \
   .add_legend();
plt.show();

## Pair-Plot

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue = "surv_status", height=4);
plt.show()

#### Observation(s):

1. we are not able to distinguish who can survive by having a 2D scatter plot analysis on any of the given features
2. any of the two features are resulting in over lapping of the survival status 
3. let us consider doing a univariate analysis on the data


# Univariate Analysis

### PDF or Histogram

##### PDF on Age for survival status

In [None]:

sns.FacetGrid(haberman, hue="surv_status", height=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.title("Distribution plot on age for survival status")
plt.ylabel("Density")
plt.show();


##### PDF on Operation Year for survival status

In [None]:

sns.FacetGrid(haberman, hue="surv_status", height=5) \
   .map(sns.distplot, "op_year") \
   .add_legend();
plt.title("Distribution plot on opareation year for survival status")
plt.ylabel("Density")
plt.show();

##### PDF on Axil nodes identified for survival status

In [None]:

sns.FacetGrid(haberman, hue="surv_status", height=8) \
   .map(sns.distplot, "axil_nodes") \
   .add_legend();
plt.title("Distribution plot on no. of Axil nodes identified for survival status")
plt.ylabel("Density")
plt.show();

#### Observations:

1. Out of all given features we can come into an appropriate conclusion based on desity plot of axil_nodes
2. there are more chances of patient surviving more than 5 years when number of axil nodes found are in range 0 -2
3. though there are patients died before 5 year when number of axil nodes found in range of 0-2


to find more precise percentage of survival based on axil nodes found, we shall continue with CDF

## CDF 

In [None]:
haberman_gt_5 = haberman.loc[haberman["surv_status"] == "GT_5y"];
haberman_lt_5 = haberman.loc[haberman["surv_status"] == "LT_5y"];


### CDF on Opration year

In [None]:

labels = ["pdf of GT_5", "cdf of GT_5", "pdf of LT_5", "cdf of LT_5"]
counts, bin_edges = np.histogram(haberman_gt_5['op_year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(haberman_lt_5['op_year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

plt.title("pdf and cdf on operation year")
plt.xlabel("year")
plt.ylabel("Probability")
plt.legend(labels)
plt.show();

### CDF on age

In [None]:

labels = ["pdf of GT_5", "cdf of GT_5", "pdf of LT_5", "cdf of LT_5"]
counts, bin_edges = np.histogram(haberman_gt_5['age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)


counts, bin_edges = np.histogram(haberman_lt_5['age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)



plt.title("pdf and cdf on age")
plt.xlabel("age")
plt.ylabel("Probability")
plt.legend(labels)
plt.show();

### CDF on axil nodes

In [None]:


counts, bin_edges = np.histogram(haberman_gt_5['axil_nodes'], bins=40, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)


counts, bin_edges = np.histogram(haberman_lt_5['axil_nodes'], bins=40, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

plt.show();

### Observations :: 

1. Based on operation year , not much able to exactly segregate, as CDF is closely moving same for two suvival classes
2. Based on age : 
    1. we can concreatley say if age is lessthan 38 there is 18% probability of surviving more than 5 years
3. Based on auxilary nodes : we can conclude as follwoing
    1. there is 30% chances of surving if number of auxilary nodes found is less than 5
    2. if number of axil nodes found is greater than 47, then the person 100% cannot survive greater than 5 years
       

# Box Plot of data set

Box plot based on age:

In [None]:
sns.boxplot(x='surv_status',y='age', data=haberman)
plt.show()

Box plot based on Opeartion Year:

In [None]:
sns.boxplot(x='surv_status',y='op_year', data=haberman)
plt.show()

Box plot based on Axil nodes Year:

In [None]:
sns.boxplot(x='surv_status',y='axil_nodes', data=haberman)
plt.show()

## Violin Plots

In [None]:
#Median, Quantiles, Percentiles, IQR.
print("\nMedians:")
print(np.median(haberman_gt_5["axil_nodes"]), " : GT_5y")
print(np.median(haberman_lt_5["axil_nodes"]), " : LT_5y")


print("\nDectiles:")
print(np.percentile(haberman_gt_5["axil_nodes"],np.arange(0, 100, 10)))
print(np.percentile(haberman_lt_5["axil_nodes"],np.arange(0, 100, 10)))

print("\n 5 multiple percentiles :")
print(np.percentile(haberman_gt_5["axil_nodes"],np.arange(0, 100, 5)))
print(np.percentile(haberman_lt_5["axil_nodes"],np.arange(0, 100, 5)))

from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(haberman_gt_5["axil_nodes"]))
print(robust.mad(haberman_lt_5["axil_nodes"]))

In [None]:

sns.violinplot(x="surv_status", y="axil_nodes", data=haberman, size=8)
plt.show()

###### Observations:

1. Based on the Box Plot and Violin plot analysis, we can come up with following data model

   if( axil_nodes are =< 2):
       There are 80% chances of patient surviving and 40% chnaces of patient dying.
    

# Conclusions:

Based on univariate and bivariate analysis on the haberman dataset, we can come to a appropriate conclusion and data model as follows

    if( axil_nodes =< 2):
       There are 80% chances of patient surviving and 40% chances of patient dying below 5 years of operation
    elseif( axil_nodes >= 47:
       There are 100% chnaces that patient will not survive greater than 5 years opeartion


The results can be improved if a balanced dataset is received