# EXPLORATORY DATA ANALYSIS: HABERMAN DATASET
### A. DATA INFORMATION
### B. OBJECTIVE
### C. DATASET CONFIGURATION
       1. ENVIRONMENT LOADING
       2. LOADING THE DATASET
### D. HIGH LEVEL STATISTICS OF THE DATASET
       1. NUMBER OF POINTS
       2. NUMBER OF FEATURES
       3. NUMBER OF CLASSES
       4. DATAPOINTS PER NUMBER OF CLASS
       5. MEAN, MAD & STD. DEVIATION
### E. UNIVARIATE ANALYSIS
       1. HISTOGRAM
       2. PDF & CDF
       3. BOX PLOT & WHISKERS
       4. VIOLIN PLOT
       5. CONTOUR PLOT
### F. BI-VARIATE ANALYSIS
       1. PAIR PLOT
       2. SCATTER PLOT
### G. ALL OBSERVATIONS
       
       
       

# A. DATASET INFORMATION

**About the Dataset** :- The Haberman's Survival Dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicagos Billings Hospital on the survival of patients who had undergone surgery for breast cancer. 

**Dataset source** :- https://www.kaggle.com/gilsousa/habermans-survival-data-set/data

**Attribute Information:**

Age of patient at time of operation (numerical)

Patient's year of operation (year - 1900, numerical)

Number of positive axillary nodes detected (numerical)

Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

# B. OBJECTIVE

**To predict whether the patient will survive after 5 years or not based upon the patient's age, year of treatment and the number of positive lymph nodes**

# C. DATASET CONFIGURATION

### 1. ENVIRONMENT SETUP

In [None]:
# check for the input dataset
import os
print(os.listdir('../input'))

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore') ## ignore all the warnings
from statsmodels import robust ## for Median Absolute Deviation

### 2. LOADING THE DATASET

In [None]:
haberman=pd.read_csv("../input/haberman.csv",names=['age', 'operation_year', 'axil_nodes', 'survival_status'])

# D. HIGH LEVEL STATISTICS OF THE DATASET

In [None]:
# haberman.describe()
print(haberman.head(15))

### 1. NUMBER OF POINTS

In [None]:
print (haberman.shape)

In [None]:
haberman["survival_status"].value_counts()

In [None]:
haberman.info()

In [None]:
# Class Label "survival_status" are now to labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived
haberman['survival_status'] = haberman['survival_status'].map({1:"yes", 2:"no"})
print(haberman.head(15)) ## we are displaying 1st 15 lines

In [None]:
## CHECKING THE UPDATED SURVIVAL STATUS
haberman["survival_status"].value_counts()

In [None]:
## CHECKING THE UPDATED INFO ABOUT THE CHANGED DATATYPE OF OUR CLASS LABEL
haberman.info()

**OBSERVATION**
1. Dataset is UNBALANCED but complete as no values are missing
2. Our CLASS LABEL ie survival_status is INTERGER and needs to converted to valid CATEGORICAL datatype
3. Class Label "survival_status" are now to labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived

### 2. NUMBER OF FEATURES

In [None]:
print (haberman.columns)

In [None]:
## Last column is our CATEGORY that is the DEPENDENT VARIABLE, therefore it is not considered as a feature
print (haberman.columns[:-1])

### 3. NUMBER OF CLASSES

In [None]:
print(haberman["survival_status"].unique())

### 4. DATAPOINTS PER NUMBER OF CLASS

In [None]:
print(haberman.groupby("survival_status").count())

### 5. MEAN, MAD & STD. DEVIATION

In [None]:
haberman_yes=haberman.loc[haberman["survival_status"]=="yes"]
haberman_no=haberman.loc[haberman["survival_status"]=="no"]

print("SURVIVAL_STATUS: YES , COUNT",haberman_yes["survival_status"].count()) 
print("hi")
print("VARIABLE       MEAN             MEDIAN    STD. DEVIATION        MAD")
print("age         ",np.mean(haberman_yes["age"]),"  ",np.median(haberman_yes["age"]),"  ",np.std(haberman_yes["age"]),"   ",robust.mad(haberman_yes["age"]))
print("opYear      ",np.mean(haberman_yes["operation_year"]),"  ",np.median(haberman_yes["operation_year"]),"  ",np.std(haberman_yes["operation_year"]),"   ",robust.mad(haberman_yes["operation_year"]))
print("axilnodes   ",np.mean(haberman_yes["axil_nodes"]),"  ",np.median(haberman_yes["axil_nodes"]),"  ",np.std(haberman_yes["axil_nodes"]),"   ",robust.mad(haberman_yes["axil_nodes"]))
print("\n")
# includeInDescribe=['ag','opYear','axilNodes']
perc=[.0,.25,.50,.75,.1]


print("SURVIVAL_STATUS: NO , COUNT",haberman_no["survival_status"].count())  
print("VARIABLE       MEAN             MEDIAN    STD. DEVIATION        MAD")
print("age         ",np.mean(haberman_no["age"]),"  ",np.median(haberman_no["age"]),"  ",np.std(haberman_no["age"]),"   ",robust.mad(haberman_no["age"]))
print("opYear      ",np.mean(haberman_no["operation_year"]),"  ",np.median(haberman_no["operation_year"]),"  ",np.std(haberman_no["operation_year"]),"  ",robust.mad(haberman_no["operation_year"]))
print("axilnodes   ",np.mean(haberman_no["axil_nodes"]),"   ",np.median(haberman_no["axil_nodes"]),"   ",np.std(haberman_no["axil_nodes"]),"   ",robust.mad(haberman_no["axil_nodes"]))
print("\n\n")

print("SURVIVAL STATUS: YES -> DESCRIBE")
print(haberman_yes.describe(percentiles=perc))
print("\n\n")
print("SURVIVAL STATUS: NO -> DESCRIBE")
print(haberman_no.describe(percentiles=perc))
# print(haberman.describe(percentiles=perc))


**OBSERVATION**

1. This is Binary Classification Problem, where we need to predict whether the patient will survive after 5 years or not based upon the patient's age, year of treatment and the number of positive lymph nodes
2. 50% of the Patients are below the age of 54.

# E. UNIVARIATE ANALYSIS


## 1. HISTOGRAM

In [None]:
## PATIENT AGE
sns.FacetGrid(haberman,hue="survival_status",size=5)\
    .map(sns.distplot,"age")\
    .add_legend();

plt.show();

**OBSERVATION**

1. Patients with age range 40-60 have survived the most.

In [None]:
sns.FacetGrid(haberman,hue="survival_status",size=5)\
    .map(sns.distplot,"operation_year")\
    .add_legend();
plt.show()

**OBSERVATION**

1. Operation year having range (63-66) had highest successfull survival rate
2. Operation year 60 had highest un-successfull rate



In [None]:
sns.FacetGrid(haberman,hue="survival_status",size=5)\
    .map(sns.distplot,"axil_nodes")\
    .add_legend();

plt.show()

**OBSERVATION**

1. As we can clearly see, axil node=0 has the highest Survival rate.

# 2. PDF & CDF

In [None]:
##haberman
plt.figure(figsize=(20,6))
plt.subplot(131) ##(1=no. of rows, 3= no. of columns, 1=1st figure,2,3,4 boxes)
counts,bin_edges=np.histogram(haberman_yes["age"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('AGE')
plt.title('PDF-CDF of AGE for Survival Status = YES')
plt.legend(['PDF-AGE', 'CDF-AGE'], loc = 5,prop={'size': 16})

plt.subplot(132)
counts,bin_edges=np.histogram(haberman_yes["operation_year"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('YEAR OF OPERATION')
plt.title('PDF-CDF of OPERATION YEAR for Survival Status = YES')
plt.legend(['PDF-OPERATION YEAR', 'CDF-OPERATION YEAR'], loc = 5,prop={'size': 11})

plt.subplot(133)
counts,bin_edges=np.histogram(haberman_yes["axil_nodes"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('AXIL NODES')
plt.title('PDF-CDF of AXIL NODES for Survival Status = YES')
plt.legend(['PDF-AXIL NODES', 'CDF-AXIL NODES'], loc = 5,prop={'size': 16})
plt.show()

## 3. BOX PLOT & WHISKERS

In [None]:
# Box plot takes a less space and visually represents the five number summary of the data points in a box. 
# The outliers are displayed as points outside the box.
# 1. Q1 - 1.5*IQR
# 2. Q1 (25th percentile)
# 3. Q2 (50th percentile or median)
# 4. Q3 (75th percentile)
# 5. Q3 + 1.5*IQR
# Inter Quartile Range = Q3 -Q1

figure, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    mystr="Box plot for survival_status and "+feature
    sns.boxplot( x='survival_status', y=feature, data=haberman, ax=axes[idx]).set_title(mystr)
plt.show()

**OBSERVATION**

1. From AXIL_NODE and SURVIVAL_STATUS, we can conclude that higher the axil_nodes, higher the chances of their death.

## 4. VIOLIN PLOT

In [None]:
# A violin plot combines the benefits of BoxPlot and Univariate Histogram PDF plots 
#and simplifies them
# Denser regions of the data are fatter, and sparser ones thinner 
#in a violin plot

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
#     print(idx,feature)
    sns.violinplot( x='survival_status', y=feature, data=haberman, ax=axes[idx])
plt.show()

## 5.  CONTOUR PLOT

In [None]:
#2D Density plot, contors-plot
sns.jointplot(x="age",y="operation_year",data=haberman, kind="kde")
plt.show()

sns.jointplot(x="age",y="axil_nodes",data=haberman, kind="kde")
plt.show()

sns.jointplot(x="operation_year",y="axil_nodes",data=haberman, kind="kde")
plt.show()



# F. BI-VARIATE ANALYSIS


## 1. PAIR PLOTS

In [None]:
plt.close() ## close previous show()
sns.set_style("whitegrid")
sns.pairplot(haberman,hue="survival_status",vars=["age","operation_year","axil_nodes"],size=3.5,plot_kws=dict(s=70),diag_kind = 'kde')
plt.show()

**OBSERVATION**

As we can see all the above Pair Plots, we can say that they are not Linearly Separable.

## 2. SCATTER PLOTS

In [None]:
## AGE <> AXIL NODES
# haberman.plot(kind='scatter', x='age', y='axil_nodes') ;
# plt.show()

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=6) \
   .map(plt.scatter, "age", "axil_nodes") \
   .add_legend();
plt.show();

**OBSERVATION**

1. Patients with Age < 40 and axil < 30 have higher chances of survival.
2. Patients with Age > 50 and Axil > 10 are more likely to die.

In [None]:
## AXIL NODES <> OPERATION YEAR
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(plt.scatter, "axil_nodes", "operation_year") \
   .add_legend();

plt.show();

**OBSERVATION**

1. People with axil nodes more than 50 have higher rate of non survival.

In [None]:
## AGE <> OPERATION YEAR
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(plt.scatter, "operation_year", "age") \
   .add_legend();
plt.show();

**OBSERVATION**

1. Operation year 60, 61 and 68 has more survival rate.

# G. ALL OBSERVATIONS
1. Dataset is UNBALANCED but complete as no values are missing
2. Our CLASS LABEL ie survival_status is INTERGER and needs to converted to valid CATEGORICAL datatype
3. Class Label "survival_status" are now to labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived.
4. This is Binary Classification Problem, where we need to predict whether the patient will survive after 5 years or not based upon the patient's age, year of treatment and the number of positive lymph nodes
5. 50% of the Patients are below the age of 54.
6. Operation year having range (63-66) had highest successfull survival rate
7. Operation year 60 had highest un-successfull rate. Patients with age range 40-60 have survived the most.
8. As we can clearly see, axil node=0 has the highest Survival rate.
9. From AXIL_NODE and SURVIVAL_STATUS, we can conclude that higher the axil_nodes, higher the chances of their death.
10. As we can see all the above Pair Plots, we can say that they are not Linearly Separable.
11. Patients with Age < 40 and axil < 30 have higher chances of survival.
12. Patients with Age > 50 and Axil > 10 are more likely to die
13. People with axil nodes more than 50 have higher rate of non survival.
14. Operation year 60, 61 and 68 has more survival rate.

