# Exploratory data analysis (EDA) for Haberman's Survival Data Set


Title: Haberman's Survival Data

Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999

Past Usage: Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of Instances: 306

Number of Attributes: 4 (including the class attribute)

Attribute Information: Age of patient at time of operation (numerical), Patient's year of operation (year - 1900, numerical, Number of positive axillary nodes detected (numerical), Survival status (class attribute) 1 = the patient survived 5 years or longer, 2 = the patient died within 5 year.

Missing Attribute Values: None

Objective: Classify a new patients who had undergone surgery for breast cancer will survive 5 years or longer OR die within 5 year

In [None]:
# check for the input dataset
import os
print(os.listdir('../input'))

In [None]:
#importing essentials libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as mpatches

In [None]:
#Load Haberman data into a pandas dataFrame.
haberman = pd.read_csv("../input/haberman.csv", header=None,names=['Age','Op_Year','axil_nodes_det','Surv_status'])

In [None]:
#checking the first 5 values of the dataframe
haberman.head()

In [None]:
# how many data-points and features?
print(haberman.shape)

In [None]:
#What are the column names in our dataset?
print(haberman.columns)

In [None]:
#How many data points for each class are present? 
haberman["Surv_status"].value_counts()

# 2-D Scatter Plot

In [None]:
#2-D scatter plot:
haberman.plot(kind='scatter', x='Age', y='Op_Year', title='2-D scatter plot Age Vs Op_Year') ;
plt.show()
#this plot doesn't give any information about the classes

In [None]:
# 2-D Scatter plot with color-coding for each class.

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Surv_status", height=4) \
   .map(plt.scatter, "Age", "Op_Year") \
   .add_legend();
plt.title('2-D scatter plot Age Vs Op_Year')
plt.show();

# Observation(s):
1. The plot gives us a rough idea that both the classes are mixed and and not easily distinguishable.

# Pair-plot

In [None]:
#this pair plot gives class dependancies w.r.t. each variable/feature
plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="Surv_status",vars=["Age", "Op_Year","axil_nodes_det"], height=5);
plt.suptitle('Pair Plot for each feature')
plt.show()

# Observation(s):
1. The diagonals represents the PDF's of each feature and suggests that the classes are not easily separable.
2. Every variable has the mixed representation of both the classes. That's why we can't draw a linearly separable line from any of them(Plots). 

# Histogram, PDF, CDF

In [None]:
#1-D scatter plot of AGE
haberman_one = haberman.loc[haberman["Surv_status"] == 1];
haberman_two= haberman.loc[haberman["Surv_status"] == 2];

plt.plot(haberman_one["Age"], np.zeros_like(haberman_one['Age']), 'o')
plt.plot(haberman_two["Age"], np.zeros_like(haberman_two['Age']), 'o')
plt.legend('12')
plt.xlabel("Age")
plt.title('1-D scatter plot for AGE')
plt.show()

In [None]:
# Histogram and PDF of AGE for both classes
sns.FacetGrid(haberman, hue="Surv_status", height=5) \
   .map(sns.distplot, "Age") \
       .add_legend();
plt.title('Histogram and PDF of AGE')
plt.ylabel('count')
plt.show();

# Observation(s):
1. The mean of both the classes is approximately equal and have near to same distriution.
2. The spread or variance of class with patient survived 5 years or longer is more than class with patient died within 5 year of operation.

In [None]:
# Histogram and PDF of Op_Year for both classes
sns.FacetGrid(haberman, hue="Surv_status", height=5) \
   .map(sns.distplot, "Op_Year") \
       .add_legend();
plt.title('Histogram and PDF of Op_Year')
plt.ylabel('count')
plt.show();

# Observation(s):
1. W.r.t Op_Year both the classes are overlaped and this feature not readily differentiate them.

# CDF

In [None]:
# CDF plot of Age for class 1
counts, bin_edges = np.histogram(haberman_one['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.legend('1')
plt.xlabel("Age")
plt.ylabel("Probability")
plt.title("CDF plot of Age for class 1")
plt.show();

In [None]:
#  CDF plot of Age for both classes

counts, bin_edges = np.histogram(haberman_one['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)



counts, bin_edges = np.histogram(haberman_two['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel("Age")
blue_patch = mpatches.Patch(color='blue', label='PDF class 1')
green_patch = mpatches.Patch(color='green', label='PDF class 2')
orange_patch = mpatches.Patch(color='orange', label='CDF class 1')
red_patch = mpatches.Patch(color='red', label='CDF class 2')
plt.legend(handles=[blue_patch,green_patch,orange_patch,red_patch])
plt.ylabel("probability")
plt.title("CDF and PDF plot of both classes w.r.t Age")
plt.show();

# observation(s):
1. PDF of both classes first intersect at 44, If we take this point then with 30% of probability we can say survival rate are high for age group 10 to 44 and there is 18% of probability of being wrong or errorneous.

# Mean, variance and Standard Dev

In [None]:
#Mean, Variance, Standard-dev,  
print("mean:")
print('mean of class 1 is:',np.mean(haberman_one["Age"]))
#Mean with an outlier.
print('mean with outlier is:',np.mean(np.append(haberman_one["Age"],2240)));
print('mean of class 2 is: ',np.mean(haberman_two["Age"]))

print("\nStandard-dev:");
print('STD of class 1 is:',np.std(haberman_one["Age"]))
print('STD of class 2 is:',np.std(haberman_two["Age"]))

# Observation(s):
1. Mean for survived and died patients are closer,but by adding outlier as 2240 in survived we can observe the increase in mean of class 1.
2. Thus, mean can be easily corrupted by outlier.
3. The Standard deviation of both the class are nearly same.

In [None]:
#Median, Quantiles, Percentiles, IQR.
print("\nmedians:")
print('median of class 1 is:',np.median(haberman_one["Age"]))
#Median with an outlier
print('median with outlier is:',np.median(np.append(haberman_one["Age"],2240)));
print('median of class 2 is:',np.median(haberman_two["Age"]))

print("\nQuantiles:")
print(np.percentile(haberman_one["Age"],np.arange(0, 100, 25)))
print(np.percentile(haberman_two["Age"],np.arange(0, 100, 25)))

print("\n90th Percentiles:")
print(np.percentile(haberman_one["Age"],90))
print(np.percentile(haberman_two["Age"],90))

print("\n85th Percentiles:")
print(np.percentile(haberman_one["Age"],85))
print(np.percentile(haberman_two["Age"],85))


from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(haberman_one["Age"]))
print(robust.mad(haberman_two["Age"]))

# Observation(s):
1. Median for survived class(1) with and without outlier is same, declaring there is no or very little effect of outlier on median statistics. Thus, Median cannot be easily corrupted by outlier.
2. Age at Quantiles of 0%, 25%, 50%, 75% is 30, 43, 52, 60 respectively for class 1 and 34, 46, 53, 61 respectively for class 2.
3. The 90th Percentiles values for class 1 and 2 are 67.0 each.
4. The 85th Percentiles values for class 1 and 2 are 64, 65 respectively.
5. Median Absolute Deviation is different for both the classes.

# Box plot and Whiskers

In [None]:
sns.boxplot(x='Surv_status',y='Age', data=haberman).set_title('Box plot of AGE and survival status')
blue_patch = mpatches.Patch(color='blue', label='class 1')
orange_patch = mpatches.Patch(color='orange', label='class 2')
plt.legend(handles=[blue_patch,orange_patch],loc=1)
plt.show()

# Observation(s):
1. The 25% to 75% values for class 1 w.r.t Age lies between 43 to 60 of age & for class 2 it is 47 to 61 of age.
2. The mean value for class 1 is 52 and for class 2 is 53 from the plot.
3. The Whiskers for class 2 are farther than class 1.

# Violin plots

In [None]:
sns.violinplot(x="Surv_status", y="Age", data=haberman, size=8)
plt.title('Violin plt of AGE and Survival status')
blue_patch = mpatches.Patch(color='blue', label='class 1')
orange_patch = mpatches.Patch(color='orange', label='class 2')
plt.legend(handles=[blue_patch,orange_patch],loc=1)
plt.show()

# Observation(s):
1. This plot gives the combined information of PDF and box plot. The curve denotes the PDF and middle area denotes box plot

# Contors-plot

In [None]:
#2D Density plot, contors-plot
sns.jointplot(x="Age", y="Op_Year", data=haberman_one, kind="kde");
plt.suptitle('Contors plot of AGE and Op_Year')
plt.show();

# Observation(s):
1. The year from 60 to 62 has the age group of patients between 47 to 53.
2. The plot shows the PDF of features on the side.

# Conclusions:
1. With this exploratory data analysis on haberman dataset I understood the importance of EDA before starting any new project in Ml. As it gives us a better understanding of what is the problem and how the data is distributed and what to do in near future.
2. EDA is very important either we have the domain knowledge or not. As it gives the basic understanding about the problem and data.
3. I have plot 1-D scatter plot, 2-D scatter Plot, Pair Plot, Histogram, PDF, CDF, Violin plots, Contors-plot, Box plot and calculated Mean, variance, Standard Dev, Median, Quantiles, Percentiles, IQR for the haberman survival dataset.