# Exploratory DataAnalysis on Haberman's Survival Data

 Author:Dileep Vadlamudi

# Dataset Details

Title: Haberman's Survival Data

Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999 Past Usage:

Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI. Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of Instances: 305

Number of Attributes: 4 (including the class attribute)

Attribute Information:

Age of patient at time of operation (numerical)

Patient's year of operation (year - 1900, numerical)

Number of positive axillary nodes detected (numerical)

Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

Missing Attribute Values: None

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Habermans=pd.read_csv("../input/haberman.csv") #reading csv file

In [None]:
print(Habermans.shape)

In [None]:
labels=['Age','operation_year','axilarynodes','survival_status']
hs=pd.read_csv("../input/haberman.csv",names=labels)

In [None]:
print(hs.head())

In [None]:
print(hs.tail())

In [None]:
print(hs.describe())

In [None]:
print(hs.columns)

In [None]:
print(hs["survival_status"].value_counts())




By the above data we can conclude that given dataset is unbalanced because the data of survival(1) is too far from the data of died(2)

# 2-D scatter plot


In [None]:
hs.plot(kind='scatter',x='Age',y='operation_year');
plt.title("operation_year vs age")
plt.show()

# Observation:

1.by the above plot we can say that majority of operations are done in the Age between 40-60

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(hs,hue="survival_status",size=4)\
    .map(plt.scatter,"Age","operation_year")\
    .add_legend();
plt.title("Age vs Operaation_year")
plt.show()

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(hs,hue="survival_status",size=4)\
    .map(plt.scatter,"Age","axilarynodes")\
    .add_legend();
plt.title("Age vs axilarynodes")

plt.show()

#  observation

1.Using Age and Operation year we are unable to separate survived(1) and dead(2) 


2.Using axilarynodes and Age we are unable to seperate survived(1) and dead(2)  by this plot we can say that highest number of members having 0 axilarynodes

## Pair Plots or bivariate Analysis

In [None]:
plt.close()
sns.set_style("whitegrid")
sns.pairplot(hs,hue="survival_status",size=4);
plt.show()

# observation
plot between axilarynodes and operation_year we can see the seperation very well when compared to the other plots

# Histogram ,PDF ,CDF

In [None]:
import numpy as np
survived=hs.loc[hs["survival_status"]==1];
dead=hs.loc[hs["survival_status"]==2]
plt.plot(survived["axilarynodes"], np.zeros_like(survived['axilarynodes']), 'o')
plt.plot(dead["axilarynodes"], np.zeros_like(dead['axilarynodes']), 'o')

In [None]:
sns.FacetGrid(hs, hue="survival_status", size=5) \
   .map(sns.distplot, "axilarynodes") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(hs, hue="survival_status", size=5) \
   .map(sns.distplot, "operation_year") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(hs, hue="survival_status", size=5) \
   .map(sns.distplot, "Age") \
   .add_legend();
plt.show();

In [None]:
#Plot CDF of axilary nodes

counts, bin_edges = np.histogram(hs['axilarynodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)


counts, bin_edges = np.histogram(hs['axilarynodes'], bins=20, 
                                 density = True)
pdf = counts/(sum(counts))
plt.plot(bin_edges[1:],pdf);

plt.show();


In [None]:
counts, bin_edges = np.histogram(hs['axilarynodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

In [None]:
plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(hs.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    print("********* "+feature+" *********")
    counts, bin_edges = np.histogram(hs[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    plt.xlabel(feature)
    

In [None]:
#survival
counts, bin_edges = np.histogram(survived['axilarynodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)


# dead
counts, bin_edges = np.histogram(dead['axilarynodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

In [None]:
print("Means:")
print(np.mean(survived['axilarynodes']))
print(np.mean(dead["axilarynodes"]))

print("\nStd-dev:");
print(np.std(survived['axilarynodes']))
print(np.std(dead['axilarynodes']))


In [None]:
#Median, Quantiles, Percentiles, IQR.
print("\nMedians:")
print(np.median(survived['axilarynodes']));
print(np.median(dead['axilarynodes']))

print("\nQuantiles:")
print(np.percentile(survived['axilarynodes'],np.arange(0, 100, 25)))
print(np.percentile(dead['axilarynodes'],np.arange(0, 100, 25)))

print("\n90th Percentiles:")
print(np.percentile(survived['axilarynodes'],90))
print(np.percentile(dead['axilarynodes'],90))

from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(survived['axilarynodes']))
print(robust.mad(dead['axilarynodes']))


In [None]:
sns.boxplot(x='survival_status',y='axilarynodes', data=hs)
plt.show()

In [None]:
sns.boxplot(x='survival_status',y='Age', data=hs)
plt.show()

In [None]:
sns.boxplot(x='survival_status',y='operation_year', data=hs)
plt.show()

In [None]:
sns.violinplot(x="survival_status", y="axilarynodes", data=hs, size=8)
plt.show()

In [None]:
sns.violinplot(x="survival_status", y="Age", data=hs, size=8)
plt.show()

In [None]:
sns.violinplot(x="survival_status", y="operation_year", data=hs, size=8)
plt.show()

# observation

1.most of the people who survived has axilary nodes 0