**Data Set Description**:The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Number of data points**: 306

**Number of Attributes**: 4 (including the class attribute)

**Attribute Information**:
    1. Age of patient at time of operation (numerical)
    2. Patient's year of operation (year - 1900, numerical)
    3. Number of positive axillary nodes detected (numerical)
    4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
    Missing Attribute Values: None

**1. Data Loading & Environment Setting**

In [17]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid");
import os

print(os.listdir("../input"))
haberman = pd.read_csv("../input/haberman.csv",header=None, 
                       names=['age', 'year_of_operation', 'positive_axillary_nodes', 'survival_status'])

**2. High Level Statistics**

In [18]:
# (Q) how many data-points and features?
print (haberman.shape)

In [19]:
#(Q) How many data points for each class are present? 
haberman["survival_status"].value_counts()

In [20]:
# (Q) High Level Statistics
haberman.describe()

**3. Data Transformation**

As there are no missing values in dataset, data imputation process will not be required. 
Also value of column *survival_status* is not meaningful. Converting this value to categorical yes/no value.

In [21]:
# modify the target column values to be meaningful as well as categorical
haberman['survival_status'] = haberman['survival_status'].map({1:"yes", 2:"no"})
haberman['survival_status'] = haberman['survival_status'].astype('category')
print(haberman.head())

**4. Data Distribution**


In [22]:
print("# of rows: " + str(haberman.shape[0]))
print("# of columns: " + str(haberman.shape[1]))
print("Columns: " + ", ".join(haberman.columns))

print("Target variable distribution")
print(haberman.iloc[:,-1].value_counts())
print(haberman.iloc[:,-1].value_counts(normalize = True))

**5. Obective Of Exploratory Analysis**

To predict whether the patient will survive after 5 years or not based upon the patient's age, year of treatment and the number of positive lymph nodes

**6. Univariate Analysis - PDF **


In [23]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.show();

In [24]:
sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(sns.distplot, "year_of_operation") \
   .add_legend();
plt.show();

In [25]:
sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(sns.distplot, "positive_axillary_nodes") \
   .add_legend();
plt.show();

**6. Univariate Analysis - CDF **

In [26]:
counts, bin_edges = np.histogram(haberman['age'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('age')

plt.show();

In [27]:
counts, bin_edges = np.histogram(haberman['positive_axillary_nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('positive_axillary_nodes')

plt.show();

In [28]:
counts, bin_edges = np.histogram(haberman['year_of_operation'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('year_of_operation')

plt.show();

**6. Univariate Analysis - BoxPlot **

In [29]:
idx, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.boxplot(x='survival_status',y='year_of_operation', data=haberman,ax=axes[0])
sns.boxplot(x='survival_status',y='age', data=haberman,ax=axes[1])
sns.boxplot(x='survival_status',y='positive_axillary_nodes', data=haberman,ax=axes[2])
plt.show()


**6. Univariate Analysis - Violin Plots **

In [30]:
idx, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.violinplot(x='survival_status',y='year_of_operation', data=haberman,ax=axes[0])
sns.violinplot(x='survival_status',y='age', data=haberman,ax=axes[1])
sns.violinplot(x='survival_status',y='positive_axillary_nodes', data=haberman,ax=axes[2])
plt.show()

**7. Bi-Variate Analysis  **

In [31]:
sns.pairplot(haberman, hue='survival_status', size=4)
plt.show()