[](http://)# Exploratory data analysis (EDA) for Haberman Dataset

1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle account to donwload data. (https://www.kaggle.com/gilsousa/habermans-survival-data-set)
2. Perform a similar alanlaysis as above on this dataset with the following sections:
* High level statistics of the dataset: number of points, numer of   features, number of classes, data-points per class.
* Explain our objective. 
* Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are useful towards classification.
* Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are useful in classfication.
* Write your observations in english as crisply and unambigously as possible. Always quantify your results.

## **Discription** <br/>
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Number of Attributes**: 4  <br/>
**Attributes Information:**<br/>
1. Age of patient at time of operation ('age') <br/>
2. Patient's year of operation ('operation_year') <br/>
3. Number of positive axillary nodes detected ('axil_nodes') <br/>
4. Survival status: 
                1 = the patient survived 5 years or longer('will_survive') 
                2 = the patient died within 5 year('not_survive')

## Analysis

### Data Preprocessing 

In [None]:
# Loading Nessesary Modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

'''downloaded dataset from https://drive.google.com/open?id=1o1I9PLyjqGgs0eOylK-2srXM2ZH3mIVb'''

#Load haberman.csv into a pandas dataFrame.
haberman = pd.read_csv("../input/haberman.csv")

In [None]:
# A breif look on dataframe
print("Haberman Dataset Head\n")
print(haberman.head(10))

In [None]:
# Dataset Description
print("Haberman Dataset Describe\n")
print(haberman.describe())

In [None]:
print("Haberman Dataset Info\n")
print(haberman.info())

In [None]:
# how many data-points and features?
print (haberman.shape)

In [None]:
# Check for any Null values
print(haberman.isnull().values.any())

In [None]:
print(list(haberman['status'].unique()))

##### Change "status" data type (int64 to catagorial)

In [None]:
haberman['status'] = haberman['status'].map({1:"survived", 2:"not_survived"})
haberman['status'] = haberman['status'].astype('category')
print(haberman.head(10))

In [None]:
print(haberman.info())

##### Observations

1. No null or missing values so no need of data imputation.
2. The status column data is int64 type it has to be converted to catagorial class. 

### High level statistics of the dataset:

High level statistics of the dataset: number of points, numer of features, number of classes, data-points per class

In [None]:
print("Number of rows: " + str(haberman.shape[0]))
print("Number of columns: " + str(haberman.shape[1]))
print("Columns: " + ", ".join(haberman.columns))
print("*"*100)
print("Target variable distribution")
print(haberman.iloc[:,-1].value_counts())
print("*"*100)
print(haberman.iloc[:,-1].value_counts(normalize = True))
print("*"*100)
print(haberman.describe())

##### Obervations

1. Number of data points == 306,
2. Number of features == 3, 
3. Number of classes == 2,
4. data-points per class == (survived = 225, Not survived = 81)

5. The age of the patients vary from 30 to 83 with the median of 52.
6. Although the maximum number of positive lymph nodes observed is 52, nearly 75% of the patients have less than 5 axil nodes and nearly 25% of the patients have no axil nodes.
7. The target column is imbalanced with 73% of values are "yes".

### Objective

To predict whether the patient will survive after 5 years or not based upon the patient's age, year of treatment and the number of axil nodes (positive lymph nodes).

### Univariate Analysis

##### Distribution plots

In [None]:
# cite : https://www.kaggle.com/gokulkarthik/haberman-s-survival-exploratory-data-analysis
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    fg = sns.FacetGrid(haberman, hue='status', height=5)
    fg.map(sns.distplot, feature).add_legend()
    plt.suptitle(str("Distribution Plot for "+feature), y=1.05, fontsize=18)
    plt.show()
    

##### CDF(Cumulative Distribution Function)

In [None]:
plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    print("\n*************** "+str(feature)+" ****************")
    counts, bin_edges = np.histogram(haberman[feature], bins=10, density=True)
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    print("Bin Edges: {}".format(bin_edges))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, label = "pdf")
    plt.plot(bin_edges[1:], cdf, label = "cdf")
    plt.xlabel(feature)
    plt.legend(loc='upper left')
    plt.title(str(feature+"'s pdf-cdf plot"))
    plt.grid(True)
    plt.suptitle("PDF-CDF Plottings", y=1.05, fontsize=20)

##### Mean, Variance and Std-dev

In [None]:
#Mean, Std-deviation, 

yes = haberman.loc[haberman["status"] == "survived"];
no = haberman.loc[haberman["status"] == "not_survived"];

for idx, feature in enumerate(list(haberman.columns)[:-1]):
    print(str(feature))
    print("Mean of ", str(feature)," of survived class : ",np.mean(yes[feature]))
    print("Mean of ", str(feature)," of not survived class : ",np.mean(no[feature]))
    print("Std Dev. of ", str(feature)," of survived class : ",np.std(yes[feature]))
    print("Std Dev. of ", str(feature)," of not survived class : ",np.std(no[feature]))
    print("\n")


##### Median, Percentile, Quantile, IQR, MAD

In [None]:
#Median, Quantiles, Percentiles, IQR.
for idx, feature in enumerate(tuple(haberman.columns)[:-1]):
    print("*"*10,str(feature),"*"*10)
    print("\nMedians:")
    print(np.median(yes[feature]))
    print(np.median(no[feature]))

    print("\nQuantiles:")
    print(np.percentile(yes[feature],np.arange(0, 100, 25)))
    print(np.percentile(no[feature],np.arange(0, 100, 25)))

    print("\n90th Percentiles:")
    print(np.percentile(yes[feature],90))
    print(np.percentile(no[feature],90))

    from statsmodels import robust
    print ("\nMedian Absolute Deviation")
    print(robust.mad(yes[feature]))
    print(robust.mad(no[feature]))
    print("\n")


##### Box plot 

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for idx, feature in enumerate(tuple(haberman.columns)[:-1]):
    sns.boxplot(x='status',y=feature, hue="status", data=haberman, ax=axes[idx]).set_title(str("status-"+feature))
plt.suptitle("Box Plots",y=1.0, fontsize=18)
plt.show()

##### Violin plots

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(tuple(haberman.columns)[:-1]):
    sns.violinplot(x='status',y=feature, hue="status", data=haberman, ax=axes[idx]).set_title(str("status-"+feature))
plt.suptitle("Violin Plots",y=1.05, fontsize=18)
plt.show()

##### Observations

1. The number of axil of the survivors is highly densed from 0 to 5 shown by PDF of axil node.
2. Almost 80% of the patients have less than or equal to 5 positive axil node.
3. The patients treated after 1966 have the slighlty higher chance to surive that the rest. The patients treated before 1959 have the slighlty lower chance to surive that the rest.

### Bivariate Analysis

##### 2-D Scatter Plot

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    for i, f in enumerate(list(haberman.columns)[:-1]):
        sns.scatterplot(x=feature, y=f, hue='status', ax=axes[idx][i], data = haberman).set_title(str(feature+" vs. "+f))
plt.suptitle("2-d Scatter Plots", fontsize=18)
plt.show()


##### Pairplot

In [None]:
plt.close();
sns.set()
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="status", height=6)
plt.suptitle("PairPlots", y=1.05, fontsize = 18, color='black')
plt.show()

##### contour plot

In [None]:
#2D Density plot, contors-plot
for idx, feature in enumerate(list(haberman.columns)[:-1]):
    for i, f in enumerate(list(haberman.columns)[:-1]):
        if(idx>=i):
            continue
        sns.jointplot(x=feature, y=f, data=yes, kind="kde");
        plt.suptitle(str("contour plot for "+feature+" & "+f), y = 1.05, fontsize=18)
        plt.show()

### Conclusions :

1. The given dataset is imbalanced as it does not contains equal number of data-points for each class.
2. The given dataset is not linearly separable form each class. There are too much overlapping in the data-points and hence it is very difficult to classify.
3. Somehow nodes is giving some Intuition in the dataset.
4. we can not build simple model using only if else condition we need to have some more complex technique to handle this dataset.
5. By scattering the data points between year and nodes, we can see a little better separation between the two classes than other scatter plots.