# Haberman's Cancer Survival - EDA
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**FEATURES**:

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)<br>
-- 1 = the patient survived 5 years or longer<br>
-- 2 = the patient died within 5 year

**Domain**:

  Positive axillary lymph node:
* A lymph node in the area of the armpit (axilla) to which cancer has spread. This spread is determined by surgically removing some of the lymph nodes and examining them under a microscope to see whether cancer cells are present.

* Here Number of positive axillary nodes detected features is being referred as "axil nodes"

**OBJECTIVE:**

Predict the patient survival status i.e survival status=1 or survival status=2 based upon patientâ€™s age, year of treatment and the number of axil nodes.

In [None]:
import warnings
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

#supress warnings
warnings.filterwarnings("ignore")
#Load Haberman.csv into a pandas dataFrame.
haberman = pd.read_csv("../input/habermans-survival-data-set/haberman.csv",names=['age', 'operation_year', 'axil_nodes', 'survival_status'])
haberman.reset_index(drop=True, inplace=True)
status1 = haberman.loc[haberman["survival_status"] == 1];
status2 = haberman.loc[haberman["survival_status"] == 2];

In [None]:
#Data-Points and Features
print(haberman.shape)

In [None]:
print(haberman.columns)

In [None]:
#shows first 10 data-points.
haberman.head(n=11)

In [None]:
print(haberman.info())

# 1. High level statistics of the dataset

In [None]:
print(haberman.describe())

In [None]:
print("No. Of Rows: " + str(haberman.shape[0]))
print("No. Of Columns: " + str(haberman.shape[1]))
print("Columns: " + ", ".join(haberman.columns))
print("No. of patients in each survival status:")
print(haberman["survival_status"].value_counts())
print("% of patients in each survival status:")
print(haberman["survival_status"].value_counts(normalize=True)*100)

**Observations:**<br>
* The age of the patients vary from 30 to 83
* The maximum number of positive axil nodes observed is 52
* Nearly 75% of the patients have less than 5 axil nodes and nearly 50% of the patients have less than 1 axil node
* The dataset is imbalanced with approximately 74% of values as '1' i.e in about 74% cases the patient survived 5 years or longer

# 2. Univariate Analysis(PDF, CDF, Boxplot and Violin Plot)

In [None]:
"""
2.1 PDF and Histogram

Probality Density Function (PDF) is the probabilty that the variable takes a value x. (smoothed version of the histogram)
Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable
Here the height of the bar denotes the percentage of data points under the corresponding group
"""
sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.show();

sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(sns.distplot, "operation_year") \
   .add_legend();
plt.show();

sns.FacetGrid(haberman, hue="survival_status", size=5) \
   .map(sns.distplot, "axil_nodes") \
   .add_legend();
plt.show();


In [None]:
"""
2.2 CDF

The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x.
"""
label = ["PDF", "CDF"]
#status1
print("Status 1")
counts, bin_edges = np.histogram(status1['age'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.figure(1)
plt.title("PDF and CDF for age")
plt.xlabel("Age")
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(label)
plt.show()

# status2
print("*"*100)
print("Status 2")
counts, bin_edges = np.histogram(status2['age'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.figure(2)
plt.title("PDF and CDF for age")
plt.xlabel("Age")
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(label)
plt.show();

In [None]:
"""
2.2 CDF

The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x.
"""
label = ["PDF", "CDF"]

#status1
print("Status 1")
counts, bin_edges = np.histogram(status1['operation_year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.figure(1)
plt.title("PDF and CDF for Operation Year")
plt.xlabel("Operation Year")
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(label)
plt.show()
# status2
print("*"*100)
print("Status 2")
counts, bin_edges = np.histogram(status2['operation_year'], bins=10,density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.figure(2)
plt.title("PDF and CDF for Operation Year")
plt.xlabel("Operation Year")
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(label)
plt.show();

In [None]:
"""
2.2 CDF

The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x.
"""
label = ["PDF", "CDF"]
#status1
print("Status 1")
counts, bin_edges = np.histogram(status1['axil_nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.figure(1)
plt.title("PDF and CDF for Axil Nodes")
plt.xlabel("Axil Nodes")
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(label)
plt.show()
# status2
print("*"*100)
print("Status 2")
counts, bin_edges = np.histogram(status2['axil_nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.figure(2)
plt.title("PDF and CDF for Axil Nodes")
plt.xlabel("Axil Nodes")
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(label)
plt.show();

In [None]:
"""
2.3 Box Plot with whiskers
Box-plot can be visualized as a PDF on the side-ways.
"""
fig,axes = plt.subplots(1, 3, figsize=(15, 5))
sns.boxplot(x = "survival_status", y = "age", data = haberman,ax=axes[0]).set_title("Box plot for survival_status and age")
sns.boxplot(x = "survival_status", y = "operation_year", data = haberman,ax=axes[1]).set_title("Box plot for survival_status and operation_year")
sns.boxplot(x = "survival_status", y = "axil_nodes", data = haberman,ax=axes[2]).set_title("Box plot for survival_status and axil_node")
plt.show()

In [None]:
"""
2.4 : Violin Plot
A violin plot combines the benefits of the previous two plots 
and simplifies them
Denser regions of the data are fatter, and sparser ones thinner 
in a violin plot
"""
fig,axes = plt.subplots(1, 3, figsize=(15, 5))
sns.violinplot(x = "survival_status", y = "age", data = haberman,ax=axes[0]).set_title("Box plot for survival_status and age")
sns.violinplot(x = "survival_status", y = "operation_year", data = haberman,ax=axes[1]).set_title("Box plot for survival_status and operation_year")
sns.violinplot(x = "survival_status", y = "axil_nodes", data = haberman,ax=axes[2]).set_title("Box plot for survival_status and axil_node")
plt.show()

**Observation(s):**
* The number of axil nodes of the survivors is highly densed from 0 to 5.
* Almost 80% of the patients have less than or equal to 5 axil nodes and Patients with 0 axil nodes were the highest.
* Data in axil_nodes column is skewed.
* The data is very less to predict survival rate correctly and most of the data is overlapping.

# 3. Bi-variate analysis (scatter plots, pair-plots) 

In [None]:
"""
3.1 1-D scatter plot for axil_nodes and age
"""
import numpy as np
plt.xlabel("Axil-Nodes")
plt.plot(status1["axil_nodes"], np.zeros_like(status1["axil_nodes"]), 'o')
plt.plot(status2["axil_nodes"], np.zeros_like(status2["axil_nodes"]), 'o')
plt.show()
plt.xlabel("Age")
plt.plot(status1["age"], np.zeros_like(status1["age"]), 'o')
plt.plot(status2["age"], np.zeros_like(status2["age"]), 'o')
plt.show()

In [None]:
"""
3.2 2-D scatter plot
Two-dimensional scatterplots visualize a relation (correlation) between two variables X and Y
"""
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=4) \
   .map(plt.scatter, "age", "axil_nodes") \
   .add_legend();
plt.show(); 

In [None]:
"""
3.3 Pair Plot

Pair plot in seaborn plots the scatter plot between every two data columns in a given dataframe.
It is used to visualize the relationship between two variables.
"""
plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="survival_status", size=4,vars=["age","operation_year", "axil_nodes"]);
plt.show()

**Observation(s):**
* The patient's whose age>77 died within 5 year(Survival Status=2) and those having age<34 survived 5 years or longer(Survival Status=1).
* Most of the patients had positive axillary nodes from 0 to 5.
* Patients with axil-nodes>46 died within 5 year(Survival Status=2).
* Axil-Node is giving some intution in the dataset as compared to other features.