1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle account to donwload data. (https://www.kaggle.com/gilsousa/habermans-survival-data-set)
2. Perform EDA on this dataset with the following sections:
* High level statistics of the dataset: number of points, numer of   features, number of classes, data-points per class.
* Explain our objective. 
* Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are useful towards classification.
* Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are useful in classfication.
* Write your observations in english as crisply and unambigously as possible. Always quantify your results.

Attribute Information:
* 1- Age of patient at time of operation (numerical)
* 2- Patient's year of operation (year - 1900, numerical)
* 3- Number of positive axillary nodes detected (numerical)
* 4- Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Loading data
haber = pd.read_csv('/kaggle/input/habermans-survival-data-set/haberman.csv', header=None, \
                    skiprows=1, \
                    names=['op_age', 'op_year', 'postv_axil_nodes', 'survival_status'])
haber.head(5)

In [None]:
# Some pre-processing for better understanding
# Changing 1 to "positive" and 2 to "negative"

haber["survival_status"][haber["survival_status"]==1] = "Positive"
haber["survival_status"][haber["survival_status"]==2] = "Negative"
haber.head(10)

#### High level statistics of the dataset: number of points, numer of   features, number of classes, data-points per class.

In [None]:
print(haber.shape)
#There are 306 rows and 4 columns

#Check the name of the columns
print(haber.columns)

In [None]:
haber.isnull().any() #This says that there are no missing values in my DataFrame

#For reference: 
#https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b

In [None]:
haber.info() #all values are integer values

In [None]:
#Obtaining the statistical info from the data
#getting statistical data about year and status is not that useful

haber.describe()

In [None]:
# Survival count
print(haber['survival_status'].value_counts())
#https://seaborn.pydata.org/generated/seaborn.countplot.html
p = sns.countplot(x="survival_status" ,hue="survival_status", data=haber)
p.set_title("Cancer Survival Status")
p.set_xlabel("Survival Status\nPositive: Survived\nNegative: Not Survived")

#### Observations:
* The dataset has 306 observations of Cancer patients with no missing values.
* Most patients have survived more than 5 years.
* The mean(mean value) and the median(50% value) of the operation ages are almost equal.
* There is a huge difference between the mean and the median values for positive axillary nodes. We will see this again by using boxplots.
* We can also observe that the 75% and the max value has a huge difference which infers that positive axillary values has extreme values(aka. Outliers).

#### OBJECTIVE: To classify/predict a patient survival who had undergone surgery for cancer.

#### Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are useful towards classification.

In [None]:
#1D Scatter plot
haber_survived = haber[haber["survival_status"]=="Positive"]
haber_not_survived = haber[haber["survival_status"]=="Negative"]

plt.figure(figsize=(8,8))
#plotting for axillary nodes
plt.subplot(211)
plt.plot(haber_survived["postv_axil_nodes"], \
         np.zeros_like(haber_survived['postv_axil_nodes']), \
         'o', label="survived")

plt.plot(haber_not_survived["postv_axil_nodes"], \
         np.zeros_like(haber_not_survived['postv_axil_nodes']), \
         'o', label="not survived")

plt.xlabel("positive axilliary nodes")
plt.title("1D Distribution of Positive Axillary Nodes")
plt.legend()

#plotting for operation age
plt.subplot(212)
plt.plot(haber_survived["op_age"], \
         np.zeros_like(haber_survived['op_age']),\
         'o', label="survived")

plt.plot(haber_not_survived["op_age"], \
         np.zeros_like(haber_not_survived['op_age']),\
         'o', label="survived")

plt.xlabel("operation age")
plt.title("1D Distribution of Operation Age")
plt.legend()

plt.subplots_adjust(hspace=0.4)

##### Observation:
* From this type of 1D plot, it is really difficult to know the relation between the features and the objective as the plot points are overlapped to an extent that we do not know how many actual points are there for each colour. Hence, we would go for some other type of plots as shown below to understand the plots better.

In [None]:
# Observing by creating Density graphs

p1= sns.FacetGrid(haber, hue="survival_status", size=3.5)\
    .map(sns.distplot, "op_age")
#https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot
p1.fig.suptitle("Density graph- Operation Age", x=0.5, y=1.01)
p1.add_legend()

p2= sns.FacetGrid(haber, hue="survival_status", size=3.5)\
    .map(sns.distplot, "op_year")
p2.fig.suptitle("Density graph- Operation Year", x=0.5, y=1.01)
p2.add_legend()

p3= sns.FacetGrid(haber, hue="survival_status", size=3.5)\
    .map(sns.distplot, "postv_axil_nodes")
p3.add_legend()
p3.fig.suptitle("Density graph- Positive Axilliary Nodes", x=0.5, y=1.01)

#####  Observations:
* It is highly difficult to find form simple if and else if a patient will survive will or not depending on the above features. The graphs are highly overlapped and cannot be evaluated with simple equations.
* There is no need to create Cumulative Density graphs to determine the error percentages.

Lets go ahead and check for outliers and their relationship with the change in the features using boxplots and violin plots.

In [None]:
plt.figure(figsize=(10,4)) #setting size of the figure

#https://stackoverflow.com/questions/42406233/how-to-add-title-to-seaborn-boxplot
plt.subplot(131)
sns.boxplot(y='op_age', x='survival_status', data=haber)\
    .set_title("BoxPlot- Operation Age")

plt.subplot(132)
sns.boxplot(y='op_year', x='survival_status', data=haber)\
    .set_title("BoxPlot- Operation Year")

plt.subplot(133)
sns.boxplot(y='postv_axil_nodes', x='survival_status', data=haber)\
    .set_title("BoxPlot- Positive Axilliary Nodes")
plt.subplots_adjust(wspace=0.4)
#adjusting gaps between subplots as the labels overlapped:
#   https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html

In [None]:
plt.figure(figsize=(10,5)) #setting size of the figure

plt.subplot(131)
sns.violinplot(data=haber, y='op_age', x='survival_status' )\
    .set_title("BoxPlot- Operation Age")

plt.subplot(132)
sns.violinplot(data=haber,y ='op_year', x='survival_status' )\
    .set_title("BoxPlot- Operation Year")

plt.subplot(133)
sns.violinplot(data=haber, y ='postv_axil_nodes', x='survival_status' )\
    .set_title("BoxPlot- Positive Axilliary Nodes")

plt.subplots_adjust(wspace=0.7)

#### Observations:
* The error percentage is too much for the objective we want to attain, if we want to draw a univariate conclusion out of the present data set.

Hence, we shall move towards multi-variate analysis and check if we can draw some decent analysis points.
#### Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are useful in classfication.

In [None]:
p= sns.pairplot(data=haber, hue='survival_status', size=3)
p.fig.suptitle("Pair Plot Cancer Survival DataSet", y=1.02)

#### Final Words:
* It is not possible to go away without highly errorneous conclusions with these features and simple analysis.
* To draw some conclusion out of this dataset, we need to go with a hybrid approach.