#                         Exploratory Data analysis on HabermanDataset

## Dataset Information:
    Number of Instances: 306
    Number of Attributes: 4 (including the class attribute)
    Attribute Information:
    Age of patient at time of operation (numerical)
    Patient's year of operation (year - 1900, numerical)
    Number of positive axillary nodes detected (numerical)
    Survival status (class attribute):
    1 = the patient survived 5 years or longer
    2 = the patient died within 5 year

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
#download the data set from  
#https://www.kaggle.com/gilsousa/habermans-survival-data-set/data
# load the data set
url = "../input/haberman.csv"
haberman=pd.read_csv(url)

In [None]:
#  data-points and features
print (haberman.shape)

In [None]:
#no column names mentioned in the data set. so will add headers to the columns.
haberman.columns = ["Age","Year","Axillary nodes","Survival status"]
print (haberman.columns)

In [None]:
haberman.head()

In [None]:
#how many patients are survived 5 years and more and how many died within 5years
haberman["Survival status"].value_counts()

### Obervation:  
1. **Imbalanced** data set.

2. Clearly the data is not balanced as we have **224 patients survived more than 5 years and 81 patients died        within 5 years**. 

## 2-D ScatterPlot

In [None]:
# lets plot plain scatter plot considering age and axillary nodes
haberman.plot(kind='scatter', x='Age', y='Axillary nodes') ;
plt.show()

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Survival status", size=6) \
   .map(plt.scatter, "Age", "Axillary nodes") \
   .add_legend();
plt.show();


### Observation:

1. It seems most of the patients have 0 Auxillary nodes detected.

## Pair Plot

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="Survival status",
             vars=['Age','Year','Axillary nodes'], size=3)
plt.show()
# The diagnol elements are PDFs for each feature.

### Observation:

1. ***Auxillary nodes versus Age*** is the useful plot to atleast get the insight that most people who survived have 0 Auxillary nodes detected.

2. It looks like we cannot distinguish the data easily with the help of above scalar  plots as most of them are ***overlapping***.

##  Histogram, PDF

In [None]:
sns.FacetGrid(haberman, hue="Survival status", size=5) \
   .map(sns.distplot, "Axillary nodes") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(haberman, hue="Survival status", size=5) \
   .map(sns.distplot, "Age") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(haberman, hue="Survival status", size=5) \
   .map(sns.distplot, "Year") \
   .add_legend();
plt.show();

### Observation:

1. From the above PDFS(Univariate analysis) both Age and Year are not good features for useful insights as the 
  **distibution is more similar for both people who survived and also dead**.

2. **axillary nodes** is the only feature that is useful to know the survival status of patients as there is         difference between the distributions for both classes(labels). From that distibution we can infer that **most        survival patients have fall in to zero axillary nodes**.

3. From the year distribution, we can observe that people who didnt survive suddenly fall and rise in between 1958    and 1960. lets check the summary statistics to get more insights.


# CDF

In [None]:
#divide the data set in two according to the label Survival status 
# alive means status=1 and dead means status =2
alive=haberman.loc[haberman["Survival status"]==1]
dead=haberman.loc[haberman["Survival status"]==2]


In [None]:
counts, bin_edges = np.histogram(alive['Axillary nodes'], bins=30, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['Pdf for the patients who survive more than 5 years',
            'Cdf for the patients who survive more than 5 years'])
plt.show()

In [None]:
counts, bin_edges = np.histogram(dead['Axillary nodes'], bins=30, density=True)

pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend(['Pdf for the patients who dead Within 5 years',
            'Cdf for the patients who dead within 5 years'])
plt.show()

In [None]:
# check also summary statistics below to get an idea to distinguish the 
#survival and not survival

# Mean, Variance and Std-dev

In [None]:
print("Summary Statistics of Patients who are alive for more than 5 years:")
alive.describe()

In [None]:
print("Summary Statistics of Patients who are dead within 5 years:")
dead.describe()

### Observations:
1. From both the tables we can observe that almost for all the features the statistics are **similar except for       Axillary nodes**.

2. The **auxillary nodes mean(average) is more** for people who died within 5 years than people who live more than    5 years

3. From the observation of Cdfs, we can infer that patients **above 46 axillary nodes detected** can be considered as dead within 5 years.

# Box plot and Whiskers

In [None]:
sns.boxplot(x='Survival status',y='Axillary nodes', data=haberman)
plt.show()

In [None]:
sns.boxplot(x='Survival status',y='Age', data=haberman)
plt.show()

In [None]:
sns.boxplot(x='Survival status',y='Year', data=haberman)
plt.show()

## Violin plots

In [None]:
# Denser regions of the data are fatter, and sparser ones thinner 
#in a violin plot

sns.violinplot(x='Survival status',y='Year', data=haberman,size=8)
plt.show()

In [None]:
sns.violinplot(x='Survival status',y='Axillary nodes', data=haberman,size=8)
plt.show()

In [None]:

sns.violinplot(x='Survival status',y='Age', data=haberman,size=8)
plt.show()

### Observation:
1. From box,violin plots we can say that more no of patients who are dead have **age between 46-62,year between      59-65** and the patients who survived have **age between 42-60, year between 60-66**.
    

In [None]:
# contors-plot
sns.jointplot(x="Age", y="Year", data=haberman, kind="kde");
plt.show();