# Data visualization with Haberman Dataset

#### <font color='green'> Author - Atul Kumar </font> 

## 1. Dataset 

### Haberman's Survival Data
- The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

- Number of Instances: 306

- Number of Attributes: 4 (including the class attribute)

- Attribute Information:
    - Age of patient at time of operation (numerical)
    - Patient's year of operation (year - 1900, numerical)
    - Number of positive axillary nodes detected (numerical)
    - Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

- Missing Attribute Values: None

## 2. Objective

To perform exploratry data analysis to check which independent variables or combination of independent variables are important to predict the survival status of a cancer patient. 


## 3. Importing Libraries 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels import robust

## 4. Loading the csv file

In [1]:
hsd = pd.read_csv('../input/haberman.csv', names = ['Age', 'Year', 'Aux_nodes', 'Sr_stat'])

In [3]:
hsd.head()

In [3]:
# Converting 1 to Survived and 2 to Not Survived strings to make the data more meaningful

hsd['Sr_stat'] = hsd['Sr_stat'].map({1:"Survived", 2:"Not Survived"})
hsd['Sr_stat'] = hsd['Sr_stat'].astype('category')

In [5]:
hsd.head()

In [6]:
hsd.tail()

## 5. High level statistical analysis 

In [7]:
hsd.shape

306 data points 

3 features 

1 dependent variable 

In [8]:
hsd.columns

Column legends 
- Age: Age of patient at the time of operation
- Year: Patient's year of operation 
- Aux_nodes: Number of positive axillary nodes detected
- Sr_stat: Survival status

In [9]:
hsd['Sr_stat'].value_counts()

Observations
- There are 2 classes of data
    - 225 patients survived after 5 years
    - 81 patient died within 5 years

## 6. Univariate analysis

#### 6.1 Histogram and PDF

In [5]:
sns.set(style = "whitegrid")
sns.FacetGrid(hsd, hue = "Sr_stat", size = 5) \
   .map(sns.distplot, "Age") \
   .add_legend()
plt.xlabel("age of patient at the time of operation")
plt.show()

Observations 
* In the graph we can clearly see that Age of the patient at the time of operation is not a good feature alone to determine the survival status of a patient as we can see the histograms of patient survived and couldn't survive within 5 years overlaps and their pdf intersects at numerous points. 

In [6]:
sns.FacetGrid(hsd, hue = "Sr_stat", size = 5) \
   .map(sns.distplot, "Year") \
   .add_legend()
plt.xlabel("patient's year of operation")
plt.show()

Observations 
* Year of operation is also not a very good factor to determine the status of survival as histograms are heavily overlapping in this case as well. 

In [17]:
sns.FacetGrid(hsd, hue = "Sr_stat", size = 5, ylim = (0, 0.55)) \
   .map(sns.distplot, "Aux_nodes") \
   .add_legend()
plt.xlabel("number of positive auxillary nodes")
plt.show()

Observation
- We can see between 0 to 0.2 the pdf or the density is very high(more than 50%)  for patients who survived the 5 years post operation. 
- The overlap is there as well so any feature if used alone can not determine the survival stauts of a patient.
- <b>But it is much better than other features in terms of classifying the two categories.</b>

### 6.2 CDF

In [23]:
survived = hsd.loc[hsd["Sr_stat"] == "Survived"]
not_survived = hsd.loc[hsd["Sr_stat"] == "Not Survived"]

#### CDF of number of positive auxillary nodes
<b>[IMP]</b> As the overlapping was least in this feature respective to other features so we will try to exploit it and check whether we can work our way to classify the survival status on the basis of just this feature alone.

In [31]:
#Survived

counts, bin_edges = np.histogram(survived['Aux_nodes'], bins = 10, density = True)
pdf = counts/sum(counts)
print('FOR SURVIVED')
print()
print("PDF: " ,pdf)
print("BIN EDGES: ", bin_edges)
CDF = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], CDF)

#Not Survived
print()
print()
counts, bin_edges = np.histogram(not_survived['Aux_nodes'], bins = 10, density = True)
pdf = counts/sum(counts)
print('FOR NOT SURVIVED')
print()
print("PDF: " ,pdf)
print("BIN EDGES: ", bin_edges)
CDF = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], CDF)
plt.xlabel('number of positive auxillar nodes')
plt.ylabel('PDF/CDF')
plt.legend(['Survived pdf', 'Survived CDF','Not survived pdf', 'Not survived CDF'])
plt.show()

Observations
- For Survived data points
    - At pdf value 0.08, approximately 92% of the points lie behind or before (to the left) of this point i.e (10, 0.08).
    - <b> So mostly all the points lie in range (0, 10) </b>
- For Not Survived data points
    - At pdf value 0.17, approximately 70% of the points lie before the point (10, 0.17)
    - At pdf value 0.1, approximately 90% of the points lie before the point (20, 0.1)
    - <b> So mostly all points lie in range (0, 20) </b>
    

### 6.3 Low level statistical analysis

In [10]:
print("FOR SURVIVED")
survived.iloc[:,0:3].describe()

In [11]:
print("FOR NOT SURVIVED")
not_survived.iloc[:,0:3].describe()

#### 6.3.1 Mean

As we can see mean value of number of positive auxillary nodes for patients who didn't survive is quite higher as compares to the mean for survived patient data point

Age is simply not a factor to categorize the survival status.

#### 6.3.2 Median 

In [12]:
print("Median of positive auxillary value is: {0} for SURVIVED".format(np.median(survived["Aux_nodes"])))

So we can see very easily that atleast half of the patient who survived had positive auxillary value 0

In [13]:
print("Median of positive auxillary value is: {0} for NOT SURVIVED".format(np.median(not_survived["Aux_nodes"])))

Whereas for patients who died within 5 years has relatively larger median value.

#### 6.3.3 Standard Deviation (spread from mean)

<a href="https://www.codecogs.com/eqnedit.php?latex=std&space;=&space;\sqrt[]{Var}&space;=&space;\sqrt[]{\tfrac{\sum_{i&space;=&space;1}^{N}&space;(x_{i}&space;-&space;mean)^{2}}{N}&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?std&space;=&space;\sqrt[]{Var}&space;=&space;\sqrt[]{\tfrac{\sum_{i&space;=&space;1}^{N}&space;(x_{i}&space;-&space;mean)^{2}}{N}&space;}" title="std = \sqrt[]{Var} = \sqrt[]{\tfrac{\sum_{i = 1}^{N} (x_{i} - mean)^{2}}{N} }" /></a>

As seen from the tables above the standard deviation is almost identical for age and year features but has a distinguishable values for positive auxillary feature

#### 6.3.4 Quantiles

For survived patients the 25th percentile and 50th percentile is zero that means at least 50% of the value are zero whereas for not survived the value varies for 25% to 75% from 1 to 11. We can infer from this data that survived papients are more likely to have 0 as positive auxillary number of nodes value.

#### 6.3.5 Median Absolute Deviation (spread from median)

<a href="https://www.codecogs.com/eqnedit.php?latex=MAD&space;=&space;median(\left&space;|x_{i}&space;-&space;median_{x}&space;\right&space;|)_{i&space;=&space;1}^{n}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?MAD&space;=&space;median(\left&space;|x_{i}&space;-&space;median_{x}&space;\right&space;|)_{i&space;=&space;1}^{n}" title="MAD = median(\left |x_{i} - median_{x} \right |)_{i = 1}^{n}" /></a>

In [14]:
print("FOR SURVIVED")
print(robust.mad(survived['Aux_nodes']))
print("FOR NOT SURVIVED")
print(robust.mad(not_survived['Aux_nodes']))

This is giving a better picture as compared to the standard deviation because it handles the outliers gracefuly. 

<b> MDA for not survived has more spread as compared to that for survived. </b>

### 6.4 Box plot

In [102]:
sns.boxplot(x = 'Sr_stat', y = 'Aux_nodes', data = hsd)
plt.xlabel("survival Status")
plt.ylabel("number of positive auxillary nodes")
plt.show()

Observations

For Survived 
- Minimum value -> 0
- 25 percentile -> 0
- 50 percentile -> 0
- 75 percentile -> 3
- Inference -> There are a lot of outliers but atleast 50% of the values are 0. 

For Not Survived 
- Minimum value -> 0
- 25 percentile -> 1
- 50 percentile -> 4
- 75 percentile -> 11
- Inference -> There are only two outliers and for 75% of data points the values lies between 0 and 11.

### 6.5 Violin plot

In [15]:
sns.violinplot(x = 'Sr_stat', y = 'Aux_nodes', data = hsd, size = 5)
plt.xlabel("survival status")
plt.ylabel("number of positive auxilary nodes")
plt.show()

In [19]:
sns.violinplot(x = 'Sr_stat', y = 'Year', data = hsd, size = 5)
plt.xlabel("survival status")
plt.ylabel("year of operation")
plt.show()

In [21]:
sns.violinplot(x = 'Sr_stat', y = 'Age', data = hsd, size = 5)
plt.xlabel("survival status")
plt.ylabel("Age at time of operation")
plt.show()

Observations
- In the first violin plot we can see that the <b> survivors are more likely to have positive auxillary node value near to zero </b> and <b> non survivors have values ranging from 0 to 30 with decent probability </b>. 
- In the second plot we can see that in the operation year range <b> 1958 to 1963 a patient is more likely to survive </b> and in the range <b> 1963 to 1966 a patient is more likely to die within 5 years </b> and in the range <b> 1958 to 1960 a patient is more likely to survive. </b> 
- In the third plot we can see that the range of <b> age 43 to 50 at the time of operation seems to survive more likely.</b> 

## 7. Bi-variate analysis

### 7.1 2-D Scatter plot

In [17]:
sns.FacetGrid(hsd, hue = 'Sr_stat', size = 5) \
   .map(plt.scatter, 'Year', 'Aux_nodes') \
   .add_legend()
plt.xlabel("patient's year of operation")
plt.ylabel('number of positive auxillary nodes')
plt.show()

Observations 
- It seems most patient's have 0 positive auxillary nodes.
- The data can not be classified using simple model like if-else.

### 7.2 Pair plot

In [18]:
plt.close()
sns.pairplot(hsd, hue = 'Sr_stat', vars = ['Age', 'Year', 'Aux_nodes'],  size = 4)
plt.show()

Observations
- Even using two features together doesn't seem to solve our problem completly.
- But Aux_nodes Vs Year plot seems to seprate the datapoints more clearer than other plots.
- Inference -> Aux_nodes and Year are better features than Age to classify the survival status of a patient.

## 8. Conclusion

To predict whether a patient will survive the next 5 years after operations we can choose all features to build our model as no feature can individualy or in a pair, classify the survival status with good accuracy if we had to build a simple model. However if we had a limitation to choose only good features and not use all the features than the order of bestness of features would be: Aux_nodes > (Aux Nodes, Year) > Year. So with the help of EDA I was able to visualize which amongst the given set of features are the best features. 