# Assignment 2 - Haberman Dataset (EDA)

Taken from [Kaggle](https://www.kaggle.com/gilsousa/habermans-survival-data-set/version/1), the dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billing's Hospital on the survival of patients who had undergone surgery for breast cancer.

**Information of the Dataset:**


1.   Number of Attributes: 4
2.   Total instances: 306
3.   Attribute Information:
      *   **age**: Age of the patient at the time of operation (numerical)
      *   **year**: Patient's year of operation (numerical)
      *   **nodes**: Number of positive axillary nodes detected (numerical)
      *   **status**: Survival status of the patient. (1 - Survived > 5yrs / 2 - Died within 5yrs) (Class attribute)
4.   The number of **axillary nodes** is an important parameter as the breast cancer usually spreads to the lymph nodes present in near the arm as that is the area which is closest to the breast. The number of axillary lymph nodes helps the doctor to confirm the diagnosis and staging of the cancer. 
5.   Number of null values: 0
6.   Number of classes: 2
7.   Number of data points per class: (1 - 225, 2 - 81)
      *   225 instances of patients who survived for more than 5 yrs after operation
      *   81 instances of patients who died before 5 yrs of operation






In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_style('darkgrid')

In [None]:
cols = ['age', 'op_year', 'axil_nodes', 'surv_status']
data = pd.read_csv('../input/habermans-survival-data-set/haberman.csv', names=cols)
data.info()

In [None]:
data.surv_status.value_counts()

In [None]:
data.describe()

**Observations:**
1. The min and max age for the patients are:
        min: 30
        max: 83
2. The min and max year for performing the operation are:
        min: 58
        max: 69
3. The min and max number of axillary lymph nodes are:
        min: 0
        max: 52
4. Even though the maximum number of nodes found were 52, 75% of the patients had nodes less than 5 and 25% of the patients did not have any nodes
5. 75% of the patients were aged below 61 yrs

# Objective

The objective is to find whether we can predict whether a patient will survive for more than 5 yrs if an operation is performed on him/her given certain physical conditions - age, year in which operation was performed and number of axillary nodes near the breast.
<br/>This is a classification problem.

# Preprocessing

We will first convert the data into a more verbose form by converting the `status` column into a categorical feature with `1, 2` mapped to `yes` and `no`

In [None]:
data['survived_more_than_5yrs'] = data['surv_status'].apply(lambda x: 'yes' if x == 1 else 'no')
data.drop('surv_status', axis=1, inplace=True)
data.head()

In [None]:
data.info()

Now, we see that the `survived_more_than_5yrs` feature has been created as an `object` type which is the pandas representation for a categorical feature.

# Univariate Analysis

##  Distribution Plots and PDFs

In [None]:
for indx, feature in enumerate(data.columns[:-1]):
    fig = sns.FacetGrid(data, hue='survived_more_than_5yrs', height=5)
    fig.map(sns.distplot, feature).add_legend()
    plt.show()

**Observations:**
1. **age**
    1. Patients aged 40 - 58 (approx) are the ones who died before 5 yrs operation, whereas patients with age 50 - 55(approx) were able to survive for more than 5yrs after operation. 
    2. We cannot conclude anything related to our objective by looking at the age of the patients.
2. **op_year**
    1. Maximum number of operations on the patiens who survived for more than 5 yrs was performed in the year 1960(approx.)
    2. Max number of operations on patients who died within 5 yrs was performed in the year 1965(approx.)
    3. Cannot find any conclusive insights from the 'year' feature.
    
3. **axil_nodes**
    1. We can clearly see that patients with less number of axillary lymph nodes ( < 4 approx) survived more.
    2. This is a good feature to base our model on.
    
**Colclusion:**
1. 'age' has no direct relation to the 'status' of the patient.
2. 'op_year' has no direct relation to the 'status' of the patient.
3. 'axil_nodes' is a very good feature to base our model for classification as we can say that patients with < 4 axillary nodes were able to survive the operation for more than 5 yrs. 

## CDF Plots

In [None]:
def get_cdf_info(dataset):
    ''' Calculate pdf, cdf and bins for the dataset '''
    cdf = {}
    pdf = {}
    bins = {}
    for indx, feature in enumerate(dataset.columns[:-1]):
        counts, bins_edge = np.histogram(dataset[feature], bins=15, density=True)
        pdf[feature] = np.asarray(counts / np.sum(counts))
        cdf[feature] = np.cumsum(pdf[feature])
        bins[feature] = bins_edge
        
    return pdf, cdf, bins

In [None]:
# Divide the dataset based on the survival status of the patients
survived_data = data[data['survived_more_than_5yrs'] == 'yes'].copy()
not_survived_data = data[data['survived_more_than_5yrs'] == 'no'].copy()

In [None]:
# Plotting the cdf and pdf for the survived dataset
pdf, cdf, bins = get_cdf_info(survived_data.copy())

plt.figure(figsize=(20, 5))
for indx, feature in enumerate(survived_data.columns[:-1]):
    plt.subplot(1, 3, indx + 1)
    fig = plt.plot(bins[feature][1:], cdf[feature])
    plt.plot(bins[feature][1:], pdf[feature])
    plt.legend(['PDF', 'CDF'])
    plt.xlabel(feature)
    plt.ylabel('Probability')

**Observations:(Survived)**
1. From the pdf-cdf plot for the axillary nodes, it can be observed that out of all the patients who survived more that 5yrs, about 85% of them had number of axillary nodes less than 5.
2. Around 82% of the patients who survived were aged less than 60yrs.

In [None]:
# Plotting the cdf and pdf for the not survived dataset
pdf, cdf, bins = get_cdf_info(not_survived_data.copy())

plt.figure(figsize=(20, 5))
for indx, feature in enumerate(not_survived_data.columns[:-1]):
    plt.subplot(1, 3, indx + 1)
    fig = plt.plot(bins[feature][1:], cdf[feature])
    plt.plot(bins[feature][1:], pdf[feature])
    plt.legend(['PDF', 'CDF'])
    plt.xlabel(feature)
    plt.ylabel('Probability')

**Observations:**
1. From the pdf-cdf plot for axillary nodes, we can observe that out of all the patients that died within 5 yrs of operation, approximately 40% of them had number if axillary nodes > 5.
2. Around 50% of the patiens that died before 5yrs had axillary nodes less than or equal to 4.
3. There is a sharp increase in the rate of death of patients as the number of axillary nodes rise.

## Box plots and Violin plots

In [None]:
# Box plots
plt.figure(figsize=(15, 5))
for indx, feature in enumerate(data.columns[:-1]):
    plt.subplot(1, 3, indx + 1)
    plt.subplots_adjust(wspace=0.8)
    sns.boxplot(x='survived_more_than_5yrs', y=feature, data=data)

**Observations:**
1. We can see that there are a significant number of outliers in the `axil_nodes` feature for the patients who survived for > 5yrs but only 2 outliers for the patients who died before 5yrs of the operation.

In [None]:
# Violin plots - Shows the pdf on top of the basic box plot
plt.figure(figsize=(15, 5))
for indx, feature in enumerate(data.columns[:-1]):
    plt.subplot(1, 3, indx + 1)
    plt.subplots_adjust(wspace=0.8)
    sns.violinplot(x='survived_more_than_5yrs', y=feature, data=data)

**Conclusion of Univariate Analysis:**
1. `axil_nodes` is the most important feature for classification.
2. After `axil_nodes`, we can also use `age` for classification although it has not been found that significant.
3. The initial analysis show that `op_year` is not a significant feature for classification.

# Bivariate Analysis

## Pair plot

In [None]:
sns.pairplot(data, hue='survived_more_than_5yrs', height=5)
plt.show()

**Observations:**
1. Unable to find any meaningful pattern or observation from the pair plots.
2. Might be some relation or pattern between `age` and `axil_nodes`, it can be explored further using a scatter plot.

## Scatter plot

In [None]:
sns.scatterplot(x='age', y='axil_nodes', data=data, hue='survived_more_than_5yrs')
plt.show()

**Observations:**
1. We can see that if the number of axillary nodes is very low, then no matter the age, the patient survived for more than 5yrs.
2. As the number of axillary nodes increased slowly, if the age of the patient as less than 45 yrs, there was a high chance of survival.

**Conclusion:**
1. `axil_nodes` and `age` can be considered as the significant features for the classification problem of the hamberman cancer detection dataset.