# **Haberman's Survival: Exploratory Data Analysis**

## **Data Description:**

The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

## **Attribute Information:**

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

## **Environment Configuration:**

In [None]:
import os
print(os.listdir('../input/habermans-survival-data-set/'))

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

#Load haberman.csv into a pandas dataFrame.
cancer_df = pd.read_csv("../input/habermans-survival-data-set/haberman.csv", header=0, names=["Age", "Operation_Year", "Positive_Axillary_Nodes", "Survival_Status_After_5_Years"])
print(cancer_df.head())


## **High Level Statistics of Dataset and Data Preparation:**


In [None]:
cancer_df.shape

In [None]:
cancer_df.info()

In [None]:
print("Target variable distribution:")
print(cancer_df['Survival_Status_After_5_Years'].value_counts())
print(cancer_df.iloc[:,-1].value_counts(normalize = True))

#### **Observations:**

1. Each of the columns have 306(total no. of datapoints in the dataset) non-null entries which implies there are no missing values in the dataset. Hence no need to do missing value imputations for the columns.
2. The 'Survival_Status_After_5_Years' column is of type int and hence it needs to be converted to categorical type with meaningful values.
3. The 'Survival_Status_After_5_Years' shows the dataset is imbalanced since it has 225 datapoints(73.53%) belonging to patients who survived 5 years or longer and only 81 datapoints(26.47%) belonging to patients who died within 5 years.


In [None]:
#yes : the patient survived 5 years or longer.
#no : the patient died within 5 years
cancer_df['Survival_Status_After_5_Years'] = cancer_df['Survival_Status_After_5_Years'].map({1:"yes", 2:"no"})
cancer_df['Survival_Status_After_5_Years'] = cancer_df['Survival_Status_After_5_Years'].astype('category')
cancer_df['Survival_Status_After_5_Years'].value_counts()

In [None]:
cancer_df.info()

In [None]:
print(cancer_df.describe())

#### **Observations:**

1. The age of the patients vary from 30 to 83 with the median age being 52.
2. The positive axillary nodes vary from 0 to 52. But 25% of patients having 0 positive auxillary nodes, 50% of patients having less than or equal to 1 lymph nodes and 75% of the patients having less than or equal to 4 lymph nodes.


In [None]:
survival_status_yes=cancer_df.loc[cancer_df["Survival_Status_After_5_Years"]=="yes"]
survival_status_no=cancer_df.loc[cancer_df["Survival_Status_After_5_Years"]=="no"]

print("Patients who survived 5 years or more :")
print(survival_status_yes.describe())
print("\n****************************************************************************\n")
print("Patients who died within 5 years :")
print(survival_status_no.describe())

#### **Observations:**

**Patients who survived more than 5 years:**
1. The mean of positive axillary nodes is 2.79 
2. The median of positive axillary nodes is 0 which implies 50% of patients had 0 positive axillary nodes.
3. Even though max positive axillary was 46, 75% of patients had less than equal to 3 lymph nodes.

**Patients who died within 5 years:**
1. The mean of positive axillary nodes is 7.45
2. The median of positive axillary nodes is 4 which implies 50% of patients had less than or equal to 4 positive axillary nodes.
3. Max positive axillary was 52 and 75% of patients had less than equal to 11 lymph nodes.

## **Objective:**

To predict whether the patient will survive after 5 years or not based upon the patient's age, year of operation and the number of positive axillary nodes.

## **Univariate Analysis:**

### **Histograms:**

In [None]:
sns.set()
for feature in list(cancer_df.columns)[:-1]:
    sns.FacetGrid(cancer_df, hue="Survival_Status_After_5_Years", height=5) \
    .map(sns.distplot, feature) \
    .add_legend();
    plt.title("Histogram of "+feature);
    plt.show();

#### **Observations:**

1. Deaths of patients treated around year 1958 and around 1965 were found to be more compared to deaths of patients in other years.
2. Patients who survived more than 5 years had positive axillary nodes densed from 0-5 while patients who died within 5 years had positive axillary nodes densed from 0-20.

### **PDF and CDF:**

In [None]:
plt.figure(figsize=(18,5))
for idx, feature in enumerate(list(cancer_df.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    print("\n********* "+feature+" *********\n")
    counts, bin_edges = np.histogram(cancer_df[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    plt.xlabel(feature)

#### **Observations:**

Most of the patients(~80%) had less than 10 positive axillary nodes.

In [None]:
for idx, feature in enumerate(list(cancer_df.columns)[:-1]):
    print("\n********* "+feature+" *********\n")
    counts1, bin_edges1 = np.histogram(survival_status_yes[feature], bins=10, density=True)
    counts2, bin_edges2 = np.histogram(survival_status_no[feature], bins=10, density=True)
    print("People who survived more than 5 years:\n")
    print("Bin Edges: {}".format(bin_edges1))
    pdf1 = counts1/sum(counts1)
    print("PDF: {}".format(pdf1))
    cdf1 = np.cumsum(pdf1)
    print("CDF: {}".format(cdf1))
    print("\nPeople who died within 5 years:\n")
    print("Bin Edges: {}".format(bin_edges2))
    pdf2 = counts2/sum(counts2)
    print("PDF: {}".format(pdf2))
    cdf2 = np.cumsum(pdf2)
    print("CDF: {}".format(cdf2))
    plt.figure(figsize=(18,5))
    plt.subplot(1, 2, 1)
    plt.plot(bin_edges1[1:], pdf1, bin_edges1[1:], cdf1)
    plt.xlabel(feature)
    plt.title("People who survived more than 5 years")
    plt.subplot(1, 2, 2)
    plt.plot(bin_edges2[1:], pdf2, bin_edges2[1:], cdf2)
    plt.xlabel(feature)
    plt.title("People who died within 5 years")
    plt.show()

### **Quantiles, 90th Percentile, and Median Absolute Deviation(MAD):**

In [None]:
for feature in list(cancer_df.columns[:-1]):
    print("\n**********"+feature+"**********")
    print("\nQuantiles:")
    print("People who survived more than 5 years:")
    print(np.percentile(survival_status_yes[feature],np.arange(0, 100, 25)))
    print("People who died within 5 years:")
    print(np.percentile(survival_status_no[feature],np.arange(0, 100, 25)))

    print("\n90th Percentiles:")
    print("People who survived more than 5 years:")
    print(np.percentile(survival_status_yes[feature],90))
    print("People who died within 5 years:")
    print(np.percentile(survival_status_no[feature],90))

    from statsmodels import robust
    print ("\nMedian Absolute Deviation:")
    print("People who survived more than 5 years:")
    print(robust.mad(survival_status_yes[feature]))
    print("People who died within 5 years:")
    print(robust.mad(survival_status_no[feature]))


#### **Observations:**

1. Deaths of patients treated around year 1958 and around 1965 were found to be more compared to deaths of patients in other years.
2. Patients who survived more than 5 years had IQR(Inter-Quartile Range = 75percentile-25percentile) of positive axillary nodes as 3 where as patients who died within 5 years had IQR of positive axillary nodes as 10.
3. 90th percentile of people who survived more than 5 years is 8 implying 90% of people had less than or equal to 8 positive axillary nodes where as 90th percentile of people who died within 5 years is 20.

### **Box Plots:**

In [None]:
plt.figure(figsize=(18,5))
for idx, feature in enumerate(list(cancer_df.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    sns.boxplot(x='Survival_Status_After_5_Years',y=feature, data=cancer_df)

### **Violin Plots:**

In [None]:
plt.figure(figsize=(18,5))
for idx, feature in enumerate(list(cancer_df.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    sns.violinplot(x='Survival_Status_After_5_Years',y=feature, data=cancer_df)


#### **Observations:**

Above Box plots and violin plots strengthen the previously above mentioned observations.

## Multivariate Analysis:


### **Pair-Plot and Scatter Plots**

In [None]:
sns.pairplot(cancer_df, hue="Survival_Status_After_5_Years", height=3);
plt.show()

In [None]:
sns.scatterplot(x="Age",y="Positive_Axillary_Nodes",data=cancer_df, hue='Survival_Status_After_5_Years')
plt.show()

In [None]:
import plotly.express as px
fig = px.scatter_3d(cancer_df, x='Age', y='Operation_Year', z='Positive_Axillary_Nodes',
              color='Survival_Status_After_5_Years', width=600,height=400)
fig.update_layout(legend=dict(
    yanchor="top",
    y=0.99,
    xanchor="left",
    x=0.01
),
margin=dict(
        l=0,
        r=0,
        b=0,
        t=0,
        pad=0
    ))
fig.show()

#### **Observations:**

1. Patients with lower age group < ~40 and lower positive axillary nodes < ~8 tend to survive for more than 5 years.
2. Patients with higher age group > ~40 and higher positive axillary nodes > ~4 tend to die within 5 years.
3. Patients of all age groups tend to die within 5 years if they have higher number(> ~4) of positive axillary nodes.
4. Patients of age > ~38 and < ~70 had higher number of positive axillary nodes.
5. Data is not linearly seperable.

## **Final Conclusions:**

1. Dataset is imbalanced with less number of datapoints belonging to patients who died within 5 years. Dataset has no missing values. Data classes are not linearly seperable.
2. The positive axillary nodes vary from 0 to 52. But 25% of patients having 0 positive auxillary nodes, 50% of patients having less than or equal to 1 positive auxillary nodes and 75% of the patients having less than or equal to 4 positive auxillary nodes.
3. Deaths of patients treated around year 1958 and around 1965 were found to be more compared to deaths of patients in other years.
4. The median of positive axillary nodes in patients who survived more than 5 years is 0 which implies 50% of had less than or equal to 0 positive axillary nodes, 90% of people had less than or equal to 8 positive axillary nodes.
5. The median of positive axillary nodes in patients who died within 5 years is 4 which implies 50% of had less than or equal to 4 positive axillary nodes. 90% of people had less than or equal to 20 positive axillary nodes.
6. Higher the positive axillary nodes higher the chances of death within 5 years.
7. Patients with lower age group < ~40 and lower positive axillary nodes < ~8 tend to survive for more than 5 years.
Patients with higher age group > ~40 and higher positive axillary nodes > ~4 tend to die within 5 years.