**Data Description** The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Attribute Information:**

* Age of patient at time of operation (numerical)
* Patient's year of operation (year - 1900, numerical)
* Number of positive axillary nodes detected (numerical)
* Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

Domain inputs from https://pubmed.ncbi.nlm.nih.gov/6352003/
From the paper, we come to know that the number of positive auxillary nodes is greatly related to the survival rate.The greater the value, smaller the chances of survival.But let's analyse the data to know more about that feature. 

In [None]:
pip install mplcyberpunk

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data process/ng, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

import mplcyberpunk
plt.style.use("cyberpunk")
mplcyberpunk.add_glow_effects()

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
dataset = pd.read_csv("/kaggle/input/habermans-survival-data-set/haberman.csv", names =["age","year","n_auxillary_nodes","survival_after_5years"])
dataset.head()

In [None]:
print(dataset.info())

In [None]:
dataset.shape[0]

In [None]:
print(dataset.survival_after_5years.value_counts())
print(dataset.iloc[:,-1].value_counts(normalize = True))


**Observations - Higher level statistics of the dataset**
- There are 306 datapoints in each of the 4 features. survival_after_5years is the target feature containing binary values - 1 & 2
- There are 225 patients and 81 patients who survived and who did not after treatment respectively. Hence the target column is imbalanced with 73% of values being '1'


In [None]:
dataset['survival_after_5years'] = dataset['survival_after_5years'].map({1:"yes", 2:"no"})
dataset.head()

In [None]:
dataset.info()

**Observation** - We find that survival_after_5years falls into object Dtype. We want that to be our class variable containing 2 categories 

In [None]:
dataset['survival_after_5years'] = dataset['survival_after_5years'].astype(('category'))

In [None]:
dataset.info()

In [None]:
print(dataset.describe())

In [None]:
dataset.isna().sum()

**Observations**
- The age column is have greater deviation compared to other columns and there is no patient under age 30 affected from this cancer (According to the data provided). Patient Age ranges between 30 and 83. 
- Year ranges between 1958 to 1969. 
- Although the maximum number of positive auxillary nodes observed is 52, nearly 75% of the patients have less than 5 positive auxillary nodes and nearly 25% of the patients have no positive auxillary nodes
- Data is pretty clean as there is no missing values. So imputation is not necessary.

**OBJECTIVE** - *To perform exploratary data analysis on Haberman cancer survival Dataset to know which features are useful towards classification.*

**UNIVARIATE ANALYSIS**
- Probability Density Functions
- Cummulative Density Functions 
- Box plots
- Violin plots

In [None]:
#Distribution plots
"""
* Distribution plots are used to visually assess how the data points are distributed with respect to its frequency.
* Usually the data points are grouped into bins and the height of the bars representing each group increases with increase in the number of data points 
lie within that group. (histogram)
* Probality Density Function (PDF) is the probabilty that the variable takes a value x. (smoothed version of the histogram)
* Kernel Density Estimate (KDE) is the way to estimate the PDF. The area under the KDE curve is 1.
* Here the height of the bar denotes the percentage of data points under the corresponding group
"""
for idx, feature in enumerate(list(dataset.columns)[:-1]):
    fg = sns.FacetGrid(dataset, hue='survival_after_5years', height=5)
    fg.map(sns.distplot, feature).add_legend()
    plt.show()

**Observations**
- Patients having number of auxillary nodes above 20 are unlikely to survive. (Domain inputs are hence proved here)


In [None]:
dataset[dataset.n_auxillary_nodes>20]

In [None]:
"""
The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x.
"""
plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(dataset.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    print("********* "+feature+" *********")
    counts, bin_edges = np.histogram(dataset[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, bin_edges[1:], cdf)
    plt.xlabel(feature)

**Observations**
- Almost 80% of the patients have less than or equal to 5 auxillary nodes
- Almost equal number of patients(50%) took surgery before and after 1964 

In [None]:
"""
Box plot takes a less space and visually represents the five number summary of the data points in a box. 
The outliers are displayed as points outside the box.
1. Q1 - 1.5*IQR
2. Q1 (25th percentile)
3. Q2 (50th percentile or median)
4. Q3 (75th percentile)
5. Q3 + 1.5*IQR
Inter Quartile Range = Q3 -Q1
"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(dataset.columns)[:-1]):
    sns.boxplot( x='survival_after_5years', y=feature, data=dataset, ax=axes[idx])
plt.show()  

In [None]:
"""
Violin plot is the combination of box plot and probability density function.
"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, feature in enumerate(list(dataset.columns)[:-1]):
    sns.violinplot( x='survival_after_5years', y=feature, data=dataset, ax=axes[idx])
plt.show()

**BIVARIATE ANALYSIS**

In [None]:
"""
Pair plot in seaborn plots the scatter plot between every two data columns in a given dataframe.
It is used to visualize the relationship between two variables
"""
sns.pairplot(dataset, hue='survival_after_5years', size=4)
plt.show()

**Observations**
- From pairplots, we see that combination of features aren't useful for classification because nowhere we could linearly seperate yes" and "no" between any two combination of features.
- Considering the above plots, just by the overall look, we can say that the plots of the attributes are highly overlapped. An inference from such plots would be quite difficult.

- But, the patient's age and the number of positive axillary nodes have some useful characteristics for classification which can be more revealed by more advanced machine learning algorithms.
