In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Objective**: To classify patients status as survived(1) or not(2) 5 years post operation.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

In [None]:
path = '/kaggle/input/habermans-survival-data-set/haberman.csv'
data = pd.read_csv(path, header = None, names = ['age', 'year', 'nodes', 'status'])

In [None]:
data.shape

In [None]:
data.head(5)

The data contains 306 rows an 4 columns.
*age*, *year* and *nodes* are the **independent** variables and *status* is the **dependent** variable.


Attribute Information:

* Age of patient at time of operation (numerical)
* Patient's year of operation (year - 1900, numerical)
* Number of positive axillary nodes detected (numerical)
* Survival status (class attribute)
    * 1 = the patient survived 5 years or longer
    * 2 = the patient died within 5 year

In [None]:
data.isna().sum() #Checking if there are any null values

There are no null or missing values in the data.

In [None]:
data.nunique() #Counting the number of unique values in the data

We can say that we have a data of 12 diffferent calendar years and 2 survival statuses as expected.

## High Level Statistics and Univariate Analysis

In [None]:
data.describe()

In [None]:
features = data.columns[:-1]
target = data.columns[-1]

In [None]:
for feature in features:
    sns.set_style("whitegrid")
    sns.displot(data=data, x=feature, kde = True, hue= target, palette='vlag')
    plt.title(f'Distribution of {feature} column')
    plt.grid()
    plt.show()

Since, nodes have a huge spread, let's check the cumulative distribution plot.

In [None]:
sns.kdeplot(data=data, x='nodes', cumulative = True)
plt.title("Cumulative Distribution of Aux. Nodes found.")
plt.show()

### Observations on univariate analysis

1.  High number of patients of age 45-65 years and the patients within 40-55 years of age fell under status 2 (died within 5 years of the operation).
2.  Patients under 40 had very few cases that fell under status 2.
3. Initially (in 1958) there were a huge number of patients but as time went on, we can see a dip in the number of patients (which could be due to various reasons like tech advancement, awareness etc.).
4. In 1964, the ratio of patients in status to status 2 was really poor and we can see the status2 curve climbing which indicates poor performance. 
5. While the min nodes found were 0 and the max was 50 but majority of the patients had none or 1 auxiliary nodes found. (Median = 1, Mean = 4.2) so we can expect the presence of outliers. [We can see that 80% of the dataset has <= 8 nodes.]
6. Patients with 0 aux. nodes found tend to live longer after the operation.
7. As number of auxiliary nodes increases, the more dangerous it gets for the patient, although there are few cases with high number of nodes.


However, using univariate analysis, the two target classes are not easily separable. i.e. they don't exist in clusters in the plots.

## Bivariate Analysis

In [None]:
sns.pairplot(data = data, hue = 'status', palette='vlag')
plt.show()

The combination of features are not easily separable.

#### Plotting Box Plots and Violin Plots

In [None]:
for feature in features:
    sns.boxplot(x=target, y = feature, data= data)
    plot_title = feature + ' vs ' + target + ' boxplot'
    plt.title(plot_title)
    plt.show()

One can see that the vertical overlap between the every feature vs target boxplot is huge. That means a lot of data points lie together and can't be separated. 

Nodes vs Status boxplot shows the least overlap which was already explained as most patients with status = 1 had less nodes and as the aux. nodes increased, the danger tends to increase and the more likely they are to falling in status = 2  group. Hence, there are some status 1 patients with high aux. nodes found as well.

In [None]:
for feature in features:
    sns.violinplot(x=target, y = feature, data= data)
    plot_title = feature + ' vs ' + target + ' violinplot'
    plt.title(plot_title)
    plt.show()

Here, we can see the distrbution along with all the data that we could see in the boxplot.

## Multivariate Analysis

In [None]:
sns.jointplot(x = 'age', y = 'nodes', data = data, kind='kde', fill = True)
plt.show()

A large number of patients incoming were of age 40-60 with 0-1 aux nodes.

In [None]:
sns.jointplot(x = 'age', y = 'nodes', data = data[data.status == 1], kind='kde', hue = target, fill = True)
plt.show()

We have a good number of patients from status 1 of ages 30-70 and less aux. nodes.

In [None]:
sns.jointplot(x = 'age', y = 'nodes', data = data[data.status == 2], kind='kde', fill = True)
plt.show()

40-60 age group had more status 2 patients. The status 2 patients were also likely to have a higher aux. node count.

### Calculating Quantiles

In [None]:
for feature in ['age', 'nodes']:
    print(f"Quantiles of {feature} : ")
    min_f = data[feature].min()
    print(f"Min : {min_f}")
    quant = [1, 25, 50, 75, 99]
    q = np.percentile(data[feature], quant)
    for i,qx in enumerate(q):
        print(f'{quant[i]} percentile : {q[i]}')
    iqr = q[3] - q[1]
    max_f = data[feature].max()
    print(f"Max : {max_f}")
    print(f'IQR ({feature}) = {iqr}')
    print()

In nodes column, the 99th percentile is 29.90 but the max is 52. So, there is a huge difference and that could mean that we have outliers in the nodes column.

## Observations
1. As per the analysis, the given data is not linearly separable.
2. Most patients had 0 or 1 aux. nodes found but as the number of auxiliary nodes increases, the more dangerous it gets for the patient, although there are few cases with high number of nodes.
3. High number of patients of age 45-65 years and the patients within 40-55 years of age fell under status 2 (died within 5 years of the operation).
4. In 1964, the number of patients falling into status 2 class increased but majority were still in status 1. This trend followed every year but was the worse in 1964.
5. In the long run, the total number of patients decreased with time.
6. The nodes feature might have outliers present.