# 3. Exploratory Data Analysis (EDA) on Haberman's Data set

## Data Set Information:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

## Attribute Information:

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
-- 1 = the patient survived 5 years or longer
-- 2 = the patient died within 5 year

Ref: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival

In [None]:
# Import required packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Load data from the csv file
hd=pd.read_csv('../input/habermans-survival-data-set/haberman.csv', header=None, names=['age', 'year', 'nodes', 'status'])
hd.info()
hd.shape

### Observations: 
1. The dataset has no null values so no need to remove the null value data points.
2. All the data values are in integer format.
3. The data has 3 features and 306 observations to be considered for analysis.
4. Status indicates whether the person lived (1) or died (2) within 5 yrs of performing surgery.

In [None]:
hd.columns

In [None]:
hd['status'].value_counts()

### Observations: 
1. 225 patients survived after 5 yrs of surgery.
2. 81 patients died within 5 yrs of surgery. 

In [None]:
hd.describe()

### Observations: 
1. Patients age ranges from 30 to 83 yrs.
2. Other statistic values for mean, std deviation and quantiles.

## Distribution plots

## Univariate Analysis

In [None]:
sns.FacetGrid(hd, hue="status", height=4) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.title('Patient\'s Age', fontsize=20);
plt.show();

### Observations: 
1. We can't arrive at any conclusion based on age using above plot.

In [None]:
sns.FacetGrid(hd, hue="status", height=4) \
   .map(sns.distplot, "nodes") \
   .add_legend();
plt.title('Patient\'s Nodes',fontsize=20);
plt.show();

### Observations: 
1. We can't arrive at any conclusion based on nodes using above plot.

### 2D Scatter plot

In [None]:
hd.plot(kind='scatter',x='nodes',y='age')
plt.title ('Nodes vs Age comparison',fontsize=20)
plt.show();

### Pair plot

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(hd, hue="status", size=4);
plt.show()

### Observations: 
1. We can't arrive at any conclusion since the data points are not clearly separated.

###  PDF and CDF

In [None]:
lived=hd[hd['status']==1]
counts, bin_edges = np.histogram(lived['age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.gca().legend(('PDF','CDF'))
plt.title('PDF & CDF of patients who lived after surgery',fontsize=20)
plt.show();

In [None]:
dead=hd[hd['status']==2]
counts, bin_edges = np.histogram(dead['age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.gca().legend(('PDF','CDF'))
plt.title('PDF & CDF of patients who died after surgery',fontsize=20)
plt.show();

## Box plot

In [None]:
sns.boxplot(x='status',y='age', data=hd)
plt.title ('Box plot for Status vs Age',fontsize=20)
plt.show()

### Observations:
1. We can't arrive at any conclusion since the 75th and 25th quantiles in both the plots is very close.
2. Patients undergoing surgery below 33 yrs age have survived.
3. Patients undergoint surgery after 78 yrs age have died.

In [None]:
sns.boxplot(x='status',y='nodes', data=hd)
plt.title ('Box plot for Status vs Nodes',fontsize=20)
plt.show()

### Observations:
1. There are too many patients (outliers) who survived having many nodes, so cannot arrive at conclusion.


# Violin plot

In [None]:
sns.violinplot(x="status", y="age", data=hd, size=8)
plt.title ('Violin plot for Status vs Age',fontsize=20)
plt.show()

### Observations:
1. Patients undergoing surgery below 33 yrs age have survived.
2. Patients undergoint surgery after 78 yrs age have died.

In [None]:
sns.violinplot(x="status", y="nodes", data=hd, size=8)
plt.title ('Violin plot for Status vs Nodes',fontsize=20)
plt.show()