#  Exploratory data analysis on Haberman Dataset

**Attribute Information :**
* Age of patient at time of operation
* Year of operation (year - 1900)
* Number of positive axillary nodes detected 
* Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

**Objectives :**
* To classify a person's survival status based on three features

In [None]:
cd /kaggle/input/habermans-survival-data-set

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
from prettytable import PrettyTable
warnings.filterwarnings('ignore')

data=pd.read_csv('haberman.csv',names=["age","year", "nodes","status"])

In [None]:
data.head()

In [None]:
data.status.value_counts()

## 1. Basic Information about the dataset :

In [0]:
# How many data-points and features ?
print(data.size,data.shape[1]-1)

In [0]:
# What are the features in given data set?
print([i for i in data.columns[:-1]])

In [0]:
# What need to be classified ?
print ('Wether the patient survived lessthan(status=2) or more than 5 years (status= 1) after operation')

In [0]:
# The data was collected between ?
print('19{} and 19{}'.format(data.year.min(),data.year.max()))

In [0]:
# The age group of patients?
print(data.age.min(),data.age.max())

In [0]:
##https://matplotlib.org/gallery/pie_and_polar_charts/pie_features.html#sphx-glr-gallery-pie-and-polar-charts-pie-features-py

# How many patients survived More than 5 years(status=1) and Less than 5 years (status=2)?
print('Patients survived More than 5 years are',data.status.value_counts()[1],'and Less than 5 years are ',data.status.value_counts()[2])
status_counts=[data.status.value_counts()[1],data.status.value_counts()[2]]
labels='Status 1','Status 2'

fig1,ax1=plt.subplots(figsize=(6,6));
ax1.pie(status_counts,explode = (0, 0.1),labels=labels,shadow=1,startangle=90,autopct='%1.1f%%');
ax1.axis('equal');
plt.show();

In [0]:
# How many values are unknown or not available?
data.info()

**Overview of data :**

1. The given data set is conducted on 306 patients
1. Given data is unbalanced as the counts of survival status differ by almost 47% 
2. The data doesn't have any missing values

## 2. Initial numerical details of the data set: 

In [0]:
data.describe()

In [0]:
g=data.groupby('status')

In [0]:
g['age','nodes'].describe()

In [0]:
# Out of patients not survied, percentage of them don't even have a single node
data[(data.nodes==0) & (data.status==2)].shape[0]/data[data.status==2].shape[0]*100

**Note :**
1.  **Below** **34** years **age** there's probably no chance of not surviving for more than 5 years
1. The **average age** of patients died with in and more than five years are **52 and 53.7** years respectively
2. It seems **more** the **age** **lesser** the **chance** of **survival**
2. **75%** of patients are of **age** around **60** and **61** years for status 1 and 2
2. There are **6.2 %** patiens **died** within five years even **without** any pesence of **nodes**.
3. The **average** no. of auxilary **nodes** present in patients are around **3** and **7** ,for status 1 and 2 respectively

## 2.1 Quantile Analysis of Nodes

In [0]:
stts1=g['nodes'].quantile([i*0.1 for i in range(11)])[1].values
stts2=g['nodes'].quantile([i*0.1 for i in range(11)])[2].values
x=[i*10 for i in range(11)]
table=PrettyTable()
table.field_names=['Percentile','No of Nodes (Status 1)','No of Nodes (Status 2)']
for i in range(11):
  table.add_row(np.around([x[i],stts1[i],stts2[i]],2))
print(table)

In [0]:
plt.figure(figsize=(10,5))
plt.plot(x,stts1);
plt.scatter(x,stts1,c='g',label='status 1');
plt.plot(x,stts2);
plt.scatter(x,stts2,c='r',label='status 2');
plt.legend()
plt.xlabel('Percentile')
plt.ylabel('No of Nodes')
plt.title('Percentile plot of no of Nodes for status 1 and 2')
plt.grid()

**Observations**: 
* At **max** there are **46** **nodes** present for patients survived where as **52** nodes for patients **not** **survived**
* If a patient have around **8 nodes**, there's **90%** chance that patient will **survive**, whereas less than **70%** chance that patient **won't survive**
* It seems **more** the no of **Nodes**, **lesser** the chance of **survival**

## 2.1 Quantile Analysis of Age

In [0]:
stts1=g['age'].quantile([i*0.1 for i in range(11)])[1].values
stts2=g['age'].quantile([i*0.1 for i in range(11)])[2].values
x=[i*10 for i in range(11)]
table=PrettyTable()
table.field_names=['Percentile','Age (Status 1)','Age (Status 2)']
for i in range(11):
  table.add_row(np.around([x[i],stts1[i],stts2[i]],2))
print(table)

In [0]:
plt.figure(figsize=(10,5))
plt.plot(x,stts1);
plt.scatter(x,stts1,c='g',label='status 1');
plt.plot(x,stts2);
plt.scatter(x,stts2,c='r',label='status 2');
plt.legend()
plt.xlabel('Percentile')
plt.ylabel('No of Nodes')
plt.title('Percentile plot of no of Nodes for status 1 and 2')
plt.grid()

**Observations :**
* Unlike no of Nodes, **age** of patient **isn't** much **distinguishable** for suvival status
* Still we can observe, **lesser** the **age** **more** the **chance** of **survival**
* If patient is of **below** **age** **34**, most probably the **patient** would **survive** 

## 3. Box Plots

In [0]:
plt.figure(figsize=(20,5))
plt.subplot(121)
sns.boxplot(x='status',y='nodes', data=data)
#plt.show()
plt.subplot(122)
sns.boxplot(x='status',y='age', data=data)
plt.suptitle('Box plots of No of Nodes present and Age of patients')
plt.show()

* Graphical way of viewing Inter Quantile ranges of variables 

## 4 Univariate Analysis:

### 4.1 Effect of Age:

In [0]:
d=np.array(data.age.loc[data.status==1])
s=np.array(data.age.loc[data.status==2])

fig,axes=plt.subplots(1,3,sharey=True,sharex=True,figsize=(17,5));

sns.distplot(d,ax=axes[0],label='status 1');
sns.distplot(s,ax=axes[1],label='status 2',color='r');
sns.distplot(d,ax=axes[2],label='status 1' );
sns.distplot(s,ax=axes[2],label='status 2',color='r');
axes[0].legend();
axes[0].set_xlabel('Age of people with status 1')
axes[1].set_xlabel('Age of people with status 2')
axes[1].legend();
axes[2].legend();
plt.suptitle('Histograms of age');
plt.xlabel('Age');

**Observations :**
* **Age** is around **50** to **60** for **most patient**s of both survived and not survived


### 4.2 Effect of Nodes:

In [0]:
d=np.array(data.nodes.loc[data.status==1])
s=np.array(data.nodes.loc[data.status==2])

fig,axes=plt.subplots(1,3,sharey=True,sharex=True,figsize=(17,5));

sns.distplot(d,ax=axes[0],label='status 1');
sns.distplot(s,ax=axes[1],label='status 2',color='r');
sns.distplot(d,ax=axes[2],label='status 1' );
sns.distplot(s,ax=axes[2],label='status 2',color='r');
axes[0].legend();
axes[0].set_xlabel('No of Nodes')
axes[1].set_xlabel('No of Nodes')
axes[1].legend();
axes[2].legend();
plt.suptitle('Histograms of No of Nodes');
plt.xlabel('No of Nodes');

**Observations :**
* Most patients are having 0 to 5 Nodes for both survived and not survived

### 4.3 Analysis by Year :

In [0]:
d=np.array(data.year.loc[data.status==1])
s=np.array(data.year.loc[data.status==2])

fig,axes=plt.subplots(1,3,sharey=True,sharex=False,figsize=(17,5));

sns.distplot(d,ax=axes[0],label='status 1');
sns.distplot(s,ax=axes[1],label='status 2',color='r');
#plt.xticks(d,['19'+str(i) for  i in d],rotation=90)
sns.distplot(d,ax=axes[2],label='status 1' );
axes[0].grid()
axes[1].grid()
sns.distplot(s,ax=axes[2],label='status 2',color='r');
axes[0].legend();
axes[0].set_xlabel('Year')
axes[1].set_xlabel('Year')
axes[1].legend();
axes[2].legend();
plt.xticks(d,['19'+str(i) for  i in d],rotation=90)
plt.suptitle('Histograms of Year');
plt.xlabel('Year');
plt.grid()

**Observations :**
* There **isn't** much **difference** in distribution of patients of both **survived** and **not survived**
* We can see, there are **less** instances of patients **not survived** around year **1962** and **more** instances around year **1965**
* For **survived** patients the trend almost **unchanced** **across**  **time**

## 5. Violin Plots

In [0]:
plt.figure(figsize=(20,5))
plt.subplot(121)
sns.violinplot(x='status',y='nodes', data=data)
#plt.show()
plt.subplot(122)
sns.violinplot(x='status',y='age', data=data)
plt.suptitle('Violin plots of No of Nodes present and Age of patients')
plt.show()

* By combining both Distribution plot and inter Quantile (Box plot)

## 5. PDF, CDF

In [0]:
counts, bin_edges = np.histogram(data.nodes[data.status==1], bins=10,density = True)
plt.figure(figsize=(20,6))
plt.subplot(121)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='PDF status 1',c='b')
plt.plot(bin_edges[1:], cdf,label='CDF status 1',c='r')
counts, bin_edges = np.histogram(data.nodes[data.status==2], bins=10,density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='PDF status 2',c='g')
plt.plot(bin_edges[1:], cdf,label='CDF status 2',c='200')
plt.legend()
plt.xlabel('Number of Nodes')
plt.title('CDF and PDF of Number of nodes present for stauts 1 and 2')
#plt.show();

counts, bin_edges = np.histogram(data.age[data.status==1], bins=10,density = True)
plt.subplot(122)
#plt.figure(figsize=(15,7))
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='PDF status 1',c='b')
plt.plot(bin_edges[1:], cdf,label='CDF status 1',c='r')
counts, bin_edges = np.histogram(data.age[data.status==2], bins=10,density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='PDF status 2',c='g')
plt.plot(bin_edges[1:], cdf,label='CDF status 2',c='200')
plt.legend()
plt.xlabel('Age of patients')
plt.title('CDF and PDF of Age of patients for stauts 1 and 2')
plt.show();

**Observations:**
* We can draw similar conclusions from CDF plots, as from percentile analysis
* And can draw similar conclusions from PDF plots, as from histogram and dirstributional analysis above

## 4. Bi Variate Analysis

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(data, hue="status",markers=["o", "s"],vars=data[['age','nodes','year']], size=3,diag_kind="kde");
plt.show()

**Observations :**
* There's lot of over lap of data, any combination of two features aren't much helpful to distinguish survival status of patients

# Conclusions :

* We observed that **age** and no of **nodes** are **important** features to come to a preliminary conclusions to** distinguish survival status** of patients
* **More** the **Age**, **less** the probability of **survival** 
* **More** the no of **Nodes** present, **lesser** the probability of **survival**
* Even though above **observations aren't certain**, we can get primary guess on survival status.
* As there is very less data about not survived patients, we can't be more certain with above observations