**HABERMAN DATASET EXPLORATORY DATA ANALYSIS**

In [None]:
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
 

habermandataset = pd.read_csv('../input/eda-on-haberman-cancer-survival/haberman.csv')
habermandataset.head(5)

In [None]:
print(habermandataset.shape)
print(habermandataset.columns)

In [None]:
#Converting the status column from numeric to string
habermandataset['status'] = habermandataset['status'].apply(lambda x: 'Survived' if x == 1 else 'Dead')

In [None]:
habermandataset['status'].value_counts()

Dataset has 2 classes (Survived and Dead). From the above data we can conclude that the dataset is imbalanced.

In [None]:
habermandataset.describe()

# OBJECTIVE
Determine the features(age,year,nodes) which helps in the classification

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("whitegrid");
sns.FacetGrid(habermandataset,hue="status", size=8,palette="Set1") \
   .map(plt.scatter, "age", "year") \
   .set(title='Age vs Year')\
   .add_legend();
plt.show();

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(habermandataset, hue="status", size=3).set(title='Scatter plot');
plt.show()

Due to large amount of overlap, it is difficult to make any conclusion based on the scatter plots.

# HISTOGRAMS AND PDF'S

In [None]:
sns.FacetGrid(habermandataset, hue="status",palette = ['Red','Blue'],size=10) \
   .map(sns.distplot, "age") \
   .set(xticks=np.arange(0,100,2),title='Histogram/PDF for age') \
   .add_legend();
plt.show();

From the above density plot we observe that the people of age 30 to 34 are more likely to survive, and people above the age of 76 are more likely to die.  
From the age 34 to 76 there is a lot of overlap between people who survived and those who died and no classification can be made just based on the age.

In [None]:
sns.FacetGrid(habermandataset, hue="status",palette = ['Red','Blue'],size=8) \
   .map(sns.distplot, "year") \
   .set(title='Histogram/PDF for year')\
   .add_legend();
plt.show();

There cannot be any classification made based on the year, as each year has both survivors and people who died.

In [None]:
sns.FacetGrid(habermandataset, hue="status",palette = ['Red','Blue'],size=8) \
   .map(sns.distplot, "nodes") \
   .set(xticks=np.arange(-10,60,2),title='Histogram/PDF for nodes')\
   .add_legend();
plt.show();

There is a high concentration of people who survived where nodes value is close to 0(0-2) and the it reduces drastically as the value of nodes increases. 

# BOX PLOTS

In [None]:
plt.figure(figsize=(16, 6))
sns.boxplot(x='status',y='age', data=habermandataset,hue='status')\
    .set(yticks=np.arange(25,90,2),title='Box plot for Age')
plt.show()

Box plot above shows all people between age 30 to 33 survived while all people of age 77 to 83 died.

In [None]:
plt.figure(figsize=(16, 6))
sns.boxplot(x='status',y='year', data=habermandataset,hue='status')\
   .set(yticks=np.arange(55,80,2),title='Box plot for year')
plt.show()

In [None]:
h = habermandataset[habermandataset.year <=60]
print(h['status'].value_counts())
print()
h = habermandataset[habermandataset.year >=66]
print(h['status'].value_counts())

More than twice the number of people have survived as compared to people dead for years 58-60 and 
almost thrice the number of people have survived as compared to people dead for year 66 - 69

In [None]:
plt.figure(figsize=(16, 6))
ax = sns.boxplot(x='status',y='nodes', data=habermandataset,hue='status')\
        .set(title='Box plot for nodes')
plt.show()

In [None]:
plt.figure(figsize=(16, 6))
sns.violinplot(x="status", y="age", data=habermandataset, size=8,hue='status')\
   .set(title='Violin plot for age')
plt.show()

As indicated above younger people have higher chances of survival

In [None]:
plt.figure(figsize=(16, 6))
sns.violinplot(x="status", y="year", data=habermandataset, size=8,hue='status')\
   .set(yticks=np.arange(50,75,1),title='Violin plot for year')
plt.show()

Year 59-61 has higher survival rate and 64-66 has higher death rate 

In [None]:
plt.figure(figsize=(16, 6))
sns.violinplot(x="status", y="nodes", data=habermandataset, size=8,hue='status')\
   .set(title='Violin plot for nodes')
plt.show()

Node value close to 0 has very high survival rate since thickness of plot is quite high in this area

# CONCLUSION

1. Features node and age have highest imapact on the survivability/death.
2. Younger people have higher chance of survival as compared to older.
3. Node value close to 0 guarantees higher chances of survival.
