# Exploratory Data Analysis : Haberman's Survival

## Objective : To get insights of patient's survival status who had undergone operation for breast cancer.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels import robust
%matplotlib inline

In [None]:
df = pd.read_csv('../input/habermans-survival-data-set/haberman.csv',header=None, names=['age', 'year', 'nodes', 'status'])
df.head()

> ### Summary
* age    -  Age of patient at time of operation (numerical)
* year   -  Patient's year of operation (year - 1900, numerical)
* nodes  -  Number of positive axillary nodes detected (numerical)
* status -  Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# How many data point and features:
df.shape

In [None]:
# What are the column names :
df.columns

In [None]:
# count of data points for each class in 'status':
df['status'].value_counts()

* 225 Patients survived more than 5 years and longer
* 81 Patients died with in 5 years after operation 

### Univariate Analysis

#### Distribution of 'Status' feature

In [None]:
sns.distplot(df['status']).set_title('Distribution of Status')

* From above visulization, probability of patient survived more than 5 year is more compared to probability of patients dies with in 5 years after operation.

In [None]:
sns.distplot(df['age']).set_title('Distribution of age')

* Most of cancer suffering patients are of age in between 40 to 65

In [None]:
sns.distplot(df['year']).set_title('Distribution of year')

In [None]:
sns.distplot(df['nodes']).set_title('Distribution of nodes')

* Distribution of nodes is the right skewed distribution and most of the patients having 0 to 1 positive axillary nodes detected. 

### PDF(Probability Density Function)

In [None]:
sns.FacetGrid(df, hue = 'status',height = 6).map(sns.distplot,"age").add_legend()
plt.ylabel("Density")
plt.title('Survival Status vs Age')

* patients with age of below 34 have survived.
* patients with age of above 77 have not survived.
* Probability of not surviving between age 40 to 60 and age 65 to 70 is high compared to survival rate.

In [None]:
sns.FacetGrid(df, hue = 'status',height = 6).map(sns.distplot,"nodes").add_legend()
plt.ylabel("Density")
plt.title('Survival Status vs Auxillary Nodes')

* 55+ percent of patients who have Auxillary Nodes 0 to 5 are survived and around 12 % are not survived.

In [None]:
sns.FacetGrid(df, hue = 'status',height = 6).map(sns.distplot,"year").add_legend()
plt.ylabel("Density")
plt.title('Survival Status vs Operation_year')

### CDF(Cummulative Distributed Function)


In [None]:
# Plot CDF for 'Age'
counts, bin_edges = np.histogram(df['age'], bins = 10, density = True)

pdf_age = counts/sum(counts)
print(pdf_age)
print(bin_edges)
cdf_age = np.cumsum(pdf_age)
print(cdf_age)

In [None]:
plt.figure(figsize=(9,6))
plt.plot(bin_edges[1:],pdf_age)
plt.plot(bin_edges[1:],cdf_age)
plt.ylabel('Density')
plt.xlabel('Age')
plt.legend(['PDF of Age','CDF of Age'])
plt.show()

In [None]:
# creating data frame for each status
Survived = df.loc[df["status"] == 1]
Not_Survived = df.loc[df["status"] == 2]

In [None]:

counts_S, bin_edges_S = np.histogram(Survived['age'], bins = 10, density = True)

pdf_age_survived = counts_S/sum(counts_S)
cdf_age_survived = np.cumsum(pdf_age_survived)


counts_NS, bin_edges_NS = np.histogram(Not_Survived['age'], bins = 10, density = True)

pdf_age_Not_survived = counts_NS/sum(counts_NS)
cdf_age_Not_survived = np.cumsum(pdf_age_Not_survived)

plt.figure(figsize=(9,6))
plt.plot(bin_edges_S[1:],pdf_age_survived)
plt.plot(bin_edges_S[1:],cdf_age_survived)

plt.plot(bin_edges_NS[1:],pdf_age_Not_survived)
plt.plot(bin_edges_NS[1:],cdf_age_Not_survived)


plt.ylabel('Density')
plt.xlabel('Age')
plt.legend(['PDF of Age Survived','CDF of Age Survived','PDF of Age Not Survived','CDF of Age Not Survived'])
plt.show()

* From above graph it is very clear that patients with above age of 77 did not survived.
* patients with below age of 38 are survived after the operation.
* probability of being survived with having age below 48 is high.

In [None]:
counts_S, bin_edges_S = np.histogram(Survived['year'], bins = 10, density = True)

pdf_year_survived = counts_S/sum(counts_S)
cdf_year_survived = np.cumsum(pdf_year_survived)


counts_NS, bin_edges_NS = np.histogram(Not_Survived['year'], bins = 10, density = True)

pdf_year_Not_survived = counts_NS/sum(counts_NS)
cdf_year_Not_survived = np.cumsum(pdf_year_Not_survived)

plt.figure(figsize=(9,6))
plt.plot(bin_edges_S[1:],pdf_year_survived)
plt.plot(bin_edges_S[1:],cdf_year_survived)

plt.plot(bin_edges_NS[1:],pdf_year_Not_survived)
plt.plot(bin_edges_NS[1:],cdf_year_Not_survived)


plt.ylabel('Density')
plt.xlabel('year')
plt.legend(['PDF of year Survived','CDF of year Survived','PDF of year Not Survived','CDF of year Not Survived'])
plt.show()

In [None]:
counts_S, bin_edges_S = np.histogram(Survived['nodes'], bins = 10, density = True)

pdf_nodes_survived = counts_S/sum(counts_S)
cdf_nodes_survived = np.cumsum(pdf_nodes_survived)


counts_NS, bin_edges_NS = np.histogram(Not_Survived['nodes'], bins = 10, density = True)

pdf_nodes_Not_survived = counts_NS/sum(counts_NS)
cdf_nodes_Not_survived = np.cumsum(pdf_nodes_Not_survived)

plt.figure(figsize=(9,6))
plt.plot(bin_edges_S[1:],pdf_nodes_survived)
plt.plot(bin_edges_S[1:],cdf_nodes_survived)

plt.plot(bin_edges_NS[1:],pdf_nodes_Not_survived)
plt.plot(bin_edges_NS[1:],cdf_nodes_Not_survived)


plt.ylabel('Density')
plt.xlabel('Auxillary Nodes')
plt.legend(['PDF of Auxillary Nodes Survived','CDF of Auxillary Nodes Survived','PDF of Auxillary Nodes Not Survived','CDF of Auxillary Nodes Not Survived'])
plt.show()

* Patients who has more than 46 auxillary nodes are not survived.

In [None]:
plt.figure(figsize=(8,5))
plt.plot(Survived["age"], np.zeros_like(Survived["age"]), '*', label = "Survived")
plt.plot(Not_Survived["age"], np.zeros_like(Not_Survived["age"]), '*', label = "Not Survived")
plt.title("scatter plot for Age")
plt.xlabel("age")
plt.legend()
plt.show()

* From above chart it is clear that, most of the patients in between age 41 to 70 are not survived.

In [None]:
sns.pairplot(df, hue = "status",vars = ["age", "year", "nodes"], height = 3)
plt.show()

In [None]:
print('Medians:')
print(np.median(Survived['nodes']))
print(np.median(np.append(Survived['nodes'],50)))
print(np.median(Not_Survived['nodes']))

print('\nQuantiles:')
print(np.percentile(Survived['nodes'],np.arange(0,100,25)))
print(np.percentile(Not_Survived['nodes'],np.arange(0,100,25)))

print('\n90th percentile:')
print(np.percentile(Survived['nodes'],90))
print(np.percentile(Not_Survived['nodes'],90))

print ('\nMedian Absolute Deviation')
print(robust.mad(Survived['nodes']))
print(robust.mad(Not_Survived['nodes']))

#### Observations:

* From above values, it is clear that average axillary nodes in long survival is 0 and for short survival it is 4. which means,      Patients who have average 4 auxillary nodes have short survival rate.
* At 90th% there if nodes detected is greater than 8 nodes then it has high survival status and if nodes are greater 20 then patients will have less survival rate.
* Probability of surviving patients with age of below 38 is high.
* patients with age of above 77 have not survived.
* Probability of not surviving between age 40 to 60 and age 65 to 70 is high compared to survival rate.
* 55+ percent of patients who have Auxillary Nodes 0 to 5 are survived and around 12 % are not survived.
* Patients who has more than 46 auxillary nodes are not survived.

#### Conclusion:
* This dataset is imbalanced and much overlapping and very hard to classify.
* Having said this, 'Age' and 'Nodes' feature gives us more insights compare to other features.
* If we utilize advance tenchinque to handle imbalanced dataset then it will be easy to get more insights.