## Exploratory Data Analysis: Haberman's Cancer Survival Data
1. This dataset is compiled between 1958 and 1970 at Uchicago's Billings Hospital by Haberman and his fellow       researchers
2. It captured the details of patients who underwent the breast cancer surgery, The details captured are: 
   age: the age of patient; 
   year: the two digit year(in 1900) in which the surgery was performed; 
   axillary (nodes): the number of lymph nodes which got infected/ tested positive with cancer; 
   survived: which captured the life-time of a patient after surgery. '1' indicates that a patient survived morethan 5    years after the surgery and '2' indicates that a patient survived lessthan 5 years after surgery 
3. The attribute axillary nodes is recorded to analyse the spread of cancer, since the nearest tissue/organ to breast    is axillary node(lymph node), it's infection indicates that the cancer is locally spread.( Refer http://www.breastcancer.org/symptoms/diagnosis/staging for ). The remaining attributes like age,year,survival are self-      explanatory


### Objective
###### To predict whether a patient will live longer than 5 years or less than 5 years, given the age, year of operation, infection to axillary nodes

In [9]:
#importing all important libraries and reading the dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

survival_df = pd.read_csv('../input/haberman.csv',names=['age', 'year', 'axillary', 'survived'])
#view first few rows
survival_df.head(3)

In [10]:
#Computing Meta-data
print('The number of instances/records in the dataset: {0} and the Number of attributes: {1} \n'\
      .format(survival_df.shape[0],survival_df.shape[1]))
print('List of attributes:',survival_df.columns)

In [11]:
#Analysing Class distribution:
survival_df['survived'].value_counts()

##### Observations
1. There are two classes, 1-indicating >= 5 years of survival and 2-indicating < 5 years of survival
1. The dataset is imbalanced and warrants a need of techniques which can help reduce the class imbalance anomaly, use of unmodified/vanilla algorithms will give a biased prediction favouring survival

### Bivariate Analysis
#### Pair Plots

In [4]:
#pairplots using seaborn
sns.set_style('whitegrid')
sns.pairplot(survival_df,hue='survived',vars=survival_df.columns[:-1],size=4,markers=['o','D'],palette='cubehelix')
plt.show()

#### Observations
1. The attributes of the dataset are not correlated and there is no trend observed between it's attributes
2. It is very hard, if not impossible, to separate classes by a hyper plane or by forming rules using pairs of attributes

### Univariate Analysis
##### CDF

In [5]:
#age
sns.FacetGrid(survival_df,hue='survived',size=6,palette='cubehelix') \
   .map(sns.distplot,'age') \
    .add_legend()
plt.show()    

In [6]:
#year
sns.FacetGrid(survival_df,hue='survived',palette='cubehelix',size=6)\
   .map(sns.distplot,'year')\
   .add_legend()
plt.show()    

In [7]:
#axillary nodes
sns.FacetGrid(survival_df,hue='survived',palette='cubehelix',size=6)\
    .map(sns.distplot,'axillary')\
    .add_legend()
plt.show()    

#### Observations:
1. The data is not linearly separable even in terms of single variable
2. Age: Patients below 42 years of age have more chance of survival then others
3. Year of Surgery: No clear conclusion can be drawn.
4. Axillary Nodes: Patients with least number (<=3) of infected axillary nodes have more chance of survival

### PDF

In [8]:
#age
survived = survival_df[survival_df['survived']==1]
not_survived = survival_df[survival_df['survived']==2]
fig,ax=plt.subplots(1,1,figsize=(14,8))
counts,bin_edges = np.histogram(survived['age'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='survived pdf')
plt.plot(bin_edges[1:],cdf,label='survived cdf')

counts,bin_edges = np.histogram(not_survived['age'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='not_survived pdf')
plt.plot(bin_edges[1:],cdf,label='not_survived cdf')

plt.legend()
plt.show()

#### Observations:
1. No rule can be deduced using age attribute, however, it is observed that people less than 44 years of age have higher chance of survival

In [9]:
#year
fig,ax=plt.subplots(1,1,figsize=(14,8))
counts,bin_edges = np.histogram(survived['year'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='survived pdf')
plt.plot(bin_edges[1:],cdf,label='survived cdf')

counts,bin_edges = np.histogram(not_survived['year'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='not_survived pdf')
plt.plot(bin_edges[1:],cdf,label='not_survived cdf')

plt.legend()
plt.show()

In [10]:
#Axillary Nodes
fig,ax=plt.subplots(1,1,figsize=(14,8))
counts,bin_edges = np.histogram(survived['axillary'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='survived pdf')
plt.plot(bin_edges[1:],cdf,label='survived cdf')

counts,bin_edges = np.histogram(not_survived['axillary'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='not_survived pdf')
plt.plot(bin_edges[1:],cdf,label='not_survived cdf')

plt.legend()
plt.show()

#### Observations:
1. Roughly, it can be inferred that as the number of infected lymph nodes increase, the survival rate falls down (i.e.. a patient may not survive past 5 years)

In [11]:
#Descriptive Statistics
survival_df.describe()

### Box plot

In [12]:
#age
sns.boxplot(data=survival_df,x='survived',y='age')
plt.show()

In [13]:
#year
sns.boxplot(data=survival_df,x='survived',y='year')
plt.show()

In [14]:
#no. of infected axillary nodes
sns.boxplot(data=survival_df,x='survived',y='axillary')
plt.show()

#### Observations:
1. Age: Anyone below 35 years of age will survive past 5 years and anyone above 78 years of age will not survive past 5 years
2. year: Patients operated after 1965 have more chance of survival
3. axillary nodes: Patients having more than 8 axillary nodes infected will not survive past 5 years.

### Violin Plots

In [15]:
#age
sns.violinplot(data=survival_df,x='survived',y='age')
plt.show()

In [16]:
#year
sns.violinplot(data=survival_df,x='survived',y='year')
plt.show()

In [17]:
#axillary nodes
sns.violinplot(data=survival_df,x='survived',y='axillary')
plt.show()

In [18]:
'''Observations deduced from box plots holds true here and no new observations are found'''