# EDA on Haberman Survival Dataset
Source: https://www.kaggle.com/gilsousa/habermans-survival-data-set/data

EDA by: Ashishkumar Teotia

EDA Date: 4th May 2018

# DataSet Details:
Title: Haberman's Survival Data

Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999
Past Usage:

Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of Instances: 306

Number of Attributes: 4 (including the class attribute)


Attribute Information:

Age of patient at time of operation (numerical)

Patient's year of operation (year - 1900, numerical)

Number of positive axillary nodes detected (numerical)

Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

Missing Attribute Values: None

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

print('Libraries Imported')
import os
print(os.listdir('../input'))

In [6]:
#dataSetURL = '/resources/data/AppliedAI/EDA_ch8/haberman.csv'

labels = ['age', 'operation_year', 'axil_nodes', 'survived_status']

df = pd.read_csv('../input/haberman.csv', names = labels)
df.head()

In [7]:
df.shape

In [8]:
df.columns

In [9]:
df.describe()

In [10]:
df.info()

**No Missing value** were found in our dataset, but we need to label our class in readable format, as survived features say **1 means survived which can be decoded as 'survived**', and **2- not survived which can b labeled as 'dead'**

In [11]:
df['survived_status'] = df['survived_status'].map({1:'survived', 2:'dead'})
df.tail()

In [12]:
df['survived_status'].value_counts()

We can see our dataset is not balanced, as we can see that from total 306 people, 255 people suvived for 5 or more year and 81 of them died within 5 years. So 225:81 is not a balanced pair.

In [13]:
plt.scatter(df['age'],df['operation_year'], c = 'g')
plt.xlabel('Age')
plt.ylabel('Operation year')
plt.title('Operation year vs Age')
plt.show()

This scatter plot doesn't clear much about the dataset, it seems that dataset is highly mixed up, but still we can say that the majority of operations are performed on people age range between 40 and 68, where most of the points plotted.

In [14]:
plt.scatter(df['age'],df['axil_nodes'], color = 'g')
plt.xlabel('Age')
plt.ylabel('Axil Nodes')
plt.title('Axil_nodes vs Age')
plt.show()

We can see that there is quite good concentration of data point When axil_node is 0

In [12]:
plt.scatter(df['axil_nodes'], df['operation_year'], c = 'g')
plt.xlabel('Axil Nodes')
plt.ylabel('Operation year')
plt.title('Operation year vs Axil Nodes')
plt.show()

Here we can conclude that large number of operation were done in span of 7 years between 1960 and 1966 

In [15]:
plt.close();
sns.set_style('whitegrid');
sns.pairplot(df, hue = 'survived_status', size = 4)
plt.show()

In [16]:
sns.set_style('whitegrid');
sns.FacetGrid(df, hue = 'survived_status', size = 6)\
   .map(plt.scatter, 'age', 'axil_nodes')\
   .add_legend();
plt.show();

1. Here with this scatter plot we get insight that patients with 0 axil nodes are more likely to survive irrespective to there age.

2. It is very much less likely to have patients with axil nodes more than 30.

3. Patients who are older than 50 and have axil nodes greater than 10 are more likely to die.


In [17]:
sns.set_style('whitegrid');
sns.FacetGrid(df, hue='survived_status', size = 7) \
    .map(plt.scatter, 'operation_year', 'axil_nodes') \
    .add_legend();
plt.show()

This doesn't give much clear picture about the dataset, but we can say most of the operations which were done in year 1965 were unsuccessfull.

In [21]:
sns.FacetGrid(df, hue='survived_status', size = 5) \
    .map(sns.distplot, 'axil_nodes') \
    .add_legend();
plt.show();

We can conclude that from this histogram (axil_node) that, Patients having 0 axil nodes are more likely to survive 

In [22]:
sns.FacetGrid(df, hue='survived_status', size = 5) \
    .map(sns.distplot, 'age') \
    .add_legend();
plt.show();

1.This histogram is overlapping each other, but still we can say that people within range of 40-60 are more likely to die.

2.People less than age 40 are more likely to survive

In [26]:
sns.FacetGrid(df, hue='survived_status', size = 5) \
    .map(sns.distplot, 'operation_year') \
    .add_legend();
plt.show();

Large number of patients died whos operation was done in year 60 and 65

# CDF

In [25]:
counts, bin_edges = np.histogram(df['axil_nodes'], bins=20, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('Axil_nodes')
plt.show()

In [24]:
sns.boxplot(x='survived_status', y = 'axil_nodes', data=df)
plt.show()

1. Box plot shows that, the more number of axil nodes, the more patients likely to die

2. The patients who had axil nodes from 1 to 24 are the majority of patients who died.

In [27]:
sns.violinplot(x='survived_status', y='axil_nodes', data = df, size = 9)
plt.show()

1. After taking a look at violin plot we can see that, the large number of patients who survived had 0 axil nodes or doesn't had it at all.

2. Patients and who died had axil node greater than or equal to 1, and as the concentration of axil node increases the repective patient is more likely to die.

# Final Conclusion

1. From this Dataset we can say that the majority of operations are performed on people age range between 38 and 68, where most of the points plotted on scatter plot (Operation_year vs Age)

2. We can see that there is quite good concentration of data point When axil_node is 0. 

3. We can conclude that large number of operation were done in span of 7 years between 1960 and 1966  (Axil_nodes vs Operation_year)

3. Here with this scatter plot we get insight that patients with 0 axil nodes are more likely to survive irrespective to there age. (Axil_node vs Age)

4. It is very much less likely to have patients with axil nodes more than 30.

5. Patients who are older than 50 and have axil nodes greater than 10 are more likely to dead.

6. Most of the operations which were done in year 1960-65 were unsuccessfull as most the patients died within 5 years after opertaion.

7. Patients having who have 0 axil nodes are more likely to survive 

8. Patients within range of 45-65 and had axil node >= 1 are more likely to die.

9. People less than age 40 are more likely to survive though having axil node greater than or equal to 1

11. Box plot shows that, the more number of axil nodes, the more patients likely to die

12. The patients who had axil nodes from 1 to 24 are the majority of patients who died.

13. After taking a look at box plot we can see that, the large number of patients who survived had 0 axil nodes or doesn't had it at all.

14. Patients and who died had axil node greater than or equal to 1, and as the concentration of axil node increases the repective patient is more likely to die.




Feature Importance:
1. Axil_nodes is most important feature in this dataset, as who had Axil node >= 1 those are more likely to die. 

2. Age is also somewhat important feature of this dataset, because we have seen that patients who aged less than 40 are likely to survive inspite of having axil node >= 1