## About the Dataset

1. Title: Haberman's Survival Data

2. Sources:
   (a) Donor:   Tjen-Sien Lim (limt@stat.wisc.edu)
   (b) Date:    March 4, 1999

3. Past Usage:
   1. Haberman, S. J. (1976). Generalized Residuals for Log-Linear
      Models, Proceedings of the 9th International Biometrics
      Conference, Boston, pp. 104-122.
   2. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984),
      Graphical Models for Assessing Logistic Regression Models (with
      discussion), Journal of the American Statistical Association 79:
      61-83.
   3. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis,
      Department of Statistics, University of Wisconsin, Madison, WI.

4. Relevant Information:
   The dataset contains cases from a study that was conducted between
   1958 and 1970 at the University of Chicago's Billings Hospital on
   the survival of patients who had undergone surgery for breast
   cancer.

5. Number of Instances: 306

6. Number of Attributes: 4 (including the class attribute)

7. Attribute Information:
   1. Age of patient at time of operation (numerical)  ='Age'
   2. Patient's year of operation (year - 1900, numerical) = 'OperationYear'   
   3. Number of positive axillary nodes detected (numerical) = 'Axil_Nodes'
   4. Survival status (class attribute)                      = 'Survival_Status'
         1 = the patient survived 5 years or longer
         2 = the patient died within 5 year

8. Missing Attribute Values: None

### Objective

Explore the data to see the how the Survival_Status of the patient is related to other features in the dataset.

#### Importing required libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

#### Reading the dataset into pandas dataframe

Haberman's Survival dataset is a .csv (comma seperated values) files containing four columns without any column names.
Hence, adding column names from the data given by the source.

In [None]:
haberman = pd.read_csv('../input/haberman.csv',names = ['Age','OperationYear','Axil_Nodes','Survival_Status'])

In [None]:
print(haberman.shape)

The dataset contains 4 columns with 306 entries(rows) i.e., it contains 3 features, 1 target feature and 306 data points. The Dimentionality of the features is low. Hence, Visualization is feasible with plotting.

In [None]:
haberman['Survival_Status'].value_counts()

The dataset contains 225 points of First Class (Survived longer than 5 years) and <br>
                     81  points of Second Class (Died within 5 years)<br>
                     
The dataset is biased towards First class and with only 306 points it is not easy to make a good prediction model.

In [None]:
haberman.info()

The dataset doesn't contain any null values. All the values are integers.

In [None]:
haberman.Survival_Status.unique()

The integer values 1 and 2 in Survival_Status is ambiguous. Hence, the column is converted to a category of 'Yes' or 'No' based on if the have survived greater than 5 years.
The values are mapped as follows.
1 :-> 'Yes'
2 :-> 'No'

In [None]:
haberman['Survival_Status'] = haberman['Survival_Status'].map({1:True, 2:False})
#haberman['Survival_Status'] = haberman['Survival_Status'].astype('category')

In [None]:
haberman.head(5)

In [None]:
haberman.info()

### Statistics


In [None]:
haberman.describe()

- The age of the patients vary from 30 to 83 with median of 52.45 <br>
- All the operations are conducted in span of 11 years from 1958 to 1969 <br>
- From a min of o to max of 52 lymp nodes, 25% of the patients have no axil lymp nodes. While, 75% have less than 5 lymp nodes. This implies that the exteame cases are very few in number. (Outlier points are present)

### Univariant Analysis

In [None]:
haberman.plot(kind='scatter', x='Age', y='Axil_Nodes') ;
plt.show()

Using a 2d-Scatter-Plot distinguishing the data points is difficult. Adding a third parameter to colorixe the data will improve the readability.

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Survival_Status", size=4) \
   .map(plt.scatter, "Age", "Axil_Nodes") \
   .add_legend();
plt.show();

It seems the Survival_Status based on Age and number of Axil_Nodes is mixed together without any significant seperation.<br>
A pair plot will give us the insight into all relations.

#### Pair Plot

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="Survival_Status", size=3);
plt.show()

Although the data is highly biased we can observe some pattern in the data.<br>
- Patients with higher number of lymp nodes have less chances of surviving more than 5 years.
- The Axil_Nodes vs OperationYear plot can be used to seperate the points, among the remaining ones.

#### 1-D Plot

In [None]:
haberman_1 = haberman.loc[haberman["Survival_Status"] == True]
haberman_2 = haberman.loc[haberman["Survival_Status"] == False];

plt.plot(haberman_1["Age"], np.zeros_like(haberman_1["Age"]), 'o')
plt.plot(haberman_2["Age"], np.zeros_like(haberman_2["Age"]), 'o')


All the points are heavily overlapped making it difficult to draw any conclusion.

#### Distribution Plots

In [None]:
sns.FacetGrid(haberman, hue="Survival_Status", size=5) \
   .map(sns.distplot, "Age") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(haberman, hue="Survival_Status", size=5) \
   .map(sns.distplot, "OperationYear") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(haberman, hue="Survival_Status", size=5) \
   .map(sns.distplot, "Axil_Nodes") \
   .add_legend();
plt.show();

From the distribution plots we can infer that: <br>
- The age distribution of the patients is almost Gaussian and the number of Axil_nodes follow an approxiamte power law distribution.
- Patients younger than 40 years have a little more chance of surviving more than 5 years.
- Patients who survive for more than 5 years have lymp nodes around zero mostly.

#### CDF (Cummulative Distribution Functions)

In [None]:
counts, bin_edges = np.histogram(haberman_1['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)


counts, bin_edges = np.histogram(haberman_1['Age'], bins=20, 
                                 density = True)
pdf = counts/(sum(counts))
plt.plot(bin_edges[1:],pdf);

plt.show();

In [None]:
counts, bin_edges = np.histogram(haberman_1['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

#compute CDF
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)



plt.show()

In [None]:
counts, bin_edges = np.histogram(haberman_1['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)


# virginica
counts, bin_edges = np.histogram(haberman_2['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)



plt.show();

In [None]:
print("Means:")
print(np.mean(haberman_1['Age']))
print(np.mean(haberman_2['Age']))

print("\nStd-dev:");
print(np.std(haberman_1['Age']))
print(np.std(haberman_2['Age']))


In [None]:
print("\nMedians:")
print(np.median(haberman_1['Age']))
print(np.median(haberman_2['Age']))


print("\nQuantiles:")
print(np.percentile(haberman_1['Age'],np.arange(0, 100, 25)))
print(np.percentile(haberman_2['Age'],np.arange(0, 100, 25)))

print("\n90th Percentiles:")
print(np.percentile(haberman_1['Age'],90))
print(np.percentile(haberman_2['Age'],90))

from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(haberman_1['Age']))
print(robust.mad(haberman_2['Age']))


#### Box Plots

In [None]:
sns.boxplot(x='Survival_Status', y='Age', data=haberman)
plt.show()

In [None]:
sns.boxplot(x='Survival_Status',y='OperationYear', data=haberman)
plt.show()

In [None]:
sns.boxplot(x='Survival_Status',y='Axil_Nodes', data=haberman)
plt.show()

#### Violin Plots

In [None]:
sns.violinplot(x="Survival_Status", y="Age", data=haberman, size=8)
plt.show()

In [None]:
sns.violinplot(x="Survival_Status", y="OperationYear", data=haberman, size=8)
plt.show()

In [None]:
sns.violinplot(x="Survival_Status", y="Axil_Nodes", data=haberman, size=8)
plt.show()

From Violin plots we can see that only Axil_Nodes has a significant variation in distribution of Survival of patients.<br>
Remaining plots don't give us any more infomation about thier relations.

In [None]:
sns.jointplot(x="Age", y="Axil_Nodes", data=haberman, kind="kde");
plt.show();

In [None]:
sns.jointplot(x="Age", y="OperationYear", data=haberman, kind="kde");
plt.show();

In [None]:
sns.jointplot(x="OperationYear", y="Axil_Nodes", data=haberman, kind="kde");
plt.show();

In [None]:
corr = haberman.corr(method = 'spearman')
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

From the above correlation matrix we can see that Survival_Status is more correlated with Axil_nodes. The negative correlation implies that the relation is inversely proportional. ( More lymp nodes = less chance of survival)

### Final Observations
- The dataset contains very less data points and is highly imbalanced.
- Most of the patients have less than 5 lymp nodes.
- The feature Axil_Nodes has more correlation to Survival_Status. Hence, more feature importance is to be given for it.
- Patients younger than 40 years have a little more chance of surviving more than 5 years.
- The age distribution of the patients is almost Gaussian and the number of Axil_nodes follow an approxiamte power law distribution.
- An accurate model with the given data is most likely to over fit, because of small data size.
- A linear model cannot seperate the data without any significant error.
- Naive Bayes won't work because of non-gaussian features.
- To model this data non-linear models have to be used.