# Haberman Cancer Survival Dataset Analysis

## About the dataset:
Title: Haberman's Survival Data

Sources: (a) Donor: Tjen-Sien Lim (limt@stat.wisc.edu) (b) Date: March 4, 1999

Past Usage:

Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of Instances: 306

Number of Attributes: 4 (including the class attribute)

Attribute Information:

Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
Missing Attribute Values: None

## About Terminology used in below analysis:

### About axil_nodes feature:

#### 1. Here axil_nodes means Positive Axillary Lymph Nodes(nodes which are effected).
#### 2. In terminology of below analysis we used 'axil_nodes' which means the 'positive axillary lymph nodes'.
#### More at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4337126/

### About class variable:

1. In below terminology **survivors means persons who have survived more than 5 years.**
2. In below terminology **victims means persons who have died within 5 years.**

## Objective: Classifying person cancer survival status based on given three features

#### 1.Importing required libraries for data analysis

In [None]:
# Importing pandas library
import pandas as pd;
# Importing numpy library
import numpy as np;
# Importing pyplot from matplotlib
import matplotlib.pyplot as plt;
# Importing seaborn library
import seaborn as sbrn;
# Importing warnings
import warnings;

# To ignore warnings for better representation
warnings.filterwarnings('ignore');
# Reading haberman csv file using pandas
haberman_data = pd.read_csv("../input/haberman.csv");
# Printing head of dataframe just for verification
haberman_data.head()

In [None]:
# Printing tail of dataframe just for verification
haberman_data.tail()

### Changes to be done to dataset for better analysis:
1. The column values are numerical which are not meaningful and so needed to be changed to meaningful values.
2. The class variable surv_status values are numerical and so needed to be converted to strings for better data analysis.

In [None]:
# Assigning meaningful custom column_values to dataframe
haberman_data.columns = ['age', 'op_year', 'axil_nodes', 'surv_status'];

In [None]:
# Replacing class variable data with strings for better data analysis by using nested dictionaries
haberman_data = haberman_data.replace({'surv_status': {1: 'more', 2: 'less'}});
# Printing head of dataframe just for verification
haberman_data.head()

In [None]:
# Printing tail of dataframe just for verification
haberman_data.tail()

#### 2. Number of points in dataset

In [None]:
# The shape method of data frame returns number of rows and columns and so taking it's first value
haberman_data.shape[0]

#### 3. Number of features and their names

In [None]:
# The shape method of data frame returns number of rows and columns and so taking it's first value
print(haberman_data.shape[1]);
# Printing names of features of dataset
print(haberman_data.columns);

#### 4. Number of classes and their names:

In [None]:
# Grouping dataset by class variable for knowing what classes are there
haberman_data_group = haberman_data.groupby('surv_status');
# Assigns names of classes to classes variable
classes = haberman_data_group.groups.keys();
# Printing number of classes
print("Number of classes: ", len(classes));
# Printing names of classes
print("Classes: ", classes);

### About groups:
1. **more** represents - **patient survived 5 years or longer**
2. **less** represents - **patient died less than 5 years**

#### 5, Number of points within each class

In [None]:
# Printing number of points per each class using value_counts
haberman_data['surv_status'].value_counts()

In [None]:
haberman_data.describe()

### Observations:
1. The **dataset is unbalanced** as number of points of two classes differ largely.
2. The persons who are operated are between **30 and 83** age. This means that **kids and youth** are not operated and by which we can assume that kids and youth have **less chances** of affecting from cancer.
3. The **operated years** are between **1958 and 1969**.
4. Most of the persons operated have **less axil_nodes(positive)**.
5. No None values in dataset that means it is not a **corrupted dataset(Good dataset)**.

### Creating seperate dataframes for each class for better data analysis

In [None]:
#filtering rows which have column_value 'surv_status' as 'more'
survivors = haberman_data[haberman_data['surv_status'] == 'more'];
#filtering rows which have column_value 'surv_status' as 'less'
victims = haberman_data[haberman_data['surv_status'] == 'less'];

## Univariate Analysis of all features

In [None]:
# Describe function is best useful for knowing some basic characteristics
survivors.describe()

In [None]:
# Describe function is best useful for knowing some basic characteristics
victims.describe()

In [None]:
# Drawing distribution plot based on feature 'age' classified by surv_status
sbrn.FacetGrid(haberman_data, hue="surv_status", size = 6) \
    .map(sbrn.distplot, "age") \
    .add_legend();
plt.show();

In [None]:
# Drawing distribution plot based on feature 'op_year' classified by surv_status
sbrn.FacetGrid(haberman_data, hue="surv_status", size = 6) \
    .map(sbrn.distplot, "op_year") \
    .add_legend();
plt.show();

In [None]:
# Drawing distribution plot based on feature 'axil_nodes' classified by surv_status
sbrn.FacetGrid(haberman_data, hue="surv_status", size = 6) \
    .map(sbrn.distplot, "axil_nodes") \
    .add_legend();
plt.show();

### Observations: 
1. By seeing descriptions, distribution plots of various features we can conclude that **axil_node** is **better** for classification.
2. **50% of survivors** have **axil_nodes = 0** and **75% of survivors** have **axil_nodes less than 3**.
3. **50% of victims** have **axil_nodes less than 4**.
4. **25% of persons** can be classified correctly.
5. So by using this we can make if else model to classify persons but with **high error rate**.

### PDF's and CDF's of features

In [None]:
for classtype, classtype_df in haberman_data_group:
    for feature in classtype_df.columns:
        if(feature != "surv_status"):
            counts, bin_edges = np.histogram(classtype_df[feature], bins = 20);
            if(classtype == "more"):
                print("="*30, "Counts and bin_edges of ", feature, " of survivors", "="*30);
            else:
                print("="*30, "Counts and bin_edges of ", feature, " of victims", "="*30);
            print("Counts: ", counts);
            print("Sum of Counts: ", sum(counts));
            print("Bin Edges: ", bin_edges);

#### Note: we can do plots using for loop but for better understanding I am plotting graphs individually

In [None]:
# Calculating frequencies of survived persons in some specific intervals based on axil_nodes using histogram function
countsByAxialNodes, bin_edgesOfAxialNodes = np.histogram(survivors['axil_nodes'], bins = 20, density = True);
# Calculating probability distribution function values of axil_nodes of survived persons
pdf = countsByAxialNodes /sum(countsByAxialNodes);
# Calculating cumulative distribution function values of axil_nodes of survived persons
cdf = np.cumsum(pdf);
plt.plot(bin_edgesOfAxialNodes[1:], pdf, label="PDF of axil_nodes of survivors");
plt.plot(bin_edgesOfAxialNodes[1:], cdf, label="CDF of axil_nodes of survivors");
plt.xlabel('Axil Nodes');
# Calculating frequencies of victims in some specific intervals based on axil_nodes using histogram function
countsByAxialNodes, bin_edgesOfAxialNodes = np.histogram(victims['axil_nodes'], bins = 20, density = True);
# Calculating probability distribution function values of axil_nodes of victims
pdf = countsByAxialNodes /sum(countsByAxialNodes);
# Calculating cumulative distribution function values of axil_nodes of victims
cdf = np.cumsum(pdf);
plt.plot(bin_edgesOfAxialNodes[1:], pdf, label="PDF of axil_nodes of victims");
plt.plot(bin_edgesOfAxialNodes[1:], cdf, label="CDF of axil_nodes of victims");
plt.legend();
plt.show();

In [None]:
# Calculating frequencies of survived persons in some specific intervals based on operation year using histogram function
countsByOpearationYear, bin_edgesOfOperationYear = np.histogram(survivors['op_year'], bins = 20);
# Calculating probability distribution function values of operation year of survived persons
pdf = countsByOpearationYear /sum(countsByOpearationYear);
# Calculating cumulative distribution function values of operation year of survived persons
cdf = np.cumsum(pdf);
plt.plot(bin_edgesOfOperationYear[1:], pdf, label="PDF of Operation years of Survivors");
plt.plot(bin_edgesOfOperationYear[1:], cdf, label="CDF of Operation years of Survivors");
plt.xlabel('Operation year');
# Calculating frequencies of victims in some specific intervals based on operation year using histogram function
countsByOpearationYear, bin_edgesOfOperationYear = np.histogram(victims['op_year'], bins = 20);
# Calculating probability distribution function values of operation year of victims
pdf = countsByOpearationYear /sum(countsByOpearationYear);
# Calculating cumulative distribution function values of operation year of victims
cdf = np.cumsum(pdf);
plt.plot(bin_edgesOfOperationYear[1:], pdf, label="PDF of Operation years of Victims");
plt.plot(bin_edgesOfOperationYear[1:], cdf, label="CDF of Operation years of Victims");
plt.legend();
plt.show();

In [None]:
# Calculating frequencies of survived persons in some specific intervals based on age using histogram function
countsByAge, bin_edgesOfAge = np.histogram(survivors['age'], bins = 20);
# Calculating probability distribution function values of age of survived persons
pdf = countsByAge /sum(countsByAge);
# Calculating cumulative distribution function values of age of survived persons
cdf = np.cumsum(pdf);
plt.plot(bin_edgesOfAge[1:], pdf, label="PDF of ages of survivors");
plt.plot(bin_edgesOfAge[1:], cdf, label="CDF of ages of survivors");
plt.xlabel('Age');
# Calculating frequencies of victims in some specific intervals based on age using histogram function
countsByAge, bin_edgesOfAge = np.histogram(victims['age'], bins = 20);
# Calculating probability distribution function values of age of victims
pdf = countsByAge /sum(countsByAge);
# Calculating cumulative distribution function values of age of victims
cdf = np.cumsum(pdf);
plt.plot(bin_edgesOfAge[1:], pdf, label="PDF of ages of victims");
plt.plot(bin_edgesOfAge[1:], cdf, label="CDF of ages of victims");
plt.legend();
plt.show();

### Box-Plots

In [None]:
# Drawing box-plot of axil_nodes of dataset classified by surv_status
# Box-plot is very useful in analysing intervals of features and knowing quantile values
sbrn.boxplot(data = haberman_data, x="surv_status", y="axil_nodes");
plt.show();

In [None]:
# Drawing box-plot of operation year of dataset classified by surv_status
# Box-plot is very useful in analysing intervals of features and knowing quantile values
sbrn.boxplot(data = haberman_data, x="surv_status", y="op_year");
plt.show();

In [None]:
# Drawing box-plot of age of dataset classified by surv_status
# Box-plot is very useful in analysing intervals of features and knowing quantile values
sbrn.boxplot(data = haberman_data, x="surv_status", y="age");
plt.show();

### Violin Plots

In [None]:
# Drawing violin plot of axil_nodes of dataset classified by surv_status
sbrn.violinplot(data = haberman_data, x="surv_status", y="axil_nodes");
plt.show();

In [None]:
# Drawing violin plot of operation year of dataset classified by surv_status
sbrn.violinplot(data = haberman_data, x="surv_status", y="op_year");
plt.show();

In [None]:
# Drawing violin plot of age of dataset classified by surv_status
sbrn.violinplot(data = haberman_data, x="surv_status", y="age");
plt.show();

In [None]:
# Importing libraries for calculating median absolute deviation
from statsmodels import robust
print("="*30, " Median and MAD of axil_nodes of survivors: ", "="*30);
# Calculating median of axil_nodes of survived persons
median = np.median(survivors['axil_nodes']);
# Calculating mad of axil_nodes of survived persons
mad = robust.mad(victims['axil_nodes']);
print(median);
print(mad);
# Calculating intervals of axil_nodes of survived persons
print("Interval of axil_nodes of survivors: ({}, {})".format(median - mad, median + mad));
print("="*30, " Median and MAD of axil_nodes of victims: ", "="*30);
# Calculating median of axil_nodes of victims
median = np.median(victims['axil_nodes']);
# Calculating mad of axil_nodes of victims
mad = robust.mad(victims['axil_nodes']);
print("Median: ", median);
print("MAD: ", mad);
# Calculating intervals of axil_nodes of victims
print("Interval of axil_nodes of victims: ({}, {})".format(median - mad, median + mad));

### Percentiles of 10 multiples less than 100

In [None]:
# Iterating over grouped dataset and calculating 10 multiple percentiles of each feature of each class
for classtype, classtype_df in haberman_data_group:
    # Iterating over each feature in each group
    for feature in classtype_df.columns:
        # We should not consider 'surv_status' as it is class variable
        if(feature != "surv_status"):
            # Calculating percentiles by giving list of 10 multiples less than 100 using percentile method
            percentilesBy10 = np.percentile(classtype_df[feature], np.arange(0, 100, 10));
            # Filtering class for good representation
            if(classtype == "more"):
                print("="*30, "10 Percentiles of ", feature, " of survivors", "="*30);
            else:
                print("="*30, "10 Percentiles of ", feature, " of victims", "="*30);
            print("Percentiles: ", percentilesBy10);

### Observations:
1. By above univariate analysis person who **operated after 1966** and with **axil_nodes less than 3** have **more survival chances.**
2. **Age** is **not useful** feature in classification and so that means **survival chances doesn't depend on age**.
3. **Half of the victims** can be classified correctly by simple if-else model victims.
4. **Most** of the **survivors** have **positive axil nodes less than 4** and by intervals of axil_nodes of survivors we can assume that there may be some **outliers present** (since **axil_nodes** have **max value of 52**)
5. So by using if-else model: **if axil_nodes < 4 then survivors else victims** we can classify **50% of victims correctly** and the remaining **50% of victims** we are classifying them as **survivors**. So **error rate is 50%**.

## Multivariate analysis to check which features will be useful in classification

In [None]:
# Multivariable analysis using pairplot from seaborn
# hue represents by which the dataset must be classified
sbrn.pairplot(haberman_data, hue="surv_status");
plt.show();

### Observations:
1. By seeing above pairplots of haberman dataset we can see that this **dataset cannot be classified in best way**.
2. For classification the **combination of op_year and axil_nodes** is **somewhat better** than others but even the classification done by this combination is **not good**.

# Final Conclusions(ALL):

1. The **persons** who are **operated** are between **30 and 83 age**. This means that **kids and youth** are not operated and by which we can assume that kids and youth have **less chances of affecting from cancer**.
2. The **operated years** are between **1958 and 1969**.
3. Most of the **persons operated** have **less axil_nodes(positive)**.
4. By multivariate analysis of haberman dataset we can see that this **dataset cannot be classified in best way**.
5. For classification the **combination of op_year and axil_nodes** is **somewhat better** than others.
6. In Univariate analysis **axil_node is better for classification**.
7. By using if-else model: **if axil_node < 4 then survivors else victims** 50% of victims can be correctly classified and remaining 50% of victims we are classifying them as survivors and so error rate is 50%. 
8. Person who **operated after 1966** and with **axil_nodes less than 3** have **more survival chances**.
9. **Age is not useful** feature in classification and so that means survival chances **doesn't depend on age**.
10. Almost **80% of survivors** have **less than 4 axil_nodes(positive)**.
11. By above analysis, We can build this if-else model, **if(axil_nodes < 3) then survival chance is more else survival chance is less** but **error rate is high.**
12. By above if-else model we are classifying **25% of survivors as victims** and **50% of victims as survivors** which is **not good classification**.
13. By above models and analysis we can say that of all persons **25% of persons can be correctly classified**.