# Plotting for Exploratory Data Analysis(EDA) for Cancer Patients

# Habermans Dataset

Sources: (a) Donor: Tjen-Sien Lim  (b) Date: March  1999

Past Usage:

Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

* Number of Instances: 306
* Number of Attributes: 4 (including the class attribute)
* Attribute Information:
    * Age of patient at time of operation (numerical)
    * Patients year of operation (year - 1900, numerical)
    * Number of positive axillary nodes detected (numerical)
    * Survival status (class attribute) 1 = the patient survived 5 years or   longer 2 = the patient died within 5 year
* Missing Attribute Values: None

# Objective

Classify a new patient  according to one of the 2 classes that is  whether it survived 5 years or longer or patient died within 5 years, given the 3 features

In [11]:
#importing all libraries 
import pandas as pd
import seaborn  as se
import numpy as np
import matplotlib.pyplot as plt


In [25]:
#reading the dataset
hb = pd.read_csv("../input/haberman/haberman.csv")
#hb

In [26]:
hb.shape
#it shows we have 306 rows and 4 columns

In [27]:
hb.columns

In [28]:
hb['survival_status'].value_counts();

# Observations

This shows
   * Only 225 patients survived 5 years or longer
   * And 81 the patient died within 5 year

# Univariate Analysis

# Histogram

In [29]:
se.FacetGrid(hb,hue="survival_status",size=5)\
    .map(se.distplot,"year")\
    .add_legend()
plt.show()

 Observation : can't say much from the plot as points are overlapping

In [None]:
se.FacetGrid(hb,hue="survival_status",size=5)\
    .map(se.distplot,"Age")\
    .add_legend()
plt.show()

 Observation :
   * Patients with age less than 35 and  greater than  30 have survived more than 5 years after operation
   * Patients with age less than 83 and greater than 78 have survived not more  than 5 Years after operation
   * Patients from age 35 to 78 we can't say anything as point are almost overlapping.

In [None]:
se.FacetGrid(hb,hue="survival_status",size=5)\
    .map(se.distplot,"positive_axillary_nodes")\
    .add_legend()
plt.show()

 Observation : can't say much from the plot as points are  overlapping but one thing we can infer is as the no. of positive auxillary nodes increases the survival status decreases less than 5 years .

# Box plot and Whiskers

In [None]:
se.boxplot(x = 'survival_status',y = 'year',data = hb)
plt.show()

In [None]:
se.boxplot(x = 'survival_status',y = 'Age',data = hb)
plt.show()

In [None]:
se.boxplot(x = 'survival_status',y = 'positive_axillary_nodes',data = hb)
plt.show()

# Observations

* From the boxplot we can observe that most people who survived cancer have zero positive axillary nodes

# Violin plots

In [None]:
se.violinplot(x="survival_status", y="year", data=hb, size=8)
plt.show()

In [None]:
se.violinplot(x="survival_status", y="Age", data=hb, size=8)
plt.show()

In [None]:
se.violinplot(x="survival_status", y="positive_axillary_nodes", data=hb, size=8)
plt.show()

# Observation

* From the violin plots we can observe that most people who survived cancer have zero positive axillary nodes

# PDF and CDF

In [None]:
#pdf cdf of year

counts,bin_edges = np.histogram(hb['year'],bins = 30, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.legend()

counts,bin_edges = np.histogram(hb['year'],bins = 30, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)


plt.xlabel('Year')
plt.grid()

plt.show()

In [None]:
#pdf cdf of positive_axillary_nodes

counts,bin_edges = np.histogram(hb['positive_axillary_nodes'],bins = 30, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.legend()

counts,bin_edges = np.histogram(hb['positive_axillary_nodes'],bins = 30, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

plt.xlabel('positive_axillary_nodes')
plt.grid()

plt.show()

In [None]:
#pdf cdf of Age

counts,bin_edges = np.histogram(hb['Age'],bins = 30, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.legend()

counts,bin_edges = np.histogram(hb['Age'],bins = 30, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

plt.xlabel('Age')
plt.grid()

plt.show()

# Bivariate analysis

# 2-D Scatter Plot

In [None]:
se.set_style("darkgrid");
se.FacetGrid(hb,hue='survival_status',size=6)\
    .map(plt.scatter,"year","Age")\
    .add_legend();
plt.show()

 Observation : can't say much from the plot as points overlapping

In [None]:
se.set_style("darkgrid");
se.FacetGrid(hb,hue='survival_status',size=6)\
    .map(plt.scatter,"positive_axillary_nodes","Age")\
    .add_legend();
plt.show()

 Observation : can't say much from the plot as points overlapping

# Pair-Plot

In [None]:
plt.close();
se.set_style("whitegrid");
se.pairplot(hb,hue="survival_status",size=3)
plt.show()


# Observations 

* Positive_axillary_nodes is a useful feature to identify the       survival_status of cancer patients
* Age and Year of operation have overlapping curves so we can't have a suitable observation that can classify survival_status


# Mean

In [None]:
#hb is the name of the data frame
less_five = hb[hb['survival_status']==2]
more_five = hb[hb['survival_status']==1]

In [None]:
print(np.mean(more_five))

In [None]:
print(np.mean(less_five))

Observation
* Mean age of patients who survived more than 5 years is 52 years and who didn't survive is 54 years
* Those having  more than 3 positive_axillary_nodes  they have not survived more than 5 years
* Those having less than 3 positive_axillary_nodes  they have survived more than 5 years after the operation

# Final Conclusion

* Those having more than 3 positive_axillary_nodes they have not survived more than 5 years
* Those having less than 3 positive_axillary_nodes they have survived more than 5 years after the operation
* Positive_axillary_nodes is a useful feature to identify the survival_status of cancer patients
* Age and Year of operation have overlapping curves so we can't  classify patients for their survival_status using age 