**Exploratory Data Analysis (EDA) for cancer patients**

**Haberman's Survival Data Set**



Sources: (a) Donor: Tjen-Sien Lim (b) Date: March 1999

Past Usage:

Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI. Relevant Information: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

    Number of Instances: 306
    Number of Attributes: 4 (including the class attribute)
    Attribute Information:
        Age of patient at time of operation (numerical)
        Patients year of operation (year - 1900, numerical)
        Number of positive axillary nodes detected (numerical)
        Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
    Missing Attribute Values: None



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
hdata = pd.read_csv("../input/haberman.csv",header=None ,names = ["Age", "Operation_year", "Axil_nodes", "Surv_status"])
hdata['Surv_status']=hdata['Surv_status'].map({2:"NO",1:'YES'})
hdata.head()

In [None]:
print (hdata.shape)

In [None]:
print (hdata.columns)

**[1]Bi-variate analysis**

** Scatter plot**

**1.1. 1-D Plot**

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(hdata,hue="Surv_status",height=5).map(plt.scatter, "Surv_status", "Axil_nodes").add_legend();
plt.show();
#https://seaborn.pydata.org/tutorial/categorical.html

In [None]:
#Univariate analysis
patient_survived = hdata.loc[hdata["Surv_status"] == 'YES'];
patient_died = hdata.loc[hdata["Surv_status"] == 'NO'];
plt.plot(patient_survived["Axil_nodes"], np.zeros_like(patient_survived['Axil_nodes']), 'o' )
plt.plot(patient_died["Axil_nodes"], np.zeros_like(patient_died['Axil_nodes']), 'o')
plt.title("1-D scatter plot for Axil_nodes")
plt.show()

Since Data point is **overlapping** so it not provide information.

**Observation:-** In 1-D plot the both graph represent the spread of data on Axil_nodes. But data is so jamble the differntiate is not possible.  Both graph represent the data point on Axil_nodes. 

**2.  2-D Scatter Plot**

2.1. Age vs Operation_year

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(hdata,hue="Surv_status",height=5).map(plt.scatter,  "Operation_year" , "Age").add_legend();    
plt.show();

2.2. Age vs Axil_nodes

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(hdata,hue="Surv_status",height=5).map(plt.scatter, "Age" , "Axil_nodes").add_legend();
plt.show();


**Observation:-** In 2-D graph Operation_year vs Age in Operation_year 61 & 67 the Surv_Status is NO ,but it not give much sense. In Age vs Axil_nodes the data is so much mixed that it not give any information.

**Pair Plot**

In [None]:
sns.set_style("whitegrid");
sns.pairplot(hdata, hue="Surv_status", height=4);
plt.show()

**Observation:-** In pair plot all the point are jamble , but the graph of axil_nodes vs Operation_year give the  some idea but making any decsion is not easy.

**Histogram**

In [None]:
sns.FacetGrid(hdata,hue='Surv_status',height=5).map(sns.distplot,'Age').add_legend()
plt.show()

In [None]:
sns.FacetGrid(hdata,hue='Surv_status',height=5).map(sns.distplot,"Operation_year").add_legend()
plt.show()

In [None]:
sns.FacetGrid(hdata,hue='Surv_status',height=5).map(sns.distplot,"Axil_nodes").add_legend()
plt.show()

**Observation:-** From Histogram since data point is overlapping so prediction is not easy

**Plotting PDF(Probability Density Function) & CDF(Cumulative Distribution Function) for Axil_nodes , Age & Operation_year feature .**

In [None]:
counts, bin_edges = np.histogram(patient_survived['Axil_nodes'], bins=15, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.legend('surv_status')
plt.legend(['Survived_PDF', 'Survived_CDF','Died_PDF', 'Died_CDF'])

counts, bin_edges = np.histogram(patient_died['Axil_nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend('surv_status')
plt.legend(['Survived_PDF', 'Survived_CDF','Died_PDF', 'Died_CDF'])
plt.show();



counts, bin_edges = np.histogram(patient_survived['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.legend('surv_status')
plt.legend(['Survived_PDF', 'Survived_CDF','Died_PDF', 'Died_CDF'])

counts, bin_edges = np.histogram(patient_died['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend('surv_status')
plt.legend(['Survived_PDF', 'Survived_CDF','Died_PDF', 'Died_CDF'])
plt.show();


counts, bin_edges = np.histogram(patient_survived['Operation_year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
plt.legend('surv_status')
plt.legend(['Survived_PDF', 'Survived_CDF','Died_PDF', 'Died_CDF'])

counts, bin_edges = np.histogram(patient_died['Operation_year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.legend('surv_status')
plt.legend(['Survived_PDF', 'Survived_CDF','Died_PDF', 'Died_CDF'])
plt.show();
#appliedaicourse.com

**In first graph 95% Survived whose Axil_node is less then 10 & 95% not able to survive whose Axil_node is less the 23.
In second graph 95% Survived whose Age is less then 70 and also 95% Died whose Age is less then 70.
In third graph the 95% Survived whose Operation_year is lies in b/w 1958 to 1968 and also the 95% Died whose Operation_year lies in b/w 58 to 67 **

**Box Plot**

In [None]:
sns.boxplot(x='Surv_status',y='Age', data=hdata)
plt.show()

sns.boxplot(x='Surv_status',y='Operation_year', data=hdata)
plt.show()

sns.boxplot(x='Surv_status',y='Axil_nodes', data=hdata)
plt.show()

**In first box plot graph the Age whose Surv_status is YES the 25th - 75th percentile value lies 42 to 60. And whose Surv_status is NO the 25th - 75th value lies from 46 to 62.
In second box plot graph the Operation_year whose Surv_status is YES the 25th - 75th percentile value lies 60 to 66. And whose Surv_status is NO is 25th - 75th value lies from 59 to 65.
In third box plot the Axil_nodes whose Surv_status is YES the 25th - 75th percentile value lies 0 to 4. And whose Surv_status is NO is 25th and 75th values lies from 1 to 11.
**

**Violine plots**

In [None]:
sns.violinplot(x='Surv_status',y='Age', data=hdata)
plt.show()

sns.violinplot(x='Surv_status',y='Operation_year', data=hdata)
plt.show()

sns.violinplot(x='Surv_status',y='Axil_nodes', data=hdata)
plt.show()

**Since data is jamble and not gives not much information so violin plot give not much idea.**

In [None]:
print("Summary Statistics of Patients")
hdata.describe()

In [None]:
print("Summary Statistics of Patient who Survived.")
patient_survived.describe()

In [None]:
print("Summary Statistics of Patient who Not Survived.")
patient_died.describe()

**Observation:-**

Since the data is jambled and imbalance (225 is Survived & 81 Not Survived) so concluding and point is not possible.In pair plot the data is mixed up which also reflect on histogram and boxplot graph.
In box plot the threshold value is not be calculated due to jamble data.