# 1.Domain Knowledge about Dataset
Dataset is Haberman's Survival Data
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
Number of Instances: 306
Number of Attributes: 4 (including the class attribute)
Attribute Information:
1st column : Age of patient at time of operation (numerical)
2nd column : Patient's year of operation (year - 1900, numerical)
3rd column : Number of positive axillary nodes detected (numerical)
4th column : Survival status : 2 class labels (1,2)
(class attribute) 1 = the patient survived 5 years or longer
(class attribute) 2 = the patient died within 5 year

In [None]:
#IMPORTING Libraries needd for EDA 
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
#Reading the haberman file
haberman=pd.read_csv("../input/haberman.csv")

In [None]:
#Disply column names present in dataset
print(haberman.columns)

In [None]:
#Header is not we have to insert the header
haberman.columns=['age', 'year_of_operation', 'aux_nodes_detected', 'survival_status']

In [None]:
haberman.columns

In [None]:
#To see the format of data
haberman.head()

In [None]:
haberman.tail()

In [None]:
haberman.describe()

In [None]:
#To see how many rows and columns present in the dataset
haberman.shape
#305 rows and 4 columns present in the dataset

In [None]:
haberman.info()

In [None]:
#checking for any missing values
haberman.isnull().sum()

Observation: dataset has no mising values for any column

In [None]:
#checking how many values does class has
haberman["survival_status"].value_counts()

Observation: Out of 305 patients, 224 survived and 81 not survived died within 5 years after surgery done

# Objective :
The objective is to predict wether the patient will survive for more than 5 years or not given patient's Age, year of of operation,and Number of positive axillary nodes detected. This given Problem is classification problem, where we have to classify the data in any one of the two class label.

# Univariate Analysis
PDF's and CDF's

In [None]:
#Seperating the survived and not_Survived data
survived_patients = haberman[haberman['survival_status'] == 1]
not_survived_patients = haberman[haberman['survival_status'] == 2]

In [None]:
#Checking whether data is splitted properly or not
survived_patients.head()

In [None]:
not_survived_patients.head()

In [None]:
plt.figure(2,figsize=(14,4))
plt.subplot(131)
plt.plot(survived_patients['age'],np.zeros_like(survived_patients['age']),'o',label='survived')
plt.plot(not_survived_patients['age'],np.zeros_like(not_survived_patients['age']),'o',label='not-survived')
plt.legend()
plt.xlabel('age')
plt.title('Survival_Status Based on Age')

plt.subplot(132)
plt.plot(survived_patients['aux_nodes_detected'],np.zeros_like(survived_patients['aux_nodes_detected']),'o',label='survived')
plt.plot(not_survived_patients['aux_nodes_detected'],np.zeros_like(not_survived_patients['aux_nodes_detected']),'o',label='not-survived')
plt.legend()
plt.xlabel('aux_nodes_detected')
plt.title('Survival Status Bases on aux_nodes_detected')

plt.subplot(133)
plt.plot(survived_patients['year_of_operation'],np.zeros_like(survived_patients['year_of_operation']),'o',label='survived')
plt.plot(not_survived_patients['year_of_operation'],np.zeros_like(not_survived_patients['year_of_operation']),'o',label='not-survived')
plt.legend()
plt.xlabel('year_of_operation')
plt.title('Survival Status Bases on year_of_operation')

Observation: Fully overlapped we cant say much

In [None]:
sns.FacetGrid(haberman, hue="survival_status", size=5).map(sns.distplot, "age").add_legend()
plt.title('Histogram for survival_status based on age')
plt.show()

# Observation
1. 40 to 45 years aged persons mostly not_survived among all other aged person.
2. 30 to 34 years aged persons mostly 100 persent survived among all other aged persons.
3. 78 to 82 aged persons not_survived within 5 years
4. 30 to 40 age there better chances to survive after 5 yeras


In [None]:
sns.FacetGrid(haberman, hue="survival_status", size=5).map(sns.distplot, "year_of_operation").add_legend()
plt.title('Histogram for survival_status based on year_of_operation')
plt.show()

# Observation
1. Patient who got operated in between 1958-1963 or 1966-1968 are more likely to survive more than 5 years.
2. Patient who got operated in between 1963-1966 might not survive more than 5 years.

In [None]:
sns.FacetGrid(haberman, hue="survival_status", size=5).map(sns.distplot, "aux_nodes_detected").add_legend()
plt.title('Histogram for survival_status based on auxillary_nodes_detected')
plt.show()

# Observation
1. Patient having less than 5 auxillary nodes are more likely to survive more than 5 years.
2. Patient having more than 5 auxillary nodes might not survive more than 5 years.

In [None]:
plt.figure(3,figsize=(20,5))
for idx, feature in enumerate(list(survived_patients.columns)[:-1]):
    plt.subplot(1, 3, idx+1)
    
    print("="*30+"SURVIVED_PATIENT"+"="*30)
    print("********* "+feature+" *********")
    counts, bin_edges = np.histogram(survived_patients[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, label = 'pdf_survived')
    plt.plot(bin_edges[1:], cdf, label= 'cdf_survived')
    
    print("="*30+"NOT_SURVIVED_PATIENT"+"="*30)
    counts, bin_edges = np.histogram(not_survived_patients[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    pdf = counts/sum(counts)
    print("PDF: {}".format(pdf))
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
    plt.plot(bin_edges[1:], pdf, label = 'pdf_not_survived')
    plt.plot(bin_edges[1:], cdf, label= 'cdf_not_survived')
    
    plt.title('pdf & cdf for patients based on '+feature)
    plt.legend()
    plt.xlabel(feature)

# Observation:
1. By seeing age PDF plot there is not data available after 78 age for survived means no chances for survive.
2. No dependency on year of operation
3. if the number aux_nodes increases there will be no chances for survived
4. 45 to 50 there will be no chance for survived

# BOX PLOT

In [None]:
sns.boxplot(x='survival_status',y='age', data=haberman)
plt.title('box_plot based on age')
plt.show()

# Observation:
1. Mean average age of survived is 52
2. Mean average age of not_survived is 53
3. IQR is high for survived


In [None]:
sns.boxplot(x='survival_status',y='year_of_operation', data=haberman)
plt.title('box_plot based on year_of_operation')

In [None]:
sns.boxplot(x='survival_status',y='aux_nodes_detected', data=haberman)
plt.title('box_plot based on auxillary_nodes_detected')

# Observation: 
Outliers has an impact on mean of survival status

# Violin Plot

In [None]:
sns.violinplot(x="survival_status", y="age", data=haberman, size=8)
plt.title('violin_plot based on age')

# Observation:
age distribution for survived and not_survived are almost equal

In [None]:
sns.violinplot(x="survival_status", y="year_of_operation", data=haberman, size=8)
plt.title('violin_plot based on year_of_operation')

In [None]:
sns.violinplot(x="survival_status", y="aux_nodes_detected", data=haberman, size=8)
plt.title('violin_plot based on auxillary_nodes_detected')

# Observation
Variation of survived is less than the variation of the not_survived

# Pair Plot

In [None]:
sns.set_style("whitegrid")
sns.pairplot(haberman, hue="survival_status",vars=['age','year_of_operation','aux_nodes_detected'], size=4)
plt.show()

Observation:
    Patients having age less than 40 years are more probable to live more than five years. (from year_of_operation vs age graph)

# Contour Plot

In [None]:
sns.jointplot(x="age", y="year_of_operation", data=haberman, kind="kde")
plt.title('Contour Plot age vs year_of_operation')
plt.show()

In [None]:
sns.jointplot(y="aux_nodes_detected", x="age", data=haberman, kind="kde")
plt.title('Contour Plot age vs auxillary_nodes_detected')
plt.show()

In [None]:
sns.jointplot(x="year_of_operation", y="aux_nodes_detected", data=haberman, kind="kde");
plt.title('Contour Plot year_of_operation vs auxillary_nodes_detected')
plt.show()

# Observation
1. Patients aged between 40-60 are mostly operated in between 1960-1964.
2. Patients with more than 5 auxillary nodes are rare.
3. MOst of the 40 to 50 age people are having 0 aux nodes
4. Patients operated between 60 to 65 year has aux nodes less than 5

# Overall Observation

1. Patients having age less than 40 years are more probable to Survive
2. Patients with less number of auxillary nodes detected are more probable to survive
3. More than 75% of the patients have auxillary nodes less than 10.