# Exploratory Data Analysis on Haberman's Survival DataSet:

## About the DataSet:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital onthe survival of patients who had undergone surgery for breast cancer.

# Some Data Insights:

1) Age of patient at time of operation (numerical)<br>
2) Patient's year of operation (year - 1900, numerical)<br>
3) Number of positive axillary nodes detected (numerical)<br>
4) Survival status (class attribute)<br>
        1 = the patient survived 5 years or longer<br>
        2 = the patient died within 5 year<br>
5) There are no missing values in this data.<br>


## Importing Necessary Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Reading Dataset and basic information o Data

In [None]:
df=pd.read_csv('../input/habermans-survival-data-set/haberman.csv',names=['Age','Operated_year','Nodes_Detected','Survival_Status'])

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Survival_Status'].value_counts()

In [None]:
No_of_Deaths=df[df['Survival_Status']==df['Survival_Status'].max()]
Total_Cases=df['Survival_Status']
Survived_rate=(len(No_of_Deaths)/len(Total_Cases))*100
print(Survived_rate)

# Observations:

* From Above Calculation Wecan say that altogether 26.47% of people survived.

## Pair Plots

In [None]:
sns.set_style('whitegrid')
sns.pairplot(df,hue='Survival_Status')

##### In All of the plots obtained by the pairplots above, we can see that there is too much overlap between each other.


## Means, Medians and Standard Deviation 

In [None]:
#Spliting data in Patients Survived and that did Not Survived.

In [None]:
Patient_Survived=df[df['Survival_Status']==1]
Patient_Not_Survived=df[df['Survival_Status']==2]

### MEANS:

In [None]:
print("SURVIVED:")
print("The mean Age of Survived Patients is : {}".format(np.mean(Patient_Survived['Age'])))
print("The mean No. of Nodes of Survived Patients is : {}".format(np.mean(Patient_Survived['Nodes_Detected'])))
print("DEAD:")
print("The mean Age of Dead Patients is : {}".format(np.mean(Patient_Not_Survived['Age'])))
print("The mean No. of Nodes of Dead Patients is : {}".format(np.mean(Patient_Not_Survived['Nodes_Detected'])))

### MEDIANS:

In [None]:
print("SURVIVED:")
print("The Median Age of Survived Patients is : {}".format(np.median(Patient_Survived['Age'])))
print("The Median of Nodes Detected of Survived Patients is : {}".format(np.median(Patient_Survived['Nodes_Detected'])))
print("DEAD:")
print("The Median Age of Dead Patients is : {}".format(np.median(Patient_Not_Survived['Age'])))
print("The Median of Nodes Detected of Dead Patients is : {}".format(np.median(Patient_Not_Survived['Nodes_Detected'])))

### Quantiles:

In [None]:
print("SURVIVED:")
print("The Quantile AGE of Survived Patients are: ",np.percentile(Patient_Survived["Age"],np.arange(0,100,25)))
print("The Quantile of Nodes Detected of Survived Patients are: ",np.percentile(Patient_Survived["Nodes_Detected"],np.arange(0,100,25)))
print("DEAD:")
print("The Quantile AGE of Dead Patients are: ",np.percentile(Patient_Not_Survived["Age"],np.arange(0,100,25)))
print("The Quantile of Nodes Detected of Dead Patients are: ",np.percentile(Patient_Not_Survived["Nodes_Detected"],np.arange(0,100,25)))

### Standard Deviations

In [None]:
print("SURVIVED:")
print("The Standard Deciation Age of Survived Patients is : {}".format(np.std(Patient_Survived['Age'])))
print("The Standard Deciation No. of Nodes of Survived Patients is : {}".format(np.std(Patient_Survived['Nodes_Detected'])))
print("DEAD:")
print("The Standard Deciation Age of Dead Patients is : {}".format(np.std(Patient_Not_Survived['Age'])))
print("The Standard Deciation No. of Nodes of Dead Patients is : {}".format(np.std(Patient_Not_Survived['Nodes_Detected'])))

### Histograms of all the features of data

In [None]:
Age_hist=sns.FacetGrid(df,hue='Survival_Status',height=7).map(sns.distplot,'Age').add_legend()
Age_hist.set(xlabel='Age of Patient',ylabel='Density',title='Histogram of Age of Patient')

In [None]:
Age_hist=sns.FacetGrid(df,hue='Survival_Status',height=7).map(sns.distplot,'Operated_year').add_legend()
Age_hist.set(xlabel='Year of Operation ',ylabel='Density',title='Histogram of Year Operated')

In [None]:
Age_hist=sns.FacetGrid(df,hue='Survival_Status',height=7).map(sns.distplot,'Nodes_Detected').add_legend()
Age_hist.set(xlabel='No. of Lymph Nodes Detected ',ylabel='Density',title='Histogram of Lymph Nodes Detected')

In [None]:
Label=(["PDF of Survived","CDF of Survived","PDF of not Survived","CDF of not Survived"])
counts, bin_edges=np.histogram(Patient_Survived['Age'],bins=10,density=True)
pdf=counts/sum(counts)
print("PDF is : ",pdf)
print("Bin Edges are : ",bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

counts, bin_edges=np.histogram(Patient_Not_Survived['Age'],bins=10,density=True)
pdf=counts/sum(counts)
print("PDF is : ",pdf)
print("Bin Edges are : ",bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.legend('Survival_Status')
plt.legend(Label)
plt.xlabel('Age of Patient ')
plt.title('PDF and CDF of Patients based on  AGE ')

In [None]:
Label=(["PDF of Survived","CDF of Survived","PDF of not Survived","CDF of not Survived"])
counts, bin_edges=np.histogram(Patient_Survived['Operated_year'],bins=10,density=True)
pdf=counts/sum(counts)
print("PDF is : ",pdf)
print("Bin Edges are : ",bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

counts, bin_edges=np.histogram(Patient_Not_Survived['Operated_year'],bins=10,density=True)
pdf=counts/sum(counts)
print("PDF is : ",pdf)
print("Bin Edges are : ",bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.legend('Survival_Status')
plt.legend(Label)
plt.xlabel('Year of Patient ')
plt.title('PDF and CDF of Patients based on year of Operation')

In [None]:
Label=(["PDF of Survived","CDF of Survived","PDF of not Survived","CDF of not Survived"])
counts, bin_edges=np.histogram(Patient_Survived['Nodes_Detected'],bins=10,density=True)
pdf=counts/sum(counts)
print("PDF is : ",pdf)
print("Bin Edges are : ",bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

counts, bin_edges=np.histogram(Patient_Not_Survived['Nodes_Detected'],bins=10,density=True)
pdf=counts/sum(counts)
print("PDF is : ",pdf)
print("Bin Edges are : ",bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.legend('Survival_Status')
plt.legend(Label)
plt.xlabel('No. of Nodes Detected')
plt.title('PDF and CDF of Patients based on year of No. of Nodes Detected')

# Observations:

* The PDF and CDF of the data overlap too much to come to any exact conclusions.
* But from the PDF and CDF of Age we can say the around 15% of people having Age less than 38 survived.

## Box Plots: 

In [None]:
sns.boxplot(x='Survival_Status',y='Age',data=df)
plt.title('Box Plot of Age of Patients')

In [None]:
sns.boxplot(x='Survival_Status',y='Nodes_Detected',data=df)
plt.title("Box Plot for No. of Nodes Detecteed ")

In [None]:
sns.boxplot(x='Survival_Status',y='Operated_year',data=df)
plt.title("Box Plot for Year of Operation")

## Violin Plots: 

In [None]:
sns.violinplot(x = "Survival_Status", y = "Age", hue = "Survival_Status", data = df)
plt.title("Violin plot for Age ")
plt.show()

In [None]:
sns.violinplot(x = "Survival_Status", y = "Operated_year", hue = "Survival_Status", data = df)
plt.title("Violin plot for Year of Operation ")
plt.show()

In [None]:
sns.violinplot(x = "Survival_Status", y = "Nodes_Detected", hue = "Survival_Status", data = df)
plt.title("Violin plot for NO. Of Nodes Detected ")
plt.show()

In [None]:
sns.FacetGrid(data=df,size=10,hue='Survival_Status').map(plt.scatter,'Survival_Status','Age').add_legend()

## Observations:

* The Patients with age less than or equal to 34 Survived.<br>
* The Patients with age more than or equal to 78 died.

In [None]:
sns.jointplot(y='Nodes_Detected',x='Survival_Status',data=df,kind='kde')

# Conclusions:


* The Haberman's Survival Dataset is very imbalanced with 73% of values being Survived.
* The Dataset is almost impossible to separate features and draw conclusions.
