<h1><center>EDA On Haberman Dataset</center></h1>

# Haberman Dataset
* Haberman dataset consists the list of patients survival information who undergone the breast cancer surgery between 1958 and  1970.
* This dataset collected at university of billings chicago hospital.
* Dataset Attributes info:
* Age of the person at the time of operation.
* Year of operation conducted.
* Number of positive axilary nodes detected.
* Survival Status
   * If it is 1 then patient is survived more than 5 years from the operation conducted year.
   * If it is 0 then patient died with in 5 years.

# Objective:
* Classify person survival status after 5 years of breast cancer surgery happened based on dataset attribute informtion.

In [None]:
#Import all required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

#Lets load haberman data into dataframe.
hman_data = pd.read_csv('../input/haberman.csv',
                        names=['age','year_of_op','axilary_nodes_cnt','survival_status_a5'])

In [None]:
# No of data points(Rows) with features(Attributes)
hman_data.shape

In [None]:
#Columns in our dataset, we have named it when we read into data frame, let look with below command
hman_data.columns

In [None]:
# Let's describe the data frame...
hman_data.describe()

# Observation
* From the above information, In between 30-83 years age of people undergone for surgery.
* The dataset contains the information of surgeries from 1958-69.
* Maximum number of nodes detected is 52.

In [None]:
# Let's change survival status attribute from 1 to "yes" and 2 to "no" to improve readability.

hman_data['survival_status_a5'] = hman_data['survival_status_a5'].map({1: 'Yes' , 2 : 'No'})


In [None]:
#No of data points per class
hman_data['survival_status_a5'].value_counts()

# Observation:
* Data looks like imbalanced, survived people 3 times more than died people approximately. 

In [None]:
# Let's see the data types of data frame features.
hman_data.dtypes

# 2D Scatter Plot

In [None]:
%matplotlib inline
hman_data.plot(kind='scatter', x = 'age',y = 'axilary_nodes_cnt')

In [None]:
# let's see with seaborn where we can color each survival status
sns.set_style("whitegrid")
sns.FacetGrid(hman_data, hue = 'survival_status_a5', height = 5)\
   .map(plt.scatter,"age", "axilary_nodes_cnt")\
   .add_legend();
plt.show()

# Observation
* We couldn't able to draw any separation line with 2D Plots.

# Pair Plots

In [None]:
sns.pairplot(hman_data,hue = "survival_status_a5", height = 3)

# Observation
* From pair plots we can't identify any useful attribute/feature for classification.

# Countplots

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.countplot(hman_data.axilary_nodes_cnt, hue=hman_data['survival_status_a5'])

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.countplot(hman_data.age, hue=hman_data['survival_status_a5'])
plt.legend(loc='upper right')

In [None]:
sns.countplot(hman_data.year_of_op, hue=hman_data['survival_status_a5'])
plt.legend(loc='upper right')

# Observation
* From above all counplots we understood that axilary_nodes_cnt is an important feature to classify a person survival status.
* The person with less number of axilary nodes had a great chance of survival chances after five years of surgery.
* Less chances to a person who had high number of axilary nodes to survive.


# PDF and CDF

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(hman_data,hue="survival_status_a5", height=6)\
   .map(sns.distplot,"axilary_nodes_cnt")\
   .add_legend()

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(hman_data,hue="survival_status_a5", height=6)\
   .map(sns.distplot,"age")\
   .add_legend()

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(hman_data,hue="survival_status_a5", height=6)\
   .map(sns.distplot,"year_of_op")\
   .add_legend()

# Observation
* From Above all three PDF's axilary_nodes_cnt is useful feature to clasiify a person survival status because the overlapping between "Yes" and "No" is less.
* In other two PDF's overlapping is high between two class labels.  

In [None]:
# Let's see with CDF's
for i in ['axilary_nodes_cnt', 'age', 'year_of_op']:
    
   hman_data_s = hman_data.loc[hman_data["survival_status_a5"] == "Yes"];   
   counts, bin_edges = np.histogram(hman_data_s[i], bins=10, 
                                 density = True)
   pdf = counts/(sum(counts))
   cdf = np.cumsum(pdf)
   plt.xlabel(i)
   plt.plot(bin_edges[1:],pdf,label = 'PDF_YES')
   plt.plot(bin_edges[1:], cdf, label = 'CDF_YES')
   
   hman_data_n = hman_data.loc[hman_data["survival_status_a5"] == "No"]; 
   counts, bin_edges = np.histogram(hman_data_n[i], bins=10, 
                                    density = True)
   pdf = counts/(sum(counts))
   cdf = np.cumsum(pdf)
   plt.xlabel(i)
   plt.plot(bin_edges[1:],pdf, label = 'PDF_NO')
   plt.plot(bin_edges[1:], cdf, label = 'CDF_NO')
   plt.legend()
   plt.show()

# Observation
* People with less number of axilary_nodes_cnt had high survival rate.
* If a person has less than 3 axilary_nodes_cnt then there are 80-85% of chances are there to survive.
* No survival chance for a person if axilary_nodes_cnt is greater than 45.
* No survival chance if person age is greater than 77.

# Box Plots

In [None]:
sns.boxplot(x='survival_status_a5',y='axilary_nodes_cnt',data=hman_data)

In [None]:
sns.boxplot(x='survival_status_a5',y='year_of_op',data=hman_data)

In [None]:
sns.boxplot(x='survival_status_a5',y='year_of_op',data=hman_data)

# Violinplot

In [None]:
sns.violinplot(x='survival_status_a5',y='axilary_nodes_cnt',data=hman_data)

In [None]:
sns.violinplot(x='survival_status_a5',y='year_of_op',data=hman_data)

In [None]:
sns.violinplot(x='survival_status_a5',y='age',data=hman_data)

# Observation
* Voilin plots and Box plots also tells us axilary_nodes_cnt is the only feature can play a role to classify survival status of a person after 5 years of surgery.
* High number of axilary_nodes_cnt has less chances to survive.

# Conclusion
* Data looks like imbalanced so we can not able to classify data with more accuracy.
* axilary_nodes_cnt is the most useful feature to classify survival status of a person.
* Survival chance is high with less number of axilary_nodes_cnt.
* In the case of people with less than 2 auxiary nodes had 85% of chance to survive after 5 years of surgery.
* age and year_of_op not much useful to classify class labe survival_status_a5.