# Habermans Breast Cancer Survival Classification

 **Description:

 The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of   Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
 
 Number of Observations: 306

 Number of Attributes: 4 (including the Target Variable)

 **Attribute Information:

 1. Age - Age of patient at time of operation 
 2. Op_yr - Patient's year of operation (year - 1900, numerical)
 3. Aux - Number of positive axillary nodes detected (numerical)
 4. Sur_stat - Survival status (class attribute) 1 = The patient survived 5 years or longer , 2 = The patient died within 5 year

In [None]:
# Loading the libraries
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Loading the data 
haber=pd.read_csv("../input/habermans-survival-data-set/haberman.csv")

In [None]:
#Changing the Column Names
haber.columns=['age','op_yr','aux','sur_stat']
haber.columns

In [None]:
# Checking the dimensions
haber.shape

In [None]:
#Checking the column names
haber.columns

In [None]:
#Checking the Catagories in Survival_Stats variable
haber['sur_stat'].unique()

In [None]:
#Subseting the data having class 1 in Survival_stats variable and printing the dimensions
haber_a5=haber[haber.sur_stat==1]
haber_a5.shape

In [None]:
#Subseting the data having class 2 in Survival_stats variable and printing the dimensions
haber_b5=haber[haber.sur_stat==2]
haber_b5.shape

# Objective :
Our main objective is to classify the survival_stats as 1 and 2 based on the 3 features or variables we have. We have the three varaibles as Age, Operation year, Positive auxilary nodes. We will build a model by performing univariate and bivariate analysis and come up with a simple model. The target variable is Survival_stats variable and the remaining three are independent variables.

In [None]:
# Univariate analysis on each of the variable
# Univariate analysis on the "age" variable using histogram

sns.set_style("whitegrid")
sns.FacetGrid(haber,hue="sur_stat",size=5).map(sns.distplot,"age").add_legend()

** Observations from above figure:**

##1. Both the pdf's of classes 1 and 2 are overlapped.So, It's not what we want
##2. But here we can observe that people who lived above 5 years are in between the ages 51 to 56
##3. We can also see that the people who lived below 5 years are in between the ages 40 to 58
##4. We can also see that the people with age between 30 - 34 have lived for 5 or more than 5 years
##5. We can also observe that people with age between 77 to 83 lived less than 5 years.

In [None]:
# Univariate analysis on the "op_yr" variable using histogram

sns.set_style("whitegrid")
sns.FacetGrid(haber,hue="sur_stat",size=5).map(sns.distplot,"age").add_legend()

** Observations from above figure:**

##1. Here also we can see that both the pdf's are overlaped so, we don't want this to happen
##2. people started having surgery from the year 1958 to 1969. In between these years opeations happened.
##3  The opeartions happened in the years between 1960 to 1963 there is increase class 2 i.e people lived less than 5 years
##4. The opeartions happened in the years between 1963 to 1966 there is increase class 1 i.e people lived more than equal to 5 years


In [None]:
# Univariate analysis on the "aux" variable using histogram

sns.set_style("whitegrid")
sns.FacetGrid(haber,hue="sur_stat",size=5).map(sns.distplot,"aux").add_legend()

** Observations from above figure:**

##1. There are Many no of people with positive auxilary node(PAN) less than 1.
##2. Many people are in case 1 who had PAN within the range of 0 to 1.
##3. Many people who had more than 2 PAN had lived less than 5 years.
##4. People having PAN in between 28 to 30 and 46 to 47 are likely to be lived more than 5 years.

In [None]:
# Using Boxplots for univariate analysis of each variable:

# Univariate analysis of "age" variable using Boxplot

sns.set_style("whitegrid")
sns.boxplot(x="sur_stat",y="age",data=haber)

** Observations from above figure:**

##1. if age > 30 & age < 33 : case 1 but this won't totally help in classifying
##2. if age >78 : case 2
##3. The above two statements will be correct. but we cannot write the code for other age values because both are overlapped and we cannot write rules for this
##4  But we can say that 50 % of case 1 people are in betwwen ages 43 to 60 
##5  We can also say that 50 % of the case 2 people are in between the ages 45 to 62

In [None]:
# Univariate analysis of "op_yr" variable using Boxplot

sns.set_style("whitegrid")
sns.boxplot(x="sur_stat",y="op_yr",data=haber)

** Observations from above figure:**

##1. By observing we can say that (Here I have overlaped both the classes 1 and 2 and assumed),
##2  - 25% operations happened in between 1958 to 1959
##3  - 50% operations happened in between 1959 to 1966
##4  - Another 25% operations happened in between 1966 to 1969

In [None]:
# Univariate analysis of "aux" variable using Boxplot

sns.set_style("whitegrid")
sns.boxplot(x="sur_stat",y="aux",data=haber)


** Observations from above figure:**

##1. Here as we can observe that 50% of the people who lived more than 5 years are having the PAN within 
the range of 0 to 3
##2. We can also observe that 50% of the people who lived below 5 years are having the PAN within the range of 2 to 11
##3. We can also see some outliers above the whiskers.This is because 80 out of 100 lie below 7 ,So, other 20 % points are shown as outliers.(assumption)

In [None]:
#Univariate analysis using violin plots:

sns.set_style("whitegrid")
sns.violinplot(x="sur_stat",y="aux",data=haber,size=8)

** Observations from above figure:**

##1. When we observe this we can see that the people with less than 2 PAN (Positive Axilary Nodes) are likely to be lived 5 or more years.
##2. We can also see that the people with more than 7 PAN are likely to live less than 5 years. 
##3. When we observe that the class 2 people normally had PAN between the range 0 to 60 .
##4. Class 1: - 
*               75% people : 0  - 3 aux nodes
*               25% people : 3  - 7 aux nodes

##5. Class 2: - 
*                 25% people : 0 - 1 aux nodes
*                 50% people : 1 - 11 aux nodes
*                 25% people : 11 - 24 aux nodes

##6  We can also see other points beyond the whiskers, but they are considered as outliers.

In [None]:
#Bivariate analysis using 2d scatter plots (PAIR PLOTS)

sns.set_style("whitegrid")
sns.pairplot(haber,hue="sur_stat",size=2)
plt.show()

** Observations from above figure:**

##1. Here We have a total of 6 plots out of 3 are the mirror images of other 3 plots.
##2. So, when each plot was observed no plot is helpful in classifying the catagories in survival_stats variable
##3. The best among all the 3 plots is the op_yr and aux plot , which can be helpful in classifying.

In [None]:
#Using pca to visualize the data
haber_data=haber.drop('sur_stat',axis=1)
haber_label=haber['sur_stat']
#print(haber_data.shape)
#print(haber_label.shape)
from sklearn.preprocessing import StandardScaler
std_data=StandardScaler().fit_transform(haber_data)
#print(std_data.shape)
from sklearn import decomposition
pca=decomposition.PCA()
pca.n_components=2
pca_data=pca.fit_transform(std_data)
#print(pca_data.shape)

final_data=np.vstack((pca_data.T,haber_label)).T
#Creating a data frame
data0=pd.DataFrame(final_data,columns=('1stDim','2ndDim','labels'))
#Visualizing the data
sns.FacetGrid(data=data0,hue='labels',height=6).map(plt.scatter,'1stDim','2ndDim').add_legend()



In [None]:
#Using Tsne for visualizing
from sklearn.manifold import TSNE
#Creating the model
tsne_model=TSNE(n_components=2,random_state=0,perplexity=40,n_iter=5000)
#Fiting the data to the model
data1=tsne_model.fit_transform(std_data)
#Now appending the labels to the data using vstack
data2=np.vstack((data1.T,haber_label)).T
#Now creating the dataframe for the stacked data that we made
tsne_df=pd.DataFrame(data2,columns=('1st','2nd','labels'))

#Visualizing t-sne using Seaborn
sns.FacetGrid(data=tsne_df,hue='labels',height=6).map(plt.scatter,'1st','2nd').add_legend()
plt.show()