In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
haberman = pd.read_csv('../input/habermans-survival-data-set/haberman.csv', names = ['age','operation_year','axil_nodes','survived_status'])
haberman.head(5)

Introduction About Haberman Data Set:

1. The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

2. Attribute Information:

a. Age of patient at time of operation (numerical)

b. Patient's year of operation (year - 1900, numerical)

c. Number of positive axillary nodes detected (numerical)

d. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

EDA : So based on this data we need to classify if a new patient comes wether his/her Survival Status would be 1 or 2.

In [None]:
haberman.shape

In [None]:
haberman.columns

In [None]:
haberman["survived_status"].value_counts()

Observation/Conclusions:

1. So this haberman dataset consist of 4 columns 'age','operation_year','axil_nodes','survived_status' out of which the 
   'survived_status' will be used as the class variable or class_label and rest other columns are the featurs or        variables.
2. This dataset consist of 306 datapoints.
3. As you can see there are 225 datapoints with the survived_status as '1' and 81 with the survived_status as '2' and  hence 
   with this count we can treat it as a imbalanced data set.

In [None]:
plt.scatter(haberman["age"],haberman["operation_year"],color='r')
plt.xlabel("age")
plt.ylabel("operation_year")
plt.title("Age Vs Operation_year")
plt.show()

In [None]:
plt.scatter(haberman["age"],haberman["axil_nodes"],color='b')
plt.xlabel("age")
plt.ylabel("axil_nodes")
plt.title("Age Vs axil_nodes")
plt.show()

In [None]:
plt.scatter(haberman["axil_nodes"],haberman["operation_year"],color='g')
plt.xlabel("axil_nodes")
plt.ylabel("operation_year")
plt.title("axil_nodes Vs operation_year")
plt.show()

Observation/Conclusions:

1.Scatter Plot between Age Vs Operation_year:
  a. The Data looks quite mixed up and this scatter plot can't be used to classify the survived_status.
  b. We can conclude one thing that is lot of people between the age of 40 to 65 have opted for the operation.

2.Scatter Plot between Age Vs axil_nodes:
  a. We can conclude that lot of pepople or maximun point are concentrated where the axil_nodes = 0.
  
3.Scatter Plot between axil_nodes Vs Operation_year:
  a. We can conclude that lot of operations happened between the axial nodes 0 to 10

Pair Plots:

So here the total number of pair plots we will get in 3c2 that is 3 plots and then we will try to understand which features 
are good to classify for the survived_status.

In [None]:
sns.set_style('whitegrid')
sns.pairplot(haberman,hue='survived_status',height=3)
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman,hue ='survived_status',height=5).map(plt.scatter,'age','operation_year').add_legend()
plt.title("Age Vs Operation_year")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman,hue ='survived_status',height=5).map(plt.scatter,'age','axil_nodes').add_legend()
plt.title("Age Vs axil_nodes")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman,hue ='survived_status',height=5).map(plt.scatter,'axil_nodes','operation_year').add_legend()
plt.title("axil_nodes Vs operation_year")
plt.show()

Observation/Conclusions from the Pair Plot and Scatter Plot:

 So this pair plot and scatter plot is built using seaborn as now with the color we can easily distinguish our class_variable that is survived_status.

1. Scatter Plot between Age Vs Operation_year:
  a. The Data looks still mixed up but from this scatter plot can be used to classify the survived_status with the  different colors.
  b. We can conclude one thing that is lot of people between the age of 40 to 65 have opted for the operation and maximum come under the survived_status as 1.

2. Scatter Plot between Age Vs axil_nodes:
  a. We can conclude that lot of pepople or maximun point are concentrated where the axil_nodes = 0 and also among them 
     maximum people have among them will have the survived_status as 1.
  b. We can also see that very less patient has axial_nodes more that 30.

3. Scatter Plot between axil_nodes Vs Operation_year:
  a. We can conclude that lot of operations happened between the axial nodes 0 to 10.

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman,hue ='survived_status',height=5).map(sns.distplot,'age').add_legend()
plt.title("Age")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman,hue ='survived_status',height=5).map(sns.distplot,'operation_year').add_legend()
plt.title("operation_year")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman,hue ='survived_status',height=5).map(sns.distplot,'axil_nodes').add_legend()
plt.title("axil_nodes")
plt.show()

Observation/Conclusions from the Density Plot or PDF (Probability Density Function):


1.Density Plot From Age:
  a. We can clearly see that both the smoothed histogram is overlapping each other for both survived_status.
  b. We can say that the patient between 40-60 are likely to have the survived status as 2(dead) and peoples less than age 
     of 40 are likely to have the survived status as 1.

2.Density Plot From operation_year:
  a. We can clearly see that both the smoothed histogram is overlapping each other for both survived_status.
  b. We can also see that lot of people who had operation year between year 60 to 65 have the survived status as 2(dead).

3.Density Plot From axil_nodes:
  a. We can clearly see that both the smoothed histogram is overlapping each other for both survived_status.
  b. We can see that that lot of people who has axial node less than or equal to 0 has the survived_status as 1.

Univariate Analysis:

From all the three PDF we can conclude saying that which feature is the best to use is in the order below:
  Axial_nodes > Age > Operation_year

In [None]:
print("Mean of age of people whose survived_status is 1 =",np.mean(haberman.age[haberman['survived_status']==1]))
print("Mean of age of people whose survived_status is 2 =",np.mean(haberman.age[haberman['survived_status']==2]))
print("Mean of axial_node of people whose survived_status is 1 =",np.mean(haberman.axil_nodes[haberman['survived_status']==1]))
print("Mean of axial_node of people whose survived_status is 2 =",np.mean(haberman.axil_nodes[haberman['survived_status']==2]))
print("Standard Deviation of age of people whose survived_status is 1 =",np.std(haberman.age[haberman['survived_status']==1]))
print("Standard Deviation of age of people whose survived_status is 2 =",np.std(haberman.age[haberman['survived_status']==2]))
print("Standard Deviation of axial_node of people whose survived_status is 1 =",np.std(haberman.axil_nodes[haberman['survived_status']==1]))
print("Standard Deviation of axial_node of people whose survived_status is 2 =",np.std(haberman.axil_nodes[haberman['survived_status']==2]))
print("90 percentile of age of people whose survived_status is 1 =",np.percentile(haberman.age[haberman['survived_status']==1],90))
print("90 percentile of age of people whose survived_status is 2 =",np.percentile(haberman.age[haberman['survived_status']==2],90))
print("90 percentile of axial_node of people whose survived_status is 1 =",np.percentile(haberman.axil_nodes[haberman['survived_status']==1],90))
print("90 percentile of axial_node of people whose survived_status is 2 =",np.percentile(haberman.axil_nodes[haberman['survived_status']==2],90))

Conclusions using Mean and Standard deviation:

 Taking the Age into consideration:
 1. We can conclude saying that range of the people whose age is in between 41 to 63 years old will mostly have the survived status as 1.
 2. We can conclude saying that range of the people whose age is in between 43 to 64 years old will mostly have the survived status as 2.
 3. So based on the age its pretty tough to say which survived_status the person belong to as you can see the 90th percentile 90% of of people of the age below 67 can have the survived_status as 1 or 2.
 Taking the Axial_Node into consideration:
 1. We can conclude saying that range of the people whose axial_node is in between -3 to 8.5 will mostly have the survived status as 1.
 2. We can conclude saying that range of the people whose axial_node is in between -1 to 16.5 will mostly have the survived status as 2.
 3. We can conclude by using percentile that 90% of people whose axial node below 8 will have the survived status as 1 and 90% of people whose axial node is below to 20 will have survived status as 2.
 
 We didn't consider mean for the operation_year because that is year in the gap of 5 years.

In [None]:
sns.set_style('whitegrid')
sns.boxplot(x='survived_status',y='age',data=haberman)
plt.title("Box Plot for Age")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.boxplot(x='survived_status',y='operation_year',data=haberman)
plt.title("Box Plot for operation_year")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.boxplot(x='survived_status',y='axil_nodes',data=haberman)
plt.title("Box Plot for axil_nodes")
plt.show()

Conclusions using Box Plot:

 Box Plot for Age:
 1. We can conclude using box plot for age that survived_status as 1 has the (25, 50, 75) percentile as (43, 52, 60).
 2. We can conclude using the box plot for age that for survived_status as 2 have the (25, 50, 75) percentile as (46, 53, 61).
 3. With the percentile we can say that people whose age is less than 43 will have the survived_status as 1 and
   people whose age is above 60 will have survived_Status as 2 in simple terms young people will have the status as 1   where as the people who are old will have the survived_status as 2.
 Box Plot for operation_year:
 1. We can conclude using box plot for operation_year that survived_status as 1 has the (25, 50, 75) percentile as (60, 63, 66).
 2. We can conclude using box plot for operation_year that survived_status as 2 has the (25, 50, 75) percentile as (59, 63, 65).
 3. Mostly all the poople irrespective of there age or number of axial_node maximum people got operated between the year 59 to 66.
 Box Plot for axil_nodes:
 1. We can conclude using box plot for axil_nodes that survived_status as 1 has the (25, 50, 75) percentile as (0, 0, 3).
 2. We can conclude using box plot for axil_nodes that survived_status as 2 has the (25, 50, 75) percentile as (1, 4, 11).
 3. We can say that the lesser the axial_nodes the status tends to be at 1 and if the axial_numdes increases the 
    survived_status change to 2.

In [None]:
sns.set_style('whitegrid')
sns.violinplot(x='survived_status',y='age',data=haberman)
plt.title("Violin Plot for Age")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.violinplot(x='survived_status',y='operation_year',data=haberman)
plt.title("Violin Plot for operation_year")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.violinplot(x='survived_status',y='axil_nodes',data=haberman)
plt.title("Violin Plot for axil_nodes")
plt.show()

Conclusions using Violin Plot:
Using the violin plot we get the same info of PDF + Box plot together in a single plot.
 Violin Plot for axil_nodes:
 1. We can conclude using Violin Plot for axil_nodes that people who has the axial_nodes 0 or less than 0 they will have 
   survived_status as 1.

Final Conclusion:
 1. From the Dataset we can say that maximun number of operations are done for the people between the age of 40-60 by using
   the scatter plot between age vs operation_year.
 2. From the Scatter plot between axial_node vs operation_year we can say that maximum operations were done between 1960 to 1966.
 3. From the density plot for Age we can say that lot of people die whose age is above 40 and from the density plot for
   axial_node we can say that lot of people who has less number of axial_node tend to survive.
 4. People who are older than 50 and have axil nodes greater than 10 are more likely to dead.
 5. The people who had axil nodes from 1 to 24 are the majority of patients who died.
 6. People who has 0 axial node or less than that tend to survive (survival_Status=1) irrespective of there age.
 7. People who are older than 50 and have axil nodes greater than 10 are more likely to dead(survival_Status=2).