# Exploratory Data Analysis : Haberman's Survival







# Introduction

    This Dataset contains cases of survival of breast cancer patients who have undergone surgery in Chicago's Billings             Hospital.This study was conducted between 1968 and 1970.
    1 . It contains 306 data points / rows
    2 . It has four columns / features 
     
                    
                                
    

## Features / Columns Explanation:
* Age (first column) : 
      Represents Age of patient.
* operation_year (second column) : 
      Represent the year in which the patient has undergone surgery.
* axillary_nodes(third column) : 
      Represents number of axillary lymph nodes which are effected by cancer. 
*  survival_status (fourth column)(class label) :
       Represents if a patient survived or not after breast cancer treatment.
        1 = The patient has survived for 5 years or more after surgery
        2 = The patient died with in five years after surgery
    

## Objective:
  Find key Features that determine patient's long term  survival after surgery.  

## 1. Importing Libraries and loading dataset

In [None]:
%config Completer.use_jedi = False
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook
# check for the input dataset
import os
print(os.listdir('../input'))
# Loading haberman dataset and assigning columns
haberman = pd.read_csv('../input/habermans-survival-data-set/haberman.csv',names=['age' , 'operation_year','axillary_nodes','survival_status'])


## 2. Understanding Data

In [None]:
# printing first five rows of data
haberman.head(5)

In [None]:
# Counting Number of Rows and Columns of our dataset
haberman.shape

### Observation:
    1. This dataset contains 306 rows and 4 columns / features

In [None]:
# prints column names of dataset
print(haberman.columns)

In [None]:
print(haberman.info())

### Observations:
        1. There are no empty rows
        2. All columns are of integer data type
        3. The "survival_status" column is of integer data type but it can be coverted into categorical datatype
        4. In "survival_status" column: 
                                4.1: value 1 can be mapped to "yes" which means patient has survived 5 years or more
                                4.2: value 2 can be mapped to "no" which means patient has died within 5 years

In [None]:
haberman['survival_status'] = haberman['survival_status'].map({1:"yes",2:"no"})
haberman.head()

In [None]:
haberman.describe()

### Observations:
1. Count : Total number of values present in respective columns.

2. Mean: Mean of all the values present in the respective columns.

3. Std: Standard Deviation of the values present in the respective columns.

4. Min: The minimum value in the column.

5. 25%: Gives the 25th percentile value.

6. 50%: Gives the 50th percentile value.

7. 75%: Gives the 75th percentile value.

8. Max: The maximum value in the column.


In [None]:
# Checking the total numbers of yes and no in  survival status column
haberman['survival_status'].value_counts()

# Observations:
1. Out of 306 patients only 225 survived while 81 died
2. This dataset is imbalanced

In [None]:
survived = haberman[haberman['survival_status'] == 'yes']
died  = haberman[ haberman['survival_status'] == 'no']
survived.describe()

In [None]:
died.describe()

In [None]:
print(f"Median(died): {np.median(died['axillary_nodes'])}")
print(f"Median(survived): {np.median(survived['axillary_nodes'])}")

In [None]:
# importing median absolute deviation
from statsmodels.robust import mad
print(f"MAD(died): {mad(died['axillary_nodes'])}")
print(f"MAD(survived): {mad(survived['axillary_nodes'])}")


## Observations:
      1. The difference between mean of patients who died and survived is almost 5 points.
      2. The difference between median of patients who died and survived is almost 4 points.
      3. The axillary nodes of patients who survived are less compared to patients who died.
      4. The mean age of all the patients is  almost same

## 3. Univariate Analysis
      The purpose of this analysis is to describe,summarize and analyze patterns in single feature.

## 3.1 PDF
Probability Density Function (PDF) is the probability that the variable takes a value x. (a smoothed version of the histogram)
Here the height of the bar denotes the percentage of data points under the corresponding group

In [None]:
sns.FacetGrid(data = haberman , hue = 'survival_status',height = 4)\
.map(sns.distplot , 'age' , bins = 40)\
.add_legend()
plt.show()

## Observations:
1. Major Overlap is observed,which determines that age has nothing to do with patient survival.
2. Although their is overlapping but we can vaguely tell that patients with age range between 22 to 38 are more likely to survive,while patients with age between 39 to 58 and 76 to 95 have less chance of survival,where as patients with age range between 59-75 have equal chance of survival

In [None]:
sns.FacetGrid(data = haberman , hue = 'survival_status' , height = 5)\
.map(sns.distplot , 'operation_year')\
.add_legend()
plt.show()

## Observations:
1. Major Overlap is observed which shows that we cannot select operation year feature for modeling because it is not a deciding factor if a patient will survive or not.
2. Although there is overlapping,but we can vaguely tell that operations in between 1955-1957 and 1963-1966 are more   unsucessful,while operations between 1958-1961 are more successful

In [None]:
sns.FacetGrid(data = haberman ,hue = 'survival_status' , height = 5)\
.map(sns.distplot , 'axillary_nodes')\
.add_legend()
plt.show()


## Observations:
1. Patients with no or one node are more likely to survive.Patients with 5 or more nodes are less likely to survive.

## 3.2 Cumulative Distribution Function (CDF)
The Cumulative Distribution Function (CDF) is the probability that the variable takes a value less than or equal to x.

In [None]:
status_yes = haberman[haberman['survival_status']=='yes']['axillary_nodes']
counts , bin_edges = np.histogram( status_yes , bins = 10 , density = True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.xlabel('Axillary Nodes')
plt.plot(bin_edges[1:],pdf)

plt.plot(bin_edges[1:],cdf,label='yes')
print(f'Survived PDF: {pdf}')
print(f'Survived Bin Edges: {bin_edges[1:]}')
status_no = haberman[haberman['survival_status']=='no']['axillary_nodes']
counts , bin_edges = np.histogram(status_no , bins = 10 , density = True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label = 'no')
#for x,y in zip(bin_edges[1:],cdf):

 #   label = "{:.2f}".format(y)

  #  plt.annotate(label, # this is the text
   #              (x,y), # this is the point to label
    #             textcoords="offset points", # how to position the text
     #            xytext=(0,10), # distance from text to points (x,y)
      #           ha='center') # horizontal alignment can be left, right or center
plt.legend()
plt.grid()
plt.show()




# Observations:
1. 83.55 %  of patients who survived have nodes less then 0-4.6 .
2. 58 percennt of patients also died who have nodes less then 5.

## 3.3 Box Plot and Violin Plot

In [None]:
sns.boxplot(y='age' , x = 'survival_status', data = haberman)
plt.show()
sns.boxplot(x = 'survival_status' , y = 'operation_year',data = haberman)
plt.show()
sns.boxplot(x = 'survival_status' , y = 'axillary_nodes' , data = haberman)
plt.show()

In [None]:
sns.violinplot(x='survival_status',y = 'age' , data = haberman)
plt.show()
sns.violinplot(x='survival_status',y = 'operation_year' , data = haberman)
plt.show()
sns.violinplot(x='survival_status',y = 'axillary_nodes' , data = haberman)
plt.show()

## Observations:
1. Most Patient who died in first five years are in age range between 45 and 55 years,but due to overlap age cannot be a deciding factor for classification.
2. There are comparatively more people who got operated in 1965 who didnot survived for five years.
3. Most People who survived for five years have 0 or 1 node ,but having one or no node didnot always guarantee  long term survival.violin plot show some cases where patient who have no nodes or one node have died in first five years after surgery. 
4. Significant overlap is observed in age's and operation year's box and violin plots where as the overlap was  less in axillary nodes plots 

## 4. Bivariate Analysis

## 4.1 Scatter Plot

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(data = haberman , hue = 'survival_status',height = 10 )\
.map(sns.scatterplot ,'age', 'axillary_nodes')\
.add_legend()
plt.show()

## Observations:
1. More Patients have survived who have 1 or 0 nodes but some died with in 5 years.
2. There is more chance of survival if a person have less nodes.
3. There are no patients of age less then 29.
4. Patients with age 50 or more having 10 nodes or more have less chance of survival. 

## 4.2 Pair Plots:

In [None]:
sns.set_style('whitegrid')
sns.pairplot(data = haberman , hue = 'survival_status' , height = 5)
plt.show()

## Observations:
1. Significant Overlap is observed but the plot between operation_year and axillary_nodes is better

## Multivariate Analysis (Contour Plots)

In [None]:
sns.jointplot(x='age',y ='axillary_nodes' ,data = haberman , kind = 'kde' , height = 10 )

## Observations:
1. Most Patients are in age range 48-55 years.Patients in this age range have 0 or 1 node.


In [None]:
sns.jointplot(y = haberman.age , x = haberman.operation_year , kind = 'kde')

## Observations:
1. Most operations are done in 1958 - 1964 on patients of age 44-55 years.

## Conclusions:
1. Patient's age and operation year cannot  decide patients long term survival.
2. Axillary nodes have nothing to do with patient's long term survival.We also saw some cases where a patient died who has no or 1 node 
3. Based on these current features we cannot predict if a person will survive for  5 years or more.