# Exploratory data analysis (EDA) on Haberman's Survival dataset

### **About the Dataset**

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Number of Instances (rows): **306**

Number of Attributes (columns): **4**  (including the class attribute and all the attributes have **numeric** values)

### Attribute Information

1. **Age** of patient at time of operation

2. Patient's **year** of operation

3. Number of positive axillary **nodes** detected
<br>(**Positive axillary nodes** means the lymph nodes in the underarm area do contain cancer. <br>These are also the first lymph nodes where breast cancer is likely to spread.)
     
4. Survival **status** (class attribute) 
<br>1 = the patient survived 5 years or longer 
<br>2 = the patient died within 5 year

Missing Attribute Values: None

source :- https://www.kaggle.com/gilsousa/habermans-survival-data-set/data


## OBJECTIVE   

Classify a new patient who had undergone breast cancer surgery as belonging to one of the 2 classes i.e,( patient will survive more than 5 years or die within 5 years) based on the 3 features (age, year, nodes).

### 1. importing libraries

In [None]:
# importing the required python libraries

import numpy as np  
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

### 2. Loading dataset

In [None]:
#loading data to pandas dataframe
haberman_df = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')

### 3. Basic analysis of df

In [None]:
#to view the dataframe 
haberman_df

In [None]:
haberman_df.columns =['age','year','nodes','status']
haberman_df

**Observations :**

difficult to understand the **status** of patient, as the class attribute holds numerical values <br>
**1 =** the patient **survived** 5 years or longer<br>**2 =** the patient **died** within 5 year



In [None]:
# replacing 1 as survived and 2 as died using map()
haberman_df['status']= haberman_df['status'].map({1:'survived',2:'died'})
haberman_df

In [None]:
#Checking the column names 
haberman_df.columns

In [None]:
#Checking the dimensions(rows,columns)
haberman_df.shape

In [None]:
#returns first 5 rows of the df
haberman_df.head()

In [None]:
#returns last 5 rows of the df
haberman_df.tail()

### 4. High level statistics of df

In [None]:
#to get a concise summary of the df
haberman_df.info()

**Observations :** 

- **no null values present in the df,** as the no.of non-null counts of each attributes are equal to the no.of instances(306)
- **all the features have numeric values,** as the datatype of each features belongs to integer.
- **class attribute have non numeruc values,** as its datatype belongs to object.

In [None]:
#computes a summary statistics of numeric data 
haberman_df.describe()

**Observations :**

- **age** of patients ranges from 30 to 83 (min, max) with an average age of  52 (mean) and deviation of age from mean is 10 (std)
- **year** of operation between 1958 - 1969 (min, max)
- **nodes** - highest positive axillary nodes of patient is 52(max) but 75% patient has less than 4 nodes and 25% has no nodes



In [None]:
#Counting number of datapoints in each class
haberman_df['status'].value_counts()

**Observations:**<br>

haberman_df is a **imbalanced dataset**, as the number of data points for the 2 classes are not equal 



## 5. Univaraite analysis
univaraite analysis (Distribution plots, PDF and CDF, Boxplot, Voilin plots) is performed to understand which features are useful towards classification.



### 5.1 Distribution plots 
The Seaborn module along with the Matplotlib module is used to depict the **distplot**, which shows the variation in the data distribution through histogram and density line on it.

In [None]:
#Univariate analysis on the "age" variable using distplot
sns.FacetGrid(haberman_df, hue='status', height=5) \
   .map(sns.distplot, "age") \
   .add_legend()
plt.ylabel('frequency')
plt.show()


**Observations:**
- patients with age between 30 and 34 have survived more than 5 years
- patients with age between 77 and 83 have died with in 5 years
- for the rest of the patients, can't conclude anything from the plot as points are overlapping. 


In [None]:
#Univariate analysis on the "year" variable using distplot
sns.FacetGrid(haberman_df, hue='status', height=5) \
   .map(sns.distplot, "year") \
   .add_legend()
plt.ylabel('frequency')
plt.show()


**Observations:**
- can't conclude anything from the plot as points are much overlapping , so **year of operation** not useful in classifying the patient.


In [None]:

#Univariate analysis on the "nodes" variable using distplot
sns.FacetGrid(haberman_df, hue='status', height=5) \
   .map(sns.distplot, "nodes") \
   .add_legend()
plt.ylabel('frequency')
plt.show()

**Observations:**
- eventhough overlapping seen in this plot, but it clearly shows that patients with less number of postive axillary nodes have survived more than 5 years.

Hence from distplot, **nodes** seems to be much useful in classifying the patient's survival status than other features.



### 5.2 PDF and CDF 

PDF and CDF are used to find the probabilistic relations between the variables.

#### PDF (Probability Denstiy Funtion)
- finds the density of probability for continues random variable 
- to understand the distribution of data visually without knowing the exact probability for a certain range of values.

#### CDF (Cummulative Distributive Function)
- finds the cumulative probability of random variables either it is continuous or discrete
- to determine the probability that a random varible that is taken from the population will be less than or equal to a certain value.



In [None]:
#creating dataframe of patients who survived
survived_df = haberman_df.loc[haberman_df["status"]== 'survived']

#creating dataframe of patients who died
died_df = haberman_df.loc[haberman_df["status"]== 'died']

In [None]:
#Plotting PDF abd CDF for age of patients who survived
counts, bin_edges = np.histogram(survived_df['age'], bins=10, density = True)
pdf = counts/(sum(counts))
print("pdf_survived :\n",pdf);
print('bin_edges_survived :\n',bin_edges)
cdf = np.cumsum(pdf)
print("cdf_survived :\n=",cdf);
plt.plot(bin_edges[1:],pdf,label='pdf_survived')
plt.plot(bin_edges[1:],cdf,label='cdf_survived')

print('*'*60)

#Plotting PDF abd CDF for age of patients who died
counts, bin_edges = np.histogram(died_df['age'], bins=10, density = True)
pdf = counts/(sum(counts))
print("pdf_died :\n", pdf);
print('bin_edges_died:\n',bin_edges)
cdf = np.cumsum(pdf)
print("cdf_died:\n =",cdf);
plt.plot(bin_edges[1:],pdf, label='pdf_died')
plt.plot(bin_edges[1:], cdf, label='cdf_died')

plt.xlabel('age')
plt.ylabel('probability')
plt.title('PDF and CDF of age')
plt.legend()
plt.show()



**Observations:**

**From pdf** 
- patients with the age between 45 and 55 have died more than people who survived.

**From cdf**
- Patient with age less than 45 has more probability of surviving than the probability of dying within 5 years

**From pdf and cdf**
- No patients with the age between 30 and 34 have died within 5 years
- No patients with the age between 77 and 83 have survived more than 5 years

But **age** not seems to be much useful in classifying the patient as pdf and cdf plot of both classes are found to be overlapped.

In [None]:
#Plotting PDF abd CDF for year of operation of patients who survived
counts, bin_edges = np.histogram(survived_df['year'], bins=10, density = True)
pdf = counts/(sum(counts))
print("pdf_survived :\n",pdf);
print('bin_edges_survived :\n',bin_edges)
cdf = np.cumsum(pdf)
print("cdf_survived :\n=",cdf);
plt.plot(bin_edges[1:],pdf,label='pdf_survived')
plt.plot(bin_edges[1:], cdf, label='cdf_survived')

print('*'*60)

#Plotting PDF abd CDF for year of operation of patients who died
counts, bin_edges = np.histogram(died_df['year'], bins=10, density = True)
pdf = counts/(sum(counts))
print("pdf_died :\n", pdf);
print('bin_edges_died:\n',bin_edges)
cdf = np.cumsum(pdf)
print("cdf_died:\n =",cdf);
plt.plot(bin_edges[1:],pdf, label='pdf_died')
plt.plot(bin_edges[1:], cdf, label='cdf_died')

plt.xlabel('year')
plt.ylabel('probability')
plt.title('PDF and CDF of year')
plt.legend()
plt.show()



**Observations:**

**From pdf** 
- patients who has been operated between 1960 and 1962 have survived more than people who died.
- patients who has been operated between 1965 and 1967 have died more than people who survived.

**From cdf**
- does not depicts any valuable inference as it overlaps for both survival status.

Hence, from both pdf and cdf plot, **year** is also not seems to be useful in classifying the patient.

In [None]:
#Plotting PDF abd CDF for no.of positive axillary nodes of patients who survived
counts, bin_edges = np.histogram(survived_df['nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
print("pdf_survived :\n",pdf);
print('bin_edges_survived :\n',bin_edges)
cdf = np.cumsum(pdf)
print("cdf_survived :\n=",cdf);
plt.plot(bin_edges[1:],pdf,label='pdf_survived')
plt.plot(bin_edges[1:], cdf, label='cdf_survived')

print('*'*60)

#Plotting PDF abd CDF for no.of positive axillary nodes of patients who died
counts, bin_edges = np.histogram(died_df['nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
print("pdf_died :\n", pdf);
print('bin_edges_died:\n',bin_edges)
cdf = np.cumsum(pdf)
print("cdf_died:\n =",cdf);
plt.plot(bin_edges[1:],pdf, label='pdf_died')
plt.plot(bin_edges[1:], cdf, label='cdf_died')

plt.xlabel('nodes')
plt.ylabel('probability')
plt.title('PDF and CDF of nodes')
plt.legend()
plt.show()



**Observations:**
 
**From pdf and cdf**

- For the patients having less than 5 positive axillary nodes, the probability of surviving (\~82%) is higher than the probability of dying (\~58%).
- No patients having more than 46 nodes have survived more than 5 years.

The PDF and CDF plot for 'nodes' of both srvival status are also overlapping, but it clearly shows that the patients with less number of postive axillary nodes have more chances of survival.

Hence from all PDF and CDF plot, nodes seems to be much useful in classifying the patient's survival status than other features.

### 5.3 Boxplot

A box plot is a useful way to illustrate the central tendency, variability, and skewness of a distribution and also an excellent way to detect outliers and extreme values.

- box represents the interquartile range(IQR)
- top line in the box is the 75th percentile(3rd quartile)
- center line in the box is the 50th percentile(median). 
- bottom line in the box is the 25th percentile(1st quartile)
- Whiskers extend from the box to the left and right.
- the left and right fences represent the minimum and maximum value.


In [None]:
#boxplot for survival status of patient vs age 
sns.boxplot(x='status',y='age', data=haberman_df)
plt.title('survival status of patient vs age')
plt.show()

**Observations:**

- 50% of patient's age who survived more than 5 years are ranges from 43 to 60 (IQR)
- 50% of patient's age who died within 5 years are ranges from 46 to 61 (IQR)

From the box plot we found that ~75% of the age of patients overlapped for both survival status. Hence, **age** is not much useful in classifying the patient.



In [None]:
#boxplot for survival status of patient vs year
sns.boxplot(x='status',y='year', data=haberman_df)
plt.title('survival status of patient vs year')
plt.show()

**Observations:**

- 50% operations happened in between 1960 to 1966 for patients who survived more than 5 years (IQR)
- 50% operations happened in between 1959 to 1965 for patients who died within 5 years (IQR)

From the box plot we found that ~75% of the year of operations overlapped for both survival status. Hence, **year** is also not much useful in classifying the patient.

In [None]:
#boxplot for survival status of patient vs nodes
sns.boxplot(x='status',y='nodes', data=haberman_df)
plt.title('survival status of patient vs nodes')
plt.show()

**Observations:**

- 50% of patients who survived more than 5 years are having 0 positive axillary nodes(median)
- 50% of patients who died within 5 years  are having the positive axillary nodes within the range of 1 to 11(IQR)
- Almost patients who survived more than 5 years having their positive nodes less than 7(max). Hence,we see some outliers above the whiskers.

From the box plot we found that only 25% of the patients having positive nodes overlapped for both survival status. Hence, nodes seems to be much useful in classifying the patient's survival status than other features.

### 5.4 Violin plot

- It is similar to a box plot, with the addition of PDF on each side and they look like a voilin, hence named as violin plot
- In a violin plot Denser regions of the data are fatter, and sparser ones are thinner. 


In [None]:
#violinplot for survival status of patient vs age 
sns.violinplot(x="status", y="age", data=haberman_df, size=8)
plt.title('survival status of patient vs age')
plt.show()


**Observations:**
- age of patients who died within 5 years is highly densed between 45 and 55 


In [None]:
#violinplot for survival status of patient vs year 
sns.violinplot(x="status", y="year", data=haberman_df, size=8)
plt.title('survival status of patient vs year')
plt.show()


**Observations:**

- year of operation of patients who survived more than 5 years is highly densed in 1960 
- year of operation of patients who died within 5 years is highly densed in 1965 



In [None]:
#violinplot for survival status of patient vs nodes
sns.violinplot(x="status", y="nodes", data=haberman_df, size=8)
plt.title('survival status of patient vs nodes')
plt.show()


**Observations:**
- Positive axillary nodes of patients who survived more than 5 years is highly densed for 0 to 2 nodes. 
- Positive axillary nodes of patients who died within 5 years is highly densed for 4 to 7 nodes.

From the violin plot we found that most people who survived more than 5 years have zero positive axillary nodes.Hence, nodes seems to be much useful in classifying the patient's survival status than other features.




## 6. Bivariate analysis
Bivariate analysis (scatter plots, pair-plots) is performed to understand if combinations of features are useful in classfication.


### 6.1.Scatter plot

A scatter plot shows the relationship between two numerical variables where each value in the data set is represented by a dot.


In [None]:
#2-D scatter plot between age and year
haberman_df.plot(kind='scatter', x='age', y='year') ;
plt.show()

**Observations:**<br>
 
can't say much about this plot because both variables are represented by the same color

In [None]:
# 2-D Scatter plot with color-coding for each class.
sns.set_style("whitegrid")
sns.FacetGrid(haberman_df, hue='status', height=4) \
   .map(plt.scatter, "age", "year") \
   .add_legend();
plt.show();

**Observations :**<br>
 
now it is easy to differentiate between two variables based on the survival status but this combination of features (**age and year**) are not much useful in classifying the patient, as the points have overlapping.


### 6.2 pair plot

Instead of plotting scatter plot individually for each combination, we can plot a pair plot which shows a clear view of relationship between all combination of features.

In [None]:
# pairwise scatter plot: Pair-Plot
sns.pairplot(haberman_df, hue ="status", diag_kind = "hist", height = 3)
plt.show()

**Observations:**<br>

- The histogram on the diagonal shows the distribution of a single variable while the scatter plots on the upper and lower side of the diagonal shows the relationship between two variables.
- From **age and nodes** combination, patients with age between 30 to 40 and having lesser number of positive axillary nodes seems to be more survived. 
- In all combinations, there is so much overlapping, thus no plot can be linearly separable and not seems to be much useful in classifying the patient.


## CONCLUSION

Haberman_df is a imbalanced dataset, as the number of data points for the 2 classes are not equal.


We plotted multiple plots above to classify a new patient as belonging to one of the 2 classes based on the 3 features.
 ### **1.age**<br>
       if age > 30 && age < 34, then survived more than 5 
       if age >77, then died within 5 years 
   for the rest of the patients, it is not possible to write working if/else code, hence 'age' won't totally help in classifying.
  
 ### **2.year**<br>
    'year' does not depicts any valuable inference as it overlaps for both survival status.

 ### **3.nodes**<br>
    from all the plots,'nodes' gives us a clear idea that the patients having 0 or less number of positive axilary nodes have survived more than 5 years after the operation.

Since there is too much overlapping in data points, it is difficult to create simple linearly seperable model to classify a new patient as belonging to one of the 2 classes based on the 3 features.

But it can be possible to assume that as the number of positive axillary nodes and age increases, the chance of  survival of patients decreases.

