The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

# Objective:

To determine survival chances of a patient based on three parameters - Age, Year of operation and number of positive axillary nodes.

In [None]:
#importing libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Loading dataset as pandas dataframe 
data = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')

In [None]:
#Displaying some rows of the dataset to get an insight of the values in each column.
data.head()

In [None]:
#Displaying number of rows and columns of dataset
data.shape

The dataset has 305 rows and 4 columns.

In [None]:
#Displaying column names
data.columns

In [None]:
#Assigning valuable column names
data.columns = ['Age','Year_of_operation','Axillary_nodes','Survival_status']

In [None]:
data.head()

The columns represent  following information:

**Age**: Represents age of patient at the time of operation.

**Year_of_operation**: Represents year of operation.

**Axillary_nodes**: Represents number of positive axillary nodes.

**Survival_status**: 1: Patient survived for 5 years or longer, 2: Patient dies within 5 years of operation.

In [None]:
#Mapping values in Survival_status column
data['Survival_status'] = data['Survival_status'].map({1:'Survived',2:'Not Survived'})

In [None]:
#Finding the survival count
data['Survival_status'].value_counts()

**Observations**: 

1.In the dataset, 224 patients survived for 5 years or longer and 81 patients died within 5 years of operation.

2.The dataset is imbalanced.                

In [None]:
#Describing the dataset
data.describe()

**Observations**:

1.There are no null values.

2.Average number of axillary nodes in patients are 4.

In [None]:
#Creating dataset 'Survived' having data of all survived patients.
Survived = data[data['Survival_status']=='Survived']

In [None]:
#Describing Survived dataset
Survived.describe()

**Observations**:

1.Average number of axillary nodes in survived patients is approx 3 .

2.Maximum number of axillary nodes in survived patients is 46.

3.75% of survivors had axillary nodes <= 3.

In [None]:
#Creating dataset 'Not_Survived' having data of patients who did not survive for more than 5 years.
Not_Survived = data[data['Survival_status']=='Not Survived']

In [None]:
#Describing Not_Survived dataset
Not_Survived.describe()

**Observations**:

1.Average number of axillary nodes in patients who did not survive is approx 7 which is more than the mean axillary nodes in survived patients.

2.Maximum number of axillary nodes = 52

3.50% of people who did not survive had atmost 4 axillary nodes in their body. 

# Uni-variate analysis:

# 1. Distplot: 

A distplot plots a univariate distribution of observations. It is basically a combination of histogram and kernel density estimation(KDE) plot.

In [None]:
#PDF of Age using distplot
g = sns.FacetGrid(data, hue='Survival_status',height=8)
g.map(sns.distplot,'Age')
plt.legend()
plt.show()

**Observations**:

People within 30-40 years of age have more chances of survival whereas people within 40-60 years of age have less chances of survival. And people within 65-75 years of age have equal chances of surviving and not surviving. 

In [None]:
#PDF of Year of operation using distplot
g = sns.FacetGrid(data, hue='Survival_status',height=8)
g.map(sns.distplot,'Year_of_operation')
plt.legend()
plt.show()

In [None]:
#PDF of Axillary nodes using distplot
g = sns.FacetGrid(data, hue='Survival_status',height=8)
g.map(sns.distplot,'Axillary_nodes')
plt.legend()
plt.show()

**Observations**:

1.Patients with no axillary nodes survived. However patients with more than 2 axillary nodes have less chances of surviving.

2.Also from above three displots it can be said that count of positive Axillary nodes is the most distinguishable feature to determine survival.

# 2.Boxplot: 

A boxplot gives distribution of quantative data like quartiles, inter-quartile range, and outliers.

In [None]:
#Boxplot between Survival_status and Axillary_nodes
fig = plt.figure(figsize=(8,8))
sns.boxplot(x = 'Survival_status', y = 'Axillary_nodes', data = data)
plt.show()

**Observations**:

1.50% of patients who did not survive had axillary nodes between 1-11(approximately).

2.75% of survived patients had very less axillary nodes. 

3.So it can be said that more the number of axillary nodes, less are the chances of survival. 

In [None]:
#Boxplot between Survival_status and Age
fig = plt.figure(figsize=(8,8))
sns.boxplot(x = 'Survival_status', y = 'Age', data = data)
plt.show()

**Observations**:

1.Patients within 30-35 years of age have definitely survived.

2.Patients above 80 years of age did not survive.

In [None]:
#Boxplot between Year_of_operation and Survival_status
fig = plt.figure(figsize=(8,8))
sns.boxplot(y = 'Year_of_operation', x = 'Survival_status', data = data )
plt.show()

# 3.Violinplot

A violin plot is a combination of boxplot and kernel density estimation(KDE) plot. It gives more information about density distribution.

In [None]:
#violinplot between Survival_status and Axillary_nodes
sns.violinplot(y = 'Axillary_nodes', x = 'Survival_status', data = data)
plt.show()

#violinplot between Survival_status and Age
sns.violinplot(y = 'Age', x = 'Survival_status', data = data)
plt.show()

#violinplot between Survival_status and Year_of_operation
sns.violinplot(y = 'Year_of_operation', x = 'Survival_status', data = data)
plt.show()

**Observations**:

1.In year 1965, there were more patients who did not survive.

2.People within age 45-55 years had more chances of not surviving.



# Bi-variate analysis

# Pairplot:

A pairplot gives bivariate distribution between each pair of columns.

In [None]:
#Using pairplots
sns.set(style='darkgrid')
sns.pairplot(data,hue = 'Survival_status',height = 8)

**Observations**:
For less number of axillary nodes, survival chances are more.

# CDF(Cumulative distribution function)

In [None]:
#CDF of axillary nodes in survived patients.
fig = plt.figure(figsize=(9,7))

counts, bin_edges = np.histogram(Survived['Axillary_nodes'], bins=10, density=True)

pdf = counts/sum(counts)

cdf = np.cumsum(pdf)

plt.plot(bin_edges[1:],pdf,label='PDF')
plt.plot(bin_edges[1:],cdf,label='CDF')

plt.xlabel('Axillary Nodes')
plt.legend()

**Observations**: 90% of patients who survived had axillary nodes < 10.

In [None]:
#CDF of axillary nodes in not survived patients.
fig = plt.figure(figsize=(9,7))

counts, bin_edges = np.histogram(Not_Survived['Axillary_nodes'], bins=20, density=True)

pdf = counts/sum(counts)

cdf = np.cumsum(pdf)

plt.plot(bin_edges[1:],pdf,label='PDF')
plt.plot(bin_edges[1:],cdf,label='CDF')

plt.xlabel('Axillary Nodes')
plt.legend()

**Observations**: 90% of patients who did not survive had more than 30 axillary nodes.

# Multi-variate analysis:

# Jointplot:

A jointplot gives univariate as well as bivariate distribution.

In [None]:
#Jointplot between Year of operation and Age in Survived dataset
sns.jointplot(x = 'Year_of_operation', y = 'Age', data = Survived, kind = 'hex',height=8)
plt.show()

**Conclusion**: Neither of the three parameters can individually determine the accurate chances of survival. However number of axillary nodes can determine chances of survival to some extent.