# Haberman's Survival Data Set Analysis

**Introduction**

*The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.*

**Objective**

*- EDA of features that influence the survival of patients who had undergone surgery for breast cancer (Univariate/Bivariate/Multivariate Analysis)*

## Exploratory Data Analysis

In [None]:
#Importing required modules and dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# pd.set_option('display.max_rows', 500)
# pd.set_option('display.max_columns', 500)
print(os.listdir("../input"))
data = pd.read_csv('../input/haberman.csv', header = None, names = ['age_of_patient', 'year_of_surgery', 'positive_axillary_nodes', 'survival_status'])
print("Modules and dataset imported succesfully")

### Preview of Data

* **Input Variables:** 
> Age of the Patient,
> Year of the Surgery,
> Number of positive Axillary nodes detected

* **Target variable:** 
> Survival status of the Patient

In [None]:
data.head()

### Data Dictionary
> **age** - *Age of patient at the time of Breast cancer surgery (numerical)*

> **year** - *Patient's year of surgery (numerical)* 

> **nodes** - *Number of positive axillary nodes detected (numerical)*
 >> A positive axillary node is a lymph node in the area of the armpit (axilla) to which cancer has spread. This spread is determined by surgically removing some of the lymph nodes and examining them under a microscope to see whether cancer cells are present.

> **status** - *Survival status of the Patient (class attribute)*
 >> **1** means the patient survived 5 years or longer and **2** means the patient died within 5 years of surgery.

### Columns of the dataset and other information

In [None]:
print("\nThere are {} patients in the list".format(len(data.index)));
print("\nColumn names are {}".format(data.columns.tolist()))
# Converting survival_status column values to categorical: 1 - Survived, 2 - Died
data.survival_status.replace(to_replace=[1, 2], value=['Survived', 'Died'], inplace=True)
print("\nFull summary of data as follows:\n")
data.info()
print("\nMissing data% stats as follows:\n\n{}".format((data.isnull().sum() / len(data)) * 100))

In [None]:
# Now the dataframe looks like
data.head()

### High-level statistics

In [None]:
data.describe()

**Observations**

- *Patient age varies from **30 to 83**.*

- *Average age of the patients is **52** years where it deviates by **10**.*

- *25% of the patients had **no Positive Axillary nodes** detected.*

- *50% of the patients had less than **1** Positive Axillary nodes detected.*

- *75% of the patients had less than **4** Positive Axillary nodes detected.*

- *The maximum number of Positive Axillary nodes detected in a patient is **52**.*

- *All surgeries were carried out between **1962 and 1969(inclusive)**.*

## Univariate Analysis 

### Number of Patients Survived vs Patients Died Relationship

In [None]:
data.survival_status.value_counts()  # Gives the count of Patients survived and died

In [None]:
# Countplot of Patients Survived vs Patients Died
sns.set_style('whitegrid')
fig1, ax = plt.subplots(1, 2, figsize=(15,7))
fig1 = sns.countplot(x=data.survival_status, hue='survival_status', data=data, ax=ax[0])
fig1.set_xlabel('Survival Status of Patients', fontsize=15)
fig1.set_ylabel('Number of Patients', fontsize=15)
fig1.set_title('Number of Patients Survived vs Patients Died', fontsize=15)
fig1.legend(title='Survival Status')

# Pieplot showing percentages of Patients Survived and Patients Died
ax[1].pie(x=data.survival_status.value_counts(), labels=['Survived', 'Died'], autopct='%.2f%%')
ax[1].axis('equal')
ax[1].set_title('Percentage of Patients Survived vs Patients Died', fontsize=15)
ax[1].legend(title='Survial Status')
plt.show()

**Observations:**

- *Out of **306** patients surveyed, **225** patients survived more than 5 years whereas **81** patients died within 5 years of surgery.*

- *i.e., Around **73.53%** of patients survived more than 5 years whereas **26.47%** of patients died within 5 years of surgery.*

> **This is clearly an Imbalanced class problem.**

### Age vs Number of Patients Relationship

In [None]:
# Countplot of Age vs Number of Patients Relationship
sns.set_style('white')
fig2, ax = plt.subplots(4, 1, figsize=(20,25))
fig2 = sns.countplot(data.age_of_patient, ax=ax[0])
fig2.set_xlabel('Age of patients', fontsize=15)
fig2.set_ylabel('Number of patients', fontsize=15)
fig2.set_title('Age vs Number of Patients', fontsize=15)

# Histogram of Age group vs Number of Patients Relationship
ax[1].hist(x=data.age_of_patient, bins=[30,35,40,45,50,55,60,65,70,75,80,85])
ax[1].set_xlabel('Age of patients', fontsize=15)
ax[1].set_ylabel('Number of patients', fontsize=15)
ax[1].set_title('Age vs Number of Patients', fontsize=15)

# Distplot of distribution of patients age
sns.distplot(data.age_of_patient, ax=ax[2])
ax[2].set_xlabel('Age of patients', fontsize=15)
ax[2].set_ylabel('Density', fontsize=15)
ax[2].set_title('Density distribution of Patients Age', fontsize=15)

#PDF and CDF of Patient age
counts, bin_edges = np.histogram(data.age_of_patient)
pdf = counts/sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label='PDF')
plt.plot(bin_edges[1:], cdf, label='CDF')
plt.xlabel('Age of patient', fontsize=15)
plt.title('PDF and CDF of Patient age', fontsize=15)
plt.legend(fontsize=15)
plt.show()

In [None]:
data.age_of_patient.value_counts().sort_index()  # Gives the count of patients per age

**Observations:**

- *More number of patients were in the age group of **50 to 54**. i.e., **56** patients.*

- *Less number of patients were in the age group of **80 to 84**. i.e., only **1** patient.*

- *Around **14** patients had an age of **52**, which is the highest occurrence*

- *Age of patients are more or less normally distributed*

- *From CDF, Patients in the age group **30 to 70** make the **90%** of total patient population.*

### Age vs Survival Relationship

In [None]:
# Countplot of Age vs Survival Relationship
sns.set_style('whitegrid')
fig1, ax = plt.subplots(1, 1, figsize=(15,7))
fig1 = sns.countplot(x=data.age_of_patient, hue='survival_status', data=data, ax=ax)
fig1.set_xlabel('Age of patient', fontsize=15)
fig1.set_ylabel('Number of patients Survived/Died', fontsize=15)
fig1.set_title('Age vs Survival Relationship', fontsize=15)
plt.legend(bbox_to_anchor=(1,1), loc='upper right', title='Survival Status')
plt.show()

In [None]:
# Swarmplot of Age vs Survival Relationship
sns.set_style('white')
fig, ax = plt.subplots(figsize=(7, 5))
sns.swarmplot(data=data, y='survival_status', x='age_of_patient', ax=ax)
# sns.swarmplot(data=data, y='survivalStatus', x='ageOfPatient', hue='survivalStatus', ax=ax)
ax.set_xlabel('Age of patient', fontsize=15)
ax.set_ylabel('Survival status', fontsize=15)
ax.set_title('Age vs Survival Relationship', fontsize=15)
# ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Survival Status')  # a legend would be too obvious here, hence commented
plt.show()

In [None]:
pd.crosstab(data.age_of_patient, data.survival_status) # Gives the number of Patients Survived/Died per age

In [None]:
# # data.groupby('ageOfPatient')['survivalStatus'].value_counts()
# Following code gives Patient survival % per age.
group = data.groupby('age_of_patient')['survival_status']
for name, group in group:
    try:
        print("Age of patient : "+str(name)+", Patient Survival rate is : "+
              str(round((group.value_counts()['Survived']/group.value_counts().sum())*100,2))+"%")
    except:
        print("Age of patient : "+str(name)+", Patient Survival rate is : 0%")

**Observations:**

- *Patients in the age group **50-55** years had more survival change than other age groups.*

In [None]:
# Box plot to give 5 number summary
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10, 7))
sns.boxplot(data=data, y='survival_status', x='age_of_patient', hue='survival_status', ax=ax)
ax.set_xlabel('Age of patient', fontsize=15)
ax.set_ylabel('Survival status', fontsize=15)
ax.set_title('Age vs Survival Relationship', fontsize=15)
ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Survival Status')  
plt.show()

In [None]:
# Violin plot
sns.set_style('white')
fig, ax = plt.subplots(figsize=(10, 7))
sns.violinplot(data=data, x='survival_status', y='age_of_patient', hue='survival_status', ax=ax)
ax.set_xlabel('Survival status', fontsize=15)
ax.set_ylabel('Age of patient', fontsize=15)
ax.set_title('Age vs Survival Relationship', fontsize=15)
ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Survival Status')  
plt.show()

In [None]:
# To calculate Q1, Median(Q2), Q3, IQR, Min, Max etc
group = data.groupby('survival_status')['age_of_patient']
for name, group in group:
    print("\nPatient status : "+str(name))
    print("Lowest age is "+str(np.min(group)))
    print("Highest age is "+str(np.max(group)))
    print("Median age is "+str(np.median(group)))
    print("1st Quartile (Q1) is "+str(np.percentile(group, 25)))
    print("2nd Quartile (Q2) is "+str(np.percentile(group, 50))+ " which should be same as Median")
    print("3rd Quartile (Q3) is "+str(np.percentile(group, 75)))
    print("IQR is "+str(abs((np.percentile(group, 25))-(np.percentile(group, 75)))))

**Observations:**

- *Lowest age of patients that **Survived** is **30** years and patients that Died is **34**.*

- *Q1 (the first quartile, or the 25% mark for **Survived** patients is **43** and patients that **Died** is **46**.*

- *Median age of patients that **Survived** is **52** years and patients that **Died** is **53**.*

- *Q3 (the first quartile, or the 75% mark for **Survived** patients is **60** and patients that **Died** is **61**.*

- *Highest age of patients that **Survived** is **77** years and patients that **Died** is **83**.*

- *50% of **Survived** patients lie in the age range of **43 to 60**.* 

- *50% of **Died** patients lie in the age range of **46 to 61**.*

- *There are no Outliers in patient age, because there is no doubt about the veracity of age values.*

### Distribution of Patients age

In [None]:
# Distplot of distribution of Patients age
ax = sns.FacetGrid(data, hue='survival_status', height=5)
ax.map(sns.distplot, 'age_of_patient')
ax.set_xlabels('Age of patients', fontsize=15)
ax.set_ylabels('Density', fontsize=15)
ax.add_legend(title='Survival Status', fontsize=12)
plt.show()

In [None]:
# Distplot of distribution of Patients age
ax = sns.FacetGrid(data, col='survival_status', hue='survival_status', height=5)
ax.map(sns.distplot, 'age_of_patient')
ax.set_xlabels('Age of patients', fontsize=15)
ax.set_ylabels('Density', fontsize=15)
ax.add_legend(title='Survival Status', fontsize=12)
plt.show()

**Observations:**

- *Patients in the age group of **50 to 55** had more surivival chance than other age groups.*

### Year of surgery vs Number of Patients undergone surgery

In [None]:
# Countplot of Year of surgery vs Number of Patients undergone surgery
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10,5))
fig = sns.countplot(data.year_of_surgery, ax=ax)
fig.set_xlabel('Year of surgery', fontsize=15)
fig.set_ylabel('Number of Patients undergone surgery', fontsize=15)
fig.set_title('Year of surgery vs Number of Patients undergone surgery', fontsize=15)
fig.set_xticklabels([1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969])
plt.show()

In [None]:
data.year_of_surgery.value_counts()  # Gives number of Patients undergone surgery per year

**Observations**

- *Maximum number of surgeries were done in **1958**, i.e., **36** patients had undergone surgeries.*

- *Minimum number of surgeries were done in **1969**, i.e., Only **11** patients had undergone surgeries.*

### Year of surgery vs Survival Relationship

In [None]:
# Countplot of Year of surgery vs Survival Relationship
sns.set_style('whitegrid')
fig1, ax = plt.subplots(1, 1, figsize=(15,7))
fig1 = sns.countplot(x=data.year_of_surgery, hue='survival_status', data=data, ax=ax)
fig1.set_xlabel('Year of surgery', fontsize=15)
fig1.set_ylabel('Number of Patients Survived/Died', fontsize=15)
fig1.set_title('Year of surgery vs Survival Relationship', fontsize=15)
fig1.set_xticklabels([1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969])
fig1.legend(title='Survival Status')
plt.show()

In [None]:
pd.crosstab(data.year_of_surgery, data.survival_status) # Gives the number of Patients Survived/Died per year

In [None]:
# data.groupby('yearOfSurgery')['survivalStatus'].value_counts()
group = data.groupby('year_of_surgery')['survival_status']
for name, group in group:
    print("Year of surgery : "+"19"+str(name)+", Patient Survival rate is : "+
          str(round((group.value_counts()['Survived']/group.value_counts().sum())*100,2))+"%")

**Observations:**

- *Year **1961** had the highest survival percentage of **88.46%**. Out of 26 sugeries done in **1961** - **23** patients survived whereas only **3** died.*

- *Year **1965** had the lowest survival percentage of **53.57%**. Out of 28 sugeries done in **1965** - only **15** patients survived and **13** died.*

- *Out of the total **36** patients undergone surgeries in **1958** (which is the highest), **24** patients survived more than 5 years whereas **12** died within 5 years and had a survival rate of **66.67%**.*

### Distribution of Surgery year

In [None]:
# Distplot of distribution of surgery year
ax = sns.FacetGrid(data, hue='survival_status', height=5)
ax.map(sns.distplot, 'year_of_surgery')
ax.set_xlabels('Year of surgery', fontsize=15)
ax.set_ylabels('Density', fontsize=15)
ax.add_legend(title='Survival Status', fontsize=12)
#ax.set_xticklabels([1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969])
plt.show()

In [None]:
# Distplot of distribution of surgery year
ax = sns.FacetGrid(data, col='survival_status', hue='survival_status', height=5)
ax.map(sns.distplot, 'year_of_surgery')
ax.set_xlabels('Year of surgery', fontsize=15)
ax.set_ylabels('Density', fontsize=15)
ax.add_legend(title='Survival Status', fontsize=12)
#ax.set_xticklabels([1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969])
plt.show()

**Observations:**

- *Patients who had surgery between **1958 to 1962** had more surivival chance than others.*

### Number of Positive Axillary nodes detected vs Number of Patients affected

In [None]:
# Countplot of Positive Axillary nodes detected vs Number of Patients affected
sns.set_style('whitegrid')
fig3, ax = plt.subplots(figsize=(20,5))
fig3 = sns.countplot(data.positive_axillary_nodes, ax=ax)
fig3.set_xlabel('Positive Axillary Nodes detected', fontsize=15)
fig3.set_ylabel('Number of Patients affected', fontsize=15)
fig3.set_title('Positive Axillary Nodes detected', fontsize=15)
plt.show()

In [None]:
data.positive_axillary_nodes.value_counts()  # Gives count of positive Axillary nodes and corresponding number of patients

**Observations**

- *No Positive Axillary nodes were detected in **136** patients surveyed.*

- *The highest number of Positive Axillairy nodes detected in a patient is **52**.*

### Positive Axillary nodes detected vs Survival Relationship

In [None]:
# Countplot of Positive Axillary nodes detected vs Survival Relationship
sns.set_style('whitegrid')
fig1, ax = plt.subplots(1, 1, figsize=(15,7))
fig1 = sns.countplot(x=data.positive_axillary_nodes, hue='survival_status', data=data, ax=ax)
fig1.set_xlabel('Number of Positive Axillary Nodes detected', fontsize=15)
fig1.set_ylabel('Number of Patients Survived/Died', fontsize=15)
fig1.set_title('Positive Axillary Nodes detected vs Survival Relationship', fontsize=15)
plt.legend(bbox_to_anchor=(1,1), loc='upper right', title='Survival Status')
plt.show()

In [None]:
# Swarmplot of Positive Axillary nodes detected vs Survival Relationship
sns.set_style('white')
fig, ax = plt.subplots()
sns.swarmplot(data=data, y='survival_status', x='positive_axillary_nodes', ax=ax)
ax.set_xlabel('Positive Axillary nodes detected', fontsize=15)
ax.set_ylabel('Survival status', fontsize=15)
ax.set_title('Positive Axillary nodes detected vs Survival Relationship', fontsize=15)
plt.show()

In [None]:
pd.crosstab(data.positive_axillary_nodes, data.survival_status)  # Gives the number of Patients Survived/Died per Positive Axillary nodes detected

In [None]:
# data.positiveAxNodes.value_counts()
# Following code gives Patient survival % per number of nodes detected.
group = data.groupby('positive_axillary_nodes')['survival_status']
for name, group in group:
    try:
        print("Axillary Nodes detected : "+str(name)+", Patient Survival rate is : "+
              str(round((group.value_counts()['Survived']/group.value_counts().sum())*100,2))+"%")
    except:
        print("Axillary Nodes detected : "+str(name)+", Patient Survival rate is : 0%")

**Observations**

- *When there was no Positive Axillary nodes detected, then the patient had a survival rate of **86.03%**. Out of 136 patients, only **19** patients died. **117** patients survived.*

In [None]:
# Box plot to give 5 number summary
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10, 7))
sns.boxplot(data=data, y='survival_status', x='positive_axillary_nodes', hue='survival_status', ax=ax)
ax.set_xlabel('Positive Axillary nodes detected', fontsize=15)
ax.set_ylabel('Survival status', fontsize=15)
ax.set_title('Positive Axillary nodes detected vs Survival Relationship', fontsize=15)
ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Survival Status')  
plt.show()

In [None]:
# Violin plot
sns.set_style('white')
fig, ax = plt.subplots(figsize=(10, 7))
sns.violinplot(data=data, x='survival_status', y='positive_axillary_nodes', hue='survival_status', ax=ax)
ax.set_xlabel('Survival status', fontsize=15)
ax.set_ylabel('Positive Axillary nodes detected', fontsize=15)
ax.set_title('Positive Axillary nodes detected vs Survival Relationship', fontsize=15)
ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Survival Status')  
plt.show()

In [None]:
# To calculate Q1, Median(Q2), Q3, IQR, Min, Max etc
group = data.groupby('survival_status')['positive_axillary_nodes']
for name, group in group:
    print("\nPatient status : "+str(name))
    print("Lowest number of positive Axillary nodes detected is "+str(np.min(group)))
    print("Highest number of positive Axillary nodes detected is "+str(np.max(group)))
    print("Median number of positive Axillary nodes detected is "+str(np.median(group)))
    print("1st Quartile (Q1) is "+str(np.percentile(group, 25)))
    print("2nd Quartile (Q2) is "+str(np.percentile(group, 50))+ " which should be same as Median")
    print("3rd Quartile (Q3) is "+str(np.percentile(group, 75)))
    print("IQR is "+str(abs((np.percentile(group, 25))-(np.percentile(group, 75)))))

**Observations:**

- *Lowest number of positive Axillary nodes detected in both **Survived** and **Died** patients is **0**.*

- *Q1 (the first quartile, or the 25% mark for **Survived** patients is **0** and **Died** patients is **1**.*

- *Median number of positive Axillary nodes detected in **Survived** patients is **0** and **Died** patients is **4**.*

- *Q3 (the first quartile, or the 75% mark for **Survived** patients is **3** and patients that **Died** is **11**.*

- *Highest number of positive Axillary nodes detected in **Survived** patients is **46** and **Died** patients is **52**.*

- *50% of **Survived** patients had **0 to 3** positive Axillary nodes detected.* 

- *50% of **Died** patients had **1 to 11** positive Axillary nodes detected.* 

- *Even though there are Outliers depicted on boxplot, those may be genuine values.*

### Distribution of Positive Axillary nodes detected

In [None]:
# Distplot of distribution of Patients age
ax = sns.FacetGrid(data, hue='survival_status', height=5)
ax.map(sns.distplot, 'positive_axillary_nodes')
ax.set_xlabels('Positive Axillary nodes detected', fontsize=15)
ax.set_ylabels('Density', fontsize=15)
ax.add_legend(title='Survival Status', fontsize=12)
plt.show()

In [None]:
# Distplot of distribution of Patients age
ax = sns.FacetGrid(data, hue='survival_status', col='survival_status', height=5)
ax.map(sns.distplot, 'positive_axillary_nodes')
ax.set_xlabels('Positive Axillary nodes detected', fontsize=15)
ax.set_ylabels('Density', fontsize=15)
ax.add_legend(title='Survival Status', fontsize=12)
plt.show()

**Observations**

- *Patients who had **less than 5 positive Axillary nodes** have more survival chance than others.*

## Bivariate/Multivariate Analysis

### Age of patient vs Positive Axillary Nodes detected

In [None]:
# Scatter plot
ax1 = sns.FacetGrid(data, hue='survival_status', col='survival_status', height=5)
ax1.map(plt.scatter, 'age_of_patient', 'positive_axillary_nodes')
ax1.set_xlabels('Age of patient', fontsize=15)
ax1.set_ylabels('Positive Axillary nodes detected', fontsize=15)
ax1.add_legend(title='Survival Status', fontsize=12)

# Another one without using col parameter
ax2 = sns.FacetGrid(data, hue='survival_status', height=5)
ax2.map(plt.scatter, 'age_of_patient', 'positive_axillary_nodes')
ax2.set_xlabels('Age of patient', fontsize=15)
ax2.set_ylabels('Positive Axillary nodes detected', fontsize=15)
ax2.add_legend(title='Survival Status', fontsize=12)

# Another way
fig, ax3 = plt.subplots()
sns.scatterplot(x='age_of_patient', y='positive_axillary_nodes', hue='survival_status', data=data, ax=ax3) 
ax3.set_xlabel('Age of patient', fontsize=15) 
ax3.set_ylabel('Positive Axillary nodes detected', fontsize=15) 
ax3.legend(bbox_to_anchor=(1,1), loc='upper left', fontsize=12)
plt.show()

**Observations**

- *Here Survival status is not linearly separable. Using ageOfPatient and positiveAxNodes we cannot easily classify patients who survived and died as they overlap.*

### Age of patient vs Year of surgery

In [None]:
# Swarmplot of Age of Patient vs Year of surgery Relationship
sns.set_style('white')
fig, ax = plt.subplots(figsize=(10, 7))
sns.swarmplot(data=data, hue='survival_status', x='year_of_surgery', y='age_of_patient', ax=ax)
ax.set_xlabel('Year of surgery', fontsize=15)
ax.set_ylabel('Age of patient', fontsize=15)
ax.set_title('Age of patient vs Year of surgery Relationship', fontsize=15)
ax.set_xticklabels([1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969])
ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Survival Status')  
plt.show()

**Observations**

- *The only one oldest patient (age 83) had surgery in the year **1958**, but **Died within 5 years after surgery**.*

- *The only 3 youngest patients (age 30) had surgery in the years **1962, 1964, 1965**, and all three **Survived more than 5 years**.*

### Year of surgery vs Positive Axillary nodes detected

In [None]:
# Swarmplot of Year of surgery vs Positive Axillary Nodes detected
sns.set_style('white')
fig, ax = plt.subplots(figsize=(10, 7))
sns.swarmplot(data=data, hue='survival_status', x='year_of_surgery', y='positive_axillary_nodes', ax=ax)
ax.set_xlabel('Year of surgery', fontsize=15)
ax.set_ylabel('Positive Axillary nodes detected', fontsize=15)
ax.set_title('Year of surgery vs Positive Axillary nodes detected', fontsize=15)
ax.set_xticklabels([1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969])
ax.legend(bbox_to_anchor=(1,1), loc='upper left', title='Survival Status')  
plt.show()

**Observations**

- *The highest number of Positive Axillary nodes were detected a patient in the year **1958** and that patient **died within the 5 years of surgery**.*

- *All the years had atleast one case of **Zero** positive Axillary nodes. The patient survived in most such cases.*

### Pairplot of all numerical features

In [None]:
ax = sns.pairplot(data, hue='survival_status', height = 4)
ax.add_legend(title='Survival Status')
plt.show()

**Observations**

- *From above pairplots we cannot easily classify the class label.*

## Conclusions

- *The Haberman dataset is clearly imbalanced. Out of **306** patients surveyed, **225** patients survived more than 5 years whereas **81** patients died within 5 years of surgery.*

- *Patients in the age group of **50 to 55** had more surivival chance than other age groups.*

- *Patients who had surgery between **1958 to 1962** had more surivival chance than others.*

- *Patients who had **less than 5 positive Axillary nodes** detected had more survival chance than others.*