# Haberman's survival- EDA

## Details
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

## Attributes:

1. age: Age of patient at time of operation (numerical)

2. year: Patient's year of operation (year - 1900, numerical)

3. nodes: Number of positive axillary nodes detected (numerical)

4. status: Survival status (class attribute)

    1= the patient survived 5 years or longer

    2= the patient dies within 5 years

## Objective:
Given the age, year and nodes, classify/predict a patient's survival who had undergone surgery for breast cancer.

**Importing packages and libraries**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
import os
print(os.listdir('../input')) #checking input dataset

**Loading Dataset**

In [None]:
haberman_df = pd.read_csv('../input/haberman.csv/haberman.csv')


**Understanding the data**

The top 5 rows of data set can be seen by the head() function. 

In [None]:
haberman_df.head()

In [None]:
print (haberman_df.shape)   #shows datapoints and features                     
print (haberman_df.columns) #displays column names in our dataset

In [None]:
haberman_df["status"].value_counts()


**Observations:**


1.   There are 306 datapoints and 4 features
2.   Haberman dataset is an imbalanced dataset as the number of data points is different ("the number of patients survived 5 years or longer"= 225, "the number of patient died within 5 years"= 80"
3.   The datatype of survival_status is an integer, which is meaningless. It has to be converted to a categorical datatype

In [None]:
print(list(haberman_df['status'].unique())) # print the unique values of the target column(status)

There are two unique values, '1' and '2' in the status column. So the value '1' can be mapped to ‘YES’ which means the patient survived 5 years or longer and the value '2' can be mapped to ‘NO’ which means the patient died within 5 years.

In [None]:
haberman_df['status'] = haberman_df['status'].map({1:'YES', 2:'NO'}) #mapping the value '1' to 'YES'and value '2' to 'NO'
haberman_df.head() #printing the first 5 records from the dataset.

## Scatter plots
**1-D scatter plot**

In [None]:
one = haberman_df.loc[haberman_df["status"] == "YES"]
two = haberman_df.loc[haberman_df["status"] == "NO"]
plt.plot(one["age"], np.zeros_like(one["age"]), 'o',label='YES')
plt.plot(two["age"], np.zeros_like(two["age"]), 'o',label='NO')
plt.title("1-D scatter plot for age")
plt.xlabel("age")
plt.legend(title="survival_status")
plt.show()

**Observation:**
1. Since a lot of overlapping is seen here, we can't infer much from this 1-D scatter plot

**2-D Scatter Plot**

In [None]:
sns.set_style("whitegrid");
sns.FacetGrid(haberman_df, hue="status", height=6) \
   .map(plt.scatter, "age", "nodes") \
   .add_legend();
plt.show();


**Observations:**
1. Seperating the patients_survived from patients_died is harder as they have considerable overlap (they are not linearly separable).

**Pair Plots**

In [None]:
sns.set_style("whitegrid")
sns.pairplot(haberman_df, diag_kind="kde", hue="status", height=4)
plt.show()


**Observation:**


1.   Not much informative, as there is too much of overlapping. Classification is not possible.
2.   The plot between year and nodes is comparatively better.



## Univariant Analysis
**PDF(Probability Density Function)**


In [None]:
sns.FacetGrid(haberman_df, hue="status", height=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.title("PDF of age")
plt.show();

**Observations:**
The PDF of Patients_age shows major overlapping. This tells us that the survival chance of a patient is irrespective of their age. But we can roughly tell that patient's in age group 30-40 are more likely to survive.

In [None]:
sns.FacetGrid(haberman_df, hue="status", height=5) \
   .map(sns.distplot, "year") \
   .add_legend();
plt.title("PDF of year")
plt.show();

**Observations:** Here also major overlapping is seen. Also year of operation alone cannot be used as a parameter to determine the patient's survival chance.

In [None]:
sns.FacetGrid(haberman_df, hue="status", height=5) \
   .map(sns.distplot, "nodes") \
   .add_legend();
plt.title("PDF of nodes")
plt.show();

**Observations:**

1.   Overlapping is observed. Hence difficult to classify two classes.
2.   But vaguely we can say that patients with 0 or 1 node are more likely to survive.

**Cumulative Distribution Function(CDF)**

In [None]:
# the patient survived 5 years or longer
counts, bin_edges = np.histogram(one['nodes'], bins=10, density = True)
pdf1 = counts/(sum(counts))
print(pdf1);
print(bin_edges)
cdf1 = np.cumsum(pdf1)
plt.plot(bin_edges[1:],pdf1)
plt.plot(bin_edges[1:], cdf1)
 
# the patient dies within 5 years
counts, bin_edges = np.histogram(two['nodes'], bins=10, density = True)
pdf2 = counts/(sum(counts))
print(pdf2)
print(bin_edges)
cdf2 = np.cumsum(pdf2)
plt.plot(bin_edges[1:],pdf2)
plt.plot(bin_edges[1:], cdf2)

label = ["pdf of patient_survived", "cdf of patient_survived", "pdf of patient_died", "cdf of patient_died"]
plt.legend(label)
plt.xlabel("positive_lymph_node")
plt.title("pdf and cdf for positive_lymph_node")
plt.show();

**Observations**: 

1.   There are about 84% of patients_survived that has nodes<=4
2.   About 56% of patients_died has nodes<=4.5



In [None]:
# the patient survived 5 years or longer
counts, bin_edges = np.histogram(one['age'], bins=10, density = True)
pdf1 = counts/(sum(counts))
print(pdf1);
print(bin_edges)
cdf1 = np.cumsum(pdf1)
plt.plot(bin_edges[1:],pdf1)
plt.plot(bin_edges[1:], cdf1)
 
# the patient dies within 5 years
counts, bin_edges = np.histogram(two['age'], bins=10, density = True)
pdf2 = counts/(sum(counts))
print(pdf2)
print(bin_edges)
cdf2 = np.cumsum(pdf2)
plt.plot(bin_edges[1:],pdf2)
plt.plot(bin_edges[1:], cdf2)

label = ["pdf of patient_survived", "cdf of patient_survived", "pdf of patient_died", "cdf of patient_died"]
plt.legend(label)
plt.xlabel("age")
plt.title("pdf and cdf for age")
plt.show();

**Observations:**
1.   20% of patients who survived had age<41



**Box_plots**

In [None]:
sns.boxplot(x='status',y='age', data=haberman_df)
plt.title("Box_plot for age and survival status")
plt.show()

sns.boxplot(x='status',y='year', data=haberman_df)
plt.title("Box_plot for year and survival status")
plt.show()

sns.boxplot(x='status',y='nodes', data=haberman_df)
plt.title("Box_plot for nodes and survival status")
plt.show()

**Violin Plots**

In [None]:
sns.violinplot(x="status", y="age", data=haberman_df, size=8)
plt.title("Violin plot for age and survival status")
plt.show()

sns.violinplot(x="status", y="year", data=haberman_df, size=8)
plt.title("Violin plot for year and survival status")
plt.show()

sns.violinplot(x="status", y="nodes", data=haberman_df, size=8)
plt.title("Violin plot for nodes and survival status")
plt.show()

**Observations:**

1.   More number of patients survived who had 0 to 1 positive axillary nodes. But there is a small frequency of patients who had no nodes died within 5 years of operation. Thus absence of positive axillary nodes doesn't necessarily guarantee survival.
2.   There are more number of patients aged between 50-60 who survived. At the same time a large frequency of patients died lie in the age range of 45-55. Thus age is not an important feature to determine a persons survival chance.




**Contour Plot**

In [None]:
sns.jointplot(x="age", y="year", data=haberman_df, kind="kde");
plt.show();

**Observation:**

1.   The years 1960 to 1964 saw more operations done on patients aged between 45 and 55

## Conclusions:

1.   Haberman datset is not linearly separable since there is too much overlapping in datapoints. Hence difficult to classify classes.
2.   The dataset is imbalanced as it contains unequal number of data-points for each class. Thus it is difficult to classify the survival chance of a patient based on given features.
3.   The number of positive axillary nodes gave us some insight about the survival chance. Zero or less number of nodes in patients indicated more chance of survival. But still the absence of nodes cannot always guarantee survival.

