In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## **Haberman's Survival Data:**

### The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

### **Attribute Information:**
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (1 = the patient survived 5 years or longer, 2 = the patient died within 5 years, class attribute)

### **OBJECTIVE:** We have to find out whether the patients will survive more than 5 years or not.

In [None]:
# Load haberman.csv into a pandas dataFrame
df = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')
df.head()

Above data seems to be wrong as data does not have any header, first row is being considered as column names. Let's fix this

In [None]:
df = pd.read_csv('../input/habermans-survival-data-set/haberman.csv',names=['patient_age','operation_year','axillary_nodes','survival_status'])
df.head()

In [None]:
# how many data-points and features?
df.shape

In [None]:
df.info()

In [None]:
df.describe()

**OBSERVATION:**

**patient_age**
1. Oldest patient : 83 years old
2. Youngest patient : 30 years old
3. Average patient age : 52 years.
4. Median age : 52 years.
**axillary_nodes**
1. Highest number of Axillary nodes in a person : 52
2. Lowest number of Axillary nodes in a person : 0.
3. Mean number of Axillary nodes : 4
4. Median of Number of Axillary nodes : 1
5. Most persons have Number of Axillary nodes in 0-5 range

In [None]:
# column names in our dataset?
df.columns

**Independent Columns:** patient_age, operation_year, axillary_nodes

**Dependent Columns:** survival_status

In [None]:
# How many patient survived 5+ years or died within 5 years.
df['survival_status'].value_counts()

In [None]:
patches, texts = plt.pie(df.survival_status.value_counts())
plt.legend(patches, ["Long Survived","Short Survived"], loc="best")
plt.tight_layout()
plt.show()

From the above result we **observe** that there is huge difrence in survival status.
1. Haberman's Survival Data Set is an imbalanced dataset.
2. Most patients survived for longer than 5 years. 

### **2-D Scatter Plot**

In [None]:
df.plot(kind='scatter', x='patient_age', y='axillary_nodes')
plt.title('Axillary Nodes vs Patient Age:', size=20)
plt.show()

cannot make much sense out it. 
We color the points by thier class-label/Survival status.

**2-D Scatter plot with color-coding for each Survival status/class.**

We draw multiple 2-D scatter plots for each combination of features

How many cobinations exist? 3C2 = 3.

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(df, hue="survival_status", height=5).map(plt.scatter, "patient_age", "axillary_nodes").add_legend()
plt.show()

**OBSERVATION:**

   1. When nodes < 20 & 30 < age <= 40 then the chances of survival are much more.
   2. When nodes < 10 & 40 < age <= 70 then the chances of survival and non survival are almost the same.
   3. When nodes < 10 & 60 < age <= 70 then the chances of non survival are more.
   4. When 10 < nodes < 20 & 30 < age <= 50 then the chances of survival are more.
   5. When 10 < nodes < 20 & 50 < age <= 70 then the chances of non survival are more.

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(df, hue="survival_status", height=5).map(plt.scatter, "patient_age", "operation_year").add_legend()
plt.show()

**OBSERVATION:**

   1. 30-40 age interval: More chances of survival.
   2. 40-70 age interval: Almost equal chances of survival and non survival.
   3. 70-80 age interval: More chances of survival.
   4. Above age 80: Most likely to die.

In [None]:
#3D Scatter plot
import plotly.express as px
fig = px.scatter_3d(df, x='operation_year', y='patient_age', z='axillary_nodes',color='survival_status')
fig.show()

It can be used for classification but with some misclassification errors

### **Pairwise scatter plot: Pair-Plot**
Dis-advantages: 
* Can be used when number of features are high.
* Only possible to view 2D patterns.

In [None]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(df, hue="survival_status",palette="viridis", height=3);
plt.show()
# NOTE: the diagnol elements are PDFs for each feature. PDFs are expalined below.

**OBSERVATION:**

1. axillary_nodes and patient_age are the most useful features to identify chances of survival.
2. It is clear that if the number of axillary nodes is very less, roughly between 0-5, then the chances of survival are more.

### **Histogram, PDF, CDF**

In [None]:
#1-D scatter plot of survival status
survived_longer = df.loc[df["survival_status"] == 1];
survived_shorter = df.loc[df["survival_status"] == 2];

plt.plot(survived_longer["axillary_nodes"], np.zeros_like(survived_longer['axillary_nodes']), 'r')
plt.plot(survived_shorter["axillary_nodes"], np.zeros_like(survived_shorter['axillary_nodes']), 'o')
plt.title("survival status univarite scatterplot")
plt.show()

Disadvantages of 1-D scatter plot: Very hard to make sense as points are overlapping a lot.

In [None]:
#Analysis of auxillary nodes
sns.FacetGrid(df,hue="survival_status",height=8).map(sns.distplot,"axillary_nodes").add_legend()
plt.title('Histogram of axillary nodes detected', fontsize=15)
plt.show()

**OBSERVATION:**

    1. If the number of nodes is 0 then patient survives more than 5 years
    2. If the number of nodes is <4 the patient has a high chance of survival of more than 5 years

In [None]:
#Analysis of Patient Age
sns.FacetGrid(df, hue="survival_status",height=5).map(sns.distplot, "patient_age").add_legend()
plt.title('Histogram of ages of patients', fontsize=15)
plt.show()

**Observations:**

1. The minimum age of patient: 30
2. The maximum age of patient is around 83
3. The median age of patient is around 52
4. The age group between 40 to 75 has nearly same survival and death chances so we cannot do further analysis using this feature

In [None]:
#Analysis of Operation year
sns.FacetGrid(df,hue="survival_status",height=5).map(sns.distplot,"operation_year").add_legend()
plt.title('Histogram of operation year of patients', fontsize=15)
plt.show()

**OBSERVATION:**
1. Operation year 60 had highest survival rate.
2. Operation year having range 63-66 had lowest survival rate.

### axillary_nodes is the useful features to indentify the survival status. since, the both distributions are way different from each other.
### **Let's Use CDF to find the exact percentage of people who will survive and who will not**

In [None]:
# Plots of CDF of axillary_nodes for survival status.
plt.figure(figsize=(20,10))
# survived_long
counts, bin_edges = np.histogram(survived_longer['axillary_nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
# print(pdf);
# print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)


# survived_short
counts, bin_edges = np.histogram(survived_shorter['axillary_nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
# print(pdf);
# print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('Axillary Lymph Nodes')
plt.legend(["1- PDF of high survival of more than 5 years","2- CDF of high survival of more than 5 years",
            "3- PDF of low survival of less than 5 years","4- CDF of low survival of less than 5 years"])
plt.show()

**OBSERVATION:**

   1. There is a 84 % chance of survival is more than 5 years if the number of nodes detected is less than 3
   2. There is a 100 % chance low survival if the number of nodes is greater than 40
   3. And we can see as number of axillary nodes increases survival chances also reduces (from the PDF plot).

### **Mean, Variance and Std-dev**

In [None]:
print("Axillary_nodes:\n")
print("Mean of number of nodes for people who survived 5+ years is: ",np.mean(survived_longer["axillary_nodes"]))
print("STD of number of nodes for people who survived 5+ years is: ",np.std(survived_longer["axillary_nodes"]))
print("\nMean of number of nodes for people who survived <5 years is: ",np.mean(survived_shorter["axillary_nodes"]))
print("STD of number of nodes for people who survived <5 years is: ",np.std(survived_shorter["axillary_nodes"]))

**OBSERVATION**
1. The mean of number of nodes for people who survived 5+ years is low whereas the mean of number of nodes for people who survived <5 years is high
2. The spread of data in number of people who survived 5+ years is less than the number of people who survived  <5 years.

### **Median, Percentile, Quantile, IQR, MAD**

In [None]:
#Median, Quantiles, Percentiles, IQR.
print("\nMedians:")
print("Median of number of nodes for people who survived 5+ years: ",np.median(survived_longer["axillary_nodes"]))
print("Median of number of nodes for people who survived <5 years: ",np.median(survived_shorter["axillary_nodes"]))


print("\nQuantiles:")
print("percentiles of long survival are ",np.percentile(survived_longer["axillary_nodes"],np.arange(0, 100, 25)))
print("percentiles of short survival are ",np.percentile(survived_shorter["axillary_nodes"],np.arange(0, 100, 25)))

print("\n90th Percentiles:")
print("90th percentile for long survival is",np.percentile(survived_longer["axillary_nodes"],90))
print("90th percentile for short survival is",np.percentile(survived_shorter["axillary_nodes"],90))

from statsmodels import robust
print ("\nMedian Absolute Deviation")
print("MAD of number of nodes for people who survived 5+ years: ",robust.mad(survived_longer["axillary_nodes"]))
print("MAD of number of nodes for people who survived <5 years: ",robust.mad(survived_shorter["axillary_nodes"]))

**Observations from Median:**

    The median for patients who survived longer is 0 and for those who survived low is 4.

**Observations from Quantiles:**

   1. Nearly 50th% of nodes in long survival cases are 0 and 75th% is more than 3 indicating that remanining 25th% have more than 3 Auxiliary nodes
   2. 75th% of data in low survival cases have 11 nodes detected
   3. At 90th% if number of nodes detected is more than >8 then high survival chance and if the number of nodes detected is >20 then low survival

### **Box plot and Whiskers**

In [None]:
sns.boxplot(x='survival_status',y='axillary_nodes', data=df)
plt.show()

**OBSERVATION**
1. The 75th% of points from Long survival is nearly equal to 25th% of points of low survival
2. Threshold of Long survival is from 0 to 8
3. Threshold for Low survival is from 0 to 25
4. Median axillary_nodes for survived patients is zero. It is a central value

If nodes between 0 to 7 in High Survival have an error then there is a high chance that even points from Low Survival lie in it which is almost 50% error in short survival

### **Violin plots**

In [None]:
sns.violinplot(x='survival_status',y='axillary_nodes', data=df, height=5)
plt.show()

**OBSERVATION**
   1. The spread of points in Long survival is more near 0 and that of Low survival is more near 2
   2. The whiskers of Long survival extend from 0 to 7 and that of low survival extend from 0 to 25
   3. The distribution is heavily right skewed.

### **Multivariate probability density, contour plot.**

In [None]:
#2D Density plot, contors-plot
sns.jointplot(x='patient_age',y='axillary_nodes', data= df, hue="survival_status",height= 9, kind="kde")
plt.show()

### **Correlation**
Let's plot correlations between columns

In [None]:
corr = df.corr()
sns.heatmap(corr, xticklabels=corr.columns,yticklabels=corr.columns)

**OBSERVATION**
1. axillary_nodes has no correlation with patient_age and operation_year.
2. axillary_nodes and survival_status are slightly correlated.
3. patient_age and survival_status are also slightly correlated. But, correlation is less compared to correlation between axillary_nodes and survival_status

### **Don't forget to upvote if you Liked, this Notebook would be updated further**