In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('../input/habermans-survival-data-set/haberman.csv', names=['Patient_age','Operation_year','Axillary_nodes','Survival_status'])
data

#### Haberman's Survival Data:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

#### Attribute Information:

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (1 = the patient survived 5 years or longer, 2 = the patient died within 5 years, class attribute)

In [None]:
data.head(10)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.columns

In [None]:
data['Survival_status'].value_counts()

From this, we can tell that this is an imbalanced dataset.

OBJECTIVE: 
Given dataset is a binary classification problem. We have to find out whether the patients will survive more than 5 years or not.

In [None]:
data.plot(kind='scatter', x='Patient_age', y='Axillary_nodes')
plt.title('Axillary Nodes vs Patient Age:', size=20)
plt.show()

These are the most important factors in determining the chances of survival.

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(data, hue="Survival_status", height=4) \
   .map(plt.scatter, "Patient_age", "Operation_year") \
   .add_legend()
plt.show()

OBSERVATION:
1. 30-40 age interval: More chances of survival.
2. 40-70 age interval: Almost equal chances of survival and non survival.
3. 70-80 age interval: More chances of survival.
4. Above age 80: Most likely to die.

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(data, hue="Survival_status", height=5) \
    .map(plt.scatter, "Patient_age", "Axillary_nodes") \
    .add_legend()
plt.show()

OBSERVATION:
1. When nodes < 20 & 30 < age <= 40 then the chances of survival are much more.
2. When nodes < 10 & 40 < age <= 70 then the chances of survival and non survival are almost the same.
3. When nodes < 10 & 60 < age <= 70 then the chances of non survival are more.
4. When 10 < nodes < 20 & 30 < age <= 50 then the chances of survival are more.
5. When 10 < nodes < 20 & 50 < age <= 70 then the chances of non survival are more.

In [None]:
from itertools import combinations
def pair_plots(*args):
    combos = list(combinations([*args], 2))
    return combos, len(combos)*2 + len(args)
print(pair_plots(*data.columns[:-1]))

In [None]:
sns.pairplot(data, hue='Survival_status', palette='rainbow', height=4)
plt.show()

OBSERVATION:

1. These pair plots are not linearly separable.
2. It is clear that if the number of axillary nodes is very less, roughly between 0-5, then the chances of survival are more.

In [None]:
long_survived = data.loc[data['Survival_status']== 1]
short_survived = data.loc[data['Survival_status']== 2]

plt.plot(long_survived['Axillary_nodes'], 'o', color='blue')
plt.plot(short_survived['Axillary_nodes'], 'o', color='orange')

plt.ylabel('Axillary_nodes')
plt.title('Survival vs Axillary_nodes')
plt.show()

When nodes < 10, patients survive longer.

In [None]:
long_survived.shape

In [None]:
short_survived.shape

In [None]:
print(np.mean(long_survived['Patient_age']))
print(np.mean(short_survived['Patient_age']))

In [None]:
print(np.mean(long_survived['Axillary_nodes']))
print(np.mean(short_survived['Axillary_nodes']))

In [None]:
print(np.median(long_survived['Patient_age']))
print(np.median(short_survived['Patient_age']))

In [None]:
print(np.median(long_survived['Axillary_nodes']))
print(np.median(short_survived['Axillary_nodes']))

Axillary_nodes is more informative. 
There is a lot of difference between mean and median of axillary nodes, indicating the presence of outliers.

In [None]:
counts, bins = np.histogram(long_survived['Axillary_nodes'], bins=10, density=True)
pdf = counts/(sum(counts))
print(pdf)
print(bins)

cdf = np.cumsum(pdf)
plt.plot(bins[1:],pdf)
plt.plot(bins[1:],cdf)
plt.legend(['1', '2'])
plt.xlabel('Axillary_nodes')
plt.title('CDF of Long Survived people:')
plt.show()

In [None]:
counts, bins = np.histogram(short_survived['Axillary_nodes'], bins=10, density=True)
pdf = counts/(sum(counts))
print(pdf)
print(bins)

cdf = np.cumsum(pdf)
plt.plot(bins[1:],pdf)
plt.plot(bins[1:],cdf)
plt.legend(['1', '2'])
plt.xlabel('Axillary_nodes')
plt.title('CDF of Short Survived people:')
plt.show()

In [None]:
print("SURVIVAL STATUS: YES -> STATISTICS: \n")
print(long_survived.describe())
print("\n\n")
print("SURVIVAL STATUS: NO -> STATISTICS: \n")
print(short_survived.describe())

In [None]:
sns.FacetGrid(data, hue='Survival_status', height=5) \
.map(sns.distplot,'Operation_year') \
.add_legend()
plt.title('PDF of Operation Year distribution:')
plt.show()

In [None]:
sns.FacetGrid(data, hue="Survival_status", height=4) \
.map(sns.distplot, "Patient_age") \
.add_legend()
plt.title('PDF of Patients Age distribution:')
plt.show()

In [None]:
sns.FacetGrid(data, hue="Survival_status", height=4) \
.map(sns.distplot, "Axillary_nodes") \
.add_legend()
plt.title('PDF of Axillary Nodes distribution:')
plt.show()

When the number of axillary nodes is roughly between 0-1, chances of survival is maximum. Then the survival rate is gradually declining. But when axillary nodes is more than 20, chances of death are more.

In [None]:
sns.boxplot(x='Survival_status', y='Patient_age', data=data)
plt.show()

OBSERVATION:

1. There are no outliers and much can be derived from this plot.
2. Age of survival lies between 42-60.
3. Age of non-survival lies between 45-61.

In [None]:
sns.boxplot(x='Survival_status', y='Axillary_nodes', data=data)
plt.show()

OBSERVATION:

1. There are a lot of outliers so median is preferred over mean.
2. Axillary nodes for survival lie between 0-4.
3. Axillary nodes for non-survival lie between 2-11.

In [None]:
print("Quartiles:")
for col in data.columns:
    quar = np.percentile(data[col], np.arange(0, 100, 25))
    print("Quartiles of {} is {}".format(col,quar))
    IQR = quar[3] - quar[1]
    print("Inter quartile range of {} is {} \n".format(col,IQR))

In [None]:
sns.violinplot(x="Survival_status", y="Patient_age", data=data)
plt.title("Violin plot for Survival Status and Patient Age")
plt.show()

In [None]:
sns.violinplot(x="Survival_status", y="Operation_year", data=data)
plt.title("Violin plot for Survival Status and Operation Year")
plt.show()

In [None]:
sns.violinplot(x="Survival_status", y="Axillary_nodes", data=data)
plt.title("Violin plot for Survival Status and Axillary Nodes")
plt.show()

In [None]:
sns.jointplot(x="Patient_age", y="Operation_year", data=data, kind='kde')
plt.title("Contour plot for Patient Age and Operation Year")
plt.show()

In [None]:
sns.jointplot(x="Patient_age", y="Axillary_nodes", data=data, kind='kde')
plt.title("Contour plot for Patient Age and Axillary Nodes")
plt.show()

In [None]:
sns.jointplot(x="Operation_year", y="Axillary_nodes", data=data, kind='kde')
plt.title("Contour plot for Operation Year and Axillary Nodes")
plt.show()

CONCLUSION:

1. Patient Age and Axillary Nodes are the main deciding features for survival.
2. Dataset is not balanced but complete as no value is missing.
3. People with age range 40-60 have maximum chances of survival.
4. Operation year 60 had highest survival rate.
5. Operation year range 63-66 had lowest survival rate.
6. Axillary node with range 0-1 has the highest survival rate.
7. Patients between age range 30-34 survived after the treatment.
8. Patients with age > 77 were not able to survive.
9. Patients with age < 40 and axillary nodes < 30 have higher chances of survival.
10. Patients with age > 50 and axillary nodes > 10 have lower chances of survival.
11. Inter quartile range of Patient_age is 16.75 
12. Inter quartile range of Operation_year is 5.75 
13. Inter quartile range of Axillary_nodes is 4.0 
14. Pairplots are linearly inseparable.
15. Axillary_nodes feature has the most outliers.