# Dataset Attribute Information:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
> A positive axillary node is a lymph node in the area of the armpit (axilla) to which cancer has spread. This spread is determined by surgically removing some of the lymph nodes and examining them under a microscope to see whether cancer cells are present.

4. Survival status (class attribute)
*      1 = the patient survived 5 years or longer
*      2 = the patient died within 5 year

In [None]:
#importing all the libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#read csv file
df = pd.read_csv("/kaggle/input/habermans-survival-data-set/haberman.csv",  names=['patient_age', 'year_of_operation','axilary_nodes', 'survival_status'])

In [None]:
#printing head for initial view of data
df.head()

In [None]:
#How many datapoints and features are?(rows,columns)
print(df.shape)

In [None]:
#what are the column name is dataset?
print(df.columns)

In [None]:
#number of classes
print(df['survival_status'].unique())

In [None]:
df.info()

Observation:
1. There are total 4 columns.
2. All column have int64 datatype
3. There are no missing attribute value 

In [None]:
df['survival_status'].value_counts()

Observation:
1. out of 306, 225 survived longer then 5 years.
2. out of 306, 81 died with 5 years.
3. This is imbalance dataset.

In [None]:
df.plot(kind='scatter', x='patient_age', y='axilary_nodes')
plt.title('patient age vs axilary node')
plt.show()
#this doesn't give informantion about dataset

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(df, hue = 'survival_status', height=5)\
    .map(plt.scatter, 'patient_age', 'axilary_nodes')\
    .add_legend()
plt.show()

observation
1. patient age and axilary nodes are overlap
2. It can't be seperated as they have considirable overlap.

In [None]:
#from pair plot we can get each feature w.r.t all feature dependancies
sns.set_style('whitegrid')
sns.pairplot(df, hue = 'survival_status', height = 3)
plt.show()

Observation
1. Not any pair of features is easly seperated by linear seperable method.

In [None]:
#Histogram
patient_long_survived = df.loc[df['survival_status'] == 1]
patient_short_survived = df.loc[df['survival_status'] == 2]

plt.plot(patient_long_survived['axilary_nodes'],np.zeros_like(patient_long_survived['axilary_nodes']),'o')
plt.plot(patient_short_survived['axilary_nodes'],np.zeros_like(patient_short_survived['axilary_nodes']),'r--')
plt.legend(['long survived', 'short survived'], loc = 'lower right')
plt.show()


# PDF with histplot

In [None]:
sns.FacetGrid(df, hue = 'survival_status', height= 5, palette="Blues")\
    .map(sns.histplot ,'patient_age', kde = True)\
    .add_legend()
plt.show()

Observation:
1. major overlapping so, survival rate is irrespective of person's age.

In [None]:
sns.FacetGrid(df, hue = 'survival_status', height= 5, palette="Blues")\
    .map(sns.histplot ,'axilary_nodes', kde = True)\
    .add_legend()
plt.show()

Observation:
1. person with 0 nodes have a high chance of survival.
2. person with 20 or more nodes have very few chance to survive.

In [None]:
sns.FacetGrid(df, hue = 'survival_status', height= 5, palette="Blues")\
    .map(sns.histplot ,'year_of_operation', kde = True)\
    .add_legend()
plt.show()

Observation:
1. major overlapping so, it is deficult to come out in any conclusion

# CDF

In [None]:
#long survives
counts, bin_edges = np.histogram(patient_long_survived['axilary_nodes'],bins = 10, density = True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(patient_short_survived['axilary_nodes'],bins = 10, density = True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf)
plt.plot(bin_edges[1:], cdf)
plt.xlabel('PDF and CDF of patient which has Auxilary nodes')
plt.legend(['PDF of long survived','CDF of long survived', 'PDF of short survived', 'CDF of short survived'])

plt.show()

Observation:
1. about 82% person who survived had auxilary nodes less than 8.
2. about 95% person who died had auxilary nodes more than 25.

# Mean, Variance, Std-dev :

In [None]:
#mean of number of nodes
print('-'*30,"LONG SURVIVED",'-'*40,'\n')

print(patient_long_survived.describe())

print('\n','-'*30,"SHORT SURVIVED",'-'*40,'\n')

print(patient_short_survived.describe())

observation:
1. 75% of people survived which has avg 3 nodes.
2. 75% of people died which has avg 11 nodes.
3. people with 0 nodes has more likely to survive

# Median, Quantile, Percentile,MAD

In [None]:
print('-'*30,"Median Long Survived",'-'*40,'\n')
print('axillary nodes : ',np.median(patient_long_survived['axilary_nodes']))
print('patients age : ',np.median(patient_long_survived['patient_age']))

print('-'*30,"Median Short Survived",'-'*40,'\n')
print('axillary nodes : ',np.median(patient_short_survived['axilary_nodes']))
print('patients age : ',np.median(patient_short_survived['patient_age']))

print('-'*30,"90 % Percentile Long Survived",'-'*40,'\n')
print('axillary nodes : ',np.percentile(patient_long_survived['axilary_nodes'],90))
print('patients age : ',np.percentile(patient_long_survived['patient_age'],90))

print('-'*30,"90 % Percentile Short Survived",'-'*40,'\n')
print('axillary nodes : ',np.percentile(patient_short_survived['axilary_nodes'],90))
print('patients age : ',np.percentile(patient_short_survived['patient_age'],90))

print('-'*30,"Quantile Long Survived",'-'*40,'\n')
print('axillary nodes : ',np.percentile(patient_long_survived['axilary_nodes'],np.arange(25,101,25)))
print('patients age : ',np.percentile(patient_long_survived['patient_age'],np.arange(25,101,25)))

print('-'*30,"Quantile Short Survived",'-'*40,'\n')
print('axillary nodes : ',np.percentile(patient_short_survived['axilary_nodes'],np.arange(25,101,25)))
print('patients age : ',np.percentile(patient_short_survived['patient_age'],np.arange(25,101,25)))

########################################################
from statsmodels import robust

print('-'*30,"MAD Long Survived",'-'*40,'\n')
print('axillary nodes : ',robust.mad(patient_long_survived['axilary_nodes'],90))
print('patients age : ',robust.mad(patient_long_survived['patient_age'],90))

print('-'*30,"MAD Short Survived",'-'*40,'\n')
print('axillary nodes : ',robust.mad(patient_short_survived['axilary_nodes'],90))
print('patients age : ',robust.mad(patient_short_survived['patient_age'],90))

observation:
1. The avg nodes of person who died is 4.
2. Around 90% people died which has axillary node is greater then 20.
3. Around 90% people survived which has axillary node less then 8.

#Box plot

In [None]:
sns.boxplot(x='survival_status',y = 'axilary_nodes', data = df)
plt.title('survival status vs axilary nodes')
plt.show()

sns.boxplot(x='survival_status',y = 'patient_age', data = df)
plt.title('survival status vs patient age')
plt.show()

sns.boxplot(x='survival_status',y = 'year_of_operation', data = df)
plt.title('survival status vs year of operation')
plt.show()

In [None]:
sns.violinplot(x='survival_status',y = 'axilary_nodes', data = df)
plt.title('survival status vs axilary nodes')
plt.show()

sns.violinplot(x='survival_status',y = 'patient_age', data = df)
plt.title('survival status vs patient age')
plt.show()

sns.violinplot(x='survival_status',y = 'year_of_operation', data = df)
plt.title('survival status vs year of operation')
plt.show()

In [None]:
sns.jointplot(x='axilary_nodes',y = 'patient_age', data = df, kind='kde')
plt.title('survival status vs axilary nodes')
plt.show()

sns.jointplot(x='patient_age',y = 'year_of_operation', data = df, kind='kde')
plt.title('survival status vs axilary nodes')
plt.show()

# Final observation
1. all the features are hevely mixed so it can't be seperated by linear seperable method
2. 90 % people died whith an axilarry node greater than 20.
3. people with 0 node are most likely to survive
4. aroud 75% of people survived with avg node 3.
5. This is imbalance dataset and binary classification.
6. people which 40 < Age < 63 with axilary < 3 has likely to survive.
7. around 1960 -1965 has more unsuccesful operations.
8. less axilary node more chance of survival