Description : The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Following information already provided

Number of Instances: 306

Number of Attributes: 4 (including the class attribute)

Attribute Information:

Age of patient at time of operation (numerical)
Patient's year of operation (year - 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
Missing Attribute Values: None

    

In [None]:
#1.1
# Setting up the environment and storing data in a dataframe

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#haberman_df=pd.read_csv('haberman.csv')   # storing the given details into a dataframe
col = ['Age', 'Operation_year', 'Axil_nodes', 'Surv_status']
haberman = pd.read_csv('../input/haberman.csv', names = col)
print(haberman.info())
print("Distribution of records:\n",haberman['Surv_status'].value_counts())
print("Distribution of records in %\n", haberman['Surv_status'].value_counts(normalize=True)*100)


Observations:
    1. None of the Attribute contains missing values
    2. Dataframe is very small. Only contains 306 records
    3. provided data is highly imbalanced because it has 225(~73% of total records) records of the patient 
        survived 5 years or longer and only 81(~26%) records of the patients who died within 5 years. 
        

#Bivariate Analysis

In [None]:
sns.pairplot(haberman,hue='Surv_status',size=4)

#2.1 Univariate Analysis and CDF for patient survived 5 years or longer

In [None]:
haberman_serv_gt5yr=haberman.loc[haberman['Surv_status']==1]
haberman_not_serv=haberman.loc[haberman['Surv_status']==2]

counts, bin_edges= np.histogram(haberman_serv_gt5yr['Age'],bins=10,density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
print(pdf)
print(cdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

counts, bin_edges=np.histogram(haberman_serv_gt5yr['Operation_year'],bins=10,density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
print(pdf)
print(cdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

counts, bin_edges=np.histogram(haberman_serv_gt5yr['Axil_nodes'],bins=10,density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
print(pdf)
print(cdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

#2.2 Univariate Analysis and CDF for patient died within 5 years

In [None]:
counts, bin_edges= np.histogram(haberman_not_serv['Age'],bins=10,density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
print(pdf)
print(cdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

counts, bin_edges=np.histogram(haberman_not_serv['Operation_year'],bins=10,density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
print(pdf)
print(cdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

counts, bin_edges=np.histogram(haberman_not_serv['Axil_nodes'],bins=10,density=True)
pdf=counts/(sum(counts))
cdf=np.cumsum(pdf)
print(pdf)
print(cdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)

#3 Basic statistics

In [None]:
print("Patients who have servived 5 years or more\n\n",haberman_serv_gt5yr.describe())
print("\n\n patients who died within 5 years \n\n",haberman_not_serv.describe())

#4.1 Drawing boxplot for Age

In [None]:
sns.boxplot(x='Surv_status',y='Age',data=haberman)

#4.2 Drawing boxplot for operation year

In [None]:
sns.boxplot(x='Surv_status',y='Operation_year',data=haberman)

#4.3 Drawing boxplot for Axillary Nodes

In [None]:
sns.boxplot(x='Surv_status',y='Axil_nodes',data=haberman)

#5.1 Drawing violinplot for Age

In [None]:
sns.violinplot(x='Surv_status',y='Age',data=haberman)

#5.2 Drawing violinplot for Operation Year

In [None]:
sns.violinplot(x='Surv_status',y='Operation_year',data=haberman)

#5.1 Drawing violinplot for Axillary Nodes

In [None]:
sns.violinplot(x='Surv_status',y='Axil_nodes',data=haberman)

**observations/Conclusions**

1. 50% Patients who have servived more than 5 years have 0 positive axillary nodes. So if a patient have 0 positive axillary 
   nodes then chances of serviving 5 years or more are very high.      **(from section #3)**


2. if the operation was done on or after 1965 then chances of serviving 5 years or more are higher and if the 
   operation was done prior to 1960 then chances of dying within 5 years are high.       ** (from section #4.2)**
   
3. If axillary positive nodes are greater than 3 then chances of dying within 5 years are very high.  **(from section #4.3)**