Data :
Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Attribute Information:
Age of patient at the time of operation (numerical)

Patient's year of operation (year - 1900, numerical)

Number of positive axillary nodes detected (numerical) Reference:https://www.medicalnewstoday.com/articles/319713#outlook Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

Load Liabraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Objective:Analyze the survival of patients after breast cancer surgery

Load Data:

In [None]:
columns = ['age', 'year', 'nodes', 'status']
data=pd.read_csv("/kaggle/input/habermans-survival-data-set/haberman.csv",header=None, names=columns)
data.head()


Basic Data Analysis

In [None]:
data.info()

In [None]:
data.shape #no of rows and columns 

In [None]:
data.describe() #basic statistics

In [None]:
statusdist=data['status'].value_counts(normalize=True)*100

statusdist # % values of status

In [None]:
labels=statusdist.keys().map({1:'Survived',2:'Not Survived'}) #Applying a function to label pie chart if 1 means Survived 2 means not survived

plt.pie(x=statusdist,labels=labels,autopct='%1.2f%%')

plt.title("Survival rate(%) of Cancer Patients")

Observation: Dataset is imbalanced and status 1 is ~74% which means survival rate is more than the patients who didnt survived.

# Univariate Analysis


Year of Operation

In [None]:
sns.FacetGrid(data,hue='status',height=5).map(sns.distplot, "year").add_legend()
plt.title("Distribution of Year")
plt.show();



Observation: There is overlap between the PDFs of  years Survived (1) and Non-Survived(2) cases.

In [None]:
Analyze the number of cases across years

In [None]:


Noofcases=data['year'].value_counts()
ld=pd.DataFrame(Noofcases)
ld=ld.reset_index()

ld.columns=['Year','NoOfCases']
ld

In [None]:
sns.lineplot(data=ld,x='Year',y='NoOfCases')
plt.xlabel('Year')
plt.ylabel('No. of cases observed')
plt.title("No. of cases over Years")
plt.show()

No. of cases observerved are highest in year 58 and 62-64. We need to later compare survived vs non survived cases.

Creating dataframes for survived and Non-survived data

In [None]:
survived=data.loc[data['status']==1];
nonsurvived=data.loc[data['status']==2]



Positive Auxiliary Nodes

In [None]:
data['nodes'].unique()

In [None]:
sns.FacetGrid(data,hue='status',height=5).map(sns.distplot, "nodes").add_legend()
plt.title("Survival based on Auxiliary Nodes")
plt.show();



Observation: If the nodes are less than 5 than chances of survival is more.Women with more than 25 nodes have chances of non-survival is more.

In [None]:
counts, bin_edges = np.histogram(survived['nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
#plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Survived');

counts, bin_edges = np.histogram(nonsurvived['nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
#plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Not Survived');
plt.legend();
plt.show();



Various Percentile values for Auxiliary Nodes:

In [None]:
for i in range(0,101,10):
   
    print(i,"th percentile for Survived cases",np.percentile(survived["nodes"],i))
    print(i,"th percentile for Non-Survived cases",np.percentile(nonsurvived["nodes"],i))

Observation:Till 90th percentile for survived cases the no of nodes are less than 10 but after that they are increasing.

Age

In [None]:
sns.FacetGrid(data,hue='status',height=5).map(sns.distplot, "age").add_legend()
plt.title("Distribution for Age")
plt.show();



In [None]:
counts, bin_edges = np.histogram(survived['age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
#plt.plot(bin_edges[1:],pdf,label='Survived')
plt.plot(bin_edges[1:],cdf,label='Survived');

counts, bin_edges = np.histogram(nonsurvived['age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
#plt.plot(bin_edges[1:],pdf,label='Not Survived')
plt.plot(bin_edges[1:],cdf,label='Not Survived');
plt.legend();
plt.show();

Observation: With PDF and CDF of Age, we are not getting enough information.

Creating 2 categorical variables for Age-Group and Status

In [None]:
data['age'].unique()

Creating Bins of 10 Years 

In [None]:

data['agegroup'] = pd.cut(data['age'], bins=[29,40,50,60,70,80,90], labels=['30-40','40-50','50-60','60-70','70-80','80-90'])
data.head()

In [None]:
data['statusflag']=data['status'].map({1:'Survived',2:'Not Survived'})
data.head()

### Bivariate Analysis

Pair Plots

In [None]:
sns.set_style("whitegrid");
sns.pairplot(data, hue="status", height=3);
plt.show()


Observation: Age and Node are having some relation and are important to analyze the survival of patients.

Correlation between Numeric Variables

In [None]:
corr=data.corr(method='spearman')
corr

Observation:Status is more related to No of Auxiliary Nodes.

In [None]:
#checking nonsurvived dataframes for various age range
nonsurvived['agegroups'] = pd.cut(nonsurvived['age'], bins=[29,40,50,60,70,80,90], labels=['30-40','40-50','50-60','60-70','70-80','80-90'])

sns.boxplot(x='agegroups',y='nodes',data=nonsurvived)
plt.show()

Observation: Age Range 40-60 have more non survived cases and more auxiliary nodes

In [None]:
Survivedcases=survived['year'].value_counts()
lds=pd.DataFrame(Survivedcases)
lds=lds.reset_index()

lds.columns=['Year','NoOfCases']
lds['status']='Survived'

Nonsurvivedcases=nonsurvived['year'].value_counts()
ldns=pd.DataFrame(Nonsurvivedcases)
ldns=ldns.reset_index()

ldns.columns=['Year','NoOfCases']
ldns['status']='NonSurvived'
final=pd.concat([lds,ldns],ignore_index=True)
#final
sns.lineplot(x ='Year', y ='NoOfCases', data = final,hue='status')
plt.title("Survived vs Non Survived across years")
plt.show()

Observation: Year 65 has highest non survived cases followed by Year 58.

# Multivariate Analysis


In [None]:

pt=pd.pivot_table(data,index=["year",'statusflag'],columns=['agegroup'],values=["nodes"],
               aggfunc='count',fill_value=0,margins=True)

pt

Observation:Using this pivot table we can get the count of Non Survived cases across each year and different age groups.

Conclusions:
1. Survival rate is more.
2. No. of Auxiliary nodes is an important factor
3. Ages between 40-60 have more non survived cases.
4. PDF-CDF charts are not much helpful.
5. 90% of survived cases having no. of auxiliary nodes <10.
6. Year 65 has highest non survived cases.