Description : The Haberman Dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

In [None]:
import warnings 

warnings.filterwarnings("ignore")

In [None]:
#importing packages required to perform operations on the datasets

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
haberman =pd.read_csv('../input/habermans-survival-data-set/haberman.csv',names=['age','year','nodes','status']) #loading using pandas dataframe
haberman

In [None]:
haberman['age'].value_counts()  #getting the datapoints in age class

Observation : 

Getting the total number of patients as per their age.


Eg: 11 people has age 47, there is only one people who is 83 years old and so on.

In [None]:
haberman['age'].value_counts().plot(kind='bar', color='pink')  #plotting a bar graph to compare and observe datas

Observation :



Plotting a graph ::


x axis : Age of patients.


y axis : Total number of people respect to that age.

It is seen from the graph that there are more number of people with age 52
and less number of people with age greater than equal to 78


In [None]:
haberman['year'].value_counts()   #getting datapoints from year class

Observation :


Getting the number of patients per year.


The year 1958 has more patient compare to other years.

In [None]:
haberman['year'].value_counts().plot(kind='bar', color='orange') #plotting a bar graph to compare and observe datas

Observation:


x axis : years starting from 1958 to 1969.


y axis: Number of patients per year.

It is seen from the graph that 1958 has more patients detected with breast cancer than that of other years.

The year with less patient is 1969.

In [None]:
haberman['nodes'].value_counts()   #counting the nodes 

In [None]:
haberman['nodes'].value_counts().plot(kind='bar')  #plotting the nodes bar garph

Observation :


x: people with number of nodes.


y: total number of nodes.

It is seen that there are more than 120 people with 0 node and more than 46 people with 1 node.


In [None]:
haberman['status'].value_counts()   #checking the status of survival
                                    # 1 = the patient survived 5 years or longer
                                    # 2 = the patient died within 5 years

Observation :

225 people have survived 5 years of longer.


81 people died within 5 years.

In [None]:
haberman['status'].value_counts().plot(kind='bar', color='r')

Observation:

x axis: 1 and 2.


1 = the patient survived 5 years or longer.


2= the patient died within 5 years.


y= total number of patients.

It is seen from the graph that there are more than 200 people with higher chances of survival and less than 100 people with less chance of survival.

**Plotting Scatter Plot**

In [None]:
haberman.plot(kind='scatter', x='age', y='year', color='black')  #plotting a scatter plot
plt.show()

Observation:

x axis: age of patients

y axis: Year from 1958-1969

From scatter plot it is hard to understand patients undergone surgery respective of the year.



In [None]:
haberman.plot(kind='scatter',x = 'age', y='nodes', color='magenta')    #plotting a scatter plot
plt.show()

Observation:

x axis : age

y axis: nodes

It is seen that people with age more than 40 and less than 60 was detected with highest number of nodes.

But it is not clear to predict accurately.

In [None]:
haberman.plot(kind='scatter',x = 'age', y='status')     #plotting a scatter plot
plt.show()

Observation :

x axis : age

y axis : status


It is hard to make any observation but it can be seen that people with age > 75 died within 5 years of detection.

In [None]:
sns.set_style('darkgrid');
sns.FacetGrid(haberman, hue='status',height= 6).map(plt.scatter,'age','nodes').add_legend()   #plotting a scatter plot
plt.show()

Observation :

x axis : age

y axis : nodes

It is hard to observe from scatter plot the rate of survival of patients.

In [None]:
sns.set_style('darkgrid');
sns.FacetGrid(haberman, hue='status',height= 8).map(plt.bar,'age','nodes').add_legend()    #plotting a bar plot
plt.show()

Observation :

x axis : age

y axis: nodes


It is seen from the bar graph that people with age 42 has highest number of nodes and died within or less than 5 years after the detection.

In [None]:
sns.set_style('whitegrid')

sns.pairplot(haberman, hue='status', height = 4, palette='vlag').add_legend()      #plotting pairplot
plt.show()

Observation:

1. Diagonals are pdf's.

2. plot (2,3,6) and plot(4,7,8) are same with their axis interchanged.


In [None]:
# 1D Plot

import numpy as np
haberman_more = haberman.loc[haberman['status'] == 1];
haberman_less = haberman.loc[haberman['status'] == 2];
plt.plot(haberman_more['nodes'], np.zeros_like(haberman_more['nodes']), 'o')
plt.plot(haberman_less['nodes'], np.zeros_like(haberman_less['nodes']), 'o')
plt.xlabel('nodes')
plt.show()

Observation:

It is hard to analyze from 1D plot since all the datas are overlapped.

In [None]:
#Distribution plot

sns.FacetGrid(haberman, hue='status',height=6, palette='cubehelix').map(sns.distplot, 'nodes').add_legend()
plt.show()


Observation :

People with 0 nodes has highest chance of survival in compare to others. 

In [None]:
sns.FacetGrid(haberman,hue='status', height=6).map(sns.distplot, 'age').add_legend()
plt.show()

#Distribution plot

Observation:

From this distribution plot it can be observed that people of age >50 and <60
has highest chance of survival rate 

In [None]:
sns.FacetGrid(haberman,hue='status', height=6, palette='rocket').map(sns.distplot, 'year').add_legend()
plt.show()


#Distribution plot

Observation:

From distribution plot it is not possible to determine any particular outcome due to overlapping of datas.

In [None]:
counts, bin_edges = np.histogram(haberman_more['status'], bins =10, density=True)
pdf = counts/(sum(counts))

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.show()

Observation:

PDF and CDF of patients who survived more than 5 years.

In [None]:
counts, bin_edges = np.histogram(haberman_less['status'], bins =10, density=True)
pdf = counts/(sum(counts))

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
plt.show()

Observation:

PDF and CDF of patients who survived less than 5 years.

In [None]:
haberman.describe()  #details

In [None]:
print(np.median(haberman_more['age']))
print(np.median(haberman_more['year']))
print(np.median(haberman_more['nodes']))

In [None]:
sns.boxplot(data= haberman, x= 'status', y='age')
plt.show()

Observation:

Some patients between age 42 to 60 survived more than 5 years.

Some patients between age 45 to 61 survived less than 5 years.

In [None]:
sns.boxplot(data= haberman, x= 'status', y='nodes')
plt.show()

Observation:

Patient with less number of nodes survived more than 5 years.

In [None]:
sns.boxplot(data= haberman, x= 'status', y='year')
plt.show()

Observation:

Some patient admiited between 1960 to 1966 survived more than 5 years.

Some patient admitted between 1958 to 1965 survived less than 5 years.


In [None]:
sns.violinplot(data= haberman, x= 'status', y= 'age', height = 8)
plt.show()

In [None]:
sns.violinplot(data= haberman, x= 'status', y= 'year', height = 8)
plt.show()

In [None]:
sns.violinplot(data= haberman, x= 'status', y= 'nodes', height = 8)
plt.show()

Observation:

People with less nodes have higher chance of survival rate.

people with more nodes have lesser chance of survival rate.

In [None]:
sns.violinplot(data= haberman, x= 'age', y= 'year', height = 8)
plt.show()

In [None]:
sns.violinplot(data= haberman, x= 'nodes', y= 'year', height = 8)
plt.show()

In [None]:
sns.jointplot(kind='kde', data= haberman, x='status', y='age')
plt.show()

Observation:

Patients 40 to 60 has survived more than 5 years.

In [None]:
sns.jointplot(kind='kde', data= haberman, x='year', y='status')
plt.show()

Observation:

Patients admitted between 1958 to 1968 has undergone operation and survived

In [None]:
sns.jointplot(kind='kde', data= haberman, x='status', y='nodes')
plt.show()

Observation:

People with less number of nodes has highest survival rate.

people with more nodes has less survival rate.

In [None]:
sns.jointplot(x="age", y="year", data = haberman, kind = "kde")
plt.show()

Observation:

Most people with age 40 to 60 has undergone operation.

In [None]:
sns.jointplot(x="age", y="nodes", data = haberman, kind = "kde")
plt.show()

Observation:

Most people with age between 35 to 70 has highest number of nodes.

In [None]:
sns.jointplot(x="year", y="nodes", data = haberman, kind = "kde")
plt.show()

**Conclusion ::**

1. It is hard to predict anything accurately due lack of datas.

2. Although we can say that people with less number of nodes has highest chance of survival.

3. Data is not balanced.

4. Age between 40 to 60 has highest chance of survival.

5. Impossible to seperate pairplots.