## Haberman cancer survival dataset

Cancer survival dataset
* a simple dataset to understand eda
* patient information
* The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
* Area related to life
* 1.Age of patient at time of operation (numerical)
* 2.Patient's year of operation (year - 1900, numerical)
* 3.Number of positive axillary nodes detected (numerical)
* 4.Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
haberman=pd.read_csv("../input/habermans-survival-data-set/haberman.csv",names=['age', 'year', 'nodes', 'status'])

In [None]:
#(Q) How many data points are present and features ?
print(haberman.shape)

In [None]:
#(Q) what are the column names or features ?
print(haberman.columns)

In [None]:

#(Q) How many data points for each class are present?
#(Q) this is a imbalanced data set
print(haberman['status'].value_counts())

In [None]:
#(Q) Number of classes ?
print(haberman['status'].nunique())

*(Q) Explain our objective
* To check whether the patient can survive after 5 years or cannot survive after 5 years according to given information like his age,no of positive nodes,age,year of operation 

In [None]:
#describing the dataframe
print(haberman.describe())

In [None]:
#2-D scatter plot:
#ALWAYS understand the axis: labels and scale.
haberman.plot(kind='scatter', x='age', y='year') ;
plt.title("2-D PLOT BETWEEN AGE AND YEAR")
plt.show()
#THERE IS A PLOT BETWEEN AGE AND YEAR WE WONT BE GETTING MUCH INFORMATION AS WE DIDN'T CLASSIFY BASED ON STATUS
#WE NEED TO COLOR CODE BASED ON STATUS

In [None]:
#here instead of getting confused by using status as 1 ans 2 we can use yes or no
haberman['status']=haberman['status'].apply(lambda x:'Survived' if x==1 else 'Didntsurvive')

In [None]:
# 2-D Scatter plot with color-coding for each patient.
plt.close()
sns.set_style("whitegrid");
sns.FacetGrid(haberman,hue='status',height=4)\
.map(plt.scatter,"age","year")\
.add_legend();
plt.title("2-D Scatter plot with color-coding for each patient")
plt.show();      
# here the graph is between age and year
# here the data points are overlapping
# notice that the orange and blue points can't be seperated easily
#there line can't be drawn between the orange annd blue
#(Q) How many combinations are present 3c2 

**Observation(s):**
* 1.using the age and year,we are not able to distinguish because of too much overlapping
* 2.the age doesn't play a big role it depends on other factors also

In [None]:
#3d plot
import plotly.express as px
df = pd.read_csv("../input/habermans-survival-data-set/haberman.csv",names=['age', 'year', 'nodes', 'status'])
fig = px.scatter_3d(df, x='age', y='year', z='nodes',
              color='status')
plt.title("3D PLOT BETWEEN AGE AND YEAR AND NODES WITH COLOR AS STATUS")
fig.show()

In [None]:
# (PAIR PLOT)
plt.close()
sns.set_style("whitegrid")
sns.pairplot(haberman,hue="status",height=3)
plt.title("PAIR PLOT")
plt.show()


**Observations**
*1. the year of treatment and nodes play a better role in seperate them.
*2 All other plots are overlapping making it harder to seperate the data points

#  Histogram, PDF, CDF

In [None]:
#1d scatter plot using one feature
import numpy as np
haberman_survived=haberman.loc[haberman['status']=='Survived']
haberman_Didntsurvive=haberman.loc[haberman['status']=='Didntsurvive']
plt.plot(haberman_survived['nodes'],np.zeros_like(haberman_survived['nodes']),'o')
plt.plot(haberman_Didntsurvive['nodes'],np.zeros_like(haberman_Didntsurvive['nodes']),'o')
plt.title("1D SCATTER PLOT FOR NODES FOR THOSE WHO SURVIVED AND DIDNT SURVIVE")
plt.show()

*As you can see here the overlappings are more the datapoints are overlapped hence 1d isn't helping here

In [None]:
sns.FacetGrid(haberman, hue="status", height=5) \
   .map(sns.distplot, "nodes") \
   .add_legend();
plt.title("PDF of Nodes")
plt.show();


* in the above graph if the no of auxillary node are more the chances of survival are decreasing so no of nodes is inveresely proportional to chances of survival 

In [None]:
sns.FacetGrid(haberman, hue="status", height=5) \
   .map(sns.distplot, "year") \
   .add_legend();
plt.title("PDF of year")
plt.show();


In [None]:
sns.FacetGrid(haberman, hue="status", height=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.title("PDF FOR AGE" )
plt.show();


 observation
* here nodes have more importance
* here with the observation the people with 0 nodes almost survived
* here people with more nodes have less chance of survival or didn't survive 
* here the patients who are at the age of 30 have more chances of surviving
* if no of nodes==0 long  survival
* else if between 1 and 3 chance of survival
* else if>3.5 less chance

In [None]:
#Histograms and probability density functions
counts,bins=np.histogram(haberman_survived['nodes'],bins=10,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bins)
cdf=np.cumsum(pdf)
plt.plot(bins[1:],pdf,label='pdf')
plt.plot(bins[1:],cdf,label='cdf')
plt.xlabel('number of nodes')
plt.ylabel('probability')
plt.title("CDF AND PDF OF NODES FOR SURVIVED")
plt.legend()
plt.show()

In [None]:
counts,bins=np.histogram(haberman_survived['age'],bins=10,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bins)
cdf=np.cumsum(pdf)
plt.plot(bins[1:],pdf,label='pdf')
plt.plot(bins[1:],cdf,label='cdf')
plt.xlabel('age')
plt.ylabel('probability')
plt.title("CDF AND PDF FOR THE AGE OF SURVIVED")
plt.legend()
plt.show()

In [None]:
counts,bins=np.histogram(haberman_survived['year'],bins=10,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bins)
cdf=np.cumsum(pdf)
plt.plot(bins[1:],pdf,label='pdf')
plt.plot(bins[1:],cdf,label='cdf')
plt.xlabel('year')
plt.ylabel('probability')
plt.title("CDF AND PDF FOR THE YEAR OF OPERATION FOR THE SURVIVED")
plt.legend()
plt.show()

In [None]:
counts,bins=np.histogram(haberman_Didntsurvive['nodes'],bins=10,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bins)
cdf=np.cumsum(pdf)
plt.plot(bins[1:],pdf,label='pdf')
plt.plot(bins[1:],cdf,label='cdf')
plt.xlabel('nodes')
plt.ylabel('probability')
plt.title("CDF AND PDF OF THE NODES FOR THE DIDNT SURVIVE")
plt.legend()
plt.show()

***Observations***
* (1 from the above observations 85% of survival is present if number of nodes are less than 5
* (2 from the above 100% people didnt survive if no of nodes are gretaer than 40

In [None]:
# Plots of CDF of nodes for patients .
#survive
counts, bin_edges = np.histogram(haberman_survived['nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

# notsurvive
counts, bin_edges = np.histogram(haberman_Didntsurvive['nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.title("BOTH CDF AND PDF OF SURVIVE AND DIDNT SURVIVE")
plt.show()

* observations
* (1 here the no of patients who didnt survive here accordig to didnt survive cdf are nearly 56% less than 5
* (2 100% who didnt survive have nodes greater than 40

# (3.5) Mean, Variance and Std-dev

In [None]:
print("Means of nodes of survived and didnt survive:")
print(np.mean(haberman_survived["nodes"]))
print(np.mean(haberman_Didntsurvive["nodes"]))
print("Means of age of survived and didnt survive:")
print(np.mean(haberman_survived["age"]))
print(np.mean(haberman_Didntsurvive["age"]))
print("Means of year of operation of survived and didnt survive:")
print(np.mean(haberman_survived["year"]))
print(np.mean(haberman_Didntsurvive["year"]))
print("standard deviation of nodes of survived and didnt survive:")
print(np.std(haberman_survived["nodes"]))
print(np.std(haberman_Didntsurvive["nodes"]))
print("standard deviation of age of survived and didnt survive:")
print(np.std(haberman_survived["age"]))
print(np.std(haberman_Didntsurvive["age"]))
print("standard deviation of year of operation of survived and didnt survive:")
print(np.std(haberman_survived["year"]))
print(np.std(haberman_Didntsurvive["year"]))

In [None]:
#we are not using mean as an outlier may change the value of mean
#here i can see the significant difference between nodes mean and std

# Median, Percentile, Quantile, IQR, MAD

In [None]:
#Median, Quantiles, Percentiles, IQR.
print("\nMedians:")
print(np.median(haberman_survived["nodes"]))
print(np.median(haberman_Didntsurvive["nodes"]))
print("\nQuantiles:")
print(np.percentile(haberman_survived["nodes"],np.arange(0, 101, 25)))
print(np.percentile(haberman_Didntsurvive["nodes"],np.arange(0, 101, 25)))

print("\n90th Percentiles:")
print(np.percentile(haberman_survived["nodes"],90))
print(np.percentile(haberman_Didntsurvive["nodes"],90))
from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(haberman_survived["nodes"]))
print(robust.mad(haberman_Didntsurvive["nodes"]))
print("\nQuantiles of whole:")
print(np.percentile(haberman["nodes"],np.arange(0, 101, 25)))


***observations***
#by finding the mean,median,std,mad,percentile
* 1).we find that mean age is 52 of the patients
* 2).75% of the people who have survived have node count less than 5
* 3) the average nodes for survival are 0 and for didntsurvive are 4
* 4)if 90% of people have nodes>8 then she will have long survival status else if 90% of patients have greater than 20% lowchance of survival

# (3.7) Box plot and Whiskers

In [None]:
sns.boxplot(x='status',y='nodes', data=haberman)
plt.title("BOX PLOT AND WHISKERS FOR SURVIVE AND DIDNT SURVIVE")
plt.show()

In [None]:
# no of people who have survived we can see that they have highly densed 0-5 nodes
# 75% who didntsurvive had node count less than 12

# (3.8) Violin plots

In [None]:

sns.violinplot(x="status", y="nodes", data=haberman, size=8)
plt.title("VIOLIN PLOT FOR SURVIVE AND DIDNT SURVIVE")
plt.show()

 # Observation:
* almost the pateints who survived more than 5 years have 0 nodes


In [None]:
print(haberman_survived[haberman_survived["nodes"]==0].shape)

In [None]:

print("the person who survived  with 0 nodes are given by 117 of 306 which is "+(str(round(117/306*100))+"%"))


In [None]:
# the patients who didn't survive more than 5 years having 0 nodes are
print(haberman_Didntsurvive[haberman_Didntsurvive["nodes"]==0].shape)

In [None]:
print("the person who did not survived  with  0 nodes are given by 117 of 306 which is "+(str(round(19/306*100))+"%"))

In [None]:
#"the pateints who survived with 1-3 nodes are given by")

In [None]:
s=haberman_survived[haberman_survived["nodes"]>=1]
print(s[s['nodes']<=3].shape)


In [None]:
print("the persons who got survived with 1-3 nodes are given by 61 0f 306 which is "+str(round(61/306*100))+"%")

In [None]:
s=haberman_Didntsurvive[haberman_Didntsurvive["nodes"]>=1]
print(s[s['nodes']<=3].shape)


In [None]:
print("the persons who got survived with 1-3 nodes are given by 61 0f 306 which is "+str(round(20/306*100))+"%")

#  Multivariate probability density, contour plot.

In [None]:
haberman_survival1=haberman.loc[haberman["status"]=="Survived"]
sns.jointplot(x="age",y="nodes",data=haberman_survival1,kind="kde")
plt.title("CONTOUR PLOT BETWEEN NODES AND AGE")
plt.grid()
plt.show()

 # observation: here from the given graph  for best survival we can see that the thicker area or darker area is from age range  48-63 and nodes from  range 0 to 3 


# conclusion:
 here we can see that age matters but no of nodes play a major role in determining the survival of the patients#
Lymph nodes:
* Lymph nodes are small, bean-shaped organs that act as filters along the lymph fluid channels. 
 * As lymph fluid leaves the breast and eventually goes back into the bloodstream, the lymph nodes try to catch and  trap cancer cells before they reach other parts of the body.
* As you can see the patients with 0 node cells are having highest survivals with 38% time survived and 6% time they havent
* I have not taken extreme case because some patients may be exceptional and lucky