# About Dataset:
## Attributes:
1. Age: Age of patient at time of operation (numerical)
2. Op_Year: Patient's year of operation (year - 1900, numerical)
3. axil_nodes: Number of positive axillary nodes detected (numerical)
4. Surv_status:  1= the patient survived 5 years or longer 2 = the patient died within 5 year

## Objective
Given a new patient information, we have to classify whether a patient who underwent surgery for breast cancer will survive more than 5 years or not. (I am only doing exploratory analysis in this module)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#reading the data
haberman=pd.read_csv('../input/haberman.csv')

In [None]:
#visualizing data
haberman.head(5)

In [None]:
#chnaging the column names
haberman.columns=['Age','Op_Year','axil_nodes','Surv_status']
haberman.head(5)

In [None]:
#finding the shape of data
haberman.shape

##  The shape of data:
Data has 4 features as we already saw
<br> There are 305 datapoints or vectors present.

In [None]:
# finding out the number of data points for each class label
haberman['Surv_status'].value_counts()

### As you can see the data is unbalanced even though the difference not too much.
<br> There are 224 data points for the patients who survived the operation for 5 years or longer
<br> There are 81 data points for patients who survived less than 5 years.
<br> The data points for survived patients is almost 46% more than that of unsurvived patients.

# Pair Plots
Pair plots doesn't work if the dimensionality of the data is high even though they are good for low dimensionality data. The total number of plots we get are nc2 which is a really high number if we have a 100 dimensions. Going through those many plots and observing the structure of the data will take up a lot of time. This is drawback in using pair plots.

In [None]:
sns.set_style('whitegrid')
sns.pairplot(haberman,hue='Surv_status',vars=['Age','Op_Year','axil_nodes'],size=3).add_legend()
plt.show()

## Observation
From the above graphs we can tell that there is a fair degree of overlap between the patients. So even by taking a combination of any of the two features we cannot linearly separate the two category of patients. 
<br> But by observing the plots, we can axil_nodes feature have more importance than the other two features in classifying the data

## UniVariate Analysis
You can use pdf, cdf, boxplot and violin plots to do analysis on a single feature to find out which feature is essential for classification. 
Analysis done on a single feature is called Univariate Analysis where as on two features is called Bivariate Analysis. 
If you are doing analysis on more than two variables, it is called Multivariate analysis.

## PDF(Probability Density Function), CDF(Cumulative density function)
PDF tells us what the probability of a new point falling in a particular range of values is. 
CDF tell us what percentage of points are less than a particular value of feature. 
useful scenario: 
In companies which make deliveries we can tell what percenatge of customers are getting deliveries in 5 days or 2 days. By this we can analyze how to increase efficiency. 
Integration of pdf is cdf and differentiation of cdf is pdf.

In [None]:
#split the data
hab_surv=haberman.loc[haberman['Surv_status']==1]
hab_unsurv=haberman.loc[haberman['Surv_status']==2]

In [None]:
#pdf and cdf for feature age
counts,bin_edges=np.histogram(hab_surv['Age'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='pdf_habsurv')
plt.plot(bin_edges[1:],cdf,label='cdf_habsurv')

counts,bin_edges=np.histogram(hab_unsurv['Age'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='pdf_habunsurv')
plt.plot(bin_edges[1:],cdf,label='cdf_habunsurv')

plt.xlabel('Age')
plt.legend(loc='center left',bbox_to_anchor=(1,0.5))
plt.show()

#pdf and cdf of feature Op_Year
counts,bin_edges=np.histogram(hab_surv['Op_Year'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='pdf_habsurv')
plt.plot(bin_edges[1:],cdf,label='cdf_habsurv')

counts,bin_edges=np.histogram(hab_unsurv['Age'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='pdf_habunsurv')
plt.plot(bin_edges[1:],cdf,label='cdf_habunsurv')

plt.xlabel('Op_Year')
plt.legend(loc='center left',bbox_to_anchor=(1,0.5))
plt.show()

#pdf and cdf for feature axil_nodes
counts,bin_edges=np.histogram(hab_surv['axil_nodes'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='pdf_habsurv')
plt.plot(bin_edges[1:],cdf,label='cdf_habsurv')

counts,bin_edges=np.histogram(hab_unsurv['Age'],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='pdf_habunsurv')
plt.plot(bin_edges[1:],cdf,label='cdf_habunsurv')

plt.xlabel('axil_nodes')
plt.legend(loc='center left',bbox_to_anchor=(1,0.5))
plt.show()







## Observations:
In the plot of age, we can see that 100% of the points from the survival feature have the value of age less than approx. 75
Similarly we can safely deduct that if the axil_nodes are greater than 50 then the point will fall under the survival status=2

## Box-Plots
Box-Plots gives you Quantiles. The Boxplot contains 3 lines corresponding to 25th percentile, 50th percentile and 75th percentile. 25th percentile means 25% of the points have value less than the particular value. We can use it for an approximate calculation of error.

In [None]:
sns.boxplot(x='Surv_status',y='Age',data=haberman)
plt.show()

sns.boxplot(x='Surv_status',y='Op_Year',data=haberman)
plt.show()

sns.boxplot(x='Surv_status',y='axil_nodes',data=haberman)
plt.show()

## Observations
Eventhough there is a fair amount of overlap between survival status 1 and 2 in axial nodes, we can see that the overlap ends after the 50th percentile value in survival status 2. So we can partly separate the data using axil_nodes feature.

## Violin Plots
Viloin plots are a combination of boxplots and histograms. The outer region is the histogram and the inner blach region is the boxplot.

In [None]:
sns.violinplot(x='Surv_status',y='Age',data=haberman)
plt.show()

sns.violinplot(x='Surv_status',y='Op_Year',data=haberman)
plt.show()

sns.violinplot(x='Surv_status',y='axil_nodes',data=haberman)
plt.show()

## Summary
<br> From the above plots, we can tell that there is a fair degree of overlap between the patients.
<br> But by observing the plots, we can tell that axil_nodes feature has more importance than the other two features in classifying the data.
<br>  If the axil_nodes are greater than 50 then the point will fall under the survival status=2