# Exploratory Data Analysis on Haberman Dataset

# 1] Basic Terminology

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings 

warnings.filterwarnings("ignore") 

In [None]:
haberman = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')

In [None]:
haberman.columns = ['age','year','nodes','status']
haberman

In [None]:
#Size of Rows & Columns of the data
haberman.shape

In [None]:
#Get the column names
print(haberman.columns)

In [None]:
# print a concise summary of a DataFrame
haberman.info()

In [None]:
#Print some basic statistical details like percentile, mean, std
haberman.describe()

In [None]:
# count of unique entries in that column of status.
haberman["status"].value_counts()

# 2] 1-D Scatter Plot Section

In [None]:
#1-D scatter plot of survived patient
survive= haberman.loc[haberman["status"] == 1]
unsurvive = haberman.loc[haberman["status"] == 2]
plt.title("1-D Scatter plot for detcteted age and status ")
plt.plot(survive["age"], np.zeros_like(survive['age']), 'o',label ='Survived')
plt.plot(unsurvive["age"], np.zeros_like(unsurvive['age']), 'o',label ='Unsurvived')
plt.xlabel('Age')
plt.legend()
plt.show()


**Observations:-** Most of patient survived in between 40 to 62 approx. age group from plot but 1-D scatter plot is Very hard to make sense as points are overlapping a lot.



# 3] 2-D Scatter Plot Section

In [None]:
#2-D scatter plot:
haberman.plot(kind='scatter', x='age', y='year',label="Age")
plt.title("2-D Scatter plot for detcteted age and year ")
plt.legend()
plt.show()

In [None]:
# 2-D Scatter plot with color-coding for each age /year.
# Here 'sns' corresponds to seaborn. 
sns.set_style("whitegrid")
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(plt.scatter, "age","year") \
   .add_legend()
plt.title("2-D Scatter plot for detcteted age and year base on status")
plt.show();

**OBSERVATION:-**.In this plot we can see the color separated by the survival status of the patient by there detected age and operated year blue point shows the survived after successful operation and orange point show unsurvived after successful operation. from the plot we can not make conclusion that most of people survive after the treatment or not.

# 4] 3-D Scatter Plot Section 

In [None]:
import plotly.express as px
fig = px.scatter_3d(haberman,  x='year', y='age', z='nodes',color='status', size='year', size_max=15, opacity=0.7,title="3-D Scatter plot over Detected Age, Nodes & Year Base on Status")
# tight layout
fig.show()

**Observation:** 3D scatter plot used for classification but in data set can not separate the data to make conclusin status 1 is blue and 2 is Yellow both got mix which is unclassified.

# 5] Pair-plot

In [None]:
# pairplots from seaborn to plot of combination
plt.close()
sns.set_style("whitegrid")
titl=sns.pairplot(haberman, hue="status", vars=['age','nodes','year'],size=4)
titl.fig.suptitle("Pairplot of Age, Nodes & Year")
plt.show()

# https://www.tutorialspoint.com/how-to-show-the-title-for-the-diagram-of-seaborn-pairplot-or-pridgrid-matplotlib 
# for title matplot not working as expected

**Observations:-** 
Above image is the combinations plot of all features is pairplots. **Plot 1,Plot 5 and Plot 9** are the PDF of all combinations of features which explain you the density of data.

**Plot 2** In this plot you can see that there is Age on X-axis and detected nodes on Y-axis .It is better than all other plots comparatively.

**Plot 3** In this plot the overlap of points are there ie. age and year there data is mostly overlapping on each others data Hard to conclude.

**Plot 4** It is plotted using the data feature Operation nodes and Age which shows similar type of plot like Plot 2 but it just rotated by 90 degree.

**Plot 6** It plot on the feature Operation nodes and year of operation overlapping of points seems to be more in this plot comparatively.

**Plot 7** This plot is similar as Plot 3 only feature interchange its axis so the plot will rotate by 90 degree.

**Plot 8** It is same as Plot 6 only feature on axis interchange.

So, considering the feature Age and nodes plotting on the Plot 3 and 7 are little bit can make sense to make further conclusion.

# 6] Histogram, PDF, CDF

In [None]:
# Histogram is showing frequency distributions of people as per the age of detection.
plt.hist(haberman["age"],label="Age")
plt.legend()
plt.title("Histogram plot for detcteted age.")

PDF - (Probability Density Function)
PDF shows the density of that data or number of data present on that point. PDF will be a peak like structure represents high peak if more number of data present or else it will be flat/ small peak if number of data present is less.It is smooth graph plot using the edges of histogram


In [None]:
# facetgrid helps in visualizing distribution of age and survival status of cancer patient.
# PDF of Age
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "age") \
   .add_legend()
plt.title("PDF plot for detcteted age & Status.")
plt.show()

Observation: Unpredictable anything equal number of density in each data point and PDF of status 1 & 2 classification overlap on each other.

In [None]:
# facetgrid distribution of nodes and survival status of cancer patient.
# PDF of Nodes
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "nodes") \
   .add_legend()
plt.title("PDF plot for detcteted Nodes & Status.")
plt.show()

Observation: In this plot patient survive if less nodes detected by high peck & unsurvive if more nodes detected from dataset, you can make conclusion.

In [None]:
# facetgrid distribution of Year and survival status of cancer patient.
# PDF of Year Of Operation
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "year") \
   .add_legend()
plt.title("PDF plot for detcteted Year & Status.")
plt.show()

Observation: this overlap on each other which is unpredictable.

CDF - (Cumulative Distribution Function)
CDF is representation of cumulative data of PDF ie. it will plot a graph by considering PDF for every data point cumulatively.

In [None]:
counts, bin_edges = np.histogram(survive['age'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf = np.cumsum(pdf)
plt.title("CDF plot for detcteted age.")
plt.plot(bin_edges[1:],pdf,label="PDF survived")
plt.plot(bin_edges[1:], cdf,label="CDF survived")
plt.legend() 

In [None]:
counts, bin_edges = np.histogram(survive['year'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf = np.cumsum(pdf)
plt.title("CDF plot for detcteted Year.")
plt.plot(bin_edges[1:],pdf,label="PDF survived")
plt.plot(bin_edges[1:], cdf,label="CDF survived")
plt.legend() 

In [None]:
counts, bin_edges = np.histogram(survive['nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf = np.cumsum(pdf)
plt.title("CDF plot for detcteted Nodes base on Survive.")
plt.plot(bin_edges[1:],pdf,label="PDF survived")
plt.plot(bin_edges[1:], cdf,label="CDF survived")
plt.legend()

**Observation:** From above CDF you can observe that orange line shows there is a 82% chance of long survival if number of nodes detected are < 5. Also you can see as number of nodes increases survival chances also reduces means it is clearly observed that 80% — 85% of people have good chances of survival if they have less no of nodes detected.

In [None]:
counts, bin_edges = np.histogram(unsurvive['nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf = np.cumsum(pdf)
plt.title("CDF plot for detcteted Nodes base on Unsurvive.")
plt.plot(bin_edges[1:],pdf,label="PDF unsurvived")
plt.plot(bin_edges[1:], cdf,label="CDF unsurvived")
plt.legend()

# 7] Mean, Variance and Std-dev

**Mean** : It is the only measure of central tendency where the sum of the deviations of each value.

**Variance (s2)** : It is the best known measure to estimate the variablilty as it is Squared Deviation. One can call it mean squared error as it is the average of standard deviaiton.

**Standard Deviation** : It is the square root of Variance. Is also referred to as Eucledian Norm.

In [None]:
#Mean, Variance, Std-deviation
print("Means:")
print(np.mean(haberman["nodes"]))
print(np.mean(haberman["age"]))

print("\nVariance:")
print(np.var(haberman['nodes']))
print(np.var(haberman['age']))

print("\nStd-dev:")
print(np.std(haberman["nodes"]))
print(np.std(haberman["age"]))

# 8] Median, Percentile, Quantile, IQR, MAD

**Median** : It is the middle value of a set of data. To determine the median value in a sequence of numbers, the numbers must first be arranged in ascending order.

**Percentile** : It is the a very good measure to measure the variablity in data, avoiding the outliers. Pth percentile in data is a value such that altleast P% or less values are lesser than it and atleast (100 – P)% values are more than P.
Median is the 50th percentile of the data.

**Quantile** : It defines a particular part of a data set, i.e. a quantile determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth) and percentiles (hundredth).

**Inter-Quartile Range [IQR]** : It works for the ranked(sorted data). It has 3 quartiles dividing data – Q1(25th percentile), Q2(50th percentile) and Q3(75th percentile). Inter-quartile Range is the difference between Q3 and Q1.

**Median Absolute Deviation [MAD]** : Mean Absolute Devation, Variance and Standard deviation (discussed in previous section) are not robust to extreme values and outliers. We average the sum of deviations from the median.

In [None]:
print("\nMedians:")
print(np.median(haberman["age"]))

print("\nQuantiles:")
print(np.percentile(haberman["age"],np.arange(0, 100, 25)))

print("\n90th Percentiles:")
print(np.percentile(haberman["age"],90))

# https://www.statology.org/median-absolute-deviation-in-python/ refered from

from statsmodels import robust
print("\nMedian Absolute Deviation:")
print(robust.mad(haberman["age"]))


# 9] Box and Whiskers plots

In [None]:
#Box-plot with whiskers: another method of visualizing the  1-D scatter plot more intuitivey.
#Box-plot can be visualized as a PDF on the side-ways.

sns.boxplot(x='status',y='age', data=haberman)
plt.title("Box and Whiskers plot for detcteted Age & Status")
plt.show()

In [None]:
sns.boxplot(x='status',y='year', data=haberman)
plt.title("Box and Whiskers plot for detcteted year & Status")
plt.show()

**Observation:** Interquartile range for status 1 is in between 60-66.Interquartile range for status 2 is in between 55-65.
Asymmetric distribution between long whiskers relative to the box length, width of box represents spread of that data in data set.

# 10] Violin plots

In [None]:
# A violin plot combines the benefits of the previous two plots and simplifies
#Denser regions of the data are fatter, and sparser ones thinner.

sns.violinplot(x="status", y="age", data=haberman, size=8)
plt.title("1) Violin plot for detcteted age and survival status")
plt.legend(labels=["status","age"])
plt.show()

**Observation:-** In above violin plot we observe that For long survive density for it is more near 60 age nodes and also it has whiskers in range 30-75 and in violin 2 it shows the short survival density more from 50 and threshold 40-80.

In [None]:
sns.violinplot(x="status", y="year", data=haberman, size=8)
plt.title("3) Violin plot for detcteted year and survival status ")
plt.legend(labels=["status","year"])
plt.show()

Observation:- In above violin plot we observe that For long survive density for it is more near 60 year nodes and also it has whiskers in range 58-75 and in violin 2 it shows the short survival density more from 63-65 and threshold as same.

In [None]:
sns.violinplot(x="status", y="nodes", data=haberman, size=8)
plt.title("2) Violin plot for detcteted nodes and survival status ")
plt.legend(labels=["status","nodes"])
plt.show()

**Observation:-** In above violin plot we observe that For long survive density for it is more near the 0 nodes and also it has whiskers in range o-9 and in violin 2 it shows the short survival density more from 0–25 and threshold from 0–12

# 11] Multivariate probability density, contour plot.

In [None]:
 # joint plot set seaborn style by lines
sns.jointplot(x=haberman.year, y=haberman.nodes, kind="kde")
plt.legend(labels=["nodes"])
plt.show()

In [None]:
 # contour plot set seaborn style in colur shade
sns.set_style("white")
plt.title("contour plot for detcteted Age & year")
sns.kdeplot(x=haberman.age, y=haberman.year, cmap="Blues", shade=True, thresh=0)
plt.show()

**Observation:-** From Contour plot we can conclude that 40 to 65 age group patient are more dectected and operated on cancer treatment which is in darker shade. 

Reference By:

1) https://www.appliedaicourse.com/
2) https://www.kaggle.com/