# Haberman's Exploratory Data Analysis

<u><b>Introduction About Dataset</b></u> :
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.


There are 4 attribute in this data set out of which 3 are independent features and 1 class attribute as below.
<ol>
<li>Age</li>
<li>Operation Year</li>
<li>Number of Axillary nodes(Lymph Nodes)</li>
<li>Survival Status</li>
</ol>


<u><b>Attribute Information</b></u> :
<ul>
<li>Age of patient at time of operation (numerical)</li>
<li>Patient's year of operation (year - 1900, numerical)</li>
<li>Number of positive axillary nodes detected (numerical)</li>
<li>Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year</li>
</ul>


<u><b>Objective</b></u> : Classify whether a patient will survive 5 years or longer given the 3 features.

![Capture.JPG](attachment:Capture.JPG)

<u><b>Exploratory Data Analysis</b></u> : Exploratory Data Analysis is a process of examining or understanding the data and extracting insights or main characteristics of the data. In Data Analysis we have to become the sherlock homes to keep in mind our prime objective to perform initial investigation on data so as to discover some insighful patterns from the data before getting our hand dirty with the data.

<b>Types of Exploratory data analysis:</b>

<b>Univariate Analysis: </b>Uni means one and variate means variable, so in univariate analysis,there is only one dependent variable. Univariate analysis is the simplest form of data analysis, where the data being analyzed consists of only one variable. The main objective of univariate analysis is to derive the object and summarize it, and analyze the patter present in it.

<b>Bi-Variate Analysis: </b>Bi means two and variate means variable, so here there are two variables. The analysis is related to cause and the relationship between the two variables.

<b>Multivariate analysis: </b> Multivariate analysis is required when more than two variables have to be analyzed simultaneously. It is a tremendously hard task for the human brain to visualize a relationship among 4 variables in a graph and thus multivariate analysis is used to study more complex sets of data. 

# Import all the required libraries

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Load the dataset

In [7]:
#read_csv() function read the data stored as a csv file into a pandas dataframe.
data=pd.read_csv("C:/Users/SLIM5/OneDrive/Documents/Project_semV/haberman.csv")

In [3]:
data

NameError: name 'data' is not defined

In [None]:
data.columns=['Age', 'year', 'nodes', 'status']

<u><b>Operation performed</b></u> : In the dataset we don't have column name so i just add the column name as per the dataset.

In [None]:
data

In [None]:
#head() function returns the first 5 rows.
data.head()

In [None]:
#tail() function returns the last 5 rows.
data.tail()

### ------------------------------------------------------------------Statistical Analysis-------------------------------------------------------------

In [None]:
#shape return the tuple representing the dimensionality of dataframe in rows and column form.
data.shape

In [None]:
#columns return the column labels of dataframe. 
data.columns

In [None]:
#dtypes return the data type of each columns.
data.dtypes

In [None]:
data['status'] = data['status'].map({1:"yes", 2:"no"})

<u><b>Operation performed</b></u> : Survival status feature have two values 1 represent those patient who survived 5 yrs or more and 2 represent those patient who died within 5 yrs. So i transformed class 1 into "yes" and class 2 into "no" label. So it become descriptive.

In [None]:
data

In [None]:
'''describe() function return descriptive statistics include those that summarize the central tendency, dispersion and shape 
of a dataset’s.'''
data.describe()

In [None]:
#isnull().sum() returns the number of missing values in the data set.
data.isnull().sum()

In [None]:
#info() function returns about a DataFrame including the index dtype and columns, non-null values and memory usage.
data.info()

In [None]:
#Here we have plot the countplot to show the count of observation survival status class label have.
sns.set(rc={'figure.figsize':(4,4)})
ax = sns.countplot(x=data['status'], data=data)

for p in ax.patches:
   ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01))

plt.show()

<u><b>Observation: </b></u>
    <ul>
    <li>We have 305 datapoints/rows and 4 features/columns.</li>
    <li>4 columns are as follows: Age, Op_year, Axil_nodes, Surv_status.</li>
    <li>Age, Op_year, Axil_nodes are independent feature and of type numerical.</li>
    <li>There is no missing values in dataset.</li>
    <li>There are total 2 classes in dataset (yes = the patient survived 5 years or longer and no = the patient died within 5 year) and 224 datapoints are of yes and 81 datapoints for no
The target column is imbalanced with 73% of values are of 'yes' (the patient survived 5 years or longer).</li>
    <li>Age of the survival patient vary between 30-83.</li>
    <li>Operation year vary between 58-69.</li>
    <li>axil nodes vary between 0-52.</li>
    <li>We observe the mean of Axil_nodes is 4.03 and median is 1.00 which is very different this implies that we have extreme values in Axil_nodes which impact the mean and means become very high.</li>
    </ul>

### ------------------------------------------------------------------Univariate Analysis-------------------------------------------------------------

### 1D Scatter plot

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
class1 = data.loc[data["status"] == "yes"]
class2 = data.loc[data["status"] == "no"]
plt.plot(class1["Age"], np.zeros_like(class1["Age"]), 'o', label='Yes')
plt.plot(class2["Age"], np.zeros_like(class2["Age"]), 'o', label='No')
plt.xlabel("Age")
plt.title("1D scatter plot of Age")
plt.legend(title = "status")
plt.show()

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
class1 = data.loc[data["status"] == "yes"]
class2 = data.loc[data["status"] == "no"]
plt.plot(class1["year"], np.zeros_like(class1["year"]), 'o',label='Yes')
plt.plot(class2["year"], np.zeros_like(class2["year"]), 'o',  label='No')
plt.xlabel("Operation Year")
plt.title("1D scatter plot of Operation Year")
plt.legend(title = "status")
plt.show()

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
class1 = data.loc[data["status"] == "yes"]
class2 = data.loc[data["status"] == "no"]
plt.plot(class1["nodes"], np.zeros_like(class1["nodes"]), 'o',label='Yes')
plt.plot(class2["nodes"], np.zeros_like(class2["nodes"]), 'o',label='No')
plt.xlabel("Axil nodes")
plt.title("1D scatter plot of Axil nodes")
plt.legend(title = "status")
plt.show()

<u><b>Observation of 1D scatter plot</b></u> : We can't extract any meaningful insights from the above 1D plot because all of the datapoints overlap and the datapoints are so dispersed. As a result, we are unable to draw any conclusions based on the data.

### Histogram

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
sns.histplot(x='Age', data=data, bins=25, binwidth=5, hue='status').set(title='Histogram of Age')

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
sns.histplot(x='year', data=data, bins=15, binwidth=1, hue='status').set(title='Histogram of Operation year')

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
sns.histplot(x='nodes', data=data, bins=20, binwidth=1, hue='status').set(title='Histogram of Axil nodes')

<u><b>Observation of histogram plot</b></u>
<ul>
<li>People in the ages between 50 to 55 survival status is more.</li>
<li>The Operation year is distributed all throughout the range from 58 to 69, we didn't find any insights from the Op_year attribute.</li>
<li>In axil node mostly all of the patient who has survived more than 5 yrs generally have axil node between [0-3].</li>
<li>Patients with aux_nodes 0 have highest chances of survivial for more than 5 yrs.</li>
    <li>We can write a simple logic like this.</li>

if(AxillaryNodes≤0)
    
    Patient= Long survival
    
else if(AxillaryNodes≥0 && Axillary nodes≤3.5(approx))
    
    Patient= Long survival chances are high
    
else if(Axillary nodes ≥3.5)
    
    Patient = Short survival
</ul>

### Distribution plot

In [None]:
sns.displot(data, x=data['Age'], hue="status", kind="kde", fill=True).set(title='Density plot of Age')

In [None]:
sns.displot(data, x=data['year'], hue="status", kind="kde", fill=True).set(title='Density plot of Operation year')

In [None]:
sns.displot(data, x=data['nodes'], hue="status", kind="kde", fill=True).set(title='Density plot of Axil nodes')

<u><b>Observation of distribution plot</b></u>: 
    <ul>
    <li>Between 1958 and 1968, the majority of breast cancer operations were performed.</li>
    <li>Axillary node 0 is seen in the majority of patients who survive more than 5 years after surgery.</li>
    <li>It has been observed that people survive long if they have less axillary nodes detected and vice versa but still it is hard to classify but this is the best data you can choose among all.</li>
    </ul>

### PDF and CDF plot

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
class1=data.loc[data['status']=="yes"]
counts,bin_edges=np.histogram(class1['Age'],bins=10,density=True)
pdf=(counts/sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="PDF of Yes class label")
plt.plot(bin_edges[1:], cdf, label="CDF of Yes class label")

class2=data.loc[data['status']=="no"]
counts,bin_edges=np.histogram(class2['Age'],bins=10,density=True)
pdf=(counts/sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="PDF of No class label")
plt.plot(bin_edges[1:], cdf, label="CDF of No class label")
plt.xlabel("Age")
plt.ylabel("Density")
plt.legend(title = "Surv_status")

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
class1=data.loc[data['status']=="yes"]
counts,bin_edges=np.histogram(class1['year'],bins=10,density=True)
pdf=(counts/sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="PDF of Yes class label")
plt.plot(bin_edges[1:], cdf, label="CDF of Yes class label")

class2=data.loc[data['status']=="no"]
counts,bin_edges=np.histogram(class2['year'],bins=10,density=True)
pdf=(counts/sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="PDF of No class label")
plt.plot(bin_edges[1:], cdf, label="CDF of No class label")
plt.xlabel("Operation year")
plt.ylabel("Density")
plt.legend(title = "surv_status")

In [None]:
sns.set(rc={'figure.figsize':(8,6)})
class1=data.loc[data['status']=="yes"]
counts,bin_edges=np.histogram(class1['nodes'],bins=10,density=True)
pdf=(counts/sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="PDf of Yes class label")
plt.plot(bin_edges[1:], cdf, label="CDF of Yes class label")

class2=data.loc[data['status']=="no"]
counts,bin_edges=np.histogram(class2['nodes'],bins=10,density=True)
pdf=(counts/sum(counts))
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label="PDF of No class label")
plt.plot(bin_edges[1:], cdf, label="CDF of No class label")
plt.xlabel("Axil nodes")
plt.ylabel("Density")
plt.legend(title = "Surv_status")

<u><b>Observation of PDf and CDF plots</b></u> : 
<ul>
    <li>If you have 0 axil node, you have a nearly 82 percent chance of surviving more than 5 years after surgery.</li>
    <li>As your axil node get increase your chances of surviving get decreased.</li>
    <li>It has been observed that people survive long if they have less axillary nodes detected and vice versa but still it is hard to classify but this is the best data you can choose among all.</li>
</ul>

### ------------------------------------------------------------------Bivariate Analysis-------------------------------------------------------------

### Boxplot

In [None]:
sns.set(rc={'figure.figsize':(16,5)})
fig, axes = plt.subplots(1,3,sharex=True)
sns.boxplot(ax=axes[0], data=data, x='status', y='Age')
sns.boxplot(ax=axes[1], data=data, x='status', y='year')
sns.boxplot(ax=axes[2], data=data, x='status', y='nodes')

<u><b>Observation of box plot</b></u> : 
<ul>
<u><b>About Age box plot</b></u>

<b>For Surv_Status yes, out of 225 Patients:</b>
    <li>25% of Patients have age < 44Yrs.</li>
    <li>50% of the Patients have the Age < 52 yrs.</li>
    <li>75% of the patients have the Age < 60Yrs.</li>
<b>For Surv_Status no, out od 81 Patients:</b>
    <li>25% of Patients have age < 46yrs.</li>
    <li>50% of the Patients have the age < 54 yrs.</li>
    <li>75% of the Patients have the age < 62 yrs.</li>

<u><b>About operation year box plot</b></u>

<b>For Surv_Status yes, out of 225 Patients:</b>
    <li>25% of Patients have gone through opearation before 60.</li>
    <li>50% of Patients have gone through opearation before 63.</li>
    <li>75% of Patients have gone through opearation before 66.</li>
<b>For Surv_Status no, out of 81 Patients:</b>
    <li>25% of Patients have gone through opearation before 59.</li>
    <li>50% of Patients have gone through opearation before 63.</li>
    <li>75% of Patients have gone through opearation before 65.</li>
    
<u><b>About Axil node box plot</b></u>

<b>For Surv_Status yes, out of 225 Patients:</b>
    <li>After seeing box plot of Axil nodes it seems it has outlier.</li>
    <li>In the data the patients who surved more than 5yrs have more outliers.</li>
    <li>In this box plot for short survival there are 50th percentile of nodes are nearly same as long survive 75th percentile. </li>
<b>For Surv_Status no, out of 81 Patients:</b>
    <li>25% of Patients have axil node < 2.</li>
    <li>50% of Patients have axil node <= 4.</li>
    <li>75% of Patients have gone axil node < 12.</li>
    <li>It seems it has few outliers.</li>
</ul>

### Violinplot

In [None]:
sns.set(rc={'figure.figsize':(16,5)})
fig, axes = plt.subplots(1,3,sharex=True)
sns.violinplot(ax=axes[0], data=data, x='status', y='Age')
sns.violinplot(ax=axes[1], data=data, x='status', y='year')
sns.violinplot(ax=axes[2], data=data, x='status', y='nodes')

<u><b>Observation of violin plot</b></u> :
<ul>
<li>In above Axil_node violin plot we observe that for long survive density it is more near the 0 axil nodes and also it has whiskers in range o-7 in the short survival density more from 0–20 ans threshold from 0–12.</li>
    <li>Age for those who survived after 5 yrs is starting slightly younger than those who not survived</li>
</ul>

### Pairplot

In [None]:
sns.pairplot(data, hue="status", markers=["o", "s"], corner='True')

<u><b>Observation</b></u>
<li>We can see that it is not easy to distinguish whether a patient survives or not with either of the 3 attributes, as data points of both classes are mostly overlapping and no simple if-else would classify the survival with considerate accuracy.</li>
<li>The least overlapping feature is Nodes, so let's use this to draw conclusions.</li>
<li>In plot 2 In this plot you can see that there is Age on X-axis and Operation year on Y-axis and the plot of there data is mostly overlapping on each other data so we cannot distinguish if there is any orange point present below blue point or vice versa.So I am rejecting these 2 data feature combination for further analysis.</li>

<li>In plot 4 In this plot we have Age on X-axis and Axil node on y-axis there are some points which is distinguishable but still it is better from other plot.</li>

<li>In plot 5 In this plot OPeration year on X-axis and Axil nodes on Y-axis This plot is somewhat similar to the Plot 4 but overlapping of points seems to be more in this plot comparative to other. So, I will also reject this combination.</li>

### HeatMap

In [None]:

sns.set(rc={'figure.figsize':(8, 5)})
sns.heatmap(data, annot=True)

<b><u>Observation:</u></b> From the above figure it clearly shows that none of the above features are interrelated to each other and have independant coexistance.

### Joint plot

In [None]:
sns.jointplot(x="Age", y="year", data=data, kind="kde", hue='status')
plt.show()

In [None]:
sns.jointplot(x="Age", y="nodes", data=data, kind="kde", hue='status')
plt.show()

In [None]:
sns.jointplot(x="nodes", y="year", data=data, kind="kde", hue='status');
plt.show();

<u><b>Observation</b></u> : Density is high for patients of age 45-60 and axilary nodes of 0-3.

# Overall Conclusion

<ul>
    <b>From count plot: </b>
    <li>From class count plot we observe that it is a imbalanced dataset. Dataset consists of 224 datapoints are of yes and 81 datapoints for no. The target column is imbalanced with 73% of values are of 'yes'.</li>
    <b>Histogram:</b>
    <li>In axil node mostly all of the patient who has survived more than 5 yrs generally have axil node between [0-3].</li>
    <li>Most of the lymph node number data is concentrated around 0 and 1</li>
    <b>Distribution plot:</b>
    <li>Axillary node 0 is seen in the majority of patients who survive more than 5 years after surgery.</li>
    <li>It has been observed that people survive long if they have less axillary nodes detected and vice versa but still it is hard to classify but this is the best data you can choose among all.</li>
    <b>PDF and CDF: </b>
    <li>The probablity of survival of patient is very high when you have axil node between 0-3.</li>
    <li>If you have 0 axil node, you have a nearly 82 percent chance of surviving more than 5 years after surgery.</li>
    <li>Age and Op_year is not a good feature to classify the dataset.</li>
    <li>Patients who survived more than 5 years have lesser number of nodes or positive node equal to zero.</li>
    <li>Feature "Axil_nodes' is relavant for classification but we cannot proceed simply with node feature only.</li>
    <b>Boxplot: </b>
    <li>Regarding number of lymph nodes alot of outliers are seen in which survived as compared to the people who have not survived.</li>
    <b>Voilin plot: </b>
    <li>In above Axil_node violin plot we observe that for long survive density it is more near the 0 axil nodes</li>
    <b>Heatmap: </b>
    <li>Heatmap shows that none of the above features are interrelated to each other and have independant coexistance.</li>
</ul>