# Introduction

**What is EDA?**

**Exploratory Data Analysis(EDA)** in laymans term is nothing but a way to analyze the characteristics from the given **data sets** with various methods of **visualization** and **tools**.

**Haberman's Data set**

The Haberman's survival dataset contains a study that was conducted at the **University of Chicago's Billings Hospital** on the survival of patients who had undergone surgery for **breast cancer**.

Number of Instances: 306

Number of Attributes: 4 (including the class attribute)

Attribute Information:

Age of patient at time of operation (numerical)

Patient's year of operation (year - 1900, numerical)

Number of positive axillary nodes detected (numerical)

Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

Missing Attribute Values: None



# Importing the libraries

In [None]:
import numpy as np #NumPy is the package for scientific computing.
import pandas as pd #Pandas is library providing high-performance, easy-to-use data structures and data analysis tools.
import matplotlib.pyplot as plt #Matplotlib is just a plotting library for python. 
import seaborn as sns #Seaborn is a python data visualization library based on matplotlib.
import os
print(os.listdir("../input"))
import warnings
warnings.filterwarnings('ignore')



# Loading the Haberman's Data set

In [None]:
haberman = pd.read_csv("../input/haberman-dataset/haberman.csv") # pd.read_csv('file_name') loads the file
haberman.head()

In [None]:
haberman.tail()

In [None]:
haberman.describe()

# **How many data-points and features are there for Data set?**

In [None]:
haberman.shape

Haberman's Dataset contains **306** data-points and **4** features

# What are the column names in our dataset?

In [None]:
haberman.columns

Haberman's Dataset has 4 columns or features which are **age, year, nodes & status**

# How many data points for each class are present? 

In [None]:
haberman["status"].value_counts()
#1 = the patient survived 5 years or longer 
#2 = the patient died within 5 year

**225 patient survived 5 years or longer 
2 patient died within 5 year **

**Haberman is a un-balanced dataset as the number of data points for "class 1" is 225 and for "class 2" is 81, which differs by 144 data-points or we can say that class 1 data-points are more than the double of class 2 data-points **

# 2-D Scatter Plot

In [None]:
haberman.plot(kind='scatter', x='age', y='nodes', color='y'); # 2-D Scatter plot of age vs nodes
plt.xlabel("Age")
plt.show()

In [None]:
haberman.plot(kind='scatter', x='year', y='nodes', color='y'); # 2-D Scatter plot of years vs nodes
plt.xlabel("Year")
plt.show()

**Obeservation**

1. Many Operations are done between year 1960-66.
2. patients whose age lies between 30-70 approx, have positive axillary nodes approx between 0-15 .
2. By 2-D scatter plot between **age vs nodes** and **year vs nodes** with no color code we are not able to make any decisions as they are not seperated well.
3. Other than above no clear informatin can be taken out from the plot.

In [None]:
sns.set_style("whitegrid");
h = sns.FacetGrid(haberman,  hue="status",height=7)
h = h.map(plt.scatter, "age", "nodes").add_legend();
plt.xlabel("Age")
plt.ylabel("Nodes")
plt.show();

**Observation**
1. Patient with nodes = 0 are likley to survive.
2. Patient with age more that 40 and less than 70 have nodes greater than 10.
3. Patient with age between 33-75 years approx are not likely to survive the operation.

In [None]:
sns.set_style("whitegrid");
h = sns.FacetGrid(haberman,  hue="status", height=7)
h = h.map(plt.scatter, "year", "nodes").add_legend();
plt.xlabel("Year")
plt.ylabel("Nodes")
plt.show();

**Obseravation**
1. Operations done in the year 1960 and 1961 were more succesful than compared any other year.
2. Operations done in the year 1963 and 1965 were more unsuccesful than compared to any other year.

In [None]:
sns.set_style("whitegrid");
h = sns.FacetGrid(haberman,  hue="status", height=7)
h = h.map(plt.scatter, "year", "age").add_legend();
plt.xlabel("Year")
plt.ylabel("Age")
plt.show();

**Observations**
1. We cannot make out much but we can somewhat say that in the span of 10 years that is from 1958-68, patients who survived were in the age group of 30-70 years.

# Observations
1. By analysing the above first 3 scatter-plots, the there is no proper seperations between class 1 and class 2. So we cannot decide the bounds of the patients who survived for 5 years or longer and died within 5 years.

# Pair-plot

In [None]:
plt.close();

sns.set(style="whitegrid", color_codes=True)
sns.pairplot(haberman,hue="status", vars=["age", "year","nodes"],size = 5).add_legend()
plt.show()

**Observations**
 1. We cannot find "lines" and "if-else" conditions to build a simple model to classify the status of the patients
 2. Even after using pair plots we are not able to decide that which pair of feature classfies the status enough to design a model.
 3. There is a very minor seperation between year and nodes variable.
 4. There is no linear seperation between the class variables.

# Univariate Analysis

# Histogram with PDF

In [None]:
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "age") \
   .add_legend();
plt.xlabel("Age")
plt.show();


In [None]:
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "year") \
   .add_legend();
plt.xlabel("Year")
plt.show();

In [None]:
sns.FacetGrid(haberman, hue="status", size=5) \
   .map(sns.distplot, "nodes") \
   .add_legend();
plt.xlabel("Nodes")
plt.show();

**Observations**
1. From the plot of age we can say that, patient with the age of 40 or more have low chance of surviving under 5 yrs.
2. Chances to survive for 5yrs or more is greater when patients age <= 40 yrs.
3. The plots are very overlapping in plots of every feature and one cannot use thr eature to create a model.
4. Patients with the nodes = 0 have higher chance of surviving and nodes > 0 have low chance of surviving.
5. Patients survival chances were greater between year of 1958-62.
6. Patients survival chances were lesser between year of 1963-66.

# PDF and CDF

**AGE**

In [None]:
haberman_class1=haberman.loc[haberman["status"]==1];
haberman_class2=haberman.loc[haberman["status"]==2];

counts, bin_edges = np.histogram(haberman_class1['age'], bins=20, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
                                 
                                 
                                 
counts, bin_edges = np.histogram(haberman_class2['age'], bins=20, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:] ,cdf)
plt.xlabel("Age")
plt.title("PDF and CDF for Age")
plt.legend(['survived_pdf', 'survived_cdf','died_pdf', 'died_cdf'])
plt.show();

**YEAR**

In [None]:
counts, bin_edges = np.histogram(haberman_class1['year'], bins=20, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
                                 
                                 
                                 
counts, bin_edges = np.histogram(haberman_class2['year'], bins=20, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:] ,cdf)
plt.title("PDF and CDF for Year")
plt.xlabel("Year")
plt.legend(['survived_pdf', 'survived_cdf','died_pdf', 'died_cdf'])
plt.show();

**NODES**

In [None]:
counts, bin_edges = np.histogram(haberman_class1['nodes'], bins=20, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf)
                                 
                                 
                                 
counts, bin_edges = np.histogram(haberman_class2['nodes'], bins=20, density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:] ,cdf)
plt.title("PDF and CDF for Nodes")
plt.xlabel("Nodes")
plt.legend(['survived_pdf', 'survived_cdf','died_pdf', 'died_cdf'])
plt.show();

**Observations**
1. There is chance of 0.4 or 40% that patient will survive with age less than 50.
2. There is chance of 0.4 or 40% that patient will not survive with age less than 50.
3. The Patients with nodes less than 10 had approx 97% chance of survivng.
4. The Patients with nodes greater than 23 had 97-100% chance of dying.

# Box Plot

**Age**

In [None]:
sns.boxplot(x='status',y='age', data=haberman)
plt.ylabel("Age") 
plt.xlabel("Status") 
plt.show()

**Year**

In [None]:
sns.boxplot(x='status',y='year', data=haberman)
plt.ylabel("Year") 
plt.xlabel("Status")
plt.show()

**Nodes**

In [None]:
sns.boxplot(x='status',y='nodes', data=haberman)
plt.ylabel("Nodes") 
plt.xlabel("Status")
plt.show()

**Observations**
1. Many patients who survived had 0 or no nodes at all.
2. Patients with the age less than 35 were surely able to survive.
3. Patient with nodes between 1 and 25 are likely to die.

# Viloin Plots

**Age**

In [None]:
sns.violinplot(x="status", y="age", data=haberman, size=8)
plt.ylabel("Age") 
plt.xlabel("Status")
plt.show()

**Year**

In [None]:
sns.violinplot(x="status", y="year", data=haberman, size=8)
plt.ylabel("Year") 
plt.xlabel("Status")
plt.show()

**Nodes**

In [None]:
sns.violinplot(x="status", y="nodes", data=haberman, size=8)
plt.ylabel("Nodes") 
plt.xlabel("Status")
plt.show()

**Observations**
1. By looking at the vlion plot of nodes, we can say that 50% of patient who survived had 0 or no nodes.
2. Patient with nodes less than 8/9 have chances of surviving more than 5 years.
3. The patients who died all had the nodes greater than 1.
4. Many Patients died in the year 19558-65.
5. There is high over lapping so it is not possible to set the limits to classify the 2 classes based on any features.

# Conclusion


1. By plotting the different plots(pair plot,violin plot, box-plot), we can say that there is high level of overlapping between the class 1 and 2 with every feature may be it age, year or nodes.
2. We can also see that as the node increases the chance of survival decreses with increase in age.
3. Patient having nodes equal to 0 have higher chance of survivng.
4. Here age, years or nodes single handedly cannot act as important feature to classify the class variables, we need to consider more feature.
5. Any age does not affect on the survival 
6. Between 3 features nodes act as a somewhat important feature but not always.
7. Many Patients died in the year 19558-65.
8. Patients with the age less than 35 were surely able to survive
