> ************Exploratory Data Analysis: Haberman’s Cancer Survival Dataset****

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.
It is always a good idea to explore a data set with multiple exploratory techniques, especially when they can be done together for comparison. The goal of exploratory data analysis is to obtain confidence in your data to a point where you’re ready to engage a machine learning algorithm. Another side benefit of EDA is to refine your selection of feature variables that will be used later for machine learning.

1. Understanding the dataset

Title: Haberman’s Survival Data
Description: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
Attribute Information:
Age of patient at the time of operation (numerical)
Patient’s year of operation (year — 1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) :
1 = the patient survived 5 years or longer
2 = the patient died within 5 years

2. Importing libraries and loading data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data=pd.read_csv('../input/haberman.csv/haberman.csv')
print(data.shape)
data['status']=data['status'].map({1:'survived',2:'not survived'}) #mapping status

In [None]:
data.head()

In [None]:
data.info()

Observations:

*     There are no missing values in this data set.
*     All the columns are of the integer data type.
*     The datatype of the status is an integer, it has to be converted to a categorical datatype
*     In the status column, the value 1 can be mapped to ‘yes’ which means the patient has survived 5 years or longer. And the value 2 can be mapped to ‘no’ which means the patient died within 5 years.

3. Univariate Analysis

The major purpose of the univariate analysis is to describe, summarize and find patterns in the single feature.
Probability Density Function(PDF)
Probability Density Function (PDF) is the probability that the variable takes a value x. (a smoothed version of the histogram)
Here the height of the bar denotes the percentage of data points under the corresponding group

In [None]:
sns.FacetGrid(data,hue='status',height=5)\
   .map(sns.distplot,'age')\
   .add_legend()
plt.show()

Observations:

*     Major overlapping is observed, which tells us that survival chances are irrespective of a person’s age.
*     Although there is overlapping we can vaguely tell that people whose age is in the range 30–40 are more likely to survive, and 40–60 are less likely to survive. While people whose age is in the range 60–75 have equal chances of surviving and not surviving.
*     Yet, this cannot be our final conclusion. We cannot decide the survival chances of a patient just by considering the age parameter

In [None]:
sns.FacetGrid(data,hue='status',height=5)\
   .map(sns.distplot,'year')\
   .add_legend()
plt.show()

Observations:

*     There is major overlapping observed. This graph only tells how many of the operations were successful and how many weren’t. This cannot be a parameter to decide the patient’s survival chances.
*     However, it can be observed that in the years 1960 and 1965 there were more unsuccessful operations.

In [None]:
sns.FacetGrid(data,hue='status',height=5)\
   .map(sns.distplot,'nodes')\
   .add_legend()
plt.show()

Observations:

*     Patients with no nodes or 1 node are more likely to survive. There are very few chances of surviving if there are 25 or more nodes.

4. Bivariate Analysis

Pair plot

By default, this function will create a grid of Axes such that each variable in data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

In [None]:
sns.set_style('whitegrid')
sns.pairplot(data, hue='status')
plt.show()

Observations:

    The plot between year and nodes is comparatively better.