# *Exploratory Data Analysis (EDA)is the most crucial stage of any model whether regression or classification, but often understated or ignored. EDA gives a clear picture of data and dependency of different features or combinations of various features of data on the output.*

**Here is the Sequence **

   1. Introduction of dataset
   2. Looking into the data
   3. Univariate analysis
     * Probability Density Function
     * Cumulative Density Function
     * Box plot
     * Violin plot   
    

4. Bivariate analysis
   * Scatter plot
   * Pair plot
     
5. Multivariate analysis
   * Contour plot

a few commonly used terms:
- Data-set is the collection of data used ( usually a table )
- Data-point is each observation in the data set (like row)
- Target, also called Dependent-variable or output-variable is the variable to be predicted or analyzed.
- Feature, also called input-variable or independent-variable is a variable or the set of variables used to determine the dependent variable.

Attribute Information:
1. Age of patient at the time of operation (numerical)
2. Patient’s year of operation (year — 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 means the patient survived 5 years or longer and 2 means the patient died within 5 year

In [None]:
import pandas as pd
data = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')

Importing pandas to read the data. Assigning the CSV(Comma Separated Value) file from Kaggle into the data frame using pandas.

In [None]:
data.shape

In [None]:
data.head()

Here, the data set does not have any headers(column names). The first data point is taken as a header. Hence, we need to add a header.

In [None]:
header_list=['age','op_year','axil_nodes','surv_status']
haberman_data = pd.read_csv('../input/habermans-survival-data-set/haberman.csv',names=header_list)

In [None]:
haberman_data.head()

Cool, the data is ready with feature names.

In [None]:
haberman_data['surv_status'].value_counts()

The data is imbalanced data, which means the number of data points in each class of ‘surv_status’ are not similar.

Now that we got a picture of data, let’s start analyzing it further.

**UNIVARIATE ANALYSIS**

Univariate analysis, as the name says, simply means analysis using a single variable. This analysis gives the frequency/count of occurrences of the variable and lets us understand the distribution of that variable at various values.

3.1. PROBABILITY DENSITY FUNCTION (PDF) :

In PDF plot, X-axis is the feature on which analysis is done and the Y-axis is the count/frequency of occurrence of that particular X-axis value in the data. Hence the term “Density” in PDF.

In [None]:
import seaborn as sns
sns.set_style("whitegrid")

sns.FacetGrid(haberman_data,hue='surv_status',height=5).map(sns.distplot,'age').add_legend()

Some Insights :

    Major overlapping is observed, so we can not clearly say about the dependency of age on survival.
    A rough estimate that patients age 20–50 have a slightly higher rate of survival and patients age 75–90 have a lower rate of survival.
    Age can be considered as a dependent variable.

In [None]:
sns.FacetGrid(haberman_data,hue='surv_status',height=5).map(sns.distplot,'op_year').add_legend()

Some more:
    The overlap is huge.
    Operation year alone is not a highly dependent variable.

In [None]:
sns.FacetGrid(haberman_data,hue='surv_status',height=5).map(sns.distplot,'axil_nodes').add_legend()

    Patients with 0 nodes have a high probability of survival.
    Axillary nodes can be used as a dependent variable.

Limitations:
 In PDF, we can’t say exactly how many data points are in a range/ lower to a value/ higher than a particular value.

3.2. CUMULATIVE DENSITY FUNCTION (CDF) :

segregating data according to the class of survival rate

In [None]:
survival_yes = haberman_data[haberman_data['surv_status']==1]
survival_no = haberman_data[haberman_data['surv_status']==2]

In [None]:
import numpy as np
import matplotlib.pyplot as plt
count, bin_edges = np.histogram(survival_no['age'], bins=10, density = True)
#count : the number of data points at that particular age value
#bin_edges :the seperation values of the X-axis (the feature under analysis)
#bins = the number of buckets of seperation
pdf = count/sum(count)
print(pdf)
# To get cdf, we want cumulative values of the count. In numpy, cumsum() does cumulative sum 
cdf = np.cumsum(pdf)
print(cdf)
count, bin_edges = np.histogram(survival_yes['age'], bins=10, density = True)
pdf2 = count/sum(count)
cdf2 = np.cumsum(pdf2)
plt.plot(bin_edges[1:],pdf,label='yes')
plt.plot(bin_edges[1:], cdf,label='yes')
plt.plot(bin_edges[1:],pdf2,label='no')
plt.plot(bin_edges[1:], cdf2,label='no')
plt.legend()
 #adding labels
plt.xlabel("AGE")
plt.ylabel("FREQUENCY")

Insights:
There are around 80% of data points have age values less than or equal to 60

In [None]:
count, bin_edges = np.histogram(survival_no['axil_nodes'], bins=10, density = True)
pdf = count/sum(count)
print(pdf)
cdf = np.cumsum(pdf)
print(cdf)
count, bin_edges = np.histogram(survival_yes['axil_nodes'], bins=10, density = True)
pdf2 = count/sum(count)
cdf2 = np.cumsum(pdf2)
plt.plot(bin_edges[1:],pdf,label='yes')
plt.plot(bin_edges[1:], cdf,label='yes')
plt.plot(bin_edges[1:],pdf2,label='no')
plt.plot(bin_edges[1:], cdf2,label='no')
plt.legend()
plt.xlabel("AXIL_NODES")
plt.ylabel("FREQUENCY")

There are around 90% of data points have axil_node values less than or equal to 10

3.3. BOX PLOTS

Points to Note,

*     median (50th quartile) is the middlemost value of the sorted data
*     25th quartile is the value in sorted data which has 25% of the data less than it and 75% of the data above it
*     75th quartile is the value in sorted data which has 75% of the data less than it and 25% of the data above it.

In [None]:
sns.boxplot(x='surv_status',y='age', data=haberman_data)

In [None]:
sns.boxplot(x='surv_status',y='axil_nodes', data=haberman_data)

In [None]:
sns.boxplot(x='surv_status',y='op_year', data=haberman_data)

3.4. VIOLIN PLOTS :

Violin plots are the combination of box plots and density functions.

*     The white dot represents the median.
*     The edges of the thicker dark line represent the quartiles.
*     The edges of the violin-shaped structure represents the minimum and maximum
*     The width of the shape represents the density/frequency of data points at that value.

In [None]:
sns.violinplot(x='surv_status',y='age', data=haberman_data)
plt.show()

In [None]:
sns.violinplot(x='surv_status',y='op_year', data=haberman_data)
plt.show()

In [None]:
sns.violinplot(x='surv_status',y='axil_nodes', data=haberman_data)
plt.show()

Insights:
*    Patients age 75–90 are less likely to not survive and patients age 30–40 are more likely to survive.
*     The operation year doesn’t seem to give exact information as it is almost equally spread throughout the given years.
*     Patients with low node values are more likely to survive.

4. BI-VARIATE ANALYSIS :

SCATTER PLOT

* compare two variables and help us analyze how the target variable is dependent on their combination.

In [None]:
sns.FacetGrid(haberman_data, hue="surv_status", height=8).map(plt.scatter, "age", "op_year").add_legend();

In [None]:
sns.FacetGrid(haberman_data, hue="surv_status", height=8).map(plt.scatter, "age", "axil_nodes").add_legend();

In [None]:
sns.FacetGrid(haberman_data, hue="surv_status", height=8).map(plt.scatter, "axil_nodes", "op_year").add_legend();

PAIR PLOTS :

In [None]:
sns.pairplot(haberman_data, hue="surv_status", height=7)

5. MULTI-VARIATE ANALYSIS :

CONTOUR PLOTS :

In [None]:
g=sns.jointplot(x = 'op_year', y = 'age', data = haberman_data, kind = 'kde')

representing a 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format.

In [None]:
g=sns.jointplot(x = 'op_year', y = 'age', data = haberman_data, kind = 'hex')

Huge number of operations were performed from 60–64 operation year and age between 45–55

In [None]:
#Hope you explored something intresting