## Exploratory Data Analysis of Haberman's Survival Data Set

** Exploratory Data Analysis(EDA) ** is a process of data analysis that primarily aims to unearth the information hidden in the data set using statistical tools, plotting tools, linear algebra, and other techniques. It helps to understand the data better and highlight its main characteristics that may help to make predictions and forecasts that can have a bearing on the future.

Understanding data is core to data science. Hence EDA is imperative to generating accurate machine learning models. Consider Haberman's Survival Data set to perform various EDA processes on it using Python. The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

The various attributes of the data set are: <br> 
<ol>
    <li>Age of patient at the time of operation (numerical)</li>
    <li>Patient's year of operation (year between 1958 to 1970, numerical)</li> 
    <li>Number of positive axillary nodes detected (numerical)</li>
    <li>Survival status (class attribute) denoted as:</li> 
        <ul>
            <li>1 - if the patient survived 5 years or longer</li> 
            <li>2 - if the patient died within 5 years</li>
        </ul>
 </ol>
 
Just like the medical diagnosis of patients plays a key role in the patient's treatment lifecycle, EDA plays a vital role in data assessment and the creation of accurate models.

## Importing Requisite Python Libraries

** Python ** was chosen due to its best AI packages and Machine Learning libraries. Here we import libraries required to perform data analysis and plotting: <br>
<ul>
    <li><p><strong>Pandas(Python Data Analysis Library)</strong></p></li>
    <li><p><strong>Numpy(Python Package for Scientific Computing)</strong></p></li>
    <li><p><strong>Matplotlib(Python Plotting Library)</strong></p> </li>
    <li><p><strong>Seaborn(Python Statistical Data Visualization Library)</strong></p></li>
</ul>    


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Loading Data Set

The Haberman's Survival Data Set is a comma-separated values(csv) file. The ** read_csv() ** function of ** Pandas ** is used to read the csv file(haberman.csv) into ** DataFrame ** named haberman. Dataframe is a two-dimensional structure that is size-mutable and has potentially heterogeneous tabular data. 

In [None]:
haberman = pd.read_csv("../input/habermans-survival-data-set/haberman.csv", header=0, names=['Age of Patient', 'Year of Operation', \
                                                                            'Positive Axillary Nodes', 'Survival Status'])

## Getting Glimpses of Data Set

Let's get acquainted with the data set by doing some preliminary analysis of data. First of all, let's see how the data set looks like. 

In [None]:
"""
for i,j in haberman.iterrows(): #iterrows() is used to iterate through each row of the DataFrame
    print(j)
    print()
"""
print(haberman)

All the attributes of the data set are self-explanatory. Age of Patient implies the age of the patient. Year of Operation mentions the year in which the operation is performed. Positive Axillary Nodes denotes the number(presence or absence) of positive axillary nodes(Lymph nodes) in a patient. Positive Axillary Nodes are the lymph nodes affected by cancer cells. Finally, Survival Status provides information about the patients' survival for 5 years or longer. 

** Observations: **
<ol>
    <li><p>The csv file contains <strong>306 rows and 4 columns</strong>, implying that the data set contains information about <strong>306 patients</strong> who underwent surgery for breast cancer. Considering the volume of data, the data set is small.</p></li>
    <li><p>A patient's diagnosis is based on the symptoms that the patient exhibits. As no other attribute of data set other than <strong>Positive Axillary Nodes</strong> falls into the category of symptoms, we can assume that the presence of Positive Axillary Node is a major catalyst(cause) of breast cancer. According to BreastCancer.org, to remove invasive breast cancer, the doctor removes one or some of the underarm lymph nodes(before or during surgery) so that they can be examined under a microscope for cancer cells. The presence of cancer cells is known as <strong>lymph node involvement</strong>.</li>
    <li><p>The presence of another symptom in the data set would have created confusion as to what variable should be given top priority for data analysis. Hence in the preliminary analysis, it seems Positive Axillary Nodes is the most important variable.</p></li>     
</ol>

The first five rows of data set can be seen by the head() function.

In [None]:
haberman.head()

Now let's find the total number of data points and features(attributes) of data set by using Pandas shape property. A data point is a collection of attributes or features. Hence it is a complete record. In the given data set, a data point comprises of data involving the four attributes(row of DataFrame). The shape attribute returns a tuple representing the dimensionality of the DataFrame(DataFrame stores number of rows and columns as a tuple).

In [None]:
print(haberman.shape)

The data set has 306 data points(rows) and 4 attributes(columns). The attributes of the data set can be known by the column property of Pandas.

In [None]:
print(haberman.columns)

Here dtype refers to the data types of Pandas. The object data type can contain multiple data types(integers, floats or strings)

Survival Status attribute(dependent variable) contains ** integer data types ** that are not categorical type. Hence it is required to convert to ** categorical type **.

In [None]:
haberman['Survival Status'] = haberman['Survival Status'].apply(
    lambda x: 'Survived' if x == 1 else 'Died')

Let's verify whether the conversion has occurred.

In [None]:
print(haberman.head(10))

We can see now that the Survival Status has fields marked as Survived or Died.

A concise summary of data set can be displayed by the info method. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [None]:
print(haberman.info())

** Observations: **
<ol>
    <li>The data type of the first three columns namely Age of Patient, Year of Operation and Positive Axillary Nodes is an integer. Survival Status has an object data type. The data set has four data columns.</li>
    <li>The data set has no null values.</li>
    <li>Memory used by data set is approximately 9.7 KB </li>
</ol>
   

** Pandas describe ** method generates descriptive statistics that include information that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. It analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided.

In [None]:
print(haberman.describe(include='all'))

** Observations: **
<ol>
    <li><p>The method gives the total count of each attribute</p></li>
    <li><p>For numeric data(variables like Age of Patient, Year of Operation, Positive Axillary Nodes), the method provides valuable information on <strong>standard deviation(std), mean, percentiles(25%, 50% and 75%), min and max values</strong>. The 50th percentile(50%) is <strong>median</strong>. Hence, a summary of central tendencies(mean and median) and dispersion(standard deviation) is obtained.</p></li>
    <li><p>Using min and max values, following inferences can be made:  </p></li>
        <ul>
            <li><p>The <strong>maximum age</strong> of patient is 83 and the <strong>minimum age</strong> is 30.</p></li>
            <li><p>The year of operation starts from <strong>58(1958) to 69(1969)</strong>. </p></li>
            <li><p>One or more patients had <strong>52 positive axillary nodes</strong> and one or more patients had <strong>zero positive axillary node</strong>. </p></li>
        </ul>
    <li><p>For object data type(Survival Status), the result will include <strong>unique, top and freq(frequency)</strong>. The variables with numerica data type will be given NaN in corresponding fields. Survival Status has two unique values(Survived and Died). The top is the most common value. Hence <strong>Survived</strong> is the most common survival status. freq is most common value's frequency and its value here is 225. So the <strong>total number of patients survived is 225</strong>.   </p></li>
</ol>

We can ascertain the total number of patients survived by value_counts() method.

In [None]:
print(haberman['Survival Status'].value_counts())

Hence we can conclude that more patients ** survived **(225) breast cancer than the ones who ** died ** of it(81). Hence the data set is ** imbalanced **.

## Objective

The main objective of EDA is to determine whether a patient will survive for 5 years or longer based on the attributes Age of Patient, Year of Operation and Positive Axillary Nodes. 

## Different Levels of Analysis

Now let's dive deeper into the data set. For that, it's imperative to consider the different levels of analysis that exist. They are:
<ul>
    <li><p><strong>Univariate Analysis</strong></p> </li>
    <li><p><strong>Bivariate Analysis</strong></p> </li>
    <li><p><strong>Multivariate Analysis</strong></p> </li>
</ul>

The selection of the data analysis technique ultimately dependents on the number of variables, data type, and focus of the statistical inquiry. 

## Univariate Analysis

Univariate analysis is the simplest data analysis technique that deals with only one variable. Being a single variable process, it does not give insights on the cause or effect relationships. The primary objective of the univariate analysis is to simply describe the data to find patterns within the data. The univariate analysis methods being considered are:
<ol>
    <li><p><strong>1-D Scatter Plot</strong></p> </li>
    <li><p><strong>Probability Density Function(PDF)</strong></p> </li>
    <li><p><strong>Cumulative Distribution Function(CDF)</strong></p> </li>
    <li><p><strong>Box Plot</strong></p> </li>
    <li><p><strong>Violin Plot</strong></p> </li>
</ol>

## Bivariate Analysis

Bivariate analysis is the process to establish a correlation between two variables. Bivariate analysis is more analytical than univariate analysis. If the data seems to fit a line or curve, then there is a relationship or correlation between the two variables. The bivariate analysis methods being considered are:
<ol>
    <li><p><strong>2-D Scatter Plot</strong></p> </li>
    <li><p><strong>Pair Plot</strong></p></li>
</ol>

## Multivariate Analysis

Multivariate analysis is a more complex statistical analysis. It is the analysis involving three or more variables and is implemented in a scenario where there is a need to understand the relationship between them. The multivariate analysis method being considered is:
<ol>
    <li><p><strong>Contour Plot</strong></p> </li>
</ol>

## Modus Operandi

The analysis will start with the bivariant analysis. 2-D scatter plot will be plotted first and will make observations of it. Then we will move over to pair plot to see both the distribution of single variables and the relationship between two variables. Afterward, the univariate and multivariate analysis will be conducted.

## 2-D Scatter Plot

The two-dimensional scatter plot helps to visualize a correlation between two variables using Cartesian coordinates. The values of one variable will be plotted along the x-axis and the other variable on the y-axis. The data will be plotted in the resultant quadrant as an ordered pair(x, y) in which x relates to value on x-axis and y relates to y-axis value.  

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue ='Survival Status', size = 8) \
.map(plt.scatter, 'Age of Patient', 'Positive Axillary Nodes') \
.add_legend()
plt.show()


** FacetGrid ** is a multi-plot grid for plotting conditional relationships. FacetGrid object takes a ** DataFrame ** as input and the names of the variables that will form the row, column, or hue dimensions of the grid. The variables should be categorical and the data at each level of the variable will be used for a facet along that axis. The map() method is responsible for repeating the same plot on each space of the grid. It applies a plotting function to each facet’s subset of the data. The add_legend() method creates the legend of the plot.

** Observations: **
<ol>
    <li><p>Majority of patients in the <strong>age group 30-40 have survived breast cancer</strong> as there are very few orange dots here. </p></li>
    <li><p>It is <strong>very rare</strong> for patients to have <strong>positive axillary nodes more than 25(or rather 30)</strong> </p></li>
    <li><p>Almost <strong>all patients</strong> in the age group <strong>50-60 have survived</strong> when there is an <strong>absence of positive axillary nodes</strong>. We can assume this by the absence of orange dots between 50 and 60.</p></li>
    <li><p><strong>All patients above the age of 80 have died</strong> within five years after the operation, as there are no blue dots here. </p></li>
    <li><p>Few patients with <strong>higher number of positive axillary nodes(greater than 10)</strong> have also survived breast cancer(presence of blue dots along Positive Axillary Nodes>10) .  </p></li>
</ol>
    

** Ascertaining Observations: ** <br>
We can ascertain the ** observation no. 1 ** by carrying out the following operation on the haberman DataFrame.

In [None]:
df_3040 = haberman.loc[(haberman['Age of Patient']<=40) & (haberman['Age of Patient']>=30)]
#print(df_3040)

df_3040_survived = df_3040.loc[df_3040['Survival Status']=='Survived']
print('No. of patients in the age group 30-40 survived: {0}' .format(len(df_3040_survived)))

df_3040_died = df_3040.loc[df_3040['Survival Status']=='Died']
print('No. of patients in the age group 30-40 died: {0}' .format(len(df_3040_died)))


** The output verified the observation no. 1 **.

We can ascertain the value 25(assumed by the blue dot at the mid point of 20 and 30) mentioned in the ** observation no. 2 ** by:

In [None]:
ax_node = haberman['Positive Axillary Nodes'].unique() #unique values of axillary node
ax_node.sort() #sorted the list
print(ax_node)

The list has value 25. So we can safely assume that the blue dot is at value 25.

We can ascertain ** observation no. 4 ** by the following operation:

In [None]:
age = haberman['Age of Patient']
count = 0
print(len(age))
for i in age:
    if(i >= 80):
        count += 1
print('No. of patients whose age is greater than or equal to 80: {0}' .format(count))

Hence there is only patient whose age is greater than or equal to 80. The orange dot after 80 must be representing this patient.

## Pair Plots

Pair plot plots pairwise relationships in a dataset. It will create a grid of Axes such that each numeric variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal axes is different from the rest of the grid of Axes, and they show the univariate distribution of the data for the variable in that column. 


In [None]:
sns.set_style('whitegrid')
sns.pairplot(haberman, hue = 'Survival Status', size = 4)
plt.show()

In the data set, there are 3 quantitative variables(Age of Patient, Year of Operation, Positive Axillary Nodes) and 1 categorical variable(Survival Status). Only numeric(integer or float) continuous values will be plotted in pair plot. Hence in the pair plot shown above, we have ** 3C2 plots **(i.e. 3 unique plots). There is an equal number of plots on either side of the diagonal, and they are the mirror image of each other. The ** diagonal plots(plot 1, plot 5 and plot 9) ** demonstrate ** histograms ** that showcase the distribution of a single variable.

** Observations: **
<ol>
    <li><p>There is no linear separation in any of the plots </p></li>
    <li><p>There is considerable overlapping of data in each plot. </p></li>
    <li><p>In <strong>plot 2</strong>, Year of Operation is on x-axis and Age of Patient is on the y-axis. There is a <strong>substantial amount of overlapping</strong> that it is difficult to make a classification based on the plot. One interesting fact to observe is that <strong>majority of the patients</strong> who have undergone operations during the years <strong>1961 and 1968 have survived</strong> 5 years or longer(due to very few orange dots compared to other years). We can even rephrase it as the year 1961 and the year 1968 had the least number of deaths of patients who had undergone breast cancer operations </p></li>
    <li><p>In <strong>plot 3</strong>, Positive Axillary Nodes is on x-axis and Age of Patient is on the y-axis. Even though there is an overlapping of dots, there are <strong>distinguishable patterns</strong> that enable us to make inferences. Plot 3(and Plot 7) appears to be better than the rest of the plots(<strong>We have analyzed the same plot in-depth during 2-D Scatter Plot</strong>). So Positive Axillary Nodes and Age of Patient are the most useful features to identify the survival status of a patient. </p></li>
    <li><p>In <strong>plot 6</strong>, Positive Axillary Nodes is on x-axis and Year of Operation is on the y-axis. The plot has the <strong>most overlapping</strong> of dots. Hence it will not lead to any meaningful conclusions or classifications. </p></li>
    <li><p><strong>Plot 4 is a mirror image of plot 2. Plot 7 is a mirror image of plot 3. Plot 8 is a mirror image of plot 6. </strong> </p></li>
    <li><p>Finally <strong>plot 7 and plot 3 are the best plots</strong> to be considered for data analysis. </p></li>
</ol>

** Ascertaining Observations: ** <br>
We can ascertain the ** observation no. 3** of plot 2 by carrying out the following operations on the haberman DataFrame:

In [None]:
df_1961 = haberman.loc[haberman['Year of Operation']==61]
df_1968 = haberman.loc[haberman['Year of Operation']==68]
#print(df_1961)

df_1961_survived = df_1961.loc[df_1961['Survival Status']=='Survived']
print('No. of patients survived during 1961: {0}' .format(len(df_1961_survived)))

df_1961_died = df_1961.loc[df_1961['Survival Status']=='Died']
print('No. of patients died during 1961: {0}' .format(len(df_1961_died)))

aster = '*'
print(aster*45)

df_1968_survived = df_1968.loc[df_1968['Survival Status']=='Survived']
print('No. of patients survived during 1968: {0}' .format(len(df_1968_survived)))

df_1968_died = df_1968.loc[df_1968['Survival Status']=='Died']
print('No. of patients died during 1968: {0}' .format(len(df_1968_died)))

## 1-D Scatter Plot

The scatter plot in which a single variable is used to make inferences is a 1-D scatter plot. Here the variable will be on the x-axis and the y-axis will have zeros(as it is impossible to make a plot without two axes). Obviously, it is a univariate analysis.

In [None]:
df_survived = haberman.loc[haberman['Survival Status'] == 'Survived'] 
df_died = haberman.loc[haberman['Survival Status'] == 'Died']
plt.plot(df_survived['Positive Axillary Nodes'], np.zeros_like(df_survived['Positive Axillary Nodes']), 'o')
plt.plot(df_died['Positive Axillary Nodes'], np.zeros_like(df_died['Positive Axillary Nodes']), 'o')
plt.show()

In the code above, ** haberman.loc[ ] ** was used to pick the data points from haberman DataFrame that are associated with the specific indexes, which in turn is stored in another DataFrame. **np.zeros_like() ** method will create an array of zeros. 'o' is the small letter of alphabet O to make dots on plot bigger and visible.

** Observations: **
<ol>
    <li>1-D scatter plot based on one feature - Positive Axillary Nodes </li>
    <li>There is significant overlap of data that hampers from making any meaningful observations. </li>
</ol> 

## Histogram

Histogram is an accurate representation of ** numerical data distribution ** that was first introduced by ** Karl Pearson **. It gives an estimate of continuous variables' probability distribution. Histogram is a univariate analysis as it relates to only one variable.<br>

The very first step to construct a histogram is to ** 'bin'(or bucket) ** the range of values. To bin means to divide the entire range of values into series of intervals, and then count the number of values that belong to each interval. The bins are usually consequtive, non-overlapping intervals of a variable. The bins must be adjacent and are of equal size(not required). All but the last bin(right hand most) is half-open.<br>

For equal-sized bins, a rectangle is erected over the bin with height proportional to number of cases in each bin(frequency or count).

Matplotlib is used to plot histogram and Numpy is used to calculate count and bin edges.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

df_axnodes = haberman['Positive Axillary Nodes'] #DataFrame of Positive Axillary Nodes
count, bin_edges = np.histogram(df_axnodes, bins=25)

print('Bin Edges: ', bin_edges)
print('Counts per Bin: ', count)
plt.hist(df_axnodes, bins=25, color='skyblue', alpha=0.7) 
plt.xlabel('Positive Axillary Nodes', fontsize=15)
plt.ylabel('Frequency', fontsize=15)

Matplotlib plots histogram using plt.hist() that takes a DataFrame as input. The bin_edges gives bin edges(left edge of the first bin and right edge of the last bin). The color parameter sets the color of the bar and the alpha parameter sets the transparency of the bar. plt.xlabel and plt.y-label are used to set the labels of x-axis and y-axis respectively.

** Observations **
<ul>
    <li><p>197 patients out of 306 patients have positive axillary nodes less than 2.08. So the <strong>majority(64.37%) of patients have a small number of positive axillary nodes</strong>. </p></li>
</ul>    

## Probability Density Function(PDF)

PDF is used to specify the probability of the random variable falling within a particular range of values. It is the probability function used to describe a continuous probability distribution. PDF is used to deal with the probabilities of random variables that have continuous outcomes. The height of a person arbitrarily chosen from a population is a typical example. 

PDF is a smoothed version of the histogram. The smoothing of the histogram is done using Kernel Density Estimation(KDE). The area under the PDF(curve) always sum up to 1. PDF is a univariate analysis.

The code snippet shown below will plot PDF. 

** PDF based on Age of Patient **

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue='Survival Status', size=8) \
    .map(sns.distplot, 'Age of Patient') \
    .add_legend()
plt.show()

The bars(orange and blue) are the histograms, and the curves represent the PDF.<br>

** Observations: **
<ol>
    <li><p>There is a significant overlapping of data that amount to ambiguity </p></li>
    <li><p>Patients in the <strong>age group 30-40 have more survival chances</strong> than other age groups. </p></li>
    <li><p>Patients in the <strong>age group 40-60 have less prospects of survival</strong>. </p></li> 
    <li><p>The age group of <strong>40-45 recorded the highest number of deaths</strong>(have the least possibility of survival). </p></li>
    <li><p>We cannot make final conclusions about a patient's survival chances based on the attribute 'Age of Patient'.  </p></li>
</ol>    

** PDF based on Year of Operation **

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue='Survival Status', size=8) \
    .map(sns.distplot, 'Year of Operation') \
    .add_legend()
plt.show()

** Observations: **
<ol>
    <li><p>Major overlapping can be observed. </p></li>
    <li><p>The plot provides information about the number of successful operations(in which patients survived) and the unsuccessful ones. <strong>Success of an operation cannot be based on year as a factor</strong>.</p></li>
    <li><p><strong>Most unsuccessful operations</strong> were performed in year <strong>1965, followed by 1960</strong>. </p></li> 
    <li><p><strong>Most successful operations</strong> were performed in year <strong>1961</strong>. </p></li>
</ol>    

** PDF based on Positive Axillary Nodes **

In [None]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue='Survival Status', size=8) \
    .map(sns.distplot, 'Positive Axillary Nodes') \
    .add_legend()
plt.show()

** Observations: **
<ol>
    <li><p> The presence of positive axillary nodes(lymph node involvement) can be the obvious manifestation of breast cancer. BreastCancer.org has listed it as an important symptom in its website. Positive Axillary Nodes thus attain more significance than the rest of the attributes. </p></li>
    <li><p><strong>Patients with zero positive axillary nodes have very high chances of survival</strong> than the patients who have its presence in them.</p></li>
    <li><p><strong>Patients with a single positive axillary node also have good chances of survival</strong>. </p></li> 
    <li><p>The <strong>likelyhood to survive</strong> breast cancer <strong>decreases with increase in number of positive axillary nodes</strong>. </p></li> 
    <li><p>Only a <strong>small number of patients have positive axillary nodes more than 25</strong>.</p></li>
    <li><p><strong>Positive Axillary Nodes</strong> is the preferred attribute to do data analysis.</p></li>
</ol> 

** Ascertaining Observations: ** <br>
We can ascertain the observations by carrying out the following operation on the haberman DataFrame:

In [None]:
df_one = haberman.loc[haberman['Positive Axillary Nodes']<=1]
df_less = haberman.loc[(haberman['Positive Axillary Nodes']<=25) & (haberman['Positive Axillary Nodes']>1)]
df_more = haberman.loc[haberman['Positive Axillary Nodes']>25]

df_one_survived = df_one.loc[df_one['Survival Status']=='Survived']
print('No. of patients survived(with one or no positive nodes): {0}' .format(len(df_one_survived)))

df_one_died = df_one.loc[df_one['Survival Status']=='Died']
print('No. of patients died(with one or no positive nodes): {0}' .format(len(df_one_died)))

aster = '*'
print(aster*65)

df_less_survived = df_less.loc[df_less['Survival Status']=='Survived']
print('No. of patients survived(1<positive nodes<=25): {0}' .format(len(df_less_survived)))

df_less_died = df_less.loc[df_less['Survival Status']=='Died']
print('No. of patients died(1<positive nodes<=25): {0}' .format(len(df_less_died)))

print(aster*65)

df_more_survived = df_more.loc[df_more['Survival Status']=='Survived']
print('No. of patients survived(25<positive nodes<=52): {0}' .format(len(df_more_survived)))

df_more_died = df_more.loc[df_more['Survival Status']=='Died']
print('No. of patients died(25<positive nodes<=52): {0}' .format(len(df_more_died)))


The output has ascertained the observations. We can make following conclusions: <br>
<ol>
    <li><p><strong>85%</strong> of patients with <strong>one or zero positive axillary nodes have survived</strong> breast cancer. </p></li>
    <li><p><strong>58%</strong> of patients with positive axillary nodes less than 25 and greater than 1 have survived five years or longer  </p></li>
    <li><p><strong>60%</strong> of patients with positive axillary nodes greater than 25 have survived breast cancer. </p></li>
    <li><p>These statistics prove that survival chances of patients is pretty high if the number of positive axillary nodes is one or zero.If the number is greater than one, then survival chances range from 58% to 60%. </p></li>
</ol>    

## Cumulative Distribution Function(CDF)

The cumulative distribution function (cdf) of a real-valued random variable X is the probability that the variable takes a value less than or equal to x. <br>
** F(x) = P(X <= x) ** <br>
where the right-hand side represents the probability that the random variable X takes on a value less than or equal to x. The probability that X lies in the semi-closed interval (a,b], where a<b, is therefore <br>
** P(a < X <= b) = F(b) - F(a) **

The integration of Probability Density Function(PDF) gives CDF. CDF is also a univariate analysis.

CDF is plotted using the selected variable 'Positive Axillary Nodes'.

In [None]:
df_axnodes_survived = haberman.loc[haberman['Survival Status']=='Survived']
counts1, bin_edges1 = np.histogram(df_axnodes_survived['Positive Axillary Nodes'], bins=10, density=True)
pdf1 = counts1/(sum(counts1))
print('PDF of patients survived 5 years or longer:', pdf1)
print('Bin Edges: ', bin_edges1)
cdf1 = np.cumsum(pdf1)

aster = '*'
print(aster * 60)

df_axnodes_died = haberman.loc[haberman['Survival Status']=='Died']
counts2, bin_edges2 = np.histogram(df_axnodes_died['Positive Axillary Nodes'], bins=10, density=True)
pdf2 = counts2/(sum(counts2))
print('PDF of patients died within 5 years:', pdf2)
print('Bin Edges: ', bin_edges2)
cdf2 = np.cumsum(pdf2)

print(aster * 60)

line1, = plt.plot(bin_edges1[1:], pdf1, label='PDF_Survived')
line2, = plt.plot(bin_edges1[1:], cdf1, label='CDF_Survived')
line3, = plt.plot(bin_edges2[1:], pdf2, label='PDF_Died')
line4, = plt.plot(bin_edges2[1:], cdf2, label='CDF_Died')
plt.legend(handles=[line1, line2, line3, line4])
"""
line1 = plt.plot(bin_edges1[1:], pdf1, label='PDF1') #no comma after line1 as above
line2 = plt.plot(bin_edges1[1:], cdf1, label='CDF1')
line3 = plt.plot(bin_edges2[1:], pdf2, label='PDF_Died')
line4 = plt.plot(bin_edges2[1:], cdf2, label='CDF_Died')
plt.legend()
"""
plt.xlabel('Positive Axillary Nodes', fontsize=15)
plt.show()


Matplotlib is used to plot the histogram and Numpy is used to calculate count and bin edges. Matplotlib plots histogram using plt.hist() that takes a DataFrame as input. The bin_edges gives bin edges(left edge of the first bin and right edge of the last bin). np.cumsum() is a numpy method to calculate the cumulative sum. plt.legend() is a Matplotlib method for generating legends of the graph. plt.xlabel() is another Matplotlib method to label the x-axis

** Observations: **
<ol>
    <li><p> Even <strong>patients with higher number of positive axillary nodes have survived breast cancer</strong>. Contrary to this, patients who have no positive axillary nodes have died after undergoing operation.</p></li>
    <li><p>The <strong>maximum number of positive axillary nodes for a patient who survived cancer is 46</strong> </p></li>
    <li><p><strong>83.55%</strong> of patients who survived cancer had positive axillary nodes in the <strong>range 0 to 4.6</strong>. </p></li> 
    <li><p><strong>56.79%</strong> of patients who died had positive axillary nodes in the <strong>range 0 to 5.2</strong>. </p></li> 
</ol> 

** Ascertaining Observations: ** <br>
We can ascertain the ** observation no. 1** by carrying out the following operations on the haberman DataFrame:

In [None]:
df_axnodes_died = haberman.loc[haberman['Survival Status']=='Died']
df_no_axnodes_died = df_axnodes_died.loc[df_axnodes_died['Positive Axillary Nodes']==0]
print('No. of patients died with zero Positive Axillary Node: ', len(df_no_axnodes_died))

df_axnodes_survived = haberman.loc[haberman['Survival Status']=='Survived']
df_high_axnodes_survived = df_axnodes_survived.loc[df_axnodes_survived['Positive Axillary Nodes']>=20]
print('No. of patients survived with high Positive Axillary Nodes(>=20): ', len(df_high_axnodes_survived))


## Box Plot

Box plot is a visual representation of the distribution of data based on the five-number summary. These five numbers are ** minimum or smallest number, first quartile(Q1) or 25th percentile, median(Q2) or 50th percentile, third quartile(Q3) or 75th percentile and maximum or largest number **. Q1 is the middle number between the minimum and median. Q3 is the middle value between the median and the maximum. InterQuartile Range(IQR) is the difference between the first quartile and third quartile.<br>
** IQR = Q3 - Q2 **<br>
The height of the box plot represents IQR. The top line and bottom line of the box represent the first quartile and the third quartile respectively. The line between the top line and bottom line of the box represents the median. The lines extending parallel from the boxes are known as the 'whiskers', which are used to indicate variability outside the upper and lower quartiles. Outliers are sometimes plotted as individual dots that are in-line with whiskers. Outlier is the data point that differs significantly from the other observations. It lies outside the overall pattern of a distribution.

The Box and Whisker Plot was first introduced by mathematician ** John Tukey ** in 1969. Box Plots can be drawn either vertically or horizontally. Although Box Plots may seem primitive in comparison to a Histogram or Density Plot, they have the advantage of taking up less space, which is useful when comparing distributions between many groups or datasets.

Python Statistical Data Visualization Library ** Seaborn ** is used to plot ** box plot **. 

In [None]:
sns.boxplot(x='Survival Status', y='Positive Axillary Nodes', data=haberman)
plt.show()

## Violin Plot

A Violin Plot is used to visualise the distribution of the data and its probability density. It is a combination of the box plot with a rotated kernel density plot on either side to show the distribution shape of the data. The ** white dot ** in the middle is the median value and the thick black bar at the centre represents the interquartile range. The thin black line extended from it represents the maximum and minimum values in the data. Violin plots are similar to box plots except that they also show the probability density of the data at different values, usually smoothed by Kernel Density Estimator. Combining best of both worlds** (Histogram and PDF, Box Plot) ** gives ** violin plot **. Violin plot is a ** univariate analysis **.

In [None]:
sns.violinplot(x='Survival Status', y='Positive Axillary Nodes', data=haberman, size=8)
plt.show()

** Observations: **
<ol>
    <li><p>The IQR is a measure of the bulk of the values lie. Hence, the <strong>patients who survived</strong> have positive axillary nodes of <strong>less than 3</strong>. Similarly, <strong>patients who died</strong> have positive axillary <strong>nodes greater than 2</strong>. </p></li>
    <li><p>The presence of points outside the whisker indicate the <strong>presence of outliers</strong>. The number of outliers in the Survived category(patients survived 5 years or longer) is considerably higher than the Died category(patients died within 5 years). </p></li>
    <li><p>The Q1 and median of the Survived category are almost the same. The median of the Died category and Q3 of the Survived category are apparently on the same line. Hence there is <strong>overlapping</strong> that may result in at least <strong>15% to 20%</strong> of error. Thus it is difficult to set a threshold to differentiate patients' chances of survival.  </p></li>
    <li><p> The majority of patients who had an absence of positive axillary nodes survived breast cancer. Similarly, the majority of patients with a larger number of positive axillary nodes died.</p></li> 
    <li><p><strong>There is an exception to every rule</strong>. It applies here too. As few patients with a large number of positive axillary nodes have survived and, few patients with the absence of positive axillary nodes have died.  </p></li> 
</ol> 

## Contour Plot

Contour plot is a multivariate analysis. A contour plot is not a normalization technique, rather it is a graphical technique for representing a three dimensional surface by plotting ** constant z slice called contours **, on a two dimensional format. Seaborn is used to plot contour plot.

In [None]:
sns.jointplot(x='Age of Patient', y='Positive Axillary Nodes', data=haberman, kind='kde')
plt.show()

** Observations: **
<ol>
    <li><p>Of all the patients with positive axillary nodes less than or equal to two, majority of them falls in the age group of <strong>50-56</strong> </p></li>
</ol> 

** Ascertaining Observation: ** <br>
We can ascertain the observation by carrying out the following operations on the haberman DataFrame:

In [None]:
df_axnodes_zero = haberman.loc[haberman['Positive Axillary Nodes']<=2]
print('No. of patients with positive axillary nodes<=2: ', len(df_axnodes_zero))
df_axnodes_zero_50 = df_axnodes_zero.loc[(df_axnodes_zero['Age of Patient']>=50) & (df_axnodes_zero['Age of Patient']<=56)]
print('No. of patients in the age group 50-56 who have positive axillary nodes<=2: ', len(df_axnodes_zero_50))

So 20.30% of all the patients with positive axillary nodes less than or equal to two falls in the age group of 50-56.

Let's summarize the important observations we made during Exploratory Data Analysis.

** Conclusions: **
<ol>
    <li><p> The majority of patients in the <strong>age group 30-40 have survived breast cancer</strong> . </p></li>
    <li><p><strong>The majority of the patients</strong> who have undergone operations during the years <strong>1961 and 1968 have survived</strong> 5 years or longer after the operation. </p></li>
    <li><p>The presence of <strong>positive axillary nodes(lymph node involvement)</strong> can be the obvious manifestation of breast cancer. In general, the survival chances of a breast cancer patient is inversely proportional to the number of positive axillary nodes.</p></li>
    <li><p><strong>Patients with zero positive axillary nodes have very high chances of survival</strong> than the patients who have its presence in them. </p></li>
    <li><p>A few patients with a <strong>large number of positive axillary nodes have survived</strong> and, few patients with the <strong>absence of positive axillary nodes have died</strong>. So the absence of positive axillary nodes cannot augur a foolproof assurance of survival. </p></li>
    <li><p>Only a <strong>small number of patients have positive axillary nodes more than 25</strong>. </p></li>
    <li><p><strong>So based on Exploratory Data Analysis, we can propose a hypothesis about the survival chances of a breast cancer patient.</strong> </p></li>
</ol>    

** References: **

<ul>
    <li>https://www.breastcancer.org/symptoms/diagnosis/lymph_nodes</li>
    <li>https://www.kaggle.com/gilsousa/habermans-survival-data-set </li>
    <li>https://en.wikipedia.org/wiki/Exploratory_data_analysis  </li>
    <li>https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html  </li>
    <li>https://matplotlib.org/tutorials/intermediate/legend_guide.html  </li>
    <li>https://seaborn.pydata.org/generated/seaborn.FacetGrid.html  </li>
    <li>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html  </li>
    <li>https://en.wikipedia.org/wiki/Cumulative_distribution_function </li>
    <li>https://datavizcatalogue.com/methods/box_plot.html </li>
    <li>https://datavizcatalogue.com/methods/violin_plot.html </li>
</ul>    
  