## Exploratory Data Analysis: Haberman's Survival Data ##

**Relevant Information**: *The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.*

**Number of Instances:** *306*

**Number of Attributes:** *4 (including the class attribute)*

**Attribute Information:**

*Age of patient at the time of operation (numerical)<br>
Patient's year of operation (year - 1900, numerical)<br>
Number of positive axillary nodes detected (numerical)<br>
Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year*

***The attribute axillary nodes indicate the count of nearby lymph nodes (the lymph nodes under the arm) or the lymph nodes near the breast bone to which the cancer has spread. It is an important clinical characteristic to determine the stage of cancer. Higher numbers mean the cancer is more advanced. The remaining attributes like age, year, survival are self- explanatory***

### Objective: ###
 1. *To find if there is any correlation among the provided features like age, year of operation, infection of axillary nodes       and survival status of a patient.*

In [None]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
import os

#Load haberman.csv into a pandas dataFrame
columns = ['age','year','infected_axillary_nodes','survival_status']
haberman_df = pd.read_csv('../input/haberman.csv',names=columns)

#Verify the data that has been loaded
haberman_df.head(5)

## Data Preparation ##

In [None]:
#print the unique values of the target column

print(haberman_df['survival_status'].unique())

In [None]:
# The values of 'survival_status' column are not that meaningful. 
# For improved readability of the subsequent graphical
# analysis and observations we are going to map the two possible labels 1 and 2 to 
#'Survived' and 'Not Survived' respectively

haberman_df.survival_status.replace([1, 2], ['Survived', 'Not Survived'], inplace = True)
haberman_df.head(8)

In [None]:
#(Q) How many data points for each class or label ar present

haberman_df['survival_status'].value_counts()

### Observations : ###
1. From the above output it is clear that the dataset is not balanced as the number of datapoints for the two    labels are not equal

## Beginning With Analysis ##

### 1. Basic Statistical Parameters

In [None]:
# Drawing a general idea about the data from the basic statistical parameters
haberman_df.describe()

### Observations : ### 
    1. Mean age being 52 and std being 10 this study mostly targets patient in the 
       age group 42 - 62
    2. 75% of the patients have at the most 4 infected_axillary_node. Which is quite 
       less from the maximum value of 52
    3. Here the maximum value of the column 'infected_axillary_node' is an outlier which is 
       biasing the mean and the std value as well.
       
       
### 2. Bivariate Analysis: ###

### 2.1 Pair-Plot ###

In [None]:
# pairwise scatter plot: Pair-Plot
# Only possible to view 2D patterns
# NOTE: the diagonal elements are PDFs for each feature.

sb.set_style("whitegrid");
sb.pairplot(haberman_df, hue="survival_status", size=4);
plt.show()

### Observations: ###

    1. No obvious correlation has been observed among the plotted features from the above graphs.
    2. There is considerable overlapping of the datapoints in all of the graphs
    3. From the graph "infected_axillary_nodes vs year" it can be said that the lowest percentage
       of survival was in the year
       1961 and highest percentage of survival was in the year 1965 (conclusion drawn from the 
       ratio of Survived cases to Not Survived ones)
       
## 3. Univariate Analysis: ##

        Univariate analysis is the simplest form of analysing data with only one variable or 
        feature. The analysis is more meaningful if the frequency distribution of the feature is
        taken into consideration.

### 3.1 Histogram, PDF ###

        1. HISTOGRAM is a way of graphical representation of univariate analysis where range of
           variables are plotted against their number of occurrances (frequencies). Rectangles are 
           used to depict the frequency of data items.
           
        2. PDF (Probablity Density Function) is a smoothed out version of histogram. In PDF the y 
           value for a point on the graph describes the likelihood(probability) of that variable to 
           take on that particular x value.

In [None]:
#Histogram and PDF for the feature age, year and infected axillary node

for idx, feature in enumerate(list(haberman_df.columns)[0:3]):
    sb.FacetGrid(haberman_df, hue="survival_status", size=5) \
      .map(sb.distplot, feature) \
      .add_legend();
plt.show()

### Observations: ###

    1. From graph 3 we can conclude that patients with lower infected_lymph_nodes (0-4) have 
       higher chances of survival
    

### 3.2 CDF ###

       The Cumulative Distribution Function (CDF) is the probability that the variable takes a 
       value less than or equal to x.

In [None]:
# Getting seperate data for "Survived and "Not Survived" for subsequent processing of 
# data

haberman_survived = haberman_df.loc[haberman_df["survival_status"] == "Survived"]
haberman_notSurvived = haberman_df.loc[haberman_df["survival_status"] == "Not Survived"]

#Analysing data for patients who survived

plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(haberman_df.columns)[0:3]):
    plt.subplot(1, 3, idx+1)
    print("------------------- Survived -------------------------")
    print("--------------------- "+ feature + " ----------------------------")
    counts, bin_edges = np.histogram(haberman_survived[feature], bins=10, density=True)
    print("Bin Edges: {}".format(bin_edges))
    
    # Compute PDF
    pdf = counts/(sum(counts))
    print("PDF:  {}".format(pdf))

    # Compute CDF
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
          
    # Plot the above cumputed values      
   # plt.subplot(1, 3, idx+1)      
    plt.plot(bin_edges[1:], pdf)
    plt.plot(bin_edges[1:], cdf)
    plt.xlabel(feature)
    plt.grid()
plt.show()

In [None]:
#Analysing data for patients who did not survived

plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(haberman_df.columns)[0:3]):
    plt.subplot(1, 3, idx+1)
    print("------------------- Not Survived -------------------------")
    print("--------------------- "+ feature + " ----------------------------")
    counts, bin_edges = np.histogram(haberman_notSurvived[feature], bins=10, \
                                     density=True)
    print("Bin Edges: {}".format(bin_edges))
    
    # Compute PDF
    pdf = counts/(sum(counts))
    print("PDF:  {}".format(pdf))

    # Compute CDF
    cdf = np.cumsum(pdf)
    print("CDF: {}".format(cdf))
          
    # Plot the above cumputed values      
   # plt.subplot(1, 3, idx+1)      
    plt.plot(bin_edges[1:], pdf)
    plt.plot(bin_edges[1:], cdf)
    plt.xlabel(feature)
    plt.grid()
plt.show()

### 3.3 Box Plot ###

       A box plot is the visual representation of the staistical five number summary 
       of a given data set.
       
       A Five Number summary includes:
           1. Minimum Value - Vertical line extending from the extreme bottom
           2. First Quartile - Bottom of the rectangle
           3. Median (Second Quartile) - Horizontal line near the middle of the rectangle
           4. Third Quartile - Top of the rectangle
           5. Maximum - Vertical line extending from the extreme top

In [None]:
#Box plot
plt.figure(figsize=(20,5))
xlabel = 'survival_status fig. ' + '(idx+1)'
for idx, feature in enumerate(list(haberman_df.columns)[0:3]):
    plt.subplot(1, 3, idx+1)
    sb.boxplot( x='survival_status', y=feature, data=haberman_df)
plt.show()  

### Observation: ###
    1. 50th percentile of survivors have 0 positive nodes, 75th percentile of survivors have 
       less than 3 positive nodes
        
### 3.4 Violin Plot ###

        A Violin Plot is a combination of a Box Plot and a Density Plot that is rotated and 
        placed on each side, to show the distribution shape of the data
       

In [None]:
#Violin Plot
plt.figure(figsize=(20,5))
for idx, feature in enumerate(list(haberman_df.columns)[0:3]):
    plt.subplot(1, 3, idx+1)
    sb.violinplot(x='survival_status', y=feature, data=haberman_df, size=8)
plt.show()

### Observations: ###

    1. The number of positive lymph nodes of the survivors is highly densed from 0 to 5.
    2. Almost 80% of the patients have less than or equal to 5 positive lymph nodea.
    
    
### Conclusion: ###    

    1. Mean age being 52 and std being 10 this study mostly targets patient in
       the age group 42 - 62
    2. 75% of the patients have at the most 4 infected_axillary_node. Which is quite 
       less from the maximum value of 52
    3. Here the maximum value of the column 'infected_axillary_node' is an outlier which is 
       biasing the mean and the std value as well.
    4. Patients with lower infected_lymph_nodes (0-4) have higher chances of survival
      
 **Though many more observations can be listed from all of the above graphical depictions.
 Very less are conclusive for determining 5 year survival status.**