# Assignment No : 01 - EDA

**By Aziz Presswala**

## Haberman's Cancer Survival Dataset

* Based on survival of patients who had undergone surgery for breast cancer.
* Contains cases from 1958 - 1970.
* Studied in University of Chicago's Billings Hospital.
* Features:-
Age of patients, Year of Operation, No. of axillary nodes,Survival Status = 1 if patient survived for more than 5 yrs else 2
* No. of axillary nodes indicate how much the cancer has spread into the lymph nodes.

### Objective:- 
To determine whether the patient survived for 5 yrs or longer or if the patient died within 5 yrs of operation.

In [None]:
# Importing the required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
hsd = pd.read_csv('../input/haberman.csv') # hsd stands for Haberman Survival Dataset


# 1.Statistics

In [None]:
# Info about the dataset
hsd.info()

In [None]:
# No. of datapoints and features
hsd.shape

In [None]:
# columns in the dataset
hsd.columns

In [None]:
# Datapoints for survival status
hsd['survival_status'].value_counts()

In [None]:
hsd.describe()

### Observation:-
* Above dataset is an imbalanced dataset.
* Mean age of the patients is around 52 yrs.
* Nearly 75% of the patients have axillary nodes less than or equal 4.
* Probability of Survival of patients for more than 5 yrs is almost 50% if age of the patient is less than 52 yrs of age.


# 2.Univariate Analysis

#### 2.1 Distribution Plots

In [None]:
sns.FacetGrid(hsd, hue="survival_status", size=5) \
   .map(sns.distplot, "Age") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(hsd, hue="survival_status", size=5) \
   .map(sns.distplot, "year") \
   .add_legend();
plt.show();

In [None]:
sns.FacetGrid(hsd, hue="survival_status", size=5) \
   .map(sns.distplot, "positive_axillary_nodes") \
   .add_legend();
plt.show();

#### 2.2 PDF and CDF

In [None]:
# PDF & CDF for Age
status_1 = hsd.loc[hsd["survival_status"] == 1];
status_2 = hsd.loc[hsd["survival_status"] == 2];

counts, bin_edges = np.histogram(status_1['Age'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

In [None]:
#PDF & CDF for Operation Year
counts, bin_edges = np.histogram(status_1['year'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

In [None]:
#PDF & CDF for Axillary Nodes
counts, bin_edges = np.histogram(status_1['positive_axillary_nodes'], bins=10, 
                                 density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)

cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)

#### 2.3 Box Plots

In [None]:
# Box Plots for Age
sns.boxplot(x='survival_status',y='Age', data=hsd)
plt.show()

In [None]:
# Box Plots for Axil_Nodes
sns.boxplot(x='survival_status',y='positive_axillary_nodes', data=hsd)
plt.show()

In [None]:
# Box Plots for Operation Year
sns.boxplot(x='survival_status',y='year', data=hsd)
plt.show()

#### 2.4 Violin Plots

In [None]:
#Violin Plot for axillary nodes
sns.violinplot(x='survival_status',y='positive_axillary_nodes', data=hsd, size=8)
plt.show()

In [None]:
#Violin Plot for Age
sns.violinplot(x='survival_status',y='Age', data=hsd, size=8)
plt.show()

In [None]:
#Violin Plot for Operation year
sns.violinplot(x='survival_status',y='year', data=hsd, size=8)
plt.show()

### Obseravtion
* From Fig 2.3, the patients that have been operated after 1965 have a higher chance of survival.
* From Fig 2.1, most number of axillary nodes are between 0-5. 


# 3.Bivariate Analysis

#### Pair-Plots

In [None]:
plt.close()
sns.set_style("whitegrid");
sns.pairplot(hsd, hue="survival_status", size=4);
plt.show()

### Observation
From the above plots we observe that:-
* From the Age vs Axillary node plot we can conclude that survival chance is greater if no. of axil nodes are less than 5. 
* We cannot clearly classify the Survival chances of the patients from the above plot as there is considerable overlap.
* The features Operation Year and Axillary nodes can be used to classify the survival status to an extent.