# Haberman's Survival DataSet Analysis

A detailed description regarding each column is present [Here](https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival). Following is a short description of each column
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
    *   the patient survived 5 years or longer
    *   the patient died within 5 year

### About dataset

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.



### First look into the dataset

In [None]:
# Package imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
df = pd.read_csv("../input/habermans-survival-data-set/haberman.csv")
df.head()

**Above data loading seems to be wrong as data does not have any header, first row is being considered as column names. Let's fix this**

In [None]:
# read_csv has header argument which we can use to tell pandas to not consider first row as column names. Also, we can use names argument to send column names. 
df = pd.read_csv("../input/habermans-survival-data-set/haberman.csv", header=None, names=['Age','YearOfOperation','#AxillaryNodes','SurvivalStatus'])
df.head()

**Now, data looks good. Let's do some basic analysis on each column**

### Age

In [None]:
df.Age.describe()

<span>&#9888;</span>**From above we can conclude the following**
<div style="color:blue;">
1. Oldest patient is 83 years old <br>
2. Youngest patient is 30 years old<br>
3. Average patient age is 52 years.<br>
4. Median age is 52 years.<br>
    </div>

### YearOfOperation

In [None]:
df.YearOfOperation.describe()

Above data did not give us much information. Lets check histogram

In [None]:
df.YearOfOperation.hist()

<span>&#9888;</span>**From above we can conclude the following**
<div style="color:blue;">
1. Data is recordeed between 1958 and 1969<br>
2. Most of the data is between 1958 and 1960
    </div>

### Number of Axillary Nodes

In [None]:
df['#AxillaryNodes'].describe()

In [None]:
df['#AxillaryNodes'].hist()

<span>&#9888;</span>**From above we can conclude the following**
<div style="color:blue;">
1. Highest number of Axillary nodes in a person is 52<br>
2. Lowest number of Axillary nodes in a person is 0. <br>
3. Mean number of Axillary nodes is 4<br>
4. Median of Number of Axillary nodes is 1<br>
5. Most persons have Number of Axillary nodes in 0-5 range
</div>

### SurvivalStatus

**Since Survival Status is a binary variable, Let's check value counts of each category. Then, let's plot a pie chart to get a visual of the same

In [None]:
df.SurvivalStatus.value_counts()

In [None]:
patches, texts = plt.pie(df.SurvivalStatus.value_counts())
plt.legend(patches, ["Survived","Not Survived"], loc="best")
plt.tight_layout()
plt.show()


<span>&#9888;</span>**From above we can conclude the following**
<div style="color:blue;">
1. Its an imbalanced dataset as one class has more values than other class <br>
2. Most patients survived for longer than 5 years.

</div>

## NOTE: Above Analysis is not a complete univariate Analysis.  

Before answering my question, let me talk about few important terminology.

**Dependent Columns:**  *In our case we only have one dependent column. Which is Survival Status. Dependent columns are usually the ones we want to predict using independent columns. For example, given Number of hours studied if we want to predict grade of a student. Grade becomes the dependent column and number of hours studied becomes independent column.*    
**Independent Columns:** *These are the columns that don't depend on anything else for the problem we are solving. Usually, there will be many columns that might depend on each other. We do Exploratory data Analysis and find correlating columns and only retain one of the column. In our case, we have Age, year,#Axillary nodes as our Independent columns.* 

Usually Univariate analysis is done using one dependent column, one independent column. We try to see how our dependent column varies with one independent column.My previous analysis did not include the dependent variable. So, let's do proper univariate analysis now.

## Uni-variate Analysis

A typical uni-variate Analysis includes exploring descriptive and inferential methods. According to wikipedia,
> Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved.

### Age vs SurvivalStatus


In [None]:
# Lets plot a histogram and check overlap between two classes of SurvivalStatus
sns.FacetGrid(df, hue = "SurvivalStatus", height = 7).map(sns.distplot, "Age").add_legend()
plt.ylabel("Frequency")
_ = plt.title("Univariate Analysis- Age")

Note: Please read [this](https://blog.bioturing.com/2018/05/16/5-reasons-you-should-use-a-violin-graph/) article comparing violin plot and box-plot

In [None]:
# Violin plot

_ = sns.violinplot(x='SurvivalStatus', y='Age', data=df)


From above plots, we can see that both have similar distribution. But, Patients who survived is a bit flatter compared to patients who did not survive

### Year vs SurvivalStatus


In [None]:
# Lets plot a histogram and check overlap between two classes of SurvivalStatus
sns.FacetGrid(df, hue = "SurvivalStatus", height = 7).map(sns.distplot, "YearOfOperation").add_legend()
plt.ylabel("Frequency")
_ = plt.title("Univariate Analysis- YearOfOperation")

In [None]:
# Violin plot

_ = sns.violinplot(x='SurvivalStatus', y='YearOfOperation', data=df)


### #AxillaryNodes vs SurvivalStatus

In [None]:
# Lets plot a histogram and check overlap between two classes of SurvivalStatus
sns.FacetGrid(df, hue = "SurvivalStatus", height = 7).map(sns.distplot, "#AxillaryNodes").add_legend()
plt.ylabel("Frequency")
_ = plt.title("Univariate Analysis- #AxillaryNodes")

In [None]:
# Violin plot

_ = sns.violinplot(x='SurvivalStatus', y='#AxillaryNodes', data=df)


## Multi-variate Analysis

Now, lets plot a pair plot to see how each column varies over other column

In [None]:
sns.pairplot(df,hue='SurvivalStatus')

Above pair-plot shows that both classes are mixed and there's no clear pattern that separates both of them. 

## Correlation

In [None]:
# Let's plot correlations between columns

In [None]:
corr = df.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)


<span>&#9888;</span>**From the above correlation heat-map, we can conclude the following**
<div style="color:blue;">
1. #AxillaryNodes has no correlation with Age and Year Of Operation.<br>
2. #AxillaryNodes and SurvivalStatus are slightly correlated.<br>
3. Age and SurvivalStatus are also slightly correlated. But, correlation is less compared to correlation between #AxillaryNodes and SurvivalStatus
</div>