## Haberman EDA Analysis

Haberman Dataset: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Attribute Information:

   * Age of patient at time of operation (numerical)
   * Patient's year of operation (year - 1900, numerical)
   * Number of positive axillary nodes detected (numerical)
   * Survival status (class attribute)  
        1 = the patient survived 5 years or longer  
        2 = the patient died within 5 year

Missing Attribute Values: None
    
Source: https://www.kaggle.com/gilsousa/habermans-survival-data-set/version/1

Objective: Given a value of age, years and nodes of a patient, we have to identify if that patient is going to survive or not.

### Import Libraries

In [None]:
##import all required libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

### Load Dataset

In [None]:
##load csv file
df =pd.read_csv("../input/habermans-survival-data-set/haberman.csv", header = None, names=['age','year','nodes','status'])

In [None]:
##check head of dataset
df.head()

In [None]:
##check shape of data
df.shape

In [None]:
## analyse statistical inference of data
df.describe()

### Missing values check

In [None]:
##check for missing values
df.isna().sum()

Missing values are not present in the dataframe

### Data handling 

In [None]:
###check dtype of the df
df.info()

In [None]:
##check for unique values counts in dataset
df.nunique()

Convert dtype of status column to object

In [None]:
###convert the datatype of status column to object
df.status = df.status.astype('object')

In [None]:
df.info()

### Outlier check

In [None]:
##plot box plot of age to check if any outlier is present
plt.figure(figsize = (4,8))
sns.boxplot( y=df['age'])
plt.show()

#### Inference: Outilers are not present in age column

In [None]:
##plot box plot of year to check if any outlier is present
plt.figure(figsize = (4,8))
sns.boxplot( y=df['year'])
plt.show()

#### Inference : Outliers are not present in year column

In [None]:
##plot box plot of nodes to check if any outlier is present
plt.figure(figsize = (4,8))
sns.boxplot( y=df['nodes'])
plt.show()

#### Inference : Outliers are present in nodes column

### Imblance percentage

In [None]:
####plot to caheck imblance percentage of target column(status)
sns.countplot(df['status'])
plt.title("Distribution of Status Variable")
plt.show()

#### Inference : Survival status of patients is highly imblanced

In [None]:
##check the value counts 
df['status'].value_counts(normalize= 2)

- Percentage of Survival patients is 73% 

#### Divide the dataframe into two dataframes
- df_status_1=status =1
- df_status_2=status =2

In [None]:
##create new dataframe for status=1
df_status_1=df[df['status']==1]
df_status_1.head()

In [None]:
##create new dataframe for status=2
df_status_2=df[df['status']==2]
df_status_2.head()

### Correlation of Numrical variable

In [None]:
###plot heatmap to find correlation between all numerical variables when status variable is 1
plt.figure(figsize=(12,8)) 
sns.heatmap(df_status_1.corr(), annot=True , cmap="RdYlGn",center=0.4)
plt.title('Correlation for status variable 1')
plt.show()

#### Inferences:
- Positive correlation is present between year-age,nodes-year columns for the patients who will survive
- Negative correlation is present betweennodes-age column for the patients who will survive

In [None]:
###plot heatmap to find correlation between all numerical variables when status variable is 2
plt.figure(figsize=(12,8)) 
sns.heatmap(df_status_2.corr(), annot=True , cmap="RdYlGn",center=0.4)
plt.title('Correlation for status variable 2')
plt.show()

#### Inferences :
- All the variables are negatively correlated with each other for the paitents who will not survive

### Univariate Analysis

In [None]:
###plot graph for age,year,nodes column column for analysis
plt.figure(figsize=(20,10)) 

plt.subplot(1,3,1)
sns.boxplot(x='status',y='age', data=df)
plt.title("Plot for survival_status and Age")

plt.subplot(1,3,2)
sns.boxplot(x='status',y='year', data=df)
plt.title("Plot for survival_status and Year")

plt.subplot(1,3,3)
sns.boxplot(x='status',y='nodes', data=df)
plt.title("Plot for survival_status and Nodes")
plt.show()

#### Inferences :
- Maximum age of patients who survived is between 43 to 60 and patients who died is between 45 to 61
- Quartile for operation year is same for both patients who died and survived.
- Number of outliers is more for the patients died as compared with patients who survived

In [None]:
###plot graph for age,years,nodes column for analysis
plt.figure(figsize=(20,8)) 

plt.subplot(1,3,1)
sns.violinplot(x="status", y="age", data=df, size=8)
plt.title("Plot for survival_status and Age")

plt.subplot(1,3,2)
sns.violinplot(x="status", y="year", data=df, size=8)
plt.title("Plot for survival_status and Year")

plt.subplot(1,3,3)
sns.violinplot(x="status", y="nodes", data=df, size=8)
plt.title("Plot for survival_status and Nodes")
plt.show()

In [None]:
plt.figure(figsize=(20,10)) 

#plt.subplot(1,3,1)
sns.FacetGrid(df, hue="status", height=5) \
   .map(sns.distplot, "age") \
   .add_legend();
#plt.title("Plot for survival_status and Age")

#plt.subplot(1,3,2)
sns.FacetGrid(df, hue="status", height=5) \
   .map(sns.distplot, "year") \
   .add_legend();
#plt.title("Plot for survival_status and Year")

#plt.subplot(1,3,3)
sns.FacetGrid(df, hue="status", height=5) \
   .map(sns.distplot, "nodes") \
   .add_legend();
#plt.title("Plot for survival_status and Nodes")
plt.show();

### Bivariate Analysis

In [None]:
plt.figure(figsize=(10, 8))

df_numerical_var=df_status_1[['age', 'year', 'nodes']]
sns.pairplot(df_numerical_var)
plt.show()

#### Inferences:
- Patients with 0 nodes are survived after treatment

In [None]:
plt.figure(figsize=(10, 8))

df_numerical_var=df_status_2[['age', 'year', 'nodes']]
sns.pairplot(df_numerical_var)
plt.show()

### Observations:
- Imblance percentage of dataset is about 74% and 26%
- Patients between age of 30 to 40 are survived after treatment
- Patients with less number of nodes are likely to survive
- 57 -68 age group patients died after tratment
- Patients with more than 1 node are not likely to survive