### Data Set Information:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.


### Attribute Information:

1. Age of patient at time of operation
2. Patient's year of operation
3. Number of positive axillary nodes detected
4. Survival status
-- 1 = the patient survived 5 years or longer
-- 2 = the patient died within 5 year

### Objective

To do Exploratory Data Analysis. Try to do following steps
1. Understanding the Dataset.
    * Size of the dataset
    * Datatype of each column
    * 5 summary statistics
    * Target variable analysis
2. Clean the data.
3. Relationship analysis.
    * Histogram
    * Pair Plot
    * Joint Plot

### Importing Libraries

We are importing Matplotlib and Seaborn Libraries for Data Visualisation.

In [None]:
#importing library
import numpy as np 
import pandas as pd 
import os

import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

In [None]:
# reading dataset
df=pd.read_csv("../input/habermans-survival-data-set/haberman.csv")
df.sample(5)

From inspecting the dataframe. We know that no column name. so we need to add column name to dataframe.

In [None]:
# reading dataset
df=pd.read_csv("../input/habermans-survival-data-set/haberman.csv",names=['age','op_year','axil_nodes','sur_status'])
df.sample(5)

### Understanding the data

In [None]:
# dataset shape
print(df.shape)
print("Dataset has {} rows and {} columns".format(df.shape[0],df.shape[1]))

In [None]:
# data summary
df.info()

* Index column is RangeIndex.
* All other columns are Int64 Data type.
* We know that sur_status is the target variable.
* All other columns having discrete values.
* No Null Values

In [None]:
# data summary
df.describe()

* There is no missing values in the above columns. Because the count of each column is matching with the row count of dataframe.
* Patient age is from 30 to 83
* Op_year have data from year 1958 to 1969.
* More than 75% of patients have less than 5 nodes. eventough Maximum value is 52.

In [None]:
# target variable distribution
df['sur_status'].value_counts()

* Dataset is unbalanced. 
* 225 patients are survived and 81 patients are not survived.

### Data Cleaning

* From Previous exploration there is No Missing Values.
* We will check whether the missing values are represented in some other value.

In [None]:
# number of unique values in each column
df.nunique()

* Age has 49 Unique columns
* Operating year has 12 Unique columns
* Nodes has 31 Unique columns

In [None]:
# unique values in 'age' column
df['age'].unique()

In [None]:
# unique values in 'op_year' column
df['op_year'].unique()

In [None]:
# unique values in 'axil_nodes' column
df['axil_nodes'].unique()

From Looking the unique values, Nan values are not coded in the different values.

### Relationship Analysis

#### **Univariant Analysis**

Creating Histogram using Seaborn

In [None]:
plt.figure(figsize=[14,14])
plt.subplot(221)
sns.distplot(df['age'])
plt.title('Distribution of Age')
plt.xlabel('Age')

plt.subplot(222)
sns.distplot(df['op_year'])
plt.title('Distribution of Operating Year')
plt.xlabel('Operating Year')

plt.subplot(223)
sns.distplot(df['axil_nodes'])
plt.title('Distribution of Axillary Node')
plt.xlabel('Axillary Node')

plt.subplot(224)
sns.distplot(df['sur_status'])
plt.title('Distribution of Survival status')
plt.xlabel('Survival status')

plt.show()

* Age column is Normal Distributed.
* Axil_nodes is Right Skewed.

#### **Box Plot**

In [None]:
sns.boxplot(x='sur_status',y='age', data=df)
plt.show()

In [None]:
sns.boxplot(x='sur_status',y='axil_nodes', data=df)
plt.show()

Axil_Nodes column has few values that are exceptional.

In [None]:
sns.boxplot(x='sur_status',y='age', data=df)
plt.show()

#### **Pair Plot**

In [None]:
sns.pairplot(df,hue="sur_status",height=3)

From Pair Plot, We know that the Target variable is not clearly seperable by any of the feature.

**Joint Plot**

In [None]:
sns.jointplot(x='sur_status',y='age', data=df, kind="kde");

In [None]:
sns.jointplot(x='sur_status',y='op_year', data=df, kind="kde");

In [None]:
sns.jointplot(x='sur_status',y='axil_nodes', data=df, kind="kde");

###  Observation
* Dataframe has no column name.
* Index column is RangeIndex.
* All columns are 'int64' Data type.
* 'sur_status' is the target variable.
* All columns having discrete values.
* No Null Values in Dataset.
* Patient age is from 30 to 83
* Op_year have data from year 1958 to 1969.
* More than 75% of patients have less than 5 nodes. eventough Maximum value is 52.
* Dataset is unbalanced.( 225 patients are survived and 81 patients are not survived. )
* Nan values are not coded with the different values.
* Age column is Normal Distributed.
* Axil_nodes is Right Skewed.
* Axil_Nodes column has few values that are exceptional.
* From Pair Plot, We know that the Target variable is not clearly seperable by any of the feature.

###  Conclusion
* Add column name to Dataframe.
* Take care of unbalanced Target Variable while using distance based ML algorithms (KNN Classifier)
* Take Log Transform on 'Axil_Node' column to make "Normal Distribution"

We can built ML Algorithm to predict the Survival status of the Patient.