### **Haberman Dataset**

* Number of Instances: 306
* Number of Attributes: 4 
    * Age of patient at time of operation (numerical)
    * Patient's year of operation (year - 1900, numerical)
    * Number of positive axillary nodes detected (numerical)
    * Survival status (class attribute)
      * 1 = the patient survived 5 years or longer
      * 2 = the patient died within 5 year

## **Objective** - Survival of Patients who had undergone Surgery for Breast Cancer

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Installing required libraries
!pip install jupyterthemes==0.16.1

In [None]:
#Library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
# !pip install jupyterthemes==0.16.1 Install this before runing this cell
from jupyterthemes import jtplot
jtplot.style(theme='onedork')
import warnings
warnings.filterwarnings('ignore')
import numpy as np

%matplotlib inline

# **Data Wrangling:**
#### **General Properties**

In [None]:
#Load haberman.csv into pandas dataframe
df = pd.read_csv('/kaggle/input/habermans-survival-data-set/haberman.csv')
df.head(8)

#### **Observation:**
* Before starting we have to add Column names to this data because we have only numericals.

In [None]:
#ref: https://stackoverflow.com/questions/31645466/give-column-name-when-read-csv-file-pandas
#Naming the Columns according to given description
col_names = ["age","Operating_year","axillary_nodes","Survival_status"]
haberman = pd.read_csv('/kaggle/input/habermans-survival-data-set/haberman.csv', names = col_names, header = None)
haberman.head(8)

In [None]:
#Cross Checking the Column names in dataset
haberman.columns

In [None]:
#Data points and Features 
haberman.shape

In [None]:
#Concise Summary of the DataFrame
haberman.info()

In [None]:
#Statistical Summary of DataFrame
haberman.describe()

In [None]:
#Missing Values
haberman.isna().sum()

In [None]:
#Checking Wheather this dataset has Duplicate Values or not
sum(haberman.duplicated())

In [None]:
#Replacing 1 and 2 by Yes and No
haberman['Survival_status'].replace(to_replace = [1,2], value = ['Yes', 'No'], inplace=True)
haberman.sample(8)

In [None]:
#Number of patient survived after 5 years
haberman['Survival_status'].value_counts()

#### **Observation:**
* There are no missing values.
* The Data is Imbalanced Dataset.
* There are 17 duplicated values.

# **Data Cleaning:**

In [None]:
#Number of Distinct Observation 
haberman.nunique()

In [None]:
#DataFrame for Duplicate Values
haberman_duplicated = haberman[haberman.duplicated()]
haberman_duplicated

In [None]:
#Cross Checking
haberman[haberman['axillary_nodes'] == 11]

In [None]:
#Droping Duplicates
haberman_cleaned = haberman.drop_duplicates()
haberman_cleaned.sample(3)

In [None]:
sum(haberman_cleaned.duplicated())

In [None]:
#Statistical Summary of Cleaned DataFrame
haberman_cleaned.describe()

In [None]:
#Number of patient survived after 5 years in cleaned dataset
haberman_cleaned['Survival_status'].value_counts()

#### **Observation:**
* The Dataset is still imbalanced after Data Cleaning.
* In this dataset the main features are **'age'** and **'axillary nodes'** by defintion.

# **Exploratory Data Analysis**

## **Univariate Exploration**

#### **Age**

In [None]:
plt.subplots(figsize = (14,7));
plt.xticks(rotation=90);
sns.countplot(data = haberman_cleaned, x = 'age');
plt.title('Age');

#### **Observation:**
* The Age is between 30 to 83.
* There are maximum of people from 52 age group.
* There are only 1 person from 75,76,77,78 and 83.
* In this the patients are majorly from 40 to 65.

#### **Patient's year of operation**

In [None]:
plt.subplots(figsize = (14,7));
plt.xticks(rotation=90);
sns.countplot(data = haberman_cleaned, x = 'Operating_year');
plt.title('Patient\'s year of operation');

#### **Number of positive axillary nodes detected**

In [None]:
plt.subplots(figsize = (14,7));
plt.xticks(rotation=90);
sns.countplot(data = haberman_cleaned, x = 'axillary_nodes');
plt.title('Number of positive axillary nodes detected');

#### **Observation:**
* Maximum time we don't find any positive axillary node.
* If we find any positive axillary node maximum number of them are 1.
* The Maximmu Number of axillary node we find is 52.

#### **Survival status**

In [None]:
plt.subplots(figsize = (10,5));
sns.countplot(data = haberman_cleaned, x = 'Survival_status');
plt.title('Survival status');

#### **Observation:**
* There are more number of patients who survived than not survived. 

## **Bivariate Exploration**

#### **Age Vs Survival status**

In [None]:
plt.subplots(figsize = (14,7));
plt.xticks(rotation=90);
sns.countplot(data = haberman_cleaned, x = 'age', hue = 'Survival_status');
plt.legend(loc='upper right');
plt.title('Age Vs Survival status');

#### **Observation:**
* Number of Patients survived more in 38 - 70 age group.  
* The maximum patients are not survived are from 53 age group after that 46, 52, 54 and 66 age group peoples.

#### **Patient\'s year of operation Vs Survival status**

In [None]:
plt.subplots(figsize = (14,7));
plt.xticks(rotation=90);
sns.countplot(data = haberman_cleaned, x = 'Operating_year', hue = 'Survival_status');
plt.legend(loc='upper right');
plt.title('Patient\'s year of operation Vs Survival status');

#### **Number of positive axillary nodes detected Vs Survival status**

In [None]:
plt.subplots(figsize = (14,7));
plt.xticks(rotation=90);
sns.countplot(data = haberman_cleaned, x = 'axillary_nodes', hue = 'Survival_status');
plt.legend(loc='upper right');
plt.title('Number of positive axillary nodes detected Vs Survival status');

#### **Observation:**
* The Patients who don't have any positive axillary node are the maximum number patients who are survived.

#### **Pair Plot**

In [None]:
sns.pairplot(haberman_cleaned, hue = 'Survival_status', markers=["D", "o"], height=4);
plt.show()

## **PDF and CDF**

In [None]:
sns.FacetGrid(haberman_cleaned, hue="Survival_status", height = 8).map(sns.distplot, "age");
plt.legend(loc='upper right');
plt.show();

#### **Observation:**
* Major overlapping is present, so we cannot say about dependency of age on survival.
* Patients age 20–50 have a slightly higher rate of survival and patients age 75–90 have a lower rate of survival.

In [None]:
sns.FacetGrid(haberman_cleaned, hue="Survival_status", height = 8).map(sns.distplot, "Operating_year");
plt.legend(loc='upper right');
plt.show();

#### **Observation:**
* Major overlapping is present, so we cannot say about dependency of Patient's year of operation on survival.

In [None]:
sns.FacetGrid(haberman_cleaned, hue="Survival_status", height = 8).map(sns.distplot, "axillary_nodes");
plt.legend(loc='upper right');
plt.show();

#### **Observation:**
* Patients with 0 nodes have a high probability of survival than with nodes.

In [None]:
#creating cdf and pdf
haberman_cleaned_yes = haberman_cleaned.loc[haberman_cleaned['Survival_status'] == 'Yes']
haberman_cleaned_no = haberman_cleaned.loc[haberman_cleaned['Survival_status'] == 'No']

### **Plot PDF and CDF for Age**

In [None]:
counts, bin_edges = np.histogram(haberman_cleaned_yes['age'], bins=10, density = True)
pdf = counts/(sum(counts))
#print(pdf);
#print(bin_edges);
cdf = np.cumsum(pdf)
plt.subplots(figsize = (14,7));
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf);

counts, bin_edges = np.histogram(haberman_cleaned_no['age'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf);

label = ['PDF for Yes', 'CDF for No', 'PDF for No', 'CDF for NO']
plt.legend(label);

plt.xlabel('Age');
plt.ylabel('Percentage of People');
plt.title(label = 'PDF and CDF for Age', fontsize=18);

#### **Observation:**
* There are around 80% of data point have age value less than 70 years.

### **Plot PDF and CDF for Patient's year of operation**

In [None]:
counts, bin_edges = np.histogram(haberman_cleaned_yes['Operating_year'], bins=10, density = True)
pdf = counts/(sum(counts))
#print(pdf);
#print(bin_edges);
cdf = np.cumsum(pdf)
plt.subplots(figsize = (14,7));
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf);

counts, bin_edges = np.histogram(haberman_cleaned_no['Operating_year'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf);

label = ['PDF for Yes', 'CDF for No', 'PDF for No', 'CDF for NO']
plt.legend(label);

plt.xlabel('Patient\'s year of operation');
plt.ylabel('Percentage of People');
plt.title(label = 'PDF and CDF for Patient\'s year of operation', fontsize=18);

#### **Observation:**
* There are around 80% of data point have operation year value less than 66 years.

### **Plot PDF and CDF for Number of positive axillary nodes detected**

In [None]:
counts, bin_edges = np.histogram(haberman_cleaned_yes['axillary_nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
#print(pdf);
#print(bin_edges);
cdf = np.cumsum(pdf)
plt.subplots(figsize = (14,7));
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf);

counts, bin_edges = np.histogram(haberman_cleaned_no['axillary_nodes'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf);

label = ['PDF for Yes', 'CDF for No', 'PDF for No', 'CDF for NO']
plt.legend(label);

plt.xlabel('Patient\'s year of operation');
plt.ylabel('Number of positive axillary nodes detected');
plt.title(label = 'PDF and CDF for Number of positive axillary nodes detected', fontsize=18);

#### **Observation:**
* There are more than 90% data point have axillary nodes value less than 10 axillary nodes.