In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Understanding of data:

**Title: Haberman’s Survival Data**

**Description:** The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Attribute Information:**
* Age of patient at the time of operation (numerical)
* Patient’s year of operation (year — 1900, numerical)
* Number of positive axillary nodes detected (numerical)

**Survival status (class attribute) :**
* 1 = the patient survived 5 years or longer
* 2 = the patient died within 5 years

**Objective: To predict whether a patient will survive or not after 5 years based on the features such as patient's age, operation year and number of positive lymph nodes.**

In [None]:
#importing dataset
haberman_dataset = pd.read_csv("../input/habermans-survival-data-set/haberman.csv", header = None,  
                               names= ['AGE', 'OPERATION_YEAR', 'POSITIVE_LYMPH_NODES', 'SURVIVAL_STATUS'])
haberman_dataset.head()

In [None]:
print(haberman_dataset.shape)

Observation: **The CSV file contains 306 rows and 4 columns.**

In [None]:
print(haberman_dataset.info())

**Observation:**
* There are no missing values in this data set.
* All the columns are of the integer data type.
* The datatype of the status is an integer, which we can convert to a categorical datatype.
* In the status column, the value 1 can be mapped to ‘yes’ which means the patient has survived 5 years or longer. And the value 2 can be mapped to ‘no’ which means the patient died within 5 years.

In [None]:
#mapping the values of 1 and 2 to yes and no respectively and 
#printing the first 5 records from the dataset.

haberman_dataset['SURVIVAL_STATUS'] = haberman_dataset['SURVIVAL_STATUS'].map({1: 'yes', 2: 'no'})
haberman_dataset['SURVIVAL_STATUS'] = haberman_dataset['SURVIVAL_STATUS'].astype('category')
haberman_dataset.head(10)

In [None]:
haberman_dataset.describe()

**Observation:**
* Count : Total number of values present in respective columns.
* Mean: Mean of all the values present in the respective columns.
* Std: Standard Deviation of the values present in the respective columns.
* Min: The minimum value in the column.
* 25%: Gives the 25th percentile value.
* 50%: Gives the 50th percentile value.
* 75%: Gives the 75th percentile value.
* Max: The maximum value in the column.

In [None]:
#gives each count of the status type
haberman_dataset['SURVIVAL_STATUS'].value_counts()

**Observation:**
* The value_counts() function tells how many data points for each class are present. Here, it tells how many patients survived and how many did not survive.
* Out of 306 patients, 225 patients survived and 81 did not.

In [None]:
print(haberman_dataset.iloc[:,-1].value_counts(normalize = True))

* We can observe that our target model is **imbalanced** as it contains 73% (225/306) values 'yes' and only 27% (81/306) values 'no'.
* Since dataset consists of only **306 records**.

In [None]:
#survival_status_yes dataframe stores all the records where status is yes
survival_status_yes = haberman_dataset[haberman_dataset['SURVIVAL_STATUS']== 'yes']
survival_status_yes.describe()

In [None]:
#survival_status_no dataframe stores all the records where status is no
survival_status_no = haberman_dataset[haberman_dataset['SURVIVAL_STATUS'] == 'no']
survival_status_no.describe()

* The mean age and the year in which the patients got operated are almost similar of both the classes, while the mean of the nodes of both the classes differs by 5 units approximately.
* The positive lymph nodes of patients who survived are less when compared to patients who did not survive.

# Univariate Analysis

The major purpose of the univariate analysis is to describe, summarize and find patterns in the single feature.

**One Dimensional Scatter Plot**

In [None]:
sns.set_style('whitegrid')
one= haberman_dataset.loc[haberman_dataset['SURVIVAL_STATUS']== 'yes']
two= haberman_dataset.loc[haberman_dataset['SURVIVAL_STATUS']== 'no']
plt.plot(one['AGE'], np.zeros_like(one['AGE']), 'o', label= "SURVIVAL_STATUS, yes")
plt.plot(two['AGE'], np.zeros_like(two['AGE']), 'o', label= "SURVIVAL_STATUS, no")
plt.xlabel('Age')
plt.show()


 **Probability Density Function(PDF)**
 
 Probability Density Function (PDF) is the probability that the variable takes a value x. (a smoothed version of the histogram).
 


Here the height of the bar denotes the percentage of data points under the corresponding group

In [None]:
sns.FacetGrid(haberman_dataset,hue='SURVIVAL_STATUS',height = 5)\
 .map(sns.distplot,"AGE")\
 .add_legend();
plt.show()

**Observation:**
* Major overlapping is observed, which tells us that survival chances are irrespective of a person’s age.
* Although there is overlapping we can vaguely tell that people whose age is in the range 30–40 are more likely to survive, and 40–60 are less likely to survive. While people whose age is in the range 60–75 have equal chances of surviving and not surviving.
* From the above figure, we can't come to an conclusion. We cannot decide the survival chances of a patient just by considering the age parameter.

In [None]:
sns.FacetGrid(haberman_dataset,hue='SURVIVAL_STATUS',height = 5)\
 .map(sns.distplot,"OPERATION_YEAR")\
 .add_legend();
plt.show()

**Observation:**
* There is major overlapping observed as compared to "AGE" parameter. This graph only tells how many of the operations were successful and how many weren’t. This cannot be a parameter to decide the patient’s survival chances.
* However, it can be observed that in the years 1960 and 1965 there were more unsuccessful operations.

In [None]:
sns.FacetGrid(haberman_dataset,hue='SURVIVAL_STATUS',height = 5)\
 .map(sns.distplot,"POSITIVE_LYMPH_NODES")\
 .add_legend();
plt.show()

Patients with no nodes or 1 node are more likely to survive. There are very few chances of surviving if there are 25 or more nodes.

**Cumulative Distribution Function(CDF)**

The Cumulative Distribution Function (CDF) is the probability that the variable takes a value less than or equal to x.

In [None]:
counts1, bin_edges1 = np.histogram(survival_status_yes['POSITIVE_LYMPH_NODES'], bins=10, density = True)
pdf1 = counts1/(sum(counts1))
print(pdf1);
print(bin_edges1)
cdf1 = np.cumsum(pdf1)
plt.plot(bin_edges1[1:], pdf1)
plt.plot(bin_edges1[1:], cdf1, label = 'yes')
plt.xlabel('POSITIVE_LYMPH_NODES')
print("***********************************************************")
counts2, bin_edges2 = np.histogram(survival_status_no['POSITIVE_LYMPH_NODES'], bins=10, density = True)
pdf2 = counts2/(sum(counts2))
print(pdf2);
print(bin_edges2)
cdf2 = np.cumsum(pdf2)
plt.plot(bin_edges2[1:], pdf2)
plt.plot(bin_edges2[1:], cdf2, label = 'no')
plt.xlabel('POSITIVE_LYMPH_NODES')
plt.legend()
plt.show()

**Observation:**
83.55% of the patients who have survived had nodes in the range of 0 – 4.6

**Box Plots and Violin Plots**

The box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. Outlier points are those past the end of the whiskers.

**Violin plot** is the combination of a box plot and probability density function(CDF).

In [None]:
sns.boxplot(x='SURVIVAL_STATUS',y='AGE',data=haberman_dataset)
plt.show()
sns.boxplot(x='SURVIVAL_STATUS',y='OPERATION_YEAR',data=haberman_dataset)
plt.show()
sns.boxplot(x='SURVIVAL_STATUS',y='POSITIVE_LYMPH_NODES',data=haberman_dataset)
plt.show()

In [None]:
sns.violinplot(x="SURVIVAL_STATUS",y="AGE",data = haberman_dataset,height = 10)
plt.show()
sns.violinplot(x="SURVIVAL_STATUS",y="OPERATION_YEAR",data = haberman_dataset,height = 10)
plt.show()
sns.violinplot(x="SURVIVAL_STATUS",y="POSITIVE_LYMPH_NODES",data = haberman_dataset,height = 10)
plt.show()

**Observation:**

* Patients with more than 1 nodes are not likely to survive. More the number of nodes, lesser the survival chances.
* A large percentage of patients who survived had 0 nodes. Yet there is a small percentage of patients who had no positive axillary nodes died within 5 years of operation, thus an absence of positive axillary nodes cannot always guarantee survival.
* There were comparatively more people who got operated in the year 1965 did not survive for more than 5 years.
* There were comparatively more people in the age group 45 to 65 who did not survive. Patient age alone is not an important parameter in determining the survival of a patient.
* The box plots and violin plots for age and year parameters give similar results with a substantial overlap of data points. The overlap in the box plot and the violin plot of nodes is less compared to other features but the overlap still exists and thus it is difficult to set a threshold to classify both classes of patients.

# Bi-Variate Analysis 

**Two Dimensional Scatter Plot**

A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables — one plotted along the x-axis and the other plotted along the y-axis.

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(haberman_dataset, hue= 'SURVIVAL_STATUS', size= 6).map(plt.scatter,'AGE','POSITIVE_LYMPH_NODES').add_legend()

In [None]:
sns.set_style("whitegrid")
sns.FacetGrid(haberman_dataset, hue= 'SURVIVAL_STATUS', size= 6).map(plt.scatter,'POSITIVE_LYMPH_NODES','OPERATION_YEAR').add_legend()

**Pair Plots**

Pair plots are useful for exploring datasets where we have less number of features they give us insights to the underlying datasets by plotting features against one anaother in pairs.

By default, this function will create a grid of Axes such that each variable in data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

In [None]:
sns.pairplot(haberman_dataset, hue= 'SURVIVAL_STATUS', height= 4)
plt.show()

**Observation:**
The plot between operation year and lymph nodes is comparatively better.

# Multivariate analysis

**Contour Plot**

A contour line or isoline of a function of two variables is a curve along which the function has a constant value. It is a cross-section of the three-dimensional graph.

In [None]:
sns.jointplot(x = 'OPERATION_YEAR', y = 'AGE', data = haberman_dataset, kind = "kde")
plt.show()

In [None]:
sns.kdeplot(
    data= haberman_dataset, x="OPERATION_YEAR", y="AGE", hue="SURVIVAL_STATUS", fill=True,
)

In [None]:
sns.kdeplot(
     data= haberman_dataset, x="OPERATION_YEAR", y="AGE",
    fill=True, thresh=0, levels=100, cmap="mako",
)

# Conclusion:

* Patient’s age and operation year alone are not deciding factors for his/her survival. Yet, people less than 35 years have more chance of survival.
* Survival chance is inversely proportional to the number of positive axillary nodes. We also saw that the absence of positive axillary nodes cannot always guarantee survival.
* The objective of classifying the survival status of a new patient based on the given features is a difficult task as the data is less and imbalanced.