# Exploratory Data Analysis

Haberman cancer dataset.

### Objective:

Classification of survival/death of cancer patients

In [None]:
# importing libraries

import numpy as np
import pandas as pd
import plotly.express as px

In [None]:
# importing data

df = pd.read_csv("../input/habermans-survival-data-set/haberman.csv", names = ['age', 'year', 'nodes', 'status'])

In [None]:
# inspecting data

df.head()

### Observation:

1.) Data is properly loaded.

2.) There are four columns in the dataset.

In [None]:
# inspecting size of the dataset

df.shape

### Observation:

1.) There are four columns in the dataset.

2.) The dataset consists of 306 tuples.

In [None]:
# inspecting datatypes of columns

df.dtypes

### Observation:

1.) All the columns are of integer data type.

2.) There are no missing values in the dataset since all the datatypes are in accordance to our expectation, i.e. integer

In [None]:
# renaming status

df.rename(columns = {'status': 'survived'}, inplace = True)

In [None]:
# mapping status for better understanding

df['survived'].replace({1 : 1, 2 : 0}, inplace = True)

## Univariate analysis

In [None]:
df.describe()

### Observation:

1.) mean age of the patients is 52. Therefore most of the patients are middle aged.

2.) mean year is 63. Therefore most of the operations were done in the 1960's.

3.) 75% of people are with less than or equal to four nodes. However there seem to be some outliers in this column, as maximum number of nodes in 52.

In [None]:
df['survived'].value_counts()

### Observation:

1.) 225 people survived for more than 5 years after the operation.

2.) 81 people did not survive for 5 years after the operation.

3.) This is an unbalanced dataset.

In [None]:
px.box(x = df['nodes'], color = df['survived'], notched=True, title = 'Box plot of nodes')

### Observation:

1.) In general people with less number of nodes seem to survive more. Especially peope with zero nodes.

In [None]:
px.bar(x = df['year'].value_counts().index.tolist(), y = df['year'].value_counts().values.tolist(), title = 'Number of operations per year')

### Observation:

1.) Number of operations is reducing over the years.

In [None]:
df_byyear = df.groupby(df['year']).sum()
px.bar(x = df_byyear.index.tolist(), y = df_byyear['nodes'], title = 'Total number of nodes per year')

### Observation:

1.) Number of nodes used to be high in the late 50's and eary 60's.

In [None]:
df_byyear = df.groupby(df['year']).median()
mi = min(df_byyear.index.tolist())
ma = max(df_byyear.index.tolist())
theta = []
for value in df_byyear.index.tolist():
    theta.append(((value - mi) * 360 ) / (ma - mi))
px.bar_polar(theta = theta, r = df_byyear['age'], title = 'Average age per year')

### Obseravtion:

Average age of operations is decreasing.

In [None]:
px.histogram(x = df['age'], color = df['survived'], title = 'histogram of age')

### Observation:

1) As age increase, chances of survival decrease.

2) Most of the patients are middle aged.

## Bivariate analysis

In [None]:
px.scatter(x = df['age'], y = df['nodes'], color = df['survived'], trendline = "ols", size = df['nodes'], title = 'Age vs Nodes scatter plot')

### Observation:

1.) There is slight decrease in number of nodes as age increase.

In [None]:
px.box(x = df['year'], y = df['nodes'], color = df['survived'], title = 'Box plot of age vs year')

### Observation:

Deceased have had more number of nodes consistantly over the years.

## Multivariate analysis

In [None]:
px.parallel_coordinates(df, color = 'survived')

### Observation:

1.) Lesser the number of nodes, more likely is the patient to survive.