# Haberman's Survival - Exploratory Data Analysis

**Data Description** 
The Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

**Attribute Information:**

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4.Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 years

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_theme()

import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

df = pd.read_csv(os.path.join(dirname, filename), header=None, names=['age', 'operation_year', 'positive_axillary_nodes', 'survival_status'])

In [None]:
# get glimps of data
print(df.shape)
df.head()

### Observation:
1. Data have 306 record/point/row
2. survival_status 1 means patient survie the surgery and 2 means die. Its an categorical column.
3. operation year 65 means 1965, as data collected in 90's. so lets convert it into redable format.

In [None]:
# survival column indicate patient is survived or die, so we can convert into categorical variable
df.survival_status = df.survival_status.map({1:'survived', 2:'die'})
#df.operation_year = (df.operation_year + 1900).astype('category')
df.operation_year = df.operation_year + 1900
df.head()

another step in Data Preprosessing is to Null value imputations. so let's check if there is any null value, if there then will take approriate action to replace null, if possible

In [None]:
df.isnull().sum()

## Lets have Few Statistics handy

In [None]:
# check few stats point of data
df.describe()

In [None]:
# 75, 90 and 95 %tile of positive_axillary_nodes
print(np.percentile(df.positive_axillary_nodes,[75,90, 95]))

### Observation:
The above table tells us a lot of information about the data, example,
1. maximum age of patients is 83 years. Assuming age is normally distributed, by looking at mean and std (standard deviation), assuming mean=52.5 and std=10.8 for simplicity of calculation, we can say that 95% of people have age between 31 and 74 who have done the surgery (by using the 68-95-99.7 rule of normal distribution). (for more info on normal/gaussian distribution visit: https://www.youtube.com/watch?v=mtbJbDwqWLE)


2. for positive_axillary_nodes let's look into 25%, 50%, 75%. 25% is 25 percentile value (percentile = %tile for short-form), Just by looking into min, 25%, 75% and max value of positive_axillary_nodes we get to know that it's log-normally distributed. how?
(for more info on log-normal distribution visit: https://www.youtube.com/watch?v=xtTX69JZ92w). 
first of all, range for positive_axillary_nodes is 0 to 52 (min to max), but here 75%tile will tell you 75% of positive_axillary_nodes values are <= 4 and 95%tile<=19.75. so very few records(25% of data) have value> 4, we can visualize long-tail of distribution.


Please do check those 5 min youtube videos to know what this distribution means for better understanding.

# EDA - Exploratory Data Analysis

##  Univariate analysis (analysis of single variable)

In [None]:
ax = sns.distplot(df.age)
ax.axvline(np.percentile(df.age, 95))
plt.title('fig 1- age')
plt.show()


In [None]:
# It's bivariate analysis but still doing here to understand data deeply
print(sns.histplot(data=df, x="operation_year",hue='survival_status'))


# # Numerical approch to fine die % in year. # uncomment and run cell to get numbers format
# tmp = df.groupby(['operation_year', 'survival_status'])['age'].count().reset_index()
# tmp.columns = ['operation_year', 'survival_status', 'patient_count']
# tmp = tmp.pivot_table(index='operation_year', columns='survival_status', values='patient_count')
# tmp['%_die'] = round(tmp.die*100.00/tmp.survived,2)
# print(tmp)

In [None]:
df.survival_status.value_counts(normalize=True)


In [None]:
sns.distplot(df.positive_axillary_nodes)

## Multivariate analysis

In [None]:
df.head()

In [None]:
sns.pairplot(data=df, hue='survival_status')

Note: if no. of features for visualization are <= 4 or 5 then only pairplot is useful as we can see for x=3 features ('age', 'operation_year', 'positive_axillary_nodes') total 9 (x^2=3^2=9) graphs plots. so you can imagine why pairplot is not useful if the number of features is high. In such a case we can use dimensionality techniques such as PCA, t-SNE etc for visualization.

### Obersvation:
1. maximum people have age group 50 and distribution looks like normally distributed (normal/gaussian distribution).
2. In fig-1, the horizontal line will tell us 95% of people have age <= 72
3. maximum operations are performed in 1958 as compared to others.
4. **73.5% of people survived after an operation, indicated dataset is an imbalance as 73% survived and 27% die.***
5. check positive_axillary_nodes distribution is log-normally distributed as analyzed earlier.
6. **in 1965 almost 86% of people die where in 1961 maximum % of people survived. (numerical approach is also coded for understanding purposes. uncomment and run cell for reference. The numerical approach is hard to do as need a lot of coding and that's how visualization helps us a lot in analysis)**

## Creating hypothesis and visualising is the best way to generate insights out of data
#### Let's make some hypotheses and then perform analysis of on them
hypothesis 1: assuming over the year in advancing technology, there are high chances of survival. (Due to very low data size this will not give you good analysis)

hypothesis 2: is more the no. of positive_axillary_nodes means lower the chance of survival?

:

:

hypothesis n: ......

We can draw many hypotheses as per the business and domain knowledge but need a bigger data size to draw any conclusion

# learning: 
So after doing EDA on Haberman Dataset, we can see that why 1. Programming, 2. Domain Knowledge and 3. Math/ Stats are the 3 main pillars of being an analyst or Data scientist.