In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under 
# the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## 1. An overview of data

In [None]:
# Header to this data is not given so we provide our own header given in description of dataset
# Original source of the dataset https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival
data = pd.read_csv(filepath_or_buffer="../input/habermans-survival-data-set/haberman.csv", 
                   header=None, names=["age", "year", "axil_nodes", "surv_status"])
data.head()

We have four attributes for this dataset: 
* **age:** age of patient at the time of operation 
* **year:** patient's year of operation (year - 1900s)
* **axil_nodes:** number of positive axillary nodes detected
* **surv_status:** survival status, 1 = the patient survived 5 years or longer, 2 = the patient died within 5 year

In [None]:
# Shape of the dataset
data.shape

In [None]:
# Info on the columns of the dataset
data.info()

* All the attributes are non null, that is, we don't have any missing values.
* All the attributes are of type integer (numerical)

** NOTE:**
**year** and **surv_status** are of integer type since they have numerical values. But, if we think carefully, they are more of categorical attributes as ordinality does not matter in this case. For example, 1964 is greater than 1962, but that relation is of no value here. For us, 1964 and 1962 are just two years. Similarly, for **surv_status**, 2 > 1, but that ordinality is of no use here. So they are categorical values. 

Let's change their types.

In [None]:
data["year"] = data["year"].astype('category')
data["surv_status"] = data["surv_status"].astype('category')

* Data type of these two columns are changed now. Also let's change the values of surv_status a little bit for easier understanding.
* Now, 0 = the patient died within 5 year, 1 = the patient survived 5 years or longer.

In [None]:
# Let's change the categorical values also to make it easier for us
data["surv_status"] = data["surv_status"].apply(lambda x: 0 if x == 2 else 1)
data["surv_status"].value_counts()

In [None]:
data.info()

In [None]:
# Any null values ?
data[data.isnull().any(1)]

* No rows with any null values

In [None]:
data.describe()

* Range of age is [30, 83] and that of axil_nodes is [0, 52].
* 75% patients have axillary nodes less than or equal to 4, but the highest number of axil_nodes are 52.
* It seems there might be outliers in number of axillary nodes (but not beacuse of any errors or so)

In [None]:
# people who had 0 nodes or tumors even then they did not survived more than 5 years.
data[(data["axil_nodes"] == 0) & (data["surv_status"] == 0)].shape[0]

In [None]:
# Distribution of target variable

# https://matplotlib.org/3.1.1/gallery/pie_and_polar_charts/pie_features.html#sphx-glr-gallery-pie-and-polar-charts-pie-features-py

labels = ['Survived>=5years', 'Survived<5years']
surv_status = list(data["surv_status"].value_counts())
explode = (0, 0)  

fig1, ax1 = plt.subplots()
ax1.pie(surv_status, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal') 

plt.show()

### Conclusions:
* There are 306 data points and 4 features including target class of survival status.
* It is fair to treat year as categorical feature.
* age and axil_nodes are numerical features.
* There are no null values.
* Target variable has has two class labels which are divided in ratio almost 3:1. (Is it enough to call it as imbalanced dataset? You can check it out [here](https://datascience.stackexchange.com/questions/11788/when-should-we-consider-a-dataset-as-imbalanced))


## 2. Univariate Analysis

### 2.1. Age

In [None]:
#  What is the age group of patients?
sns.distplot(a=data["age"], bins=30, color="g")

In [None]:
sns.set_style("whitegrid")
sns.violinplot(y="age", data=data, orient="h", color="g")

* Most patients are between the age of 40 to 65.

We can verify that, when we see top 15 age values which have highest count.

In [None]:
data["age"].value_counts()[:15]

In [None]:
sum(data["age"].value_counts()[:15])

* They count for more than 50% data points and they are between 40 to 65 with age 38 poking in between. :) 

### 2.2. Year

In [None]:
sns.set()
# ax = sns.distplot(a=data["year"], bins=20, hist_kws={"rwidth":1})
sns.countplot(x="year", data=data, palette="deep")
plt.tight_layout()

* In the year 1968 and 1969, less number of surgeries have happened compared to other years.
* 1958, 1963, 1964 has most number of surgeries (>= 30).

### 2.3. Axillary Nodes

In [None]:
# Distribution of number of axillary nodes
plt.figure(figsize=(12, 6))
sns.distplot(a=data["axil_nodes"], bins=30)

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(x="axil_nodes", data=data, palette="deep")

* More than 130 patients had 0 axillary nodes.
* More than 200 patients have less than or equal to 4 axillary nodes.
* Greater number of axillary nodes cases are very less, making it almost random.

In [None]:
# Number of patients having less than or equal to 4 axillary nodes
data[data["axil_nodes"] <= 4].shape[0]

Out of 306 data points, 230 patients have less than or equal to 4 axillary nodes.

In [None]:
plt.figure(figsize=(10, 5))
ax = sns.boxplot(x="axil_nodes", data=data, palette="rocket")
ax.set_xticks(np.linspace(0, 50, 11))
plt.show()

* We can see that 75% of patients have less than 5 axil nodes.
* Many data points are going out of the whiskers showing they might be outlier points.

Let's do the outlier analysis using quantiles.

In [None]:
# 25 percentile value
q1 = data["axil_nodes"].quantile(q=0.25)
print("25 percentile value: " + str(q1))

# 75 percentile value
q3 = data["axil_nodes"].quantile(q=0.75)
print("75 percentile value: " + str(q3))


# Inter Quartile Range
iqr = (q3 - q1) * 1.5
print("Inter Quantile Range: " + str(iqr))

upper_bound = q3 + iqr
print("UpperBound value: " + str(upper_bound))

lower_bound = q1 - iqr
print("LowerBound value: " + str(lower_bound))

* Anything above 10 and below -6 (which is not possible in our case) will be considered outlier.

In [None]:
data["outlier_or_not"] = data["axil_nodes"].apply(lambda x: "yes" if x > upper_bound or x < lower_bound else "no")

data["outlier_or_not"].value_counts()

* We have 40 such instances where number of axil_nodes can be termed as outlier case, but we know that these are genuine cases and these are not result of any errors.
* Number of positive axillary nodes can be due to many causes.

#### Conclusions: 
* Almost 50% of patients are of age group 45 - 60.
* In the year 1968 and 1969, less number of surgeries have happened compared to other years.
* 1958, 1963, 1964 has most number of surgeries (>= 30).
* More than 75% of patients have less than or equal to 4 axillary nodes.
* There are unusual values ranging to 52 axillary nodes also in few patients.
* Target variable (surv_status) is sort of imbalanced dataset.

## 3. Multivariate Analysis

Since we are really interested in knowing about the survival status of patients and how other features relate to it, we will mainly explore about that with other features.

### 3.1. Survival Status variation with Age

In [None]:
sns.FacetGrid(data, hue="surv_status", size=6).map(sns.distplot, "age", bins=10)

Above plot is overlapping, let's look at boxplot.

In [None]:
sns.boxplot(x="surv_status", y="age", data=data)

In [None]:
sns.violinplot(x="surv_status", y="age", data=data, inner="quartile")

We can see age in non-sruvival cases have higher end but most of the data seems to overlap. Age is not able to clearly distinguish b/w the two classes of survival status.

Let's see the average number of cases according to age

In [None]:
plt.figure(figsize=(20, 10))
sns.set_context("notebook")
grouped_data = data.groupby(by=["age", "surv_status"]).size().reset_index(name="counts")
ax = sns.lineplot(x="age", y="counts", data=grouped_data, markers=True, 
                  hue="surv_status", style="surv_status", dashes=True, palette='bright')
ax.set_xticks(np.linspace(30, 80, 51))
plt.show()

Mostly, number of people survived are more than number of people died except for two age numbers, 46 and 53.

### 3.2. Survival Status variation with Year

In [None]:
plt.figure(figsize=(8, 4))
sns.countplot("year", hue="surv_status", data=data)

Apart from the year 65 and 69, ratio of survival and no-survival cases revolves arounf 2:1.

In [None]:
#  We do this as to make these two features numerical for boxplot, else it will thriw error
convert_dict = {'surv_status': int, 
                'year': int } 

sns.boxplot(x="surv_status", y="year", data=data.astype(convert_dict))

People who did not survive are more from early years of operations, whereas people who survived are more from later years. 
May be because of advancements in medical science, but there is effective pattern as such.

In [None]:
plt.figure(figsize=(12, 6))
sns.set_context("notebook")
grouped_data = data.groupby(by=["year", "surv_status"]).size().reset_index(name="counts")
sns.lineplot(x="year", y="counts", markers=True, dashes=True, data=grouped_data, 
             hue="surv_status", style="surv_status", palette="bright")

* Every year on an average, there are more number of patients who survived for more than 5 years. That means our doctors are doing good job. Kudos to the doctors !!!

* But year 1995 was not so great for the patients the gap b/w people survived to people died before 5 years is least.

### 3.3.  Survival Status with axil-nodes

In [None]:
# More number of nodes mean less sruvival chance ?
sns.FacetGrid(data, hue="surv_status", size=7).map(sns.kdeplot, "axil_nodes", shade=True).add_legend()

* If your axillary nodes <= 4, survival chances are more and for 5 to 20 survival chances are less.

This is verified in below codes

In [None]:
# Cases where axil_nodes are less than or equal to 4
data[data["axil_nodes"] <= 4]["surv_status"].value_counts()

In [None]:
# Cases where axil_nodes are more than 4
data[(data["axil_nodes"] > 4)]["surv_status"].value_counts()

In [None]:
plt.figure(figsize=(16, 8))
sns.countplot(x="axil_nodes", data=data, hue="surv_status")

* As we can see most of the patients has less than 4 axillary nodes and they mostly survived for more than 5 years. For other number of axillary nodes, no comments can be made.

In [None]:
# data.groupby(by=["axil_nodes", "surv_status"]).count()
plt.figure(figsize=(16, 7))
grouped_data = data.groupby(by=["axil_nodes", "surv_status"]).size().reset_index(name="counts")
ax = sns.lineplot(x="axil_nodes", y="counts", data=grouped_data, markers=True, 
                  hue="surv_status", style="surv_status", dashes=True, palette='bright')

After number of axillary nodes 4 , data is overlapping too much.

In [None]:
plt.figure(figsize=(6, 8))
sns.boxplot(x="surv_status", y="axil_nodes", data=data, orient='v', palette="bright")

* Survival cases had geerally very less axillary nodes, non-survival cases had more number of axillary nodes.
* There is no trend as such apart from that, as there is overlapping of data.

In [None]:
plt.figure(figsize=(6, 8))
sns.violinplot(x="surv_status", y="axil_nodes", data=data, orient='v', palette="deep")

### Conclusions: 
* For each age number, there are more number of people who survived more than 5 years and not survived, except for two ages 46 and 53.
* Mostly each year, there are almost twice number of survival cases than no-survival cases except for year 65 where this is almost equal.
* People who have survived are more from later years than earlier years. 
* Even if we are able to make statements about the age and year's relation with sruvival status, data is not inferring anything or there is no any pattern they have. They miserably fail to say anything about survival status of patients.
* For number of axillary nodes, data is skewed too much. More than 75% of patients had less than or equal to 4 axillary nodes.
* Also, there are more number of cases of survival if axillary_nodes <= 4. For axillary nodes greater than 4, there are less chances, almost 50-50.  

### 3.4 Multivariate Analysis of other features

In [None]:
# HOW AXIALLRY NODES AND AGE ARE RELATED
# Calculating average number of axillary nodes in each age.
grouped_data = data.groupby("age").mean().reset_index()
plt.figure(figsize=(20, 8))
sns.barplot(x="age", y="axil_nodes", data=data)

Not so much, also we can se long black sticks on bar plot which says that there is high variance in the dat which is a measure of uncertainity. That means, age is not much of a factor for axillary nodes.

https://stackoverflow.com/questions/58362473/what-does-black-lines-on-a-seaborn-barplot-mean

In [None]:
plt.figure(figsize=(20, 10))
sns.boxplot(x="age", y="axil_nodes", data=data, hue="surv_status")

Above plot gives that for age group 40-60, we have number of axillary nodes in different numbers. We can check this out.  

In [None]:
data[data["axil_nodes"] > 4]["surv_status"].value_counts()

So out of 81 points of more than 4 axillary nodes, 37 of them (almost half) were non-survival cases.

Let's plot pairplots to see if we can find something there.

In [None]:
plt.figure(figsize=(20, 8)
sns.pairplot(data=data, hue="surv_status", diag_kind="hist")
plt.show()

In [None]:
plt.figure(figsize=(20, 8))
sns.pairplot(data=data, hue="surv_status", diag_kind="kde")
plt.show()

Scatter plots don't say much about the data as such. That's why we didn't plot any of them.

## Overall Conclusions:
* Data is skewed and don't give much room in terms of making sense otu of features and their relations with survival status.
* Since the class label is imbalanced (for good), we always find that overall survival cases are more than non-survival cases. 
* Age doesn't convey much apart from the fact that most of the patients were from the age group of 45-60.
* Year also doesn't have anything as such with survival status of patients.
* Number of positive axillary nodes tries to make some sense. 
* Patients having more than 4 axil_nodes have 50-50 chance of survival be it any number of axillary nodes that is.
* Patients having less than or equal to 4 axil_nodes have more chance of survival, with 0 axil_nodes being the best.

##### Please leave your comments to this kernel, positive or negative I don't mind, but try to be constructive with your feedback. That will surely help me improve.

##### If I have left anything to explore, do let me know.