<pre>
<img src='https://i.imgur.com/WaGFvvh.jpg', width=500>
</pre>

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np

In [None]:
input_dir = "/kaggle/input/habermans-survival-data-set/haberman.csv"
df = pd.read_csv(input_dir)

In [None]:
df.head()

#### Our columns are missing in the dataframe, let's add all of them.

<ol>
    <li>Age: Shows the age of the patient.</li>
    <li>Treatment Year: In the year patient was treated (1900s)</li>
    <li>Lymph Nodes: Number of sentinel nodes.</li>
    <li>Survival Status: 
        <ul>
        <li>2: Patient died within 5 year.</li>
        <li>1: Patient survived more than  years.</li>
        </ul>
    </li>
</ol>

In [None]:
df.columns = ["age", "treatment_year", "lymph_nodes", "surv_status"]

In [None]:
df.head()

In [None]:
df.info()

So we have only integer values in our dataframe. But we might temper the values for survival status to give better intuition about the data.

In [None]:
df.describe()

We have 305 data points in each columns.



So, we can see that if the patient's average is 52 years and if she was treated in 1962 and had 4 average lymph nodes then the patient must have lived more than 5 years.


Minimum and maximum age we have here for the patients are 30 years and 83 years old. We can also see that as the age grows survival rate also reduces.


let's map the survival status 1 to **yes** and 2 to **no** for better intution.

In [None]:
df['surv_status'] = df['surv_status'].map({1:"yes", 2:"no"})

In [None]:
df['surv_status'].value_counts()

So, we have 224 patients who survived more than 5 years and 81 who died within the 5 years.

In [None]:
for col in df.columns[:-1]:
    print(f"{col} has the maximum value of {df[col].max()}")


print("\n")
print("+"*50)
print("\n")

for col in df.columns:
    print(f"{col} has the minimum value of {df[col].min()}")

# EDA

## Univariate Analysis

In [None]:
df.plot(kind="scatter", x="age", y="lymph_nodes")
plt.show()

We can see that as the age grows it's not quite necessary to have a patient even one node. We can see that from plot, patients, even at the age of 75 have 0 nodes.

Let's get better intuition based upon survival status with both age and lymph nodes.

In [None]:
sns.set_style("whitegrid")

sns.FacetGrid(df, hue="surv_status", height=8) \
    .map(plt.scatter, "age", "lymph_nodes") \
    .add_legend();
plt.show()

In [None]:
lymp_nodes_0_with_neg = df[((df['surv_status'] == "yes") & (df['lymph_nodes'] == 0))]["surv_status"]
lymp_nodes_0_with_pos = df[((df['surv_status'] == "no") & (df['lymph_nodes'] == 0))]["surv_status"]

In [None]:
print(f"Zero lymph nodes with positive survival status 1 patients are {lymp_nodes_0_with_neg.values.size}")
print(f"Zero lymph nodes with negative survival status 2 patients are {lymp_nodes_0_with_pos.values.size}")

So the lesser the number of lymph nodes the more likely the patient will survive, even if the patient has an age of more than 70 year.

In [None]:
px.scatter_3d(df, x='age', y='treatment_year', z='lymph_nodes',
              color='surv_status')

## PDF & CDF

In [None]:
sns.FacetGrid(df, hue="surv_status", height=5) \
    .map(sns.distplot, "age") \
    .add_legend()

In [None]:
sns.FacetGrid(df, hue="surv_status", height=5) \
    .map(sns.distplot, "lymph_nodes") \
    .add_legend()

In [None]:
plt.figure(figsize=(16, 8))

for i, col in enumerate(df.drop("surv_status", axis=1).columns):
    
    plt.subplot(1, 3, i+1)
    counts, bins = np.histogram(df[col], bins=10, density = True)
    pdf = counts/(sum(counts))

    cdf = np.cumsum(pdf)
    plt.plot(bins[1:] ,pdf)
    plt.plot(bins[1:], cdf)
    plt.xlabel(col)


We have almost 80% data for those who has less than or equal to 10 lymph nodes. Let's do a sanity check on it.

In [None]:
print(len(df[df['lymph_nodes'] < 10]) / len(df))

In [None]:
v1 = df[((df['lymph_nodes']<10) & (df['surv_status'] == "yes"))]["lymph_nodes"].values.size
v2 = df[((df['lymph_nodes']<10) & (df['surv_status'] == "no"))]["lymph_nodes"].values.size

In [None]:
print(f"Patient survived with less than 10 lymph nodes are: {v1}")
print(f"Patient did not survived with less than 10 lymph nodes are: {v2}")

## Box plot and Voilen plot

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(16, 8))
for i, col in enumerate(df.drop("surv_status", axis=1).columns):
    
    sns.boxplot(x="surv_status", y=col, data=df, ax=ax[i])


In [None]:
fig, ax = plt.subplots(1, 3, figsize=(16, 8))
for i, col in enumerate(df.drop("surv_status", axis=1).columns):
    
    sns.violinplot(x="surv_status", y=col, data=df, ax=ax[i])


#### Age:

Average age for surviving and not surviving are almost the same. So, it does not seem to be help that much.

In [None]:
df[(df['surv_status'] == "yes")]["age"].mean()

In [None]:
df[(df['surv_status'] == "no")]["age"].mean()

#### Treatment year

In [None]:
v1 = df[((df['treatment_year'] < 60) & (df['surv_status'] == "yes"))].values.size

In [None]:
v2 = df[((df['treatment_year'] > 60) & (df['surv_status'] == "yes"))].values.size

In [None]:
print(f"Before 1960s survival status was {(v1/(v1+v2))*100}%")
print(f"After 1960s survival status is {(v2/(v1+v2))*100}%")

It does seem really good. It conculdes that as the time passed researchers managed to reduce the fatality of this cancer type.

# Bivariate Analysis

## Pair Plot

In [None]:
sns.set_style("whitegrid")

sns.pairplot(df, hue="surv_status", height=5)

Only **Lymph nodes** seems like good features to decide the survival status. Other plots are overlapping too much.

## Correlational Matrix plot with Heatmap

In [None]:
corr_mat = df.corr()

plt.figure(figsize=(16, 8))
sns.heatmap(corr_mat, annot=True)
plt.show()

Highest correlation that we got is 9.3% between age and treatment year. So these features are not so well correlated to each other either.

### The End :)