# Haberman's Survival : Exploratory Data Analysis

Source: https://www.kaggle.com/gilsousa/habermans-survival-data-set

## Haberman's Cancer survival :EDA
Haberman's survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

## Understanding the data set


### Attribute Information:
- Age of patient at the time of operation (numerical) - Feature 1
- Patient's year of operation (year - 1900, numerical) - Feature 2
- Number of positive axillary nodes detected (numerical) - Feature 3
- Survival status (class attribute): From this variable we can deduce that this is problem of Binary Classification
  - 1 = the patient survived 5 years or longer 
  - 2 = the patient died within 5 years

In [None]:
df_haberman = pd.read_csv('../input/habermans-survival-data-set/haberman.csv')
df_haberman.columns = ['age', 'operation_year', 'axillary_nodes', 'survival_status']
df_haberman.head()

In [None]:
df_haberman.shape

In [None]:
df_haberman.describe()

In [None]:
df_haberman.nunique()

In [None]:
df_haberman.operation_year.unique()


## Cleaning the data set

In [None]:
df_haberman.isnull().sum()

In [None]:
df_haberman_duplicated = df_haberman[df_haberman.duplicated()]
df_haberman_duplicated

In [None]:
df_haberman = df_haberman.drop_duplicates()
df_haberman.head()

In [None]:
df_haberman.describe()

## Analysing each variable

In [None]:
sns.set(font_scale = 1.5)
plt.subplots(figsize = (18,8));
sns.countplot(data = df_haberman, x = 'age');
plt.xticks(rotation=88)
plt.title('Age');

### Observation: 
- Age ranges from 30 to 83.
- Maximum number of patients from 52 age group.
- For 71,75,76,77,78 and 83 we have only one patient each.
- In this the patients are majorly from 38 to 65.

In [None]:
plt.subplots(figsize = (12,6));
sns.countplot(data = df_haberman, x = 'operation_year');
plt.title('Patient\'s year of operation');

### Observation: 
- Operation Year ranges from 1958 to 1969.
- Maximum number of people had their operation in 1958.
- Minimum number of patients had their operation in 1969.

In [None]:
plt.subplots(figsize = (14,7));
sns.countplot(data = df_haberman, x = 'axillary_nodes');
plt.title('Number of positive axillary nodes detected');

### Observation: 
- Maximum number of patients are not having positive axillary nodes
- Minimum number of patients are haviing number of possitive axillary nodes greater than 23.
- Minimum number of axillary nodes observed is 52.

In [None]:
print(df_haberman['survival_status'].value_counts())

lcolors = ['#30c9a1', '#f54761']
labels = '1 = longer survival rate', '2 = shorter survival rate'
explode = (0, 0.1)
patches, texts, junk = plt.pie(df_haberman.survival_status.value_counts(),  explode=explode, labels=labels, autopct='%.2f%%', colors=lcolors)
plt.tight_layout()
plt.show()

colors = {1 : '#2eb82e', 2 : '#ff5050'}
plt.subplots(figsize = (2,5));
sns.countplot(data = df_haberman, x = 'survival_status',  palette=colors);
plt.title('Survival status');

## Analysing each feature variable with respect to class attribute variable i.e., survival_status

In [None]:
plt.subplots(figsize = (10,5));
a = sns.boxplot(x=df_haberman.survival_status, y=df_haberman.age, palette=colors)
plt.show()

### Observation:
- Younger patients have better chance of survival

In [None]:
plt.subplots(figsize = (10,5));
a = sns.boxplot(x=df_haberman.survival_status, y=df_haberman.operation_year, palette=colors)
plt.show()

### Observation:
- Patients treated after 1966 have a higher chance of surviving compared to the rest.

In [None]:
plt.subplots(figsize = (10,5));
a = sns.boxplot(x=df_haberman.survival_status, y=df_haberman.axillary_nodes, palette=colors)
plt.show()

### Observation:
- Patients with less number of axillary nodes have better chance of survival
- Outliers are observed, need to getrid of them before feeding this data set to model

In [None]:
df_outliers = df_haberman[df_haberman["axillary_nodes"]>8]
df_haberman = df_haberman[df_haberman["axillary_nodes"]<9]
print(df_outliers)
print("Count of outliers:", len(df_outliers))

In [None]:
plt.subplots(figsize = (18,8));
sns.countplot(data = df_haberman, x = 'age', hue = 'survival_status', palette=colors);
plt.legend(loc='upper right');
plt.xticks(rotation=88)
plt.title('Age / Survival status');

### Observation:
- Age range from 37 to 70 is having more patients who survived more than| 5 years
- Age groups 52, 53, 46, 45, and 61 are having more patients who have not survived more than 5 years.

In [None]:
plt.subplots(figsize = (12,6));
sns.countplot(data = df_haberman, x = 'operation_year', hue = 'survival_status', palette=colors);
plt.legend(loc='upper right');
plt.title('Operation Year / Survival status');

### Observation:
- From 1966 onwards patients who survived more than 5 years are comparatively high 
- Also the rate patients who survived less than 5 years kept reducing from 1966 to 1969 

In [None]:
plt.subplots(figsize = (14,7));
sns.countplot(data = df_haberman, x = 'axillary_nodes', hue = 'survival_status', palette=colors);
plt.legend(loc='upper right');
plt.title('Axillary Nodes / Survival status');

### Observation:
- The Patients with 0 axillary nodes survived more
- Also the rate patients who survived less than 5 years kept reducing from 1966 to 1969 

## Analysing combination of 2 feature variables with respect to class attribute variable i.e., survival_status¶

In [None]:
sns.relplot(data=df_haberman, x="age", y="operation_year", hue="survival_status", palette=colors)

### Observation:
- Patients whose age is <= 40 years almost survived in all the operation years
- Older patients have better chance of survival with greater operation year

In [None]:
sns.relplot(data=df_haberman, x="operation_year", y="axillary_nodes", hue="survival_status", palette=colors)

### Observation:
- Patients with 0 axillary nodes survived in all the operation years
- Patients with big number of axillary nodes have better chance of survival with greater operation year

In [None]:
sns.relplot(data=df_haberman, x="axillary_nodes", y="age", hue="survival_status", palette=colors)

### Observation:
- Patients whose age is <= 40 years almost survived
- Patients who have 0 axillary nodes and have better chance of survival
- Older patients have better chance of survival with less number of axillary nodes

## Analysing combination of 3 feature variables with respect to class attribute variable i.e., survival_status¶

In [None]:
df = px.data.tips()
fig = px.scatter_3d(df_haberman, x = 'operation_year', y = 'age', z = 'axillary_nodes', color = 'survival_status', size_max = 20, opacity=0.8)
fig.show()

### Observation:
- More axillary nodes leads to lesser survival status and viceversa
- More age leads to lesser survival status and viceversa
- Lesser operation year leads to lesser survival status and viceversa

## We are provided with a sample of labeled data to train a model, and on that basis our seleted model should predict the output. So, making use of Supervised ML models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from prettytable import PrettyTable

In [None]:
print(df_haberman['survival_status'].value_counts())

In [None]:
# Creation of balanced dataset
df_haberman_1 = df_haberman.loc[df_haberman["survival_status"]==1]
df_haberman_1 = df_haberman_1.sample(n=52)
df_haberman_2 =  df_haberman.loc[df_haberman["survival_status"]==2]
df_haberman_b = df_haberman_1.append(df_haberman_2)
df_haberman_b.reset_index(drop = True, inplace = True)
print(df_haberman_b.head())

In [None]:
## Imbalanced
x = df_haberman.drop(['survival_status'], axis=1)
y = df_haberman['survival_status']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=11, stratify=y)

## Balanced
xb = df_haberman_b.drop(['survival_status'], axis=1)
yb = df_haberman_b['survival_status']
xb_train, xb_test, yb_train, yb_test = train_test_split(xb, yb, test_size=0.25, random_state=11, stratify=yb)

### 0. DecisionTreeClassifier

In [None]:
def dtcm(xtrain,ytrain,xtest,ytest):
    return DecisionTreeClassifier(criterion="entropy",max_depth=2,splitter='best', min_samples_split=2).fit(xtrain,ytrain) 

### 1. SupportVectorMachine

In [None]:
def svmm(xtrain,ytrain,xtest,ytest):
    return SVC(kernel='linear').fit(xtrain,ytrain)

### 2. LogisticRegression

In [None]:
def lrm(xtrain,ytrain,xtest,ytest):
    return LogisticRegression(solver='lbfgs').fit(xtrain,ytrain) 

### 3. Naive Bayes

In [None]:
def nbm(xtrain,ytrain,xtest,ytest):
    return GaussianNB().fit(xtrain,ytrain)

### 4. k-nearest neighbors

In [None]:
def knnm(xtrain,ytrain,xtest,ytest):
    return KNeighborsClassifier(n_neighbors=5).fit(xtrain,ytrain)

### Model

In [None]:
def ml_model(xtrain,ytrain,xtest,ytest,model):
    clf = ""
    model_name = ""
    if(model==0):
        clf = dtcm(xtrain,ytrain,xtest,ytest)
        model_name = "DecisionTreeClassifier"
    elif(model==1):
        clf = svmm(xtrain,ytrain,xtest,ytest)
        model_name = "SupportVectorMachine"
    elif(model==2):
        clf = lrm(xtrain,ytrain,xtest,ytest)
        model_name = "LogisticRegression"
    elif(model==3):
        clf = nbm(xtrain,ytrain,xtest,ytest)
        model_name = "NaiveBayes"
    elif(model==4):
        clf = knnm(xtrain,ytrain,xtest,ytest)
        model_name = "k-nearest neighbors"
    ypred = clf.predict(xtest)
    accuracy = metrics.accuracy_score(ytest,ypred)
    precision = metrics.precision_score(ytest,ypred,zero_division=1)
    return [model_name, accuracy, precision]

In [None]:
report = PrettyTable()
report.field_names = ["Data Set", "Model", "Accuracy", "Precision"]
for model in range(0,5):
    item = ['Imbalanced']
    item.extend(ml_model(x_train, y_train, x_test, y_test, model))
    report.add_row(item)
    item = ['Balanced']
    item.extend(ml_model(xb_train, yb_train, xb_test, yb_test, model))
    report.add_row(item)
print(report)

#### Observation:
- From the above table it looks like LogisticRegression gives better result

In [None]:
model = lrm(x_train,y_train,x_test,y_test)
print('Survival Cases:')
data = [[33,60,2], [43,65,1], [50,66,0]]
for row in data:
    y_hat = model.predict_proba([row])
    prob_survive = y_hat[0, 0] * 100
    print('~> data=%s, Survival=%.3f%%' % (row, prob_survive)) 


print('Non-Survival Cases:')
data = [[34,65,13], [54,60,21], [42,69,9]]

for row in data:
    y_hat = model.predict_proba([row])
    prob_survive = y_hat[0, 0] * 100
    print('~> data=%s, Survival=%.3f%%' % (row, prob_survive)) 