# Survival Prediction Using Logistic Regression

* **1. Introduction**
    * 1.1 About the data
    * 1.2 General information of the data
    * 1.3 Objective
* **2. Load and check data**
    * 2.1 Load data
* **3. Modelling**

## 1. Introduction

### 1.1 About The Data

The dataset contains cases from a study that was conducted between
   1958 and 1970 at the University of Chicago's Billings Hospital on
   the survival of patients who had undergone surgery for breast
   cancer.


source: http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

### 1.2 General Information

**age:** Age of patient at time of operation (numerical)


**year:** Patient's year of operation (year - 1900, numerical)


**axillary:** Number of positive axillary nodes detected (numerical)


**status:** Survival status (class attribute)
    * 1 = the patient survived 5 years or longer
    * 2 = the patient died within 5 year

### 1.3 Objective

* Is there any correlation between age, year and axillary node to survival?
* Build prediction model to predict survival. Using Binary Logistic Regression

## 2. Load and check data

### 2.1 Load data

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Comment this if the data visualisations doesn't work on your side
%matplotlib inline

plt.style.use('bmh')

In [6]:
df = pd.read_csv('data/haberman.csv')

In [7]:
df.head()

Unnamed: 0,age,year,axillary,status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


In [8]:
df.describe()

Unnamed: 0,age,year,axillary,status
count,306.0,306.0,306.0,306.0
mean,52.457516,62.852941,4.026144,1.264706
std,10.803452,3.249405,7.189654,0.441899
min,30.0,58.0,0.0,1.0
25%,44.0,60.0,0.0,1.0
50%,52.0,63.0,1.0,1.0
75%,60.75,65.75,4.0,2.0
max,83.0,69.0,52.0,2.0


## 3. Modelling

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [52]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('status', axis=1), df['status'], test_size=0.2, random_state=42)

In [55]:
m = LogisticRegression(n_jobs=-1)
m.fit(X_train, y_train)
predictions = m.predict(X_test)

  " = {}.".format(effective_n_jobs(self.n_jobs)))


In [56]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       0.73      0.91      0.81        44
           2       0.43      0.17      0.24        18

   micro avg       0.69      0.69      0.69        62
   macro avg       0.58      0.54      0.52        62
weighted avg       0.64      0.69      0.64        62

