# Implementing Logistic Regression

In the introduction, we touched on what logistic regression is and what a model looks like. Now, the idea needs to be extended to multiple input variables.

In [None]:
# All good python projects begin with specifying which modules to load

import pandas as pd  # Pandas is a package which creates data frames
import numpy as np # Numpy is the package which creates/manages/operates on numerical data
import matplotlib.pyplot as plt # Matplotlib is the plotting library

from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## The Data

Every project begins with the data.  We will be using data that _Tjen-Sien Lim_ (limt@stat.wisc.edu) supplied. The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data

//=-=-=-=-=-=-=-=-=-=-=-=-=-=

Dataset:  haberman.data
Lim, Tjen-Sien (1999). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Variables/Columns

1. Age of patient at time of operation (numerical) <br>
2. Patient's year of operation (year - 1900, numerical)<br> 
3. Number of positive axillary nodes detected (numerical) <br>
4. Survival status (class attribute) <br>
-- 1 = the patient survived 5 years or longer <br>
-- 2 = the patient died within 5 year <br>

//=-=-=-=-=-=-=-=-=-=-=-=-=-=


In [None]:
# Pull the data directly from github
haber = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'

# Data does not have a header row so we have to label the data
names = ['age', 'year', 'nodes', 'survival']
data = pd.read_csv(haber, header=None, names=names)

# We will change the survival labels
# 1-> 1 : No change
# 2-> 0 : Death within 5 years is 0
data['survival']=-1*(data['survival']-2)

# head() gives a snapshot of the data.  Jupyterhub is great a rendering tables.
data.head()

In [None]:
X = data.loc[:, data.columns != 'survival'].values
y = data.loc[:, data.columns == 'survival'].values

# Error Calculation

To evaluate the learning of the parameters, we need an error equation. In linear regression, the error was the sum-squared error. For logistic regression, we will use the the sum of absolute error since the error is -1, 0, or 1 because of the binary output condition.

In this case, $\hat y_i$ is the calculated output over all data indicated by $i$.

$$
E = \sum_i |y_i - \hat y_i|
$$

# Stochastic Gradient Descent

We cannot solve for the parameters, $\beta$, like we did for linear regression. Therefore we have to find the parameters through an iterative process.

One such process is stochastic gradient descent. In this process, the parameters are updated with each new data point. The updates are done such that the parameters move in the direction of greatest decrease in the error.

## Parameter Updating

The equation to update each parameter, $\beta_n$, uses the current value and a learning factor, $\eta$.

$$
\beta_{n,\textrm{new}} = \beta_{n,\textrm{old}} + \eta (y_i - \hat y_i) x_{n,i}
$$

So, for our 3 inputs the following update equations are used after each new data record is read.

$$
\beta_{0,\textrm{new}} = \beta_{0,\textrm{old}} + \eta (y_i - \hat y_i) \\
\beta_{1,\textrm{new}} = \beta_{1,\textrm{old}} + \eta (y_i - \hat y_i) x_{1,i}\\
\beta_{2,\textrm{new}} = \beta_{2,\textrm{old}} + \eta (y_i - \hat y_i) x_{2,i}\\
\beta_{3,\textrm{new}} = \beta_{3,\textrm{old}} + \eta (y_i - \hat y_i) x_{3,i}
$$

# Putting the Algorithm Together

The algorithm is put together in this order:

1. Create training data and test data from the available dataset
2. Make initial guesses for the equation parameters
3. Loop through the data
    1. Calculate the predicted output
    2. Update the parameters
4. Calculate error on test data
5. Repeat steps 3 and 4 multiple times (each pass through the data is called an Epoch)
6. Profit

## Training and Test Sets

The input data needs to be randomly assigned to either the training data or the test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Calculate Prediction

Here we will calculate the predicted output using a function. This function can be included in the loop. The predicted output can be between 0 and 1 so we will round the output so that our prediction is binary.

In [None]:
def calc_survival( beta, inputs):
    t = beta[0] + beta[1]*inputs[0] + beta[2]*inputs[1] + beta[3]*inputs[2]
    
    y_ = 1/(1+np.exp(-1*t))
    
    assert y_ >= 0.0, "Prediction is less than zero"
    assert y_ <= 1.0, "Prediction is greater than one"
    
    return ______________

## Update Parameters

The parameter updates require the prediction, the actual output, the parameters and the inputs. Again, this is functionized for later use.

In [None]:
def update_params( beta, predict, actual, inputs, eta=0.2):
    delta = actual - predict
    
    beta[0] = beta[0] + eta*delta
    beta[1] = beta[1] + eta*delta*inputs[0]
    beta[2] = beta[2] + eta*delta*inputs[1]
    beta[3] = beta[3] + eta*delta*inputs[2]
    
    return beta

## Error Calculation

The error calculation is the sum of the absolute error so it is easy to calculate.

In [None]:
def error_calc( actual, predict):
    '''
    Here the actual and predict inputs are assumed to be arrays.
    '''
    delta = actual - predict
    
    return np.sum(np.abs(delta))

## Loop Over Epochs

Now that the functions are created and the data is ready, we can write a loop to iterate over the data and over the epochs and print the error.

In [None]:
nepoch = ______
eta = ______
N = len(y_train)

beta = np.array([0, 55, 60, 1])

error_array = np.zeros(nepoch)

for n in range(nepoch):
    y_train_predict = np.zeros_like(y_train)
    y_test_predict = np.zeros_like(y_test)
       
    for i in range(N):
        inputs = X_train[i,:]
        y_train_predict[i] = calc_survival(beta, inputs)
        beta = update_params(beta, y_train_predict[i], y_train[i], inputs, eta=eta)
        
    for j in range(len(y_test)):
        y_test_predict[j] = calc_survival(beta, X_test[j,:])
    
    print(beta)
    error_array[n] = error_calc(y_test, y_test_predict)
    
error_array