# Logistic Regression(binary classification) from scratch in numpy

We will start of by importing the necessary libraries.

In [169]:
import numpy as np 
import pandas as pd
import math
import warnings 
warnings.filterwarnings('ignore')

## Overview of the algorithm:-

In Logistic regression where we deal with binary classification problems we generally use the sigmoid function to specify the probability of one class with respect to the input example and the other class probability can be found by subtraction the earlier probability by 1. 

Sigmoid function:
g(z) = 1/(1 + exp(-z))

For our model we can see logistic regression as a two layer model, one layer computing an affine transformation of the inputs with the specified parameters.

Affine transformation:
linear_pred = W<sup>T</sup> * x  (bias term is already included in W(parameters) and x<sub>0</sub> is 1) 

Then the next layer implements sigmoid activation function to output probability of specified class i.e., output of second layer is:-

1/(1 + exp(linear_pred))

We will use maximum log-likelihood estimation as our cost function.

Our update rule will be same as that for linear regression model(surprisingly both come the same after taking the derivatives and all).

We will use Pima Indians diabetes dataset for classification task.
Dataset can be found here: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

In [170]:
df = pd.read_csv('pima-indians-diabetes.data.csv',delimiter=',',header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [171]:
df_data = df.iloc[:,:8]
df_label = df.iloc[:,8]

In [172]:
def linear_layer(x,weights):
    new_weights = np.array(weights)
    new_array = x.reshape((x.shape[0],))
    new_weights = new_weights.reshape((new_weights.shape[0],))
    y = np.dot(new_weights,new_array)
    return y

In [173]:
def sigmoid(z):
    return 1/(1 + np.exp(-z))

In [174]:
def normalize(x):
    for j in range(x.shape[1]):
        for i in range(x.shape[0]):
            x[i,j] = (x[i,j] - np.amin(x[:,j]))/(np.amax(x[:,j])-np.amin(x[:,j]))

In [175]:
def data_split(data,ratio=0.8):
    data_train = data[:int(ratio*data.shape[0]),:]
    data_test  = data[int(ratio*data.shape[0]):,:]
    return data_train,data_test

In [176]:
def label_split(data,ratio=0.8):
    data_train = data[:int(ratio*data.shape[0])]
    data_test  = data[int(ratio*data.shape[0]):]
    return data_train,data_test

In [177]:
def get_parameters(train_data,train_target,lr,epochs):
    weights = [0.0 for i in range(train_data.shape[1])]
    for epoch in range(epochs):
        for i in range(train_data.shape[0]):
            linear_pred = linear_layer(train_data[i,:],weights)
            linear_pred = np.array(linear_pred)
            y_pred = sigmoid(linear_pred) 
            y_pred = np.array(y_pred)
            error = y_pred - train_target[i]
            for j in range(train_data.shape[1]):
                weights[j] = weights[j] - lr*error*y_pred*(1.0-y_pred)*train_data[i,j]
    weights = np.array(weights)
    return weights            

In [178]:
def get_predictions(train_data,train_target,test_data,lr,epochs):
    predictions = []
    weights = get_parameters(train_data,train_target,lr,epochs)
    for i in range(test_data.shape[0]):
        linear_pred = linear_layer(test_data[i,:],weights)
        linear_pred = np.array(linear_pred)
        y_pred = sigmoid(linear_pred)
        predictions.append(y_pred)
    predictions = np.array(predictions)
    return predictions

In [195]:
def evaluation(train_data,train_targets,test_data,test_targets,out_function,lr,epochs):
    prediction = out_function(train_data,train_targets,test_data,lr,epochs)
    prediction_labels = []
    for i in range(len(prediction)):
        if(prediction[i]>=0.5):
            prediction_labels.append(1)
        else:
            prediction_labels.append(0)
    accuracy = 0
    for i in range(len(test_targets)):
        if(prediction_labels[i] == test_targets[i]):
            accuracy += 1
    accuracy = accuracy/len(test_targets)        
    return accuracy

## Now implementation and testing part

Converting pandas dataframe to numpy array.

In [180]:
df_data = df_data.as_matrix()
df_label = df_label.as_matrix()

In [181]:
df_data

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

In [182]:
df_data.dtype , df_label.dtype

(dtype('float64'), dtype('int64'))

Normalizing data before using it to train our model.

In [183]:
normalize(df_data)

In [184]:
df_data

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,
        0.48333333],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
        0.37901056],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
        0.39221783],
       ...,
       [1.        , 0.96031746, 1.        , ..., 0.86184211, 0.22781868,
        0.6363187 ],
       [1.        , 1.        , 0.85714286, ..., 0.99013158, 0.33418538,
        1.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 0.29941165,
        1.        ]])

In [185]:
train_data , test_data = data_split(df_data)

In [186]:
train_data.shape,test_data.shape

((614, 8), (154, 8))

In [187]:
train_labels,test_labels = label_split(df_label)

In [188]:
train_labels.shape , test_labels.shape

((614,), (154,))

Inserting the bias column in our data

In [189]:
train_data = np.insert(train_data,0,0.5,axis=1)
test_data = np.insert(test_data,0,0.5,axis=1)

In [190]:
train_data.shape , test_data.shape

((614, 9), (154, 9))

In [191]:
weights = get_parameters(train_data,train_labels,0.005,30)

In [192]:
lin_out = linear_layer(train_data[6],weights)

In [193]:
print(lin_out)

-0.4583926225802051


In [196]:
evaluation(train_data,train_labels,test_data,test_labels,get_predictions,lr=0.1,epochs=70)

0.7402597402597403

So finally we have our accuracy as 74.025% which is higher than the baseline i.e. 65% for this dataset with just basic logistic regression.