# Logistic regression metrics

In [1]:
import numpy as np
import math
np.set_printoptions(suppress=True)

from logistic_regressor import LogisticRegressor

### Random dataset to work with (binary classification)

** Training set **
<br>
The original goal was to train the regressor and reproduce the coefficients, however, I realized that reproducing the actual values is fairly difficult (I tried with sklearn and statsmodels as well, with little success). However, the ratio between the coefficients are fairly conserved.
<br>
I will use total accuracy for this demonstration

In [2]:
size =1000

coefficients = [0,-1.4,2.1,-3,10.4,-8]
x = np.ones((size,len(coefficients)))
for i in range(1,len(coefficients)):
    x[:,i]=np.random.rand(size)
y = np.vectorize(lambda x: round(1/(1+math.exp(-x))))((x*coefficients).sum(axis=1) + np.random.normal(size=size))
coefficients

[0, -1.4, 2.1, -3, 10.4, -8]

In [3]:
print((y==0).sum(),(y==1).sum())

512 488


Dataset is balanced

** Test set **

In [4]:
t_x = np.ones((size,len(coefficients)))
for i in range(1,len(coefficients)):
    t_x[:,i]=np.random.rand(size)
t_y = np.vectorize(lambda x: round(1/(1+math.exp(-x))))((t_x*coefficients).sum(axis=1)
                                                        + np.random.normal(size=size))

### Logistic Regressor

In [5]:
from logistic_regressor import LogisticRegressor
log = LogisticRegressor()

**Default solution is found using stochastic gradient descent**

In [6]:
log.fit(x,y,epochs=100,learning_rate=0.01,bin_size=1)
log.coeff

array([-0.84269476, -0.87646913,  2.51090639, -2.73968631,  9.53928816,
       -6.58303746])

**Coefficients ratio:**
<br>
(Intercept at index 0 should be ignored)

In [7]:
coeff = log.coeff
coefficients/coeff

array([-0.        ,  1.59731809,  0.83635137,  1.09501587,  1.0902281 ,
        1.21524449])

Accuracy on test set:

In [8]:
pred = log.predict(t_x) 

print((pred==t_y).sum())

938


** Batch gradient descent **

In [9]:
log.fit(x,y,method='batch',learning_rate=0.01,epochs=500)
pred = log.predict(t_x) 
print((pred==t_y).sum())

930


### Arguments

** Columns of values 1 is automatically added if missing **
<br>
User can force regressor not to add it

with intercept:

In [10]:
log.fit(x[:,1:],y,epochs=500,learning_rate=0.01,bin_size=1)
pred = log.predict(t_x) 
print((pred==t_y).sum())

939


no intercept

In [11]:
log.fit(x[:,1:],y,add_x0=False,epochs=500,learning_rate=0.01,bin_size=1)
pred = log.predict(t_x[:,1:]) 
print((pred==t_y).sum())

935


**User can modify learning rate, epochs and bin size**

In [12]:
log.fit(x,y,learning_rate=0.1,epochs=250,bin_size=20)

** User can specify starting coefficients **

In [13]:
log.fit(x,y,learning_rate=0.1,epochs=1)
pred = log.predict(t_x) 
print((pred==t_y).sum())

890


continue from previously learnt coefficients:

In [14]:
log.fit(x,y,learning_rate=0.1,epochs=1,starting_coeff=log.coeff)
pred = log.predict(t_x) 
print((pred==t_y).sum())

937


### Predictions

**Predict labels**

In [15]:
pred = log.predict(t_x)

print('True 1:\t\t',t_y.sum())
print('Predicted 1:\t', pred.sum())
print(pred[:10])

True 1:		 512
Predicted 1:	 505
[1 1 1 0 0 0 0 0 0 0]


**Output probabilties** (probability for 1)

In [16]:
pred = log.predict(t_x,probability=True)

print(pred[:10])

[0.50899483 0.56325886 0.58824181 0.47268824 0.48184544 0.48953257
 0.47406203 0.49172269 0.44655136 0.43768228]


**Predict labels with a different threshold ** default is 0.5

In [17]:
pred = log.predict(t_x,threshold=0.6)

print('True 1:\t\t',t_y.sum())
print('Predicted 1:\t', pred.sum())
print(pred[:10])

True 1:		 512
Predicted 1:	 103
[0 0 0 0 0 0 0 0 0 0]
