k-class logistic regression

In [1]:
import numpy as np
from scipy.io import loadmat
import copy
from sklearn.metrics import accuracy_score

In [13]:
#initialise constants

# error-rate is set
error = 20

#learning rate
alpha = 0.01

# add machine epsilon while finding log to avoid log(0)
eps = np.finfo(float).eps

# size of subset of training data to used. If it is none then whole of training set is used
subsetSize = None

In [3]:
def calculateLogits(data,theta):
    return 1/(1+np.exp(-1*np.matmul(np.transpose(data),theta)))

In [4]:
def gradientDescent(X_train,y_train,alpha,error):
    # initialize theta to a random value
    thetaNew = np.random.rand(featurevecSize,1)
    thetaOld = np.inf
    numIterations = 0
    #print np.linalg.norm(thetaNew - thetaOld,1),error
    while np.linalg.norm(thetaNew - thetaOld,2) > error or np.isnan(np.linalg.norm(thetaNew - thetaOld,2)):
        thetaOld = thetaNew
        thetaNew = thetaOld - (alpha * np.matmul(X_train,(calculateLogits(X_train,thetaOld)-y_train)))
        numIterations+=1
    return (thetaNew,numIterations)

In [7]:
#loading and preprocessing of data
X_train = loadmat('mnistTrainImages.mat')
y_train = loadmat('mnistTrainLabels.mat')
    
if subsetSize is not None and subsetSize < X_train['trainData'].shape[0]:
    trainIndices = np.random.randint(0,X_train['trainData'].shape[0],subsetSize)
else:
    trainIndices = range(X_train['trainData'].shape[0])

#convert to numpy array
X_train = np.array(X_train['trainData'][trainIndices,:])
y_train = np.array(y_train['trainLabels'][trainIndices,:])

numClasses = len(np.unique(y_train))

# prepend with a column of ones to account for bias and take transpose 
X_train = np.transpose(np.insert(X_train,0,1,1))

trainSize = X_train.shape[1]
featurevecSize = X_train.shape[0]

In [8]:
#load test data and test labels
X_test = loadmat('mnistTestImages.mat')
y_test = loadmat('mnistTestLabels.mat')

X_test = np.array(X_test['testData'])
y_test = np.array(y_test['testLabels'])

testSize = X_test.shape[0]

X_test = np.insert(X_test,0,1,1)

prediction = np.zeros((testSize,numClasses))

In [8]:
# train model using gradient descent method
for label in range(numClasses):
    modifiedLabelsTrain = copy.deepcopy(y_train)
    modifiedLabelsTrain[np.where(y_train == label)[0]] = 1
    modifiedLabelsTrain[np.where(y_train != label)[0]] = 0
    (theta,_) = gradientDescent(X_train,modifiedLabelsTrain,alpha,error)
    prediction[:,label] = calculateLogits(np.transpose(X_test),theta).ravel()

finalPrediction = prediction.argmax(axis=1)

print 'For error rate of',error,'with learning rate of',alpha
print 'Accuracy in percentage',accuracy_score(finalPrediction,y_test)*100

  


For error rate of 20 with learning rate of 0.01
Accuracy in percentage 85.32


### Observations:

Alpha stands for learning rate and error tolerance is the distance (as measured by 2-norm) between successive theta values. These are called hyper-parameters because they are not estimated in the process of regression, but are fixed by the user before the start of the algorithm. As of now there is no closed form formula for calculating them but are set by trial and error.

The following are accuracies and iterations to converge for various values of hyper-parameters alpha and error tolerance for threshold probability 0.5:

|Error rate | Learning rate | Accuracy (%) |
|:---------:|:-------------:|:--------:|
| 10 | 0.01 | 80.87 |
| 20 | 0.01 | 85.32 |
| 5 | 0.001 | 83.96 |
| 10 | 0.001 | 81.79 |
| 20 | 0.001 | 80.91 |
| 5 | 0.0001 | 62.2 |
| 10 | 0.0001 | 63.46 |
| 20 | 0.0001 | 63.73 |

1. Gradient descent has got a very slow convergence often oscillates a lot for large datasets. Hence other variants of gradient descent like batch gradient descent and stochastic gradient descent.

2. Threshold probability is not used here. The *argmax* function in numpy gives the index of the largest logit calculated for a particular test example. This is assigned as the predicted class which is compared to the original labels to calculate accuracy.

3. It is clearly observed that for a lower learning, rate convergence of logits is quite faster. This is because alpha scales down the gradient value thus preventing a lot of oscillations. So it is ideal to choose a smaller learning rate.

4. There is also a steady dip in accuracy along with decreasing learning rate. Hence there is a trade-off between speed and accuracy. That is for a lower learning rate the rate of convergence is fast but accuracy is low and vice versa. This is a common situation that one encounters in various other branches of computer science as well.

5. With an increase in error tolerance the accuracy decreases slightly but the convergence happens faster and vice versa.

6. From the above table ideal values for the parameters are alpha equals 0.01 and error tolerance equals 20 for this dataset. The hyper-parameters vary across datasets.