# Logistic Regression (Spam Email Classifier)

In this lab we will develop a Spam email classifier using Logistic Regression.

We will use [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) from Kaggle, which was split into two almost equal parts: training dataset (train.csv) and test dataset (test.csv).
Each record in the datasets contains 58 features, one of which is the class label. The class label is the last feature and it takes two values +1 (spam email) and -1 (non-spam email). The other features represent various characteristics of emails such as frequencies of certain words or characters in the text of an email; and lengths of sequences of consecutive capital letters (See [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) for the detailed description of the features).

In [1]:
import numpy as np
import random

We start with implementing some auxiliary functions.

In [2]:
# Implement sigmoid function
def sigmoid(x):
    # Bound the argument to be in the interval [-500, 500] to prevent overflow
    x = np.clip( x, -500, 500 )

    return 1/(1 + np.exp(-x))

In [3]:
def load_data(fname):
    labels = []
    features = []
    
    with open(fname) as F:
        next(F) # skip the first line with feature names
        for line in F:
            p = line.strip().split(',')
            labels.append(int(p[-1]))
            features.append(np.array(p[:-1], float))
    return (np.array(labels), np.array(features))

Next we read the training and the test datasets.

In [4]:
(trainingLabels, trainingData) = load_data("train.csv")
(testLabels, testData) = load_data("test.csv")

In the files the positive objects appear before the negative objects. So we reshuffle both datasets to avoid situation when we present to our training algorithm all positive objects and then all negative objects.

In [5]:
#Reshuffle training data and
permutation =  np.random.permutation(len(trainingData))
trainingLabels = trainingLabels[permutation]
trainingData = trainingData[permutation]

#test data
permutation =  np.random.permutation(len(testData))
testLabels = testLabels[permutation]
testData = testData[permutation]

## Exercise 1

1. Implement Logistic Regression training algorithm.

In [6]:
def logisticRegression(trainingData, trainingLabels, learningRate, maxIter):
    #Compute the number of training objects
    numTrainingObj = len(trainingData)
    #Compute the number of features (dimension of our data)
    numFeatures = len(trainingData[0])
    
    #Initialize the bias term and the weights
    b = 0
    W = np.zeros(numFeatures, dtype=np.float128)
    
    for t in range(maxIter):
        #For every training object
        for i in range(numTrainingObj):
            X = trainingData[i]
            y = trainingLabels[i]
            #Compute the activation score
            a = np.dot(X, W) + b
        
            #Update the bias term and the weights
            b = b + learningRate*y*sigmoid(-y*a)
            for s in range(numFeatures):
                W[s] = W[s] + learningRate*y*sigmoid(-y*a)*X[s]
            
            #The above for-loop can be equivalently written in the vector form as follows
            #W = np.add(W, learningRate*y*sigmoid(-y*a)*X)
            
    return (b, W)

2. Use the training dataset to train Logistic Regression classifier. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

In [7]:
(b,W) = logisticRegression(trainingData, trainingLabels, 0.1, 10)
print("Bias term: ", b, "\nWeight vector: ", W) 

Bias term:  -372.50583505382194097 
Weight vector:  [  -4.06907888 -103.36947318  -26.45890426    9.769       -24.56475082
   18.79321189   49.96750584   34.82771073    4.66779331   -5.83907522
   12.26129104 -244.64907509   11.09677929   -8.01756238   13.333
  102.51469675   34.73624603   21.71872523  -16.72953192   46.231
   83.23429667   94.957        57.39649369   60.74638059 -653.23374459
 -302.42621109 -311.92975736 -151.61695266 -124.95360719 -113.15797848
  -74.54897863  -49.71597848 -113.8829516   -51.57397848 -127.66184855
  -92.94474702 -158.51516313   -0.669       -89.06522497  -44.70297848
  -63.43373419 -177.52756458  -65.83995616  -85.8377336  -147.45683165
  -75.9957342    -7.62290153  -38.12401619  -17.53236711  -86.91459667
  -17.11060101   89.4320508    39.28911875    4.97647861 -331.53166083
  151.0743862   -95.85092115]


## Exercise 2

1. Implement Logistic Regression classifier with given bias term and weight vector

In [8]:
def logisticRegressionTest(b, W, X):
    #Compute the activation score
    a = np.dot(X, W) + b
    predictedClass = 0;
    confidence = 0;
    
    if a > 0:
        predictedClass = +1
        confidence = sigmoid(a)
    else:
        predictedClass = -1
        confidence = 1-sigmoid(a)
    return (predictedClass, confidence)

2. Use the trained model to classify objects in the test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

In [9]:
def evaluationReport(classTrue, classPred):
    positive_mask = classTrue == 1

    # Count the number of elements in the positive class 
    positive = np.count_nonzero(positive_mask)
    # Count True Positive
    tp = np.count_nonzero(classPred[positive_mask]==1)
    # Count False Negative
    fn = np.count_nonzero(classPred[positive_mask]==-1)
    
    negative_mask = classTrue == -1

    # Count the number of elements in the negative class 
    negative = np.count_nonzero(negative_mask)
    # Count False Positive
    fp = np.count_nonzero(classPred[negative_mask]==1)
    # Count True Negative
    tn = np.count_nonzero(classPred[negative_mask]==-1)

    # Compute Accuracy, Precision, Recall, and F-score
    accuracy = (tp + tn)/(tp + tn + fp + fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    fscore = 2*precision*recall/(precision + recall)
    print("Evaluation report")
    print("Accuracy: %.2f" % accuracy)
    print("Precision: %.2f" % precision)
    print("Recall: %.2f" % recall)
    print("F-score: %.2f" % fscore)

In [10]:
classTrue = np.array([int(x) for x in testLabels], dtype=int)
classPred = np.array([int(logisticRegressionTest(b,W,X)[0]) for X in testData], dtype=int)
evaluationReport(classTrue, classPred)

Evaluation report
Accuracy: 0.62
Precision: 0.87
Recall: 0.02
F-score: 0.04


## Exercise 3

1. Apply Gaussian Normalisation to the training dataset

In [11]:
def GaussianNormalisation(dataset):
    #Compute the number of features
    numFeatures = len(dataset[0])
    
    featureMean = np.empty(numFeatures, float)
    featureStd = np.empty(numFeatures, float)
    
    #For every feature
    for i in range(numFeatures):
        #find its Mean and Std
        featureMean[i] = dataset[:,i].mean(axis=0)
        featureStd[i] = dataset[:,i].std(axis=0)
        #Apply Gaussian Noramlisation
        dataset[:,i] = (dataset[:,i] - featureMean[i])/featureStd[i]

    return (featureMean, featureStd)

In [12]:
#normalize the training dataset
(featureMean, featureStd) = GaussianNormalisation(trainingData)

2. Train Logistic Regression on the normalised training dataset. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

In [13]:
#Train Logistic Regression classifier on the normalised training data
(b,W) = logisticRegression(trainingData, trainingLabels, 0.1, 10)
print("Bias term: ", b, "\nWeight vector: ", W) 

Bias term:  -3.713327771293792289 
Weight vector:  [ 0.26449472 -0.37943566  0.31332628  1.08894543  0.16898046  0.86758278
  0.96146412  0.48829645  0.07179314  0.09879491  0.04831041  0.05591818
  0.30279669 -0.14512708  0.48177015  1.10892931  0.53849794  0.34400421
  0.64148233 -0.04365394  0.06600537  1.93998242  0.72277151  1.65756034
 -4.2426971  -0.92111694 -6.05212535  0.29516656 -1.28347077  0.44909019
 -2.00706703 -0.03211902 -0.517342   -0.17765313 -1.54730298  0.34789046
 -0.21829777  0.29026295 -0.03737028  0.12137363 -3.14851923 -2.00371221
 -0.05750017 -1.32883656 -0.73536286 -1.17041856 -0.36395977 -1.49418548
  0.2588058   0.32725688  0.53060392  0.93535997  2.24809407  1.12636468
  0.46507143  1.05492443  0.51629748]


3. Normalise the test dataset using Means and Standard Deviations of the features *computed on the training dataset*.

In [14]:
def normalise(dataset, featureMean, featureStd):
    #Compute the number of features
    numFeatures = len(dataset[0])
    
    #For every feature
    for i in range(numFeatures):
        #Apply Gaussian Noramlisation with given Mean and Std values
        dataset[:,i] = (dataset[:,i] - featureMean[i])/featureStd[i]

In [15]:
#normalize the test dataset using Means and Std computed on the training dataset
normalise(testData, featureMean, featureStd)

4. Use the model trained on the normalised training dataset to classify objects in the normalised test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

In [16]:
#Predict class labels of test objects for the normalized test dataset
classPred = np.array([int(logisticRegressionTest(b,W,X)[0]) for X in testData], dtype=int)
evaluationReport(classTrue, classPred)

Evaluation report
Accuracy: 0.89
Precision: 0.85
Recall: 0.88
F-score: 0.86


5. Compare the quality of the classifier with normalisation and without normalisation