# Logistic Regression (Spam Email Classifier)

In this lab we will develop a Spam email classifier using Logistic Regression.

We will use [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) from Kaggle, which was split into two almost equal parts: training dataset (train.csv) and test dataset (test.csv).
Each record in the datasets contains 58 features, one of which is the class label. The class label is the last feature and it takes two values +1 (spam email) and -1 (non-spam email). The other features represent various characteristics of emails such as frequencies of certain words or characters in the text of an email; and lengths of sequences of consecutive capital letters (See [SPAM E-mail Database](https://www.kaggle.com/somesh24/spambase) for the detailed description of the features).

In [8]:
import numpy as np
import random

We start with implementing some auxiliary functions.

In [9]:
# Implement sigmoid function
def sigmoid(x):
    # Bound the argument to be in the interval [-500, 500] to prevent overflow
    x = np.clip( x, -500, 500 )

    return 1/(1 + np.exp(-x))

In [10]:
def load_data(fname):
    labels = []
    features = []
    
    with open(fname) as F:
        next(F) # skip the first line with feature names
        for line in F:
            p = line.strip().split(',')
            labels.append(int(p[-1]))
            features.append(np.array(p[:-1], float))
    return (np.array(labels), np.array(features))

Next we read the training and the test datasets.

In [11]:
(trainingLabels, trainingData) = load_data("train.csv")
(testLabels, testData) = load_data("test.csv")

In the files the positive objects appear before the negative objects. So we reshuffle both datasets to avoid situation when we present to our training algorithm all positive objects and then all negative objects.

In [12]:
#Reshuffle training data and
permutation =  np.random.permutation(len(trainingData))
trainingLabels = trainingLabels[permutation]
trainingData = trainingData[permutation]

#test data
permutation =  np.random.permutation(len(testData))
testLabels = testLabels[permutation]
testData = testData[permutation]

## Exercise 1

1. Implement Logistic Regression training algorithm.

In [13]:
def logisticRegression(trainingData, trainingLabels, learningRate, maxIter):
    #Compute the number of training objects
    numTrainingObj = len(trainingData)
    #Compute the number of features (dimension of our data)
    numFeatures = len(trainingData[0])
    
    #Initialize the bias term and the weights
    b = 0
    W = np.zeros(numFeatures, dtype=np.float64)
    
    for t in range(maxIter):
        #For every training object
        for i in range(numTrainingObj):
            X = trainingData[i]
            y = trainingLabels[i]
            #Compute the activation score
            a = np.dot(X, W) + b
        
            #Update the bias term and the weights
            b = b + learningRate*y*sigmoid(-y*a)
            for s in range(numFeatures):
                W[s] = W[s] + learningRate*y*sigmoid(-y*a)*X[s]
            
            #The above for-loop can be equivalently written in the vector form as follows
            #W = np.add(W, learningRate*y*sigmoid(-y*a)*X)
            
    return (b, W)

2. Use the training dataset to train Logistic Regression classifier. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

In [14]:
(b,W) = logisticRegression(trainingData, trainingLabels, 0.1, 10)
print("Bias term: ", b, "\nWeight vector: ", W) 

Bias term:  -363.15951816758025 
Weight vector:  [ -10.90024438 -113.58073666  -20.31922296   41.807         1.73756728
   21.08258423   58.13454541   33.98006613   13.25749851   -7.92082525
   13.96809494 -220.03480134   17.21060329   -7.28913832   20.86422556
  117.82563519   25.38574793   35.92883048   -7.46213282   38.64263399
   31.99293727  104.11697144   57.3855778    65.42787761 -622.12434161
 -282.79789999 -271.27507826 -149.26978826  -91.98978349  -96.88667151
  -61.28572679  -41.62719017 -108.67554666  -42.79419017 -120.17175428
  -76.71996123 -153.05430971   -3.543       -88.57889446  -38.14753387
  -58.59188729 -136.6373994   -61.21603343  -77.16082483 -129.01717334
  -81.85799361   -7.282       -39.31316575  -11.22401092  -78.57425684
  -12.38227811  102.9264842    45.92032563    6.77756225 -268.76593695
  250.70487775   67.32478733]


## Exercise 2

1. Implement Logistic Regression classifier with given bias term and weight vector

In [15]:
def logisticRegressionTest(b, W, X):
    #Compute the activation score
    a = np.dot(X, W) + b
    predictedClass = 0;
    confidence = 0;
    
    if a > 0:
        predictedClass = +1
        confidence = sigmoid(a)
    else:
        predictedClass = -1
        confidence = 1-sigmoid(a)
    return (predictedClass, confidence)

2. Use the trained model to classify objects in the test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

In [16]:
def evaluationReport(classTrue, classPred):
    positive_mask = classTrue == 1

    # Count the number of elements in the positive class 
    positive = np.count_nonzero(positive_mask)
    # Count True Positive
    tp = np.count_nonzero(classPred[positive_mask]==1)
    # Count False Negative
    fn = np.count_nonzero(classPred[positive_mask]==-1)
    
    negative_mask = classTrue == -1

    # Count the number of elements in the negative class 
    negative = np.count_nonzero(negative_mask)
    # Count False Positive
    fp = np.count_nonzero(classPred[negative_mask]==1)
    # Count True Negative
    tn = np.count_nonzero(classPred[negative_mask]==-1)

    # Compute Accuracy, Precision, Recall, and F-score
    accuracy = (tp + tn)/(tp + tn + fp + fn)
    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    fscore = 2*precision*recall/(precision + recall)
    print("Evaluation report")
    print("Accuracy: %.2f" % accuracy)
    print("Precision: %.2f" % precision)
    print("Recall: %.2f" % recall)
    print("F-score: %.2f" % fscore)

In [17]:
classTrue = np.array([int(x) for x in testLabels], dtype=int)
classPred = np.array([int(logisticRegressionTest(b,W,X)[0]) for X in testData], dtype=int)
evaluationReport(classTrue, classPred)

Evaluation report
Accuracy: 0.50
Precision: 0.44
Recall: 0.99
F-score: 0.61


## Exercise 3

1. Apply Gaussian Normalisation to the training dataset

In [18]:
def GaussianNormalisation(dataset):
    #Compute the number of features
    numFeatures = len(dataset[0])
    
    featureMean = np.empty(numFeatures, float)
    featureStd = np.empty(numFeatures, float)
    
    #For every feature
    for i in range(numFeatures):
        #find its Mean and Std
        featureMean[i] = dataset[:,i].mean(axis=0)
        featureStd[i] = dataset[:,i].std(axis=0)
        #Apply Gaussian Noramlisation
        dataset[:,i] = (dataset[:,i] - featureMean[i])/featureStd[i]

    return (featureMean, featureStd)

In [19]:
#normalize the training dataset
(featureMean, featureStd) = GaussianNormalisation(trainingData)

2. Train Logistic Regression on the normalised training dataset. Use learningRate=0.1 and maxIter=10. Output the bias term and the weight vector of the trained model.

In [20]:
#Train Logistic Regression classifier on the normalised training data
(b,W) = logisticRegression(trainingData, trainingLabels, 0.1, 10)
print("Bias term: ", b, "\nWeight vector: ", W) 

Bias term:  -3.8700230572307306 
Weight vector:  [ 0.30622392 -0.29738268 -0.04827423  0.97898444  0.22550381  0.22235831
  1.23644039  0.3995184   0.1675739  -2.17212054 -0.19072859 -0.14266101
 -0.38107243  0.18353617  0.45252564  1.17780158  0.63921452  0.23452619
 -0.28650925  0.38656092  0.24308377  2.38975508  1.08884446  1.62427053
 -4.78985762 -1.03482003 -6.3825585   0.17206847 -1.15155976  0.0880132
 -1.74818719  0.01457053 -0.44550382 -0.71287809 -1.09124483  0.46843464
 -0.63134178  0.3124822  -0.2318435  -0.00810798 -3.07630166 -2.57784402
 -0.06993102 -1.29041026 -0.50645505 -0.6733586  -0.67205198 -1.83949287
  0.45092734 -0.03057094 -0.43808852  1.31808973  1.67539497  1.65398792
 -0.80748744  1.34248496  0.75888506]


3. Normalise the test dataset using Means and Standard Deviations of the features *computed on the training dataset*.

In [21]:
def normalise(dataset, featureMean, featureStd):
    #Compute the number of features
    numFeatures = len(dataset[0])
    
    #For every feature
    for i in range(numFeatures):
        #Apply Gaussian Noramlisation with given Mean and Std values
        dataset[:,i] = (dataset[:,i] - featureMean[i])/featureStd[i]

In [22]:
#normalize the test dataset using Means and Std computed on the training dataset
normalise(testData, featureMean, featureStd)

4. Use the model trained on the normalised training dataset to classify objects in the normalised test dataset. Output an evaluation report (accuracy, precision, recall, F-score).

In [23]:
#Predict class labels of test objects for the normalized test dataset
classPred = np.array([int(logisticRegressionTest(b,W,X)[0]) for X in testData], dtype=int)
evaluationReport(classTrue, classPred)

Evaluation report
Accuracy: 0.88
Precision: 0.83
Recall: 0.89
F-score: 0.86


5. Compare the quality of the classifier with normalisation and without normalisation