# Lab 8: The Perceptron

In this lab, you will complete an implementation of the Perceptron training algorithm. 

First a synthetic dataset is generated. The code then does k-fold cross-validation on that data. This means the data is split into 5 folds, and we train on 4 of the folds and evaluate on the remaining fold. We do that 5 times so that we evaluate on each of the 5 folds. 

The code reports precision and recall scores for evaluation. (https://en.wikipedia.org/wiki/Precision_and_recall ) Ideally these should both be close to 1, but because the training algorithm has not been completed, when you initially run this you should see something like Precision=0.5 and Recall=1.0 for each fold. In other words, recall is perfect because everything is being predicted as the positive class, which makes precision very low. 

Your goal for the lab is to complete the train() function below so that precision and recall are both close to 1. 


## Function for Generating Data

The following function genData() will generate some synthetic data consisting of two features/predictors and a binary response/outcome. The data is randomly generated but will look something like this, where each instance has a label 1 or -1 representing the outcome variable, followed by the values of the two features.

(1, array([ 1.49640192,  1.68659547]))

(-1, array([ 0.8539312 ,  0.98891425]))

…

In [1]:
import numpy as np
import random
import sys, os
import re, string

# Generates training data with two features.
# DO NOT MODIFY.
def genData(iterations):
    dataset = []
    for i in range(0,int(iterations/2.0)):
        # positive examples
        num1 = random.uniform(1.1,1.9)
        num2 = random.uniform(1.1,1.9)
        dataset.append((1,np.array([num1,num2])))

        # negative examples
        num1 = random.uniform(0.7,1.3)
        num2 = random.uniform(0.7,1.3)
        dataset.append((-1,np.array([num1,num2])))
    return dataset

## Function for Training

The function below is the only part of the code that you need to modify. But before you make any changes, try running all of the code and checking the precision and recall. 

In [2]:
# COMPLETE THIS FUNCTION.
# This function does the online perceptron training. 
def train(train_data, w):
    bias = 0
    for i in range(0,10):
        random.shuffle(train_data)
    
        '''
        TO COMPLETE:
        The outer loop does ten iterations over the dataset, shuffling the data each time.
        For each of those ten iterations, have an inner loop consider one data point at a time. 
        Calculate the activation for the datapoint, and update the weights and bias when there is an error.
        Follow the pseudocode.
 
        '''
        
        for (y,x) in train_data:
            activation = w * x
            if activation <= 0:
                if (y == -1):
                    w = w + y*x
                    bias += 1
                else: 
                    w += 1
            else:
                if (y == -1):
                    w += 1
                else:
                    w += 1
                    bias += 1 
            
        
    return (w, bias)

## Function for Testing

Once the model is trained, the following test() function is used to evaluate the model. It takes the learned weights and bias and applies the learned model to some unseen test data. Precision and recall are calculated. 

In [3]:
# Evaluates the trained model on the held-out portion of data.
# DO NOT MODIFY.
def test(w, bias, test_data, thresh=0):
    incorrect = 0
    tp = fp = tn = fn = 0
    for (y,x) in test_data:
        activation = bias + np.inner(w, x)
        if activation >= thresh:
            if (y == -1):
                fp += 1
                incorrect += 1
            else:
                tp += 1
        else:
            if (y == -1):
                tn += 1
            else:
                fn += 1
                incorrect += 1
    accuracy = ((len(test_data)-float(incorrect)) / len(test_data)) * 100
    
    try:
        precision = tp / float(tp + fp)
    except:
        precision = 0
    try:
        recall = tp / float(tp + fn)
    except:
        recall = 0
    print('Precision: %s, Recall %s' % (precision, recall))
    return (accuracy,precision,recall)

## Function for K-Fold Cross-Validation

In [4]:
# Does cross-validation.
# Splits the data into k folds, trains on k-1 folds and tests on 1 fold, and repeats k times.
# DO NOT MODIFY.
def cv(data, k=5):
    start = 0
    fold_size = int(len(data) / float(k))
    init_w = 0.0
    w = np.array([init_w]*len(data[0][1])) # initialize weights
    
    for i in range(0,k):
        train_data = data[start:start+fold_size]
        test_data = data[0:start] + data[start+fold_size:]
        (w, bias) = train(train_data, w)
        (acc,p,r) = test(w, bias, test_data)
        start = start+fold_size 
    return (w,bias)





## Testing Out All the Code

The following lines of code generate 1500 observations, and run 5-fold cross-validation.

In [5]:
data = genData(1500)      # generate some data
(w, bias) = cv(data, 5)  # do 5-fold cross-validation on the data

Precision: 0.5, Recall 1.0
Precision: 0.5, Recall 1.0
Precision: 0.5, Recall 1.0
Precision: 0.5, Recall 1.0
Precision: 0.5, Recall 1.0


# Deliverable: Submit your completed notebook via Blackboard.

If you finish early, try modifying the training method to use the _averaged perceptron_ technique. See this link, page 53, for pseudocode and an explanation: http://ciml.info/dl/v0_99/ciml-v0_99-ch04.pdf