# Perceptron, Adaline and Logistique Regression

## by *Templier William*

We are interested in the implementation of the perceptron algorithm (Rosenblatt, 68), Adaline (Widrow et Hoff, 60) and Logisitc Regression (Cox, 66) whose pseudo-code are the following:

Perceptron:
`Input: Train, eta, MaxEp
init: w
epoch = 0
err = 1
m = len(Train)
while epoque <= MaxEp and err! = 0
    err = 0
    for i in 1: m
        choose randomly an example (x,y)
        h <- w * x
        if (y * h <= 0)
            w <- w + eta * y * x
            err <- err + 1
     epoch <- epoch + 1
output: w`

Adaline:
`input: Train, eta, MaxEp
init : w
epoque=0
err=1
m = len(Train)
while epoque<=MaxEp and err!=0
    err=0
    for i in 1:m
        choose randomly an example (x,y)
        h <- w*x
        if(y*h<=0)
           err <- err+1
        w <- w + eta*(y-dp)*x
     epoque <- epoque+1
output: w`

Logistic Regression:
`input: Train, eta, MaxEp
init : w
epoque=0
err=1
m = len(Train)
while epoque<=MaxEp and err!=0
    err=0
    for i in 1:m
        choose randomly an example (x,y)
        h <- w*x
        if(y*h<=0)
           err <- err+1
        w <- w + eta*y*(1-sigm(y*dp))*x
     epoque <- epoque+1
output: w`

1. Create a list of 4 elements corresponding to the logical AND example called `Train`:
$Train=\{((1,+1,+1),+1),((1,-1,+1),-1),((1,-1,-1),-1),((1,+1,-1),-1)\}$

Each element of the list is a list which last characteristic is the class of the example and the first characteristics their coordinates with the biais '1' included at the beginning of each vector.

    

In [1]:
Train=[
    ([1, +1, +1], +1),
    ([1, -1, +1], -1),
    ([1, -1, -1], -1),
    ([1, +1, -1], -1),
]
Train

[([1, 1, 1], 1), ([1, -1, 1], -1), ([1, -1, -1], -1), ([1, 1, -1], -1)]

2. Code the Perceptron, Adaline and LR (Logistic regression) programs

Hint: You can write a function that calculates the dot product between an example $\mathbf{x} = (x_1, \ldots, x_d)$ and the weight vector $\mathbf{w} = (w_0, w_1, \ldots, w_d)$: 
$ h(\mathbf{x},\mathbf{w}) = w_0 + \sum_ {j = 1} ^ d w_j x_j $.


In [2]:
def dot(x, y):
    return sum(x_i*y_i for x_i, y_i in zip(x, y))

from math import exp

def sigmoid(z):
    return (1.0/(1.0+exp(-z)))

Since the only difference between the three algorithms was a *term* in the update segmnent (w <- w + x \* term \* eta), I refactored as much as possible, to be able to write a simple **gradient descent** algorithm that can be used by prodiving a desired `update_function`.

In [3]:
def _compute_update(w, x, term, eta):
    return [w_i + eta * term * x_i for w_i,x_i in zip(w,x)]

def perceptron_update(w,x,h,y,eta):
    #w <- w + eta * y * x
    return _compute_update(w, x, term=y, eta=eta)

def adaline_update(w,x,h,y,eta):
    #w <- w + eta*(y-dp)*x
    return _compute_update(w, x, term=(y - h), eta=eta)

def logreg_update(w,x,h,y,eta):
    #w <- w + eta*y*(1-sigm(y*dp))*x
    return _compute_update(w,x, term=(y * (1 - sigmoid(y * h))), eta=eta)

In [4]:
from random import randint

def gradient_descent(train, eta, max_epoch, update_rule, update_on_error=False):
    w = [0.0] * len(train)
    #err = 1
    epoch = 0
    
    while (epoch < max_epoch): #and err > 0:
        err = 0
        
        for i in range(len(train)):
            x,y = train[randint(0, len(train)-1)]
            #print(max(x)) for checking the overflow error
            h = dot(w,x)

            if (y * h <= 0): err += 1

            if err > 0 and update_on_error: # for the perceptron
                w = update_rule(w,x,h,y,eta)

            elif not update_on_error:
                w = update_rule(w,x,h,y,eta)
                
        epoch += 1
                
    return w

In [5]:
w_perceptron = gradient_descent(Train, 0.01, 500, perceptron_update, update_on_error=True)
w_adaline = gradient_descent(Train, 0.01, 500, adaline_update)
w_logreg = gradient_descent(Train, 0.01, 500, logreg_update)

3. Apply the three learning models on the logical AND, and calculate the model error rate on this basis.

Hint: You can write a function that takes a weight vector $\mathbf{w}$ and an example $(\mathbf{x},y)$ and calculates the error rate of the model with weight $\mathbf{w}$.

In [6]:
Test = [
    ([1, +1, +1], +1),
    ([1, -1, +1], -1),
    ([1, -1, -1], -1),
    ([1, +1, -1], -1),
    ([1, +1, +1], +1),
    ([1, -1, +1], -1),
    ([1, -1, -1], -1),
    ([1, +1, -1], -1)
]

In [7]:
def EmpiricalRisk(Test,W):
    E=0.0
    m=len(Test)
    # The empirical error of a model with weight W on a test set of size m
    for xi, yi in Test:
        h_w = dot(W, xi)
        if (yi * h_w <= 0):
            E+=1.0
    return E/float(m)

In [8]:
print(f"Empirical risk Perceptron = {EmpiricalRisk(Test, w_perceptron)}")
print(f"Empirical risk Adaline = {EmpiricalRisk(Test, w_adaline)}")
print(f"Empirical risk LogReg = {EmpiricalRisk(Test, w_logreg)}")

Empirical risk Perceptron = 0.0
Empirical risk Adaline = 0.0
Empirical risk LogReg = 0.0


4. We are now going to focus on the behavior of the three models on http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks), https://archive.ics.uci.edu/ml/datasets/spambase, https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29, https://archive.ics.uci.edu/ml/datasets/Ionosphere. These files are in the current respository with the names `sonar.txt`; `spam.txt`; `wdbc.txt` and `ionoshpere.txt`. We can use the following `ReadCollection` function in order to read the files in the form of the training set that is requested. 

In [9]:
from math import sqrt
import pandas as pd
import random
from sklearn.model_selection import train_test_split

def Normalize(x):
    norm=0.0
    for e in x:
        norm+=e**2
    for i in range(len(x)):
        x[i]/=sqrt(norm)
    return x

def ReadCollection(filename, normalize=True):
    tag_df=pd.read_table(f'data/{filename}',sep=',',header=None)
    if("wdbc" in filename):
        Dic={'M': -1, 'B': +1}
    elif("sonar" in filename):
        Dic={'R': -1, 'M': +1}
    elif("iono" in filename):
        Dic={'g': -1, 'b': +1}
    elif("spam" in filename):
        Dic={0:-1, 1:+1}
        
    X=[]
    for e in range(len(tag_df)):
        x=list(tag_df.loc[e,:])
        if("wdbc" in filename):
            x.pop(0)
            cls=x.pop(0)
        else:
            cls=x.pop()
            
        if normalize: x=Normalize(x)
        x.insert(len(x),Dic[cls])
        X.append(x)
    random.shuffle(X)

    return X

 2. Run the three models on these files with $\eta=0.01$ et $\eta=0.1$ and `MaxEp=500`.
 
 3. Report in the table below the average of the error rates on the test by repeating each experiment 10 times. 
 
 <br>
 
 
 <center> $\eta=0.01$, MaxE$=500$ </center>
    
    
  | Collection | Perceptron | Adaline |    RL    |
  |------------|------------|---------|----------|
  | WDBC       |            |         |          |
  | Ionosphere |            |         |          |
  | Sonar      |            |         |          |
  | Spam       |            |         |          |
 
 <br><br>
  
  <center> $\eta=0.1$, MaxEp$=500$ </center>
    
    
  | Collection | Perceptron | Adaline |    RL    |
  |------------|------------|---------|----------|
  | WDBC       |            |         |          |
  | Ionosphere |            |         |          |
  | Sonar      |            |         |          |
  | Spam       |            |         |          |
  
  Hint: you can use the following function

In [10]:
models = {
    'Perceptron': perceptron_update,
    'Adaline': adaline_update,
    'LogReg': logreg_update
}

In [11]:
# helper function to transform the dataset in (xi,yi) tuples
def get_x_y(dataset):
    return [(vector[:-1], vector[-1]) for vector in dataset]

In [12]:
def hyperparameters_tuning(etas=[0.01,0.1], normalize=True, nb_iter=10):

    datasets = {
        "WDBC": ReadCollection("wdbc.txt", normalize=normalize), 
        "Ionosphere": ReadCollection("ionosphere.txt", normalize=normalize),
        "Sonar": ReadCollection("sonar.txt", normalize=normalize),
        "Spam": ReadCollection("spam.txt", normalize=normalize)
    }

    for eta in etas:
        print(f'eta = {eta}\n--------')

        results = pd.DataFrame(
                index=datasets.keys(),
                columns=models.keys(),
                dtype=float
        ).fillna(0.0)

        for name, X in datasets.items():
            for i in range(nb_iter):
                train, test = train_test_split(X, test_size=0.25)

                train = get_x_y(train)
                test = get_x_y(test)

                for model, update_fnc in models.items():
                    results.loc[name, model] += EmpiricalRisk(
                        test, gradient_descent(train, eta, 500, update_fnc)
                    )

        print(results.applymap(lambda x: round(x/float(nb_iter), 4)))

In [13]:
hyperparameters_tuning(etas=[0.01,0.1], normalize=True, nb_iter=20)

eta = 0.01
--------
            Perceptron  Adaline  LogReg
WDBC            0.3787   0.0843  0.0965
Ionosphere      0.3068   0.1648  0.1687
Sonar           0.4452   0.2856  0.2846
Spam            0.3929   0.2602  0.2286
eta = 0.1
--------
            Perceptron  Adaline  LogReg
WDBC            0.3762   0.0913  0.0916
Ionosphere      0.2994   0.1750  0.1619
Sonar           0.4846   0.2452  0.2279
Spam            0.3935   0.2581  0.1514


 4. Normalize the vector representations of observataions by dividing them with their norm and repeat quetions 2 and 3. Are there any significant change by normalizing? Please explain.
 
 **We inverted here, as the data were already normalized. In the following we discuss the not normalized data**


In [14]:
hyperparameters_tuning(etas=[0.01,0.1], normalize=False, nb_iter=20)

eta = 0.01
--------


  return sum(x_i*y_i for x_i, y_i in zip(x, y))
  return [w_i + eta * term * x_i for w_i,x_i in zip(w,x)]


OverflowError: math range error

As we upgrade the weights through gradient descent, using a vector **x** that can have high values - we cumulate higher and higher values - especially when doing the dot product *h*.

In the update function of the Logistic regression, we use the `exp` function on the math library which does not handle overflow when the input value is really large (and thus gets exponentially larger). We could use a **exp** function from a library that handles overflows (by recasting for example), like *numpy*.

It is also know that normalized data facilitates the gradient descent convergence by *smoothing* the space, as shown in the image below - taken from [here](https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3)

In [15]:
from IPython.display import Image
Image(url= "https://miro.medium.com/max/700/1*vXpodxSx-nslMSpOELhovg.png")