# ML Assignment 2 & 3   
# Paul Thillen, Louis-Philippe Noël
# 3. PRACTICAL PART (40 pts) : Neural network implementation and experiments

## Numerically stable softmax. 
You will need to compute a numerically
stable softmax. Refer to posted readings to see how to do this. Start by
writing the expression for a single vector, then adapt it for a mini-batch
of examples stored in a matrix.


In [82]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.io
import gzip,pickle

Populating the interactive namespace from numpy and matplotlib


#Load data

In [84]:
#two moons 
two_moons = np.loadtxt(open('2moons.txt','r'))
#mnist
f=gzip.open('mnist.pkl.gz')
mnist=pickle.load(f)

#Utilities

In [42]:
def softmax(x):
    maximum = np.max(x, axis=-1, keepdims=True)
    top = np.exp(x - maximum)
    bottom = np.sum(top, axis=-1, keepdims=True)
    return top/bottom

#test
#A= array([[ 1.,  3.,  4.,  5.,  6.],
#       [ 4.,  5.,  6.,  3.,  3.],
#       [ 1.,  3.,  4.,  5.,  6.],
#       [ 4.,  5.,  6.,  3.,  3.]])
#softmax(A)

In [121]:
def uniforme(a,b):
    return(a+(b-a)*random()) 

## Parameter initialization. 
As you know, it is necessary to randomly
initialize the parameters of your neural network (trying to avoid symme-
try and saturating neurons, and ideally so that the pre-activation lies in
the bending region of the activation function so that the overall networks
acts as a non linear function). We suggest that you sample the weights
of a layer from a uniform distribution in [ -1/sqrt(n_c), 1/sqrt(n_c) ], 
where n c is the number of inputs for this layer (changing from one layer to the other).
Biases can be initialized at 0. Justify any other initialization method.

## fprop
fprop will compute the forward progpagation i.e. step by step computation from the input to the output and the cost of the activations of each layer.

## bprop
bprop will use the computed activations by fprop and does
the backpropagation of the gradients from the cost to the input following
precisely the steps derived in part 2.

## Finite difference gradient check. We can estimate the gradient nu-
merically using the finite difference method. You will implement this estimate as a tool to check your gradient computation. To do so, calculate the value of the loss function for the current parameter values (for a single example or a mini batch). Then for each scalar parameter θ k , change the
parameter value by adding a small perturbation (10^−6 < ε < 10^−4 )
and calculate the new value of the loss (same example or minibatch),
then set the value of the parameter back to its original value. The partial
derivative with respect to this parameter is estimated by dividing the
change in the loss function by ε. The ratio of your gradient computed
by backpropagation and your estimate using finite difference should be
between 0.99 and 1.01.

All in one : Class

In [73]:
class Model:

    def plot_function(self, train_data, title):
        plt.figure()
        d1 = train_data[train_data[:, -1] > 0]
        d2 = train_data[train_data[:, -1] == 0]
        plt.scatter(d1[:, 0], d1[:, 1], c='b', label='classe 1')
        plt.scatter(d2[:, 0], d2[:, 1], c='g', label='classe 0')
        x = np.linspace(np.min(train_data[:, 0]) - 0.5,
                        np.max(train_data[:, 0]) + 0.5,
                        100)
        y = -(self.weights[0]*x + self.bias - .5)/self.weights[1]
        plt.plot(x, y, c='r', lw=2, label='y = -(w1*x + b1)/w2')
        plt.xlim(np.min(train_data[:, 0]) - 0.5, np.max(train_data[:, 0]) + 0.5)
        plt.ylim(np.min(train_data[:, 1]) - 0.5, np.max(train_data[:, 1]) + 0.5)
        plt.grid()
        plt.legend(loc='lower right')
        plt.title(title)
        plt.show()

In [119]:
class neural_net:
    
    def __init__(self, nc, m, k):
        #nc : number of neurons in the hidden layer
        #m : number of classes
        #k : size of mini-batches to use
        self.nc = nc
        self.m = m
        self.k = k
        
    def ini(self,train_data):
        self.train_data = train_data
        self.x = train_data[:,:-1]
        self.y = train_data[:,-1]
        self.b1=0
        self.b2=0
        nc1 = train_data.shape[1]
        nc2 = self.nc
        self.w1 = uniforme(-1/sqrt(nc1),1/sqrt(nc1))
        self.w2 = uniforme(-1/sqrt(nc2),1/sqrt(nc2))
        
    def fprop(self):
        #ha : activation des neurones de la couche cachée
        self.ha= np.dot(w1,x)+b1
        #hs : sortie des neurones de la couche cachée
        self.hs= np.maximum(0,ha)
        #oa : activation des neurones de la couche de sortie
        self.oa= np.dot(w2,hs)+b2
        #os : sortie des neurones de la couche cachée
        self.os=softmax(oa)
        #L : fonction de coût
        self.L= -np.log(os)
        return L
    
    def bprop(self):
        #grad_oa : gradient de la fonction d'activation de la couche de sortie par rapport à L
        onehot=np.zeros(m)
        onehot[y-1]=1
        self.grad_oa = os-onehot
        #grad_w2 et grad_b2
        self.grad_w2 = np.dot(grad_oa,np.transpose(hs))
        self.grad_b2 = grad_oa
        #grad_hs
        self.grad_hs = np.dot(np.transpose(w2),grad_oa)
        #grad_ha
        def I_ha(x):
            #x is a vector
            y=np.zeros(len(x))
            for i in range(0, len(x)):
                    if x[i]>0:
                        y[i] = 1
                    else:
                        y[i]=0
            return y

        self.grad_ha = np.dot(grad_hs,I_ha(ha))
        #grad_w1 et grad_b1
        self.grad_w1 = np.dot(grad_ha,np.transpose(x))
        self.grad_b1 = grad_ha
        #grad_x
        self.grad_x = np.dot(np.transpose(w1),grad_ha)
        #elastic regularization
        self.grad_w2_el = grad_w2+2*w2-np.sign(w2)
        self.grad_w1_el = grad_w1+2*w1-np.sign(w1)
        
    def finite_check(self):
        e = 0.00002
        w1_2 = w1
        w1_2 = np.add(w1_2,e)
        w2_2 = w2
        w2_2 = np.add(w2_2,e)
        b1_2 = b1
        b1_2 = np.add(b1_2,e)
        b2_2 = b2
        b2_2 = np.add(b2_2,e)
        fprop(x,w1,w2,b1,b2)
        bprop(x,y,m,w1,w2,b1,b2)
        #gradients avec valeurs initiales des paramètres
        grad_w1_1 = grad_w1
        grad_w2_1 = grad_w2
        grad_b1_1 = grad_b1
        grad_b2_1 = grad_b2
        #gradients avec nouvelles valeurs des paramètres
        fprop(x,w1_2,w2_2,b1_2,b2_2)
        bprop(x,y,m,w1_2,w2_2,b1_2,b2_2)
        grad_w1_2 = grad_w1
        grad_w2_2 = grad_w2
        grad_b1_2 = grad_b1
        grad_b2_2 = grad_b2
        #check difference
        ratio_w1 = np.divide(grad_w1_1,grad_w1_2)
        ratio_w2 = np.divide(grad_w2_1,grad_w2_2)
        ratio_b1 = np.divide(grad_b1_1,grad_b1_2)
        ratio_b2 = np.divide(grad_b2_1,grad_b2_2)
        #sortie
        if ratio_w1>1.01 or ratio_w2>1.01 or ratio_b1>1.01 or ratio_b2>1.01:
            return False
        else:
            return True

 ## Size of the mini batches. 
We ask that your computation and gradient descent is done on minibatches (as opposed to the whole training set)
with adjustable size using a hyperparameter K. In the minibatch case,
we do not manipulate a single input vector, but rather a batch of input
vectors grouped in a matrix (that will give a matrix representation at
each layer, and for the input). In the case where the size is one, we obtain
an equivalent to the stochastic gradient. Given that numpy is eﬃcient on
matrix operations, it is more eﬃcient to perform computations on a whole
minibatch. It will greatly impact the execution time.

1. As a beginning, start with an implementation that computes the gradients
for a single example, and check that the gradient is correct using the ﬁnite
diﬀerence method described above.

In [122]:
#exemple x : 2 dimensions, classe = 1
x= np.array([[-1.2084724, 0.39429077, 1.]])
net1 = neural_net(2,2,1)
net1.ini(x)

TypeError: 'module' object is not callable

2. Display the gradients for both methods (direct computation and ﬁnite
difference) for a small network (e.g. d = 2 and d h = 2) with random
weights and for a single example.

3. Add a hyperparameter for the minibatch size K to allow compute the
gradients on a minibatch of K examples (in a matrix), by looping over
the K examples (this is a small addition to your previous code).