<a href="https://colab.research.google.com/github/ziqlu0722/Machine-Learning/blob/master/Multilayer_Perceptron.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

log: Softmax with Cross Entropy Loss to be done

In [0]:
import numpy as np
import tensorflow as tf
import random
import os

## 2.1 Data Manipulation

**Data Manipulation Functions**

In [0]:
def load_data(link):
    x = []
    y = []
    with open(link) as data:
        for line in data:
            y.append(line.strip().split(',')[0])
            x.append(line.strip().split(',')[1:])
    x = np.array(x).astype(np.float32)
    y = np.array(y).astype(np.float32)
    return x, y
  
def to_categorical(x):
    """ One-hot encoding """
    n_col = np.max(x).astype(int) + 1
    one_hot = np.zeros((x.shape[0], n_col))
    one_hot[np.arange(x.shape[0]).astype(int), x.astype(int)] = 1 
    return one_hot
  
def normalize(X, axis=-1, order=2):
    """ normalize input data"""
    l2 = np.atleast_1d(np.linalg.norm(X, order, axis))
    l2[l2 == 0] = 1
    return X / np.expand_dims(l2, axis)

In [0]:
x, y = load_data('drive/Data/wine/train_wine.csv')
y = to_categorical(y)
x = normalize(x)

In [9]:
y[:5]

array([[0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]])

In [10]:
x.shape, y.shape

((151, 13), (151, 4))

##2.2 MLP From Scratch

###2.2.1 Loss Functions

**Cross Entropy Loss**

***1. Loss Function: ***

[source](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#mae-l1)

>Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

>The graph above shows the range of possible loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!

>Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.

<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/cross_entropy.png" width="400">


$$L = -\sum_{c=1}^ky_{i,c}\ln~(\sigma_{i, c})$$

* $K$ - number of classes (dog, cat, fish)
* $y$ - binary indicator (0 or 1) if class label $c$ is the correct classification for observation $i$
* $p$ - predicted probability observation $i$ is of class $c$

***2. Derivatives of Cross Entropy Loss: ***

\begin{align}
\frac{\partial L}{\partial \sigma_i} &= - \sum_{c=1}^k y_k \frac{\partial ln~(\sigma_k)}{\partial \sigma_k } \\
&= - \sum_{c=1}^k y_k \frac{1}{\sigma_k}
\end{align}



**Square Loss**

*** 1.Loss Function***

$$\frac{(y-\hat{y})^2}{2}$$
    
***2. Derivatives of Square Loss***

$$-(y - \hat{y})$$



In [0]:
class SquareLoss():
  def __init__(self): pass

  def loss(self, y, y_pred):
      return 0.5 * np.power((y - y_pred), 2)
  
  def acc(self, y, p):
    acc = np.count_nonzero(y[np.argmax(y, axis = 1) == np.argmax(p, axis = 1)])/len(y)
    return acc

  def gradient(self, y, y_pred):
      return -(y - y_pred)
    
class cross_entropy_softmax():
  def __init__(self): pass
  
  def loss(self, y, y_pred):
    m = y.shape[0]
    p = softmax(X)

    log_likelihood = -np.log(p[range(m),y])
    loss = np.sum(log_likelihood) / m
    return loss
  
  def gradient(self, y, y_pred):
    pass

###2.2.2 Activation Functions

**Softmax**

\begin{align}\sigma_i = \frac{e^{z_i}}{\sum_k e^{z_k}}\end{align}

softmax of 

\begin{align}
\left[
\begin{matrix}z_1, & z_2 & z_3, & ..., & z_k\end{matrix}
\right]\text{ > input of activation layer}
\end{align} is:

\begin{align}
\left[\begin{matrix}\frac{e^{z_1}}{\sum_k e^{z_k}}, & \frac{e^{z_2}}{\sum_k e^{z_k}} & \frac{e^{z_3}}{\sum_k e^{z_k}}, & ..., & \frac{e^{z_k}}{\sum_k e^{z_k}}\end{matrix}
\right]\text{ > output of activation layer}
\end{align}

**Derivative of Softmax**

Jacobian Matrix Definition from [Wiki](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant)

vectorized:

![Jacobian M](https://wikimedia.org/api/rest_v1/media/math/render/svg/74e93aa903c2695e45770030453eb77224104ee4)

or element-wisely:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/f5be13db80a62a12665fad2f03414d1d445e9b10)

We need to calculate the Jacobian Matrix for Softmax, which is:

\begin{align}
\sigma'_{(z_i)} =
\left[
\begin{matrix}
{\sigma_1}'_{(z_1)}, &  {\sigma_1}'_{(z_2)}, & ... &{\sigma_1}'_{(z_k)}
\\{\sigma_2}'_{(z_1)}, &  {\sigma_2}'_{(z_2)}, & ... &{\sigma_2}'_{(z_k)}
\\{\sigma_3}'_{(z_1)}, &  {\sigma_3}'_{(z_2)}, & ... &{\sigma_3}'_{(z_k)}
\\...&...&...&...
\\{\sigma_k}'_{(z_1)}, &  {\sigma_k}'_{(z_2),}& ... &{\sigma_k}'_{(z_k)}\\
\end{matrix}
\right]
\end{align}



(1) if $i = j:$

\begin{align}
{\sigma_j}'_{(z_i)} &= (\frac{e^{z_i}}{\sum_k e^{z_k}})'_{(z_i)}
\\&=\frac{(e^{z_i})'_{(z_i)}(\sum_ke^{z_k})-(e^{z_i})(\sum_ke^{z_k})'_{(z_i)}}{(\sum_k e^{z_k})^2}
\\&=\frac{(e^{z_i})(\sum_ke^{z_k})-(e^{z_i})*1}{(\sum_k e^{z_k})^2}
\\&=\frac{e^{z_k}}{\sum_ke^{z_k}}(1-\frac{1}{\sum_k e^{z_k}})
\\& = \sigma_i(1-\sigma_i)
\end{align}

(2) if $i \ne j:$

\begin{align}
{\sigma_j}'_{(z_i)} &= (\frac{e^{z_j}}{\sum_k e^{z_k}})'_{(z_i)}
\\&=e^{z_j}*(-1)*\frac{(e^{z_i})}{(\sum_k e^{z_k})^2}
\\& =-\sigma_i\sigma_j
\end{align}

therefore,

$$
\sigma'_{(z_i)} = 
\left[
         \begin{matrix}
\sigma_1(1-\sigma_1), &  \sigma_1(0-\sigma_2),& ... &  \sigma_1(0-\sigma_k)\\
\sigma_2(0-\sigma_1), &  \sigma_2(1-\sigma_2),& ... &  \sigma_2(0-\sigma_k)\\
\sigma_3(0-\sigma_1), &  \sigma_3(0-\sigma_2),& ... &  \sigma_3(0-\sigma_k)\\
...&...&...&...\\
\sigma_k(0-\sigma_1) &  \sigma_k(0-\sigma_2)& ... &  
\sigma_k(1-\sigma_k)
         \end{matrix}
\right]
$$

or $$\sigma'_{(z_i)} =\sigma_j(\delta_{ij} - \sigma_i)$$

where

$$\delta_{ij}=
\begin{cases}
1& \text{i=j}\\
0& \text{i$\ne$j}
\end{cases}$$

**Derivative of Softmax with Cross Entropy Loss**

**Sigmoid**



<img src="http://drive.google.com/uc?export=view&id=1ZSX-N5o9Zf0R2PzeHCZHiTWdiqYstuMZ" width="500">


\begin{align}
S'(z_i) &= (\frac{1}{1+e^{-z_i}})'
\end{align}


**Derivatives of Sigmoid**

\begin{aligned}
S'(z) &= (\frac{1}{1+e^{-z}})' 
\\
&= \frac{e^{-z}}{(1+e^{-z})^{2}} 
\\
&= \frac{1+e^{-z}-1}{(1+e^{-z})^{2}}  
\\
&= \frac{1}{(1+e^{-z})} - \frac{1}{(1+e^{-z})^2}
\\
&= \frac{1}{(1+e^{-z})}(1-\frac{1}{(1+e^{-z})}) 
\\
&= S(z)(1-S(z))
\\
\end{aligned}

**Tanh**

<img src="http://drive.google.com/uc?export=view&id=14c37aZtyjsNYt_JcAZsA38UxNHBw8tf8" width="500">

** Relu **

<img src="http://drive.google.com/uc?export=view&id=165FrzviKDSkbX8a5MxHW6ZcJk-VohEjV" width="500">

** Leaky Relu **

<img src="http://drive.google.com/uc?export=view&id=1vn8D4CiAE4YvmP4p82K3qBlgHas-o4AG" width="500">

In [0]:
class Sigmoid():
    def __call__(self, x):
        return 1 / (1 + np.exp(-x))

    def gradient(self, x):
        return self.__call__(x) * (1 - self.__call__(x))
      
class Softmax():
    def __call__(self, x):
      """-np.max(x)it shifts all of elements in the vector to negative to zero, 
      and negatives with large exponents saturate to zero rather than the infinity, 
      avoiding overflowing and resulting in nan"""
      pass
      
class TanH():
    def __call__(self, x):
        return 2 / (1 + np.exp(-2*x)) - 1

    def gradient(self, x):
        return 1 - np.power(self.__call__(x), 2)

class ReLU():
    def __call__(self, x):
        return np.where(x >= 0, x, 0)

    def gradient(self, x):
        return np.where(x >= 0, 1, 0)

###2.2.3 MLP Algorithm

In [0]:
class MultilayerPerceptron():
  
  def __init__(self):
      pass
  
  def _init_param(self, X, y, num_hid, acti_hid, acti_out, loss, epoch=3000, lr=0.01):
      if acti_hid == 'relu':
        self.hid_activation = ReLU()
      elif acti_hid == 'sigmoid':
        self.hid_activation = Sigmoid()
      elif acti_hid == 'tanh':
        self.hid_activation = TanH()
      
      if acti_out == 'sigmoid':
        self.output_activation = Sigmoid()
      elif acti_out == 'softmax':
        self.output_activation = Softmax()
        
      if loss == 'cross_entropy':
        self.loss = CrossEntropy()
      elif loss == 'square_loss':
        self.loss = SquareLoss()

      self.epoch = epoch
      self.lr = lr
      self.num_hid = num_hid
      self.num_sample, self.num_input = X.shape
      self.num_output = y.shape[1]
      # Hidden layer
      self.w1 = np.random.uniform(size = (self.num_input, self.num_hid))
      self.b1 = np.zeros(shape = (1, self.num_hid))
      # Output layer
      self.w2 = np.random.uniform(size = (self.num_hid, self.num_output))
      self.b2 = np.zeros((1, self.num_output))
      
  def forwardpass(self, X):
      self.hid_layer_input = X @ self.w1 + self.b1
      self.hid_layer = self.hid_activation(self.hid_layer_input)
      self.output_layer_input = self.hid_layer @ self.w2 + self.b2 
      self.output_layer = self.output_activation(self.output_layer_input)   
      return self.output_layer
    
  def backprop(self, X, y):
      # grad. - loss to output layer input
      print(self.loss.gradient(y, self.output_layer).shape)
      print(self.output_activation.gradient(self.output_layer_input).shape)
      grad_output = self.loss.gradient(y, self.output_layer) * self.output_activation.gradient(self.output_layer_input)
      
      grad_w2 = self.hid_layer.T @ grad_output
      grad_b2 = np.sum(grad_output, axis=0, keepdims=True)
      # grad. - hid layer_1 param
      grad_hid_layer = grad_output.dot(self.w2.T) * self.hid_activation.gradient(self.hid_layer_input)
      grad_w1 = X.T.dot(grad_hid_layer)
      grad_b1 = np.sum(grad_hid_layer, axis=0, keepdims=True)

      self.w2 -= self.lr * grad_w2
      self.b2 -= self.lr * grad_b2
      self.w1 -= self.lr * grad_w1
      self.b1 -= self.lr * grad_b1

  def batch_grad(self, X, y, num_hid, acti_hid, acti_out, loss, epoch=3000, lr=0.01):
      self._init_param(X, y, num_hid, acti_hid, acti_out, loss, epoch=3000, lr=0.01)

      for epoch in range(self.epoch):
        y_pred = self.forwardpass(X)
        self.backprop(X, y)

        acc = self.loss.acc(y, y_pred)
        loss = np.sum(self.loss.loss(y, y_pred))/len(y)
            
        if epoch % 100 == 0:
          print('Epoch {} ---- Loss {:.4f} ---- Accuracy {:.4f}'.format(epoch+1, loss, acc))
       
  def predict(self, X):
    
      y_pred = self.fowardpass(X)

In [28]:
# Model_1

num_hid = 20
epoch = 100000
lr = 0.001
acti_hid = 'sigmoid'
acti_out = 'sigmoid'
loss = 'square_loss'

mlp = MultilayerPerceptron()
mlp.batch_grad(x, y, num_hid = num_hid, epoch = epoch, lr = lr, acti_hid = acti_hid, acti_out = acti_out, loss = loss) 

Epoch 1 ---- Loss 1.4947 ---- Accuracy 0.0000
Epoch 101 ---- Loss 0.8274 ---- Accuracy 0.0000
Epoch 201 ---- Loss 0.3290 ---- Accuracy 0.3841
Epoch 301 ---- Loss 0.3270 ---- Accuracy 0.3841
Epoch 401 ---- Loss 0.3248 ---- Accuracy 0.3841
Epoch 501 ---- Loss 0.3218 ---- Accuracy 0.3841
Epoch 601 ---- Loss 0.3178 ---- Accuracy 0.4967
Epoch 701 ---- Loss 0.3126 ---- Accuracy 0.6225
Epoch 801 ---- Loss 0.3058 ---- Accuracy 0.6424
Epoch 901 ---- Loss 0.2975 ---- Accuracy 0.6291
Epoch 1001 ---- Loss 0.2881 ---- Accuracy 0.6225
Epoch 1101 ---- Loss 0.2782 ---- Accuracy 0.6159
Epoch 1201 ---- Loss 0.2687 ---- Accuracy 0.6093
Epoch 1301 ---- Loss 0.2602 ---- Accuracy 0.6026
Epoch 1401 ---- Loss 0.2529 ---- Accuracy 0.6026
Epoch 1501 ---- Loss 0.2470 ---- Accuracy 0.6026
Epoch 1601 ---- Loss 0.2422 ---- Accuracy 0.6026
Epoch 1701 ---- Loss 0.2384 ---- Accuracy 0.6026
Epoch 1801 ---- Loss 0.2353 ---- Accuracy 0.6026
Epoch 1901 ---- Loss 0.2329 ---- Accuracy 0.6026
Epoch 2001 ---- Loss 0.2309 ----

In [0]:
# # Model_2

# num_hid = 24
# epoch = 100000
# lr = 0.0001
# acti_hid = 'relu'
# acti_out = 'softmax'
# loss = 'cross_entropy'

# mlp = MultilayerPerceptron()
# mlp.batch_grad(x, y, num_hid = num_hid, epoch = epoch, lr = lr, acti_hid = acti_hid, acti_out = acti_out, loss = loss) 

##2.3 MLP By Tensorflow

### 2.3.1. Create Some Constants

In [0]:
num_input = 13
num_hid = 8
num_output = 4
lr = 0.01
actf = tf.nn.sigmoid

### 2.3.2. Network Configuration

 - Define layer with its weights and bias variables
 - and activation functions

In [0]:
# create placeholder for x, y
x_ = tf.placeholder(dtype = tf.float32, shape = [None, num_input])
y_ = tf.placeholder(dtype = tf.float32, shape = [None, num_output])

# create container for param variables
hid_layer_param = {
'w': tf.Variable(tf.random_normal([num_input, num_hid])),
'b': tf.Variable(tf.zeros([num_hid]))}

output_layer_param = {
'w': tf.Variable(tf.random_normal([num_hid, num_output])),
'b': tf.Variable(tf.zeros([num_output]))}

hid_layer = actf(x_ @ hid_layer_param['w'] + hid_layer_param['b'])
output_layer = actf(hid_layer @ output_layer_param['w'] + output_layer_param['b'])

### 2.3.3. Define Loss Function, Accuracy Calculation

In [0]:
loss = tf.reduce_mean(tf.square(y - output_layer))

optimizer = tf.train.AdamOptimizer(lr)
train = optimizer.minimize(loss)

correct_prediction = tf.equal(tf.argmax(y, axis = 1), tf.argmax(output_layer, axis = 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

### 2.3.4. Training

In [0]:
def MLP(num_epoch, x_test, y_test):
    init = tf.global_variables_initializer()

    with tf.Session() as sess:
        sess.run(init)

        for epoch in range(num_epoch):
            _, c, acc = sess.run([train, loss, accuracy], 
                                 feed_dict = {x_: x, y_: y})
                       
            if epoch % 100 == 0:
              print('epoch {} - loss: {:0.4f} - acc: {:0.4f}'.format(epoch, c, acc))
    
    return hid_layer_param, output_layer_param

In [182]:
num_epoch = 5000
MLP(num_epoch, x_test, y_test)

epoch 0 - loss: 0.2788 - acc: 0.3841
epoch 100 - loss: 0.1625 - acc: 0.3841
epoch 200 - loss: 0.1506 - acc: 0.6556
epoch 300 - loss: 0.1325 - acc: 0.6225
epoch 400 - loss: 0.1175 - acc: 0.6225
epoch 500 - loss: 0.1092 - acc: 0.6225
epoch 600 - loss: 0.1043 - acc: 0.6623
epoch 700 - loss: 0.1000 - acc: 0.7152
epoch 800 - loss: 0.0946 - acc: 0.7550
epoch 900 - loss: 0.0872 - acc: 0.8079
epoch 1000 - loss: 0.0788 - acc: 0.8675
epoch 1100 - loss: 0.0705 - acc: 0.8874
epoch 1200 - loss: 0.0631 - acc: 0.9139
epoch 1300 - loss: 0.0568 - acc: 0.9205
epoch 1400 - loss: 0.0515 - acc: 0.9272
epoch 1500 - loss: 0.0473 - acc: 0.9272
epoch 1600 - loss: 0.0437 - acc: 0.9272
epoch 1700 - loss: 0.0408 - acc: 0.9338
epoch 1800 - loss: 0.0383 - acc: 0.9338
epoch 1900 - loss: 0.0362 - acc: 0.9338
epoch 2000 - loss: 0.0345 - acc: 0.9404
epoch 2100 - loss: 0.0329 - acc: 0.9470
epoch 2200 - loss: 0.0315 - acc: 0.9470
epoch 2300 - loss: 0.0303 - acc: 0.9470
epoch 2400 - loss: 0.0292 - acc: 0.9536
epoch 2500 -

({'b': <tf.Variable 'Variable_13:0' shape=(8,) dtype=float32_ref>,
  'w': <tf.Variable 'Variable_12:0' shape=(13, 8) dtype=float32_ref>},
 {'b': <tf.Variable 'Variable_15:0' shape=(4,) dtype=float32_ref>,
  'w': <tf.Variable 'Variable_14:0' shape=(8, 4) dtype=float32_ref>})