# CONVOLUTIONAL NEURAL NETWORKS

# Reusing weights in multiple places

* When you need to detect the same feature in multiple places   
  ex. similar strokes in different parts of a digit
  
* Overfitting often caused by having more parameters than necessary to learn a specific dataset
  * difficult to avoid when NNs have lots of parameters but not very many training examples
  * regularization: a means to countering it, but not the only technique, and not most ideal
  
* Overfitting is about the **ratio** between:
  * the number of weights in the model 
  * and the number of datapoints it has to learn those weights
  
* To avoid it, it's better to use something loosely defined as **_structure_**:   
  * = reuse weights for multiple purposes when same pattern needs to be detected in multiple places   
  * can significantly **reduce overfitting** and lead to much **more accurate** models
  * because reduces weight-to-data ratio
  
* Usually, removing parameters makes a model less expressive (less able to learn patterns)
  * but if you're clever in where to reuse weights
  * model can be equally expressive and more robust to overfitting
* Also tends to make the model smaller (fewer actual parameters to store)
* **Convolution**: the most famous and widely spread structure in neural networks (called *convolutional layer* when used as a layer)

# The convolutional layer

* instead of a single big dense linear layer (fully-connected),   
  lots of very small linear layers
  * usually fewer than 25 inputs and a single output
  * reused in every position
  * called a convolutional **_kernel_**   
<br/>   

* **Example**: 3\*3 convolutional kernel:
  * predicts in its current location,
  * moves 1 pixel to the right and predicts again
  * and so on, until right edge reached
  * moves down 1 pixel and predicts
  * and so on, until bottom edge reached
* **Result** of one kernel is a smaller matrix of kernel predictions (input to next layer)
  * 8\*8 image with 3\*3 convolutional kernel: output is a 6\*6 matrix   
<br/>   

* Convolutional layers usually have **multiple kernels**, each producing a prediction matrix:
  * you can sum the prediction matrices elementwise = **_sum pooling_**
  * or take the mean elementwise = **_mean pooling_**
  * or take maximum value elementiwe = **_max pooling_** -> most popular
  * only **final matrix** is forward propagated into next layers
* Allows each kernel to learn a particular pattern then search for it somewhere in the image
  * = small set of weights training over a much larger set of training examples
  * each mini-kernel forward propagated multiple times on multiple segments of data
  * thus changing the ratio of weights to datapoints on which those weights are being trained
  * drastically **reduces ability to overfit** and **increases ability to generalize**

# Example code

#### Loading data

In [1]:
import sys
import numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, y_train = x_train[0:1000], y_train[0:1000]
num_train_images = len(x_train)
num_test_images = len(x_test)

# IMAGES
train_images = x_train.reshape(num_train_images, 28 * 28) / 255  # shape = (1000, 784)
test_images = x_test.reshape(num_test_images, 28*28) / 255

# LABELS: one_hot vectors
# label '4' = [0, 0, 0, 0, 1, 0, 0...]
train_labels = np.zeros((len(y_train), 10))
for i,l in enumerate(y_train):
    train_labels[i][l] = 1

test_labels = np.zeros((len(y_test), 10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

#### Network parameters

In [2]:
alpha = 2
iterations = 700
batch_size = 128

pixels_per_image = 784
num_labels = 10

image_rows, image_cols = 28, 28
kernel_rows, kernel_cols = 3, 3
num_kernels = 16

hidden_size = ( (image_rows - kernel_rows) * (image_cols - kernel_cols) ) * num_kernels   # (25*25)*16 = 10000
kernels = 0.02 * np.random.random((kernel_rows * kernel_cols, num_kernels)) - 0.01        # shape = 9*16
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.01                    # shape = 10000*10

train_acc, test_acc = 0.0, 0.0

#### Utilitarian functions

In [3]:
np.random.seed(1)

tanh = lambda x: np.tanh(x)
tanh2deriv = lambda output: 1 - (output ** 2)
def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

# Convolutional kernels

def get_image_section(layer, row_from, row_to, col_from, col_to):
    '''Selects the same subregion in a batch of images.'''
    section = layer[:,row_from:row_to,col_from:col_to]
    section = section.reshape(-1, 1, row_to - row_from, col_to - col_from)
    return section

def get_image_sections(layer, kernel_rows, kernel_cols):
    '''Creates a list of all the subregions in a batch of images.'''
    sections = list()
    n_rows = layer.shape[1]
    n_cols = layer.shape[2]
    
    for row_start in range(n_rows - kernel_rows):
        for col_start in range(n_cols - kernel_cols):
            section = get_image_section(layer,
                                        row_start,
                                        row_start+kernel_rows,
                                        col_start,
                                        col_start+kernel_cols)
            sections.append(section)
    
    return sections

def expand_sections(sections):
    '''Concatenates and reshapes the list of subregions from a batch of images.'''
    expanded_sections = np.concatenate(sections, axis=1)
    exp_shape = expanded_sections.shape
    flat_sections = expanded_sections.reshape(exp_shape[0] * exp_shape[1], -1)
    return (flat_sections, exp_shape)

#### Learning and predicting on training data

In [4]:
for j in range(iterations):
    
    # LEARNING AND PREDICTING ON TRAINING DATA
    
    train_correct_cnt = 0
    
    for i in range(num_train_images // batch_size):
        batch_start, batch_end = (i * batch_size), ((i+1) * batch_size)
        
        # LAYER 0
        layer_0 = train_images[batch_start:batch_end]         # shape = (28, 28)
        layer_0 = layer_0.reshape(layer_0.shape[0], 28, 28)   # shape = (128, 28, 28)
        
        # exp_shape = (128, 1875, 3) - flat_sections_trn.shape = (240000, 3) - kernels.shape = (9, 16)
        trn_sections = get_image_sections(layer_0, kernel_rows, kernel_cols)
        (flat_sections_trn, exp_shape) = expand_sections(trn_sections)
        kernel_output = flat_sections_trn.dot(kernels)  # set of 2-dim images (output of each kernel in each image position)
        
        # LAYER 1
        layer_1 = tanh(kernel_output.reshape(exp_shape[0], -1))
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        
        # LAYER 2
        layer_2 = softmax(np.dot(layer_1, weights_1_2))
        
        for k in range(batch_size):
            target_labels = train_labels[batch_start+k:batch_start+k+1]
            train_correct_cnt += int( np.argmax(layer_2[k:k+1]) == np.argmax(target_labels) )
        train_acc = train_correct_cnt / float(num_train_images)
        
        layer_2_delta = (train_labels[batch_start:batch_end] - layer_2) / (batch_size * layer_2.shape[0])
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * tanh2deriv(layer_1)
        layer_1_delta *= dropout_mask
        
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        l1d_reshape = layer_1_delta.reshape(kernel_output.shape)
        k_update = flat_sections_trn.T.dot(l1d_reshape)
        kernels -= alpha * k_update
        
        if (j == 0) and (i == 0):
            print(f'layer_0.shape {layer_0.shape}')   # (128, 28, 28)
            print(f'kernel_output.shape {kernel_output.shape}')  
            print('exp_shape {} flat_sections_trn.shape {} kernels.shape {}'.format(exp_shape,
                                                                                    flat_sections_trn.shape,
                                                                                    kernels.shape))
            print('layer_1.shape {} dropout_mask.shape {}'.format(layer_1.shape,     # (128, 10000)
                                                              dropout_mask.shape))   # (128, 10000)
            print()
    
    # LEARNING AND PREDICTING ON TEST DATA
    
    test_correct_cnt = 0
    
    for m in range(num_test_images):
        
        # LAYER 0
        layer_0 = test_images[m:m+1]
        layer_0 = layer_0.reshape(layer_0.shape[0], 28, 28)
        layer_0.shape
        
        tst_sections = get_image_sections(layer_0, kernel_rows, kernel_cols)
        (flat_sections_tst, exp_shape) = expand_sections(tst_sections)
        kernel_output = flat_sections_tst.dot(kernels)
        
        # LAYER 1
        layer_1 = tanh(kernel_output.reshape(exp_shape[0], -1))
        
        # LAYER 2
        layer_2 = np.dot(layer_1, weights_1_2)
        
        test_correct_cnt += int( np.argmax(layer_2) == np.argmax(test_labels[m:m+1]) )
        test_acc = test_correct_cnt / float(num_test_images)
    
    if (j % 20 == 0) or (j == iterations-1):
        print('Iter ' + str(j) + ' Train-Acc ' + str(train_acc)[0:5] + ' Test-Acc ' + str(test_acc)[0:5])
        

layer_0.shape (128, 28, 28)
kernel_output.shape (80000, 16)
exp_shape (128, 625, 3, 3) flat_sections_trn.shape (80000, 9) kernels.shape (9, 16)
layer_1.shape (128, 10000) dropout_mask.shape (128, 10000)

Iter 0 Train-Acc 0.066 Test-Acc 0.022
Iter 20 Train-Acc 0.037 Test-Acc 0.028
Iter 40 Train-Acc 0.191 Test-Acc 0.330
Iter 60 Train-Acc 0.601 Test-Acc 0.795
Iter 80 Train-Acc 0.697 Test-Acc 0.838
Iter 100 Train-Acc 0.73 Test-Acc 0.857
Iter 120 Train-Acc 0.77 Test-Acc 0.864
Iter 140 Train-Acc 0.765 Test-Acc 0.870
Iter 160 Train-Acc 0.773 Test-Acc 0.872
Iter 180 Train-Acc 0.781 Test-Acc 0.876
Iter 200 Train-Acc 0.79 Test-Acc 0.875
Iter 220 Train-Acc 0.79 Test-Acc 0.876
Iter 240 Train-Acc 0.808 Test-Acc 0.879
Iter 260 Train-Acc 0.814 Test-Acc 0.881
Iter 280 Train-Acc 0.815 Test-Acc 0.880
Iter 300 Train-Acc 0.815 Test-Acc 0.88
Iter 320 Train-Acc 0.813 Test-Acc 0.881
Iter 340 Train-Acc 0.824 Test-Acc 0.879
Iter 360 Train-Acc 0.832 Test-Acc 0.878
Iter 380 Train-Acc 0.828 Test-Acc 0.877
Iter 40

# Summary

* Here we use one layer for all convolutions,   
  but most convolutions stack multiple stacked layers:   
  * each convolutional layer treats the previous layers as an image
  * one of the main developments that allowed very deep neural networks (and the popularization of the phrase **_deep learning_**)
  * a **landmark** in the field and major progress!  
<br/>   

* Independently learning different types of information in different sets of weights (instead of all together),
  * for ex. image of a cat = colors, lines and edges, corners, small shapes
  * sets of **lower-level features**
* Then combining the different types of low-level features that correspond to the output   
  
<p style="background:#DDEEEE; padding: 15px; border-left: 7px solid red;">
Using the same piece of intelligence, and the <b>same weights</b>, in multiple places:  
<br/>    
makes the <b>weights more intelligent</b> by giving them more samples to learn from, <b>increasing generalization</b>
</p>

* Many of the biggest developments in deep learning over the past five years are iterations of this idea:
    * convolutions
    * recurrent neural networks (RNNs)
    * word embeddings
    * capsule networks

* When you know a network will need the same idea in multiple places, force it to use the same weights in those places