# Introduction

This tutorial will introduce some ideas in neural networks such as fine-tuning. The corresponding practical practice in this tutorial involves fine-tuning a convolutional neural network(CNN) to perform 3D object classification based on the paper Multi-view Convolutional Neural Networks for 3D Shape Recognition.
The goal of this tutorial is to implement an MVCNN 3D object classifier. To make things clearer, let's first look at the architecture in the paper. The dataset is composed by different objects with 12 images from 12 aspects. After training each images via CNN1, we select one with the largest tensor as the input of CNN2.
<img  src="arch.png">

The tutorial is composed by two parts. First we need to implement a single-view classifier which only uses the view from one aspect corresponding to the CNN1 part. And then we implement the multi-view classifier based on the data from 12 aspects of view corresponding to the CNN1 part.

# Environment

Before getting started, you'll need to install some libraries that we will use to set up the environment. First you need to intall Anaconda Python on your computer(the version of mine is 4.4.9) and then install the Keras library(2.0.2) and its dependency TensorFlow(1.0.0) with conda command.Keras is an open source neural network library written in Python.TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library, and also used for machine learning applications such as neural networks.

    conda install tensorflow
    conda install keras

For training of CNNs, computers with GPU could be more effcient. However, the purpose of this tutorial is to give you a glance of the usage of keras and tensorflow in training CNNs, you can just use your local machine to train one epoch if you only have CPU equipped.  

In [2]:
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras.models import Model
from keras.applications.resnet50 import preprocess_input
from keras.preprocessing import image

import numpy as np
import os, os.path
import glob
import random



Using TensorFlow backend.


# Preparing and loading data 

The dataset we use in this tutorial is the same with the one used in the paper, the modelnet-40. This file contains the 40 categories of CAD models used to train our deep network. You can download the dataset [here](http://3dshapenets.cs.princeton.edu).Then put the dataset under the project folder. Here is the structure of the dataset. 
<img  src="dataset.png">

To load the data, first we need to read the name of files. There are 40 kinds of models in the dataset. To make life easier, we can only use some of them by changing the nclasses parameter(the default number is 40).

In [3]:
# Get the sub-directories of a directory
def subdirs(dirname):
    return [x for x in os.listdir(dirname) if os.path.isdir(os.path.join(dirname, x))]

def data_filenames(subset, src_dir='modelnet40', nclasses = 40):
    """
    Arguments:
    - subset:       Either 'train' or 'test'
    - src_dir:      The name of dataset
    - nclasses:     The number of classes.(default is 40)
    """
    classes = sorted(subdirs(src_dir))
    ans = []
    i = 0
    for (icls, cls) in enumerate(classes):
        
        subset_dir = os.path.join(src_dir, cls, subset)
        model_dirs = subdirs(subset_dir)
        for model_dir in model_dirs:
            filenames = glob.glob(os.path.join(src_dir, cls, subset, model_dir, '*.png'))
            ans.append((icls, filenames))
        i+=1
        if i == nclasses:
            break
    return ans

Then we use the number of file to test the function.

In [4]:
print(len(data_filenames('test',nclasses=10)))

710


The dataset is composed by various image so that we need to implement a function the read images with the help of keras library. Also we need to call the preprocess function of ResNet50 model for each image. The reason we use ResNet50 model for inputing images is that we will apply this model in the further training work. The target size of the image is (224,224)

In [5]:
def read_image(filename, target_size=(224, 224)):
    x = image.load_img(filename, target_size=target_size)
    x = image.img_to_array(x)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    return x

In [6]:
print(read_image('ModelNet40/airplane/train/airplane_0001.off/airplane_0001.0.png'))

[[[[-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   ..., 
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]]

  [[-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   ..., 
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]]

  [[-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   ..., 
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]]

  ..., 
  [[-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -116.77899933 -123.68000031]
   [-103.93900299 -

Then we need to implement a generator to returns images and classes from ModelNet-40 in size one batches.

First we read all the name of both the test and train files in the modelnet40 dataset with the function data_filenames.

The generator yields by default an infinite number of elements which each have the form (x, y), where x is an input for supervised training and y is an output. 

If single is True (single view mode) then x is a 4D numpy array with shape 1 x h x w x 3 which ramdomly chose for 12 views of the object, representing an input image(h for height, w for weight), and y is a 2D numpy array of shape 1 x nclasses, where nclasses is the number of classes in ModelNet-40 (defined as the global nclasses, equal to 40). 

If single is False (multiple view mode) then x is a list of arrays representing different views of the same model: the list has length nviews (defined as the global nviews, equal to 12): each view is an numpy array of an image with shape 1 x h x w x 3.So that the x is a 5D numpy array with shape 12 x 1 x h x w x 3.

In [7]:
def data_generator(subset, single=True, frac=1.0, nclasses = 40):
    """
    Arguments:
    - subset:       Either 'train' or 'test'
    - single:       If true, return one image and class at a time in the format (img, cls) which for the single-view classifier
                    If false, return (x, cls), where x is a list of images for all (12) views which for the culti-view classifier
    - frac:         Fraction of dataset to load (use frac < 1.0 for quick tests).
    - nclasses:     The number of classes.(default is 40)
    """
    filenames = data_filenames(subset)
    def generator_func():
        while 1:
            random.shuffle(filenames)
            for (cls, view) in filenames[:int(len(filenames)*frac)]:
                cls_array = np.zeros((1, nclasses), 'float32')
                cls_array[0, cls] = 1.0
                cls = cls_array
                if single:
                    filename = random.choice(view)
                    yield (read_image(filename), cls)
                else:
                    yield ([read_image(view_elem) for view_elem in view], cls)
    return (generator_func(), len(filenames))

Here, to make a simple test, we set the frac to be 0.01.

In [8]:
(g, dataset_size) = data_generator('test', single=True, frac = 0.01)
(x1, y1) = g.__next__()
print('When single is True:')
print('dataset size:' + str(dataset_size))
print(len(x1))
print(y1)

(g2, dataset_size2) = data_generator('test', single=False, frac = 0.01)
(x2, y2) = g2.__next__()
print('When single is False:')
print('dataset size:' + str(dataset_size))
print(len(x2))
print(y2)

When single is True:
dataset size:2468
1
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   1.  0.  0.  0.]]
When single is False:
dataset size:2468
12
[[ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.]]


At last, we can get the data generator for both train and test data.

In [9]:
(train_generator,dataset_size_train) = data_generator('train')
(validation_generator, dataset_size_val) = data_generator('test')
print(dataset_size_train)
print(dataset_size_val)

9843
2468


# Single view classifier

In this part, we will implement a single-view classifier step by step.

### Fine-tuning

For neural network training, fine-tuning is the process to make the result more precise by adjusting the weights of the models. In our tutorial, we will fine tune the existed CNN model and here we will use the ResNet-50 which is already included in Keras.applications.

First, create a base model using pretrained ResNet-50 without the  fully-connected layer at the top of the network.

In [10]:
base_model = applications.ResNet50(weights = 'imagenet',include_top=False, input_shape=(224,224,3))

And then we need to add a flat layer followed by a dense layer with 40 outputs and softmax as the activation.

Here we will use the sequential model in keras which is a linear stack of layers. The sequential model is relatively simple and fast. There is no connection between two layers. For this single-view classifier, we only need to use the sequential model.

The softmax function is often used in the final layer of a neural network-based classifier.

In [11]:
top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(40, activation='softmax'))
model = Model(inputs= base_model.input, outputs= top_model(base_model.output))

Sometimes if we train all the layers the result is not as satisfied as we set some top layers not trainable. So here we will not finetuned all the layers but set p fraction of layers to be not trainable. I use p = 0.5 in this tutorial but you can make more experiment to find the best p.

In [12]:
def set_not_train(model, p):
    for layer in model.layers[:int(len(model.layers)*p)]:
        layer.trainable = False

In [13]:
set_not_train(model,0.5)

After that, we need to compile the model. Here we use the categorical_crossentropy loss and accuracy matrix.

In [14]:
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

At last we will call the keras.model function fit_generator() with both the train and test generator. Here we also need to set the number epoch which is the total number of iterations of training on the data. Notice here we do not need to run it until convergence.

In [56]:
model.fit_generator(
    train_generator,
    steps_per_epoch=dataset_size_train,
    epochs=20,
    validation_data = validation_generator,
    nb_val_samples=dataset_size_val
)

  


Epoch 1/20
1035/9843 [==>...........................] - ETA: 6129s - loss: 3.1274 - acc: 0.2512

KeyboardInterrupt: 

### Batch generator

We can have a look of the result above which I run on my local machine without GPU. The ETA is more than 6000s. The inefficiency due to the generator only generate one image a time which means the batch size is 1. If we use larger batch size, it brings advantages to training speed. However, we also have to take the size of batch into consideration. The higher the batch size, the more memory space we'll need. Here we will set the size to 16.

Here we will build another function to generate the batch data with the generator above. 

In [21]:
def batch_generator(generator, batch_size):
    def generator_func(generator, batch_size):
        while True:
            i = batch_size
            xx = np.empty((batch_size, 224,224,3))
            yy = np.empty((batch_size, 40))
            while i:
                (x,y) = generator.__next__()
                xx[batch_size-i]= x[0]
                yy[batch_size-i] = y[0]
                i-=1
            yield xx,yy
    return (generator_func(generator, batch_size))

And next is to use the new generator as the test/train data of the fit_generator. One thing we need to be careful is that the step_per_epoch and nb_val_samples need to be divided by the size of batch.

In [None]:
batch_size = 16
model.fit_generator(
    batch_generator(train_generator, batch_size),
    steps_per_epoch=dataset_size_train //batch_size,
    epochs=20,
    validation_data=batch_generator(validation_generator, batch_size),
    nb_val_samples=dataset_size_val //batch_size
)

  import sys


Epoch 1/20


# Multi view classifier

Secondly, we will implement a rather complex network with multiple views of a object. For each object in the dataset, it has 12 aspects of view. At the first part, we only user one of them to recognize the 3D object which it is obvious not accurate enough comparing with using all of them. In part 1, we use the Sequential Model but here we will use the functional api of keras. The functional api is the way to solve complex problems. The reason we use it here is that we need to share the same layers in CNN2.

### CNN1

As the architecture showed in the image above, it has two CNNs. For CNN1 we will truncate ResNet50. The reason I choose 34 as the number of stop layer is from the [research]('https://arxiv.org/pdf/1512.03385.pdf') of ResNet the 34-ResNet works best. Be careful here for each object we have 12 images and each of them should be a separate instance but share the CNN1 with same parameters. 

In [1]:
input_instances = []
base_model = applications.ResNet50(weights = 'imagenet',include_top=False)
models = []
input_t = Input((224,224,3))
for i in range(0,12):
    input_instances.append(Input(shape=(224,224,3)))
    models.append(Model(inputs = base_model.input, outputs=base_model.layers[34].output)(input_instances[i]))

### CNN2

For CNN2, the input is the largest tensor of the 12 output from CNN1. Then we build one Conv2D layer. The Conv2D is 2D convolution layer which is often used in image-related problem. And then we use batch normalization to prevent gradients vanishing and improve accuracy. 

In [4]:
from keras.layers.convolutional import Conv2D
from keras.layers.normalization import  BatchNormalization
from keras.layers.merge import maximum
max_tensor = maximum(models)
x = Conv2D(filters=40, kernel_size = 5)(max_tensor)
x = BatchNormalization()(x)

Next, the same as Part I, we add a flat layer followed by a dense layer with 40 outputs and softmax as the activation.

In [None]:
x = Conv2D(filters=40, kernel_size = 5)(max_tensor)
x = BatchNormalization()(x)
x = Flatten()(x)
x = Dense(40, activation='softmax')(x)

model = Model(inputs= input_instances, outputs= x)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

To improve performace, we will alse batch the data. Be careful here the single parameter of data generator is false.

In [None]:
(train_generator,dataset_size_train) = data_generator('train',single = False)
(validation_generator, dataset_size_val) = data_generator('test', single = False)

def batch_generator(generator, batch_size):
    def generator_func(generator, batch_size):
        while True:
            i = 0
            xx=[]
            for t in range(12):
                xx.append(numpy.empty((batch_size,224,224,3)))
            yy = numpy.empty((batch_size,40))
            while i < batch_size:
                (x,y) = generator.__next__()
                k = 0
                while k < 12:
                    xx[k][i] = x[0]
                    yy[i] = y[0]
                    k+=1
                i+=1
            yield xx,yy
    return (generator_func(generator, batch_size))

Finally, use the fit_generator function to train the model like part1.

In [None]:
batch_size = 500
model.fit_generator(
    batch_generator(train_generator, batch_size),
    steps_per_epoch=dataset_size_train //batch_size,
    epochs=20,
    validation_data=batch_generator(validation_generator, batch_size),
    nb_val_samples=dataset_size_val //batch_size
)

# Summary and Result

In this tutorial, I introduce a process of implementing the classifier of 3D object. After this tutorial, you should have the knowledge of the following few things:
    
    1.A basic understanding of two kinds of keras model, sequential model and functional api, and the difference between them.
    2.Different types of layers in Keras such as Dense layer, Flatten layer, Conv2D and so on.   
    3.Some concepts of deep learning such as fine-tuning

Moreover, I will give you some results from aws. It takes too much time to run until convergency so that I will give you the result of the first epoch as reference.
For single-view classifier, the accuracy after first epoch is 55%.
For multi-view classifier, the accuracy after first epoch is 58%

# References

    1.Keras Library: https://keras.io
    2.Deep Residual Learning for Image Recognition: https://arxiv.org/pdf/1512.03385.pdf
    3.The dataset modelnet-40: http://3dshapenets.cs.princeton.edu
    4.Multi-view Convolutional Neural Networks for 3D Shape Recognition: https://arxiv.org/abs/1505.00880