In [4]:
import imgaug as ia
from imgaug import augmenters as iaa
import numpy as np
import imageio
import os
import tensorflow as tf

# Notes
- imbalanced dataset
    - check data distribution
- large file count and size
- do we need multiple GPUs?


## Bag of Tricks

Link: https://towardsdatascience.com/a-big-of-tricks-for-image-classification-fec41eb28e01

### Large batch size
- (1) Distributed training: Split up your training over multiple GPUs. on each training step, your batch will be split up across the available GPUs. For example, if you have a batch size of 8 and 8 GPUs, then each GPU will process one image.

- (2) Changing the batch and image size during training: Part of the reason why many research papers are able to report the use of such large batch sizes is that many standard research datasets have images that aren’t very big. When training networks on ImageNet for example, most state-of-the-art network used crops between 200 and 350; of course they can have large batches with such small image sizes

- To get around this small bump in the road, you can start off your training with smaller images and larger batch size. Do this by downsampling your training images. You’ll then be able to fit many more of them into one batch. With the large batch size + small images you should be able to already get some decent results. To complete the training of your network, fine tune it with a smaller learning rate and large images with a smaller batch size. 

### Refined training models
- So, start off with Adam: just set a learning rate that’s not absurdly high, commonly defaulted at 0.0001 and you’ll usually get some very good results. Then, once your model starts to saturate with Adam, fine tune with SGD at a smaller learning rate to squeeze in that last bit of accuracy!

### Transfer learning
- In general, models with higher accuracy (relative to each other on the same dataset) will be better for transfer learning and get you better final results. The only other thing to be aware of is to choose your pre-trained network for transfer learning in accordance with your target task. For example, using a network pre-trained for self-driving cars on a dataset for medical imaging wouldn't be such a great idea; it’s a huge gap between the domains as the data itself is quite different

### Fancy Data Augmentation
- Another technique which is now commonly used on the very latest ImageNet models is Cutout Regularisation. Despite the name, cutout can be really seen as a form of augmenting your data to handle occlusion. Occlusion is an extremely common challenge in real-world applications, especially in the hot computer vision areas of robotics and self-driving cars. By quite literally applying a form of occlusion to the training data, we effectively adapt our network to be more robust to it.

## Winning Approaches (NO CODE)
Link: https://medium.com/neuralspace/kaggle-1-winning-approach-for-image-classification-challenge-9c1188157a86

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets.[14]

It is almost practically inefficient to train a Convolution Neural Network from scratch. So, we take the weights of a pre trained CNN model on ImageNet with 1000 classes and fine tuning it by keeping some layers frozen and unfreezing some of them and training over it.

We will use Keras for initial benchmarks as Keras provides a number of pretrained models and we will use the ResNet50 and InceptionResNetV2 for our task. It is important to benchmark the dataset with one simple model and one very high end model to understand if we are overfitting/underfitting the dataset on the given model.

### Imbalanced dataset

So, we use tried with two approaches to balance the data:

1. Adaptive synthetic sampling approach for imbalanced learning (ADASYN): ADASYN generates synthetic data for classes with less samples in a way that datasets that are more difficult to learn are generated more compared to samples that are easier to learn.

2. Synthetic Minority Over-sampling Technique (SMOTE): SMOTE involves over sampling the minority class and under sampling of the majority class to get the best results.


### Learning Rate

Now to further improve the results, we played with learning rate including cyclical learning rate and learning rate with warm restarts. But before doing that, we need to find the best possible learning rate for the model. This is done by plotting a graph between the learning rate and the loss function to check where the loss starts decreasing.

Now, another thing that can be done is to train several architectures using the above techniques and then the results can be merged together. This is known as Model Ensemble and this is one of the widely popular technique. But is very computational expensive.

So, I decided to use a technique called snapshot ensembling [12] that achieves the goal of ensembling by training a single neural network, and making it converge to several local minima along its optimization path and saving the model parameters.


## Winning Approaches 2.0
https://towardsdatascience.com/latest-winning-techniques-for-kaggle-image-classification-with-limited-data-5259e7736327

Codes: https://github.com/kayoyin/GreyClassifier/blob/master/src/dataset.py

Preprocessing:
- Normalization
- Contrast stretching since black and white

Unbalanced dataset
- resampling by randomly cropping images

Data Augmentation
- only use in training set

Started off with a CNN model that has been pre-trained on ImageNet
- The idea is to freeze lower layers of the pre-trained model that can capture generic features while fine-tuning the higher layers to our specific domain.

- Among them, ResNet18 is the architecture I adopted as it gave the best validation accuracy upon training on our data, after running various architectures for 5 epochs. After experimenting with different numbers of frozen layers, 7 was found to be the best one. I also used the SGD optimizer with weight decay to discourage overfitting.

Learning Rate Scheduling
-  Instead of determining the optimal learning rate experimentally, I chose to use cyclic learning rate scheduling. This method makes the learning rate vary cyclically, which allows the model to converge to and escape several local minima. It also eliminates the need to find the best learning rate “by hand”.

Snapshot Ensembling
- Snapshot ensembling saves the model’s parameters periodically during training. The idea is that during cyclic LR scheduling, the model converges to different local minima. Therefore, by saving the model parameters at different local minima, we obtain a set of models that can give different insights for our prediction. This allows us to gather an ensemble of models in a single training cycle.

For each image, we concatenate the class probability predictions of each of the “snapshot” models to form a new data point. This new data is then inputted into an XGBoost model to give a prediction based on the snapshot models.

Subclass Decision
- Upon inspection of the confusion matrix on the validation set for a single model, we discover that it often confuses one class for the same other one. In fact, we find three subclasses that are often confused together:
- Also, the model is already very good at differentiating these subclasses (and finding suburbs). All that remains to get a great performance is for the model to accurately identify classifications within the subclasses.
- To do so, we train three new separate models on each subclass, using the same approach as before. Some classes have very few training data, so we increase the amount of data augmentation. We also find new parameters adjusted to each subclass.

Anti-aliasing
- The network outputs can change drastically with small shifts or translations to the input. This is because the striding operation in the convolutional network ignores the Nyquist sampling theorem and aliases, which breaks shift equivariance.

https://towardsdatascience.com/https-towardsdatascience-com-making-convolutional-networks-shift-invariant-again-f16acca06df2

Finally, after anti-aliasing the ResNet18 network and combining the training and validation sets to use all annotated data available for training, the testing accuracy rises to 0.97115. Anti-aliasing is a powerful method to improve generalization, which is crucial when the image data is limited.


https://towardsdatascience.com/product-image-classification-with-deep-learning-part-i-5bc4e8dccf41

We are going to use Convolutional Neural Networks(CNN) for classifying the products’ images using Supervised Learning and run it on PyTorch(which is an Artificial Intelligence framework developed by Facebook). For this purpose, we take up Facebook’s ResNet model which is pre-trained on more than a million images from the ImageNet database.

# Convolutional Neural Networks

## CNN using Tensorflow

Link: https://medium.com/@tifa2up/image-classification-using-deep-neural-networks-a-beginner-friendly-approach-using-tensorflow-94b0a090ccd4
    - *imgaug* - package to artificially add noise to the dataset
           - crop, flip, adjust hue, contrast and saturation
    - no codes

### Pre-processing images

In [5]:
def pre_process_image(image, training):
    # This function takes a single image as input,
    # and a boolean whether to build the training or testing graph.
    
    if training:
        # For training, add the following to the TensorFlow graph.

        # Randomly crop the input image.
        image = tf.random_crop(image, size=[img_size_cropped, img_size_cropped, num_channels])

        # Randomly flip the image horizontally.
        image = tf.image.random_flip_left_right(image)
        
        # Randomly adjust hue, contrast and saturation.
        image = tf.image.random_hue(image, max_delta=0.05)
        image = tf.image.random_contrast(image, lower=0.3, upper=1.0)
        image = tf.image.random_brightness(image, max_delta=0.2)
        image = tf.image.random_saturation(image, lower=0.0, upper=2.0)

        # Some of these functions may overflow and result in pixel
        # values beyond the [0, 1] range. It is unclear from the
        # documentation of TensorFlow 0.10.0rc0 whether this is
        # intended. A simple solution is to limit the range.

        # Limit the image pixels between [0, 1] in case of overflow.
        image = tf.minimum(image, 1.0)
        image = tf.maximum(image, 0.0)
    else:
        # For training, add the following to the TensorFlow graph.

        # Crop the input image around the centre so it is the same
        # size as images that are randomly cropped during training.
        image = tf.image.resize_image_with_crop_or_pad(image,
                                                       target_height=img_size_cropped,
                                                       target_width=img_size_cropped)

    return image

### Splitting the dataset

In [6]:
train_batch_size = 64
def random_batch():
    # Number of images in the training-set.
    num_images = len(images_train)

    # Create a random index.
    idx = np.random.choice(num_images,
                           size=train_batch_size,
                           replace=False)

    # Use the random index to select random images and labels.
    x_batch = images_train[idx, :, :, :]
    y_batch = labels_train[idx, :]

    return x_batch, y_batch

### Building convolutional neural network

We’re going to have 3 convolution layers with 2 x 2 max-pooling.

Max-pooling: A technique used to reduce the dimensions of an image by taking the maximum pixel value of a grid. This also helps reduce overfitting and makes the model more generic. The example below show how 2 x 2 max pooling works

## CNN using Keras
Link: https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8



In [19]:
from keras.models import Sequential #for initializing CNN
from keras.layers import Conv2D #for images, first step of CNN
from keras.layers import MaxPooling2D #getting the max value pixel per region
from keras.layers import Flatten #converting all 2d arrays into a single long cont vector
from keras.layers import Dense
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [8]:
#create an object of the sequential class below
classifier = Sequential()

Arguments
1. number of filters
2. shape of each filter - 3x3
3. input shape and type of image (3 for RGB)
4. activation function - RELU

In [9]:
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))

Instructions for updating:
Colocations handled automatically by placer.


In [11]:
#to reduce the number of nodes for the upcoming layers
classifier.add(MaxPooling2D(pool_size = (2, 2)))

In [12]:
classifier.add(Flatten())

In this step we need to create a fully connected layer, and to this layer we are going to connect the set of nodes we got after the flattening step, these nodes will act as an input layer to these fully-connected layers. As this layer will be present between the input layer and output layer, we can refer to it a hidden layer.

In [13]:
classifier.add(Dense(units = 128, activation = 'relu'))

As you can see, Dense is the function to add a fully connected layer, ‘units’ is where we define the number of nodes that should be present in this hidden layer, these units value will be always between the number of input nodes and the output nodes but the art of choosing the most optimal number of nodes can be achieved only through experimental tries. Though it’s a common practice to use a power of 2. And the activation function will be a rectifier function.

Now it’s time to initialise our output layer, which should contain only one node, as it is binary classification. This single node will give us a binary output of either a Cat or Dog.

In [14]:
classifier.add(Dense(units = 1, activation = 'sigmoid'))

Compile the CNN model

In [15]:
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

Perform some data augmentations on the images

In [20]:
train_datagen = ImageDataGenerator(rescale = 1./255,
                    shear_range = 0.2,
                    zoom_range = 0.2,
                    horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
training_set = train_datagen.flow_from_directory('training_set',
                    target_size = (64, 64),
                    batch_size = 32,
                    class_mode = 'binary')
test_set = test_datagen.flow_from_directory('test_set',
                    target_size = (64, 64),
                    batch_size = 32,
                    class_mode = 'binary')

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'training_set'

Fit data to our model.

In the above code, ‘steps_per_epoch’ holds the number of training images, i.e the number of images the training_set folder contains.

And ‘epochs’, A single epoch is a single step in training a neural network; in other words when a neural network is trained on every training samples only in one pass we say that one epoch is finished. 

In [None]:
classifier.fit_generator(training_set,
steps_per_epoch = 8000,
epochs = 25,
validation_data = test_set,
validation_steps = 2000)

Making predictions from the trained model

In [None]:
import numpy as np
from keras.preprocessing import image
test_image = image.load_img('dataset/single_prediction/cat_or_dog_1.jpg', target_size = (64, 64))
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis = 0)
result = classifier.predict(test_image)
training_set.class_indices
if result[0][0] == 1:
    prediction = 'dog'
else:
    prediction = 'cat'

## CNN for Visual Recognition

Link: https://cs231n.github.io/transfer-learning/

Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. The three major Transfer Learning scenarios look as follows:

When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of several factors, but the two most important ones are the size of the new dataset (small or big), and its similarity to the original dataset (e.g. ImageNet-like in terms of the content of images and the classes, or very different, such as microscope images). Keeping in mind that ConvNet features are more generic in early layers and more original-dataset-specific in later layers, here are some common rules of thumb for navigating the 4 major scenarios:

1. New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
2. New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
3. New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
4. New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.


Advice:
1. Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the architecture you can use for your new dataset. For example, you can’t arbitrarily take out Conv layers from the pretrained network. However, some changes are straight-forward: Due to parameter sharing, you can easily run a pretrained network on images of different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward function is independent of the input volume spatial size (as long as the strides “fit”). In case of FC layers, this still holds true because FC layers can be converted to a Convolutional Layer: For example, in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size 6x6, and is applied with padding of 0.
2. Learning rates. It’s common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the (randomly-initialized) weights for the new linear classifier that computes the class scores of your new dataset. This is because we expect that the ConvNet weights are relatively good, so we don’t wish to distort them too quickly and too much (especially while the new Linear Classifier above them is being trained from random initialization).
