# Setup running environment

In [0]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

In [0]:
!pip install keras==2.3.0

In [0]:
!pip install --user git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI

In [0]:
%cd './gdrive/My Drive/Workspace/code/D2SRetina/'

/content/gdrive/My Drive/Workspace/code/D2SRetina


# General Approach

Grocery items can be detected using one-stage CNN detector which find the item location and its category at the same time. However, the main disadvantages of this procedure are that:
* If new grocery items are added to the store, we need to re-trained the model with larger number of ouput classes.
* When the number of grocery items is large i.e. 10000 items, the classification submodel need to output a 10000-ways softmax, which increase the size of model.

Instead, we propose a two-stage approach as follows:
* A general-object detector is trained to detect object in general. On other hand, the output of classification sub-model is 1-way softmax.
* A variable-size metric-learning model is trained to learn the similarity between object of same class.

In operation, the pipeline is as follows:
* The general-object detector is used to detect object location. Detected objects are then cropped.
* Given a database of image of grocery items, the second model will find the grocery which most similar to detected object, and assign its label.

![Pipeline](https://github.com/viethungluu/D2SRetina/blob/master/images/self_checkout.png?raw=true)

The main advantages of this approach is that it does not require re-train model if new groceries are added. We only need to run the network on a sample image of new grocery once and store it on database for later use.


**One more thing**

Currently, recognition model is based on comparision of visual similarity between object. For grocery items, there are usually *text (brand, ads)* appeared on the item. One more step to increase the performance of recognition model is to extract text feature from image, and use it for comparision.

To extract text feature in the image, another text detection model should be trained. This module will be added if we have more time.


# Training general-object detection model

## Architecture

Training a general-object detection model
* **Architecture**: RetinaNet with ResNet50 as backbone
* **Transfer learning**: Initialize model's weights with the one pre-trained on MS COCO dataset
* **General-object detector**: We train the model on modified dataset where all object classes are converted to only one class "object". Following this strategy, our model would be more generic toward "object" in general, not some specific classes.

## Training

In [0]:
!pip install detection/.
!python detection/setup.py build_ext --inplace

In [0]:
!git pull

**Optimizing** the *scales* and *ratios* of anchors for bounding box detection. Found values will be used as config for training detection model.

In [0]:
!python detection/keras_retinanet/bin/anchor_optimization.py --data-dir ../../data/processed/d2s/ --set-name training_object

**Verifying** if the annotations are correct

In [0]:
!python detection/keras_retinanet/bin/debug.py --annotations --no-resize --num_images 2 coco ../../data/processed/d2s/ --coco-set training_object

**Training**

In [0]:
!python detection/keras_retinanet/bin/train.py --backbone resnet50 --weights ../../models/snapshots/D2S/resnet50_coco_best_v2.1.0.h5 --lr 1e-5 --epochs 100 --steps 500 --batch-size 4 --snapshot-path ../../models/snapshots/D2S/Coco --logger-dir ../../models/logs/D2S/Coco --save-path ../../results/D2S/Coco coco ../../data/processed/d2s/

## Evaluation

**Evaluating** trained model

In [0]:
!python detection/keras_retinanet/bin/evaluate.py --model ../../models/snapshots/D2S/Coco/resnet50_coco_26.h5 --backbone resnet50 --convert-model --save-path ../../results/D2S/Coco coco ../../data/processed/d2s/

Detection results. You can find the visualization of the detection **[HERE](https://drive.google.com/open?id=11Raospe5D3RZTZ3Wm9rCgv_BrlBJp3B3)**

> Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.974

> Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.990

> Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.990

> Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000

> Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.618

> Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.981


> Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.291

> Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.987

> Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.987

> Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000

> Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.750

> Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.988




# Training one-shot recognition model

In [0]:
!git pull

remote: Enumerating objects: 9, done.[K
remote: Counting objects:  11% (1/9)[Kremote: Counting objects:  22% (2/9)[Kremote: Counting objects:  33% (3/9)[Kremote: Counting objects:  44% (4/9)[Kremote: Counting objects:  55% (5/9)[Kremote: Counting objects:  66% (6/9)[Kremote: Counting objects:  77% (7/9)[Kremote: Counting objects:  88% (8/9)[Kremote: Counting objects: 100% (9/9)[Kremote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects:  50% (1/2)[Kremote: Compressing objects: 100% (2/2)[Kremote: Compressing objects: 100% (2/2), done.[K
remote: Total 16 (delta 7), reused 9 (delta 7), pack-reused 7[K
Unpacking objects: 100% (16/16), done.
From https://github.com/viethungluu/D2SRetina
   169c63c..9445e3b  master     -> origin/master
Updating 169c63c..9445e3b
Fast-forward
 recognition/losses.py            | 11 [32m++++[m[31m-------[m
 recognition/triplet_selectors.py | 36 [32m+++++++++++[m[31m-------------------------[m
 recognition/visual

## Dataset

The provided dataset is in COCO-format. In each image there are several objects.

We would like to re-use this dataset to train our object-recognition model. Thus, a custom dataset generator is built. The output of our dataset is the image patches which cover each object in dataset image.

In [0]:
!python recognition/dataset.py --coco-path ../../data/processed/d2s/ --num-images 12

Visualizing some object image in our recognition dataset.

<img src="https://github.com/viethungluu/D2SRetina/blob/master/images/debug_regcognition.png?raw=true" alt="Recognition dataset" width="600"/>


## Training

**Training strategy**:
* **Architecture**: Triplet model with ResNet50 as backbone. Image patches are resized and cropped to fixed size of $224 \times 224$.
* **Transfer learning**: Initialize backbone's weights with the one pre-trained on ImageNet dataset. First 6 blocks of ResNet50 model are freeze.
* **Loss function**: Triplet loss with L2-distance.
*   **Batch balancing**: In each mini-batch, K x P samples are used to form triplets where P is the number of classes and K is the number of samples for each class. This help model not biased to specific class.
*   **Curiculumn triplet sampling**: Curriculum learning describes a type of learning in which you first start out with only easy examples of a task and then gradually increase the task difficulty.
 * In the first epochs, the model is trained with "all" sampling. Let's say in each mini-batch we have K x P samples as above, then $[\frac{K \times (K -1)}{2} \times K \times (P - 1)] \times P$ triplets are formed.
 * In the next epochs, the model is trained with hard negative sampling. Let's say in each mini-batch, only $\frac{P \times (P -1)}{2} \times K$ hardest triplets are used.
* **Learning rate**: We start with large batch size and large learning rate ($1e-2$) [\[ref\]](https://arxiv.org/abs/1711.00489). After first 30 epochs, the learning rate is gradually decrease.
* **Soft margin**: Revisit formula of hard margin is $loss = max(positive\_distance - negative\_distance + margin, 0)$ where $margin$ is hyper-parameter. To get rid of hyper-param $margin$, soft margin is defined as $ln(1+exp(positive\_distance - negative\_distance))$. [\[ref\]](https://arxiv.org/pdf/1703.07737.pdf)
* **Random erasing**: Randomly erase part of image for data augmentation.

Model summary

In [0]:
!python recognition/models.py

**Start by training with all sampling**

In [0]:
!python recognition/train.py --coco-path ../../data/processed/d2s/ --backbone ResNet18 --imagenet-weights --optim Adam --triplet-selector all --K 4 --P 4 --soft-margin --lr 1e-2 --epoch-decay-start 10 --n-epoch 11 --n-batches 200 --snapshot-path ../../models/snapshots/D2S/Coco/soft_margin --logger-dir ../../models/logs/D2S/Coco/soft_margin

Reading dataset from ../../data/processed/d2s/annotations/D2S_training.json
loading annotations into memory...
Done (t=0.08s)
creating index...
index created!
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38, 38: 39, 39: 40, 40: 41, 41: 42, 42: 43, 43: 44, 44: 45, 45: 46, 46: 47, 47: 48, 48: 49, 49: 50, 50: 51, 51: 52, 52: 53, 53: 54, 54: 55, 55: 56, 56: 57, 57: 58, 58: 59, 59: 60}
{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8, 10: 9, 11: 10, 12: 11, 13: 12, 14: 13, 15: 14, 16: 15, 17: 16, 18: 17, 19: 18, 20: 19, 21: 20, 22: 21, 23: 22, 24: 23, 25: 24, 26: 25, 27: 26, 28: 27, 29: 28, 30: 29, 31: 30, 32: 31, 33: 32, 34: 33, 35: 34, 36: 35, 37: 36, 38: 37, 39: 38, 40: 39, 41: 40, 42: 41, 43: 42, 44: 43, 45: 44, 46: 45, 47: 46, 48: 47, 49: 48, 50: 4

**After 10 epochs, training with:**


1.   hard negative sampling
2.   learning rate decay
3.   random erasing



In [0]:
!python recognition/train.py --coco-path ../../data/processed/d2s/ --backbone ResNet18 --snapshot ../../models/snapshots/D2S/Coco/soft_margin/ResNet18_all_10.pth --optim Adam --triplet-selector hard --K 4 --P 4 --soft-margin --p 0.5 --sh 0.2 --lr 1e-3 --epoch-decay-start 10 --n-epoch 100 --n-batches 200 --snapshot-path ../../models/snapshots/D2S/Coco/soft_margin --logger-dir ../../models/logs/D2S/Coco/soft_margin

Reading dataset from ../../data/processed/d2s/annotations/D2S_training.json
loading annotations into memory...
Done (t=0.08s)
creating index...
index created!
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38, 38: 39, 39: 40, 40: 41, 41: 42, 42: 43, 43: 44, 44: 45, 45: 46, 46: 47, 47: 48, 48: 49, 49: 50, 50: 51, 51: 52, 52: 53, 53: 54, 54: 55, 55: 56, 56: 57, 57: 58, 58: 59, 59: 60}
{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8, 10: 9, 11: 10, 12: 11, 13: 12, 14: 13, 15: 14, 16: 15, 17: 16, 18: 17, 19: 18, 20: 19, 21: 20, 22: 21, 23: 22, 24: 23, 25: 24, 26: 25, 27: 26, 28: 27, 29: 28, 30: 29, 31: 30, 32: 31, 33: 32, 34: 33, 35: 34, 36: 35, 37: 36, 38: 37, 39: 38, 40: 39, 41: 40, 42: 41, 43: 42, 44: 43, 45: 44, 46: 45, 47: 46, 48: 47, 49: 48, 50: 4

Plot training loss

<img src="https://github.com/viethungluu/D2SRetina/blob/master/images/loss.png?raw=true" alt="Training loss" width="600"/>

## Evaluation

In [0]:
!python recognition/evaluate.py --coco-path ../../data/processed/d2s/ --backbone ResNet18 --snapshot ../../models/snapshots/D2S/Coco/soft_margin/ResNet18_hard_95.pth --emb-size 2048 --snapshot-path ../../models/snapshots/D2S/Coco/soft_margin

Reading dataset from ../../data/processed/d2s/annotations/D2S_training.json
loading annotations into memory...
Done (t=0.08s)
creating index...
index created!
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38, 38: 39, 39: 40, 40: 41, 41: 42, 42: 43, 43: 44, 44: 45, 45: 46, 46: 47, 47: 48, 48: 49, 49: 50, 50: 51, 51: 52, 52: 53, 53: 54, 54: 55, 55: 56, 56: 57, 57: 58, 58: 59, 59: 60}
{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8, 10: 9, 11: 10, 12: 11, 13: 12, 14: 13, 15: 14, 16: 15, 17: 16, 18: 17, 19: 18, 20: 19, 21: 20, 22: 21, 23: 22, 24: 23, 25: 24, 26: 25, 27: 26, 28: 27, 29: 28, 30: 29, 31: 30, 32: 31, 33: 32, 34: 33, 35: 34, 36: 35, 37: 36, 38: 37, 39: 38, 40: 39, 41: 40, 42: 41, 43: 42, 44: 43, 45: 44, 46: 45, 47: 46, 48: 47, 49: 48, 50: 4

Results after 95 epochs.

|             | precision | recall | f1-scrore | #samples |
|-------------|-----------|--------|-----------|----------|
| accuracy    |           |        | .54        | 2250     |
| macro avg   | .45        | .40     | .41        | 2250     |
| weighted avg| .64        | .54     | .56        | 2250     |


Image below shows some recognition examples. For each row, there are two pairs. For each pair, left image is query image (from test set), right image is the ground truth image (from train set) which is most similar to query image. We can see that this dataset is quite dificult, where many grocery item of different class look slmost similar.

<img src="https://github.com/viethungluu/D2SRetina/blob/master/images/evaluation.png?raw=true" alt="Example"/>