# CTW dataset tutorial (Part 3: detection baseline)

In this part of the turotial, we will show you:

  - [Framework of detection baseline](#Framework-of-detection-baseline)
  - [Training steps](#Training-steps)
  - [Predicting steps](#Predicting-steps)
  - [Submission format](#Submission-format)
  - [Evaluation API](#Evaluation-API)
  - [Evaluate results locally](#Evaluate-results-locally)
  - [Appendix: Details to the evaluation API](#Appendix:-Details-to-the-evaluation-API)

Notes:

  > This notebook MUST be run under `$CTW_ROOT/tutorial`.
  >
  > All the code SHOULD be run with `Python>=3.4`. We make it compatible with `Python>=2.7` with best effort.

## Framework of detection baseline

We use [YOLOv2](https://pjreddie.com/darknet/yolo/) and slightly modified it, detailed in git commits. We perform a multiscale testing scheme, which is described in our paper.

Follow the classification task, we also limit the number of categories to 1001, i.e., the top 1000 frequent observed character categories and an 'others' category.


## Training steps

Something are similar to classification tutorial, so this tutorial is a little simplified.

To train SSD512 model, just substitute `../detection/` with `../ssd/`.

#### Compile darknet and download pre-trained model

Firstly, initialize git submodules. If you have problem initializing submodules, you may manually download darknet and copy it to corresponding directory. Note that we have slightly modified darknet, so you should clone the repository described in `$CTW_ROOT/.gitmodules` and notice its `branch`.

In [None]:
!git submodule update --init --recursive
!cd ../detection/darknet && make -j8
!cd ../detection && if [ ! -f "products/darknet19_448.conv.23" ]; then curl https://pjreddie.com/media/files/darknet19_448.conv.23 -o products/darknet19_448.conv.23; fi

#### Decide categories

Decide which categories are the top 1000 frequent observed character categories, and save to `products/cates.json`.

In [None]:
!cd ../detection && python3 decide_cates.py

#### Crop images and write meta data

We will output the following files in this step:

  - detection/products/trainval/\*.\{jpg,txt\}
  - detection/products/trainval.txt
  - detection/products/yolo-chinese.cfg
  - detection/products/chinese.data
  - detection/products/chinese.names

In [None]:
!cd ../detection && python3 prepare_train_data.py

#### Run train scripts

We will output the following files in this step:

  - detection/products/backup/\*.weights

This script outputs a mount of logs and takes a long time, so we recommand you to run it with `/bin/bash` instead of running it directly in jupyter notebook.

- Time cost estimation for YOLOv2 (NVIDIA GTX TITAN X): 3.0 sec / step, 38 hours in total.
- Time cost estimation for SSD512 (NVIDIA GTX TITAN X * 2): 1.8 sec / step, 60 hours in total.

In [None]:
!cd ../detection && python3 train.py

#### Download trained models

Since training takes a lot of energy and we hate global warming, we provide trained models which are trained using TRAIN+VAL.

Visit our homepage (https://ctwdataset.github.io/) and gain access to the trained models.

Notes: if you are using trained models, you may run training steps with fake empty training data to produce necessary files (e.g. `chinese.data` for YOLOv2, `deploy.prototxt` for SSD512).

1. Produce `cates.json` with TRAIN+VAL, the map from label ID to character category.

  `cp ../data/annotations/downloads/train.jsonl ../data/annotations/downloads/val.jsonl ../data/annotations/`

  `python3 decide_cates.py`

1. Fake 'empty' training data.

  `echo '{"train":[{"image_id":"0000172","file_name":"0000172.jpg"}],"val":[{"image_id":"0000486","file_name":"0000486.jpg"}],"test_cls":[],"test_det":[{"image_id":"0000001","file_name":"0000001.jpg"}]}' >../data/annotations/info.json`

  `head ../data/annotations/downloads/train.jsonl -n1 >../data/annotations/train.jsonl`

  `head ../data/annotations/downloads/val.jsonl -n1 >../data/annotations/val.jsonl`

1. Run training scripts with fake data.

  `python3 prepare_train_data.py`

  `python3 train.py`

  If pre-trained model is required, just touch it or download it. If it starts to train, press `CTRL+C`.

1. Replace the model with downloaded trained model.

## Predicting steps

#### Crop testing images and write meta data

We will output the following files in this step:

  - detection/products/test/\*.\{jpg,txt\}
  - detection/products/test.txt
  - detection/products/yolo-chinese-test.cfg


In [None]:
!cd ../detection && python3 prepare_test_data.py

#### Run darknet

You may need to edit `TEST_NUM_GPU` (in `detection/settings.py`) and `num_thread` (in `detection/eval.py`) before running this step. Each copy of YOLOv2 takes about 3.6 GB GPU memory. Each GPU will run `num_thread / TEST_NUM_GPU` copies of YOLOv2 at the same time.

We will output the following files in this step:

  - `detection/products/chinese.*.data`
  - `detection/products/test.*.txt`
  - `detection/products/results/chinese.*.txt`

This script outputs a mount of logs and takes a long time, so we recommand you to run it with `/bin/bash` instead of running it directly in jupyter notebook.

Time cost estimation for YOLOv2 (NVIDIA GTX TITAN X \* 2): 0.2 sec / subimage for each thread, 2.4 hours in total if using 6 threads.

Notes:

  > For validation set, which size is about $0.5$ times detection testing set, time cost estimation is 1.2 hours in total.

In [None]:
!cd ../detection && python3 eval.py

#### Merge results

We don't apply non-maximum suppression (NMS) on each subimage in YOLOv2.

1. Collect candidates from results of subimages from YOLOv2.
2. Remove candidates in improper size.
3. Splice candidates on subimages to full images.
4. Apply NMS on full images.

We will output the following files in this step:

  - detection/products/detections.jsonl

In [None]:
!cd ../detection && python3 merge_results.py

## Submission format

Detection submission MUST be UTF-8 encoded [JSON Lines](http://jsonlines.org/), each line MUST match the corresponding item in `test_det` in overall information (`$CTW_ROOT/data/annotations/info.json`), which is described in `Tutorial part-1: Basics`.

```
result (corresponding to one line in .jsonl):
{
    detections: [detection_0, detection_1, detection_2, ...],  # length of this list MUST <=1000
}

detection:
{
    bbox: [x, y, w, h],          # x, y, w, h are floating-point numbers, where w, h MUST be greater than 0
    text: str,                   # length is usually 1, otherwise this detection must be a false negative
    score: float,
}
```

## Evaluation API

The calculation of `AP` is similar to PASCAL VOC. More evaluation metrics are implemented in `cppapi/eval_tools.hpp`. Here is a brief description, more details can be found in [Appendix](#Appendix:-Details-to-the-evaluation-API).

  1. At most 1000 detections are allowed for each image.
  1. IOU threshold is $0.5$.
  1. A detection which does not match
a ground truth but matches a subregion of an 'ignore' region is excluded during the evaluation.
  1. `macro-mAP` (also called `mAP`) is mean over character categories, weighted by number of character instances in corresponding category.
  1. `micro-mAP` is mean over images, where each image has the equivalent weight.
  1. `recall` is computed as described in the paper.
  1. When computing metrics for a specified size range,
    1. a detection (DT) out of size range is excluded during the evaluation,
    1. a ground truth (GT) out of size range is exluded,
    1. a DT in range matches a GT out of range is excluded, a DT out of range matches a GT in range is included.

If no error occurred, the data struct for the output of evaluation API is described below.

```
output:
{
    error: 0,
    performance: {
        all: size_performance,
        large: size_performance,
        medium: size_performance,
        small: size_performance,
    },
}

size_performance:
{
    n: int,                                                             # number of GTs
    AP: float,
    AP_curve: Y_curve,
    mAP: float,                                                         # i.e., macro-mAP
    mAP_curve: XY_curve,                                                # only C++ API computes mAP_curve
    attributes: [recall_0, recall_1, recall_2, ..., recall_63],         # recall for each attribute. Index is bitwise, described in 'part-1: classification'
    texts: {str_0: recall_0, str_1: recall_1, str_2: recall_2, ...},    # recall for each character category
    mAP_micro: float,
}

XY_curve:
[(x_0, y_0), (x_1, y_1), (x_2, y_2), ...]

Y_curve:
[y_0, y_1, y_2, ...]      # of which X are 1/n, 2/n, 3/n, ..., respectively

recall: {
    n: int,
    recall: int,
}
```

The data struct for the output of evaluation server is described below.

```
evaluation server output:
{
    size_ranges: list,    # the configure of considered sizes on the evaluation server, defined in `codalab/settings.py`
    attributes: list,     # the configure of considered attributes, always ["occluded", "bgcomplex", etc.]
    max_det: int,         # the configure of limit of number of detections per image, always 1000
    iou_thresh: float,    # the configure of IOU threshold, always 0.5
    performance: {
        all: size_performance,
        large: size_performance,
        medium: size_performance,
        small: size_performance,
    },
}
```

The `size_performance` in the output of evaluation server slightly differs from the output of our evaluation API.

  - `AP_curve` is discretize to `AP_curve_discrete`, which type is `XY_curve`, to avoid someone can infer which DTs are truths from the curve, and to reduce the size of output file.
  - `mAP_curve` is discretize to `mAP_curve_discrete`, for the same reason.
  - `texts` only contains top-10 frequent categories, to avoid revealing the frequency of each category on testing set, and to reduce the size of output file.

AP of size `all` should be considered the single most important metric on CTW dataset.

## Evaluate results locally

Since the following steps rely on `judge/products/stat_frequency.json`, you SHOULD firstly gather statistics, which is described in `Gather statistics` section in `Tutorial part-2: classification`.

#### Evaluate detection performance

We use [rapidjson](https://github.com/Tencent/rapidjson) library in C++ API, please initialize submodules or download this library to `cppapi/rapidjson` manually.

We will output the following files in this step:

  - `<stdout>`: detection performance for each size and each attribute
  - `<stdout>`: detection performance of top-10 most frequent character categories for each size
  - `judge/products/plots/det_AP_curve.pdf`: (described in our paper)
  - `judge/products/plots/det_mAP_curve.pdf`: macro-mAP curve is mean over AP curves for each category
  - `judge/products/plots/det_recall_by_attr_size.pdf`: recall for each size and each attribute, respectively
  - `judge/products/detection_report.json`: the output of evaluation API described above
  - `judge/products/explore_det.html`: performance for each conbination of attributes and each size

Notes:

  > If you are using trained models and validation set, you may result in a higher performance than paper. The reason may be our models are trained on TRAIN+VAL, while you are using validation set as testing set.

In [None]:
!git submodule update --init --recursive
!cd ../judge && python3 detection_perf.py

#### Draw detections on images

We show TPs in <font color="#0f0">**green**</font>, and show FNs in <font color="#ff0">**yellow**</font>. For each image, we draw most confident DTs, and set $num(TPs) + num(FNs) = num(GTs)$.

We will output the following files in this step:

  - `judge/products/printtext-drawing/*.pdf`

In [None]:
!cd ../judge && python3 draw_detection_text.py

## Appendix: Details to the evaluation API

Our Python evaluation API in `pythonapi/eval_tools.py` and C++ evaluation API `cppapi/eval_tools.hpp` work as follows, which is similar to PASCAL VOC.

  1. Check the detection (DT) file has the same number of lines as the grount truth (GT) file. Otherwise, return error.
  1. Check each line of the DT file is valid JSON, and conform to the submission format. Otherwise, return error.
  1. Use the ignore list (IG) in the annotations.
  1. Remove non-Chinese character instances from GTs.
  1. For each size, we deal with DTs, GTs, and IGs of each image in following steps, respectively.
    1. Move GTs which are not fit to current size range to IG. (For size 'all', this step always has no effect.)
    1. Match DTs with GTs greedily, order by $IOU(DT, GT)$ in descending order. For any given confidence score $c_0$, matched DTs which confidence score are greater than $c_0$ are true positives (TPs), while other matched DTs are false negatives (FNs).
    1. Remove unmatched DTs which can match IGs. They will have no effect to the evaluation.
    1. Remove unmatched DTs which are not fit to current size range. (For size 'all', this step always has no effect.)
    1. Remaining DTs are FNs, Remaining GTs are false positives (FPs).
  1. For each size, we compute metrics in following steps, respectively.
    1. Take all TPs, FNs and FPs to compute `AP`.
    1. For each character category, take TPs, FNs and FPs in specified category to compute average precision (AP). Compute mean of these APs weighted by number of character instances in corresponding category, denote the mean value as `macro-mAP`, also called `mAP`.
    1. For each image, we take a minimum confidence score $c_0$ which leads to $num(TPs) + num(FNs) \leq num(GTs)$, respectively. Then,
      1. for each attribute, compute `recall` by counting TPs which $score > c_0$ and belong to specified attribute for each the image, respectively.
      1. for each character category, compute `recall` by counting TPs which $score > c_0$ and belong to specified category for each image, respectively.
    1. For each image, take TPs, FNs and FPs in specified image to compute AP. Compute the mean of these APs, denote this mean value as `micro-mAP`.

When matching a DT with a GT, we require they have the identical character category, and $IOU(DT, GT) > 0.5$. When matching a DT with a IG, we require $\exists ig \subseteq IG$, s.t. $IOU(DT, ig) > 0.5$. Of which $IOU(A, B) = \frac{Area(A \cap B)}{Area(A \cup B)}$. Otherwise, they are not matched.

When computing an AP, for any given confidence score $c_0$, we compute precision by take the maximum of each precision at confidence score $c \geq c_0$.

Computation steps above are inefficient. We write more efficient code which don't follow these steps but results in the same results for any legal detection files.

Notes:

  > Python API and C++ API may generate slightly different results due to floating-point precision, the difference is always less than 0.0000001%. We officially approve the results of C++ API.
  >
  > If some DTs have the same confidence score, our evaluation API produces a certain reproducible result.