# CTW dataset tutorial (Part 3: Detection baseline)

In this part of the turotial, we will show you:

  - [Framework of detection baseline](#Framework-of-detection-baseline)
  - [Training steps](#Training-steps)
  - [Predicting steps](#Training-steps)
  - [Results format and evaluation API](#Results-format-and-evaluation-API)
  - [Evaluate results locally](#Evaluate-results-locally)

Notes:

  > This notebook MUST be run under `$CTW_ROOT/examples`.
  >
  > All our code SHOULD run on `Linux>=3` with `Python>=3.4`. We make it compatible with `Python>=2.7` with best effort.

## Framework of detection baseline

We use [YOLOv2](https://pjreddie.com/darknet/yolo/) and slightly modified it, detailed in git commits. We apply image cropping method and multiscale testing scheme, all are described in our paper.

Following the classification task, we also limit the number of categories to 1001, i.e., the top 1000 frequent observed character categories and an 'others' category.


## Training steps

Something are similar to classification tutorial, so this tutorial is simplified.

#### Compile darknet and download pre-trained model

Firstly, initialize git submodules. If you have problem initializing submodules, you may manually download darknet and copy it to corresponding directory.

Note that we have slightly modified darknet, so you shouldn't clone original darknet. Do clone the repository described in `$CTW_ROOT/.gitmodules` and pay attention to its `branch`.

In [None]:
!git submodule update --init --recursive
!cd ../detection/darknet && make -j8
!cd ../detection && if [ ! -f "products/darknet19_448.conv.23" ]; then curl https://pjreddie.com/media/files/darknet19_448.conv.23 -o products/darknet19_448.conv.23; fi

#### Decide categories

In [None]:
!cd ../detection && python3 decide_cates.py

#### Crop images and write meta data

This step will write:

  - detection/products/trainval/\*.\{jpg,txt\}
  - detection/products/trainval.txt
  - detection/products/yolo-chinese.cfg
  - detection/products/chinese.data
  - detection/products/chinese.names

In [None]:
!cd ../detection && python3 prepare_train_data.py

#### Run train scripts

This step will write:

  - detection/products/backup/\*.weights

This script generates a mount of logs and takes a long time, so we recommand you to run it with `/bin/bash` instead of running it directly in jupyter notebook.

Time cost estimation (NVIDIA GTX TITAN X): 3.0 sec / step, 38 hours in total.

In [None]:
!cd ../detection && python3 train.py

#### Download trained models

In [None]:
# TODO: download trained models

## Predicting steps

#### Crop testing images and write meta data

This step will write:

  - detection/products/test/\*.\{jpg,txt\}
  - detection/products/test.txt
  - detection/products/yolo-chinese-test.cfg


In [None]:
!cd ../detection && python3 prepare_test_data.py

#### Run darknet

You may need to `TEST_NUM_GPU` in `detection/settings.py` and `num_thread` in `detection/eval.py` before run this step. One thread takes about 3.5 GB GPU memory. If you can run $n$ threads in one GPU, you should set `num_thread` to `n * TEST_NUM_GPU` to achive maximum utilization.

This step will write:

  - detection/products/chinese.\*.data
  - detection/products/test.\*.txt
  - detection/products/results/chinese.\*.txt

This script generates a mount of logs and takes a long time, so we recommand you to run it with `/bin/bash` instead of running it directly in jupyter notebook.

Time cost estimation (NVIDIA GTX TITAN X \* 2): 0.2 sec * num_thread / subimage, 2.4 hours in total.

Notes:

  > For validating set, which size is about 0.5 times detection testing set, time cost estimation is 1.2 hours in total.

In [None]:
!cd ../detection && python3 eval.py

#### Merge results

We don't apply non-maximum suppression (NMS) on each of subimages in YOLOv2.

1. Collect candidates from results of subimages from YOLOv2.
2. Abandon candidates with improper size.
3. Splice candidates on subimages to full images.
4. Apply NMS on full images.

In [None]:
!cd ../detection && python3 merge_results.py

## Results format and evaluation API

Detection results MUST be UTF-8 encoded [JSON Lines](http://jsonlines.org/), each line MUST match the corresponding item in `test_det` in overall information, which is described in `Tutorial part 1: Basics`.

```
result (corresponding to one line in .jsonl):
{
    detections: [detection_0, detection_1, detection_2, ...],  # length of this list MUST be less than or equal to 1000
}

detection:
{
    bbox: [x, y, w, h],          # x, y, w, h are floating-point numbers, and w, h MUST be greater than 0
    text: str,                   # length is usually 1, otherwise this must be a failed detection
    score: float,
}
```

Our evaluation API in `pythonapi/eval_tools.py` works as follows.

  1. Check the detection (DT) file has the same number of lines as the grount truth (GT) file. Otherwise, return error.
  2. Check each line of the DT file is valid JSON, and conform to the results format. Otherwise, return error.
  3. Use the ignore list (IG) in the annotations.
  4. Non-Chinese character instances are removed from GT and added to IG.
  5. For each of the sizes, we deal with DTs, GTs, and IGs of each of the images in following steps, respectively.
    1. Move GTs which are not fit to current size range to IG. (For size 'all', this step always has no effect)
    2. Match DTs with GTs greedily, order by 'score' of DTs in descending order firstly and order by IOU between DTs and GTs in descending order secondarily. Matched DTs are TPs.
    3. Match DTs with IGs, matched DTs have no effect to the evaluation.
    4. Remove DTs which are not fit to current size range. These DTs have no effect to the evaluation. (For size 'all', this step always has no effect)
    5. Remaining DTs are FNs, Remaining GTs are FPs. Note there is no need to compute TNs.
    6. Sort TPs and FNs of DTs order by 'score' in descending order, take top-$n$ of them as TPs of GTs.
  6. For each size, we compute metrics in following steps, respectively.
    1. For each character category, take TPs and FNs of DTs belong to specified category to compute AP. Compute the mean of these APs weighted by number of character instances in each category, denote this mean value as `macro-mAP`, also call it `mAP`.
    2. For each attribute, count TPs and FPs of GTs and compute `recall`, respectively.
    3. For each character category, count TPs and FPs of GTs and compute `recall`, respectively.
    3. Take TPs and FNs of DTs in all categories to compute `AP`.
    4. For each image, take TPs and FNs of DTs belong to specified image to compute AP. Compute the mean of these APs, denote this mean value as `micro-mAP`.

When matching a DT with a GT, we require they have the same character category, and $IOU(DT, GT) > 0.5$. When matching a DT with a IG, we require $\exists ig \in IG$, s.t. $IOU(DT, ig) > 0.5$. Of which $IOU(A, B) = \frac{Area(A \cap B)}{Area(A \cup B)}$.

For any group of TPs and FNs of DTs, we compute AP in following steps.

  1. TODO: How to compute AP

If no error, the data struct for the output of evaluation API is described below.

```
TODO: output format
```

Macro-mAP (also called `mAP`) of size `all` should be considered the single most important metric on CTW.

Notes:

  > Python API and C++ API may generate slightly different results due to floating-point precision, the difference is always less than 0.0000001%. We officially approve the results of C++ API.
  >
  > If two DTs, A and B, have the same confidence score in one image, if A is in front of B in the list, we will match A before B with GTs in greedily matching step.