# CTW dataset tutorial (Part 1: Basics)

Hello, welcome to the tutorial of _Chinese Text in the Wild_ (CTW) dataset. In this tutorial, we will show you:

1. [Basics](#CTW-dataset-tutorial-(Part-1:-Basics)

  - [The structure of this repository](#The-structure-of-this-repository)
  - [Download images and annotations](#Download-images-and-annotations)
  - [Dataset split](#Dataset-Split)
  - [Annotation format](#Annotation-format)
  - [Draw annotations on images](#Draw-annotations-on-images)
  - [Appendix: Adjusted bounding box conversion](#Appendix:-Adjusted-bounding-box-conversion)

2. Classification baseline

  - Train classification model
  - Evaluate your classification model

3. Detection baseline

  - Train classification model
  - Evaluate your classification model

Our homepage is [https://ctwdataset.github.io](https://ctwdataset.github.io), you may find some more useful information from that.

Notes:

  > This notebook MUST berun under `$CTW_ROOT/examples`.
  >
  > All our code SHOULD be run on `Linux>=3` with `Python>=3.4`. We make it compatible with `Python>=2.7` with best effort.
  >
  > The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://tools.ietf.org/html/rfc2119).

## The structure of this repository

Our git repository is `git@github.com:yuantailing/ctwdataset.git`, which you can browse from [GitHub](https://github.com/yuantailing/ctwdataset).

There are several directories under `$CTW_ROOT`.

  - **examples/**: tutorials
  - **data/**: download and place images and annotations
  - **prepare/**: prepare dataset splits
  - **classification/**: classification baselines using [TensorFlow](https://www.tensorflow.org/)
  - **detection/**: detection baseline using [YOLOv2](https://pjreddie.com/darknet/yolo/)
  - **judge/**: evaluate testing results and draw results and statistics
  - **pythonapi/**: APIs to traverse annotations, to evaluate results, and for common use
  - **cppapi/**: a faster implementation to detection mAP evaluation
  - **codalab/**: which we run on [CodaLab](https://competitions.codalab.org/competitions/?q=CTW) (our evaluation server)
  - **ssd/**: a detection method using [SSD](https://github.com/weiliu89/caffe/tree/ssd)

Most of the above directories have some similar structures.

  - **\*/settings.py**: configure directory of images, file path to annotations, and dedicated configurations for each step
  - **\*/products/**: store temporary files, logs, middle products, and final products 
  - **\*/pythonapi**: a symbolic link to `pythonapi/`, in order to use Python API more conveniently

Most of the code is written in Python, while some code is written in C++, Bash, etc.

All our code won't create or modify any files outer `$CTW_ROOT` (excect `/tmp/`), and don't need a privilege elevation (except to run docker workers on the evaluation server). You SHOULD install requirements before you run our code.

  - git>=1
  - Python>=3.4
  - Jupyter notebook>=5.0
  - gcc>=5
  - g++>=5
  - CUDA driver
  - CUDA toolkit>=8.0
  - CUDNN>=6.0
  - OpenCV>=3.0
  - requirements listed in `$CTW_ROOT/requirements.txt`

Recommonded hardware requirements:

  - RAM >= 32GB
  - GPU memory >= 12 GB
  - Hard Disk free >= 100 GB
  - CPU logical cores >= 8
  - Network connection

## Download images and annotations

1. Clone the repository

  We assume you have cloned `git@github.com:yuantailing/ctwdataset.git` and have `cd` to `ctwdataset/examples/`

2. Download images to `$CTW_ROOT/data/all_images/`
3. Download annotations to `$CTW_ROOT/data/annotations/downloads/`

In [None]:
# TODO: This is only an example, replace it with real data
!curl https://example.com -o ../data/annotations/downloads/example.com

## Dataset Split

We split the dataset into 4 parts:

1. Training set (about 75%)

  For each image in training set, the annotation contains a lot of sentances, while each sentance contains some character instances.
  
  Each character instance contains:
  
    - its underlying character,
    - its bounding box (polygon),
    - and 6 attributes.

  Only Chinese character instances are completely annotated, non-Chinese characters are partially annotated.

  Some ignore regions are annotated, which contain character instances that cannot be recognized by human (e.g. too small, too fuzzy).

  We will show the annotation format in [next sections](#Annotation-format).

2. Validation set (about 5%)

  The same as training set.
  
  The split between training set and validation set is only a recommendation. We make no restriction on how you split them. To enlarge training data, you MAY use TRAIN+VAL to train your models.

3. Testing set for classification (about 10%)

  For this testing set, we make annotated bounding boxes public. Underlying character, attributes, sentances and ignored regions are not avaliable.

  To evaluate your results on testing set, please visit our evaluation server.
  
  > You MUST NOT use annotations on testing set to fine tune your models or hyper-parameters.
  >
  > You MUST NOT use evaluation server to fine tune your models or hyper-parameters.

4. Testing set for detection (about 10%)

  For this testing set, we make images public.

  To evaluate your results on testing set, please visit our evaluation server.

To run evaluation and analysis code locally, we will use validation set as testing sets in this tutorial.

If you propose to train your model on TRAIN+VAL, you can execute `cp ../data/annotations/downloads/* ../data/annotations/` instead of run following code. But you will not be able to run evaluation and analysis code locally, since you don't have the grount truth of testing set.

In [None]:
!cd ../prepare && python3 fake_testing_set.py
# or exec `cp ../data/annotations/downloads/* ../data/annotations/`

In [None]:
# Then, create symbolic links to download images
!cd ../prepare && python3 symlink_images.py

## Annotation format

We will show you:

- Overall information format
- Training set annotation format
- Classification testing set format

We will display some examples in the next cell.

#### Overall information format

Overall information file (`../data/annotations/info.json`) is UTF-8 (no BOM) encoded [JSON](https://www.json.org/).

The data struct for this information file is described below.

```
information:
{
    train: [image_meta_0, image_meta_1, image_meta_2, ...],
    val: [image_meta_0, image_meta_1, image_meta_2, ...],
    test_cls: [image_meta_0, image_meta_1, image_meta_2, ...],
    test_det: [image_meta_0, image_meta_1, image_meta_2, ...],
}

image_meta:
{
    image_id: str,
    file_name: str,
    width: int,
    height: int,
}
```
`train`, `val`, `test_cls`, `test_det` keys denote to training set, validation set, testing set for classification, testing set for detection, respectively.

The resolution for each of the images is currently $2048 \times 2048$. Image ID is a 7-digits string, the first digit of image ID indicates the camera orientation in the following rule.

  - '0': back
  - '1': left
  - '2': front
  - '3': right

Image file name doesn't contain directory name, and is always `image_id + '.jpg'`.

#### Training set annotation format

All `.jsonl` annotation files (e.g. `../data/annotations/train.jsonl`) are UTF-8 encoded [JSON Lines](http://jsonlines.org/), each line corresponding to the annotation of one image.

The data struct for each of the annotations in training set (and validation set) is described below.
```
annotation (corresponding to one line in .jsonl):
{
    image_id: str,
    file_name: str,
    width: int,
    height: int,
    annotations: [sentance_0, sentance_1, sentance_2, ...],
    ignore: [ignore_0, ignore_1, ignore_2, ...],                    # MAY be an empty list
}

sentance:
[instance_0, instance_1, instance_2, ...]

instance:
{
    polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],    # x, y are floating-point numbers
    text: str,                                            # the length of the text MUST be exactly 1
    is_chinese: bool,
    attributes: [attr_0, attr_1, attr_2, ...],            # MAY be an empty list
    adjusted_bbox: [xmin, ymin, w, h],                    # x, y, w, h are floating-point numbers
}

attr:
"occluded" | "bgcomplex" | "distorted" | "raised" | "wordart" | "handwritten"

ignore:
{
    polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],
    bbox: [xmin, ymin, w, h],
]
```

Original bounding box annotations are polygons, we will discribe how `polygon` is converted to `adjusted_bbox` in [appendix](#Appendix:-Adjusted-bounding-box-conversion).

Notes:

  > The order of lines are not guaranteed to be consistent with `info.json`.
  >
  > Polygon MUST be quadrangle.
  >
  > All characters in `CJK Unified Ideographs` are regard as Chinese, while characters in `ASCII`, `CJK Unified Ideographs Extension`(s) are not.
  >
  > Adjusted bboxes of character `instance` MUST be intersected with the image, while bboxes in `ignore` may not.
  >
  > Some logos on the camera car (e.g. "`腾讯街景地图`" in `2040368.jpg`) are ignored to avoid bias.

#### Classification testing set format

The data struct for each of the annotations in classification testing set is described below.

```
annotation:
{
    image_id: str,
    file_name: str,
    width: int,
    height: int,
    proposals: [proposal_0, proposal_1, proposal_2, ...],
}

proposal:
{
    polygon: [[x0, y0], [x1, y1], [x2, y2], [x3, y3]],
    adjusted_bbox: [xmin, ymin, w, h],
}
```

Notes:

  > The order of `image_id` in each line are not guaranteed to be same with `info.json`.
  >
  > Non-Chinese characters MUST NOT appear in proposals.

In [None]:
from __future__ import print_function
from __future__ import unicode_literals

import json
import pprint
import settings

from pythonapi import anno_tools

print('Image meta info format:')
with open(settings.DATA_LIST) as f:
    data_list = json.load(f)
pprint.pprint(data_list['train'][0])

In [None]:
print('Training set annotation format:')
with open(settings.TRAIN) as f:
    anno = json.loads(f.readline())
pprint.pprint(anno, depth=3)

In [None]:
print('Character instance format:')
pprint.pprint(anno['annotations'][0][0])

In [None]:
print('Traverse character instances in a image')
for instance in anno_tools.each_char(anno):
    print(instance['text'], end=' ')
print()

In [None]:
print('Classification testing set format')
with open(settings.TEST_CLASSIFICATION) as f:
    anno = json.loads(f.readline())
pprint.pprint(anno, depth=2)

In [None]:
print('Classification testing set proposal format')
pprint.pprint(anno['proposals'][0])

## Draw annotations on images

In this section, we will draw annotations on images. This would help you to understand the format of annotations.

We show polygon bounding boxes of Chinese character instances in <font color="#0f0">**green**</font>, non-Chinese character instances in <font color="#ff0">**yellow**</font>, and ignore regions in <font color="#f00">**red**</font>.

In [None]:
import cv2
import json
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import os
import settings

from pythonapi import anno_tools

%matplotlib inline

with open(settings.TRAIN) as f:
    anno = json.loads(f.readline())
img = cv2.imread(os.path.join(settings.TRAINVAL_IMAGE_DIR, anno['file_name']))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(16, 16))
ax = plt.gca()
plt.imshow(img)
for instance in anno_tools.each_char(anno):
    color = (0, 1, 0) if instance['is_chinese'] else (1, 1, 0)
    ax.add_patch(patches.Polygon(instance['polygon'], fill=False, color=color))
for ignore in anno['ignore']:
    color = (1, 0, 0)
    ax.add_patch(patches.Polygon(ignore['polygon'], fill=False, color=color))
plt.show()

## Appendix: Adjusted bounding box conversion

In order to create a tighter bounding box to character instances, we compute `adjusted_bbox` in following steps, instead of use the real bounding box.

  1. Take the points (<font color="#f00">red points</font>) of trisection for each edge of the polygon 
  2. Compute the bouding box (<font color="#00f">blue rectangles</font>) of above points

Adjusted bounding box is better than the real bounding box, especially for sharp polygons.

In [None]:
from __future__ import division

import collections
import matplotlib.patches as patches
import matplotlib.pyplot as plt

%matplotlib inline

def poly2bbox(poly):
    key_points = list()
    rotated = collections.deque(poly)
    rotated.rotate(1)
    for (x0, y0), (x1, y1) in zip(poly, rotated):
        for ratio in (1/3, 2/3):
            key_points.append((x0 * ratio + x1 * (1 - ratio), y0 * ratio + y1 * (1 - ratio)))
    x, y = zip(*key_points)
    adjusted_bbox = (min(x), min(y), max(x) - min(x), max(y) - min(y))
    return key_points, adjusted_bbox

polygons = [
    [[2, 1], [11, 2], [12, 18], [3, 16]],
    [[21, 1], [30, 5], [31, 19], [22, 14]],
]

plt.figure(figsize=(10, 6))
plt.xlim(0, 35)
plt.ylim(0, 20)
ax = plt.gca()
for polygon in polygons:
    color = (0, 1, 0)
    ax.add_patch(patches.Polygon(polygon, fill=False, color=(0, 1, 0)))
    key_points, adjusted_bbox = poly2bbox(polygon)
    ax.add_patch(patches.Rectangle(adjusted_bbox[:2], *adjusted_bbox[2:], fill=False, color=(0, 0, 1)))
    for kp in key_points:
        ax.add_patch(patches.Circle(kp, radius=0.1, fill=True, color=(1, 0, 0)))
plt.show()