# CTW dataset tutorial (Part 2: Classification baseline)

In this part of the turotial, we will show you:

  - [Framework of classification baseline](#Framework-of-classification-baseline)
  - [Training steps](#Training-steps)
  - [Predicting steps](#Training-steps)
  - [Results format and evaluation API](#Results-format-and-evaluation-API)
  - [Evaluate results locally](#Evaluate-results-locally)

Notes:

  > This notebook MUST run under `$CTW_ROOT/examples`.
  >
  > All our code SHOULD run on `Linux>=3` with `Python>=3.4`. We make it compatible with `Python>=2.7` with best effort.

## Framework of classification baseline

We regard the text recoginition problem as a classification problem.

We only consider recognition of the top 1000 frequent observed character categories. We give up to recognize other categories, which must will leads a failure on those categories.

The _magic number_ `1000` is written in `classification/settings.py`. You may modify it if you want to train with another number of categories.

For each of the character instances, we take following operations.

  1. crop the image region around it
  2. (training step only) randomly adjust saturation, brightness, contrast
  3. (training step only) randomly apply an affine transform
  4. per instance standardization 
  5. resize to fit the input of classification models
  6. feed to each of the classification models


## Training steps

Notes:

  > Before you run any scripts, please ensure you have requirements installed, which are described in `Tutorial part 1: Basics`.
  >
  > We train models on a desktop with 32 GB RAM. If your RAM is less than 32 GB, some code may fail.
  >
  > We train models on GTX TITAN X, which GPU memory is 12 GB. If your GPU memory is less than 12 GB, you may need to turn down `batch_size` in `cfgs` in `classification/train.py`

#### Decide categories
Decide which categories are the top 1000 frequent observed character categories, and save to `products/cates.json`.

In [None]:
!cd ../classification && python3 decide_cates.py

#### Create pickles
Crop the image region around character instances, and save pickles to `products/*.pkl`, in order to avoid frequently reading `.jpg` files.

In [None]:
!cd ../classification && python3 create_pkl.py

#### Run train scripts

Run train script with command line argument `alexnet_v2` to train `AlexNet v2`. Other choices are `overfeat`, `inception_v4`, `resnet_v2_50` and `res_net_v2_152`.

Train logs and checkpoints are saved to `classification/products/train_logs_alexnet_v2/`. You can run `tensorboard` to see detailed logs.

Time cost estimation:

  - **alexnet_v2**: 0.2 sec / step, 6 hours in total.
  - **others**: 1.0 sec / step, 28 hours in total.

Notes:

  > If training step become slower and slower (e.g. >2 sec / step), you can just Ctrl+C stop it and rerun `train.py`. It will automatically resume from the latest checkpoint. We save checkpoints per 1200 seconds, and this can be modified in `save_interval_secs` in `cfg_common` in `classification/train.py`.
  >
  > If you get a `CUDNN_STATUS_BAD_PARAM` error, you may turn down `per_process_gpu_memory_fraction` in `classification/train.py`.
  >
  > When training `resnet_v2_152`, tensorflow run training step and summary step at the same time may run out of memory (OOM). You may set `save_summaries_secs` in `cfgs` in `classification/train.py` to infinity (e.g. 999999) to disable summary step when you train `resnet_v2_152`.
  >
  > You can modify `cfgs` in `classification/train.py` to add or delete models. All avaliable models are described in `classification/slim/nets/nets_factory.py`, but we have not tested whether all models are suitable to our dataset.
  >
  > You can update TensorFlow-Slim from [source](https://github.com/tensorflow/models/tree/master/research/slim), but please keep our customized modified `slim/train_image_classifier.py` and `slim/eval_image_classifier.py`.

In [None]:
!cd ../classification && python3 train.py alexnet_v2

In [None]:
# during training, you can browse train logs using tensorboard
!tensorboard --logdir=../classification/products/train_logs_alexnet_v2/

#### Download trained models

Since training takes a lot of energy and we hate global warming, we provide the `checkpoint` trained as described above.

In [None]:
# TODO: download trained models

## Predicting steps

Just like [training steps](#Training-steps), only need to substitute `train.py` with `eval.py`.

This step will feed each of the character instances in classification testing set to the model, and save the output end point (so called _logits_) of the model to `classification/products/eval_alexnet_v2.pkl`.

Then, for each of the character instances, we sort the `logits`, and output the Top-5 results to `classification/products/predictions_alexnet_v2.jsonl`.

In [None]:
!cd ../classification && python3 eval.py alexnet_v2

## Results format and evaluation API

Classification results MUST be UTF-8 encoded [JSON Lines](http://jsonlines.org/), each line MUST match corresponding line in `Classification testing set annotations`, which is described in `Tutorial part 1: Basics`.

```
result (corresponding to one line in .jsonl):
{
    predictions: [prediction_0, prediction_1, prediction_2, ...],
}

prediction:
[candidate_0, candidate_1, candidate_2, ...]           # there MUST be at least 5 candidates

candidate: str
```

Our evaluation API in `pythonapi/eval_tools.py` works as follows.

  1. Check prediction (PR) has the same number of lines as grount truth (GT). Otherwise, return error.
  2. Check each line of PR is valid JSON, and conform to results format. Otherwise, return error.
  3. Check PR provide the same number of predictions as the number of Chinese character instances in GT. Otherwise, return error.
  4. Count number of instances and recall number for each of attributes combination and each of sizes, respectively.

If no error, the data struct for the output of evaluation API is described below.

```
output:
{
  error: 0,
  performance: {
    all: size_performance,
    large: size_performance,
    medium: size_performance,
    small: size_performance,
  }
}

size_performance:
[attr_performance_0, attr_performance_1, ..., attr_performance_63]

attr_performance:
{
  n: int,
  recalls: {
    1: int,
    5: int,
  }
}  
```

`k` in `attr_performance_k` is represented in bits. e.g. `k = 5` (`000101` in binary) means:

| Attribute | Yes or no |
| --------- | --------- |
| occluded  | 1 |
| bgcomplex | 0 |
| distorted | 1 |
| raised    | 0 |
| wordart   | 0 |
| handwritten | 0 |

corresponding to character instances with attributes combination `occluded & ~bgcomplex & distorted & ~raised & ~wordart & ~handwritten`.

Notes:

  > Since our evaluation API computes both top-1 accuracy and top-5 accuracy, you MUST provide at least 5 candidates for each of the instances.

## Evaluate results locally

#### Gather statistics

Run this script to gather statistics. This step will generate:

  - `judge/products/stat_frequency.json`: the frequency in the whole dataset, both training set and testing set
  - `judge/products/plots/stat_attributes.pdf`: (discribed in our paper)
  - `judge/products/plots/stat_instance_size.pdf`: (discribed in our paper)
  - `judge/products/plots/stat_most_freq.pdf`: (discribed in our paper)
  - `judge/products/plots/stat_num_char.pdf`: (discribed in our paper)
  - `judge/products/plots/stat_num_uniq_char.pdf`: (discribed in our paper)

Notes:

> This step requires network connection to download `SimHei.ttf`, a Chinese font file. This font file is used in rendering chinese in our statistics, and some other results.

In [None]:
!cd ../judge && python3 statistics_in_paper.py

#### Evaluate classification performance

In this step, we will output:

  - `<stdout>`: classification performance with each of the attributes
  - `<stdout>`: classification performance for top-10 most frequent character categories
  - `judge/products/plots/cls_precision_by_attr_size_(model_name).pdf`: (discribed in our paper)
  - `judge/products/plots/cls_precision_by_model_size.pdf`: performance for each of models and each of sizes
  - `judge/products/explore_cls.html`: performance for each of models and each of sizes

Notes:

  > You may result in a higher performance than paper or not. If so, the reason may be you are using validation set as testing set, while training set and validation set have a higher correlation.
  >
  > This step requires network connection to download `Chart.min.js` used to generate `explore_cls.html`.

In [None]:
!cd ../judge && python3 classification_perf.py alexnet_v2

#### Show results of character instances

This step will generate:

  - `judge/products/test_cls_cropped.pkl`: a cache file to avoid frequently reading .jpg files
  - `judge/products/predictions_compare.html`: (discribed in our paper)

Then, you can browse `judge/products/predictions_compare.html`.

In [None]:
!cd ../judge && python3 predictions2html.py alexnet_v2