<a href="https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/NDL_DocL%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88(%E8%B3%87%E6%96%99%E7%94%BB%E5%83%8F%E3%83%AC%E3%82%A4%E3%82%A2%E3%82%A6%E3%83%88%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)%E3%81%AE%E5%A4%89%E6%8F%9B%E3%81%A8%E5%8F%AF%E8%A6%96%E5%8C%96.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NDL-DocLデータセット(資料画像レイアウトデータセット)の変換と可視化

Pascal VOC形式のXMLファイルをCOCO形式のJSONファイルへ変換し、その内容を可視化します。

## データのダウンロード

In [1]:
!wget https://lab.ndl.go.jp/dataset/dataset_kotenseki.zip

--2022-07-24 23:03:40--  https://lab.ndl.go.jp/dataset/dataset_kotenseki.zip
Resolving lab.ndl.go.jp (lab.ndl.go.jp)... 13.226.225.122, 13.226.225.2, 13.226.225.85, ...
Connecting to lab.ndl.go.jp (lab.ndl.go.jp)|13.226.225.122|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 848155076 (809M) [application/zip]
Saving to: ‘dataset_kotenseki.zip’


2022-07-24 23:03:50 (86.6 MB/s) - ‘dataset_kotenseki.zip’ saved [848155076/848155076]



解凍

In [2]:
!unzip -q dataset_kotenseki.zip

## 変換

以下のリポジトリを利用しています。

https://github.com/Kazuhito00/convert_voc_to_coco

In [3]:
!git clone https://github.com/Kazuhito00/convert_voc_to_coco

Cloning into 'convert_voc_to_coco'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 21 (delta 8), reused 10 (delta 4), pack-reused 0[K
Unpacking objects: 100% (21/21), done.


In [4]:
!mv convert_voc_to_coco/convert_voc_to_coco.py . 

変換

In [5]:
!python convert_voc_to_coco.py "dataset_kotenseki/*" dataset_kotenseki.json

Number of xml files: 1219
Convert XML to JSON: 100% 1219/1219 [00:00<00:00, 3560.88it/s]
{'4_illustration': 1118, '1_overall': 1218, '2_handwritten': 13850, '3_typography': 9261, '5_stamp': 368}
Success: dataset_kotenseki.json


## 画像ディレクトリの生成（画像のコピー）

In [6]:
import glob
import os
from tqdm import tqdm

files = glob.glob("dataset_kotenseki/*/*.jpg")

output_dir = "img"
os.makedirs(output_dir, exist_ok=True)

for file in tqdm(files):
  # print(file)
  output_path = "{}/{}".format(output_dir, os.path.basename(file))
  !cp $file $output_path 
  # break

100%|██████████| 1219/1219 [02:18<00:00,  8.77it/s]


## 可視化

以下を利用しています。

https://www.kaggle.com/code/ericdepotter/visualize-coco-annotations/notebook

In [7]:
# Source: https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch/#create-custom-coco-dataset
import base64
import IPython
import json
import numpy as np
import os
import random
import requests
from io import BytesIO
from math import trunc
from PIL import Image as PILImage
from PIL import ImageDraw as PILImageDraw

# Load the dataset json
class CocoDataset():
    def __init__(self, annotation_path, image_dir):
        self.annotation_path = annotation_path
        self.image_dir = image_dir
        self.colors = ['blue', 'purple', 'red', 'green', 'orange', 'salmon', 'pink', 'gold',
                        'orchid', 'slateblue', 'limegreen', 'seagreen', 'darkgreen', 'olive',
                        'teal', 'aquamarine', 'steelblue', 'powderblue', 'dodgerblue', 'navy',
                        'magenta', 'sienna', 'maroon']
        
        json_file = open(self.annotation_path)
        self.coco = json.load(json_file)
        json_file.close()
        
        #self.process_info()
        #self.process_licenses()
        self.process_categories()
        self.process_images()
        self.process_segmentations()

    def display_info(self):
        print('Dataset Info:')
        print('=============')
        for key, item in self.info.items():
            print('  {}: {}'.format(key, item))
        
        requirements = [['description', str],
                        ['url', str],
                        ['version', str],
                        ['year', int],
                        ['contributor', str],
                        ['date_created', str]]
        for req, req_type in requirements:
            if req not in self.info:
                print('ERROR: {} is missing'.format(req))
            elif type(self.info[req]) != req_type:
                print('ERROR: {} should be type {}'.format(req, str(req_type)))
        print('')
        
    def display_licenses(self):
        print('Licenses:')
        print('=========')
        
        requirements = [['id', int],
                        ['url', str],
                        ['name', str]]
        for license in self.licenses:
            for key, item in license.items():
                print('  {}: {}'.format(key, item))
            for req, req_type in requirements:
                if req not in license:
                    print('ERROR: {} is missing'.format(req))
                elif type(license[req]) != req_type:
                    print('ERROR: {} should be type {}'.format(req, str(req_type)))
            print('')
        print('')
        
    def display_categories(self):
        print('Categories:')
        print('=========')
        for sc_key, sc_val in self.super_categories.items():
            print('  super_category: {}'.format(sc_key))
            for cat_id in sc_val:
                print('    id {}: {}'.format(cat_id, self.categories[cat_id]['name']))
            print('')
    
    def display_image(self, image_id, show_polys=True, show_bbox=True, show_labels=True, show_crowds=True, use_url=False):
        print('Image:')
        print('======')
        if image_id == 'random':
            image_id = random.choice(list(self.images.keys()))
        
        # Print the image info
        image = self.images[image_id]
        for key, val in image.items():
            print('  {}: {}'.format(key, val))
            
        # Open the image
        if use_url:
            image_path = image['coco_url']
            response = requests.get(image_path)
            image = PILImage.open(BytesIO(response.content))
            
        else:
            image_path = os.path.join(self.image_dir, image['file_name'])
            image = PILImage.open(image_path)
            
        buffered = BytesIO()
        image.save(buffered, format="PNG")
        img_str = "data:image/png;base64, " + base64.b64encode(buffered.getvalue()).decode()
        
        # Calculate the size and adjusted display size
        max_width = 900
        image_width, image_height = image.size
        adjusted_width = min(image_width, max_width)
        adjusted_ratio = adjusted_width / image_width
        adjusted_height = adjusted_ratio * image_height
        
        # Create list of polygons to be drawn
        polygons = {}
        bbox_polygons = {}
        rle_regions = {}
        poly_colors = {}
        labels = {}
        print('  segmentations ({}):'.format(len(self.segmentations[image_id])))
        for i, segm in enumerate(self.segmentations[image_id]):
            polygons_list = []
            if segm['iscrowd'] != 0:
                # Gotta decode the RLE
                px = 0
                x, y = 0, 0
                rle_list = []
                for j, counts in enumerate(segm['segmentation']['counts']):
                    if j % 2 == 0:
                        # Empty pixels
                        px += counts
                    else:
                        # Need to draw on these pixels, since we are drawing in vector form,
                        # we need to draw horizontal lines on the image
                        x_start = trunc(trunc(px / image_height) * adjusted_ratio)
                        y_start = trunc(px % image_height * adjusted_ratio)
                        px += counts
                        x_end = trunc(trunc(px / image_height) * adjusted_ratio)
                        y_end = trunc(px % image_height * adjusted_ratio)
                        if x_end == x_start:
                            # This is only on one line
                            rle_list.append({'x': x_start, 'y': y_start, 'width': 1 , 'height': (y_end - y_start)})
                        if x_end > x_start:
                            # This spans more than one line
                            # Insert top line first
                            rle_list.append({'x': x_start, 'y': y_start, 'width': 1, 'height': (image_height - y_start)})
                            
                            # Insert middle lines if needed
                            lines_spanned = x_end - x_start + 1 # total number of lines spanned
                            full_lines_to_insert = lines_spanned - 2
                            if full_lines_to_insert > 0:
                                full_lines_to_insert = trunc(full_lines_to_insert * adjusted_ratio)
                                rle_list.append({'x': (x_start + 1), 'y': 0, 'width': full_lines_to_insert, 'height': image_height})
                                
                            # Insert bottom line
                            rle_list.append({'x': x_end, 'y': 0, 'width': 1, 'height': y_end})
                if len(rle_list) > 0:
                    rle_regions[segm['id']] = rle_list  
            else:
                # Add the polygon segmentation
                for segmentation_points in segm['segmentation']:
                    segmentation_points = np.multiply(segmentation_points, adjusted_ratio).astype(int)
                    polygons_list.append(str(segmentation_points).lstrip('[').rstrip(']'))

            polygons[segm['id']] = polygons_list

            if i < len(self.colors):
                poly_colors[segm['id']] = self.colors[i]
            else:
                poly_colors[segm['id']] = 'white'
            
            bbox = segm['bbox']
            bbox_points = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1],
                           bbox[0] + bbox[2], bbox[1] + bbox[3], bbox[0], bbox[1] + bbox[3],
                           bbox[0], bbox[1]]
            bbox_points = np.multiply(bbox_points, adjusted_ratio).astype(int)
            bbox_polygons[segm['id']] = str(bbox_points).lstrip('[').rstrip(']')
            
            labels[segm['id']] = (self.categories[segm['category_id']]['name'], (bbox_points[0], bbox_points[1] - 4))
            
            # Print details
            print('    {}:{}:{}'.format(segm['id'], poly_colors[segm['id']], self.categories[segm['category_id']]))

        # Draw segmentation polygons on image
        html = '<div class="container" style="position:relative;">'
        html += '<img src="{}" style="position:relative;top:0px;left:0px;width:{}px;">'.format(img_str, adjusted_width)
        html += '<div class="svgclass"><svg width="{}" height="{}">'.format(adjusted_width, adjusted_height)
        
        if show_polys:
            for seg_id, points_list in polygons.items():
                fill_color = poly_colors[seg_id]
                stroke_color = poly_colors[seg_id]
                for points in points_list:
                    html += '<polygon points="{}" style="fill:{}; stroke:{}; stroke-width:1; fill-opacity:0.5" />'.format(points, fill_color, stroke_color)
        
        if show_crowds:
            for seg_id, rect_list in rle_regions.items():
                fill_color = poly_colors[seg_id]
                stroke_color = poly_colors[seg_id]
                for rect_def in rect_list:
                    x, y = rect_def['x'], rect_def['y']
                    w, h = rect_def['width'], rect_def['height']
                    html += '<rect x="{}" y="{}" width="{}" height="{}" style="fill:{}; stroke:{}; stroke-width:1; fill-opacity:0.5; stroke-opacity:0.5" />'.format(x, y, w, h, fill_color, stroke_color)
            
        if show_bbox:
            for seg_id, points in bbox_polygons.items():
                fill_color = poly_colors[seg_id]
                stroke_color = poly_colors[seg_id]
                html += '<polygon points="{}" style="fill:{}; stroke:{}; stroke-width:1; fill-opacity:0" />'.format(points, fill_color, stroke_color)
                
        if show_labels:
            for seg_id, label in labels.items():
                color = poly_colors[seg_id]
                html += '<text x="{}" y="{}" style="fill:{}; font-size: 12pt;">{}</text>'.format(label[1][0], label[1][1], color, label[0])
                
        html += '</svg></div>'
        html += '</div>'
        html += '<style>'
        html += '.svgclass { position:absolute; top:0px; left:0px;}'
        html += '</style>'
        return html
       
    def process_info(self):
        self.info = self.coco['info']
    
    def process_licenses(self):
        self.licenses = self.coco['licenses']
    
    def process_categories(self):
        self.categories = {}
        self.super_categories = {}
        for category in self.coco['categories']:
            cat_id = category['id']
            super_category = category['supercategory']
            
            # Add category to the categories dict
            if cat_id not in self.categories:
                self.categories[cat_id] = category
            else:
                print("ERROR: Skipping duplicate category id: {}".format(category))

            # Add category to super_categories dict
            if super_category not in self.super_categories:
                self.super_categories[super_category] = {cat_id} # Create a new set with the category id
            else:
                self.super_categories[super_category] |= {cat_id} # Add category id to the set
                
    def process_images(self):
        self.images = {}
        for image in self.coco['images']:
            image_id = image['id']
            if image_id in self.images:
                print("ERROR: Skipping duplicate image id: {}".format(image))
            else:
                self.images[image_id] = image
                
    def process_segmentations(self):
        self.segmentations = {}
        for segmentation in self.coco['annotations']:
            image_id = segmentation['image_id']
            if image_id not in self.segmentations:
                self.segmentations[image_id] = []
            self.segmentations[image_id].append(segmentation)

In [8]:
annotation_path = r'dataset_kotenseki.json'
image_dir = r'img'

coco_dataset = CocoDataset(annotation_path, image_dir)
# coco_dataset.display_info()
# coco_dataset.display_licenses()
coco_dataset.display_categories()

Categories:
  super_category: none
    id 1: 1_overall
    id 2: 2_handwritten
    id 3: 3_typography
    id 4: 4_illustration
    id 5: 5_stamp



In [9]:
html = coco_dataset.display_image('random', use_url=False)
IPython.display.HTML(html)

Image:
  file_name: 2538632_0032.jpg
  height: 1200
  width: 1600
  id: 2538632_0032
  segmentations (28):
    398:blue:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    399:purple:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    400:red:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    401:green:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    402:orange:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    403:salmon:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    404:pink:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    405:gold:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    406:orchid:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    407:slateblue:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    408:limegreen:{'supercategory': 'none', 'id': 2, 'name': '2_handwritten'}
    409:seagreen:{'supercategory': 'none', 'id': 2, 'name': '2_handwritt

## YOLO形式への変換（2022-07-25追加）

以下を参考にしています。

https://github.com/ultralytics/JSON2YOLO

In [16]:
from pathlib import Path
import shutil

def make_dirs(dir='new_dir/'):
    # Create folders
    dir = Path(dir)
    if dir.exists():
        shutil.rmtree(dir)  # delete dir
    for p in dir, dir / 'labels', dir / 'images':
        p.mkdir(parents=True, exist_ok=True)  # make dir
    return dir

def convert_coco_json(json_dir='../coco/annotations/', output_dir = "", use_segments=False, cls91to80=False):
    save_dir = make_dirs(output_dir)
    # coco80 = coco91_to_coco80_class()

    # Import json
    for json_file in sorted(Path(json_dir).resolve().glob('*.json')):
        fn = Path(save_dir) / 'labels' #  / json_file.stem.replace('instances_', '')  # folder name
        # fn.mkdir()
        with open(json_file) as f:
            data = json.load(f)

        # Create image dict
        images = {x['id']: x for x in data['images']}

        # Write labels file
        for x in tqdm(data['annotations'], desc=f'Annotations {json_file}'):
            if x['iscrowd']:
                continue

            img = images[x['image_id']]
            h, w, f = img['height'], img['width'], img['file_name']

            # The COCO box format is [top left x, top left y, width, height]
            box = np.array(x['bbox'], dtype=np.float64)
            box[:2] += box[2:] / 2  # xy top-left corner to center
            box[[0, 2]] /= w  # normalize x
            box[[1, 3]] /= h  # normalize y

            # Segments
            if use_segments:
                segments = [j for i in x['segmentation'] for j in i]  # all segments concatenated
                s = (np.array(segments).reshape(-1, 2) / np.array([w, h])).reshape(-1).tolist()

            # Write
            if box[2] > 0 and box[3] > 0:  # if w > 0 and h > 0
                # cls = coco80[x['category_id'] - 1] if cls91to80 else x['category_id'] - 1  # class
                cls = x['category_id'] - 1
                line = cls, *(s if use_segments else box)  # cls, box or segments
                with open((fn / f).with_suffix('.txt'), 'a') as file:
                    file.write(('%g ' * len(line)).rstrip() % line + '\n')

In [20]:
param_input_dir = "/content"
param_output_dir = "/content/yolo"
convert_coco_json(param_input_dir, param_output_dir)

Annotations /content/dataset_kotenseki.json: 100%|██████████| 25820/25820 [00:02<00:00, 9023.80it/s] 


### 画像のコピー

In [21]:
import glob
import os
from tqdm import tqdm

files = glob.glob("dataset_kotenseki/*/*.jpg")

output_dir = param_output_dir + "/images"
os.makedirs(output_dir, exist_ok=True)

for file in tqdm(files):
  # print(file)
  output_path = "{}/{}".format(output_dir, os.path.basename(file))
  !cp $file $output_path 
  # break

100%|██████████| 1219/1219 [02:22<00:00,  8.55it/s]
