# CrowdHuman

In this notebook, we prepare CrowdHuman data. Preparation essentially consists of transforming the annotations into the format supported by YoloV7 and creating the summary.txt referencing all the images in the dataset.

## Links

- https://www.crowdhuman.org/

- Pre-treatment already done here: https://github.com/alaksana96/darknet-crowdhuman

## First step: Download

In `homemade/`:

```bash
mkdir crowdhuman
cd crowdhuman
```

Download all file from https://www.crowdhuman.org/
<p>CrowdHuman_train01.zip
    <a href="https://pan.baidu.com/s/1e-61WDiCqQibBVTIWqrssQ">[Baidu Drive]</a> 
    <a href="https://drive.google.com/file/d/134QOvaatwKdy0iIeNqA_p-xkAhkV4F8Y/view">[Google Drive]</a>
</p>
<p>CrowdHuman_train02.zip
    <a href="https://pan.baidu.com/s/1OnndpWXiZxsCB3VtWEYE3w">[Baidu Drive]</a>
    <a href="https://drive.google.com/file/d/17evzPh7gc1JBNvnW1ENXLy5Kr4Q_Nnla/view">[Google Drive]</a>
</p>
<p>CrowdHuman_train03.zip 
    <a href="https://pan.baidu.com/s/1kkfOlHV_xXKNbJUlLSkyXA">[Baidu Drive]</a>
    <a href="https://drive.google.com/file/d/1tdp0UCgxrqy1B6p8LkR-Iy0aIJ8l4fJW/view">[Google Drive]</a>
</p>
<p>CrowdHuman_val.zip
    <a href="https://pan.baidu.com/s/1kVBchjxOWu9sM5h8OAxfQw">[Baidu Drive]</a>
    <a href="https://drive.google.com/file/d/18jFI789CoHTppQ7vmRSFEdnGaSQZ4YzO/view">[Google Drive]</a>
</p>
<p>annotation_train.odgt
    <a href="https://pan.baidu.com/s/1wShABN_jYEiTRPM6_9-Cxg">[Baidu Dirve]</a>
    <a href="https://drive.google.com/file/d/1UUTea5mYqvlUObsC1Z8CFldHJAtLtMX3/view">[Google Drive]</a>
</p>
<p>annotation_val.odgt
    <a href="https://pan.baidu.com/s/1eObuAFcZyUw6PmUtpGS9vw">[Baidu Drive]</a>
    <a href="https://drive.google.com/file/d/10WIRwu8ju8GRLuCkZ_vT6hnNxs5ptwoL/view">[Google Drive]</a>
</p>
<p>CrowdHuman_test.zip<br>
    <a href="https://pan.baidu.com/s/133YKdndDTl9AWBRiVJJVRA">[Baidu Drive]</a> Fetch Code: cr7k<br>
    <a href="https://drive.google.com/file/d/1tQG3E_RrRI4wIGskorLTmDiWHH2okVvk/view">[Google Drive]</a>
</p>

## Second step: Prepare repository

- Train

```bash
unzip CrowdHuman_train01.zip
unzip CrowdHuman_train02.zip
unzip CrowdHuman_train03.zip
mkdir images/
mkdir images/train/
mv Images/* images/train/
```

- Val

```bash
unzip CrowdHuman_val.zip
mkdir images/val/
mv Images/* images/val/
```

- Test

```bash
unzip CrowdHuman_test.zip
mkdir images/test/
mv Images/* images/test/
```

- Clean

```bash
rmdir Images/
rm unzip CrowdHuman_train01.zip
rm unzip CrowdHuman_train02.zip
rm unzip CrowdHuman_train03.zip
rm CrowdHuman_val.zip
rm CrowdHuman_test.zip
```

Normally, you should now have this tree structure:

```bash
$ tree -L 2
.
├── annotation_train.odgt
├── annotation_val.odgt
└── images
    ├── test
    ├── train
    └── val
```

## Third step: Prepare labels

In [1]:
from pathlib import Path

path = Path("homemade/crowdhuman")
repositories = ['train', 'val']
path_images = path / 'images'
odgt_format = path / "annotation_{}.odgt"

path_labels = path / 'labels'
path_labels.mkdir(exist_ok=True)

In [2]:
import json
from PIL import Image
from tqdm import tqdm

def generate_annotations(line, images, labels):
    dict_line = json.loads(line)
    
    image_id = dict_line['ID']
    image_file = images / (image_id + '.jpg')
    
    img = Image.open(image_file)
    width, height = img.size
    
    strings = []
    for label in dict_line['gtboxes']:
        if 'extra' in label:
            if label['extra'].get('ignore', 0) == 1 or label['extra'].get('unsure', 0) == 1:
                continue
            
        bb = label['hbox'] # x, y, width, height
        
        bb = [min(max(bb[0], 0), width), 
              min(max(bb[1], 0), height),
              min(max(bb[0] + bb[2], 0), width), 
              min(max(bb[1] + bb[3], 0), height)] # xmin, ymin, xmax, ymax
        
        x_center = (bb[0] + bb[2]) / 2
        x_size = (bb[2] - bb[0])
        y_center = (bb[1] + bb[3]) / 2
        y_size = (bb[3] - bb[1])
        
        if x_size <= 3 or y_size <= 3:
            continue
            
        x_center /= width
        x_size /= width
        y_center /= height
        y_size /= height
        
        strings.append("{} {:.6f} {:.6f} {:.6f} {:.6f}".format(0, x_center, y_center, x_size, y_size))
        
    if len(strings) > 0:
        output_file = labels / (image_id + '.txt')
        with open(output_file, 'w') as f:
            f.write("\n".join(strings) + "\n")
        return True, str(image_file)
    return False, str(image_file)

In [3]:
for rep in repositories:
    odgt_file = str(odgt_format).format(rep)
    print("Processing {}:".format(odgt_file))

    with open(odgt_file) as f:
        image_list = f.read().split('\n')
    image_list = list(filter(len, image_list))

    valid_images = []
    invalid_images = []
    
    images = path_images / rep
    labels = path_labels / rep
    
    labels.mkdir(exist_ok=True)

    for line in tqdm(image_list):
        valid, image_file = generate_annotations(line, images, labels)
        if valid:
            valid_images.append(image_file)
        else:
            invalid_images.append(image_file)

    with open(labels / "summary.txt", 'w') as f:
        f.write("\n".join(valid_images) + "\n")

    print("Invalid images: {}".format("\n".join(invalid_images)))

Processing homemade/crowdhuman/annotation_train.odgt:


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15000/15000 [00:05<00:00, 2918.50it/s]


Invalid images: homemade/crowdhuman/images/train/282555,1e5f7000c479116e.jpg
Processing homemade/crowdhuman/annotation_val.odgt:


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4370/4370 [00:01<00:00, 2896.81it/s]

Invalid images: 



