<a href="https://colab.research.google.com/github/srttkyk/desk/blob/master/Create_Dataset_20220403.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is in this notebook 🤔 ?

The goal of this notebook is to create an enhanced version of the dataset for the last [happywhale competition](https://www.kaggle.com/c/happy-whale-and-dolphin).<br>

This notebook implements a few tricks that came out of discussions and public notebooks while sticking to the same file structure as the official competition's data, and remains fully configurable.<br>
It has been made such that you can fork it and change the settings to your convenience.<br>

## To fork this notebook
1. Fork this notebook by clicking th "Copy & edit notebook" button in the three dots menu on the top right corner of the page
2. Upload your kaggle api creds in a **PRIVATE** dataset and add them as a dataset to your notebook
3. Change the settings as you like
4. Commit !

## Use the datasets created by this notebook !
* [light version (224x224)](https://www.kaggle.com/wolfy73/happywhale-enhanced-dataset-light)
* [normal version (380x380)](https://www.kaggle.com/wolfy73/happywhale-enhanced-dataset-normal)


If you want this notebooks to integrate other tricks, please tell me in the comment section !<br>

If you found this notebook useful, please upvote it 👍 

# What tricks are implemented in this notebook 💡 ?

|Trick|Source|Description|
|---:|:---|:---|
|Bitmap format|[W&D - (224x224) Fast dataset](https://www.kaggle.com/wolfy73/w-d-224x224-fast-dataset)|The images are converted to the BMP format which is a format that, unlike the jpg, does not need computation when an image is readed. As such, image reading becomes faster|
|Cropped images|[Happywhale: Cropped Dataset [YOLOv5] ✂️](https://www.kaggle.com/awsaf49/happywhale-cropped-dataset-yolov5)|Images are cropped using a YOLOv5 model that is train to label images with bounding boxes. This allows to crop out unwanted informations such as land, people and boats, and "zoom" on the part we are interested in|
|Fixed species|[Fix all known species column problems](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305574)|Merge similar species in the training set, reduce the number of classes to 26|
|Stratified KFold|[Stratified KFold v. Group KFold (aka. I'm a dummy)](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/306521)|Attribute a fold to each training sample, thus you can track which sample was used to train which model. This also allows to perform cross-validation|
|OOD Detection|[🐳&🐬 - Filter YOLOv5 failure cases](https://www.kaggle.com/wolfy73/filter-yolov5-failure-cases)|As the YOLOv5 model isn't perfect, we use OOD sample detection to filter out wrong bounding boxes|
|Kaggle dataset|[HappyWhale TFRecords](https://www.kaggle.com/ks2019/happywhale-tfrecords)|The kaggle outputs sizes are limited to 19.6GB. You can utilize kaggle datasets to remove this constraint|
|Maximize contrast| [🐳&🐬 - Extract more infromation from images 🖼️](https://www.kaggle.com/wolfy73/extract-more-infromation-from-images) | Try to maximize information in the image by remapping the colors |
|Detic cropping|[cropped&resized(512x512) dataset using detic](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305503)| Use a CLIP based cropping method to crop the dataset|
|YOLOv5 cropping #2|[🐳&🐬 - Train YOLOv5 w/ crowd-sourced dataset 📊](https://www.kaggle.com/wolfy73/train-yolov5-w-crowd-sourced-dataset)| Images are cropped using a YOLOv5 model that is train to label images with bounding boxes. The difference with the previous one is that this time it was train on a [specific crowed sourced dataset](https://www.kaggle.com/wolfy73/wandd-crowed-sourced-bounging-boxes)|

# Changelog 📝
|Version|Description|
|:---|:---|
|v1| All tricks turned on ✅ |
|v2| Padding turned off ✅ |
|v3| Image size increased to 380 + flag quantile increased to 5% ❌ (Out of disk space) |
|v4| Outputing in a kaggle dataset + changelog + test mode ✅ |
|v5| OOD_DETECTION_FLAG_QUANTILE=0.1, IMAGE_SIZE=224 + Maximize contrast trick ✅ |
|v6| Disabling maximized contrast for comparison ✅ |
|v7| Image size re-increased to 380 Cropping based on detic + OOD_DETECTION_FLAG_QUANTILE=0.01 ✅ |
|v8| Image size increased to 512 + OOD_DETECTION_FLAG_QUANTILE=0.02 ❌ (Out of disk space) |
|v9| Image size back to 380 + crowed sourced dataset YOLOv5 ✅  |
|v10| Image size to 512 + save disk space ✅  |
|v11| Disabled bitmap ✅ |

## Kaggle Dataset Download

In [None]:
"""
KaggleのAPI設定　　（colabにKaggleデータセットDLする時に使用）
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　(手動でDLしてgdriveにupする方が早いかも)
"""

!pip install kaggle
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth

auth.authenticate_user()

drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])

filename = "/content/drive/MyDrive/Kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)

request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Download 100%.


In [None]:
from google.colab import drive
drive.mount('/content/Drive')

Mounted at /content/Drive


In [None]:
! mkdir -p ~/.kaggle
! cp "/content/Drive/MyDrive/Kaggle/kaggle.json" ~/.kaggle/

cp: cannot stat '/content/Drive/MyDrive/Kaggle/kaggle.json': No such file or directory


In [None]:
# ! pip install kaggle --upgrade
# ! kaggle config view
# ! kaggle competitions download happy-whale-and-dolphin  -p "/content/Drive/MyDrive/Kaggle/HappyWhale-2022/original_dana/"

Configuration values from /root/.kaggle
- username: srtturkey
- path: None
- proxy: None
- competition: None


# Configuration ⚙️

Feel free to tweak theses settings to your convenience !

データセット作成時の設定値

In [None]:
class CFG:
    #Size of final images
    MAX_IMAGE_SIZE = 512

    ### Turn on and off the mentionned tricks

    ## Conversion of images into the bitmap format to speed up image reading
    USE_BITMAP_FORMAT = False

    ## Crop images using Yolov5 to cut out parts of the image that may not be informative
    USE_CROPPED_IMAGES = True
    CROPPED_IMAGES_CONFIDENCE = 0.01 # Minimum confidence to keep the bounding box
    CROPPED_IMAGES_MARGIN = 0.1 # Proportion of the extra margin to add around the bounding box (avoid under-cropping)

    ## Merge species that can be considered as the same to have less different classes
    USE_FIXED_SPECIES = True

    # Assign a fold to each sample of the data to perform cross-validation
    USE_STRATIFIED_K_FOLD = True
    STRATIFIED_K_FOLD_K = 4 # Number of folds

    # Use OOD methods to filter out failure cases in the yolo bounding box dataset
    # 間違ったBBoxを除外する設定　　詳細は　”Filter cropping failure cases” のスクリプト参照　
    USE_OOD_DETECTION = False
    OOD_DETECTION_FLAG_QUANTILE = 0.01 # Proportion of the data to flag by each method
    OOD_DETECTION_N_FLAGS = 2 # Minimum number of flag to filter out a sample
    OOD_DETECTION_USE_MAX_FLAG = True # Use the max confidence as a metric to detect OOD samples
    OOD_DETECTION_USE_DELTA_MAX_FLAG = True # Use the delta between the max confidence of original and cropped as a metric to detect OOD samples
    OOD_DETECTION_USE_ENTROPY_FLAG = True # Use the entropy of the prediction distribution as a metric to detect OOD samples
    OOD_DETECTION_USE_DELTA_ENTROPY_FLAG = True # Use the delta between the entropy of f original and cropped prediction distribution as a metric to detect OOD samples

    # Output the dataset as a kaggle dataset (higher memory limit)
    USE_KAGGLE_DATASET = True
    KAGGLE_DATASET_NAME = 'happywhale-enhanced-dataset-large' # Name of the resulting dataset

    # Maximize the information in image by remaping the colors
    USE_MAXIMIZE_CONTRAST = False

    # Pad with zeros to obtain squared images, the goal is to avoid squeezing images
    USE_ZERO_PADDING = True


    TEST_MODE = False # Used for debugging

In [None]:
# Size of final images
MAX_IMAGE_SIZE = 512

### Turn on and off the mentionned tricks

## Conversion of images into the bitmap format to speed up image reading
USE_BITMAP_FORMAT = False

## Crop images using Yolov5 to cut out parts of the image that may not be informative
USE_CROPPED_IMAGES = True
CROPPED_IMAGES_CONFIDENCE = 0.01 # Minimum confidence to keep the bounding box
CROPPED_IMAGES_MARGIN = 0.1 # Proportion of the extra margin to add around the bounding box (avoid under-cropping)

## Merge species that can be considered as the same to have less different classes
USE_FIXED_SPECIES = True

# Assign a fold to each sample of the data to perform cross-validation
USE_STRATIFIED_K_FOLD = True
STRATIFIED_K_FOLD_K = 4 # Number of folds

# Use OOD methods to filter out failure cases in the yolo bounding box dataset
# 間違ったBBoxを除外する設定　　詳細は　”Filter cropping failure cases” のスクリプト参照　
USE_OOD_DETECTION = False
OOD_DETECTION_FLAG_QUANTILE = 0.01 # Proportion of the data to flag by each method
OOD_DETECTION_N_FLAGS = 2 # Minimum number of flag to filter out a sample
OOD_DETECTION_USE_MAX_FLAG = True # Use the max confidence as a metric to detect OOD samples
OOD_DETECTION_USE_DELTA_MAX_FLAG = True # Use the delta between the max confidence of original and cropped as a metric to detect OOD samples
OOD_DETECTION_USE_ENTROPY_FLAG = True # Use the entropy of the prediction distribution as a metric to detect OOD samples
OOD_DETECTION_USE_DELTA_ENTROPY_FLAG = True # Use the delta between the entropy of f original and cropped prediction distribution as a metric to detect OOD samples

# Output the dataset as a kaggle dataset (higher memory limit)
USE_KAGGLE_DATASET = True
KAGGLE_DATASET_NAME = 'happywhale-enhanced-dataset-large' # Name of the resulting dataset

# Maximize the information in image by remaping the colors
USE_MAXIMIZE_CONTRAST = False

# Pad with zeros to obtain squared images, the goal is to avoid squeezing images
USE_ZERO_PADDING = True


TEST_MODE = False # Used for debugging

In [None]:
!pip install -q bbox-utility # check https://github.com/awsaf49/bbox for source code

In [None]:
import numpy as np
import pandas as pd
import os
import cv2
import matplotlib.pyplot as plt
import time
import glob
import shutil
import json
import datetime
from tqdm.notebook import tqdm
from bbox.utils import yolo2voc, draw_bboxes
from sklearn.model_selection import StratifiedKFold
import pickle


In [None]:
# def plot_images(batch, row=2, col=2, base_path="../input/happy-whale-and-dolphin/train_images/"):
def plot_images(batch, row=2, col=2, base_path="/content/Drive/MyDrive/Colab Notebooks/Kaggle/Happywhale/original_dana/train_images/"):
    """
        Copied and adapted from https://www.kaggle.com/awsaf49/happywhale-data-distribution
    """
    plt.figure(figsize=(col*3, row*3))
    for i in range(row*col):
        plt.subplot(row, col, i+1)
        path = os.path.join(base_path,  batch["image"].iloc[i])
        img = cv2.imread(path)
        if img is None:
            continue
        img = img[:, :, ::-1]
        plt.imshow(img)
        if "species" in batch:
            plt.title(batch["species"].iloc[i])
        plt.axis('off')
    plt.tight_layout()
    plt.show()
    
def get_size(start_path = '.'):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size

In [None]:
# # Authenticate (needs your kaggle.json file in secret named 'kaggle')
# from kaggle_secrets import UserSecretsClient
# user_secrets = UserSecretsClient()
# apitoken = user_secrets.get_secret("kaggle")
# !mkdir -p ~/.kaggle
# %store apitoken >~/.kaggle/kaggle.json
# !mkdir -p dataset

In [None]:
# """
#     Copy-paste from https://www.kaggle.com/ks2019/happywhale-tfrecords
# """

# if USE_KAGGLE_DATASET:    
#     !rm -r /tmp/{KAGGLE_DATASET_NAME}

#     os.makedirs(f'/tmp/{KAGGLE_DATASET_NAME}', exist_ok=True)

#     with open('/root/.kaggle/kaggle.json') as f:
#         kaggle_creds = json.load(f)

#     os.environ['KAGGLE_USERNAME'] = kaggle_creds['username']
#     os.environ['KAGGLE_KEY'] = kaggle_creds['key']

#     !kaggle datasets init -p /tmp/{KAGGLE_DATASET_NAME}

#     with open(f'/tmp/{KAGGLE_DATASET_NAME}/dataset-metadata.json') as f:
#         dataset_meta = json.load(f)
#     dataset_meta['id'] = f'wolfy73/{KAGGLE_DATASET_NAME}'
#     dataset_meta['title'] = KAGGLE_DATASET_NAME
#     with open(f'/tmp/{KAGGLE_DATASET_NAME}/dataset-metadata.json', "w") as outfile:
#         json.dump(dataset_meta, outfile)
#     print(dataset_meta)

#     !cp /tmp/{KAGGLE_DATASET_NAME}/dataset-metadata.json /tmp/{KAGGLE_DATASET_NAME}/meta.json
#     !ls /tmp/{KAGGLE_DATASET_NAME}

#     !kaggle datasets create -u -p /tmp/{KAGGLE_DATASET_NAME} 
    
#     BASE_PATH=f"/tmp/{KAGGLE_DATASET_NAME}"
# else:
#     BASE_PATH="."
# print("BASE_PATH =", BASE_PATH)
# !mkdir {BASE_PATH}/test_images
# !mkdir {BASE_PATH}/train_images

rm: cannot remove '/tmp/happywhale-enhanced-dataset-large': No such file or directory
Data package template written to: /tmp/happywhale-enhanced-dataset-large/dataset-metadata.json
{'title': 'happywhale-enhanced-dataset-large', 'id': 'wolfy73/happywhale-enhanced-dataset-large', 'licenses': [{'name': 'CC0-1.0'}]}
dataset-metadata.json  meta.json
Starting upload for file meta.json
100%|████████████████████████████████████████████| 132/132 [00:00<00:00, 531B/s]
Upload successful: meta.json (132B)
Dataset creation error: The requested title "happywhale-enhanced-dataset-large" is already in use by a dataset. Please choose another title.
BASE_PATH = /tmp/happywhale-enhanced-dataset-large


データの配置　（gdrive）　　参考<br>
<br>
Happywhale<br>
　|---　original_data : コンペデータ<br>
　|---　train-yolov5-w-crowd-sourced-dataset ： yoloの検出結果bbox座標
csv<br>
　|---　filter-yolov5-failure-cases : 各画像に対するOOD分析の結果csv<br>
　|---　Experiment_data ： 実験データセットを入れるフォルダ<br>
　　　|---　v1　　(色々なver.でデータセットを作っていく)<br>
　　　|---　v2<br>


# train.csv

In [None]:
%cd "/content/Drive/MyDrive/Colab Notebooks/Kaggle/Happywhale/"

[Errno 2] No such file or directory: '/content/Drive/MyDrive/Colab Notebooks/Kaggle/Happywhale/'
/content/drive/MyDrive/Colab Notebooks/Kaggle/Happywhale/original_data


In [None]:
BASE_PATH = "./"
train = pd.read_csv(BASE_PATH + "original_data/train.csv")
train

Unnamed: 0,image,species,individual_id
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9
1,000562241d384d.jpg,humpback_whale,1a71fbb72250
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392
...,...,...,...
51028,fff639a7a78b3f.jpg,beluga,5ac053677ed1
51029,fff8b32daff17e.jpg,cuviers_beaked_whale,1184686361b3
51030,fff94675cc1aef.jpg,blue_whale,5401612696b9
51031,fffbc5dd642d8c.jpg,beluga,4000b3d7c24e


In [None]:
if USE_FIXED_SPECIES:
    train["species"][train["species"] == "bottlenose_dolpin"] = "bottlenose_dolphin"
    train["species"][train["species"] == "kiler_whale"] = "killer_whale"
    train["species"][train["species"] == "globis"] = "short_finned_pilot_whale"
    train["species"][train["species"] == "pilot_whale"] = "short_finned_pilot_whale"

if USE_BITMAP_FORMAT:
    train["image"] = train["image"].str[:-3] + "bmp"
    
if USE_STRATIFIED_K_FOLD:
    """
        Copied and adapted from https://www.kaggle.com/debarshichanda/pytorch-arcface-gem-pooling-starter
    """
    skf = StratifiedKFold(n_splits=STRATIFIED_K_FOLD_K)
    folds = np.zeros(len(train), dtype=np.uint8)
    for fold, ( _, val_) in enumerate(skf.split(X=train, y=train.individual_id)):
        folds[val_] = fold
    train["fold"] = folds
        
# train.to_csv(os.path.join(BASE_PATH, "./train.csv"), index=False)
train.to_csv(os.path.join(BASE_PATH, "fixed_data/train.csv"), index=False)
train



Unnamed: 0,image,species,individual_id,fold
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9,0
1,000562241d384d.jpg,humpback_whale,1a71fbb72250,1
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b,0
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063,0
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392,0
...,...,...,...,...
51028,fff639a7a78b3f.jpg,beluga,5ac053677ed1,3
51029,fff8b32daff17e.jpg,cuviers_beaked_whale,1184686361b3,1
51030,fff94675cc1aef.jpg,blue_whale,5401612696b9,3
51031,fffbc5dd642d8c.jpg,beluga,4000b3d7c24e,3


# sample_submission.csv

In [None]:
sample_submission = pd.read_csv(BASE_PATH+"original_data/sample_submission.csv")
sample_submission

Unnamed: 0,image,predictions
0,000110707af0ba.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
1,0006287ec424cb.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
2,000809ecb2ccad.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
3,00098d1376dab2.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
4,000b8d89c738bd.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
...,...,...
27951,fff6ff1989b5cd.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
27952,fff8fd932b42cb.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
27953,fff96371332c16.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
27954,fffc1c4d3eabc7.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...


In [None]:
def crop_image(image, row):
    if row['bbox'] is None or len(row['bbox']) == 0: # if there is no bbox
        return image
    bbox = row['bbox'][0]
    conf = row['bbox_conf'][0]
    if conf>=CROPPED_IMAGES_CONFIDENCE: # don't crop for poor confident bboxes
        xmin, ymin, xmax, ymax = bbox
        width = xmax - xmin
        height = ymax - ymin
        dw = int(round(CROPPED_IMAGES_MARGIN * width))
        dh = int(round(CROPPED_IMAGES_MARGIN * height))
        xmin, xmax, ymin, ymax = max(xmin-dw, 0), min(xmax+dw, image.shape[1]), max(ymin-dh, 0), min(ymax+dh, image.shape[0])
        image = image[ymin:ymax, xmin:xmax] # crop image
        
    return image

def pad_and_resize(image):
    r = MAX_IMAGE_SIZE / max(image.shape[0], image.shape[1])
    dim = (int(image.shape[1] * r), int(image.shape[0] * r))
    image = cv2.resize(image, dim, interpolation=cv2.INTER_CUBIC)
    nimage = np.zeros((MAX_IMAGE_SIZE, MAX_IMAGE_SIZE, 3), dtype=np.uint8)
    
    middle = MAX_IMAGE_SIZE // 2
    
    vh1, hh1 = image.shape[0] // 2, image.shape[1] // 2
    vh2, hh2 = image.shape[0] - vh1, image.shape[1] - hh1
    
    iy1, iy2 = middle - vh1, middle + vh2
    ix1, ix2 = middle - hh1, middle + hh2
    
    nimage[iy1:iy2, ix1:ix2] = image
    
    return nimage

def remap_channel(image):
    # Argsort of the image to keep spatial information
    ids_sorted = np.argsort((image + np.random.random(image.shape) - 0.5).ravel())
    # Shades of grey
    values = np.floor(np.linspace(0.0, 256.0, num=len(ids_sorted), endpoint=False)).astype(np.uint8)
    s = image.shape
    image = image.ravel()
    # Reorder the shades of greyto look like the original image
    image[ids_sorted] = values
    image = image.reshape(s)
    return image

def remap_colors(image):
    """
        The remapping is equivalent to create an image with n shades of grey and move the pixels in it such that it look like the original image
    """
    if len(image.shape) == 2:
        return remap_channel(image)
    image[:, :, 0] = remap_channel(image[:, :, 0])
    image[:, :, 1] = remap_channel(image[:, :, 1])
    image[:, :, 2] = remap_channel(image[:, :, 2])
    return image

In [None]:
if USE_BITMAP_FORMAT:
    sample_submission["inference_image"] = sample_submission["image"].str[:-3] + "bmp"
sample_submission.to_csv(os.path.join(BASE_PATH, "./sample_submission.csv"), index=False)
sample_submission

Unnamed: 0,image,predictions
0,000110707af0ba.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
1,0006287ec424cb.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
2,000809ecb2ccad.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
3,00098d1376dab2.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
4,000b8d89c738bd.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
...,...,...
27951,fff6ff1989b5cd.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
27952,fff8fd932b42cb.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
27953,fff96371332c16.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...
27954,fffc1c4d3eabc7.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb96...


In [None]:
# yolo_bp = '../input/train-yolov5-w-crowd-sourced-dataset'
# metrics_bp = "../input/filter-yolov5-failure-cases"
yolo_bp = BASE_PATH + 'train-yolov5-w-crowd-sourced-dataset/'
metrics_bp = BASE_PATH + "filter-yolov5-failure-cases/"

# yoloのbbox座標
yolo_train = pd.read_csv(os.path.join(yolo_bp, 'train.csv'), index_col="image").fillna("[]")
bboxes_train = yolo_train["bbox"].map(eval)
bboxes_train_confidence = yolo_train["conf"].map(eval)

# OODのMetrics
metrics_train = pd.read_csv(os.path.join(metrics_bp, 'train.csv')) 
metrics_train.index = metrics_train["image"].str[:-3] + "jpg"
assert (bboxes_train.index == metrics_train.index).all()

# OOD Metrics と yolo　の座標を結合
metrics_train["bbox"] = bboxes_train
metrics_train["bbox_conf"] = bboxes_train_confidence

yolo_test = pd.read_csv(os.path.join(yolo_bp, 'test.csv'), index_col="image").fillna("[]")
bboxes_test = yolo_test["bbox"].map(eval)
bboxes_test_confidence = yolo_test["conf"].map(eval)

metrics_test = pd.read_csv(os.path.join(metrics_bp, 'test.csv'))
metrics_test.index = metrics_test["image"].str[:-3] + "jpg"
assert (bboxes_test.index == metrics_test.index).all()
metrics_test["bbox"] = bboxes_test
metrics_test["bbox_conf"] = bboxes_test_confidence

In [None]:
train_flags = np.zeros(len(metrics_train))
test_flags = np.zeros(len(metrics_test))

## MAX FLAG
if OOD_DETECTION_USE_MAX_FLAG:
    # Train
    train_cropped_pred_max_conf = metrics_train["cropped_pred_max_conf"].values
    train_m = metrics_train["cropped_pred_max_conf"] <= np.quantile(train_cropped_pred_max_conf, OOD_DETECTION_FLAG_QUANTILE)
    print(f"MAX_FLAG: {train_m.sum()} samples flagged in train")
    train_flags[train_m] += 1
    # test
    test_cropped_pred_max_conf = metrics_test["cropped_pred_max_conf"].values
    test_m = metrics_test["cropped_pred_max_conf"] <= np.quantile(test_cropped_pred_max_conf, OOD_DETECTION_FLAG_QUANTILE)
    print(f"MAX_FLAG: {test_m.sum()} samples flagged in test")
    test_flags[test_m] += 1

## DELTA MAX FLAG
if OOD_DETECTION_USE_DELTA_MAX_FLAG:
    # Train
    train_pred_max_conf_delta = metrics_train["pred_max_conf_delta"].values
    train_m = metrics_train["pred_max_conf_delta"] > np.quantile(train_pred_max_conf_delta, 1 - OOD_DETECTION_FLAG_QUANTILE)
    print(f"DELTA_MAX_FLAG: {train_m.sum()} samples flagged in train")
    train_flags[train_m] += 1
    # test
    test_pred_max_conf_delta = metrics_test["pred_max_conf_delta"].values
    test_m = metrics_test["pred_max_conf_delta"] > np.quantile(test_pred_max_conf_delta, 1 - OOD_DETECTION_FLAG_QUANTILE)
    print(f"DELTA_MAX_FLAG: {test_m.sum()} samples flagged in test")
    test_flags[test_m] += 1

## ENTROPY FLAG
if OOD_DETECTION_USE_ENTROPY_FLAG:
    # Train
    train_cropped_pred_entropy = metrics_train["cropped_pred_entropy"].values
    train_m = metrics_train["cropped_pred_entropy"] > np.quantile(train_cropped_pred_entropy, 1 - OOD_DETECTION_FLAG_QUANTILE)
    print(f"ENTROPY_FLAG: {train_m.sum()} samples flagged in train")
    train_flags[train_m] += 1
    # test
    test_cropped_pred_entropy = metrics_test["cropped_pred_entropy"].values
    test_m = metrics_test["cropped_pred_entropy"] > np.quantile(test_cropped_pred_entropy, 1 - OOD_DETECTION_FLAG_QUANTILE)
    print(f"ENTROPY_FLAG: {test_m.sum()} samples flagged in test")
    test_flags[test_m] += 1

## DELTA ENTROPY FLAG
if OOD_DETECTION_USE_DELTA_ENTROPY_FLAG:
    # Train
    train_pred_entropy_delta = metrics_train["pred_entropy_delta"].values
    train_m = metrics_train["pred_entropy_delta"] <= np.quantile(train_pred_entropy_delta, OOD_DETECTION_FLAG_QUANTILE)
    print(f"DELTA_ENTROPY_FLAG: {train_m.sum()} samples flagged in train")
    train_flags[train_m] += 1
    # test
    test_pred_entropy_delta = metrics_test["pred_entropy_delta"].values
    test_m = metrics_test["pred_entropy_delta"] <= np.quantile(test_pred_entropy_delta, OOD_DETECTION_FLAG_QUANTILE)
    print(f"DELTA_ENTROPY_FLAG: {test_m.sum()} samples flagged in test")
    test_flags[test_m] += 1

if USE_OOD_DETECTION:
    train_m = train_flags >= OOD_DETECTION_N_FLAGS
    metrics_train["bbox"][train_m] = None
    print(f"Removed {train_m.sum()} bounding boxes in the train dataset")
    test_m = test_flags >= OOD_DETECTION_N_FLAGS
    metrics_test["bbox"][test_m] = None
    print(f"Removed {test_m.sum()} bounding boxes in the test dataset")

MAX_FLAG: 511 samples flagged in train
MAX_FLAG: 280 samples flagged in test
DELTA_MAX_FLAG: 511 samples flagged in train
DELTA_MAX_FLAG: 280 samples flagged in test
ENTROPY_FLAG: 511 samples flagged in train
ENTROPY_FLAG: 280 samples flagged in test
DELTA_ENTROPY_FLAG: 511 samples flagged in train
DELTA_ENTROPY_FLAG: 280 samples flagged in test


In [None]:
def copy_dir(collection, base_path, Output_dir):
    dirname = "train_images" if collection == "train" else "test_images"
    print("Copying", collection, "images...")
    path = os.path.join(base_path, dirname)
    images = list(os.listdir(path))

    new_base_path = os.path.join(BASE_PATH, "Experiment_data", Output_dir, dirname)
    os.makedirs(new_base_path, exist_ok=True)
    
    if TEST_MODE:
        images = images[:50]
    n = len(images)
    for i, f in tqdm(enumerate(images), total=n):
        if not bool(i % 1000):
            print("{}/{}: {:.4f} Gb used so far".format(i, n, get_size(f"/tmp/{KAGGLE_DATASET_NAME}") / 1e9))
        
        image_path = os.path.join(path, f)
        image = cv2.imread(image_path)
        
        if USE_CROPPED_IMAGES:
            df = metrics_train if collection == "train" else metrics_test
            image = crop_image(image, df.loc[f])
        
        # if USE_ZERO_PADDING and image.shape[0] != image.shape[1]:
        if USE_ZERO_PADDING :
            image = pad_and_resize(image)
        # else:
        #     # if image.shape[0] > MAX_IMAGE_SIZE or image.shape[1] > MAX_IMAGE_SIZE:
        #     if iUSE_ZERO_PADDING:
        #         image = cv2.resize(image, (MAX_IMAGE_SIZE, MAX_IMAGE_SIZE), interpolation=cv2.INTER_CUBIC)
            
        if USE_MAXIMIZE_CONTRAST:
            image = remap_colors(image)
            
        if USE_BITMAP_FORMAT:
            new_path = os.path.join(new_base_path, f.split('.')[0] + ".bmp")
        else:
            new_path = os.path.join(new_base_path, f)
        cv2.imwrite(new_path, image)

# Train images

In [None]:
!chmod -R 777 "original_data/train_images"
!chmod -R 777 "original_data/test_images"

^C


In [None]:
copy_dir("train", base_path="original_data/", Output_dir="v1-2")

Copying train images...


  0%|          | 0/51033 [00:00<?, ?it/s]

0/51033: 0.0000 Gb used so far
1000/51033: 0.0000 Gb used so far
2000/51033: 0.0000 Gb used so far
3000/51033: 0.0000 Gb used so far
4000/51033: 0.0000 Gb used so far
5000/51033: 0.0000 Gb used so far
6000/51033: 0.0000 Gb used so far
7000/51033: 0.0000 Gb used so far
8000/51033: 0.0000 Gb used so far
9000/51033: 0.0000 Gb used so far
10000/51033: 0.0000 Gb used so far
11000/51033: 0.0000 Gb used so far
12000/51033: 0.0000 Gb used so far
13000/51033: 0.0000 Gb used so far
14000/51033: 0.0000 Gb used so far
15000/51033: 0.0000 Gb used so far
16000/51033: 0.0000 Gb used so far
17000/51033: 0.0000 Gb used so far
18000/51033: 0.0000 Gb used so far
19000/51033: 0.0000 Gb used so far
20000/51033: 0.0000 Gb used so far
21000/51033: 0.0000 Gb used so far
22000/51033: 0.0000 Gb used so far
23000/51033: 0.0000 Gb used so far
24000/51033: 0.0000 Gb used so far
25000/51033: 0.0000 Gb used so far
26000/51033: 0.0000 Gb used so far
27000/51033: 0.0000 Gb used so far
28000/51033: 0.0000 Gb used so fa

In [None]:
copy_dir("train")

Copying train images...


  0%|          | 0/51033 [00:00<?, ?it/s]

0/51033: 0.0049 Gb used so far
1000/51033: 0.0945 Gb used so far
2000/51033: 0.1845 Gb used so far
3000/51033: 0.2748 Gb used so far
4000/51033: 0.3640 Gb used so far
5000/51033: 0.4525 Gb used so far
6000/51033: 0.5413 Gb used so far
7000/51033: 0.6303 Gb used so far
8000/51033: 0.7194 Gb used so far
9000/51033: 0.8089 Gb used so far
10000/51033: 0.8991 Gb used so far
11000/51033: 0.9906 Gb used so far
12000/51033: 1.0818 Gb used so far
13000/51033: 1.1725 Gb used so far
14000/51033: 1.2619 Gb used so far
15000/51033: 1.3526 Gb used so far
16000/51033: 1.4418 Gb used so far
17000/51033: 1.5314 Gb used so far
18000/51033: 1.6204 Gb used so far
19000/51033: 1.7104 Gb used so far
20000/51033: 1.7999 Gb used so far
21000/51033: 1.8886 Gb used so far
22000/51033: 1.9774 Gb used so far
23000/51033: 2.0664 Gb used so far
24000/51033: 2.1568 Gb used so far
25000/51033: 2.2456 Gb used so far
26000/51033: 2.3338 Gb used so far
27000/51033: 2.4238 Gb used so far
28000/51033: 2.5133 Gb used so fa

# Test images

In [None]:
copy_dir("test", base_path="original_data/", Output_dir="v1-2")

Copying test images...


  0%|          | 0/27956 [00:00<?, ?it/s]

0/27956: 0.0000 Gb used so far
1000/27956: 0.0000 Gb used so far
2000/27956: 0.0000 Gb used so far
3000/27956: 0.0000 Gb used so far
4000/27956: 0.0000 Gb used so far
5000/27956: 0.0000 Gb used so far
6000/27956: 0.0000 Gb used so far
7000/27956: 0.0000 Gb used so far
8000/27956: 0.0000 Gb used so far
9000/27956: 0.0000 Gb used so far
10000/27956: 0.0000 Gb used so far
11000/27956: 0.0000 Gb used so far
12000/27956: 0.0000 Gb used so far
13000/27956: 0.0000 Gb used so far
14000/27956: 0.0000 Gb used so far
15000/27956: 0.0000 Gb used so far
16000/27956: 0.0000 Gb used so far
17000/27956: 0.0000 Gb used so far
18000/27956: 0.0000 Gb used so far
19000/27956: 0.0000 Gb used so far
20000/27956: 0.0000 Gb used so far
21000/27956: 0.0000 Gb used so far
22000/27956: 0.0000 Gb used so far
23000/27956: 0.0000 Gb used so far
24000/27956: 0.0000 Gb used so far
25000/27956: 0.0000 Gb used so far
26000/27956: 0.0000 Gb used so far
27000/27956: 0.0000 Gb used so far


In [None]:
# copy_dir("test")

Copying test images...


  0%|          | 0/27956 [00:00<?, ?it/s]

0/27956: 4.5843 Gb used so far
1000/27956: 4.6741 Gb used so far
2000/27956: 4.7626 Gb used so far
3000/27956: 4.8508 Gb used so far
4000/27956: 4.9394 Gb used so far
5000/27956: 5.0282 Gb used so far
6000/27956: 5.1164 Gb used so far
7000/27956: 5.2066 Gb used so far
8000/27956: 5.2954 Gb used so far
9000/27956: 5.3852 Gb used so far
10000/27956: 5.4752 Gb used so far
11000/27956: 5.5626 Gb used so far
12000/27956: 5.6529 Gb used so far
13000/27956: 5.7403 Gb used so far
14000/27956: 5.8295 Gb used so far
15000/27956: 5.9172 Gb used so far
16000/27956: 6.0086 Gb used so far
17000/27956: 6.0988 Gb used so far
18000/27956: 6.1864 Gb used so far
19000/27956: 6.2766 Gb used so far
20000/27956: 6.3645 Gb used so far
21000/27956: 6.4550 Gb used so far
22000/27956: 6.5445 Gb used so far
23000/27956: 6.6329 Gb used so far
24000/27956: 6.7220 Gb used so far
25000/27956: 6.8113 Gb used so far
26000/27956: 6.9006 Gb used so far
27000/27956: 6.9889 Gb used so far


In [None]:
# if not TEST_MODE and USE_KAGGLE_DATASET:
#     !ls /tmp/{KAGGLE_DATASET_NAME}
#     verion_name = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
#     !kaggle datasets version -m {verion_name} -p /tmp/{KAGGLE_DATASET_NAME} -r zip -q

⬆️ You can access your dataset by clicking the link above ! ⬆️

In [None]:
if TEST_MODE:
    train = pd.read_csv(os.path.join(BASE_PATH,  "./train.csv"))
    path = os.path.join(BASE_PATH, "train_images")
    plot_images(
        train[train["image"].isin(list(os.listdir(path)))],
        row=5, col=5, 
        base_path=os.path.join(BASE_PATH, "train_images")
    )

![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Thats_all_folks.svg/2560px-Thats_all_folks.svg.png)