# Intro

* This kernel lets you efficiently convert all images from their tensor format into RGB images, then save them as 400x400 JPEGs inside two zip files (`train` and `test`).
* Feel free to customize this kernel as you wish. You can change the shape and extension of the final output image by changing the input arguments to `convert_to_rgb` and `build_new_df`.

### Notes

* In a previous version (V11) of the kernel, I claimed that the `rxrx.io.load_site_as_rgb` function was inefficient, and tried to provide a faster solution. It turns out I did not input the correct argument, so it was instead fetching the images directly from Google Storage; with the correct argument, the speed was comparable. **My sincere apologies for misleading everyone.**


### Updates

* V13: Changed output image size to 400 px instead of 224.

### Sources

* Found out about the loading functions from this kernel: https://www.kaggle.com/jesucristo/quick-visualization-eda

In [1]:
import os
import sys
import zipfile

import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm
from PIL import Image

# Preliminary

We need to also import rxrx in order to convert the tensors into images.

In [2]:
!git clone https://github.com/recursionpharma/rxrx1-utils
sys.path.append('rxrx1-utils')
import rxrx.io as rio

Cloning into 'rxrx1-utils'...
remote: Enumerating objects: 118, done.[K
remote: Total 118 (delta 0), reused 0 (delta 0), pack-reused 118[K
Receiving objects: 100% (118/118), 1.59 MiB | 0 bytes/s, done.
Resolving deltas: 100% (59/59), done.


Will need those folders later for storing our jpegs.

In [3]:
for folder in ['train', 'test']:
    os.makedirs(folder)

!ls

__notebook__.ipynb  __output__.json  rxrx1-utils  test	train


In [4]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
print(train_df.shape)
print(test_df.shape)
train_df.head()

(36515, 5)
(19897, 4)


Unnamed: 0,id_code,experiment,plate,well,sirna
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144


In [5]:
train_df.tail()

Unnamed: 0,id_code,experiment,plate,well,sirna
36510,U2OS-03_4_O19,U2OS-03,4,O19,103
36511,U2OS-03_4_O20,U2OS-03,4,O20,202
36512,U2OS-03_4_O21,U2OS-03,4,O21,824
36513,U2OS-03_4_O22,U2OS-03,4,O22,328
36514,U2OS-03_4_O23,U2OS-03,4,O23,509


# Saving as JPEG

In [6]:
def convert_to_rgb(df, split, resize=True, new_size=400, extension='jpeg'):
    N = df.shape[0]

    for i in tqdm(range(N)):
        code = df['id_code'][i]
        experiment = df['experiment'][i]
        plate = df['plate'][i]
        well = df['well'][i]

        for site in [1, 2]:
            save_path = f'{split}/{code}_s{site}.{extension}'

            im = rio.load_site_as_rgb(
                split, experiment, plate, well, site, 
                base_path='../input/'
            )
            im = im.astype(np.uint8)
            im = Image.fromarray(im)
            
            if resize:
                im = im.resize((new_size, new_size), resample=Image.BILINEAR)
            
            im.save(save_path)

In [7]:
convert_to_rgb(train_df, 'train')
convert_to_rgb(test_df, 'test')

 20%|██        | 7469/36515 [35:55<2:18:40,  3.49it/s]

# Zip everything

In [8]:
def zip_and_remove(path):
    ziph = zipfile.ZipFile(f'{path}.zip', 'w', zipfile.ZIP_DEFLATED)
    
    for root, dirs, files in os.walk(path):
        for file in tqdm(files):
            file_path = os.path.join(root, file)
            ziph.write(file_path)
            os.remove(file_path)
    
    ziph.close()

In [9]:
zip_and_remove('train')
zip_and_remove('test')

100%|██████████| 73030/73030 [02:19<00:00, 522.77it/s]
100%|██████████| 39794/39794 [00:57<00:00, 687.36it/s]


# Create new labels

Since our data is now "duplicated" (as in, we have separated the sites), we have to also duplicate our labels.

In [10]:
def build_new_df(df, extension='jpeg'):
    new_df = pd.concat([df, df])
    new_df['filename'] = pd.concat([
        df['id_code'].apply(lambda string: string + f'_s1.{extension}'),
        df['id_code'].apply(lambda string: string + f'_s2.{extension}')
    ])
    
    return new_df


new_train = build_new_df(train_df)
new_test = build_new_df(test_df)

new_train.to_csv('new_train.csv', index=False)
new_test.to_csv('new_test.csv', index=False)

# Remove the rxrx1 utils

Need to remove those, otherwise we will have an error when saving.

In [11]:
!rm -r rxrx1-utils