# Intro

* This kernel lets you efficiently convert all images from their tensor format into RGB images, then save them as 224x224 JPEGs inside two zip files (`train` and `test`).
* You might have encountered speed and memory issues when using `rxrx.io.load_site_as_rgb`; we will create a function `efficient_load` that will address those issues.
* Feel free to customize this kernel as you wish. You can change the shape and extension of the final output image by changing the input arguments to `convert_to_rgb` and `build_new_df`.


### Sources

* Found out about the loading functions from this kernel: https://www.kaggle.com/jesucristo/quick-visualization-eda

In [76]:
import os
import sys
import zipfile

import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm
from PIL import Image
import pickle

# Preliminary

We need to also import rxrx in order to convert the tensors into images.

In [2]:
!git clone https://github.com/recursionpharma/rxrx1-utils
sys.path.append('rxrx1-utils')
import rxrx.io as rio

fatal: destination path 'rxrx1-utils' already exists and is not an empty directory.


ModuleNotFoundError: No module named 'tensorflow'

Will need those folders later for storing our jpegs.

In [3]:
!ls

new_test.csv					      test_6ch
new_test_6ch.csv				      test_controls.csv
new_train.csv					      test_rgb
new_train_6ch.csv				      train
pixel_stats.csv					      train.csv
recursion-2019-efficiently-load-entire-dataset.ipynb  train_6ch
recursion_dataset_license.pdf			      train_controls.csv
rxrx1-utils					      train_fold0.csv
sample_submission.csv				      train_rgb
test						      valid_fold0.csv
test.csv


In [25]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
print(train_df.shape)
print(test_df.shape)
train_df.head()

(36515, 5)
(19897, 4)


Unnamed: 0,id_code,experiment,plate,well,sirna
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144


In [26]:
train_df.tail()

Unnamed: 0,id_code,experiment,plate,well,sirna
36510,U2OS-03_4_O19,U2OS-03,4,O19,103
36511,U2OS-03_4_O20,U2OS-03,4,O20,202
36512,U2OS-03_4_O21,U2OS-03,4,O21,824
36513,U2OS-03_4_O22,U2OS-03,4,O22,328
36514,U2OS-03_4_O23,U2OS-03,4,O23,509


We define a utility function for loading the images. This leverages the efficiency of the `pillow` library.

In [14]:
def efficient_load(dataset,
                   experiment,
                   plate,
                   well,
                   site,
                   channels=rio.DEFAULT_CHANNELS):
    site_img = np.empty((512, 512, 6), dtype=np.uint8)

    for channel in channels:
        path = f'{dataset}/{experiment}/Plate{plate}/{well}_s{site}_w{channel}.png'
        im = Image.open(path)
        site_img[:, :, channel - 1] = im
    
    return site_img.transpose((2, 0, 1))

# Comparing loading speed

Let's take a look at how fast each function is:

In [16]:
experiment = train_df['experiment'][1]
plate = train_df['plate'][1]
well = train_df['well'][1]
site = 2

%time img1 = rio.load_site_as_rgb('train', experiment, plate, well, site)
%time img2 = efficient_load('train', experiment, plate, well, site)

print("img1 is identical to img2:", (img1 == img2))

CPU times: user 189 ms, sys: 47.5 ms, total: 236 ms
Wall time: 1.38 s
CPU times: user 20.3 ms, sys: 0 ns, total: 20.3 ms
Wall time: 20 ms
img1 is identical to img2: False


  if __name__ == '__main__':


In [17]:
img2.shape

(6, 512, 512)

Our new method is much faster, and also more memory efficient. The reason for this is because the underlying loading function for `rxrx` is optimized for loading into tensorflow, whereas in our function we are using `pillow`, and construct our numpy array as we go.

# Saving as JPEG

In [35]:
import cv2
# def convert_to_rgb(df, split, resize=True, new_size=224, extension='jpeg'):
def convert_to_rgb(df, split, resize=True, extension='npy'):
    N = df.shape[0]

    for i in tqdm(range(N)):
        code = df['id_code'][i]
        experiment = df['experiment'][i]
        plate = df['plate'][i]
        well = df['well'][i]

        for site in [1, 2]:
            save_path = f'{split}_6ch/{code}_s{site}.{extension}'
            im = efficient_load(split, experiment, plate, well, site)
            np.save(save_path, im, allow_pickle=True, fix_imports=True)
#             im = Image.fromarray(im)
#             if resize:
#                 im = im.resize((new_size, new_size), resample=Image.BILINEAR)
            
#             cv2.imwrite(save_path, im)

In [None]:
convert_to_rgb(train_df, 'train')
convert_to_rgb(test_df, 'test')

 90%|█████████ | 32945/36515 [1:55:25<11:18,  5.26it/s]  

# Create new labels

Since our data is now "duplicated" (as in, we have separated the sites), we have to also duplicate our labels.

In [9]:
def build_new_df(df, extension='npy'):
    new_df = pd.concat([df, df])
    new_df['filename'] = pd.concat([
        df['id_code'].apply(lambda string: string + f'_s1.{extension}'),
        df['id_code'].apply(lambda string: string + f'_s2.{extension}')
    ])
    new_df['cell'] = pd.concat([df['experiment'].apply(lambda x: x.split('-')[0]),  df['experiment'].apply(lambda x: x.split('-')[0])])
    new_df['site'] = pd.concat([df['experiment'].apply(lambda x: 1),  df['experiment'].apply(lambda x: 2)])
    return new_df


new_train = build_new_df(train_df)
new_test = build_new_df(test_df)

new_train.to_csv('new_train_6ch.csv', index=False)
new_test.to_csv('new_test_6ch.csv', index=False)

In [10]:
new_train.head()

Unnamed: 0,id_code,experiment,plate,well,sirna,filename,cell,site
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513,HEPG2-01_1_B03_s1.npy,HEPG2,1
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840,HEPG2-01_1_B04_s1.npy,HEPG2,1
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020,HEPG2-01_1_B05_s1.npy,HEPG2,1
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254,HEPG2-01_1_B06_s1.npy,HEPG2,1
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144,HEPG2-01_1_B07_s1.npy,HEPG2,1


In [11]:
new_train.tail()

Unnamed: 0,id_code,experiment,plate,well,sirna,filename,cell,site
36510,U2OS-03_4_O19,U2OS-03,4,O19,103,U2OS-03_4_O19_s2.npy,U2OS,2
36511,U2OS-03_4_O20,U2OS-03,4,O20,202,U2OS-03_4_O20_s2.npy,U2OS,2
36512,U2OS-03_4_O21,U2OS-03,4,O21,824,U2OS-03_4_O21_s2.npy,U2OS,2
36513,U2OS-03_4_O22,U2OS-03,4,O22,328,U2OS-03_4_O22_s2.npy,U2OS,2
36514,U2OS-03_4_O23,U2OS-03,4,O23,509,U2OS-03_4_O23_s2.npy,U2OS,2


# Remove the rxrx1 utils

Need to remove those, otherwise we will have an error when saving.

In [22]:
pix_df = pd.read_csv('../input/pixel_stats.csv')

In [77]:
pix_df.head()

Unnamed: 0,id_code,experiment,plate,well,site,channel,mean,std,median,min,max
0,HEPG2-01_1_B02,HEPG2-01,1,B02,1,1,71.063782,43.14624,67.0,7,255
1,HEPG2-01_1_B02,HEPG2-01,1,B02,1,2,32.174431,9.384594,31.0,6,98
2,HEPG2-01_1_B02,HEPG2-01,1,B02,1,3,61.836025,23.377997,59.0,11,255
3,HEPG2-01_1_B02,HEPG2-01,1,B02,1,4,56.983257,16.011435,56.0,11,156
4,HEPG2-01_1_B02,HEPG2-01,1,B02,1,5,91.671993,39.221836,85.0,13,255


In [85]:
mean_array = np.zeros(6)
std_array = np.zeros(6)
for i in range(6):
    mean_array[i] = pix_df[pix_df["channel"] == i+1]["mean"].mean()
    std_array[i] = np.sqrt((pix_df[pix_df["channel"] == i+1]["std"]**2).mean())

In [86]:
mean_array

array([ 5.84569159, 15.56796586, 10.10558294,  9.96439587,  5.57672051,
        9.06773161])

In [87]:
std_array

array([ 9.56803863, 13.35443191,  6.68432598,  8.67382883,  7.24327818,
        6.02148357])

In [81]:
std_array

array([6.62263268, 4.54647128, 3.67290432, 4.45053202, 5.53789205,
       3.48473303])