# Image preprocessing

----

This notebook can be used to preprocess the illumination corrected raw images obtained from [IDR0033](https://idr.openmicroscopy.org/webclient/?show=screen-1751) for the further analysis.

We will use available metadata to filter images that were manually identified by the authors of the [publication](https://elifesciences.org/articles/24060#s4) corresponding to the published imaging data as being outliers or that did not pass their quality control steps. For more information concerning the applied workflows to identify those, please refer to the publication.

---

## 0. Environmental setup

In [20]:
import os
import sys
import pandas as pd
import numpy as np
import tifffile

from collections import Counter


sys.path.append("..")

In [2]:
def rename_image_filenames(
    metadata,
    orig_col="Image_FileName_OrigHoechst",
    illum_col="Image_FileName_IllumHoechst",
    posfix="illum_corrected",
):
    orig_image_file_names = list(metadata[orig_col])
    illum_corrected_image_file_names = []
    for orig_image_file_name in orig_image_file_names:
        idx = orig_image_file_name.index(".")
        illum_corrected_image_file_name = (
            orig_image_file_name[:idx] + posfix + orig_image_file_name[idx:]
        )
        illum_corrected_image_file_names.append(illum_corrected_image_file_name)
    metadata[illum_col] = illum_corrected_image_file_names
    return metadata

In [3]:
def filter_out_qc_flagged_items(
    metadata,
    blurry_col="Image_Metadata_QCFlag_isBlurry",
    saturated_col="Image_Metadata_QCFlag_isSaturated",
):
    filtered_metadata = metadata.loc[
        (metadata[blurry_col] == 0) & (metadata[saturated_col] == 0)
    ]
    return filtered_metadata

In [4]:
def remove_outlier_items(
    metadata,
    outlier_plates=None,
    outlier_plate_wells=None,
    outlier_wells=None,
    plate_col="Image_Metadata_Plate",
    well_col="Image_Metadata_Well",
):
    metadata_orm = metadata.copy()
    if outlier_plates is not None:
        for outlier_plate in outlier_plates:
            metadata_orm = metadata_orm.loc[metadata_orm[plate_col] != outlier_plate]
    if outlier_plate_wells is not None:
        for outlier_plate_well in outlier_plate_wells:
            metadata_orm = metadata_orm.loc[
                (metadata_orm[plate_col] != outlier_plate_well[0])
                | (metadata_orm[well_col] != outlier_plate_well[1])
            ]
    if outlier_wells is not None:
        for outlier_well in outlier_wells:
            metadata_orm = metadata_orm.loc[metadata_orm[well_col] != outlier_well]
    return metadata_orm

---

## 1. Read in data

First, we will read in the metadata information that provide information about e.g. which images correspond to which plate, well combination and which gene was targeted. The respective file was exported from a database for which the corresponding sql script has been published alongside with the imaging data. Please refer to the github repo of the original publication for more information of how to set up the database. To derive the metadata file load in in the following, we provide a custom sql-file (`extract_metadata.sql`).

In [5]:
metadata = pd.read_csv("../data/images/metadata/metadata_image_data.csv")
metadata.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_IntendedOrfMismatch,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0


As suggested in the original publication we will be working with the image that were corrected for different illumination conditions that is also available.

We will now adapt the `Image_FileName_OrigHoechst` entries in the metadata dataframe by the posfix `_illum_corrected` to ensure that the column represents the actual filenames of `.tif` images that we will be working with.

In [12]:
posfix = "_illum_corrected.tif"
orig_col = "Image_FileName_OrigHoechst"
illum_col = "Image_FileName_IllumHoechst"

metadata = rename_image_filenames(
    metadata, orig_col=orig_col, illum_col=illum_col, posfix=posfix
)
metadata.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated,Image_FileName_IllumHoechst
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...


---

## 2. Data filtering

### 2a. Filter out blurry or saturated images

Next, we will filter out images that were identified to be blurry or saturated and thus not passing the standards for the image quality that we take over from the authors of the original publication. The respective information are also available in the metadata.

In [13]:
blurry_col = "Image_Metadata_QCFlag_isBlurry"
saturated_col = "Image_Metadata_QCFlag_isSaturated"

filtered_metadata = filter_out_qc_flagged_items(
    metadata, blurry_col=blurry_col, saturated_col=saturated_col
)
filtered_metadata.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated,Image_FileName_IllumHoechst
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...


In [14]:
print(
    "Images filtered out for not passing the quality standards: {}.".format(
        len(metadata) - len(filtered_metadata)
    )
)

Images filtered out for not passing the quality standards: 251.


As seen above 251 images were identified to be either blurry or saturated and thus should be excluded for the downstream analysis.

---

### 2b. Filter out outlier images (manual selection)

In addition to the images that were flagged for not passing the quality control steps the authors further excluded 2 additional plate-well combinations and one complete plate during their analyses as they identified those by visual inspection as outliers.

Those are the following:
* Plate 41749 (all wells)
* Plate 41754 (well B01)
* Plate 41757 (well E17)

Unfortunately, no description is given which criteria was used to identify these outliers. When briefly looking at the data in the [IDR webclient](https://idr.openmicroscopy.org/webclient/?show=screen-1751) we do not see any remarkable abnormalities.

Nonetheless, we derive the subset of the dataset where we filter out the corresponding items of these plate-well combinations to follow the preprocessing steps of the original publication. If we will use this subset or the larger set that includes those combinations in our final study is yet to be determined.

In [15]:
outlier_plates = [41749]
outlier_plate_wells = [[41754, "b01"], [41757, "e17"]]
outlier_wells = []
plate_col = "Image_Metadata_Plate"
well_col = "Image_Metadata_Well"


filtered_metadata_orm = remove_outlier_items(
    metadata,
    outlier_plates=outlier_plates,
    outlier_plate_wells=outlier_plate_wells,
    outlier_wells=outlier_wells,
    plate_col=plate_col,
    well_col=well_col,
)
filtered_metadata_orm.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated,Image_FileName_IllumHoechst
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...


After the filtering we are left with 1,918 unique plate-well combinations for each 9 fields of view are available leading to a total of 17,262 images. Surprisingly, those are 36 images (4 plate-well) more than what is described in the publication to be the final result of the preprocessing of the images. 

While our segmentation pipeline will differ from the ones the authors used to segment the nuclei, we can get also a first feeling of the dimension of the single-nuclei imaging dataset that we will be working with using the available metadata.

In [23]:
np.sum(list(filtered_metadata_orm["Image_Count_Nuclei"])), len(np.unique(filtered_metadata_orm["Image_Metadata_GeneSymbol"]))

(1278881, 194)

In [24]:
Counter(filtered_metadata_orm["Image_Metadata_GeneSymbol"])

Counter({'EIF4E': 90,
         'DKK1': 45,
         'ATF6': 135,
         'FURIN': 90,
         'MAPK8': 90,
         'CARD11': 135,
         'ATG16L1': 45,
         'TSC1': 90,
         'PAK1': 90,
         'XBP1': 180,
         'PRKAA1': 90,
         'MAP3K9': 45,
         'IKBKE': 90,
         'TGFBR1': 180,
         'RIPK1': 45,
         'EMPTY': 1557,
         'PSENEN': 45,
         'BRAF': 135,
         'DVL1': 90,
         'PER1': 90,
         'EGLN1': 135,
         'MOS': 90,
         'BMPR1B': 180,
         'Luciferase': 360,
         'CCND1': 90,
         'DVL3': 45,
         'TBK1': 90,
         'PIK3R1': 90,
         'PRKACA': 90,
         'NOTCH1': 135,
         'DEPTOR': 90,
         'TRAF6': 90,
         'PRKCZ': 180,
         'PRKACB': 135,
         'PRKACG': 135,
         'DIABLO': 45,
         'CHUK': 90,
         'LRPPRC': 45,
         'SLIRP': 90,
         'MKNK1': 45,
         'RAF1': 135,
         'GLI1': 45,
         'CYLD': 90,
         'JAK2': 90,
         'PKI

The authors obtained roughly 1,28 million nuclei corresponding to ORF overexpression of 193 genes respectively the control condition.

---

## 