# Inspect NaN features at each processing step
**Author:** Jessica Ewald <br>

Inspect the features that become NaNs at each step of the single cell processing pipeline: raw profiles, well position correction, MAD normalization, cell count regression. 

Currently, since many analytical methods cannot handle missing values (mAP, PCA, etc), I am dropping any feature that has even a single NaN after each step. 

I want to know if the dropped features are random or overrepresented in some categories such that filtering them out would lose key information. If whole categories of features are being lost, then it may be necessary to devise an imputation strategy or even to modify the processing functions (ie. MAD, cell count regression) to better handle edge cases that are currently resulting in many NaNs. 

In [35]:
# Imports
import pathlib
import pandas as pd
import numpy as np
import polars as pl
import seaborn as sns
from scipy.stats import hypergeom

import black
import jupyter_black

jupyter_black.load(
    lab=False,
    line_length=79,
    verbosity="DEBUG",
    target_version=black.TargetVersion.PY310,
)

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Define all paths and files
map_data_dir = pathlib.Path("/dgx1nas1/storage/data/jess/varchamp/sc_data/map_data/").resolve(strict=True)
bl_path = pathlib.Path(f'{map_data_dir}/bl_map_data.parquet')
well_path = pathlib.Path(f'{map_data_dir}/well_map_data.parquet')
norm_path = pathlib.Path(f'{map_data_dir}/norm_map_data.parquet')
cc_path = pathlib.Path(f'{map_data_dir}/cc_map_data.parquet')

# Read in data
bl = pl.read_parquet(bl_path)
well = pl.read_parquet(well_path)
norm = pl.read_parquet(norm_path)
cc = pl.read_parquet(cc_path)

In [3]:
# get feats in each matrix
def get_feats(df):
    feats = [i for i in df.columns if "Metadata_" not in i] 
    feats = [i for i in feats if i not in ['n_pos_pairs', 'n_total_pairs', 'average_precision']]
    return feats
    
feats_bl = get_feats(bl)
feats_well = get_feats(well)
feats_norm = get_feats(norm)
feats_cc = get_feats(cc)

In [4]:
# get raw feats (before initial NaN filter) using different approach
raw = pl.scan_parquet("/dgx1nas1/storage/data/jess/varchamp/sc_data/processed_profiles/B1A1R1_annotated.parquet")
feats_raw = [i for i in raw.columns if "Metadata_" not in i] 

In [5]:
# Count number of features after each step (NaNs always filtered out)
num_raw = len(feats_raw)
num_bl = len(feats_bl)
num_well = len(feats_well)
num_norm = len(feats_norm)
num_cc = len(feats_cc)

print(f'{num_raw - num_bl} features had at least one NaN in the raw profiles. {num_bl - num_well}, {num_well - num_norm}, and {num_norm - num_cc} additional features contained at least some NaNs after well position correction, MAD, and cell count regression respectively.')

173 features had at least one NaN in the raw profiles. 0, 44, and 591 additional features contained at least some NaNs after well position correction, MAD, and cell count regression respectively.


The features with at least some NaNs in the raw profiles contained only a small number of NaNs (ie. for <1% of cells). The features with NaNs after MAD and CC were missing for the entire column. Not showing these steps here to avoid reading large dataframes into memory. 

In [13]:
# List the features with NaNs at each step
nan_bl = np.setdiff1d(np.array(feats_raw), np.array(feats_bl))
nan_norm = np.setdiff1d(np.array(feats_bl), np.array(feats_norm))
nan_cc = np.setdiff1d(np.array(feats_norm), np.array(feats_cc))

Next, parse each feature name to obtain information on the compartment, feature type, channel, etc. Store this information in dataframes.

In [31]:
# Define CP feature parse function
def parse_cp_features(
    feature: str, channels: list = ["DNA", "RNA", "AGP", "Mito", "ER", "mito_tubeness"]
):
    """Parses a CellProfiler feature string into its semantic components.

    This function will take a feature string and return a dictionary containing its semantic components,
    specifically: the compartment, feature group, feature type, and channel.
    If the feature string is not in a recognized format, the function will assign 'Unknown' to the non-comprehensible components.
    Channel information will be returned as 'None' where it's not applicable.

    Parameters
    ----------
    feature : str
        The CellProfiler feature string to parse.

    channels : list, optional
        A list of channel names to use when parsing the feature string. The default is ['DNA', 'RNA', 'AGP', 'Mito', 'ER', "mito_tubeness"].

    Returns
    -------
    dict
        A dictionary with the following keys: 'feature', 'compartment', 'feature_group', 'feature_type', 'channel'.
        Each key maps to the respective component of the feature string.

    Raises
    ------
    ValueError
        Raised if the input is not a string.
    """

    if not isinstance(feature, str):
        raise ValueError(f"Expected a string, got {type(feature).__name__}")

    if not isinstance(channels, list):
        raise ValueError(f"Expected a list, got {type(channels).__name__}")

    def channel_standardizer(channel):
        channel = channel.replace("Orig", "")
        return channel

    unique_token = "XUNIQUEX"
    tokenized_feature = feature
    for channel in channels:
        tokenized_channel = channel.replace("_", unique_token)
        tokenized_feature = tokenized_feature.replace(channel, tokenized_channel)

    parts = tokenized_feature.split("_")

    feature_group = parts[1]
    if parts[0] not in ["Cells", "Cytoplasm", "Nuclei", "Image"]:
        compartment = "XUNKNOWN"
        feature_group = "XUNKNOWN"
        feature_type = "XUNKNOWN"
        channel = "XUNKNOWN"
    else:
        compartment = parts[0]
        feature_group = parts[1]
        feature_type = "XNONE"  # default value
        channel = "XNONE"  # default value

        if feature_group in [
            "AreaShape",
            "Neighbors",
            "Children",
            "Parent",
            "Number",
            "Threshold",
            "ObjectSkeleton",
        ]:
            # Examples:
            # Cells,AreaShape,Zernike_2_0
            # Cells,AreaShape,BoundingBoxArea
            # Cells,Neighbors,AngleBetweenNeighbors_Adjacent
            # Nuclei,Children,Cytoplasm_Count
            # Nuclei,Parent,NucleiIncludingEdges
            # Nuclei,Number,ObjectNumber
            # Image,Threshold,SumOfEntropies_NucleiIncludingEdges
            # Nuclei,ObjectSkeleton,NumberTrunks_mito_skel

            feature_type = parts[2]

        elif feature_group == "Location":
            # Examples:
            # Cells,Location_CenterMassIntensity_X_DNA
            # Cells,Location_Center_X

            feature_type = parts[2]
            if feature_type != "Center":
                channel = parts[4]

        elif feature_group == "Count":
            # Examples:
            # Cells,Count,Cells
            pass

        elif feature_group == "Granularity":
            # Examples:
            # Cells,Granularity,15_ER
            channel = parts[3]

        elif feature_group in ["Intensity", "ImageQuality"]:
            # Examples:
            # Cells,Intensity,MeanIntensity_DNA
            # Image,ImageQuality,MaxIntensity_OrigAGP
            feature_type = parts[2]
            channel = parts[3]

        elif feature_group == "Correlation":
            # Examples:
            # Cells,Correlation,Correlation_DNA_ER
            feature_type = parts[2]
            channel = [parts[3], parts[4]]
            channel.sort()
            channel = "_".join(channel)

        elif feature_group in ["Texture", "RadialDistribution"]:
            # Examples:
            # Cells,Texture,SumEntropy_ER_3_01_256
            # Cells,RadialDistribution,FracAtD_mito_tubeness_2of16
            feature_type = parts[2]
            channel = parts[3]

        else:
            feature_group = "XUNKNOWN"
            feature_type = "XUNKNOWN"
            channel = "XUNKNOWN"

    channel = "_".join(list(map(channel_standardizer, channel.split("_"))))

    channel = channel.replace(unique_token, "_")

    return {
        "feature": feature,
        "compartment": compartment,
        "feature_group": feature_group,
        "feature_type": feature_type,
        "channel": channel,
    }
    
    
def parse_feat_list(feats, channels):
    parsed = []
    for feat in feats:
        parsed.append(parse_cp_features(feature = feat, channels = channels))
    
    return pd.DataFrame(parsed)

In [36]:
# parse the feature descriptors for each list
channels = ["DNA", "GFP", "AGP", "Mito", "mito_tubeness"]

# all features produced by CellProfiler (universe)
parsed_all = parse_feat_list(feats_raw, channels)
parsed_bl = parse_feat_list(feats_bl, channels)
parsed_norm = parse_feat_list(feats_norm, channels)

# NaNs induced at each step
parsed_nan_bl = parse_feat_list(list(nan_bl), channels)
parsed_nan_norm = parse_feat_list(list(nan_norm), channels)
parsed_nan_cc = parse_feat_list(list(nan_cc), channels)


We now have dataframes with the compartment, feature group, feature type, and channel of each feature that becomes a NaN at each step. However, it's hard to tell whether certain groups are overrepresented just by looking at them: it depends on the prevalence of each group in the full dataset. We can use a hypergeometric test to compute a p-value for each item within each column of the parsed NaN feature dataframes. 

In [53]:
def quantify_overlap(df_hits, df_universe):

    res = []

    compartment = df_hits['compartment'].unique()
    feature_group = df_hits['feature_group'].unique()
    feature_type = df_hits['feature_type'].unique()
    channel = df_hits['channel'].unique()

    # M = universe size
    # n = set size
    # N = total hit number
    # x = set hit number
    
    M = df_universe.shape[0]
    N = df_hits.shape[0]

    for feat in compartment:
        n = df_universe['compartment'].str.count(feat).sum()
        x = df_hits['compartment'].str.count(feat).sum()

        # compute probability of drawing x or more hits (use inverse cdf = sf, at x-1)
        prb = hypergeom.sf(x-1, M, n, N)
        res.append({
            "column": 'compartment',
            "value": feat,
            "pval": prb,
            "expected_hits": (n/M)*N,
            "actual_hits": x,
            "set_size": n,
        })
    
    for feat in feature_group:
        n = df_universe['feature_group'].str.count(feat).sum()
        x = df_hits['feature_group'].str.count(feat).sum()

        prb = hypergeom.sf(x-1, M, n, N)
        res.append({
            "column": 'feature_group',
            "value": feat,
            "pval": prb,
            "expected_hits": (n/M)*N,
            "actual_hits": x,
            "set_size": n,
        })
        
    for feat in feature_type:
        n = df_universe['feature_type'].str.count(feat).sum()
        x = df_hits['feature_type'].str.count(feat).sum()

        prb = hypergeom.sf(x-1, M, n, N)
        res.append({
            "column": 'feature_type',
            "value": feat,
            "pval": prb,
            "expected_hits": (n/M)*N,
            "actual_hits": x,
            "set_size": n,
        })
        
    for feat in channel:
        n = df_universe['channel'].str.count(feat).sum()
        x = df_hits['channel'].str.count(feat).sum()

        prb = hypergeom.sf(x-1, M, n, N)
        res.append({
            "column": 'channel',
            "value": feat,
            "pval": prb,
            "expected_hits": (n/M)*N,
            "actual_hits": x,
            "set_size": n,
        })
        
    return pd.DataFrame(res)


Compute the probability (p-value) of randomly drawing the number of observed "hits" or more, to see if any compartment, feature type, channel, etc is overrepresented in the missing values. 

In [54]:
# Now, compute overlap p-values for each list of NaN feature annotations relative to the lists of all features annotations in the immediately upstream data
nan_ora_bl = quantify_overlap(parsed_nan_bl, parsed_all)
nan_ora_norm = quantify_overlap(parsed_nan_norm, parsed_bl)
nan_ora_cc = quantify_overlap(parsed_nan_cc, parsed_norm)

There are highly significant p-values in every case, indicating that specific feature types/compartments etc are being converted to NaNs in non-random ways. While Jupyter will only print top 20 rows, we can still see some informative examples in the first 20 rows of the baseline vs. raw profiles (baseline data has been filtered to remove any columns with NaNs in the CellProfiler profiles).

If there are still surviving features in each category that are correlated with the filtered features this may not be a huge issue, however here we can see that sometimes an entire category is filtered out (ie. rows 6-11: all of the Costes, K, Overlap, AngleBetweenNeighbors, FirstClosestDistance, and SecondClosestDistance features are lost).

Full dataframes can be examined after running the notebook.

In [62]:
# Jupyter will only print the first 20 rows, but still enough to see some important patterns
display(nan_ora_bl)

Unnamed: 0,column,value,pval,expected_hits,actual_hits,set_size
0,compartment,Cells,0.9999859,60.032571,36,1108
1,compartment,Cytoplasm,6.651057e-13,59.111494,104,1091
2,compartment,Nuclei,0.9999139,53.855935,33,994
3,feature_group,Correlation,1.094518e-74,9.752584,90,180
4,feature_group,Neighbors,5.480469e-07,1.137801,9,21
5,feature_group,RadialDistribution,9.787108e-15,31.533354,74,582
6,feature_type,Costes,6.340333e-48,1.950517,36,36
7,feature_type,K,6.340333e-48,1.950517,36,36
8,feature_type,Overlap,6.797711e-24,0.975258,18,18
9,feature_type,AngleBetweenNeighbors,0.0001564523,0.162543,3,3
