# Calculate overlap between Varchamp and pooled-rare perturbations
Author: Jess Ewald

Varchamp and pooled-rare are both investigating mislocalization of proteins in rare diseases. There is some overlap in perturbations across the two datasets; it would be a good idea to leverage this overlap to examine reproducibility, etc. 

In [1]:
# Imports
import polars as pl
import os

import black
import jupyter_black

jupyter_black.load(
    lab=False,
    line_length=79,
    verbosity="DEBUG",
    target_version=black.TargetVersion.PY310,
)

import warnings
warnings.filterwarnings("ignore")

DEBUG:jupyter_black:config: {'line_length': 79, 'target_versions': {<TargetVersion.PY310: 10>}}


<IPython.core.display.Javascript object>

In [5]:
# Paths
varchamp_pm_path = f"/dgx1nas1/storage/data/jess/varchamp/platemaps"
pooled_data_dir = f"/dgx1nas1/storage/data/jess/pooled/sc_data/processed_profiles/"
pooled_dat_path = f"{pooled_data_dir}/pilot_corrected_normalized_featselected.parquet"

In [24]:
# Read in Varchamp batch 1 platemaps
B7_plates = ["B7A1R1_P1.txt", "B7A1R1_P2.txt", "B7A1R1_P3.txt", "B7A1R1_P4.txt", "B7A2R1_P1.txt"]
varchamp = pl.DataFrame()
for plate in B7_plates:
    temp_pm = pl.read_csv(f"{varchamp_pm_path}/{plate}", separator="\t")
    varchamp = pl.concat([varchamp, temp_pm], how="vertical")

In [25]:
# Read in pooled-rare metadata
pooled = pl.scan_parquet(pooled_dat_path)
meta_cols = [i for i in pooled.columns if "Metadata_" in i] 
pooled = pooled.select(meta_cols).collect()


In [26]:
# Get alleles in each dataset
pooled_alleles = pooled.select("Metadata_Foci_Barcode_MatchedTo_GeneCode").to_series().unique().to_list()
varchamp_alleles = varchamp.select("gene_allele").to_series().unique().to_list()

In [27]:
# Convert perturbation format: pooled-rare uses single letter aa code while varchamp uses three letter abbreviation
aa_dict = {
    "Gly": "G",
    "Ala": "A",
    "Val": "V",
    "Leu": "L",
    "Ile": "I", 
    "Thr": "T",
    "Ser": "S",
    "Met": "M",
    "Cys": "C",
    "Pro": "P",
    "Phe": "F",
    "Tyr": "Y",
    "Trp": "W",
    "His": "H",
    "Lys": "K",
    "Arg": "R",
    "Asp": "D",
    "Glu": "E",
    "Asn": "N",
    "Gln": "Q"
}


In [33]:
varchamp_converted = []
for allele in varchamp_alleles:
    if allele:
        temp = allele.split("_")
        if len(temp) == 2:
            perturb = temp[1]
            
            if perturb != "RASA1":
                first = perturb[0:3]
                second = perturb[len(perturb) - 3:]
                
                perturb = perturb.replace(first, aa_dict[first])
                perturb = perturb.replace(second, aa_dict[second])
                
                converted = temp[0] + " " + perturb
                varchamp_converted.append(converted)
            else:
                varchamp_converted.append(temp[0])
        else:    
            varchamp_converted.append(temp[0])
        

In [36]:
print(len(varchamp_converted))
print(len(pooled_alleles))

allele_intersect = [value for value in varchamp_converted if value in pooled_alleles]
print(len(allele_intersect))



1391
290
38


Of the 290 alleles in pooled-rare, 38 are also in Varchamp. It's possible that more pooled-rare alleles will be profiled in Varchamp in coming batches. 