# Curate perturbation dataset with `PerturbationCurator`

Here we use pertpy's `PerturbationCurator` to ensure that a perturbatin dataset conforms to both, `CELLxGENE` (schema 5.1.0) and pertpy's defined criteria.
More specifically, the `PerturbationCurator` builds upon [cellxgene-lamin](https://github.com/laminlabs/cellxgene-lamin) and extends it by further requiring `cell_line` and `X_treatments` columns for the perturbations.

This guide demonstrates how to curate a complex, real world perturbation dataset [McFarland et al. 2020](https://www.nature.com/articles/s41467-020-17440-w) using `PerturbationCurator`. Please have a look at [lamindb's perturbation guide](https://docs.lamin.ai/perturbation) for more details.

In [1]:
#!pip install pertpy-datasets

In [2]:
# Using a local instance here but in practice, we use `laminlabs/pertpy-datasets`
!lamin init --storage ./test-perturbation --schema bionty,wetlab,ourprojects

[92m→[0m connected lamindb: zethson/test-perturbation


In [3]:
import lamindb as ln
import bionty as bt
import wetlab as wl
import ourprojects as opr
import pertpy_datasets as pts
import scanpy as sc

from lamindb.core import RecordList

[92m→[0m connected lamindb: zethson/test-perturbation


In [4]:
ln.track("HIRTYxL3aZc70000")

[92m→[0m created Transform('HIRTYxL3'), started new Run('ifbU3n87') at 2024-12-10 12:49:06 UTC
[92m→[0m notebook imports: bionty==0.53.2 lamindb==0.77.2 ourprojects==0.1.0 pertpy_datasets==0.1.0 scanpy==1.10.4 wetlab==0.37.0


In [5]:
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head(3)

[92m→[0m completing transfer to track Artifact('Xk7Qaik9') as input
[92m→[0m mapped records: 
[92m→[0m transferred records: Artifact(uid='Xk7Qaik9vBLV4PKf0001'), Storage(uid='D9BilDV2')


Unnamed: 0,depmap_id,cancer,cell_det_rate,cell_line,cell_quality,channel,disease,dose_unit,dose_value,doublet_CL1,doublet_CL2,doublet_GMM_prob,doublet_dev_imp,doublet_z_margin,hash_assignment,hash_tag,num_SNPs,organism,perturbation,perturbation_type,sex,singlet_ID,singlet_dev,singlet_dev_z,singlet_margin,singlet_z_margin,time,tissue_type,tot_reads,nperts,ngenes,ncounts,percent_mito,percent_ribo,chembl-ID
AACTGGTGTCTCTCTG,ACH-000390,True,0.093159,LUDLU-1,normal,,lung cancer,µM,0.1,LUDLU1_LUNG,TE14_OESOPHAGUS,2.269468e-10,0.009426,0.403316,,,481,human,trametinib,drug,Male,LUDLU1_LUNG,0.655877,14.860933,0.462273,12.351139,24,cell_line,787,1,3045,12895.0,3.202792,24.955409,CHEMBL2103875
ATAGGCTCAGATTTCG,ACH-000444,True,0.145728,LU99,normal,2.0,lung cancer,µM,0.5,LU99_LUNG,MCAS_OVARY,0.0008562908,0.010173,0.188284,,,1003,human,afatinib,drug,Male,LU99_LUNG,0.762847,10.648094,0.47459,8.164565,24,cell_line,1597,1,4763,23161.0,7.473771,18.051898,CHEMBL1173655
GCCAAATCAAGCCGTC,ACH-000396,True,0.11733,J82,normal,,urinary bladder carcinoma,µM,0.1,J82_URINARY_TRACT,IGR1_SKIN,6.490367e-08,0.009686,1.185862,,,647,human,dabrafenib,drug,Male,J82_URINARY_TRACT,0.651059,14.740111,0.404508,11.188513,24,cell_line,1159,1,3834,18062.0,2.762706,22.08504,CHEMBL2028663


In [6]:
# Calculate an embedding because CELLxGENE requires one
sc.tl.pca(adata)

## Curator non-perturbation data

In [7]:
curator = pts.PerturbationCurator(
    adata, using_key="test-perturbation"
)
curator.validate()

[92m→[0m added defaults to the AnnData object: {'assay': 'unknown', 'cell_type': 'unknown', 'development_stage': 'unknown', 'donor_id': 'unknown', 'self_reported_ethnicity': 'unknown', 'suspension_type': 'cell', 'genetic_treatments': '', 'compound_treatments': '', 'environmental_treatments': '', 'combination_treatments': ''}
[92m✓[0m added 15 records with [3mFeature.name[0m for "columns": 'assay', 'cell_type', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'sex', 'suspension_type', 'tissue_type', 'organism', 'cell_line', 'genetic_treatments', 'compound_treatments', 'environmental_treatments', 'combination_treatments'
[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[94m•[0m saving validated records of 'var_index'
[92m✓[0m added 1869 records [1;92mfrom public[0m with [3mGene.ensembl_gene_id[0m for "var_index": 'ENSG00000102316', 'ENSG00000109472', 'ENSG00000080007', 'ENSG00000203926', 'ENSG00000232301', 'ENSG0000

False

In [8]:
adata.obs

Unnamed: 0,depmap_id,cancer,cell_det_rate,cell_line,cell_quality,channel,disease,dose_unit,dose_value,doublet_CL1,doublet_CL2,doublet_GMM_prob,doublet_dev_imp,doublet_z_margin,hash_assignment,hash_tag,num_SNPs,organism,perturbation,perturbation_type,sex,singlet_ID,singlet_dev,singlet_dev_z,singlet_margin,singlet_z_margin,time,tissue_type,tot_reads,nperts,ngenes,ncounts,percent_mito,percent_ribo,chembl-ID,assay,cell_type,development_stage,donor_id,self_reported_ethnicity,suspension_type,genetic_treatments,compound_treatments,environmental_treatments,combination_treatments
AACTGGTGTCTCTCTG,ACH-000390,True,0.093159,LUDLU-1,normal,,lung cancer,µM,0.1,LUDLU1_LUNG,TE14_OESOPHAGUS,2.269468e-10,0.009426,0.403316,,,481,human,trametinib,drug,Male,LUDLU1_LUNG,0.655877,14.860933,0.462273,12.351139,24,cell_line,787,1,3045,12895.0,3.202792,24.955409,CHEMBL2103875,unknown,unknown,unknown,unknown,unknown,cell,,,,
ATAGGCTCAGATTTCG,ACH-000444,True,0.145728,LU99,normal,2,lung cancer,µM,0.5,LU99_LUNG,MCAS_OVARY,8.562908e-04,0.010173,0.188284,,,1003,human,afatinib,drug,Male,LU99_LUNG,0.762847,10.648094,0.474590,8.164565,24,cell_line,1597,1,4763,23161.0,7.473771,18.051898,CHEMBL1173655,unknown,unknown,unknown,unknown,unknown,cell,,,,
GCCAAATCAAGCCGTC,ACH-000396,True,0.117330,J82,normal,,urinary bladder carcinoma,µM,0.1,J82_URINARY_TRACT,IGR1_SKIN,6.490367e-08,0.009686,1.185862,,,647,human,dabrafenib,drug,Male,J82_URINARY_TRACT,0.651059,14.740111,0.404508,11.188513,24,cell_line,1159,1,3834,18062.0,2.762706,22.085040,CHEMBL2028663,unknown,unknown,unknown,unknown,unknown,cell,,,,
CGGAGAAGTCGCGTCA,ACH-000997,True,0.005422,HCT-15,low_quality,7,colorectal cancer,µM,0.1,HCT15_LARGE_INTESTINE,NCIH322_LUNG,,0.029753,0.000794,,,30,human,gemcitabine,drug,Male,HCT15_LARGE_INTESTINE,0.970247,2.852338,0.168971,0.833455,24,cell_line,76,1,178,726.0,70.247934,5.785124,CHEMBL888,unknown,unknown,unknown,unknown,unknown,cell,,,,
TAGTTGGAGATCGATA,ACH-000723,True,0.132708,YD-10B,low_quality,,head and neck cancer,,,YD10B_UPPER_AERODIGESTIVE_TRACT,647V_URINARY_TRACT,,0.156492,1.556214,,,874,human,sggpx4-2,CRISPR,Male,YD10B_UPPER_AERODIGESTIVE_TRACT,0.292802,3.272682,0.016459,0.330120,"72, 96",cell_line,2105,1,4341,20693.0,0.695887,16.242208,,unknown,unknown,unknown,unknown,unknown,cell,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GATCTAGTCATCGGAT,ACH-000015,True,0.094752,NCI-H1581,normal,,lung cancer,µM,0.1,NCIH1581_LUNG,HS766T_PANCREAS,2.865590e-08,0.007438,0.078045,,,480,human,dabrafenib,drug,Male,NCIH1581_LUNG,0.573389,19.212898,0.449900,17.285484,24,cell_line,697,1,3098,10575.0,6.780142,27.158392,CHEMBL2028663,unknown,unknown,unknown,unknown,unknown,cell,,,,
AACTCTTAGTCCCACG,ACH-000252,True,0.069540,LS1034,normal,,colorectal cancer,µM,0.0,LS1034_LARGE_INTESTINE,SNU1079_BILIARY_TRACT,1.436511e-02,0.036492,0.727242,,,366,human,control,drug,Male,LS1034_LARGE_INTESTINE,0.640246,10.526643,0.413657,8.703848,6,cell_line,616,0,2274,8281.0,7.837218,21.676126,,unknown,unknown,unknown,unknown,unknown,cell,,,,
CACTGTCCAGTCACGC,ACH-000681,True,0.094262,A549,normal,5,lung cancer,µM,2.5,A549_LUNG,MDAMB435S_SKIN,2.303331e-05,0.018360,0.089038,,,528,human,JQ1,drug,Male,A549_LUNG,0.714991,14.528289,0.487355,11.729031,24,cell_line,786,1,3081,10204.0,8.584869,17.679341,,unknown,unknown,unknown,unknown,unknown,cell,,,,
CACCTTGGTCGACTAT,ACH-000875,True,0.163557,NCI-H2347,normal,,lung cancer,µM,0.0,NCIH2347_LUNG,BICR6_UPPER_AERODIGESTIVE_TRACT,3.441387e-08,0.001044,0.314087,,,1230,human,control,drug,Female,NCIH2347_LUNG,0.721699,14.720316,0.525416,13.127789,24,cell_line,2591,0,5347,37053.0,3.902518,29.606240,,unknown,unknown,unknown,unknown,unknown,cell,,,,


In [9]:
adata.obs["sex"] = adata.obs["sex"].cat.rename_categories({"Unknown": "unknown"})

In [10]:
efo_lo = bt.ExperimentalFactor.public().lookup()

In [11]:
adata.obs["assay"] = efo_lo.single_cell_rna_sequencing.name

In [12]:
adata = adata[:, ~adata.var_names.isin(curator.non_validated["var_index"])].copy()

In [13]:
# Recreate Curator object because we are using a new object
curator = pts.PerturbationCurator(adata, using_key="test-perturbation")
curator.validate()

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[94m•[0m saving validated records of 'assay'
[92m✓[0m added 1 record [1;92mfrom public[0m with [3mExperimentalFactor.name[0m for "assay": 'single-cell RNA sequencing'
[92m✓[0m "var_index" is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m "assay" is validated against [3mExperimentalFactor.name[0m
[94m•[0m mapping "cell_type" on [3mCellType.name[0m
[93m![0m   [1;91m1 term[0m is not validated: [1;91m'unknown'[0m
    → fix typos, remove non-existent values, or save terms via [1;96m.add_new_from("cell_type")[0m
[94m•[0m mapping "development_stage" on [3mDevelopmentalStage.name[0m
[93m![0m   [1;91m1 term[0m is not validated: [1;91m'unknown'[0m
    → fix typos, remove non-existent values, or save terms via [1;96m.add_new_from("development_stage")[0m
[94m•[0m mapping "disease" on [3mDisease.name[0m
[93m![0m   [1;91m1 term[0m is not validated: [1;91m'panc

False

In [14]:
curator.standardize("all")
curator.add_new_from("all")

AttributeError: 'NoneType' object has no attribute 'name'

In [None]:
curator.validate()

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[92m✓[0m "var_index" is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m "assay" is validated against [3mExperimentalFactor.name[0m
[92m✓[0m "cell_type" is validated against [3mCellType.name[0m
[92m✓[0m "development_stage" is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m "disease" is validated against [3mDisease.name[0m
[92m✓[0m "donor_id" is validated against [3mULabel.name[0m
[92m✓[0m "self_reported_ethnicity" is validated against [3mEthnicity.name[0m
[92m✓[0m "sex" is validated against [3mPhenotype.name[0m
[92m✓[0m "suspension_type" is validated against [3mULabel.name[0m
[92m✓[0m "tissue_type" is validated against [3mULabel.name[0m
[92m✓[0m "organism" is validated against [3mOrganism.name[0m
[92m✓[0m "cell_line" is validated against [3mCellLine.name[0m
[92m✓[0m "genetic_treatments" is validated against [3mGeneticPerturbation.name[0

True

All treatment columns validate but that's only because they're all empty.

## Curate perturbations

In [None]:
# Move
adata.obs["genetic_treatments"] = adata.obs["perturbation"].where(
    adata.obs["perturbation_type"] == "CRISPR", None
)
adata.obs["compound_treatments"] = adata.obs["perturbation"].where(
    adata.obs["perturbation_type"] == "drug", None
)

### Genetic treatments

In [None]:
list(adata.obs["genetic_treatments"].unique())

[nan, 'sggpx4-2', 'sglacz', 'sggpx4-1', 'sgor2j2']

In [None]:
treatments = [
    ("sggpx4-1", "GPX4", "Glutathione Peroxidase 4"),
    ("sggpx4-2", "GPX4", "Glutathione Peroxidase 4"),
    ("sgor2j2", "or2j2", "Olfactory receptor family 2 subfamily J member 2"),
    ("sglacz", "lacz", "beta-galactosidase control"),  # Control from E. coli
]
organism = bt.Organism.lookup().human

genetic_treatments = []
for name, symbol, target_name in treatments:
    treatment = wl.GeneticTreatment(system="CRISPR-Cas9", name=name).save()
    if symbol != "lacz":
        gene_result = bt.Gene.from_source(symbol=symbol, organism=organism)
        gene = gene_result[0] if isinstance(gene_result, RecordList) else gene_result
        gene = gene.save()
    else:
        gene = bt.Gene(symbol=symbol, organism=organism).save()
    target = wl.TreatmentTarget(name=target_name).save()
    target.genes.add(gene)
    treatment.targets.add(target)
    genetic_treatments.append(treatment)

[92m✓[0m created [1;95m1 Gene record from Bionty[0m matching [3msymbol[0m: [1;95m'GPX4'[0m
[93m![0m record with similar name exists! did you mean to load it?


Unnamed: 0_level_0,uid,name,system,sequence,on_target_score,off_target_score,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,WemtRNZv2ziV,sggpx4-1,CRISPR-Cas9,,,,1,2024-12-10 12:44:57.321264+00:00,1


[92m→[0m returning existing PerturbationTarget record with same name: 'Glutathione Peroxidase 4'
[92m✓[0m created [1;95m1 Gene record from Bionty[0m matching [3msynonyms[0m: [1;95m'or2j2'[0m
[93m![0m ambiguous validation in Bionty for 1 record: 'OR2J2'


TypeError: Field 'id' expected a number but got [Gene(uid='4U5zlMFedTR1', symbol='OR2J2', ensembl_gene_id='ENSG00000196231', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC), Gene(uid='2cT5bXP0JAXr', symbol='OR2J2', ensembl_gene_id='ENSG00000204700', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC), Gene(uid='1NVMcJPLrAlr', symbol='OR2J2', ensembl_gene_id='ENSG00000225550', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC), Gene(uid='Nt62Ro62g1VZ', symbol='OR2J2', ensembl_gene_id='ENSG00000226000', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC), Gene(uid='3wXVmsWB2nY2', symbol='OR2J2', ensembl_gene_id='ENSG00000226347', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC), Gene(uid='4CkZekReK7p6', symbol='OR2J2', ensembl_gene_id='ENSG00000231676', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC), Gene(uid='2ttz4jYNb9Oo', symbol='OR2J2', ensembl_gene_id='ENSG00000232945', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC), Gene(uid='5iYHfK0bEf2e', symbol='OR2J2', ensembl_gene_id='ENSG00000234746', ncbi_gene_ids='26707', biotype='protein_coding', synonyms='DJ80I19.4|HS6M1-6|OR6-8', description='olfactory receptor family 2 subfamily J member 2 ', created_by_id=1, run_id=1, source_id=11, organism_id=1, created_at=2024-12-10 12:45:01 UTC)].

In [None]:
curator.validate()

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[92m✓[0m 'var_index' is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m 'assay' is validated against [3mExperimentalFactor.name[0m
[92m✓[0m 'cell_type' is validated against [3mCellType.name[0m
[92m✓[0m 'development_stage' is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m 'disease' is validated against [3mDisease.name[0m
[92m✓[0m 'donor_id' is validated against [3mULabel.name[0m
[92m✓[0m 'self_reported_ethnicity' is validated against [3mEthnicity.name[0m
[92m✓[0m 'sex' is validated against [3mPhenotype.name[0m
[92m✓[0m 'suspension_type' is validated against [3mULabel.name[0m
[92m✓[0m 'tissue_type' is validated against [3mULabel.name[0m
[92m✓[0m 'organism' is validated against [3mOrganism.name[0m
[92m✓[0m 'cell_line' is validated against [3mCellLine.name[0m
[92m✓[0m 'genetic_treatments' is validated against [3mGeneticTreatment.name[0m


False

### Compounds

In [None]:
compounds = wl.Compound.from_values(adata.obs["compound_treatments"], field="name")

[92m✓[0m created [1;95m8 Compound records from Bionty[0m matching [3mname[0m: [1;95m'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'JQ1', 'everolimus'[0m
[93m![0m [1;91mdid not create[0m Compound records for [1;93m6 non-validated[0m [3mnames[0m: [1;93m'azd5591', 'brd3379', 'control', 'idasanutlin', 'prexasertib', 'taselisib'[0m


In [None]:
# The remaining compounds are not in chebi and we create records for them
for missing in [
    "azd5591",
    "brd3379",
    "control",
    "idasanutlin",
    "prexasertib",
    "taselisib",
]:
    compounds.append(wl.Compound(name=missing))
ln.save(compounds)

In [None]:
drug_metadata = adata.obs[adata.obs["compound_treatments"].notna()]

unique_treatments = drug_metadata[
    ["perturbation", "dose_unit", "dose_value"]
].drop_duplicates()

compound_treatments = []
for _, row in unique_treatments.iterrows():
    compound = wl.Compound.get(name=row["perturbation"])
    treatment = wl.CompoundTreatment(
        name=compound.name,
        concentration=row["dose_value"],
        concentration_unit=row["dose_unit"],
    )
    compound_treatments.append(treatment)

ln.save(compound_treatments)

In [None]:
compounds_to_targets = {
    "trametinib": ("MAPK/ERK pathway", ["P36507"]),
    "afatinib": ("EGFR, HER2, HER4 signaling", ["P00533", "Q9UK79", "Q15303"]),
    "dabrafenib": ("MAPK/ERK pathway", ["P15056"]),
    "gemcitabine": ("DNA synthesis inhibition", ["P23921"]),  # No single protein target
    "navitoclax": ("Apoptosis regulation", ["P10415", "Q07812"]),
    "bortezomib": ("Proteasome pathway", ["P49721"]),
    "brd3379": ("Transcription regulation (BET proteins)", ["O60885"]),
    "JQ1": ("Transcription regulation (BET proteins)", ["O60885"]),
    "azd5591": ("Apoptosis regulation", ["Q07820"]),
    "control": ("Baseline", [None]),  # No target for control
    "prexasertib": ("DNA damage response", ["O14757"]),
    "taselisib": ("PI3K/AKT/mTOR pathway", ["P42336", "O00329", "P48736"]),
    "idasanutlin": ("p53 regulation", ["Q00987"]),
    "everolimus": ("mTOR pathway", ["P42345"])
}

for compound_treatment_name, targets_tuple in compounds_to_targets.items():
    compound_treatment = wl.CompoundTreatment.get(name=compound_treatment_name)
    target = wl.TreatmentTarget(name=targets_tuple[0]).save()
    proteins = []
    for id in targets_tuple[1]:
        if id is not None:
            proteins.append(bt.Protein.from_source(uniprotkb_id=id).save())
    target.proteins.set(proteins)
    compound_treatment.targets.add(target)

[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P36507'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P00533'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q9UK79'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q15303'[0m
[92m→[0m returning existing TreatmentTarget record with same name: 'MAPK/ERK pathway'
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P15056'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P23921'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P10415'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q07812'[0m
[93m![0m record with similar n

Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,i6XpYvMC,MAPK/ERK pathway,,2,2024-11-11 15:10:57.608736+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P49721'[0m
[93m![0m record with similar name exists! did you mean to load it?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7,SnJAC9fR,Apoptosis regulation,,2,2024-11-11 15:11:04.651378+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'O60885'[0m
[92m→[0m returning existing TreatmentTarget record with same name: 'Transcription regulation (BET proteins)'
[92m→[0m returning existing TreatmentTarget record with same name: 'Apoptosis regulation'
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q07820'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'O14757'[0m
[93m![0m records with similar names exist! did you mean to load one of them?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,i6XpYvMC,MAPK/ERK pathway,,2,2024-11-11 15:10:57.608736+00:00,1
8,fuAhPhXj,Proteasome pathway,,2,2024-11-11 15:11:07.553672+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P42336'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'O00329'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P48736'[0m
[93m![0m records with similar names exist! did you mean to load one of them?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7,SnJAC9fR,Apoptosis regulation,,2,2024-11-11 15:11:04.651378+00:00,1
9,5KWr4lMW,Transcription regulation (BET proteins),,2,2024-11-11 15:11:08.744610+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q00987'[0m
[93m![0m records with similar names exist! did you mean to load one of them?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
12,irV5vyHx,PI3K/AKT/mTOR pathway,,2,2024-11-11 15:11:12.919341+00:00,1
4,i6XpYvMC,MAPK/ERK pathway,,2,2024-11-11 15:10:57.608736+00:00,1
8,fuAhPhXj,Proteasome pathway,,2,2024-11-11 15:11:07.553672+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P42345'[0m


## References

In [None]:
reference = opr.Reference(
    name="Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action",
    abbr="McFarland 2020",
    url="https://www.nature.com/articles/s41467-020-17440-w",
    doi="10.1038/s41467-020-17440-w",
    text=(
        "Assays to study cancer cell responses to pharmacologic or genetic perturbations are typically "
        "restricted to using simple phenotypic readouts such as proliferation rate. Information-rich assays, "
        "such as gene-expression profiling, have generally not permitted efficient profiling of a given "
        "perturbation across multiple cellular contexts. Here, we develop MIX-Seq, a method for multiplexed "
        "transcriptional profiling of post-perturbation responses across a mixture of samples with single-cell "
        "resolution, using SNP-based computational demultiplexing of single-cell RNA-sequencing data. We show "
        "that MIX-Seq can be used to profile responses to chemical or genetic perturbations across pools of 100 "
        "or more cancer cell lines. We combine it with Cell Hashing to further multiplex additional experimental "
        "conditions, such as post-treatment time points or drug doses. Analyzing the high-content readout of "
        "scRNA-seq reveals both shared and context-specific transcriptional response components that can identify "
        "drug mechanism of action and enable prediction of long-term cell viability from short-term transcriptional "
        "responses to treatment."
    ),
).save()

## Remove unused columns

In [None]:
adata.obs = adata.obs.drop(
    [
        "depmap_id",
        "cancer",
        "cell_quality",
        "channel",
        "perturbation",
        "perturbation_type",
        "singlet_dev",
        "singlet_dev_z",
        "singlet_margin",
        "singlet_z_margin",
        "nperts",
        "ngenes",
        "ncounts",
        "cell_det_rate",
        "doublet_GMM_prob",
        "doublet_dev_imp",
        "doublet_z_margin",
        'doublet_CL1',
        'doublet_CL2'
    ],
    axis=1,
)

## Register curated artifact

In [None]:
artifact = curator.save_artifact(description="McFarland AnnData")

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[92m✓[0m 'var_index' is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m 'assay' is validated against [3mExperimentalFactor.name[0m
[92m✓[0m 'cell_type' is validated against [3mCellType.name[0m
[92m✓[0m 'development_stage' is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m 'disease' is validated against [3mDisease.name[0m
[92m✓[0m 'donor_id' is validated against [3mULabel.name[0m
[92m✓[0m 'self_reported_ethnicity' is validated against [3mEthnicity.name[0m
[92m✓[0m 'sex' is validated against [3mPhenotype.name[0m
[92m✓[0m 'suspension_type' is validated against [3mULabel.name[0m
[92m✓[0m 'tissue_type' is validated against [3mULabel.name[0m
[92m✓[0m 'organism' is validated against [3mOrganism.name[0m
[92m✓[0m 'cell_line' is validated against [3mCellLine.name[0m
[92m✓[0m 'genetic_treatments' is validated against [3mGeneticTreatment.name[0m


Unnamed: 0_level_0,uid,version,is_latest,description,key,suffix,type,size,hash,n_objects,n_observations,_hash_type,_accessor,visibility,_key_is_virtual,storage_id,transform_id,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2,Xk7Qaik9vBLV4PKf0001,,True,McFarland 2020 preprocessed,,.h5ad,dataset,2511528,Iz4mVUpIruvtABfA6D3vQA,,,md5,AnnData,1,True,3,3,3,2024-11-11 14:44:54.959908+00:00,1


[93m![0m    [1;93m11 unique terms[0m (42.30%) are not validated for [3mname[0m: [1;93m'dose_unit', 'dose_value', 'hash_assignment', 'hash_tag', 'num_SNPs', 'singlet_ID', 'time', 'tot_reads', 'percent_mito', 'percent_ribo', ...[0m


In [None]:
# Set the perturbations and references
artifact.genetic_treatments.set(genetic_treatments)
artifact.compound_treatments.set(compound_treatments)
artifact.references.add(reference)

In [None]:
artifact.describe()

[1;92mArtifact[0m(uid='A2xWHSPBuPgBhcHi0000', is_latest=True, description='McFarland AnnData', suffix='.h5ad', type='dataset', size=3345456, hash='jzxUs9DOPJewAKOb6ZMaGg', n_observations=1000, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-11 15:11:20 UTC)
  [3mProvenance[0m
    .storage = '/home/zeth/PycharmProjects/pertpy-datasets/scripts/lamindb_datasets/test-perturbation'
    .transform = 'Curate perturbation dataset with `PerturbationCurator`'
    .run = 2024-11-11 15:10:01 UTC
    .created_by = 'zethson'
  [3mLabels[0m
    .references = 'Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action'
    .genetic_treatments = 'sggpx4-1', 'sggpx4-2', 'sgor2j2', 'sglacz'
    .compound_treatments = 'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'brd3379', 'JQ1', 'azd5591', 'control', ...
    .organisms = 'human'
    .cell_types = '