# Curate perturbation dataset with `PerturbationCurator`

Here we use pertpy's `PerturbationCurator` to ensure that a perturbatin dataset conforms to both, `CELLxGENE` (schema 5.1.0) and pertpy's defined criteria.
More specifically, the `PerturbationCurator` builds upon [cellxgene-lamin](https://github.com/laminlabs/cellxgene-lamin) and extends it by further requiring `cell_line` and `X_treatments` columns for the perturbations.

This guide demonstrates how to curate a complex, real world perturbation dataset [McFarland et al. 2020](https://www.nature.com/articles/s41467-020-17440-w) using `PerturbationCurator`. Please have a look at [lamindb's perturbation guide](https://docs.lamin.ai/perturbation) for more details.

In [1]:
#!pip install pertpy-datasets

In [2]:
# Using a local instance here but in practice, we use `laminlabs/pertpy-datasets`
!lamin init --storage ./test-perturbation --schema bionty,wetlab,findrefs

[92m→[0m connected lamindb: zethson/test-perturbation


In [3]:
import lamindb as ln
import bionty as bt
import wetlab as wl
import findrefs as fr
import pertpy_datasets as pts
import scanpy as sc

[92m→[0m connected lamindb: zethson/test-perturbation


In [4]:
ln.track("HIRTYxL3aZc70000")

[92m→[0m notebook imports: bionty==0.52.0 findrefs==0.1.0 lamindb==0.76.15 pertpy_datasets==0.1.0 scanpy==1.10.3 wetlab==0.34.0
[92m→[0m loaded Transform('HIRTYxL3'), started Run('RGyUX0tz') at 2024-11-11 15:10:01 UTC


In [5]:
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head(3)

[92m→[0m completing transfer to track Artifact('Xk7Qaik9') as input
[92m→[0m mapped records: Artifact(uid='Xk7Qaik9vBLV4PKf0001')
[92m→[0m transferred records: 


Unnamed: 0,depmap_id,cancer,cell_det_rate,cell_line,cell_quality,channel,disease,dose_unit,dose_value,doublet_CL1,...,singlet_z_margin,time,tissue_type,tot_reads,nperts,ngenes,ncounts,percent_mito,percent_ribo,chembl-ID
AACTGGTGTCTCTCTG,ACH-000390,True,0.093159,LUDLU-1,normal,,lung cancer,µM,0.1,LUDLU1_LUNG,...,12.351139,24,cell_line,787,1,3045,12895.0,3.202792,24.955409,CHEMBL2103875
ATAGGCTCAGATTTCG,ACH-000444,True,0.145728,LU99,normal,2.0,lung cancer,µM,0.5,LU99_LUNG,...,8.164565,24,cell_line,1597,1,4763,23161.0,7.473771,18.051898,CHEMBL1173655
GCCAAATCAAGCCGTC,ACH-000396,True,0.11733,J82,normal,,urinary bladder carcinoma,µM,0.1,J82_URINARY_TRACT,...,11.188513,24,cell_line,1159,1,3834,18062.0,2.762706,22.08504,CHEMBL2028663


In [6]:
# Calculate an embedding because CELLxGENE requires one
sc.tl.pca(adata)

## Curator non-perturbation data

In [None]:
curator = pts.PerturbationCurator(
    adata
)  # Fetch all ontologies from this instance
curator.validate()

[92m→[0m added defaults to the AnnData object: {'assay': 'unknown', 'cell_type': 'unknown', 'development_stage': 'unknown', 'donor_id': 'unknown', 'self_reported_ethnicity': 'unknown', 'suspension_type': 'cell', 'genetic_treatments': '', 'compound_treatments': '', 'environmental_treatments': '', 'combination_treatments': ''}
[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[94m•[0m mapping [3mvar_index[0m on [3mGene.ensembl_gene_id[0m
[93m![0m    [1;91m2 terms[0m are not validated: [1;91m'ENSG00000255823', 'ENSG00000272370'[0m
→ fix typos, remove non-existent values, or save terms via [1;91m.add_new_from_var_index()[0m
[94m•[0m mapping [3massay[0m on [3mExperimentalFactor.name[0m
[93m![0m    [1;91m1 term[0m is not validated: [1;91m'unknown'[0m
→ fix typo, remove non-existent value, or save term via [1;91m.add_new_from('assay')[0m
[92m✓[0m 'cell_type' is validated against [3mCellType.name[0m
[92m✓[0m 'development_s

False

In [8]:
adata.obs

Unnamed: 0,depmap_id,cancer,cell_det_rate,cell_line,cell_quality,channel,disease,dose_unit,dose_value,doublet_CL1,...,assay,cell_type,development_stage,donor_id,self_reported_ethnicity,suspension_type,genetic_treatments,compound_treatments,environmental_treatments,combination_treatments
AACTGGTGTCTCTCTG,ACH-000390,True,0.093159,LUDLU-1,normal,,lung cancer,µM,0.1,LUDLU1_LUNG,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
ATAGGCTCAGATTTCG,ACH-000444,True,0.145728,LU99,normal,2,lung cancer,µM,0.5,LU99_LUNG,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
GCCAAATCAAGCCGTC,ACH-000396,True,0.117330,J82,normal,,urinary bladder carcinoma,µM,0.1,J82_URINARY_TRACT,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
CGGAGAAGTCGCGTCA,ACH-000997,True,0.005422,HCT-15,low_quality,7,colorectal cancer,µM,0.1,HCT15_LARGE_INTESTINE,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
TAGTTGGAGATCGATA,ACH-000723,True,0.132708,YD-10B,low_quality,,head and neck cancer,,,YD10B_UPPER_AERODIGESTIVE_TRACT,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GATCTAGTCATCGGAT,ACH-000015,True,0.094752,NCI-H1581,normal,,lung cancer,µM,0.1,NCIH1581_LUNG,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
AACTCTTAGTCCCACG,ACH-000252,True,0.069540,LS1034,normal,,colorectal cancer,µM,0.0,LS1034_LARGE_INTESTINE,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
CACTGTCCAGTCACGC,ACH-000681,True,0.094262,A549,normal,5,lung cancer,µM,2.5,A549_LUNG,...,unknown,unknown,unknown,unknown,unknown,cell,,,,
CACCTTGGTCGACTAT,ACH-000875,True,0.163557,NCI-H2347,normal,,lung cancer,µM,0.0,NCIH2347_LUNG,...,unknown,unknown,unknown,unknown,unknown,cell,,,,


In [9]:
adata.obs["sex"] = adata.obs["sex"].replace({"Unknown": "unknown"})

  adata.obs["sex"] = adata.obs["sex"].replace({"Unknown": "unknown"})


In [10]:
efo_lo = bt.ExperimentalFactor.public().lookup()

In [11]:
adata.obs["assay"] = efo_lo.single_cell_rna_sequencing.name

In [12]:
adata = adata[:, ~adata.var_names.isin(curator.non_validated["var_index"])].copy()

In [None]:
# Need to recreate Curator object because we are using a new object
curator = pts.PerturbationCurator(adata)
curator.validate()

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[92m✓[0m 'var_index' is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m 'assay' is validated against [3mExperimentalFactor.name[0m
[92m✓[0m 'cell_type' is validated against [3mCellType.name[0m
[92m✓[0m 'development_stage' is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m 'disease' is validated against [3mDisease.name[0m
[92m✓[0m 'donor_id' is validated against [3mULabel.name[0m
[92m✓[0m 'self_reported_ethnicity' is validated against [3mEthnicity.name[0m
[92m✓[0m 'sex' is validated against [3mPhenotype.name[0m
[92m✓[0m 'suspension_type' is validated against [3mULabel.name[0m
[92m✓[0m 'tissue_type' is validated against [3mULabel.name[0m
[92m✓[0m 'organism' is validated against [3mOrganism.name[0m
[92m✓[0m 'cell_line' is validated against [3mCellLine.name[0m
[92m✓[0m 'genetic_treatments' is validated against [3mGeneticTreatment.name[0m


True

In [14]:
curator.add_new_from("all")

In [15]:
curator.validate()

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[92m✓[0m 'var_index' is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m 'assay' is validated against [3mExperimentalFactor.name[0m
[92m✓[0m 'cell_type' is validated against [3mCellType.name[0m
[92m✓[0m 'development_stage' is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m 'disease' is validated against [3mDisease.name[0m
[92m✓[0m 'donor_id' is validated against [3mULabel.name[0m
[92m✓[0m 'self_reported_ethnicity' is validated against [3mEthnicity.name[0m
[92m✓[0m 'sex' is validated against [3mPhenotype.name[0m
[92m✓[0m 'suspension_type' is validated against [3mULabel.name[0m
[92m✓[0m 'tissue_type' is validated against [3mULabel.name[0m
[92m✓[0m 'organism' is validated against [3mOrganism.name[0m
[92m✓[0m 'cell_line' is validated against [3mCellLine.name[0m
[92m✓[0m 'genetic_treatments' is validated against [3mGeneticTreatment.name[0m


True

All treatment columns validate but that's only because they're all empty.

## Curate perturbations

In [16]:
# Move
adata.obs["genetic_treatments"] = adata.obs["perturbation"].where(
    adata.obs["perturbation_type"] == "CRISPR", None
)
adata.obs["compound_treatments"] = adata.obs["perturbation"].where(
    adata.obs["perturbation_type"] == "drug", None
)

### Genetic treatments

In [17]:
list(adata.obs["genetic_treatments"].unique())

[nan, 'sggpx4-2', 'sglacz', 'sggpx4-1', 'sgor2j2']

In [18]:
treatments = [
    ("sggpx4-1", "GPX4", "Glutathione Peroxidase 4"),
    ("sggpx4-2", "GPX4", "Glutathione Peroxidase 4"),
    ("sgor2j2", "or2j2", "Olfactory receptor family 2 subfamily J member 2"),
    ("sglacz", "lacz", "beta-galactosidase control"),  # Control from E. coli
]
organism = bt.Organism.lookup().human

genetic_treatments = []
for name, symbol, target_name in treatments:
    treatment = wl.GeneticTreatment(system="CRISPR KO", name=name).save()
    if symbol != "lacz":
        gene_result = bt.Gene.from_source(symbol=symbol, organism=organism)
        gene = gene_result[0] if isinstance(gene_result, list) else gene_result
        gene = gene.save()
    else:
        gene = bt.Gene(symbol=symbol, organism=organism).save()
    target = wl.TreatmentTarget(name=target_name).save()
    target.genes.add(gene)
    treatment.targets.add(target)
    genetic_treatments.append(treatment)

[92m→[0m returning existing GeneticTreatment record with same name: 'sggpx4-1'
[92m→[0m returning existing TreatmentTarget record with same name: 'Glutathione Peroxidase 4'
[92m→[0m returning existing GeneticTreatment record with same name: 'sggpx4-2'
[92m→[0m returning existing TreatmentTarget record with same name: 'Glutathione Peroxidase 4'
[92m→[0m returning existing GeneticTreatment record with same name: 'sgor2j2'
[92m✓[0m loaded [1;92m1 Gene record[0m matching [3msymbol[0m: [1;92m'OR2J2'[0m
[92m✓[0m loaded [1;92m1 Gene record[0m matching [3msynonyms[0m: [1;92m'or2j2'[0m
[92m→[0m returning existing TreatmentTarget record with same name: 'Olfactory receptor family 2 subfamily J member 2'
[92m→[0m returning existing GeneticTreatment record with same name: 'sglacz'
[92m→[0m returning existing Gene record with same symbol: 'lacz'
[92m→[0m returning existing TreatmentTarget record with same name: 'beta-galactosidase control'


In [19]:
curator.validate()

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[92m✓[0m 'var_index' is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m 'assay' is validated against [3mExperimentalFactor.name[0m
[92m✓[0m 'cell_type' is validated against [3mCellType.name[0m
[92m✓[0m 'development_stage' is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m 'disease' is validated against [3mDisease.name[0m
[92m✓[0m 'donor_id' is validated against [3mULabel.name[0m
[92m✓[0m 'self_reported_ethnicity' is validated against [3mEthnicity.name[0m
[92m✓[0m 'sex' is validated against [3mPhenotype.name[0m
[92m✓[0m 'suspension_type' is validated against [3mULabel.name[0m
[92m✓[0m 'tissue_type' is validated against [3mULabel.name[0m
[92m✓[0m 'organism' is validated against [3mOrganism.name[0m
[92m✓[0m 'cell_line' is validated against [3mCellLine.name[0m
[92m✓[0m 'genetic_treatments' is validated against [3mGeneticTreatment.name[0m


False

### Compounds

In [20]:
compounds = wl.Compound.from_values(adata.obs["compound_treatments"], field="name")

[92m✓[0m created [1;95m8 Compound records from Bionty[0m matching [3mname[0m: [1;95m'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'JQ1', 'everolimus'[0m
[93m![0m [1;91mdid not create[0m Compound records for [1;93m6 non-validated[0m [3mnames[0m: [1;93m'azd5591', 'brd3379', 'control', 'idasanutlin', 'prexasertib', 'taselisib'[0m


In [21]:
# The remaining compounds are not in chebi and we create records for them
for missing in [
    "azd5591",
    "brd3379",
    "control",
    "idasanutlin",
    "prexasertib",
    "taselisib",
]:
    compounds.append(wl.Compound(name=missing))
ln.save(compounds)

In [22]:
drug_metadata = adata.obs[adata.obs["compound_treatments"].notna()]

unique_treatments = drug_metadata[
    ["perturbation", "dose_unit", "dose_value"]
].drop_duplicates()

compound_treatments = []
for _, row in unique_treatments.iterrows():
    compound = wl.Compound.get(name=row["perturbation"])
    treatment = wl.CompoundTreatment(
        name=compound.name,
        concentration=row["dose_value"],
        concentration_unit=row["dose_unit"],
    )
    compound_treatments.append(treatment)

ln.save(compound_treatments)

In [23]:
compounds_to_targets = {
    "trametinib": ("MAPK/ERK pathway", ["P36507"]),
    "afatinib": ("EGFR, HER2, HER4 signaling", ["P00533", "Q9UK79", "Q15303"]),
    "dabrafenib": ("MAPK/ERK pathway", ["P15056"]),
    "gemcitabine": ("DNA synthesis inhibition", ["P23921"]),  # No single protein target
    "navitoclax": ("Apoptosis regulation", ["P10415", "Q07812"]),
    "bortezomib": ("Proteasome pathway", ["P49721"]),
    "brd3379": ("Transcription regulation (BET proteins)", ["O60885"]),
    "JQ1": ("Transcription regulation (BET proteins)", ["O60885"]),
    "azd5591": ("Apoptosis regulation", ["Q07820"]),
    "control": ("Baseline", [None]),  # No target for control
    "prexasertib": ("DNA damage response", ["O14757"]),
    "taselisib": ("PI3K/AKT/mTOR pathway", ["P42336", "O00329", "P48736"]),
    "idasanutlin": ("p53 regulation", ["Q00987"]),
    "everolimus": ("mTOR pathway", ["P42345"])
}


for compound_treatment_name, targets_tuple in compounds_to_targets.items():
    compound_treatment = wl.CompoundTreatment.get(name=compound_treatment_name)
    target = wl.TreatmentTarget(name=targets_tuple[0]).save()
    proteins = []
    for id in targets_tuple[1]:
        if id is not None:
            proteins.append(bt.Protein.from_source(uniprotkb_id=id).save())
    target.proteins.set(proteins)
    compound_treatment.targets.add(target)

[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P36507'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P00533'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q9UK79'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q15303'[0m
[92m→[0m returning existing TreatmentTarget record with same name: 'MAPK/ERK pathway'
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P15056'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P23921'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P10415'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q07812'[0m
[93m![0m record with similar n

Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,i6XpYvMC,MAPK/ERK pathway,,2,2024-11-11 15:10:57.608736+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P49721'[0m
[93m![0m record with similar name exists! did you mean to load it?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7,SnJAC9fR,Apoptosis regulation,,2,2024-11-11 15:11:04.651378+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'O60885'[0m
[92m→[0m returning existing TreatmentTarget record with same name: 'Transcription regulation (BET proteins)'
[92m→[0m returning existing TreatmentTarget record with same name: 'Apoptosis regulation'
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q07820'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'O14757'[0m
[93m![0m records with similar names exist! did you mean to load one of them?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,i6XpYvMC,MAPK/ERK pathway,,2,2024-11-11 15:10:57.608736+00:00,1
8,fuAhPhXj,Proteasome pathway,,2,2024-11-11 15:11:07.553672+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P42336'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'O00329'[0m
[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P48736'[0m
[93m![0m records with similar names exist! did you mean to load one of them?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7,SnJAC9fR,Apoptosis regulation,,2,2024-11-11 15:11:04.651378+00:00,1
9,5KWr4lMW,Transcription regulation (BET proteins),,2,2024-11-11 15:11:08.744610+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'Q00987'[0m
[93m![0m records with similar names exist! did you mean to load one of them?


Unnamed: 0_level_0,uid,name,description,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
12,irV5vyHx,PI3K/AKT/mTOR pathway,,2,2024-11-11 15:11:12.919341+00:00,1
4,i6XpYvMC,MAPK/ERK pathway,,2,2024-11-11 15:10:57.608736+00:00,1
8,fuAhPhXj,Proteasome pathway,,2,2024-11-11 15:11:07.553672+00:00,1


[92m✓[0m created [1;95m1 Protein record from Bionty[0m matching [3muniprotkb_id[0m: [1;95m'P42345'[0m


## References

In [24]:
reference = fr.Reference(
    name="Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action",
    abbr="McFarland 2020",
    url="https://www.nature.com/articles/s41467-020-17440-w",
    doi="10.1038/s41467-020-17440-w",
    text=(
        "Assays to study cancer cell responses to pharmacologic or genetic perturbations are typically "
        "restricted to using simple phenotypic readouts such as proliferation rate. Information-rich assays, "
        "such as gene-expression profiling, have generally not permitted efficient profiling of a given "
        "perturbation across multiple cellular contexts. Here, we develop MIX-Seq, a method for multiplexed "
        "transcriptional profiling of post-perturbation responses across a mixture of samples with single-cell "
        "resolution, using SNP-based computational demultiplexing of single-cell RNA-sequencing data. We show "
        "that MIX-Seq can be used to profile responses to chemical or genetic perturbations across pools of 100 "
        "or more cancer cell lines. We combine it with Cell Hashing to further multiplex additional experimental "
        "conditions, such as post-treatment time points or drug doses. Analyzing the high-content readout of "
        "scRNA-seq reveals both shared and context-specific transcriptional response components that can identify "
        "drug mechanism of action and enable prediction of long-term cell viability from short-term transcriptional "
        "responses to treatment."
    ),
).save()

## Remove unused columns

In [25]:
adata.obs = adata.obs.drop(
    [
        "depmap_id",
        "cancer",
        "cell_quality",
        "channel",
        "perturbation",
        "perturbation_type",
        "singlet_dev",
        "singlet_dev_z",
        "singlet_margin",
        "singlet_z_margin",
        "nperts",
        "ngenes",
        "ncounts",
        "cell_det_rate",
        "doublet_GMM_prob",
        "doublet_dev_imp",
        "doublet_z_margin",
        'doublet_CL1',
        'doublet_CL2'
    ],
    axis=1,
)

## Register curated artifact

In [26]:
artifact = curator.save_artifact(description="McFarland AnnData")

[92m→[0m validating metadata using registries of instance [3mtest-perturbation[0m
[92m✓[0m 'var_index' is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m 'assay' is validated against [3mExperimentalFactor.name[0m
[92m✓[0m 'cell_type' is validated against [3mCellType.name[0m
[92m✓[0m 'development_stage' is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m 'disease' is validated against [3mDisease.name[0m
[92m✓[0m 'donor_id' is validated against [3mULabel.name[0m
[92m✓[0m 'self_reported_ethnicity' is validated against [3mEthnicity.name[0m
[92m✓[0m 'sex' is validated against [3mPhenotype.name[0m
[92m✓[0m 'suspension_type' is validated against [3mULabel.name[0m
[92m✓[0m 'tissue_type' is validated against [3mULabel.name[0m
[92m✓[0m 'organism' is validated against [3mOrganism.name[0m
[92m✓[0m 'cell_line' is validated against [3mCellLine.name[0m
[92m✓[0m 'genetic_treatments' is validated against [3mGeneticTreatment.name[0m


Unnamed: 0_level_0,uid,version,is_latest,description,key,suffix,type,size,hash,n_objects,n_observations,_hash_type,_accessor,visibility,_key_is_virtual,storage_id,transform_id,run_id,created_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2,Xk7Qaik9vBLV4PKf0001,,True,McFarland 2020 preprocessed,,.h5ad,dataset,2511528,Iz4mVUpIruvtABfA6D3vQA,,,md5,AnnData,1,True,3,3,3,2024-11-11 14:44:54.959908+00:00,1


[93m![0m    [1;93m11 unique terms[0m (42.30%) are not validated for [3mname[0m: [1;93m'dose_unit', 'dose_value', 'hash_assignment', 'hash_tag', 'num_SNPs', 'singlet_ID', 'time', 'tot_reads', 'percent_mito', 'percent_ribo', ...[0m


In [27]:
# Set the perturbations and references
artifact.genetic_treatments.set(genetic_treatments)
artifact.compound_treatments.set(compound_treatments)
artifact.references.add(reference)

In [28]:
artifact.describe()

[1;92mArtifact[0m(uid='A2xWHSPBuPgBhcHi0000', is_latest=True, description='McFarland AnnData', suffix='.h5ad', type='dataset', size=3345456, hash='jzxUs9DOPJewAKOb6ZMaGg', n_observations=1000, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-11 15:11:20 UTC)
  [3mProvenance[0m
    .storage = '/home/zeth/PycharmProjects/pertpy-datasets/scripts/lamindb_datasets/test-perturbation'
    .transform = 'Curate perturbation dataset with `PerturbationCurator`'
    .run = 2024-11-11 15:10:01 UTC
    .created_by = 'zethson'
  [3mLabels[0m
    .references = 'Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action'
    .genetic_treatments = 'sggpx4-1', 'sggpx4-2', 'sgor2j2', 'sglacz'
    .compound_treatments = 'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'brd3379', 'JQ1', 'azd5591', 'control', ...
    .organisms = 'human'
    .cell_types = '