# Upload VarChAMP data to MaveDB
Runxi Shen (modified from notebook by Jess Ewald)
2025-03-06

This notebook uploads VarChAMP experiments and score sets from the imaging work to [MaveDB](https://www.mavedb.org/) using the Python API client implemented in [mavetools](https://github.com/VariantEffect/mavetools). 

In [1]:
import asyncio
import urllib.request
import json
import csv
import os
import pandas as pd
import polars as pl
import requests
from fqfa.util.translate import translate_dna
from mavehgvs import Variant
from mavedb import __version__ as mavedb_version

{"message": "CloudWatch log handler is not enabled. Canonical logs will only be emitted to stdout."}
{"message": "MaveDB 2025.1.0"}


## Set up API key and endpoint

You can view your API key by logging into MaveDB and then visiting the [settings](https://mavedb.org/#/settings) page.
Copy the API key here, as this is required by the client to create records and view your private records.
You can also set up the API key using an environment variable `MAVEDB_API_KEY`.

In [2]:
if "MAVEDB_API_KEY" in os.environ:
    api_key = os.environ.get("MAVEDB_API_KEY")
else:
    api_key = "0NMUIcnAA4CCinU-dqnQ_yAXyW8CSc-KfGtdFVXs1Uo" ## Runxi
    # api_key = "CHkJSgtKgNs7TxxP2-vWQiEsbhl8yJLOgTcMQ0TIV0Y" ## Jess's

# API URL for the production MaveDB instance
api_url = "https://api.mavedb.org/api/v1/"

If you are having problems with validation, compare the version of the MaveDB data models mavetools is using with the version of MaveDB running on the server you are accessing.

In [3]:
with urllib.request.urlopen(f"{api_url}api/version") as response:
    r = response.read()
    print(f"API version:{json.loads(r)['version']:>15}")
print(f"Module version:{mavedb_version:>12}")

API version:       2025.1.1
Module version:    2025.1.0


## Format the input data

MaveDB requires data in specific formats, including precisely formatted identifiers and column names. Key changes here are the creation of the "hgvs_pro" column, appending the Ensembl IDs (without the version number), adding the target sequences, and reformatting the variant nucleotide changes. Each uploaded score_set file must have a "score" column. Also, target (gene) labels may not have any spaces. 

In [55]:
## Jess' code original code for a different version of the data for REFERENCE

# dat_info = dat_info.with_columns(
#     pl.concat_str([
#         pl.col("symbol"),
#         pl.lit(":n."),
#         pl.col("nt_change")
#     ], separator="").alias("hgvs_nt"),
#     pl.col("ensembl_gene_id").str.replace(r"\..*", "", literal=False).alias("ensembl_gene_id"),
#     pl.concat_str([
#         pl.col("symbol"),
#         pl.col("aa_change")
#     ], separator="_").alias("Variant")
# ).rename({"sequence": "ref_cds"})

# dat_info = dat_info.with_columns(
#     pl.concat_str([
#         pl.col("symbol"),
#         pl.lit(":n."),
#         pl.col("nt_change")
#     ], separator="").alias("hgvs_nt"),
#     pl.col("ensembl_gene_id").str.replace(r"\..*", "", literal=False).alias("ensembl_gene_id"),
#     pl.concat_str([
#         pl.col("symbol"),
#         pl.col("aa_change")
#     ], separator="_").alias("Variant")
# ).rename({"sequence": "ref_cds"})

# get only gene-level info
# gene_info = dat_info.select(
#     ["symbol", "ensembl_gene_id", "ref_cds"]
# ).unique()

### Some QC

Check if there's any duplicated mutations, sequence mismatches, etc.

In [47]:
dat_info_orig = pl.read_csv("./B13-14_alleles_with_mutated_cds.tsv", separator="\t", schema_overrides={"chr": pl.String})
print(dat_info_orig.head())

dat_info_orig = dat_info_orig.with_columns(
    pl.col("variant").str.split("_").list.first().alias("symbol")
).rename({"variant": "Variant"})

dat_info_orig = dat_info_orig.with_columns(
    pl.concat_str([
        pl.col("symbol"),
        pl.lit(":n."),
        pl.col("nt_change")
    ], separator="").alias("hgvs_nt"),
    pl.col("ensembl_protein_id").str.replace(r"\..*", "", literal=False).alias("ensembl_protein_id"),
)
print(dat_info_orig.head())

shape: (5, 6)
┌─────────────────┬───────────┬─────────────────┬────────────────┬────────────────┬────────────────┐
│ variant         ┆ nt_change ┆ ccsb_orf_id     ┆ ensembl_protei ┆ ref_cds        ┆ allele_cds     │
│ ---             ┆ ---       ┆ ---             ┆ n_id           ┆ ---            ┆ ---            │
│ str             ┆ str       ┆ str             ┆ ---            ┆ str            ┆ str            │
│                 ┆           ┆                 ┆ str            ┆                ┆                │
╞═════════════════╪═══════════╪═════════════════╪════════════════╪════════════════╪════════════════╡
│ ABCD1_Arg518Trp ┆ 1552C>T   ┆ CCSBORF10000863 ┆ ENSP0000021810 ┆ ATGCCGGTGCTCTC ┆ ATGCCGGTGCTCTC │
│                 ┆           ┆ 7               ┆ 4.3            ┆ CAGGCCCCGGCCCT ┆ CAGGCCCCGGCCCT │
│                 ┆           ┆                 ┆                ┆ GG…            ┆ GG…            │
│ ABCD1_Arg389Gly ┆ 1165C>G   ┆ CCSBORF10000863 ┆ ENSP0000021810 ┆ ATGCCGGTGC

In [48]:
dat_info_orig.filter(pl.col("hgvs_nt").is_duplicated())

Variant,nt_change,ccsb_orf_id,ensembl_protein_id,ref_cds,allele_cds,symbol,hgvs_nt
str,str,str,str,str,str,str,str
"""F9_Glu73Lys""","""217G>A""","""CCSBORF52861""","""ENSP00000218099""","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""F9""","""F9:n.217G>A"""
"""F9_Cys170Tyr""","""509G>A""","""CCSBORF52861""","""ENSP00000218099""","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""F9""","""F9:n.509G>A"""
"""F9_Ile256Thr""","""767T>C""","""CCSBORF52861""","""ENSP00000218099""","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""F9""","""F9:n.767T>C"""
"""F9_Cys124Tyr""","""509G>A""","""CCSBORF52861""","""ENSP00000218099""","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""F9""","""F9:n.509G>A"""
"""F9_Ile210Thr""","""767T>C""","""CCSBORF52861""","""ENSP00000218099""","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""ATGCAGCGCGTGAACATGATCATGGCAGAA…","""F9""","""F9:n.767T>C"""
…,…,…,…,…,…,…,…
"""NF2_Gly197Cys""","""340G>T""","""CCSBORF3697""","""ENSP00000340626""","""ATGGCCGGGGCCATCGCTTCCCGCATGAGC…","""ATGGCCGGGGCCATCGCTTCCCGCATGAGC…","""NF2""","""NF2:n.340G>T"""
"""ZC4H2_Pro201Ser""","""601C>T""","""CCSBORF4320""","""ENSP00000363972""","""ATGGCAGATGAGCAAGAAATCATGTGCAAA…","""ATGGCAGATGAGCAAGAAATCATGTGCAAA…","""ZC4H2""","""ZC4H2:n.601C>T"""
"""ZC4H2_Val63Leu""","""187G>C""","""CCSBORF4320""","""ENSP00000363972""","""ATGGCAGATGAGCAAGAAATCATGTGCAAA…","""ATGGCAGATGAGCAAGAAATCATGTGCAAA…","""ZC4H2""","""ZC4H2:n.187G>C"""
"""ZC4H2_Val63Leu""","""187G>C""","""CCSBORF4320""","""ENSP00000363972""","""ATGGCAGATGAGCAAGAAATCATGTGCAAA…","""ATGGCAGATGAGCAAGAAATCATGTGCAAA…","""ZC4H2""","""ZC4H2:n.187G>C"""


In [49]:
from Bio.Seq import Seq
aa_single_to_three = {
    'A': 'Ala',  # Alanine
    'R': 'Arg',  # Arginine
    'N': 'Asn',  # Asparagine
    'D': 'Asp',  # Aspartic acid
    'C': 'Cys',  # Cysteine
    'Q': 'Gln',  # Glutamine
    'E': 'Glu',  # Glutamic acid
    'G': 'Gly',  # Glycine
    'H': 'His',  # Histidine
    'I': 'Ile',  # Isoleucine
    'L': 'Leu',  # Leucine
    'K': 'Lys',  # Lysine
    'M': 'Met',  # Methionine
    'F': 'Phe',  # Phenylalanine
    'P': 'Pro',  # Proline
    'S': 'Ser',  # Serine
    'T': 'Thr',  # Threonine
    'W': 'Trp',  # Tryptophan
    'Y': 'Tyr',  # Tyrosine
    'V': 'Val'   # Valine
}

for row in dat_info_orig.iter_rows(named=True): #.filter(pl.col("hgvs_nt").is_duplicated()).sort(by="hgvs_nt")
    # hgvs_df = dat_info.filter(pl.col("hgvs_nt")==hgvs_nt)
    nuc_pos = int(row["nt_change"][:-3])
    nuc_r, nuc_v = row["nt_change"].split(">")[0][-1], row["nt_change"].split(">")[-1][-1]

    nuc_ref, nuc_var = row["ref_cds"][nuc_pos-1], row["allele_cds"][nuc_pos-1]
    
    # print(nuc_r, nuc_v, nuc_ref, nuc_var)

    prot_pos = int(row["Variant"].split('_')[1][3:-3])
    prot_r, prot_v = row["Variant"].split('_')[1][:3], row["Variant"].split('_')[1][-3:]
    prot_ref_seq = Seq(row["ref_cds"]).translate()
    prot_var_seq = Seq(row["allele_cds"]).translate()
    try:
        prot_ref, prot_var = aa_single_to_three[prot_ref_seq[prot_pos-1]], aa_single_to_three[prot_var_seq[prot_pos-1]]
        # print(prot_r, prot_v, prot_ref, prot_var)

        if nuc_r != nuc_ref or nuc_v != nuc_var or prot_r != prot_ref or prot_v != prot_var:
            print("Original annotations:", row["Variant"], row["nt_change"])
            print(f"Actual Nuc Ref:{nuc_ref}, Var:{nuc_var} | " + \
                f"Actual Prot Ref:{prot_ref}, Var:{prot_var}")
            print("========================================================")
    except:
        print(row["Variant"], row["nt_change"], "| Protein seq len ONLY:", len(prot_ref_seq))
        print("========================================================")
    # break

Original annotations: F9_Cys124Tyr 509G>A
Actual Nuc Ref:G, Var:A | Actual Prot Ref:Glu, Var:Glu
Original annotations: F9_Ile210Thr 767T>C
Actual Nuc Ref:T, Var:C | Actual Prot Ref:Ile, Var:Ile
Original annotations: F9_Glu27Lys 217G>A
Actual Nuc Ref:G, Var:A | Actual Prot Ref:Glu, Var:Glu
Original annotations: NF2_Arg418Cys 1003C>T
Actual Nuc Ref:C, Var:T | Actual Prot Ref:Ile, Var:Ile
NF2_Leu535Pro 1355T>C | Protein seq len ONLY: 507
Original annotations: NF2_Gly197Cys 340G>T
Actual Nuc Ref:G, Var:T | Actual Prot Ref:Ile, Var:Ile


### Updated file is here

In [58]:
dat_info = pl.read_csv("./B13-14_metadata_update_cleaned_TH.csv", separator="\t", schema_overrides={"chr": pl.String})
# print(dat_info.head())

dat_info = dat_info.rename({"gene_allele": "Variant"})

dat_info = dat_info.with_columns(
    pl.concat_str([
        pl.col("symbol"),
        pl.lit(":n."),
        pl.col("nt_change_cdna")
    ], separator="").alias("hgvs_nt"),
    pl.col("ensembl_protein_id").str.replace(r"\..*", "", literal=False).alias("ensembl_protein_id"),
)

dat_info_var = dat_info.filter((pl.col("ccsb_mutation_id_orig")==pl.col("ccsb_mutation_id"))&(~pl.col("ccsb_mutation_id").is_null()))
dat_info_var = dat_info_var.unique(subset=["symbol","Variant","hgvs_nt"])

In [59]:
dat_info_var.filter(pl.col("hgvs_nt").is_duplicated())

symbol,gene_allele_orig,Variant,dest_plate,dest_well,orf_id_wt,ccsb_mutation_id_orig,ccsb_mutation_id,nt_change_cdna,ref,alt,pos,aa_change_orig,aa_change,ensembl_protein_id,hgvs_nt
str,str,str,str,str,i64,str,str,str,str,str,str,str,str,str,str


In [60]:
# get only gene-level info
gene_info = dat_info.select(
    ["symbol", "ensembl_protein_id"] ## , "ref_cds"
).unique()

gene_info = gene_info.join(dat_info_orig.select(["ensembl_protein_id", "ref_cds"]).unique(), on="ensembl_protein_id")
gene_info

symbol,ensembl_protein_id,ref_cds
str,str,str
"""PTPN11""","""ENSP00000376376""","""ATGACATCGCGGAGATGGTTTCACCCAAAT…"
"""BRAF""","""ENSP00000493543""","""ATGGCGGCGCTGAGCGGTGGCGGTGGTGGC…"
"""BAP1""","""ENSP00000417132""","""ATGAATAAGGGCTGGCTGGAGCTGGAGAGC…"
"""CCM2""","""ENSP00000258781""","""ATGGAAGAGGAGGGCAAGAAGGGCAAGAAG…"
"""KRAS""","""ENSP00000308495""","""ATGACTGAATATAAACTTGTGGTAGTTGGA…"
…,…,…
"""RET""","""ENSP00000480088""","""ATGGCGAAGGCGACGTCCGGTGCCGCGGGG…"
"""BCL10""","""ENSP00000498104""","""ATGGAGCCCACCGCACCGTCCCTCACCGAG…"
"""RHO""","""ENSP00000296271""","""ATGAATGGCACAGAAGGCCCTAACTTCTAC…"
"""BRIP1""","""ENSP00000259008""","""ATGTCTTCAATGTGGTCTGAATATACAATT…"


We decided that there will be a separate scoreSet for each measurement, so here we keep the localization and abundance scores separate. Each table must have an "hgvs_nt" column and a "score" column. There can be additional columns that provide complementary stats for the "score" (ie. p-value, confidence interval, etc).

In [61]:
# Format abundance and localization data
var_info = dat_info_var.select([
    "Variant", "hgvs_nt"
])
print(var_info)

shape: (593, 2)
┌─────────────────┬─────────────────┐
│ Variant         ┆ hgvs_nt         │
│ ---             ┆ ---             │
│ str             ┆ str             │
╞═════════════════╪═════════════════╡
│ CCM2_Val190Met  ┆ CCM2:n.568G>A   │
│ FH_Ala308Thr    ┆ FH:n.922G>A     │
│ CCM2_Ile432Thr  ┆ CCM2:n.1295T>C  │
│ ABCD1_Arg518Gln ┆ ABCD1:n.1553G>A │
│ BRCA1_Pro510Ser ┆ BRCA1:n.1528C>T │
│ …               ┆ …               │
│ CCM2_Val374Met  ┆ CCM2:n.1120G>A  │
│ FARS2_His84Pro  ┆ FARS2:n.251A>C  │
│ BRAF_Val487Gly  ┆ BRAF:n.1460T>G  │
│ G6PD_Ala149Thr  ┆ G6PD:n.445G>A   │
│ MSH2_Ser699Pro  ┆ MSH2:n.2095T>C  │
└─────────────────┴─────────────────┘


In [62]:
# Reformat localization
local = (
    pl.read_csv("/home/shenrunx/igvf/varchamp/2021_09_01_VarChAMP/7.analysis_runxi/output/classify_reimplement/classification_results/2025_01_Batch13-14/je_wAGP/misloc_summary_auroc.csv")
).rename({"allele_0": "Variant"})

# Get the Variant values in local but not in var_info
variants_not_in_var_info = local.join(var_info, on="Variant", how="anti").select("Variant")
variants_not_in_var_info

variants_in_var_info = local.join(var_info, on="Variant", how="inner").select("Variant")
variants_in_var_info

local = local.join(
    var_info, on="Variant"
).select(["hgvs_nt", "mean_auroc", "mislocalized_both_batches"]).rename({
    "mislocalized_both_batches": "Mislocalization_hit",
    "mean_auroc": "score",
})
print(local)

## Reformat localization
morph = pl.read_csv("/home/shenrunx/igvf/varchamp/2021_09_01_VarChAMP/7.analysis_runxi/output/classify_reimplement/classification_results/2025_01_Batch13-14/je_wAGP/morph_summary_auroc.csv")
morph = morph.rename({"allele_0": "Variant"}).join(
    var_info, on="Variant"
).select(["hgvs_nt", "mean_auroc", "morphological_change_both_batches"]).rename({
    "morphological_change_both_batches": "Morphological_change_hit",
    "mean_auroc": "score",
    })
print(morph)

## Reformat abundance
abun = pl.read_csv("/home/shenrunx/igvf/varchamp/2021_09_01_VarChAMP/7.analysis_runxi/output/classify_reimplement/classification_results/2025_01_Batch13-14/je_wAGP/well-level_abundance_changes.csv").join(
    var_info, on="Variant"
).select(["hgvs_nt", "U2OS_t"]).rename({"U2OS_t": "score"})
abun

## write out scores
local.write_csv("./varchamp_data/batch13_14/localization_scores.csv")
morph.write_csv("./varchamp_data/batch13_14/morphological_change_scores.csv")
abun.write_csv("./varchamp_data/batch13_14/abundance_scores.csv")

shape: (447, 3)
┌─────────────────┬──────────┬─────────────────────┐
│ hgvs_nt         ┆ score    ┆ Mislocalization_hit │
│ ---             ┆ ---      ┆ ---                 │
│ str             ┆ f64      ┆ bool                │
╞═════════════════╪══════════╪═════════════════════╡
│ CCM2:n.568G>A   ┆ 0.803612 ┆ false               │
│ FH:n.922G>A     ┆ 0.946719 ┆ true                │
│ CCM2:n.1295T>C  ┆ 0.986383 ┆ true                │
│ ABCD1:n.1553G>A ┆ 0.728458 ┆ false               │
│ BRCA1:n.1528C>T ┆ 0.645235 ┆ false               │
│ …               ┆ …        ┆ …                   │
│ CCM2:n.1177A>G  ┆ 0.787021 ┆ false               │
│ CCM2:n.1120G>A  ┆ 0.66408  ┆ false               │
│ BRAF:n.1460T>G  ┆ 0.669995 ┆ false               │
│ G6PD:n.445G>A   ┆ 0.98663  ┆ true                │
│ MSH2:n.2095T>C  ┆ 0.80533  ┆ false               │
└─────────────────┴──────────┴─────────────────────┘
shape: (447, 3)
┌─────────────────┬──────────┬──────────────────────────┐
│ hgvs_nt

## Format experiment and dataset entries

MaveDB requires several pieces of text metadata for each record (see the [upload guide](https://www.mavedb.org/docs/mavedb/upload_guide.html)). These functions populate all of the key fields required to characterize the VarChAMP data. We decided that there will be one "experimentSet" for each large batch of submitted data. The "method_text" field in format_experiment function describes the basic wet lab protocol used to generate the data. 

In [63]:
def format_experiment(experiment_set_urn=None):
    date = "January_2025"
    dataset = {
        "title" : f"VarChAMP_Imaging_{date}",
        "short_description" : "Protein localization & abundance and cell morphological changes from images of cells.",
        "abstract_text" : "This study measured protein subcellular localization and abundance, and cell morphological changes using fluorescence microscopy.",
        "method_text" : "Entry clones of alleles were transferred using Gateway technology into a mammalian expression pLenti6.2 plasmid containing a C-terminal mNeonGreen fusion (plasmid modified from Addgene 87075). Inserts were verified by restriction digestion and clones that did not produce the expected digestion pattern were omitted from further analysis. Lentiviral constructs were packaged in HEK 293T cells seeded in 96-well plates, then viral supernatant was transferred to spinfect U2OS cells seeded in 384-well plates (4x technical replicates were performed by administering the same viral supernatant to 4 different wells, all viral production and infection was repeated for on a separate day 2x biological replicates). 48 hrs following infection, cells were selected for infection and protein overexpression by applying puromycin for 48 hrs. Cells were then stained with 500 nM MitoTracker Deep Red 1 hr prior to paraformaldehyde fixation. Blocking, permeabilization and staining (8.25 nM Alexa Fluor™ 568 Phalloidin, 1 ug/mL Hoechst 33342, 1.5 ug/mL WGA Alexa Fluor 555) was then performed in one step. All confocal images were captured on a Perkin Elmer Opera Phenix Microscope (20X water objective, 384 wells, 9 fields).",
        "extra_metadata" : {},
        "primary_publication_identifiers" : [],
        "raw_read_identifiers" : [],
    }
    if experiment_set_urn:  # add to an existing experiment set
        dataset["experiment_set_urn"] = experiment_set_urn
    return dataset

The next two functions format each of the scoreSet submissions. Here, the "method_text" describes the data processing pipeline used to compute the submitted scores. The "label" field for each target_sequence must match the hgvs_nt prefix for the variants to map properly. 

In [64]:
def format_localization_score_set(gene_info, experiment_urn):
    date = "January_2025"
    target_genes = [
        {
            "name": row["symbol"],
            "category": "protein_coding",
            "external_identifiers": [
                {
                    "identifier": {
                        "dbName": "Ensembl",
                        "identifier": row["ensembl_protein_id"]
                    },
                    'offset': 0,
                },
            ],
            "target_sequence": {
                "sequence": row["ref_cds"],
                "sequence_type": "dna",
                "taxonomy": {
                    "tax_id": 9606,
                },
                "label": row["symbol"] # THIS MUST MATCH THE PREFIX IN THE hgvs_nt COLUMN OF THE SCORE SET
            },
        }
        for row in gene_info.to_dicts()
    ]

    dataset = {
        "title": f"VarChAMP_Imaging_Localization_{date}",
        "short_description": "Protein localization from images of cells.",
        "abstract_text": (
            "This study measured protein subcellular localization using fluorescence microscopy."
        ),
        "method_text": (
            "We used CellProfiler to create morphological profiles of single cells using images from the protein channel (GFP). "
            "Profiles were filtered to remove features with low variance or missing values, and were MAD-normalized within each plate. "
            "Cells with abnormal cytoplasm:nucleoplasm area ratios or with median GFP intensities > 5 MAD from the median were filtered out. "
            "A binary XGBoost classifier was trained to distinguish single-cell profiles for each reference-variant pair, with 4-fold cross-validation and data splits by plate. "
            "Binary XGBoost classifiers were also trained between all possible pairs of control wells that were repeated on each plate, to quantify the well position effect. "
            "Reference-variant classifier AUROC values were compared to the technical well position null AUROC values to determine which ones showed evidence from differences in the protein channel that exceeded technical artifacts. "
            "These 'hits' were considered variants that cause protein mislocalization. "
        ),
        "extra_metadata": {},
        "primary_publication_identifiers": [],
        "experiment_urn": experiment_urn,
        "license_id": 1,
        "target_genes": target_genes,
    }

    return dataset


In [65]:
def format_morphological_change_score_set(gene_info, experiment_urn):
    date = "January_2025"
    target_genes = [
        {
            "name": row["symbol"],
            "category": "protein_coding",
            "external_identifiers": [
                {
                    "identifier": {
                        "dbName": "Ensembl",
                        "identifier": row["ensembl_protein_id"]
                    },
                    'offset': 0,
                },
            ],
            "target_sequence": {
                "sequence": row["ref_cds"],
                "sequence_type": "dna",
                "taxonomy": {
                    "tax_id": 9606,
                },
                "label": row["symbol"] # THIS MUST MATCH THE PREFIX IN THE hgvs_nt COLUMN OF THE SCORE SET
            },
        }
        for row in gene_info.to_dicts()
    ]

    dataset = {
        "title": f"VarChAMP_Imaging_Morphological_Change_{date}",
        "short_description": "Cell morphological changes from images of cells.",
        "abstract_text": (
            "This study measured morphological changes of cells using fluorescence microscopy."
        ),
        "method_text": (
            "We used CellProfiler to create morphological profiles of single cells using images from the DNA channel, AGP channel and Mitochondria channel. "
            "Profiles were filtered to remove features with low variance or missing values, and were MAD-normalized within each plate. "
            "Cells with abnormal cytoplasm:nucleoplasm area ratios or with median GFP intensities > 5 MAD from the median were filtered out. "
            "A binary XGBoost classifier was trained to distinguish single-cell profiles for each reference-variant pair, with 4-fold cross-validation and data splits by plate. "
            "Binary XGBoost classifiers were also trained between all possible pairs of control wells that were repeated on each plate, to quantify the well position effect. "
            "Reference-variant classifier AUROC values were compared to the technical well position null AUROC values to determine which ones showed evidence from differences in the DNA, AGP and Mitochondria channels that exceeded technical artifacts. "
            "These 'hits' were considered variants that cause morphological changes of cells. "
        ),
        "extra_metadata": {},
        "primary_publication_identifiers": [],
        "experiment_urn": experiment_urn,
        "license_id": 1,
        "target_genes": target_genes,
    }

    return dataset

In [66]:
def format_abundance_score_set(gene_info, experiment_urn):
    date = "January_2025"
    target_genes = [
        {
            "name": row["symbol"],
            "category": "protein_coding",
            "external_identifiers": [
                {
                    "identifier": {
                        "dbName": "Ensembl",
                        "identifier": row["ensembl_protein_id"]
                    },
                    'offset': 0,
                },
            ],
            "target_sequence": {
                "sequence": row["ref_cds"],
                "sequence_type": "dna",
                "taxonomy": {
                    "tax_id": 9606,
                },
                "label": row["symbol"] # THIS MUST MATCH THE PREFIX IN THE hgvs_nt COLUMN OF THE SCORE SET
            },
        }
        for row in gene_info.to_dicts()
    ]

    dataset = {
        "title": f"VarChAMP_Imaging_Abundance_{date}",
        "short_description": "Protein abundance from images of cells.",
        "abstract_text": (
            "This study measured protein subcellular abundance using fluorescence microscopy."
        ),
        "method_text": (
            "We used CellProfiler to create morphological profiles of single cells using images from the protein channel (GFP). "
            "Profiles were filtered to remove features with low variance or missing values, and were MAD-normalized within each plate. "
            "Cells with abnormal cytoplasm:nucleoplasm area ratios or with median GFP intensities > 5 MAD from the median were filtered out. "
            "We measured changes in protein abundance across reference-variant pairs by computing differences in median protein intensity, while controlling for plate as a random effect."
        ),
        "extra_metadata": {},
        "primary_publication_identifiers": [],
        "experiment_urn": experiment_urn,
        "license_id": 1,
        "target_genes": target_genes,
    }

    return dataset


## Upload the experiment

The next few steps will upload the data to maveDB. When developing this, I made many errors, resulting in half-completed submissions. If you log into your MaveDB account and go to the dashboard, you can see a record of all of your uploaded experiments and their associated scores. From this interface, you can delete submissions which is helpful if you need to start over again.  

The first submission creates an experiment "urn" ID - this is like creating an experiment folder in your maveDB account. Knowing the IDs is useful, because you can append additional submissions to previously created experiments. There is no need to manually track these IDs - they are available on your maveDB online dashboard. 

In [67]:
# Upload to maveDB
from timeit import default_timer as timer

start = timer()
temp_datasets = list()

# upload experimentSet info
response = requests.post(
    api_url+'experiments/',
    json=format_experiment(),
    headers={"X-API-Key": api_key}
)
response_data = response.json()
created_exp = response_data["urn"]
print(f"uploaded experiment:\t{created_exp}")

uploaded experiment:	tmp:708988c8-806e-4751-a4cb-9d12079fbe5b


## Upload the mislocalization score set

To submit scoreSets to an experimentSet, we must pass in the experiment urn ID ("created_exp") as a parameter to the API request. Each experiment set also returns an urn ID ("created_ss"). We use this scoreSet ID to post the actual table with all of the scores (ie. localization_scores.csv) to the correct scoreSet description.

In [68]:
# upload dataSet info
response = requests.post(
    api_url+"score-sets/",
    json={**format_localization_score_set(gene_info, experiment_urn=created_exp)},
    headers={"X-API-Key": api_key}
)
response_data = response.json()
created_ss = response_data["urn"]
print(f"uploaded score set:\t{created_ss}")

# upload scores file
response = requests.post(
    api_url+f"score-sets/{created_ss}/variants/data",
    files={
        "scores_file": ("scores.csv", pd.read_csv(f"./varchamp_data/batch13_14/localization_scores.csv").to_csv()),
    },
    headers={"X-API-Key": api_key}
)
response.raise_for_status()
print(f"uploaded scores for score set:\t{created_ss}")

# finish up
end = timer()
print(f"elapsed time:\t{end - start:.2f}", end="\n\n")

temp_datasets.append(created_ss)

with open("temp_accessions.txt", "w") as handle:
    for urn_ss in temp_datasets:
        print(urn_ss, file=handle)

uploaded score set:	tmp:dac239c5-5b08-42d0-81d3-acf379f35b65
uploaded scores for score set:	tmp:dac239c5-5b08-42d0-81d3-acf379f35b65
elapsed time:	4.23



## Upload the morphological change scores

Now we repeat the process for the morphological change scores. 

In [69]:
# upload dataSet info
response = requests.post(
    api_url+"score-sets/",
    json={**format_morphological_change_score_set(gene_info, experiment_urn=created_exp)},
    headers={"X-API-Key": api_key}
)
response_data = response.json()
created_ss = response_data["urn"]
print(f"uploaded score set:\t{created_ss}")

# upload scores file
response = requests.post(
    api_url+f"score-sets/{created_ss}/variants/data",
    files={
        "scores_file": ("scores.csv", pd.read_csv(f"./varchamp_data/batch13_14/morphological_change_scores.csv").to_csv()),
    },
    headers={"X-API-Key": api_key}
)
response.raise_for_status()
print(f"uploaded scores for score set:\t{created_ss}")

# finish up
end = timer()
print(f"elapsed time:\t{end - start:.2f}", end="\n\n")

temp_datasets.append(created_ss)

with open("temp_accessions.txt", "w") as handle:
    for urn_ss in temp_datasets:
        print(urn_ss, file=handle)

uploaded score set:	tmp:078e8707-c602-4a60-be1e-b75148e593ce
uploaded scores for score set:	tmp:078e8707-c602-4a60-be1e-b75148e593ce
elapsed time:	6.06



## Upload the abundance scores

Now we repeat the process for the abundance scores. 

In [70]:
# upload dataSet info
response = requests.post(
    api_url+"score-sets/",
    json={**format_abundance_score_set(gene_info, experiment_urn=created_exp)},
    headers={"X-API-Key": api_key}
)
response_data = response.json()
created_ss = response_data["urn"]
print(f"uploaded score set:\t{created_ss}")

# upload scores file
response = requests.post(
    api_url+f"score-sets/{created_ss}/variants/data",
    files={
        "scores_file": ("scores.csv", pd.read_csv(f"./varchamp_data/batch13_14/abundance_scores.csv").to_csv()),
    },
    headers={"X-API-Key": api_key}
)
response.raise_for_status()
print(f"uploaded scores for score set:\t{created_ss}")

# finish up
end = timer()
print(f"elapsed time:\t{end - start:.2f}", end="\n\n")

temp_datasets.append(created_ss)

with open("temp_accessions.txt", "w") as handle:
    for urn_ss in temp_datasets:
        print(urn_ss, file=handle)

uploaded score set:	tmp:a863e7b9-d16f-4419-8107-e51ef2a2b1d2
uploaded scores for score set:	tmp:a863e7b9-d16f-4419-8107-e51ef2a2b1d2
elapsed time:	7.94

