# Upload VarChAMP data to MaveDB
Jess Ewald (modified from notebook by Alan Rubin)
2024-11-22

This notebook uploads VarChAMP experiments and score sets from the imaging work to [MaveDB](https://www.mavedb.org/) using the Python API client implemented in [mavetools](https://github.com/VariantEffect/mavetools). 

In [95]:
import asyncio
import urllib.request
import json
import csv
import os
import pandas as pd
import polars as pl
import requests
from fqfa.util.translate import translate_dna
from mavehgvs import Variant
from mavedb import __version__ as mavedb_version

## Set up API key and endpoint

You can view your API key by logging into MaveDB and then visiting the [settings](https://mavedb.org/#/settings) page.
Copy the API key here, as this is required by the client to create records and view your private records.
You can also set up the API key using an environment variable `MAVEDB_API_KEY`.

In [96]:
if "MAVEDB_API_KEY" in os.environ:
    api_key = os.environ.get("MAVEDB_API_KEY")
else:
    api_key = "CHkJSgtKgNs7TxxP2-vWQiEsbhl8yJLOgTcMQ0TIV0Y"

# API URL for the production MaveDB instance
api_url = "https://api.mavedb.org/api/v1/"

If you are having problems with validation, compare the version of the MaveDB data models mavetools is using with the version of MaveDB running on the server you are accessing.

In [97]:
with urllib.request.urlopen(f"{api_url}api/version") as response:
    r = response.read()
    print(f"API version:{json.loads(r)['version']:>15}")
print(f"Module version:{mavedb_version:>12}")

API version:       2024.4.2
Module version:    2024.4.1


## Format the input data

MaveDB requires data in specific formats, including precisely formatted identifiers and column names. Key changes here are the creation of the "hgvs_pro" column, appending the Ensembl IDs (without the version number), adding the target sequences, and reformatting the variant nucleotide changes. Each uploaded score_set file must have a "score" column. Also, target (gene) labels may not have any spaces. 

In [None]:
# format allele file
dat_info = pl.read_csv("./jess_allles_with_mutated_cds.tsv", separator="\t").with_columns(
    pl.col("variant").str.replace("_.*", "").alias("symbol")
)

dat_info = dat_info.with_columns(
    pl.col("symbol").str.replace(" ", "_").alias("symbol")
)

dat_info = dat_info.with_columns(
    pl.concat_str([
        pl.col("symbol"),
        pl.lit(":n."),
        pl.col("nt_change")
    ], separator="").alias("hgvs_nt"),
    pl.col("ensembl_protein_id").str.replace(r"\..*", "", literal=False).alias("ensembl_protein_id")
).rename({"variant": "Variant"})

# get only gene-level info
gene_info = dat_info.select(
    ["symbol", "ensembl_protein_id", "ref_cds"]
).unique()

We decided that there will be a separate scoreSet for each measurement, so here we keep the localization and abundance scores separate. Each table must have an "hgvs_nt" column and a "score" column. There can be additional columns that provide complementary stats for the "score" (ie. p-value, confidence interval, etc).

In [101]:
# Format abundance and localization data
var_info = dat_info.select([
    "Variant", "hgvs_nt"
])

# Reformat localization
local = pl.read_csv("./varchamp_data/1_auroc.csv").join(
    var_info, on="Variant"
).select(["hgvs_nt", "mean_auroc", "Mislocalized_both_batches"]).rename({
    "Mislocalized_both_batches": "Mislocalization_hit",
    "mean_auroc": "score",
    })

# Reformat abundance
abun = pl.read_csv("./varchamp_data/2_abundance_changes.csv").join(
    var_info, on="Variant"
).select(["hgvs_nt", "U2OS_Z"]).rename({"U2OS_Z": "score"})

# write out scores
local.write_csv("./varchamp_data/localization_scores.csv")
abun.write_csv("./varchamp_data/abundance_scores.csv")


## Format experiment and dataset entries

MaveDB requires several pieces of text metadata for each record (see the [upload guide](https://www.mavedb.org/docs/mavedb/upload_guide.html)). These functions populate all of the key fields required to characterize the VarChAMP data. We decided that there will be one "experimentSet" for each large batch of submitted data. The "method_text" field in format_experiment function describes the basic wet lab protocol used to generate the data. 

In [102]:
def format_experiment(experiment_set_urn=None):

    dataset = {
        "title" : "VarChAMP_Imaging_November_2024",
        "short_description" : "Protein localization and abundance from images of cells.",
        "abstract_text" : "This study measured protein subcellular localization and abundance using fluorescence microscopy.",
        "method_text" : "Entry clones of alleles were transferred using Gateway technology into a mammalian expression pLenti6.2 plasmid containing a C-terminal mNeonGreen fusion (plasmid modified from Addgene 87075). Inserts were verified by restriction digestion and clones that did not produce the expected digestion pattern were omitted from further analysis. Lentiviral constructs were packaged in HEK 293T cells seeded in 96-well plates, then viral supernatant was transferred to spinfect U2OS cells seeded in 384-well plates (4x technical replicates were performed by administering the same viral supernatant to 4 different wells, all viral production and infection was repeated for on a separate day 2x biological replicates). 48 hrs following infection, cells were selected for infection and protein overexpression by applying puromycin for 48 hrs. Cells were then stained with 500 nM MitoTracker Deep Red 1 hr prior to paraformaldehyde fixation. Blocking, permeabilization and staining (8.25 nM Alexa Fluor™ 568 Phalloidin, 1 ug/mL Hoechst 33342, 1.5 ug/mL WGA Alexa Fluor 555) was then performed in one step. All confocal images were captured on a Perkin Elmer Opera Phenix Microscope (20X water objective, 384 wells, 9 fields).",
        "extra_metadata" : {},
        "primary_publication_identifiers" : [],
        "raw_read_identifiers" : [],
    }
    if experiment_set_urn:  # add to an existing experiment set
        dataset["experiment_set_urn"] = experiment_set_urn
    return dataset

The next two functions format each of the scoreSet submissions. Here, the "method_text" describes the data processing pipeline used to compute the submitted scores. The "label" field for each target_sequence must match the hgvs_nt prefix for the variants to map properly. 

In [None]:
def format_localization_score_set(gene_info, experiment_urn):
    target_genes = [
        {
            "name": row["symbol"],
            "category": "protein_coding",
            "external_identifiers": [
                {
                    "identifier": {
                        "dbName": "Ensembl",
                        "identifier": row["ensembl_protein_id"]
                    },
                    'offset': 0,
                },
            ],
            "target_sequence": {
                "sequence": row["ref_cds"],
                "sequence_type": "dna",
                "taxonomy": {
                    "tax_id": 9606,
                },
                "label": row["symbol"] # THIS MUST MATCH THE PREFIX IN THE hgvs_nt COLUMN OF THE SCORE SET
            },
        }
        for row in gene_info.to_dicts()
    ]

    dataset = {
        "title": "VarChAMP_Imaging_Localization_November_2024",
        "short_description": "Protein localization from images of cells.",
        "abstract_text": (
            "This study measured protein subcellular localization using fluorescence microscopy."
        ),
        "method_text": (
            "We used CellProfiler to create morphological profiles of single cells using images from the protein channel (GFP). "
            "Profiles were filtered to remove features with low variance or missing values, and were MAD-normalized within each plate. "
            "Cells with abnormal cytoplasm:nucleoplasm area ratios or with median GFP intensities > 5 MAD from the median were filtered out. "
            "A binary XGBoost classifier was trained to distinguish single-cell profiles for each reference-variant pair, with 4-fold cross-validation and data splits by plate. "
            "Binary XGBoost classifiers were also trained between all possible pairs of control wells that were repeated on each plate, to quantify the well position effect. "
            "Reference-variant classifier AUROC values were compared to the technical well position null AUROC values to determine which ones showed evidence from differences in the protein channel that exceeded technical artifacts. "
            "These 'hits' were considered variants that cause protein mislocalization. "
        ),
        "extra_metadata": {},
        "primary_publication_identifiers": [],
        "experiment_urn": experiment_urn,
        "license_id": 1,
        "target_genes": target_genes,
    }

    return dataset


In [None]:
def format_abundance_score_set(gene_info, experiment_urn):
    target_genes = [
        {
            "name": row["symbol"],
            "category": "protein_coding",
            "external_identifiers": [
                {
                    "identifier": {
                        "dbName": "Ensembl",
                        "identifier": row["ensembl_protein_id"]
                    },
                    'offset': 0,
                },
            ],
            "target_sequence": {
                "sequence": row["ref_cds"],
                "sequence_type": "dna",
                "taxonomy": {
                    "tax_id": 9606,
                },
                "label": row["symbol"] # THIS MUST MATCH THE PREFIX IN THE hgvs_nt COLUMN OF THE SCORE SET
            },
        }
        for row in gene_info.to_dicts()
    ]

    dataset = {
        "title": "VarChAMP_Imaging_Abundance_November_2024",
        "short_description": "Protein abundance from images of cells.",
        "abstract_text": (
            "This study measured protein subcellular abundance using fluorescence microscopy."
        ),
        "method_text": (
            "We used CellProfiler to create morphological profiles of single cells using images from the protein channel (GFP). "
            "Profiles were filtered to remove features with low variance or missing values, and were MAD-normalized within each plate. "
            "Cells with abnormal cytoplasm:nucleoplasm area ratios or with median GFP intensities > 5 MAD from the median were filtered out. "
            "We measured changes in protein abundance across reference-variant pairs by computing differences in median protein intensity, while controlling for plate as a random effect."
        ),
        "extra_metadata": {},
        "primary_publication_identifiers": [],
        "experiment_urn": experiment_urn,
        "license_id": 1,
        "target_genes": target_genes,
    }

    return dataset


## Upload the experiment

The next few steps will upload the data to maveDB. When developing this, I made many errors, resulting in half-completed submissions. If you log into your MaveDB account and go to the dashboard, you can see a record of all of your uploaded experiments and their associated scores. From this interface, you can delete submissions which is helpful if you need to start over again.  

The first submission creates an experiment "urn" ID - this is like creating an experiment folder in your maveDB account. Knowing the IDs is useful, because you can append additional submissions to previously created experiments. There is no need to manually track these IDs - they are available on your maveDB online dashboard. 

In [105]:
# Upload to maveDB
from timeit import default_timer as timer

start = timer()

temp_datasets = list()

# upload experimentSet info
response = requests.post(
    api_url+'experiments/',
    json=format_experiment(),
    headers={"X-API-Key": api_key}
)
response_data = response.json()
created_exp = response_data["urn"]
print(f"uploaded experiment:\t{created_exp}")


uploaded experiment:	tmp:d991ec9d-8f02-48ac-a9e0-312aeb1d8c06


## Upload the mislocalization score set

To submit scoreSets to an experimentSet, we must pass in the experiment urn ID ("created_exp") as a parameter to the API request. Each experiment set also returns an urn ID ("created_ss"). We use this scoreSet ID to post the actual table with all of the scores (ie. localization_scores.csv) to the correct scoreSet description.

In [106]:
# upload dataSet info
response = requests.post(
    api_url+"score-sets/",
    json={**format_localization_score_set(gene_info, experiment_urn=created_exp)},
    headers={"X-API-Key": api_key}
)
response_data = response.json()
created_ss = response_data["urn"]
print(f"uploaded score set:\t{created_ss}")

# upload scores file
response = requests.post(
    api_url+f"score-sets/{created_ss}/variants/data",
    files={
        "scores_file": ("scores.csv", pd.read_csv(f"/Users/jewald/Desktop/mavedb_upload_sample_code/varchamp_data/localization_scores.csv").to_csv()),
    },
    headers={"X-API-Key": api_key}
)
response.raise_for_status()
print(f"uploaded scores for score set:\t{created_ss}")

# finish up
end = timer()
print(f"elapsed time:\t{end - start:.2f}", end="\n\n")

temp_datasets.append(created_ss)

with open("temp_accessions.txt", "w") as handle:
    for urn_ss in temp_datasets:
        print(urn_ss, file=handle)


uploaded score set:	tmp:8e561ceb-f17d-41ce-a746-a21abfeec2ea
uploaded scores for score set:	tmp:8e561ceb-f17d-41ce-a746-a21abfeec2ea
elapsed time:	4.52



## Upload the abundance scores

Now we repeat the process for the abundance scores. 

In [107]:
# upload dataSet info
response = requests.post(
    api_url+"score-sets/",
    json={**format_abundance_score_set(gene_info, experiment_urn=created_exp)},
    headers={"X-API-Key": api_key}
)
response_data = response.json()
created_ss = response_data["urn"]
print(f"uploaded score set:\t{created_ss}")

# upload scores file
response = requests.post(
    api_url+f"score-sets/{created_ss}/variants/data",
    files={
        "scores_file": ("scores.csv", pd.read_csv(f"/Users/jewald/Desktop/mavedb_upload_sample_code/varchamp_data/abundance_scores.csv").to_csv()),
    },
    headers={"X-API-Key": api_key}
)
response.raise_for_status()
print(f"uploaded scores for score set:\t{created_ss}")

# finish up
end = timer()
print(f"elapsed time:\t{end - start:.2f}", end="\n\n")

temp_datasets.append(created_ss)

with open("temp_accessions.txt", "w") as handle:
    for urn_ss in temp_datasets:
        print(urn_ss, file=handle)


uploaded score set:	tmp:af037e24-b2a8-4ccf-898c-031ceed49299
uploaded scores for score set:	tmp:af037e24-b2a8-4ccf-898c-031ceed49299
elapsed time:	8.08

