scverse · BorisMuzellec · Jan 17, 2023 · Jan 16, 2023 · Jan 16, 2023 · Jan 16, 2023
@@ -8,7 +8,7 @@
     - [Requirements](#requirements)
   - [Getting started](#getting-started)
     - [Documentation](#documentation)
-    - [TCGA Data](#tcga-data)
+    - [Data](#data)
   - [Contributing](#contributing)
     - [1 - Download the repository](#1---download-the-repository)
     - [2 - Create a conda environment](#2---create-a-conda-environment)
@@ -72,11 +72,13 @@ You can also try them from your browser (on synthetic data only):
 
 The documentation is hosted [here on ReadTheDocs](https://pydeseq2.readthedocs.io/en/latest/). If you want to have the latest version of the documentation, you can build it from source. Please go to the dedicated [README.md](https://github.com/owkin/PyDESeq2/blob/main/docs/README.md) for information on how to do so.
 
-### TCGA Data
+### Data
 
-The quick start notebooks either use synthetic data (provided in this repo) or data from [The Cancer Genome Atlas](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). 
+The quick start notebooks use synthetic data, provided in this repo (see [datasets](https://github.com/owkin/PyDESeq2/blob/main/datasets/README.md).)
 
-For more information on how to obtain and organize TCGA data, see [datasets](https://github.com/owkin/PyDESeq2/blob/main/datasets/README.md).
+The experiments described in our [preprint](https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1) rely on data
+from [The Cancer Genome Atlas](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga),
+which may be obtained from this [portal](https://portal.gdc.cancer.gov/).
 
 ## Contributing
 

@@ -1,35 +1,16 @@
-# TCGA Dataset
+# Datasets
 
-Data from The Cancer Genome Atlas can be obtained [here](https://portal.gdc.cancer.gov/).
+This directory stores example data which can be retrieved using the `load_example_data` function from
+`pydeseq2.utils`. Only synthetic data is provided for now, but new datasets might be available in the future.
 
 ## Folder organisation
 
-The `tcga_data` is supposed to be organized in the following way:
-
 ```
 PyDESeq2
 │
 └───datasets
         │
-        └───tcga_data   
-            │
-            └───Clinical
-            │       TCGA-BRCA_clinical.tsv.gz
-            │       TCGA-COAD_clinical.tsv.gz
-            │       TCGA-LUAD_clinical.tsv.gz
-            │       TCGA-LUSC_clinical.tsv.gz
-            │       TCGA-PAAD_clinical.tsv.gz
-            │       TCGA-PRAD_clinical.tsv.gz
-            │       TCGA-READ_clinical.tsv.gz
-            │       TCGA-SKCM_clinical.tsv.gz
-            │   
-            └───Gene_expressions
-                    TCGA-BRCA_raw_RNAseq.tsv.gz
-                    TCGA-COAD_raw_RNAseq.tsv.gz
-                    TCGA-LUAD_raw_RNAseq.tsv.gz
-                    TCGA-LUSC_raw_RNAseq.tsv.gz
-                    TCGA-PAAD_raw_RNAseq.tsv.gz
-                    TCGA-PRAD_raw_RNAseq.tsv.gz
-                    TCGA-READ_raw_RNAseq.tsv.gz
-                    TCGA-SKCM_raw_RNAseq.tsv.gz
+        └───synthetic   
+            │       test_clinical.csv
+            │       test_counts.csv
 ```
@@ -9,19 +9,7 @@
     "\n",
     "This notebook gives a minimalistic example of how to perform DEA using PyDESeq2.\n",
     "\n",
-    "It allows you to run the PyDESeq2 pipeline either on synthetic data or on one of the following TCGA datasets:\n",
-    "- TCGA-BRCA\n",
-    "- TCGA-COAD\n",
-    "- TCGA-LUAD\n",
-    "- TCGA-LUSC\n",
-    "- TCGA-PAAD\n",
-    "- TCGA-PRAD\n",
-    "- TCGA-READ\n",
-    "- TCGA-SKCM.\n",
-    "\n",
-    "While the synthetic data is part of this repository, TCGA data should be downloaded as per the instructions in `datasets/`. \n",
-    "\n",
-    "Running this pipeline may take a few minutes (~5-10 min for TCGA) depending on your setup and on the chosen dataset."
+    "It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository."
    ]
   },
   {
@@ -54,19 +42,7 @@
    "id": "13cf8156",
    "metadata": {},
    "source": [
-    "## Data loading\n",
-    "\n",
-    "See the `datasets` readme for the required data organization. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4984fb79",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "DATASET = \"synthetic\"  # or 'TCGA-BRCA', 'TCGA-COAD', etc."
+    "## Data loading"
    ]
   },
   {
@@ -76,7 +52,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "OUTPUT_PATH = f\"../output_files/{DATASET}\"\n",
+    "OUTPUT_PATH = f\"../output_files/synthetic_example\"  # Replace this with the path were you wish to save outputs\n",
     "os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist"
    ]
   },
@@ -91,7 +67,7 @@
    "source": [
     "counts_df = load_example_data(\n",
     "    modality=\"raw_counts\",\n",
-    "    dataset=DATASET,\n",
+    "    dataset=\"synthetic\",\n",
     "    debug=False,\n",
     ")"
    ]
@@ -105,7 +81,7 @@
    "source": [
     "clinical_df = load_example_data(\n",
     "    modality=\"clinical\",\n",
-    "    dataset=DATASET,\n",
+    "    dataset=\"synthetic\",\n",
     "    debug=False,\n",
     ")"
    ]
@@ -122,34 +98,23 @@
     "counts_df"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "2815adea",
-   "metadata": {},
-   "source": [
-    "Remove samples for which `high_grade` is NaN."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "581f6bbd",
+   "id": "0ab9ad77",
    "metadata": {},
    "outputs": [],
    "source": [
-    "if DATASET != \"synthetic\":\n",
-    "    samples_to_keep = ~clinical_df.high_grade.isna()\n",
-    "    samples_to_keep.sum()\n",
-    "    counts_df = counts_df.loc[samples_to_keep]\n",
-    "    clinical_df = clinical_df.loc[samples_to_keep]"
+    "clinical_df"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "436ee275",
    "metadata": {},
    "source": [
-    "Filter out genes that have less than 10 counts in total"
+    "Filter out genes that have less than 10 counts in total.\n",
+    "There shouldn't be any in the synthetic dataset, but pre-filtering genes is good practice in general."
    ]
   },
   {
@@ -198,7 +163,7 @@
     "dds = DeseqDataSet(\n",
     "    counts_df,\n",
     "    clinical_df,\n",
-    "    design_factors=\"condition\" if DATASET == \"synthetic\" else \"high_grade\",\n",
+    "    design_factors=\"condition\",  # compare samples based on the \"condition\" column (\"B\" vs \"A\")\n",
     "    refit_cooks=True,\n",
     "    n_cpus=8,\n",
     ")"

@@ -9,19 +9,7 @@
     "\n",
     "This notebook details all the steps of the PyDESeq2 pipeline.\n",
     "\n",
-    "It allows you to run the PyDESeq2 pipeline either on synthetic data or on one of the following TCGA datasets:\n",
-    "- TCGA-BRCA\n",
-    "- TCGA-COAD\n",
-    "- TCGA-LUAD\n",
-    "- TCGA-LUSC\n",
-    "- TCGA-PAAD\n",
-    "- TCGA-PRAD\n",
-    "- TCGA-READ\n",
-    "- TCGA-SKCM.\n",
-    "\n",
-    "While the synthetic data is part of this repository, TCGA data should be downloaded as per the instructions in `datasets/`. \n",
-    "\n",
-    "Running this pipeline may take a few minutes (~5-10 min for TCGA) depending on your setup and on the chosen dataset."
+    "It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository."
    ]
   },
   {
@@ -54,19 +42,7 @@
    "id": "13cf8156",
    "metadata": {},
    "source": [
-    "## Data loading\n",
-    "\n",
-    "See the `datasets` readme for the required data organization. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4984fb79",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "DATASET = \"synthetic\"  # or 'TCGA-BRCA', 'TCGA-COAD', etc."
+    "## Data loading"
    ]
   },
   {
@@ -76,7 +52,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "OUTPUT_PATH = f\"../output_files/{DATASET}\"\n",
+    "OUTPUT_PATH = f\"../output_files/synthetic_example\"  # Replace this with the path were you wish to save outputs\n",
     "os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist"
    ]
   },
@@ -89,7 +65,7 @@
    "source": [
     "counts_df = load_example_data(\n",
     "    modality=\"raw_counts\",\n",
-    "    dataset=DATASET,\n",
+    "    dataset=\"synthetic\",\n",
     "    debug=False,\n",
     ")"
    ]
@@ -103,7 +79,7 @@
    "source": [
     "clinical_df = load_example_data(\n",
     "    modality=\"clinical\",\n",
-    "    dataset=DATASET,\n",
+    "    dataset=\"synthetic\",\n",
     "    debug=False,\n",
     ")"
    ]
@@ -120,34 +96,23 @@
     "counts_df"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "2815adea",
-   "metadata": {},
-   "source": [
-    "Remove samples for which `high_grade` is NaN."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "581f6bbd",
+   "id": "e2853727",
    "metadata": {},
    "outputs": [],
    "source": [
-    "if DATASET != \"synthetic\":\n",
-    "    samples_to_keep = ~clinical_df.high_grade.isna()\n",
-    "    samples_to_keep.sum()\n",
-    "    counts_df = counts_df.loc[samples_to_keep]\n",
-    "    clinical_df = clinical_df.loc[samples_to_keep]"
+    "clinical_df"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "436ee275",
    "metadata": {},
    "source": [
-    "Filter out genes that have less than 10 counts in total"
+    "Filter out genes that have less than 10 counts in total.\n",
+    "There shouldn't be any in the synthetic dataset, but pre-filtering genes is good practice in general."
    ]
   },
   {
@@ -196,7 +161,7 @@
     "dds = DeseqDataSet(\n",
     "    counts_df,\n",
     "    clinical_df,\n",
-    "    design_factors=\"condition\" if DATASET == \"synthetic\" else \"high_grade\",\n",
+    "    design_factors=\"condition\",  # compare samples based on the \"condition\" column (\"B\" vs \"A\")\n",
     "    refit_cooks=True,\n",
     "    n_cpus=8,\n",
     ")"

@@ -1,7 +1,6 @@
 # Notebooks
 
-This folder contains examples of DEA pipelines with PyDESeq2. They assume that TCGA data was downloaded and organized
-as described in the [`datasets`](../datasets/README.md) directory.
+This folder contains examples of DEA pipelines with PyDESeq2. They rely on the synthetic data stored in the [`datasets`](../datasets/README.md) directory.
 
 - `PyDESeq2_minimal_example`: a minimal example of a DEA pipeline.
 - `PyDESeq2_step_by_step_pipeline`: a more detailed version of the above example, where each step is shown.
@@ -23,10 +23,11 @@ def load_example_data(
     debug=False,
     debug_seed=42,
 ):
-    """Load synthetic or TCGA data (gene raw counts or clinical) for a given dataset.
+    """Load synthetic example data.
 
-    May load either clinical or rna-seq data.The synthetic data is part of this
-    repo, but TCGA data should be downloaded as per the instructions in `datasets/`.
+    May load either clinical or rna-seq data. For now, this function may only return the
+    synthetic data provided as part of this repo, but new datasets might be added in the
+    future.
 
     Parameters
     ----------
@@ -36,10 +37,11 @@ def load_example_data(
     dataset : str
         The dataset for which to return gene expression data.
         If "synthetic", will return the synthetic data that is used for CI unit tests.
-        Otherwise, must be a valid TCGA dataset. (default: "synthetic").
+        (default: "synthetic").
 
     debug : bool
-        If true, subsample 10 samples and 100 genes at random. (default: False).
+        If true, subsample 10 samples and 100 genes at random.
+        (Note that the "synthetic" dataset is already 10 x 100.) (default: False).
 
     debug_seed : int
         Seed for the debug mode. (default: 42).
@@ -55,20 +57,8 @@ def load_example_data(
     )
 
     assert dataset in [
-        "synthetic",
-        "TCGA-BRCA",
-        "TCGA-COAD",
-        "TCGA-LUAD",
-        "TCGA-LUSC",
-        "TCGA-PAAD",
-        "TCGA-PRAD",
-        "TCGA-READ",
-        "TCGA-SKCM",
-    ], (
-        "The dataset argument must be one of the following: "
-        "synthetic, TCGA-BRCA, TCGA-COAD, TCGA-LUAD, TCGA-LUSC, "
-        "TCGA-PAAD, TCGA-PRAD, TCGA-READ, TCGA-SKCM"
-    )
+        "synthetic"
+    ], "The dataset argument must be one of the following: synthetic."
 
     # Load data
     datasets_path = Path(pydeseq2.__file__).parent.parent / "datasets"
@@ -101,24 +91,8 @@ def load_example_data(
                 index_col=0,
             )
 
-    else:
-        path_to_data = datasets_path / "tcga_data"
-        if modality == "raw_counts":
-            df = pd.read_csv(
-                path_to_data / "Gene_expressions" / f"{dataset}_raw_RNAseq.tsv.gz",
-                compression="gzip",
-                sep="\t",
-                index_col=0,
-            ).T
-        elif modality == "clinical":
-            df = pd.read_csv(
-                path_to_data / "Clinical" / f"{dataset}_clinical.tsv.gz",
-                compression="gzip",
-                sep="\t",
-                index_col=0,
-            )
-
     if debug:
+        # TODO: until we provide a larger dataset, this option is useless
         # subsample 10 samples and 100 genes
         df = df.sample(n=10, axis=0, random_state=debug_seed)
         if modality == "raw_counts":