diff --git a/README.md b/README.md index 99c4dc71..585aaa5b 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ - [Requirements](#requirements) - [Getting started](#getting-started) - [Documentation](#documentation) - - [TCGA Data](#tcga-data) + - [Data](#data) - [Contributing](#contributing) - [1 - Download the repository](#1---download-the-repository) - [2 - Create a conda environment](#2---create-a-conda-environment) @@ -72,11 +72,13 @@ You can also try them from your browser (on synthetic data only): The documentation is hosted [here on ReadTheDocs](https://pydeseq2.readthedocs.io/en/latest/). If you want to have the latest version of the documentation, you can build it from source. Please go to the dedicated [README.md](https://github.com/owkin/PyDESeq2/blob/main/docs/README.md) for information on how to do so. -### TCGA Data +### Data -The quick start notebooks either use synthetic data (provided in this repo) or data from [The Cancer Genome Atlas](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). +The quick start notebooks use synthetic data, provided in this repo (see [datasets](https://github.com/owkin/PyDESeq2/blob/main/datasets/README.md).) -For more information on how to obtain and organize TCGA data, see [datasets](https://github.com/owkin/PyDESeq2/blob/main/datasets/README.md). +The experiments described in our [preprint](https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1) rely on data +from [The Cancer Genome Atlas](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga), +which may be obtained from this [portal](https://portal.gdc.cancer.gov/). ## Contributing diff --git a/datasets/README.md b/datasets/README.md index dc2d2074..f4add701 100644 --- a/datasets/README.md +++ b/datasets/README.md @@ -1,35 +1,16 @@ -# TCGA Dataset +# Datasets -Data from The Cancer Genome Atlas can be obtained [here](https://portal.gdc.cancer.gov/). +This directory stores example data which can be retrieved using the `load_example_data` function from +`pydeseq2.utils`. Only synthetic data is provided for now, but new datasets might be available in the future. ## Folder organisation -The `tcga_data` is supposed to be organized in the following way: - ``` PyDESeq2 │ └───datasets │ - └───tcga_data - │ - └───Clinical - │ TCGA-BRCA_clinical.tsv.gz - │ TCGA-COAD_clinical.tsv.gz - │ TCGA-LUAD_clinical.tsv.gz - │ TCGA-LUSC_clinical.tsv.gz - │ TCGA-PAAD_clinical.tsv.gz - │ TCGA-PRAD_clinical.tsv.gz - │ TCGA-READ_clinical.tsv.gz - │ TCGA-SKCM_clinical.tsv.gz - │ - └───Gene_expressions - TCGA-BRCA_raw_RNAseq.tsv.gz - TCGA-COAD_raw_RNAseq.tsv.gz - TCGA-LUAD_raw_RNAseq.tsv.gz - TCGA-LUSC_raw_RNAseq.tsv.gz - TCGA-PAAD_raw_RNAseq.tsv.gz - TCGA-PRAD_raw_RNAseq.tsv.gz - TCGA-READ_raw_RNAseq.tsv.gz - TCGA-SKCM_raw_RNAseq.tsv.gz + └───synthetic + │ test_clinical.csv + │ test_counts.csv ``` diff --git a/notebooks/PyDESeq2_minimal_example.ipynb b/notebooks/PyDESeq2_minimal_example.ipynb index 7477e6b5..d17b7fbc 100644 --- a/notebooks/PyDESeq2_minimal_example.ipynb +++ b/notebooks/PyDESeq2_minimal_example.ipynb @@ -9,19 +9,7 @@ "\n", "This notebook gives a minimalistic example of how to perform DEA using PyDESeq2.\n", "\n", - "It allows you to run the PyDESeq2 pipeline either on synthetic data or on one of the following TCGA datasets:\n", - "- TCGA-BRCA\n", - "- TCGA-COAD\n", - "- TCGA-LUAD\n", - "- TCGA-LUSC\n", - "- TCGA-PAAD\n", - "- TCGA-PRAD\n", - "- TCGA-READ\n", - "- TCGA-SKCM.\n", - "\n", - "While the synthetic data is part of this repository, TCGA data should be downloaded as per the instructions in `datasets/`. \n", - "\n", - "Running this pipeline may take a few minutes (~5-10 min for TCGA) depending on your setup and on the chosen dataset." + "It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository." ] }, { @@ -54,19 +42,7 @@ "id": "13cf8156", "metadata": {}, "source": [ - "## Data loading\n", - "\n", - "See the `datasets` readme for the required data organization. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4984fb79", - "metadata": {}, - "outputs": [], - "source": [ - "DATASET = \"synthetic\" # or 'TCGA-BRCA', 'TCGA-COAD', etc." + "## Data loading" ] }, { @@ -76,7 +52,7 @@ "metadata": {}, "outputs": [], "source": [ - "OUTPUT_PATH = f\"../output_files/{DATASET}\"\n", + "OUTPUT_PATH = f\"../output_files/synthetic_example\" # Replace this with the path were you wish to save outputs\n", "os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist" ] }, @@ -91,7 +67,7 @@ "source": [ "counts_df = load_example_data(\n", " modality=\"raw_counts\",\n", - " dataset=DATASET,\n", + " dataset=\"synthetic\",\n", " debug=False,\n", ")" ] @@ -105,7 +81,7 @@ "source": [ "clinical_df = load_example_data(\n", " modality=\"clinical\",\n", - " dataset=DATASET,\n", + " dataset=\"synthetic\",\n", " debug=False,\n", ")" ] @@ -122,26 +98,14 @@ "counts_df" ] }, - { - "cell_type": "markdown", - "id": "2815adea", - "metadata": {}, - "source": [ - "Remove samples for which `high_grade` is NaN." - ] - }, { "cell_type": "code", "execution_count": null, - "id": "581f6bbd", + "id": "0ab9ad77", "metadata": {}, "outputs": [], "source": [ - "if DATASET != \"synthetic\":\n", - " samples_to_keep = ~clinical_df.high_grade.isna()\n", - " samples_to_keep.sum()\n", - " counts_df = counts_df.loc[samples_to_keep]\n", - " clinical_df = clinical_df.loc[samples_to_keep]" + "clinical_df" ] }, { @@ -149,7 +113,8 @@ "id": "436ee275", "metadata": {}, "source": [ - "Filter out genes that have less than 10 counts in total" + "Filter out genes that have less than 10 counts in total.\n", + "There shouldn't be any in the synthetic dataset, but pre-filtering genes is good practice in general." ] }, { @@ -198,7 +163,7 @@ "dds = DeseqDataSet(\n", " counts_df,\n", " clinical_df,\n", - " design_factors=\"condition\" if DATASET == \"synthetic\" else \"high_grade\",\n", + " design_factors=\"condition\", # compare samples based on the \"condition\" column (\"B\" vs \"A\")\n", " refit_cooks=True,\n", " n_cpus=8,\n", ")" diff --git a/notebooks/PyDESeq2_step_by_step_pipeline.ipynb b/notebooks/PyDESeq2_step_by_step_pipeline.ipynb index 65b703dc..a2c3d3d0 100644 --- a/notebooks/PyDESeq2_step_by_step_pipeline.ipynb +++ b/notebooks/PyDESeq2_step_by_step_pipeline.ipynb @@ -9,19 +9,7 @@ "\n", "This notebook details all the steps of the PyDESeq2 pipeline.\n", "\n", - "It allows you to run the PyDESeq2 pipeline either on synthetic data or on one of the following TCGA datasets:\n", - "- TCGA-BRCA\n", - "- TCGA-COAD\n", - "- TCGA-LUAD\n", - "- TCGA-LUSC\n", - "- TCGA-PAAD\n", - "- TCGA-PRAD\n", - "- TCGA-READ\n", - "- TCGA-SKCM.\n", - "\n", - "While the synthetic data is part of this repository, TCGA data should be downloaded as per the instructions in `datasets/`. \n", - "\n", - "Running this pipeline may take a few minutes (~5-10 min for TCGA) depending on your setup and on the chosen dataset." + "It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository." ] }, { @@ -54,19 +42,7 @@ "id": "13cf8156", "metadata": {}, "source": [ - "## Data loading\n", - "\n", - "See the `datasets` readme for the required data organization. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4984fb79", - "metadata": {}, - "outputs": [], - "source": [ - "DATASET = \"synthetic\" # or 'TCGA-BRCA', 'TCGA-COAD', etc." + "## Data loading" ] }, { @@ -76,7 +52,7 @@ "metadata": {}, "outputs": [], "source": [ - "OUTPUT_PATH = f\"../output_files/{DATASET}\"\n", + "OUTPUT_PATH = f\"../output_files/synthetic_example\" # Replace this with the path were you wish to save outputs\n", "os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist" ] }, @@ -89,7 +65,7 @@ "source": [ "counts_df = load_example_data(\n", " modality=\"raw_counts\",\n", - " dataset=DATASET,\n", + " dataset=\"synthetic\",\n", " debug=False,\n", ")" ] @@ -103,7 +79,7 @@ "source": [ "clinical_df = load_example_data(\n", " modality=\"clinical\",\n", - " dataset=DATASET,\n", + " dataset=\"synthetic\",\n", " debug=False,\n", ")" ] @@ -120,26 +96,14 @@ "counts_df" ] }, - { - "cell_type": "markdown", - "id": "2815adea", - "metadata": {}, - "source": [ - "Remove samples for which `high_grade` is NaN." - ] - }, { "cell_type": "code", "execution_count": null, - "id": "581f6bbd", + "id": "e2853727", "metadata": {}, "outputs": [], "source": [ - "if DATASET != \"synthetic\":\n", - " samples_to_keep = ~clinical_df.high_grade.isna()\n", - " samples_to_keep.sum()\n", - " counts_df = counts_df.loc[samples_to_keep]\n", - " clinical_df = clinical_df.loc[samples_to_keep]" + "clinical_df" ] }, { @@ -147,7 +111,8 @@ "id": "436ee275", "metadata": {}, "source": [ - "Filter out genes that have less than 10 counts in total" + "Filter out genes that have less than 10 counts in total.\n", + "There shouldn't be any in the synthetic dataset, but pre-filtering genes is good practice in general." ] }, { @@ -196,7 +161,7 @@ "dds = DeseqDataSet(\n", " counts_df,\n", " clinical_df,\n", - " design_factors=\"condition\" if DATASET == \"synthetic\" else \"high_grade\",\n", + " design_factors=\"condition\", # compare samples based on the \"condition\" column (\"B\" vs \"A\")\n", " refit_cooks=True,\n", " n_cpus=8,\n", ")" diff --git a/notebooks/README.md b/notebooks/README.md index acc75e24..9ba594e8 100644 --- a/notebooks/README.md +++ b/notebooks/README.md @@ -1,7 +1,6 @@ # Notebooks -This folder contains examples of DEA pipelines with PyDESeq2. They assume that TCGA data was downloaded and organized -as described in the [`datasets`](../datasets/README.md) directory. +This folder contains examples of DEA pipelines with PyDESeq2. They rely on the synthetic data stored in the [`datasets`](../datasets/README.md) directory. - `PyDESeq2_minimal_example`: a minimal example of a DEA pipeline. - `PyDESeq2_step_by_step_pipeline`: a more detailed version of the above example, where each step is shown. diff --git a/pydeseq2/utils.py b/pydeseq2/utils.py index 5d6db3be..16772949 100644 --- a/pydeseq2/utils.py +++ b/pydeseq2/utils.py @@ -23,10 +23,11 @@ def load_example_data( debug=False, debug_seed=42, ): - """Load synthetic or TCGA data (gene raw counts or clinical) for a given dataset. + """Load synthetic example data. - May load either clinical or rna-seq data.The synthetic data is part of this - repo, but TCGA data should be downloaded as per the instructions in `datasets/`. + May load either clinical or rna-seq data. For now, this function may only return the + synthetic data provided as part of this repo, but new datasets might be added in the + future. Parameters ---------- @@ -36,10 +37,11 @@ def load_example_data( dataset : str The dataset for which to return gene expression data. If "synthetic", will return the synthetic data that is used for CI unit tests. - Otherwise, must be a valid TCGA dataset. (default: "synthetic"). + (default: "synthetic"). debug : bool - If true, subsample 10 samples and 100 genes at random. (default: False). + If true, subsample 10 samples and 100 genes at random. + (Note that the "synthetic" dataset is already 10 x 100.) (default: False). debug_seed : int Seed for the debug mode. (default: 42). @@ -55,20 +57,8 @@ def load_example_data( ) assert dataset in [ - "synthetic", - "TCGA-BRCA", - "TCGA-COAD", - "TCGA-LUAD", - "TCGA-LUSC", - "TCGA-PAAD", - "TCGA-PRAD", - "TCGA-READ", - "TCGA-SKCM", - ], ( - "The dataset argument must be one of the following: " - "synthetic, TCGA-BRCA, TCGA-COAD, TCGA-LUAD, TCGA-LUSC, " - "TCGA-PAAD, TCGA-PRAD, TCGA-READ, TCGA-SKCM" - ) + "synthetic" + ], "The dataset argument must be one of the following: synthetic." # Load data datasets_path = Path(pydeseq2.__file__).parent.parent / "datasets" @@ -101,24 +91,8 @@ def load_example_data( index_col=0, ) - else: - path_to_data = datasets_path / "tcga_data" - if modality == "raw_counts": - df = pd.read_csv( - path_to_data / "Gene_expressions" / f"{dataset}_raw_RNAseq.tsv.gz", - compression="gzip", - sep="\t", - index_col=0, - ).T - elif modality == "clinical": - df = pd.read_csv( - path_to_data / "Clinical" / f"{dataset}_clinical.tsv.gz", - compression="gzip", - sep="\t", - index_col=0, - ) - if debug: + # TODO: until we provide a larger dataset, this option is useless # subsample 10 samples and 100 genes df = df.sample(n=10, axis=0, random_state=debug_seed) if modality == "raw_counts":