Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
- [Requirements](#requirements)
- [Getting started](#getting-started)
- [Documentation](#documentation)
- [TCGA Data](#tcga-data)
- [Data](#data)
- [Contributing](#contributing)
- [1 - Download the repository](#1---download-the-repository)
- [2 - Create a conda environment](#2---create-a-conda-environment)
Expand Down Expand Up @@ -72,11 +72,13 @@ You can also try them from your browser (on synthetic data only):

The documentation is hosted [here on ReadTheDocs](https://pydeseq2.readthedocs.io/en/latest/). If you want to have the latest version of the documentation, you can build it from source. Please go to the dedicated [README.md](https://github.com/owkin/PyDESeq2/blob/main/docs/README.md) for information on how to do so.

### TCGA Data
### Data

The quick start notebooks either use synthetic data (provided in this repo) or data from [The Cancer Genome Atlas](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga).
The quick start notebooks use synthetic data, provided in this repo (see [datasets](https://github.com/owkin/PyDESeq2/blob/main/datasets/README.md).)

For more information on how to obtain and organize TCGA data, see [datasets](https://github.com/owkin/PyDESeq2/blob/main/datasets/README.md).
The experiments described in our [preprint](https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1) rely on data
from [The Cancer Genome Atlas](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga),
which may be obtained from this [portal](https://portal.gdc.cancer.gov/).

## Contributing

Expand Down
31 changes: 6 additions & 25 deletions datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,16 @@
# TCGA Dataset
# Datasets

Data from The Cancer Genome Atlas can be obtained [here](https://portal.gdc.cancer.gov/).
This directory stores example data which can be retrieved using the `load_example_data` function from
`pydeseq2.utils`. Only synthetic data is provided for now, but new datasets might be available in the future.

## Folder organisation

The `tcga_data` is supposed to be organized in the following way:

```
PyDESeq2
└───datasets
└───tcga_data
└───Clinical
│ TCGA-BRCA_clinical.tsv.gz
│ TCGA-COAD_clinical.tsv.gz
│ TCGA-LUAD_clinical.tsv.gz
│ TCGA-LUSC_clinical.tsv.gz
│ TCGA-PAAD_clinical.tsv.gz
│ TCGA-PRAD_clinical.tsv.gz
│ TCGA-READ_clinical.tsv.gz
│ TCGA-SKCM_clinical.tsv.gz
└───Gene_expressions
TCGA-BRCA_raw_RNAseq.tsv.gz
TCGA-COAD_raw_RNAseq.tsv.gz
TCGA-LUAD_raw_RNAseq.tsv.gz
TCGA-LUSC_raw_RNAseq.tsv.gz
TCGA-PAAD_raw_RNAseq.tsv.gz
TCGA-PRAD_raw_RNAseq.tsv.gz
TCGA-READ_raw_RNAseq.tsv.gz
TCGA-SKCM_raw_RNAseq.tsv.gz
└───synthetic
│ test_clinical.csv
│ test_counts.csv
```
55 changes: 10 additions & 45 deletions notebooks/PyDESeq2_minimal_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,7 @@
"\n",
"This notebook gives a minimalistic example of how to perform DEA using PyDESeq2.\n",
"\n",
"It allows you to run the PyDESeq2 pipeline either on synthetic data or on one of the following TCGA datasets:\n",
"- TCGA-BRCA\n",
"- TCGA-COAD\n",
"- TCGA-LUAD\n",
"- TCGA-LUSC\n",
"- TCGA-PAAD\n",
"- TCGA-PRAD\n",
"- TCGA-READ\n",
"- TCGA-SKCM.\n",
"\n",
"While the synthetic data is part of this repository, TCGA data should be downloaded as per the instructions in `datasets/`. \n",
"\n",
"Running this pipeline may take a few minutes (~5-10 min for TCGA) depending on your setup and on the chosen dataset."
"It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository."
]
},
{
Expand Down Expand Up @@ -54,19 +42,7 @@
"id": "13cf8156",
"metadata": {},
"source": [
"## Data loading\n",
"\n",
"See the `datasets` readme for the required data organization. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4984fb79",
"metadata": {},
"outputs": [],
"source": [
"DATASET = \"synthetic\" # or 'TCGA-BRCA', 'TCGA-COAD', etc."
"## Data loading"
]
},
{
Expand All @@ -76,7 +52,7 @@
"metadata": {},
"outputs": [],
"source": [
"OUTPUT_PATH = f\"../output_files/{DATASET}\"\n",
"OUTPUT_PATH = f\"../output_files/synthetic_example\" # Replace this with the path were you wish to save outputs\n",
"os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist"
]
},
Expand All @@ -91,7 +67,7 @@
"source": [
"counts_df = load_example_data(\n",
" modality=\"raw_counts\",\n",
" dataset=DATASET,\n",
" dataset=\"synthetic\",\n",
" debug=False,\n",
")"
]
Expand All @@ -105,7 +81,7 @@
"source": [
"clinical_df = load_example_data(\n",
" modality=\"clinical\",\n",
" dataset=DATASET,\n",
" dataset=\"synthetic\",\n",
" debug=False,\n",
")"
]
Expand All @@ -122,34 +98,23 @@
"counts_df"
]
},
{
"cell_type": "markdown",
"id": "2815adea",
"metadata": {},
"source": [
"Remove samples for which `high_grade` is NaN."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "581f6bbd",
"id": "0ab9ad77",
"metadata": {},
"outputs": [],
"source": [
"if DATASET != \"synthetic\":\n",
" samples_to_keep = ~clinical_df.high_grade.isna()\n",
" samples_to_keep.sum()\n",
" counts_df = counts_df.loc[samples_to_keep]\n",
" clinical_df = clinical_df.loc[samples_to_keep]"
"clinical_df"
]
},
{
"cell_type": "markdown",
"id": "436ee275",
"metadata": {},
"source": [
"Filter out genes that have less than 10 counts in total"
"Filter out genes that have less than 10 counts in total.\n",
"There shouldn't be any in the synthetic dataset, but pre-filtering genes is good practice in general."
]
},
{
Expand Down Expand Up @@ -198,7 +163,7 @@
"dds = DeseqDataSet(\n",
" counts_df,\n",
" clinical_df,\n",
" design_factors=\"condition\" if DATASET == \"synthetic\" else \"high_grade\",\n",
" design_factors=\"condition\", # compare samples based on the \"condition\" column (\"B\" vs \"A\")\n",
" refit_cooks=True,\n",
" n_cpus=8,\n",
")"
Expand Down
55 changes: 10 additions & 45 deletions notebooks/PyDESeq2_step_by_step_pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,7 @@
"\n",
"This notebook details all the steps of the PyDESeq2 pipeline.\n",
"\n",
"It allows you to run the PyDESeq2 pipeline either on synthetic data or on one of the following TCGA datasets:\n",
"- TCGA-BRCA\n",
"- TCGA-COAD\n",
"- TCGA-LUAD\n",
"- TCGA-LUSC\n",
"- TCGA-PAAD\n",
"- TCGA-PRAD\n",
"- TCGA-READ\n",
"- TCGA-SKCM.\n",
"\n",
"While the synthetic data is part of this repository, TCGA data should be downloaded as per the instructions in `datasets/`. \n",
"\n",
"Running this pipeline may take a few minutes (~5-10 min for TCGA) depending on your setup and on the chosen dataset."
"It allows you to run the PyDESeq2 pipeline on the synthetic data provided in this repository."
]
},
{
Expand Down Expand Up @@ -54,19 +42,7 @@
"id": "13cf8156",
"metadata": {},
"source": [
"## Data loading\n",
"\n",
"See the `datasets` readme for the required data organization. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4984fb79",
"metadata": {},
"outputs": [],
"source": [
"DATASET = \"synthetic\" # or 'TCGA-BRCA', 'TCGA-COAD', etc."
"## Data loading"
]
},
{
Expand All @@ -76,7 +52,7 @@
"metadata": {},
"outputs": [],
"source": [
"OUTPUT_PATH = f\"../output_files/{DATASET}\"\n",
"OUTPUT_PATH = f\"../output_files/synthetic_example\" # Replace this with the path were you wish to save outputs\n",
"os.makedirs(OUTPUT_PATH, exist_ok=True) # Create path if it doesn't exist"
]
},
Expand All @@ -89,7 +65,7 @@
"source": [
"counts_df = load_example_data(\n",
" modality=\"raw_counts\",\n",
" dataset=DATASET,\n",
" dataset=\"synthetic\",\n",
" debug=False,\n",
")"
]
Expand All @@ -103,7 +79,7 @@
"source": [
"clinical_df = load_example_data(\n",
" modality=\"clinical\",\n",
" dataset=DATASET,\n",
" dataset=\"synthetic\",\n",
" debug=False,\n",
")"
]
Expand All @@ -120,34 +96,23 @@
"counts_df"
]
},
{
"cell_type": "markdown",
"id": "2815adea",
"metadata": {},
"source": [
"Remove samples for which `high_grade` is NaN."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "581f6bbd",
"id": "e2853727",
"metadata": {},
"outputs": [],
"source": [
"if DATASET != \"synthetic\":\n",
" samples_to_keep = ~clinical_df.high_grade.isna()\n",
" samples_to_keep.sum()\n",
" counts_df = counts_df.loc[samples_to_keep]\n",
" clinical_df = clinical_df.loc[samples_to_keep]"
"clinical_df"
]
},
{
"cell_type": "markdown",
"id": "436ee275",
"metadata": {},
"source": [
"Filter out genes that have less than 10 counts in total"
"Filter out genes that have less than 10 counts in total.\n",
"There shouldn't be any in the synthetic dataset, but pre-filtering genes is good practice in general."
]
},
{
Expand Down Expand Up @@ -196,7 +161,7 @@
"dds = DeseqDataSet(\n",
" counts_df,\n",
" clinical_df,\n",
" design_factors=\"condition\" if DATASET == \"synthetic\" else \"high_grade\",\n",
" design_factors=\"condition\", # compare samples based on the \"condition\" column (\"B\" vs \"A\")\n",
" refit_cooks=True,\n",
" n_cpus=8,\n",
")"
Expand Down
3 changes: 1 addition & 2 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Notebooks

This folder contains examples of DEA pipelines with PyDESeq2. They assume that TCGA data was downloaded and organized
as described in the [`datasets`](../datasets/README.md) directory.
This folder contains examples of DEA pipelines with PyDESeq2. They rely on the synthetic data stored in the [`datasets`](../datasets/README.md) directory.

- `PyDESeq2_minimal_example`: a minimal example of a DEA pipeline.
- `PyDESeq2_step_by_step_pipeline`: a more detailed version of the above example, where each step is shown.
46 changes: 10 additions & 36 deletions pydeseq2/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,11 @@ def load_example_data(
debug=False,
debug_seed=42,
):
"""Load synthetic or TCGA data (gene raw counts or clinical) for a given dataset.
"""Load synthetic example data.

May load either clinical or rna-seq data.The synthetic data is part of this
repo, but TCGA data should be downloaded as per the instructions in `datasets/`.
May load either clinical or rna-seq data. For now, this function may only return the
synthetic data provided as part of this repo, but new datasets might be added in the
future.

Parameters
----------
Expand All @@ -36,10 +37,11 @@ def load_example_data(
dataset : str
The dataset for which to return gene expression data.
If "synthetic", will return the synthetic data that is used for CI unit tests.
Otherwise, must be a valid TCGA dataset. (default: "synthetic").
(default: "synthetic").

debug : bool
If true, subsample 10 samples and 100 genes at random. (default: False).
If true, subsample 10 samples and 100 genes at random.
(Note that the "synthetic" dataset is already 10 x 100.) (default: False).

debug_seed : int
Seed for the debug mode. (default: 42).
Expand All @@ -55,20 +57,8 @@ def load_example_data(
)

assert dataset in [
"synthetic",
"TCGA-BRCA",
"TCGA-COAD",
"TCGA-LUAD",
"TCGA-LUSC",
"TCGA-PAAD",
"TCGA-PRAD",
"TCGA-READ",
"TCGA-SKCM",
], (
"The dataset argument must be one of the following: "
"synthetic, TCGA-BRCA, TCGA-COAD, TCGA-LUAD, TCGA-LUSC, "
"TCGA-PAAD, TCGA-PRAD, TCGA-READ, TCGA-SKCM"
)
"synthetic"
], "The dataset argument must be one of the following: synthetic."

# Load data
datasets_path = Path(pydeseq2.__file__).parent.parent / "datasets"
Expand Down Expand Up @@ -101,24 +91,8 @@ def load_example_data(
index_col=0,
)

else:
path_to_data = datasets_path / "tcga_data"
if modality == "raw_counts":
df = pd.read_csv(
path_to_data / "Gene_expressions" / f"{dataset}_raw_RNAseq.tsv.gz",
compression="gzip",
sep="\t",
index_col=0,
).T
elif modality == "clinical":
df = pd.read_csv(
path_to_data / "Clinical" / f"{dataset}_clinical.tsv.gz",
compression="gzip",
sep="\t",
index_col=0,
)

if debug:
# TODO: until we provide a larger dataset, this option is useless
# subsample 10 samples and 100 genes
df = df.sample(n=10, axis=0, random_state=debug_seed)
if modality == "raw_counts":
Expand Down