Pandora Paper

This repository contains the code required to reproduce all datasets we used in the publication of our novel Pandora tool for estimating the uncertainty of dimensionality reduction on population genetics data.

Setting up the environment

To run the code, you need a few packages installed. You can use the provided environment.yaml file and conda:

conda env create --file environment.yaml

Note: if you are working on a MacBook with an M1 or M2 chip, installing EIGENSOFT from conda won't work. In this case, remove the - eigensoft line from the environment file before executing the conda command above. Install EIGENSOFT by following the instructions in the Pandora documentation.

Reproducing datasets

Since the empirical datasets are based on genotype files in EIGENSTRAT format, we implemented a Snakemake pipeline for benchmarking making use of the Pandora command line interface.

The simulated datasets we used in our paper are RData exports which we have to load as Pandora NumpyDatasets, thus using Pandora as Python library.

Empirical datasets

Start the jupyter server by running jupyter notebook in your terminal. This should automatically start a new tab in your browser.
Open the jupyter file for the dataset you want to recreate. Note that for the Mathieson and Çayönü datasets, you first need to run the NearEastPublic.ipynb dataset as this provides the required modern samples for Mathieson and Çayönü.
Execute the jupyter notebook.
Done 🙂

You can now use the dataset to run the benchmark pipeline:

First, you need to configure the snakemake pipeline according to the dataset you want to analyze. You can do so by modifying the config.yaml file. See the provided config.yaml for hints on the allowed settings.
Run the pipeline in dry-run mode to see what rules will be run and what files created: snakemake -n
Finally, run the pipeline using snakemake -c1. Note that using c1 ensures that all rules are executed sequentially. This is required to make sure that all benchmarks are correct and don't influence each other.

You can load the resulting results.parquet file using pandas:

import pandas as pd
df = pd.read_parquet("results.parquet")

Simulated Datasets

To reproduce the simulated dataset benchmarks, you first need to download the data as instructed in the first few lines of the benchmark_simulated_data.py script. You can then adapt the # CONFIG section and set the analysis to PCA or MDS and run the benchmarks:

python bebenchmark_simulated_data.py

Finally, you can load the resulting results.parquet file using pandas:

import pandas as pd
df = pd.read_parquet("results.parquet")

Download the results

Instead of rerunning the analyses yourself, you can also download all (log)files from our experiments via our lab server. The directory follows this structure:

├── empirical_datasets [1]
│   ├── pca  # PCA results
│   │   ├── {dataset}  # one such directory for each empirical dataset 
│   │   │   ├── no_convergence  # Logs and (intermediate) results with convergence check disabled
│   │   │   ├── convergence_5p  # Logs and (intermediate) results with 5% convergence tolerance
│   │   │   ├── convergence_1p  # Logs and (intermediate) results with 1% convergence tolerance
│   │   │   ├── convergence_5p_10threads  # Logs and (intermediate) results with 5% convergence tolerance and only 10 threads
│   ├── mds  # MDS results
│   │   ├── {dataset}  # one such directory for each empirical dataset 
│   │   │   ├── no_convergence  # Logs and (intermediate) results with convergence check disabled
│   │   │   ├── convergence_5p  # Logs and (intermediate) results with 5% convergence tolerance
│   │   │   ├── convergence_1p  # Logs and (intermediate) results with 1% convergence tolerance
├── simulated_datasets [2]
│   ├── pca  # PCA results
│   │   ├── {dataset}  # one such directory for each simulated dataset 
│   ├── mds  # MDS results (FST population distance)
│   │   ├── {dataset}  # one such directory for each simulated dataset 
│   ├── mds_sample_dist  # MDS (missing corrected Hamming distance)
│   │   ├── {dataset}  # one such directory for each simulated dataset 
├── sliding_window_hgdp  # Sliding-window results
│   ├── 12_windows
│   ├── 50_windows

[1] Empirical datasets includes the following:

HO-WE (smartpca settings: 5 outlier iterations, shrinking disabled)
HO-WE-no_outlier (smartpca settings: 0 outlier iterations, shrinking disabled)
HO-WE-shrink (smartpca settings: 5 outlier iterations, shrinking enabled)
HO-WE-230
HO-WE-Cayonu (smartpca settings: 5 outlier iterations, shrinking disabled)
HO-WE-Cayonu-shrink (smartpca settings: 5 outlier iterations, shrinking enabled)
HO-Glob
Goats
Sheep
Panel{i} for i in 1, ..., 13

[2] Simulated datasets includes the following:

Cline
Cline_1p (1% missing data)
Cline_10p (10% missing data)
Cline_20p (20% missing data)
Island
Island_1p (1% missing data)
Island_10p (10% missing data)
Island_20p (20% missing data)
P3
P3_1p (1% missing data)
P3_10p (10% missing data)
P3_20p (20% missing data)
P3-mig50
P3-mig50_1p (1% missing data)
P3-mig50_10p (10% missing data)
P3-mig50_20p (20% missing data)

Within the individual Pandora result directories, the directory structure follows the specification as stated in the Pandora documentation.

Publication

The paper explaining the details of Pandora is available as preprint on bioRxiv:

Haag, J., Jordan A. I. & Stamatakis, A. (2024). Pandora: A Tool to Estimate Dimensionality Reduction Stability of Genotype Data. bioRxiv. https://doi.org/10.1101/2024.03.14.584962

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
benchmarking		benchmarking
datasets		datasets
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking

benchmarking

datasets

datasets

.gitignore

.gitignore

README.md

README.md

environment.yaml

environment.yaml

Repository files navigation

Pandora Paper

Setting up the environment

Reproducing datasets

Empirical datasets

Simulated Datasets

Download the results

Publication

About

Releases

Packages

Languages

tschuelia/PandoraPaper

Folders and files

Latest commit

History

Repository files navigation

Pandora Paper

Setting up the environment

Reproducing datasets

Empirical datasets

Simulated Datasets

Download the results

Publication

About

Resources

Stars

Watchers

Forks

Languages