# Simulations using a random dataset and scikit-learn blob dataset

For our first argument in the paper to demonstrate, that Pandora is able to detect instability in PCA/MDS analyses, we use two datasets:
1. a dataset of randomly generated samples and features
2. a blob dataset using the scikit-learn library

For dataset (1), we expect a low Pandora Stability (PS) value since the data is completely random and there is no structure in the data. For dataset (2), we expect a high PS value since the data is generated such that it contains three distinct "blobs" of data, i.e. three distinct clusters. The signal in the data should be high enough to be stable across different runs of the PCA/MDS analysis on bootstrapped datasets.


In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs

from pandora.dataset import NumpyDataset
from pandora.embedding_comparison import BatchEmbeddingComparison
from pandora.plotting import plot_populations

N_BOOTSTRAPS = 10

## 1. Random Dataset

We generate a random dataset with 100 samples and 10,000 features. We generate that many features since genotype datasets typically contain a large number of features (e.g. SNPs) compared to the number of individuals in the dataset.

In [25]:
np.random.seed(42)
X = np.random.choice([0, 1, 2], (100, 10_000))
samples = pd.Series([f"sample_{i}" for i in range(X.shape[0])])
populations = pd.Series(["random" for _ in range(X.shape[0])])
dataset = NumpyDataset(X, sample_ids=samples, populations=populations, dtype=np.float64)

# generate N_BOOTSTRAP bootstrap datasets
bootstrap_datasets = [dataset.bootstrap(seed=i) for i in range(N_BOOTSTRAPS)]

### 1.1. PCA Analysis

In [26]:
dataset.run_pca(n_components=2)

# run PCA for each bootstrap dataset
_ = [b.run_pca(n_components=2) for b in bootstrap_datasets]
# compare the embeddings
embedding_comparison = BatchEmbeddingComparison([b.pca for b in bootstrap_datasets])
ps = embedding_comparison.compare()
pcs = embedding_comparison.compare_clustering(kmeans_k=3)

print("Pandora Stability (PCA):", round(ps, 2))
print("Pandora Cluster Stability (PCA):", round(pcs, 2))

# plot_populations(dataset.pca)

Pandora Stability (PCA): 0.34
Pandora Cluster Stability (PCA): 0.35


### 1.2. MDS Analysis

In [21]:
dataset.run_mds(n_components=2)

# run MDS for each bootstrap dataset
_ = [b.run_mds(n_components=2) for b in bootstrap_datasets]
# compare the embeddings
embedding_comparison = BatchEmbeddingComparison([b.mds for b in bootstrap_datasets])
ps = embedding_comparison.compare()
pcs = embedding_comparison.compare_clustering(kmeans_k=3)

print("Pandora Stability (MDS):", round(ps, 2))
print("Pandora Cluster Stability (PCA):", round(pcs, 2))

# plot_populations(dataset.mds)

Pandora Stability (MDS): 0.28
Pandora Cluster Stability (PCA): 0.34


## 2. Blob Dataset

In [22]:
X, y = make_blobs(n_samples=100, n_features=10_000, random_state=42)
samples = pd.Series([f"sample_{i}" for i in range(X.shape[0])])
populations = pd.Series(f"blob_{i}" for i in y)

dataset = NumpyDataset(X, sample_ids=samples, populations=populations, dtype=np.float64)

# generate N_BOOTSTRAP bootstrap datasets
bootstrap_datasets = [dataset.bootstrap(seed=i) for i in range(N_BOOTSTRAPS)]

### 1.1. PCA Analysis

In [12]:
dataset.run_pca(n_components=2)

# run PCA for each bootstrap dataset
_ = [b.run_pca(n_components=2) for b in bootstrap_datasets]
# compare the embeddings
embedding_comparison = BatchEmbeddingComparison([b.pca for b in bootstrap_datasets])
ps = embedding_comparison.compare()
pcs = embedding_comparison.compare_clustering(kmeans_k=3)
print("Pandora Stability (PCA):", round(ps, 2))
print("Pandora Cluster Stability (PCA):", round(pcs, 2))

# plot_populations(dataset.pca)

Pandora Stability (PCA): 1.0
Pandora Cluster Stability (PCA): 1.0


### 1.2. MDS Analysis

In [13]:
dataset.run_mds(n_components=2)

# run MDS for each bootstrap dataset
_ = [b.run_mds(n_components=2) for b in bootstrap_datasets]
# compare the embeddings
embedding_comparison = BatchEmbeddingComparison([b.pca for b in bootstrap_datasets])
ps = embedding_comparison.compare()
pcs = embedding_comparison.compare_clustering(kmeans_k=3)
print("Pandora Stability (MDS):", round(ps, 2))
print("Pandora Cluster Stability (PCA):", round(pcs, 2))

# plot_populations(dataset.mds)

Pandora Stability (MDS): 1.0
Pandora Cluster Stability (PCA): 1.0
