This notebook demonstrates a multivariate multi-GRG phenotype simulation workflow. The simulation employs a multivariate normal distribution model with correlated effects (correlation 0.25) across two phenotypes, samples 1000 causal mutations per file, and adds environmental noise based on specified heritability values (0.33 and 0.25) to generate realistic phenotype values that combine genetic and environmental components. The notebook demonstrates two approaches to memory management: loading all GRG files into RAM simultaneously versus loading them sequentially, both producing dataframes with individual-level genetic values, environmental noise, and final phenotype measurements for each causal mutation.

In [1]:
import numpy as np

from grg_pheno_sim.multi_grg_phenotype import sim_phenotypes_multi_grg
from grg_pheno_sim.model import grg_causal_mutation_model


The following commands only serves the purpose of converting the VCF zip file into a GRG that will be used for the phenotype simulation. The bash script below will function as expected given the relative path for the source data file is accurate.

In [2]:
%%script bash --out /dev/null
if [ ! -f test-200-samples.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples.grg
fi

Processing input file in 10 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 10/10 [00:00<00:00, 21.11it/s]
Merging...


In [3]:
%%script bash --out /dev/null
if [ ! -f test-200-samples_copy.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples_copy.grg
fi

Processing input file in 10 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 10/10 [00:00<00:00, 23.05it/s]
Merging...


In [4]:
grg_list = ["test-200-samples.grg", "test-200-samples_copy.grg"]
#this is the list of GRG files to be loaded in 

We will first demonstrate loading all GRG files into RAM and simulating phenotypes. Causal mutations are sampled from each GRG, and the genetic values are obtained for the samples. The combined genetic dataframe is the addition of each GRG's genetic values (for each causal mutation ID). Noise is sampled at the end and added to obtain the phenotypes.

NOTE: It is necessary for each GRG to have the same number of samples.

In [5]:
model_type = "multivariate normal"
means = np.zeros(2)
cov = np.array([[1, 0.25], [0.25, 1]])

model = grg_causal_mutation_model(model_type, mean=means, cov=cov)

num_causal_per_file = 1000

random_seed = 1

normalize_phenotype = True #check for normalizing phenotyopes

normalize_genetic_values_before_noise = False

heritability = [0.33, 0.25]

load_all_into_RAM = True #this parameter decides whether to load all GRGs into RAM together

multi_grg_multi_phenotypes = sim_phenotypes_multi_grg(grg_list, model, num_causal_per_file, random_seed, normalize_phenotype=normalize_phenotype, load_all_ram=load_all_into_RAM, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)


Loaded test-200-samples.grg into RAM
Loaded test-200-samples_copy.grg into RAM
Genetic values for test-200-samples.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      -0.994071                   0
1                0      27.817296                   1
2                1      -8.278699                   0
3                1       5.242780                   1
4                2       4.658089                   0
..             ...            ...                 ...
395            197       1.647747                   1
396            198      -4.357972                   0
397            198       8.708424                   1
398            199      -8.094397                   0
399            199      23.491199                   1

[400 rows x 3 columns]
Genetic values for test-200-samples_copy.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      20.594138                   0
1                0      

In [6]:
multi_grg_multi_phenotypes

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.256711,-0.426986,-0.683696
1,1,0,0.597793,0.941915,1.539708
2,0,1,-0.616746,0.117250,-0.499496
3,1,1,-0.862186,0.149566,-0.712620
4,0,2,0.184385,0.147079,0.331464
...,...,...,...,...,...
395,1,197,-0.357021,-1.383645,-1.740666
396,0,198,0.377578,-1.215148,-0.837570
397,1,198,-0.537448,0.135495,-0.401953
398,0,199,-0.688505,-0.755862,-1.444366


We now perform similar simulations, but by loading the GRGs into RAM sequentially (instead of all together).

In [7]:
model_type = "multivariate normal"
means = np.zeros(2)
cov = np.array([[1, 0.25], [0.25, 1]])

model = grg_causal_mutation_model(model_type, mean=means, cov=cov)

num_causal_per_file = 1000

random_seed = 1

normalize_genetic_values_before_noise = False

heritability = [0.33, 0.25]

load_all_into_RAM = False #this parameter decides whether to load all GRGs into RAM together

multi_grg_multi_seq_phenotypes = sim_phenotypes_multi_grg(grg_list, model, num_causal_per_file, random_seed, load_all_ram=load_all_into_RAM, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)


Loaded test-200-samples.grg into RAM
Genetic values for test-200-samples.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      24.477461                   0
1                0      11.052996                   1
2                1      31.432044                   0
3                1     -10.336537                   1
4                2      42.255579                   0
..             ...            ...                 ...
395            197       5.243472                   1
396            198      40.608915                   0
397            198      12.021479                   1
398            199      46.438978                   0
399            199       2.298881                   1

[400 rows x 3 columns]
Loaded test-200-samples_copy.grg into RAM
Genetic values for test-200-samples_copy.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      -4.753873                   0
1                0      

In [8]:
multi_grg_multi_seq_phenotypes

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,19.723589,-0.496570,19.227019
1,1,0,16.963230,-30.764031,-13.800801
2,0,1,17.097865,35.116510,52.214374
3,1,1,4.726079,-1.001434,3.724645
4,0,2,22.309574,13.163451,35.473025
...,...,...,...,...,...
395,1,197,19.375170,-13.621793,5.753377
396,0,198,39.722359,3.275476,42.997835
397,1,198,31.791933,-36.018390,-4.226457
398,0,199,27.778340,0.219618,27.997958
