This notebook demonstrates a multi-GRG phenotype simulation workflow, using the sim_phenotypes_multi_grg function to simulate both continuous and binary phenotypes across multiple GRG files representing 200 samples. The simulation works by sampling causal mutations from each GRG file (1000 per file), calculating genetic values for each individual based on a normal distribution model (mean=0, variance=1), then combining the genetic values from all GRGs additively and adding environmental noise based on a specified heritability parameter (0.33 in the examples). The notebook demonstrates several scenarios including loading all GRGs into RAM simultaneously versus sequentially and generating both continuous phenotypes and binary phenotypes with a specified population prevalence (0.2). We can also optionally saving the effect sizes to parameter files for downstream analysis.

In [1]:
from grg_pheno_sim.multi_grg_phenotype import sim_phenotypes_multi_grg
from grg_pheno_sim.model import grg_causal_mutation_model


The following command only serves the purpose of converting the VCF zip file into a GRG that will be used for the phenotype simulation. The bash script below will function as expected given the relative path for the source data file is accurate.

In [2]:
%%script bash --out /dev/null
if [ ! -f test-200-samples.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples.grg
fi


Processing input file in 10 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 10/10 [00:12<00:00,  1.22s/it]
Merging...


In [3]:
%%script bash --out /dev/null
if [ ! -f test-200-samples_copy.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples_copy.grg
fi

Processing input file in 10 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 10/10 [00:11<00:00,  1.19s/it]
Merging...


In [4]:
%%script bash --out /dev/null
if [ ! -f test-200-samples_last.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples_last.grg
fi

Processing input file in 10 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 10/10 [00:10<00:00,  1.04s/it]
Merging...


In [5]:
grg_list = ["test-200-samples.grg", "test-200-samples_copy.grg"]
#this is the list of GRG files to be loaded in 

We will first demonstrate loading all GRG files into RAM and simulating phenotypes. Causal mutations are sampled from each GRG, and the genetic values are obtained for the samples. The combined genetic dataframe is the addition of each GRG's genetic values. Noise is sampled at the end and added to obtain the phenotypes.

NOTE: It is necessary for each GRG to have the same number of samples.

In [6]:
model_type = "normal"
mean = 0
var = 1

model = grg_causal_mutation_model(model_type, mean=mean, var=var)

num_causal_per_file = 1000

random_seed = 1

normalize_phenotype = True #set check to True if we want phenotypes normalized

normalize_genetic_values_before_noise = False

heritability = 0.33

load_all_into_RAM = True #this parameter decides whether to load all GRGs into RAM together

save_effects = True

path_list = ['first_sample_effect_sizes.par', 'second_sample_effect_sizes.par']

multi_grg_uni_phenotypes = sim_phenotypes_multi_grg(grg_list, model, num_causal_per_file, random_seed, normalize_phenotype=normalize_phenotype, load_all_ram=load_all_into_RAM, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability, save_effect_output=save_effects, effect_path_list=path_list)


Loaded test-200-samples.grg into RAM
Loaded test-200-samples_copy.grg into RAM
Genetic values for test-200-samples.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0     -19.020975                   0
1                1      -6.169599                   0
2                2     -23.516688                   0
3                3      11.246903                   0
4                4      -5.730934                   0
..             ...            ...                 ...
195            195      -6.838783                   0
196            196     -10.646021                   0
197            197       8.320047                   0
198            198      -2.339506                   0
199            199     -10.111281                   0

[200 rows x 3 columns]
Genetic values for test-200-samples_copy.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0     -22.453439                   0
1                1     -

In [7]:
multi_grg_uni_phenotypes

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.704615,-0.238636,-0.943251
1,0,1,-0.115182,1.418549,1.303366
2,0,2,-0.572481,-1.404155,-1.976636
3,0,3,0.168105,0.336582,0.504688
4,0,4,-0.666954,-0.164556,-0.831510
...,...,...,...,...,...
195,0,195,-0.510187,0.066806,-0.443381
196,0,196,-0.066285,-1.156457,-1.222742
197,0,197,0.223388,-1.000433,-0.777045
198,0,198,-0.082874,-0.361472,-0.444346


Now, we demonstrate a case where the function uses default values set for the parameters instead of custom values.

In [8]:
heritability = 0.33

load_all_into_RAM = True #this parameter decides whether to load all GRGs into RAM together

multi_grg_uni_phenotypes_default = sim_phenotypes_multi_grg(grg_list, load_all_ram=load_all_into_RAM, heritability=heritability)


Loaded test-200-samples.grg into RAM
Loaded test-200-samples_copy.grg into RAM
Genetic values for test-200-samples.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      43.337891                   0
1                1      46.293544                   0
2                2      24.802211                   0
3                3      59.585820                   0
4                4      39.311780                   0
..             ...            ...                 ...
195            195      40.694166                   0
196            196      37.210722                   0
197            197      27.361940                   0
198            198      34.491399                   0
199            199      35.881352                   0

[200 rows x 3 columns]
Genetic values for test-200-samples_copy.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      16.557583                   0
1                1      

In [9]:
multi_grg_uni_phenotypes_default

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,59.895474,-16.227358,43.668116
1,0,1,52.396002,-14.090678,38.305325
2,0,2,48.809231,-34.512444,14.296787
3,0,3,62.900187,-10.146706,52.753482
4,0,4,39.920363,10.964286,50.884649
...,...,...,...,...,...
195,0,195,33.830926,-33.515235,0.315691
196,0,196,62.463197,29.893875,92.357071
197,0,197,22.079705,51.018533,73.098237
198,0,198,54.630493,49.771255,104.401748


We now perform similar simulations, but by loading the GRGs into RAM sequentially (instead of all together).

In [10]:
model_type = "normal"
mean = 0
var = 1

model = grg_causal_mutation_model(model_type, mean=mean, var=var)

num_causal_per_file = 1000

random_seed = 1

normalize_phenotype = True #check to ensure phenotypes are normalized

normalize_genetic_values_before_noise = False

heritability = 0.33

load_all_into_RAM = False #this parameter decides whether to load all GRGs into RAM together

save_effects = True

path_list = ['first_seq_sample_effect_sizes.par', 'second_seq_sample_effect_sizes.par']

multi_grg_uni_seq_phenotypes = sim_phenotypes_multi_grg(grg_list, model, num_causal_per_file, random_seed, normalize_phenotype=normalize_phenotype, load_all_ram=load_all_into_RAM, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability, save_effect_output=save_effects, effect_path_list=path_list)

Loaded test-200-samples.grg into RAM
Genetic values for test-200-samples.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      -6.979905                   0
1                1       9.802051                   0
2                2      -2.521683                   0
3                3      -9.488867                   0
4                4     -16.724798                   0
..             ...            ...                 ...
195            195      13.250207                   0
196            196       0.621734                   0
197            197     -18.751941                   0
198            198     -19.365297                   0
199            199      -7.264376                   0

[200 rows x 3 columns]
Loaded test-200-samples_copy.grg into RAM
Genetic values for test-200-samples_copy.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      12.778135                   0
1                1      

In [11]:
multi_grg_uni_seq_phenotypes

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.481271,0.078361,-0.402910
1,0,1,-0.342456,1.049300,0.706844
2,0,2,-0.093188,-1.366265,-1.459452
3,0,3,-0.400921,1.333512,0.932591
4,0,4,-0.864817,-0.024640,-0.889457
...,...,...,...,...,...
195,0,195,0.904433,-1.927629,-1.023196
196,0,196,-0.099902,-1.463320,-1.563222
197,0,197,-1.233560,-0.647971,-1.881532
198,0,198,-0.448741,-0.604923,-1.053664


Now, we demonstrate a case with binary phenotypes.

In [12]:
model_type = "normal"
mean = 0
var = 1

model = grg_causal_mutation_model(model_type, mean=mean, var=var)

num_causal_per_file = 1000

random_seed = 1

normalize_genetic_values_before_noise = False

binary=True

population_prevalence = 0.2

heritability = 0.33

load_all_into_RAM = True #this parameter decides whether to load all GRGs into RAM together

save_effects = True

path_list = ['first_seq_sample_effect_sizes.par', 'second_seq_sample_effect_sizes.par']

multi_grg_uni_bin_phenotypes = sim_phenotypes_multi_grg(grg_list, model, num_causal_per_file, random_seed, load_all_ram=load_all_into_RAM, binary=binary, population_prevalence=population_prevalence, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability, save_effect_output=save_effects, effect_path_list=path_list)

Loaded test-200-samples.grg into RAM
Loaded test-200-samples_copy.grg into RAM
Genetic values for test-200-samples.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0     -25.191199                   0
1                1      -5.732776                   0
2                2      -2.033059                   0
3                3     -11.159693                   0
4                4      -0.409668                   0
..             ...            ...                 ...
195            195       3.650644                   0
196            196       6.378082                   0
197            197      -2.159229                   0
198            198      -4.366015                   0
199            199       6.200247                   0

[200 rows x 3 columns]
Genetic values for test-200-samples_copy.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0       1.121761                   0
1                1      

In [13]:
multi_grg_uni_bin_phenotypes

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-24.069438,11.518308,0
1,0,1,11.068769,-4.398059,1
2,0,2,-14.047168,-12.303400,0
3,0,3,-21.437173,-22.884782,0
4,0,4,-12.649445,11.732483,0
...,...,...,...,...,...
195,0,195,1.108585,-15.627286,0
196,0,196,-8.763735,-3.781550,0
197,0,197,-7.335565,13.257569,1
198,0,198,-24.620852,-8.624949,0


In [14]:
binary_list = multi_grg_uni_bin_phenotypes["phenotype"]
num_zeros = (binary_list == 0).sum()
num_ones = (binary_list == 1).sum()

print("Number of 0s:", num_zeros)
print("Number of 1s:", num_ones)
print("Population prevalence ratio observed: ", str(num_ones/(num_ones+num_zeros)))


Number of 0s: 116
Number of 1s: 84
Population prevalence ratio observed:  0.42


Finally, we demonstrate a case with more than 2 GRGs. 

In [15]:
new_grg_list = ["test-200-samples.grg", "test-200-samples_copy.grg", "test-200-samples_last.grg"]


In [16]:
model_type = "normal"
mean = 0
var = 1

model = grg_causal_mutation_model(model_type, mean=mean, var=var)

num_causal_per_file = 1000

random_seed = 1

normalize_genetic_values_before_noise = False

heritability = 0.33

load_all_into_RAM = True #this parameter decides whether to load all GRGs into RAM together

multi_grg_uni_phenotypes_three = sim_phenotypes_multi_grg(new_grg_list, model, num_causal_per_file, random_seed, load_all_ram=load_all_into_RAM, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)


Loaded test-200-samples.grg into RAM
Loaded test-200-samples_copy.grg into RAM
Loaded test-200-samples_last.grg into RAM
Genetic values for test-200-samples.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      -8.516631                   0
1                1      12.080463                   0
2                2      -1.835255                   0
3                3       7.642389                   0
4                4       3.995506                   0
..             ...            ...                 ...
195            195       5.259805                   0
196            196      -2.091133                   0
197            197       8.208745                   0
198            198       6.349593                   0
199            199     -10.409041                   0

[200 rows x 3 columns]
Genetic values for test-200-samples_copy.grg are as follows:
     individual_id  genetic_value  causal_mutation_id
0                0      -9.973063   

In [17]:
multi_grg_uni_phenotypes_three

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,35.516035,43.469769,78.985804
1,0,1,21.472437,15.658662,37.131099
2,0,2,-0.173869,-68.607838,-68.781708
3,0,3,15.223645,4.523833,19.747478
4,0,4,46.754325,15.698933,62.453258
...,...,...,...,...,...
195,0,195,44.053954,33.526804,77.580758
196,0,196,-10.718519,10.477390,-0.241129
197,0,197,26.212827,-16.227547,9.985281
198,0,198,-2.738703,-40.479284,-43.217988
