This notebook contains demos for the user to input custom effect sizes instead of using one of the distribution-based models provided in the library. We allow for input types for effect sizes to be either a list, a dictionary, or a pandas dataframe. This is for the univariate case. In the multivariate case, the user must input a pandas dataframe with effect sizes that follows the format shown in multivariate demos for it to be compatible with the simulation library.

In [1]:
import numpy as np
import pandas as pd
import pygrgl
import random

from grg_pheno_sim.phenotype import sim_phenotypes_custom


The following command only serves the purpose of converting the VCF zip file into a GRG that will be used for the phenotype simulation. The bash script below will function as expected given the relative path for the source data file is accurate.

In [4]:
%%script bash --out /dev/null
if [ ! -f test-200-samples.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples.grg
fi

Processing input file in 10 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 10/10 [00:00<00:00, 17.88it/s]
Merging...


In [5]:
grg_1 = pygrgl.load_immutable_grg("test-200-samples.grg") #loading in a sample grg stored in the same directory
n = grg_1.num_mutations

In [7]:
random_effects = [random.random() for _ in range(n)] #list input

specific_effects = [1.0 for _ in range(n)] #list input, non-random inputs

effect_sizes = np.random.randn(n)  

mutation_dict = {i: effect_sizes[i] for i in range(n)} #dictionary input

input_df = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input

input_df_manual = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input
input_df_manual['causal_mutation_id']=0


We first show custom effect sizes contained within a list.

In [8]:
normalize_genetic_values_before_noise = True

heritability = 0.33

standardized_output = True

output_path = 'custom_pheno.phen' #define the path to be saved at, this output is saved in the file of this name in the same directory

phenotypes_list = sim_phenotypes_custom(grg_1, specific_effects, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability, standardized_output=standardized_output, path=output_path)
phenotypes_list

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0          1.0                   0
1                1          1.0                   0
2                2          1.0                   0
3                3          1.0                   0
4                4          1.0                   0
...            ...          ...                 ...
10888        10888          1.0                   0
10889        10889          1.0                   0
10890        10890          1.0                   0
10891        10891          1.0                   0
10892        10892          1.0                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0         2665.0                   0
1                1         2729.0                   0
2                2         2740.0                   0
3                3         2773.0                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-1.884264,-0.225075,-2.109339
1,0,1,-0.150362,1.114632,0.964270
2,0,2,0.147653,0.077670,0.225323
3,0,3,1.041696,-0.576655,0.465041
4,0,4,-0.150362,-1.943573,-2.093934
...,...,...,...,...,...
195,0,195,1.339710,1.055654,2.395364
196,0,196,-1.396604,0.666002,-0.730602
197,0,197,-0.800575,-1.108121,-1.908696
198,0,198,-0.692206,0.802020,0.109813


We then show custom effect sizes contained within a dictionary.

In [9]:
normalize_genetic_values_before_noise = True

heritability = 0.33

#by default, the standard .phen output will not be saved

phenotypes_dict = sim_phenotypes_custom(grg_1, mutation_dict, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_dict

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     1.275108                   0
1                1    -0.505153                   0
2                2    -0.561123                   0
3                3     0.412935                   0
4                4     1.766245                   0
...            ...          ...                 ...
10888        10888    -0.083483                   0
10889        10889     0.639424                   0
10890        10890     0.213614                   0
10891        10891    -0.667769                   0
10892        10892     0.412074                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -28.884707                   0
1                1     -26.796481                   0
2                2      10.995769                   0
3                3     -27.186726                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.952245,-0.218695,-1.170940
1,0,1,-0.895464,-0.295123,-1.190587
2,0,2,0.132157,-1.509207,-1.377050
3,0,3,-0.906075,0.087642,-0.818433
4,0,4,-0.085406,2.730236,2.644830
...,...,...,...,...,...
195,0,195,-0.655348,-0.480768,-1.136117
196,0,196,-0.011047,0.529288,0.518240
197,0,197,1.372006,-1.028990,0.343016
198,0,198,3.501792,0.345657,3.847449


We finally show custom effect sizes contained within a pandas dataframe (the user need not add the causal mutation id column - that is handled internally).

In [10]:
normalize_genetic_values_before_noise = True

heritability = 0.33

phenotypes_df = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_df

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     1.275108                   0
1                1    -0.505153                   0
2                2    -0.561123                   0
3                3     0.412935                   0
4                4     1.766245                   0
...            ...          ...                 ...
10888        10888    -0.083483                   0
10889        10889     0.639424                   0
10890        10890     0.213614                   0
10891        10891    -0.667769                   0
10892        10892     0.412074                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -28.884707                   0
1                1     -26.796481                   0
2                2      10.995769                   0
3                3     -27.186726                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.952245,-0.799563,-1.751808
1,0,1,-0.895464,0.679729,-0.215734
2,0,2,0.132157,0.928280,1.060438
3,0,3,-0.906075,0.245714,-0.660361
4,0,4,-0.085406,-0.046388,-0.131794
...,...,...,...,...,...
195,0,195,-0.655348,1.152895,0.497547
196,0,196,-0.011047,0.105130,0.094082
197,0,197,1.372006,-1.083676,0.288330
198,0,198,3.501792,2.885304,6.387097


Alternatively, the user can also use his custom effect sizes (enclosed within a compatible dataframe) and manually build the consecutive steps of the simulation instead of using the sim_phenotypes_custom function. For this, the dataframe (for the univariate case) will have to be formed as shown for the df `input_df_manual` above.

Now, we show how the user can simulate custom phenotypes using custom noise.

In [11]:
normalize_genetic_values_before_noise = True

mean_1 = 0
std_1 = 1

phenotypes_df_user_noise = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, user_mean=mean_1, user_cov=std_1)
phenotypes_df_user_noise

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     1.275108                   0
1                1    -0.505153                   0
2                2    -0.561123                   0
3                3     0.412935                   0
4                4     1.766245                   0
...            ...          ...                 ...
10888        10888    -0.083483                   0
10889        10889     0.639424                   0
10890        10890     0.213614                   0
10891        10891    -0.667769                   0
10892        10892     0.412074                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0     -28.884707                   0
1                1     -26.796481                   0
2                2      10.995769                   0
3                3     -27.186726                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-0.952245,-0.994571,-1.946816
1,0,1,-0.895464,0.517973,-0.377491
2,0,2,0.132157,0.484903,0.617060
3,0,3,-0.906075,0.324259,-0.581816
4,0,4,-0.085406,0.072400,-0.013006
...,...,...,...,...,...
195,0,195,-0.655348,0.498859,-0.156489
196,0,196,-0.011047,1.288604,1.277556
197,0,197,1.372006,0.194403,1.566409
198,0,198,3.501792,0.216266,3.718059
