This notebook contains demos for the user to input custom effect sizes instead of using one of the distribution-based models provided in the library. We allow for input types for effect sizes to be either a list, a dictionary, or a pandas dataframe. This is for the univariate case. In the multivariate case, the user must input a pandas dataframe with effect sizes that follows the format shown in multivariate demos for it to be compatible with the simulation library.

In [1]:
import numpy as np
import pandas as pd
import pygrgl
import random

from grg_pheno_sim.phenotype import sim_phenotypes_custom


The following command only serves the purpose of converting the VCF zip file into a GRG that will be used for the phenotype simulation. The bash script below will function as expected given the relative path for the source data file is accurate.

In [2]:
%%script bash --out /dev/null
if [ ! -f test-200-samples.grg ]; then
  grg construct -p 10 ../data/test-200-samples.vcf.gz --out-file test-200-samples.grg
fi

Processing input file in 10 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 10/10 [00:07<00:00,  1.34it/s]
Merging...


In [3]:
grg_1 = pygrgl.load_immutable_grg("test-200-samples.grg") #loading in a sample grg stored in the same directory
n = grg_1.num_mutations

In [4]:
random_effects = [random.random() for _ in range(n)] #list input

specific_effects = [1.0 for _ in range(n)] #list input, non-random inputs

effect_sizes = np.random.randn(n)  

mutation_dict = {i: effect_sizes[i] for i in range(n)} #dictionary input

input_df = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input

input_df_manual = pd.DataFrame(list(mutation_dict.items()), columns=['mutation_id', 'effect_size']) #dataframe input
input_df_manual['causal_mutation_id']=0


We first show custom effect sizes contained within a list.

In [5]:
normalize_genetic_values_before_noise = True

heritability = 0.33

standardized_output = True

output_path = 'custom_pheno.phen' #define the path to be saved at, this output is saved in the file of this name in the same directory

phenotypes_list = sim_phenotypes_custom(grg_1, specific_effects, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability, standardized_output=standardized_output, path=output_path)
phenotypes_list

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0          1.0                   0
1                1          1.0                   0
2                2          1.0                   0
3                3          1.0                   0
4                4          1.0                   0
...            ...          ...                 ...
10888        10888          1.0                   0
10889        10889          1.0                   0
10890        10890          1.0                   0
10891        10891          1.0                   0
10892        10892          1.0                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0         2665.0                   0
1                1         2729.0                   0
2                2         2740.0                   0
3                3         2773.0                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,-1.884264,1.448283,-0.435981
1,0,1,-0.150362,0.463262,0.312901
2,0,2,0.147653,0.997660,1.145313
3,0,3,1.041696,-0.187224,0.854472
4,0,4,-0.150362,1.058266,0.907904
...,...,...,...,...,...
195,0,195,1.339710,0.870780,2.210491
196,0,196,-1.396604,1.194063,-0.202541
197,0,197,-0.800575,-1.829407,-2.629982
198,0,198,-0.692206,1.495442,0.803236


We then show custom effect sizes contained within a dictionary.

In [6]:
normalize_genetic_values_before_noise = True

heritability = 0.33

#by default, the standard .phen output will not be saved

phenotypes_dict = sim_phenotypes_custom(grg_1, mutation_dict, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_dict

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     1.299722                   0
1                1    -0.603991                   0
2                2     0.199823                   0
3                3     1.769115                   0
4                4     0.200174                   0
...            ...          ...                 ...
10888        10888    -0.537798                   0
10889        10889     1.214221                   0
10890        10890     0.790617                   0
10891        10891    -1.084699                   0
10892        10892    -0.608201                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0      68.815937                   0
1                1      27.343123                   0
2                2      26.384373                   0
3                3     103.526853                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,1.292056,0.755150,2.047207
1,0,1,0.158733,0.713695,0.872428
2,0,2,0.132534,-1.942494,-1.809961
3,0,3,2.240598,2.867540,5.108137
4,0,4,1.241830,2.627134,3.868964
...,...,...,...,...,...
195,0,195,-0.708728,-0.411610,-1.120338
196,0,196,-0.912680,-0.406501,-1.319181
197,0,197,0.164682,-1.191408,-1.026725
198,0,198,-0.936673,-0.545649,-1.482323


We finally show custom effect sizes contained within a pandas dataframe (the user need not add the causal mutation id column - that is handled internally).

In [7]:
normalize_genetic_values_before_noise = True

heritability = 0.33

phenotypes_df = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, heritability=heritability)
phenotypes_df

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     1.299722                   0
1                1    -0.603991                   0
2                2     0.199823                   0
3                3     1.769115                   0
4                4     0.200174                   0
...            ...          ...                 ...
10888        10888    -0.537798                   0
10889        10889     1.214221                   0
10890        10890     0.790617                   0
10891        10891    -1.084699                   0
10892        10892    -0.608201                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0      68.815937                   0
1                1      27.343123                   0
2                2      26.384373                   0
3                3     103.526853                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,1.292056,-0.469228,0.822828
1,0,1,0.158733,0.967233,1.125966
2,0,2,0.132534,-1.235867,-1.103334
3,0,3,2.240598,0.799297,3.039895
4,0,4,1.241830,0.380386,1.622216
...,...,...,...,...,...
195,0,195,-0.708728,1.775633,1.066905
196,0,196,-0.912680,-1.193242,-2.105922
197,0,197,0.164682,1.511735,1.676417
198,0,198,-0.936673,-1.433974,-2.370647


Alternatively, the user can also use his custom effect sizes (enclosed within a compatible dataframe) and manually build the consecutive steps of the simulation instead of using the sim_phenotypes_custom function. For this, the dataframe (for the univariate case) will have to be formed as shown for the df `input_df_manual` above.

Now, we show how the user can simulate custom phenotypes using custom noise.

In [8]:
normalize_genetic_values_before_noise = True

mean_1 = 0
std_1 = 1

phenotypes_df_user_noise = sim_phenotypes_custom(grg_1, input_df, normalize_genetic_values_before_noise=normalize_genetic_values_before_noise, user_mean=mean_1, user_cov=std_1)
phenotypes_df_user_noise

The initial effect sizes are 
       mutation_id  effect_size  causal_mutation_id
0                0     1.299722                   0
1                1    -0.603991                   0
2                2     0.199823                   0
3                3     1.769115                   0
4                4     0.200174                   0
...            ...          ...                 ...
10888        10888    -0.537798                   0
10889        10889     1.214221                   0
10890        10890     0.790617                   0
10891        10891    -1.084699                   0
10892        10892    -0.608201                   0

[10893 rows x 3 columns]
The genetic values of the individuals are 
     individual_id  genetic_value  causal_mutation_id
0                0      68.815937                   0
1                1      27.343123                   0
2                2      26.384373                   0
3                3     103.526853                   0
4      

Unnamed: 0,causal_mutation_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,1.292056,0.443900,1.735956
1,0,1,0.158733,0.246125,0.404859
2,0,2,0.132534,-0.916166,-0.783633
3,0,3,2.240598,0.023182,2.263779
4,0,4,1.241830,1.855594,3.097424
...,...,...,...,...,...
195,0,195,-0.708728,-0.276704,-0.985432
196,0,196,-0.912680,-1.239123,-2.151803
197,0,197,0.164682,-1.564872,-1.400189
198,0,198,-0.936673,-0.100849,-1.037522
