<a href="https://colab.research.google.com/github/reworkhow/AG2PI-Workshop/blob/main/2.DataSharingCollaboration/data_encryption.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Colab Notebook Template_

## Instructions
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. If you need a GPU: _Runtime_ > _Change runtime type_ > _Harware accelerator_ = _GPU_.
3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
4. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2, 3 and 4.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 3 and 4.

In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.8.5" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia CSV DataFrames Random Statistics Distributions LinearAlgebra JWAS"  # Plots will cause error
JULIA_PACKAGES_IF_GPU=""
JULIA_NUM_THREADS=4
#---------------------------------------------------#

if [ -n "$COLAB_GPU" ] && [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  if [ "$COLAB_GPU" = "1" ]; then
      JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.8.5 on the current Colab Runtime...
2023-06-14 21:26:14 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.8/julia-1.8.5-linux-x86_64.tar.gz [130873886/130873886] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package CSV...
Installing Julia package DataFrames...
Installing Julia package Random...
Installing Julia package Statistics...
Installing Julia package Distributions...
Installing Julia package LinearAlgebra...
Installing Julia package JWAS...
Installing IJulia kernel...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInstalling julia kernelspec in /root/.local/share/jupyter/kernels/julia-1.8

Successfully installed julia version 1.8.5!
Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then
jump to the 'Checking the Installation' section.




In [None]:
versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 4 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/lib64-nvidia
  JULIA_NUM_THREADS = 4


In [None]:
using CSV, DataFrames, Random, Statistics, Distributions, LinearAlgebra, JWAS

# Homomorphic encryption using high-dimensional random orthogonal matrix

Simulate raw genotypes (M) and phenotypes (y) data. Number of individuals=100, number of markers=10.

In [None]:
Random.seed!(123)
n, p = 100,10
M  = rand([0.0,1,2],n,p)
g  = M*randn(p)
y  = g + randn(n)*sqrt(var(g))
M = M.-mean(M,dims=1)
y = y.-mean(y);

generate random orthogonal matrix:

In [None]:
function stiefel(n,p;rng)
   A  = randn(rng,n,p)
   AA = A'*A
   return A*AA^(-0.5)
end
P = stiefel(n,n,rng=MersenneTwister(3))

P[1:5,1:5]

5×5 Matrix{Float64}:
  0.0545174  -0.0434924  -0.0755381     0.022179    -0.0868766
 -0.219804    0.120624   -0.0956059     0.100837    -0.130328
  0.181158    0.126938    0.000617421   0.00680411  -0.119745
 -0.123869    0.0369384  -0.0976296    -0.121432     0.00928307
  0.0763924  -0.0964402   0.0865538    -0.0205998   -0.172082

In [None]:
round.(P'P,digits=3)[1:5,1:5]

5×5 Matrix{Float64}:
  1.0  -0.0  -0.0  -0.0   0.0
 -0.0   1.0  -0.0   0.0  -0.0
 -0.0  -0.0   1.0   0.0   0.0
 -0.0   0.0   0.0   1.0  -0.0
  0.0  -0.0   0.0  -0.0   1.0

encrypt genotype and phenotype data:

In [None]:
M[1:3,1:3]

3×3 Matrix{Float64}:
 -0.01   0.05  0.95
 -0.01  -0.95  0.95
  0.99   0.05  0.95

In [None]:
#encrypt genotypes P*X
M_encrypted = P*M
M_encrypted[1:3,1:3]

3×3 Matrix{Float64}:
  0.22479    -1.0301    -0.477804
  0.0317817   0.986898   0.720304
 -1.00543    -0.554276   1.33093

In [None]:
y[1:3]

3-element Vector{Float64}:
 1.4583758399849935
 2.0685919432881854
 3.6662720880277546

In [None]:
#encrypt phenotypes P*y
y_encrypted = P*y
y_encrypted[1:3]

3-element Vector{Float64}:
  3.4279651874846992
 -0.3124339258801001
  5.241554432345621

## Preserved relationships between SNPs

In [None]:
(M'M)[1:5,1:5]

5×5 Matrix{Float64}:
  68.99   -3.95   3.95  -11.11  -2.15
  -3.95   66.75   3.25  -14.45  -5.25
   3.95    3.25  76.75   -7.55   9.25
 -11.11  -14.45  -7.55   65.79  -0.65
  -2.15   -5.25   9.25   -0.65  54.75

In [None]:
(M_encrypted'M_encrypted)[1:5,1:5]

5×5 Matrix{Float64}:
  68.99   -3.95   3.95  -11.11  -2.15
  -3.95   66.75   3.25  -14.45  -5.25
   3.95    3.25  76.75   -7.55   9.25
 -11.11  -14.45  -7.55   65.79  -0.65
  -2.15   -5.25   9.25   -0.65  54.75

## Scrambled relationships between individuals

In [None]:
(M*M')[1:5,1:5]

5×5 Matrix{Float64}:
  6.7559  -3.1641   0.1359  -3.0141   0.3059
 -3.1641   4.9159   1.2159   2.0659   1.3859
  0.1359   1.2159   4.5159  -0.6341   1.6859
 -3.0141   2.0659  -0.6341   8.2159  -0.4641
  0.3059   1.3859   1.6859  -0.4641   2.8559

In [None]:
(M_encrypted*M_encrypted')[1:5,1:5]

5×5 Matrix{Float64}:
  5.49998   -0.188062   0.674004   0.414616   0.469399
 -0.188062   5.7496    -1.18879   -1.15605   -1.62761
  0.674004  -1.18879    9.05803   -1.55662    0.489959
  0.414616  -1.15605   -1.55662    6.721     -2.54658
  0.469399  -1.62761    0.489959  -2.54658   10.9648

# Bayesian variable selection model (BayesC$\pi$) using raw and encrypted data

In [None]:
using JWAS,LinearAlgebra

In [None]:
Random.seed!(1)
G = R = 1.0

1.0

Raw data:

In [None]:
genotypes       = get_genotypes(M,G,G_is_marker_variance = true,center=false,method="BayesC",quality_control=false)
model_equation  = "y = intercept + genotypes";
model           = build_model(model_equation,R);
pheno           = DataFrame(ID=1:n,y=y);
out             = runMCMC(model,pheno,chain_length=50_000,double_precision=true);

[0m[1mThe marker IDs are set to 1,2,...,#markers[22m
[0m[1mThe individual IDs is set to 1,2,...,#observations[22m
Genotype informatin:
#markers: 10; #individuals: 100
[32mThe folder results is created to save results.[39m
[32mChecking genotypes...[39m
[32mChecking phenotypes...[39m
[32mIndividual IDs (strings) are provided in the first column of the phenotypic data.[39m
[32mPredicted values for individuals of interest will be obtained as the summation of Any[] (Note that genomic data is always included for now).[39m[32mPhenotypes for 100 observations are used in the analysis.These individual IDs are saved in the file IDs_for_individuals_with_phenotypes.txt.[39m

[0m[1mA Linear Mixed Model was build using model equations:[22m

y = intercept + genotypes

[0m[1mModel Information:[22m

Term            C/F          F/R            nLevels
intercept       factor       fixed                1

[0m[1mMCMC Information:[22m

chain_length                                  

[32mrunning MCMC ... 100%|███████████████████████████████████| Time: 0:00:04[39m




[0m[1mThe version of Julia and Platform in use:[22m

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 4 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/lib64-nvidia
  JULIA_NUM_THREADS = 4


[0m[1mThe analysis has finished. Results are saved in the returned [22m[0m[1mvariable and text files. MCMC samples are saved in text files.[22m




Encrypted data:

In [None]:
genotypes_encrypted       = get_genotypes(M_encrypted,G,G_is_marker_variance = true,center=false,method="BayesC",quality_control=false)
model_equation_encrypted  ="y_encrypted = intercept + genotypes_encrypted";
model_encrypted           = build_model(model_equation_encrypted,R);
pheno_encrypted           = DataFrame(ID=1:n,y_encrypted=y_encrypted);
out_encrypted             = runMCMC(model_encrypted,pheno_encrypted,chain_length=50_000,double_precision=true);

[0m[1mThe marker IDs are set to 1,2,...,#markers[22m
[0m[1mThe individual IDs is set to 1,2,...,#observations[22m
Genotype informatin:
#markers: 10; #individuals: 100
[31mThe folder results already exists.[39m
[32mThe folder results1 is created to save results.[39m
[32mChecking genotypes...[39m
[32mChecking phenotypes...[39m
[32mIndividual IDs (strings) are provided in the first column of the phenotypic data.[39m
[32mPredicted values for individuals of interest will be obtained as the summation of Any[] (Note that genomic data is always included for now).[39m[32mPhenotypes for 100 observations are used in the analysis.These individual IDs are saved in the file IDs_for_individuals_with_phenotypes.txt.[39m

[0m[1mA Linear Mixed Model was build using model equations:[22m

y_encrypted = intercept + genotypes_encrypted

[0m[1mModel Information:[22m

Term            C/F          F/R            nLevels
intercept       factor       fixed                1

[0m[1mMCMC

[32mrunning MCMC ... 100%|███████████████████████████████████| Time: 0:00:00[39m




[0m[1mThe version of Julia and Platform in use:[22m

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 4 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/lib64-nvidia
  JULIA_NUM_THREADS = 4


[0m[1mThe analysis has finished. Results are saved in the returned [22m[0m[1mvariable and text files. MCMC samples are saved in text files.[22m




marker effects:

In [None]:
println("The correlation is: ", cor(out["marker effects genotypes"][!,:Estimate],out_encrypted["marker effects genotypes_encrypted"][!,:Estimate]))

[out["marker effects genotypes"][!,:Estimate] out_encrypted["marker effects genotypes_encrypted"][!,:Estimate]]

The correlation is: 0.9998764117229259


10×2 Matrix{Float64}:
 -0.0365045  -0.0213402
 -0.503389   -0.476628
  1.42329     1.40781
  0.482046    0.500435
 -1.36591    -1.35355
 -1.09932    -1.10219
 -0.508864   -0.515837
  1.22906     1.25062
 -0.588028   -0.583852
 -0.56502    -0.538082

breeding values:

In [None]:
EBV_unencrypted=P'*out_encrypted["EBV_y_encrypted"][!,:EBV]
println("The correlation is: ", cor(out["EBV_y"][!,:EBV],EBV_unencrypted))

[out["EBV_y"][!,:EBV] EBV_unencrypted ]

The correlation is: 0.9997848958650805


100×2 Matrix{Float64}:
  1.09503    1.05313
  1.8361     1.8312
  2.17764    2.18383
  0.344712   0.328436
  2.56646    2.55718
 -3.94382   -3.93782
  0.923808   0.952068
 -1.54749   -1.46066
 -0.572625  -0.553677
 -0.720548  -0.746947
 -0.656238  -0.743436
 -2.43366   -2.45559
  3.70776    3.71992
  ⋮         
 -1.61005   -1.66458
 -2.15351   -2.17042
  0.404903   0.371322
  2.80143    2.75025
 -2.98915   -3.00271
 -2.29629   -2.26483
 -1.47055   -1.48419
 -0.586388  -0.632109
 -2.24757   -2.17141
  1.27398    1.28088
 -1.34386   -1.37246
 -0.13118   -0.146663

genetic variance:



In [None]:
[out["genetic_variance"][1,:Estimate] out_encrypted["genetic_variance"][1,:Estimate]]

1×2 Matrix{Float64}:
 3.82667  3.7475

residual variance:

In [None]:
[out["residual variance"][1,:Estimate] out_encrypted["residual variance"][1,:Estimate]] #same

1×2 Matrix{Float64}:
 1.98222  1.9833

heritability:

In [None]:
[out["heritability"][1,:Estimate] out_encrypted["heritability"][1,:Estimate]] #same

1×2 Matrix{Float64}:
 0.657371  0.65227

# Joint analysis using encrypted data from multiple contributors

Simulate data for contributor2. Number of individuals=200.

In [None]:
n2, p2 = 200,10 #for larger p (e.g.,500), a longer chain is needed
M2   = rand([0,1,2],n2,p2);
g2   = M2*randn(p2)
y2   = g2 + randn(n2)*sqrt(var(g2));
M2=M2.-mean(M2,dims=1)
y2=y2.-mean(y2);

data encryption for contributor2 using its own generated key:

In [None]:
# generate P
P2    = stiefel(n2,n2,rng=MersenneTwister(123))
#data encryption
y2_encrypted   = P2*y2;
M2_encrypted   = P2*M2;

joint raw data:

In [None]:
M_all=[M
       M2]
y_all = [y
         y2];

@show size(M_all),size(y_all);

(size(M_all), size(y_all)) = ((300, 10), (300,))


joint encrypted data:

In [None]:
M_all_encrypted = [M_encrypted
                   M2_encrypted]
y_all_encrypted = [y_encrypted
                   y2_encrypted];

@show size(M_all_encrypted),size(y_all_encrypted);

(size(M_all_encrypted), size(y_all_encrypted)) = ((300, 10), (300,))


# Bayesian variable selection model (BayesC$\pi$) using joint raw and encrypted data

raw data:

In [None]:
genotypes       = get_genotypes(M_all,G,G_is_marker_variance = true,center=false,method="BayesC",quality_control=false)
model_equation  = "y = intercept + genotypes";
model           = build_model(model_equation,R);
pheno           = DataFrame(ID=1:(n+n2), y=y_all)
out             = runMCMC(model,pheno,chain_length=50_000,double_precision=true);

[0m[1mThe marker IDs are set to 1,2,...,#markers[22m
[0m[1mThe individual IDs is set to 1,2,...,#observations[22m
Genotype informatin:
#markers: 10; #individuals: 300
[31mThe folder results already exists.[39m
[31mThe folder results1 already exists.[39m
[32mThe folder results2 is created to save results.[39m
[32mChecking genotypes...[39m
[32mChecking phenotypes...[39m
[32mIndividual IDs (strings) are provided in the first column of the phenotypic data.[39m
[32mPredicted values for individuals of interest will be obtained as the summation of Any[] (Note that genomic data is always included for now).[39m[32mPhenotypes for 300 observations are used in the analysis.These individual IDs are saved in the file IDs_for_individuals_with_phenotypes.txt.[39m

[0m[1mA Linear Mixed Model was build using model equations:[22m

y = intercept + genotypes

[0m[1mModel Information:[22m

Term            C/F          F/R            nLevels
intercept       factor       fixed     

[32mrunning MCMC ... 100%|███████████████████████████████████| Time: 0:00:01[39m




[0m[1mThe version of Julia and Platform in use:[22m

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 4 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/lib64-nvidia
  JULIA_NUM_THREADS = 4


[0m[1mThe analysis has finished. Results are saved in the returned [22m[0m[1mvariable and text files. MCMC samples are saved in text files.[22m




encrypted data:

In [None]:
genotypes_encrypted       = get_genotypes(M_all_encrypted,G,G_is_marker_variance = true,center=false,method="BayesC",quality_control=false)
model_equation_encrypted  ="y_encrypted = intercept + genotypes_encrypted";
model_encrypted           = build_model(model_equation_encrypted,R);
pheno_encrypted           = DataFrame(ID=1:(n+n2),y_encrypted=y_all_encrypted)
out_encrypted             = runMCMC(model_encrypted,pheno_encrypted,chain_length=50_000,double_precision=true);

[0m[1mThe marker IDs are set to 1,2,...,#markers[22m
[0m[1mThe individual IDs is set to 1,2,...,#observations[22m
Genotype informatin:
#markers: 10; #individuals: 300
[31mThe folder results already exists.[39m
[31mThe folder results1 already exists.[39m
[31mThe folder results2 already exists.[39m
[32mThe folder results3 is created to save results.[39m
[32mChecking genotypes...[39m
[32mChecking phenotypes...[39m
[32mIndividual IDs (strings) are provided in the first column of the phenotypic data.[39m
[32mPredicted values for individuals of interest will be obtained as the summation of Any[] (Note that genomic data is always included for now).[39m[32mPhenotypes for 300 observations are used in the analysis.These individual IDs are saved in the file IDs_for_individuals_with_phenotypes.txt.[39m

[0m[1mA Linear Mixed Model was build using model equations:[22m

y_encrypted = intercept + genotypes_encrypted

[0m[1mModel Information:[22m

Term            C/F      

[32mrunning MCMC ... 100%|███████████████████████████████████| Time: 0:00:01[39m




[0m[1mThe version of Julia and Platform in use:[22m

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 4 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/lib64-nvidia
  JULIA_NUM_THREADS = 4


[0m[1mThe analysis has finished. Results are saved in the returned [22m[0m[1mvariable and text files. MCMC samples are saved in text files.[22m




marker effects:

In [None]:
println("The correlation is: ", cor(out["marker effects genotypes"][!,:Estimate],out_encrypted["marker effects genotypes_encrypted"][!,:Estimate]))

[out["marker effects genotypes"][!,:Estimate] out_encrypted["marker effects genotypes_encrypted"][!,:Estimate]]

The correlation is: 0.9999313581865804


10×2 Matrix{Float64}:
  0.389828     0.406236
 -0.0454642   -0.0440686
  0.00629754  -0.0116026
  1.58312      1.59023
 -0.894963    -0.898297
 -0.852499    -0.873619
 -0.310281    -0.313352
 -0.71418     -0.713442
 -1.0042      -1.00792
 -0.814324    -0.819186

estimated breeding values:

In [None]:
P_all=[P           zeros(n,n2)
       zeros(n2,n) P2]
EBV_unencrypted=P_all'*out_encrypted["EBV_y_encrypted"][!,:EBV]
println("The correlation is: ", cor(out["EBV_y"][!,:EBV],EBV_unencrypted))

[out["EBV_y"][!,:EBV] EBV_unencrypted ]

The correlation is: 0.999935617664424


300×2 Matrix{Float64}:
 -0.16309    -0.183917
  0.0240965   0.00961323
  0.142284    0.1275
 -0.793845   -0.820669
  2.90226     2.89855
 -2.48903    -2.49766
  1.01679     1.02696
 -1.84354    -1.83009
 -1.61079    -1.61308
  1.57299     1.55119
  2.92297     2.9329
  3.41628     3.44613
  3.62974     3.63522
  ⋮          
 -0.907107   -0.907912
  1.38673     1.40444
  0.314949    0.266132
  0.959666    1.00239
  3.96908     4.01261
  2.35193     2.38322
 -0.546432   -0.532653
 -1.15608    -1.1907
  3.61686     3.65481
 -2.21522    -2.25614
 -3.93044    -4.00279
  0.687832    0.719896

genetic variance:

In [None]:
[out["genetic_variance"][1,:Estimate] out_encrypted["genetic_variance"][1,:Estimate]]

1×2 Matrix{Float64}:
 4.54082  4.60765

residual variance:

In [None]:
[out["residual variance"][1,:Estimate] out_encrypted["residual variance"][1,:Estimate]] #same

1×2 Matrix{Float64}:
 10.3848  10.3143

heritability:

In [None]:
[out["heritability"][1,:Estimate] out_encrypted["heritability"][1,:Estimate]] #same

1×2 Matrix{Float64}:
 0.303191  0.307796