## Part 2. Using the variants' weights to estimate individual risks

### Inputs:
1. EAS PRS (contains population-specific posterior SNP effect size estimates for each individual): test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt
2. EUR PRS (contains population-specific posterior SNP effect size estimates for each individual): test_EUR_pst_eff_a1_b0.5_phi1e-02_chr22.txt
3. the validation dataset for genotypes: genotype_vali
4. the test dataset for genotypes: genotype_test
5. the validation dataset for phenotypes: phenotype_vali
6. the test dataset for phenotypes: phenotype_test

### Intermediate outputs:
1. the cut test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt with only the "rsid" and "effect size" column for the EAS population: EAS_prscsx_output_cut
2. the cut test_EUR_pst_eff_a1_b0.5_phi1e-02_chr22.txt with only the "rsid" and "effect size" column for the EUR population: EUR_prscsx_output_cut
3. the txt file that includes overlapping risk variants between the test data of EUR and EAS populations: overlap_risk_variants.txt

### Outputs:
1. the coefficient vector for EAS population ($W_{eas}$): W_eas
2. tge coefficient vector for EUR population ($W_{eur}$ ): W_eur
3. the weight parameter for the EAS population: a_hat (displayed in Notebook only)
4. the weight parameter for the EUR population: a_hat (displayed in Notebook only)
5. the predicted phenotypes of the validation dataset: y_hat_vali

### Import Python packages

In [4]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import os
from IPython.display import Markdown

### Set the working directory

In [6]:
# Set the working directory as the parent folder of where the script is located and save it as a variable named "cwd"
cwd = os.path.dirname(os.getcwd())
os.chdir(cwd)

# Inspect the current working directory
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/aliceyan/Documents/GitHub/prs-csx-workshop-tutorial-updated/run_use_evaluate_prscsx/use


### Process PRS-CSx results to extract the weights for variants that exist in both populations

In [8]:
# Cut the PRS-CSx output files to keep only the rsid and the effect size
!awk '{print $2 "\t" $6}' ./inputs/test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt > ./outputs/EAS_prscsx_output_cut
!awk '{print $2 "\t" $6}' ./inputs/test_EUR_pst_eff_a1_b0.5_phi1e-02_chr22.txt > ./outputs/EUR_prscsx_output_cut

In [9]:
# Extract rsid column and sort the files
!cut -f1 "./outputs/EAS_prscsx_output_cut" | sort > ./outputs/EAS_prscsx_output_cut_sorted
!cut -f1 "./outputs/EUR_prscsx_output_cut" | sort > ./outputs/EUR_prscsx_output_cut_sorted

# Find common items between the two sorted files
!comm -12 ./outputs/EAS_prscsx_output_cut_sorted ./outputs/EUR_prscsx_output_cut_sorted > "./outputs/overlap_risk_variants.txt"

# Clean up temporary files
!rm ./outputs/EAS_prscsx_output_cut_sorted ./outputs/EUR_prscsx_output_cut_sorted

In [10]:
# Read in the overlapping variants file
with open("./outputs/overlap_risk_variants.txt", "r") as file:
    overlap_var_list = [line.strip() for line in file]  # Removes newline characters

# Obtain W_eas
pd.set_option('display.float_format', lambda x: '%.10f' % x)
EAS_var_w = pd.read_csv("./outputs/EAS_prscsx_output_cut", sep = '\t', header = None, float_precision = 'high')
W_eas = EAS_var_w[EAS_var_w[0].isin(overlap_var_list)][[1]]
W_eas = W_eas.values

# 0btain W_eur
EUR_var_w = pd.read_csv("./outputs/EUR_prscsx_output_cut", sep = '\t', header = None, float_precision = 'high')
W_eur = EUR_var_w[EUR_var_w[0].isin(overlap_var_list)][[1]]
W_eur = W_eur.values

# Save W_eas and W_eur
np.savetxt("./outputs/W_eas", W_eas, fmt="%.10f", delimiter="\t")
np.savetxt("./outputs/W_eur", W_eur, fmt="%.10f", delimiter="\t")

### Load the validation datasets for genotype vector X and phenotype vector y

In [12]:
vali_geno_vector_path = cwd + "/inputs/genotype_vali.tsv"
vali_pheno_vector_path = cwd + "/inputs/phenotype_vali.tsv"
X_vali = np.loadtxt(vali_geno_vector_path, delimiter="\t")
y_vali = np.loadtxt(vali_pheno_vector_path, delimiter="\t")

### Run Regression to find weight parameters a_hat and b_hat

In [14]:
# Inspect data dimensions
print(X_vali.shape)
print(y_vali.shape)
print(W_eas.shape)
print(W_eur.shape)

(201, 901)
(201,)
(901, 1)
(901, 1)


### Prepare the model input

In [16]:
XWeas_vali = X_vali @ W_eas # based on the equation, calculate the weighted input data for the EAS population
XWeur_vali = X_vali @ W_eur
XW_vali = np.hstack((XWeas_vali, XWeur_vali)) 
# horizontally stack the weighted inputs for both population for model input

# In essence, this block of code takes the validation data, 
# multiplies it by population-specific coefficient vectors, 
# and then concatenates the results side by side to form a combined input for the model.

### Fit the model

In [18]:
model = LinearRegression(fit_intercept = False).fit(XW_vali, y_vali) 

# we usually donâ€™t include intercept in the PRS calculation. 
# As a result the PRS calculated in this manner only reflects the relative risk, not the absolute risk.

### Obtain the regression parameters

In [20]:
a_hat = model.coef_[0]
b_hat = model.coef_[1]
print(f"{a_hat =}")
print(f"{b_hat =}")
np.savetxt("./outputs/weight_parameters", [[a_hat, b_hat]], fmt="%.10f", delimiter="\t")

a_hat =0.966567462748027
b_hat =0.06467726018612857


### Predict Phenotype on Validation Datasets

In [22]:
# Make predictions on validation data
y_hat_vali = a_hat * XWeas_vali + b_hat * XWeur_vali
y_hat_vali = y_hat_vali.flatten() #flatten(): method that flatten a 2D array into a 1D array
print(f"y_hat_vali.shape: {y_hat_vali.shape}")
print(f"y_hat_vali:\n {y_hat_vali}")

np.savetxt("./outputs/y_hat_vali", y_hat_vali, delimiter="\t", fmt="%s")

y_hat_vali.shape: (201,)
y_hat_vali:
 [ 0.06679619  0.05423122  0.05315978  0.08290685  0.07300631  0.0679764
  0.05492552  0.08604488  0.0708863   0.05682363  0.08017354  0.0751186
  0.08320225  0.03592101  0.06294025  0.06434849  0.06062601  0.0488995
  0.08803146  0.05855221  0.07837309  0.06010917  0.08693166  0.04142555
  0.08624608  0.08567306  0.08062337  0.10498899  0.07734738  0.04989415
 -0.00395381  0.03646481  0.05039314  0.06268068  0.07537368  0.05014194
  0.06357115  0.07347852  0.09554682  0.03882835  0.06824929  0.05928679
  0.07099239  0.06790522  0.0304094   0.05231566  0.05488074  0.08225548
  0.04245699  0.06240646  0.0882606   0.04473492  0.04838708  0.07043227
  0.05451036  0.07576891  0.06265146  0.0648521   0.0744746   0.07665692
  0.06673922  0.0586416   0.08227391  0.0661315   0.07165477  0.08666999
  0.04593017  0.05398522  0.04981935  0.01015506  0.09139208  0.07859414
  0.01833147  0.07449207  0.07300863  0.07487902  0.08894729  0.07780635
  0.0465033   0.