## Part 1. Running PRS-CSx for variants' weights

### Inputs:
1. reference panel for EAS population: ldblk_1kg_eas.tar.gz (from dropbox)
2. reference panel for EUR population: ldblk_1kg_eur.tar.gz (from dropbox)
3. SNP information: snpinfo_mult_1kg_hm3 (from dropbox)
4. EAS population summary statistics for schizophrenia: EAS_sumstats.txt (from test data in PRS-CSx Github folder)
5. EUR population summary statistics for schizophrenia: EUR_sumstats.txt (from test data in PRS-CSx Github folder)

### Outputs
1. EAS PRS (contains population-specific posterior SNP effect size estimates for each individual): test_EAS_pst_eff_a1_b0.5_phi1e-02_chr22.txt
2. EUR PRS (contains population-specific posterior SNP effect size estimates for each individual): test_EUR_pst_eff_a1_b0.5_phi1e-02_chr22.txt

### Check if bash and curl are available:
If you run into an error, please install them and then run the following

In [5]:
!bash --version
!which bash
!curl --version

GNU bash, version 3.2.57(1)-release (arm64-apple-darwin23)
Copyright (C) 2007 Free Software Foundation, Inc.
/bin/bash
curl 8.9.1 (arm64-apple-darwin20.0.0) libcurl/8.9.1 OpenSSL/3.0.15 zlib/1.2.13 libssh2/1.11.0 nghttp2/1.57.0
Release-Date: 2024-07-31
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS GSS-API HSTS HTTP2 HTTPS-proxy IPv6 Kerberos Largefile libz NTLM SPNEGO SSL threadsafe TLS-SRP UnixSockets


### Import Python packages:

In [7]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import os
from IPython.display import Markdown

### Set the working directory

In [9]:
# Set the working directory as the parent folder of where the script is located and save it as a variable named "cwd"
cwd = os.path.dirname(os.getcwd())
os.chdir(cwd)

# Inspect the current working directory
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/aliceyan/Documents/GitHub/prs-csx-workshop-tutorial-updated/run_use_evaluate_prscsx/run


### Inspect the bash script to run PRS-CSx

In [11]:
# Absolute path to the bash script
bash_script_path = os.path.join(cwd, "scripts/run_prscsx.sh")
print(f"Bash script path: {bash_script_path}")

# Read the script content
with open(bash_script_path, "r") as file:
    bash_script_content = file.read()

# Display the script as preformatted Markdown
Markdown(f"```bash\n{bash_script_content}\n```")

Bash script path: /Users/aliceyan/Documents/GitHub/prs-csx-workshop-tutorial-updated/run_use_evaluate_prscsx/run/scripts/run_prscsx.sh


```bash
#!/bin/bash

## Setting Up PRS-CSx 

# 0. Set the working directory as the parent directory of which the script is located.
echo "Current working directory: $(realpath ./)"

# 1. Clone the PRS-CSx repository using the following git command:
git clone https://github.com/getian107/PRScsx.git

# 2. Create a sub-folder named `ref`. Download the LD reference panels to `ref` and extract files:
# For regions that don't have access to Dropbox, reference panels can be downloaded from the [alternative download site](https://personal.broadinstitute.org/hhuang/public/PRS-CSx/Reference/).
mkdir -p ./inputs/ref
echo "Downloading files..."
EAS_REF_URL="https://www.dropbox.com/s/7ek4lwwf2b7f749/ldblk_1kg_eas.tar.gz?dl=1"
EUR_REF_URL="https://www.dropbox.com/s/mt6var0z96vb6fv/ldblk_1kg_eur.tar.gz?e=1&dl=0"
EAS_REF_DIR="./inputs/ref/ldblk_1kg_eas.tar.gz"
EUR_REF_DIR="./inputs/ref/ldblk_1kg_eur.tar.gz"
curl -L -o "$EAS_REF_DIR" "$EAS_REF_URL"
curl -L -o "$EUR_REF_DIR" "$EUR_REF_URL"
echo "Download completed: 1)EAS ref: $(realpath "$EAS_REF_DIR"), 2) EUR ref: $(realpath "$EUR_REF_DIR")"
echo "Extracting files..."
tar -zxvf "$EAS_REF_DIR" -C "./inputs/ref"
tar -zxvf "$EUR_REF_DIR" -C "./inputs/ref"
echo "Extraction completed."

# 3. Download the SNP information file and put it in the same folder containing the reference panels:
SNP_INFO_URL="https://www.dropbox.com/s/rhi806sstvppzzz/snpinfo_mult_1kg_hm3?dl=0"
SNP_INFO_DIR="./inputs/ref/snpinfo_mult_1kg_hm3"
curl -L -o "$SNP_INFO_DIR" "$SNP_INFO_URL"

# 4. PRScsx requires Python packages `scipy` and `h5py` installed:
# Function to check and install a Python package
check_and_install_package() {
    PACKAGE=$1
    if python -c "import $PACKAGE" &> /dev/null; then
        echo "$PACKAGE is already installed."
    else
        echo "$PACKAGE is not installed. Installing..."
        pip install $PACKAGE
        if [ $? -eq 0 ]; then
            echo "$PACKAGE installed successfully."
        else
            echo "Failed to install $PACKAGE. Please check your Python and pip setup."
            exit 1
        fi
    fi
}

# Ensure pip is available
if ! command -v pip &> /dev/null; then
    echo "pip is not installed. Please install pip and rerun this script."
    exit 1
fi

# Check and install scipy and h5py
check_and_install_package "scipy"
check_and_install_package "h5py"

# 5. Once Python and its dependencies have been installed, running the following will print a list of command-line options:
echo "Printing PRScsx command options..."
./PRScsx/PRScsx.py --help 

## Using PRS-CSx with Test Data
# The test data contains EUR and EAS GWAS summary statistics and a bim file for 1,000 SNPs on chromosome 22.
# 1. Create a directory to store output:
mkdir -p outputs
OUTPUT_FOLDER_DIR=$(realpath "./outputs")
    
# 2. Run PRS-CSx:
echo "Running PRS-CSx on test data..."
python ./PRScsx/PRScsx.py \
    --ref_dir=./inputs/ref \
    --bim_prefix=./PRScsx/test_data/test \
    --sst_file=./PRScsx/test_data/EUR_sumstats.txt,./PRScsx/test_data/EAS_sumstats.txt \
    --n_gwas=200000,100000 \
    --pop=EUR,EAS \
    --chrom=22 \
    --phi=1e-2 \
    --out_dir=./outputs \
    --out_name=test
echo "PRS-CSx finished running: $OUTPUT_FOLDER_DIR"

```

### Explanation of the input of PRS-CSx
`ref_dir` --> PATH_TO_REFERENCE (required): Full path to the directory that contains the SNP information file and LD reference panels. If the 1000 Genomes reference is used, the folder would contain the SNP information file snpinfo_mult_1kg_hm3 and one or more of the LD reference files: ldblk_1kg_afr, ldblk_1kg_amr, ldblk_1kg_eas, ldblk_1kg_eur, ldblk_1kg_sas; if the UK Biobank reference is used, the folder would contain the SNP information file snpinfo_mult_ukbb_hm3 and one or more of the LD reference files: ldblk_ukbb_afr, ldblk_ukbb_amr, ldblk_ukbb_eas, ldblk_ukbb_eur, ldblk_ukbb_sas. (In our script, we used the 1000 Genomes reference.)

`bim_prefix` --> VALIDATION_BIM_PREFIX (required): Full path and the prefix of the bim file for the target (validation/testing) dataset. This file is used to provide a list of SNPs that are available in the target dataset.

`sst_file` --> SUM_STATS_FILE (required): Full path and the file name of the GWAS summary statistics. Multiple GWAS summary statistics files are allowed and should be separated by comma. The summary statistics file must include either BETA/OR + SE or BETA/OR + P. The formats are specified on the Github.

`n_gwas` --> GWAS_SAMPLE_SIZE (required): Sample sizes of the GWAS, in the same order of the GWAS summary statistics files, separated by comma.

`pop` --> POPULATION (required): Population of the GWAS sample, in the same order of the GWAS summary statistics files, separated by comma. For both the 1000 Genomes reference and the UK Biobank reference, AFR, AMR, EAS, EUR and SAS are allowed.

`chrom` --> CHROM (optional): The chromosome on which the model is fitted, separated by comma, e.g., --chrom=1,3,5. Parallel computation for the 22 autosomes is recommended. Default is iterating through 22 autosomes (can be time-consuming).

`phi` --> PARAM_PHI (optional): Global shrinkage parameter phi. If phi is not specified, it will be learnt from the data using a fully Bayesian approach. This usually works well for polygenic traits with very large GWAS sample sizes (hundreds of thousands of subjects). For GWAS with limited sample sizes (including most of the current disease GWAS), fixing phi to 1e-2 (for highly polygenic traits) or 1e-4 (for less polygenic traits), or doing a small-scale grid search (e.g., phi=1e-6, 1e-4, 1e-2, 1) to find the optimal phi value in the validation dataset often improves perdictive performance.

`out_dir` --> OUTPUT_DIR (required): Output directory of the posterior effect size estimates.

`out_name` --> OUTPUT_FILE_PREFIX (required): Output filename prefix of the posterior effect size estimates.

More optional flags are explained in the Github.

Reference: https://github.com/getian107/PRScsx

### Run the bash script to run PRS-CSx

In [14]:
print(bash_script_path)
with open(bash_script_path, "w") as file:
    file.write(bash_script_content)

# Make the script executable
!chmod +x ./scripts/run_prscsx.sh

# Run the script
!./scripts/run_prscsx.sh

/Users/aliceyan/Documents/GitHub/prs-csx-workshop-tutorial-updated/run_use_evaluate_prscsx/run/scripts/run_prscsx.sh
Current working directory: /Users/aliceyan/Documents/GitHub/prs-csx-workshop-tutorial-updated/run_use_evaluate_prscsx/run
Cloning into 'PRScsx'...
remote: Enumerating objects: 276, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (72/72), done.[K
remote: Total 276 (delta 48), reused 22 (delta 10), pack-reused 194 (from 1)[K
Receiving objects: 100% (276/276), 99.90 KiB | 4.76 MiB/s, done.
Resolving deltas: 100% (166/166), done.
Downloading files...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   137  100   137    0     0    722      0 --:--:-- --:--:-- --:--:--   724
100    17  100    17    0     0     11      0  0:00:01  0:00:01 --:--:--    74
100   496    0   496    0     0    224      0 --:--:--  0:00:02 --:--:

### Explanation of the output of PRS-CSx
*For each input GWAS, PRS-CSx writes `posterior SNP effect size estimates for each chromosome` to the user-specified directory. The output file contains `chromosome, rs ID, base position, A1, A2 and posterior effect size estimate for each SNP`.*

*An individual-level polygenic score can be produced by concatenating output files from all chromosomes and then using PLINK's --score command (https://www.cog-genomics.org/plink/1.9/score). If polygenic scores are generated by chromosome, use the 'sum' modifier so that they can be combined into a genome-wide score.* In our tutorial, we are going to demonstrate this calculation process. We will multiply the posterior SNP effect sizes for chromosome 22 by the individual's genotype matrix to compute the PRS.

### How to use PRS-CSx?
*Given a global shrinkage parameter, the first approach calculates `one polygenic score for each discovery population` using `population-specific posterior SNP effect size estimates` and learns a `linear combination of the polygenic scores` that most accurately predicts the trait in the validation dataset. The optimal global shrinkage parameter and linear combination weights are then taken to an independent dataset, where the predictive performance of the final PRS can be assessed.* The second approach is explained in detail in the Github.

Reference: https://github.com/getian107/PRScsx

We will domonstrate the usage of PRS-CSx in the following part.