# How to calculate individual's all PGS catalog polygenic scores

>  A polygenic score (PGS) aggregates the effects of many genetic variants into a single number which predicts genetic predisposition for a phenotype. PGS are typically composed of hundreds-to-millions of genetic variants (usually SNPs) which are combined using a weighted sum of allele dosages multiplied by their corresponding effect sizes, as estimated from a relevant genome-wide association study (GWAS). [Source: https://www.pgscatalog.org/about/#score]

This notebook outlines steps to get individual's polygenic score values for all traits reported in the [PGS Catalog](https://www.pgscatalog.org/).

If the vcf file doesn't have variance listed for the polygenic score then it is assumed to have 0 effect alleles. So this polygenic score calculation is relying on the correctness of the incoming vcf file.


*Current shortcomings:*
1) Interpretation: there are different methods for calculating polygenic scores and this notebook doesn't cover interpretation for the scores. User would need to do it themselves.
2) not handling any other values in polygenic score file rsid column that is not truly a rsid value in the format `rs12345`
  * just values left empty by pgs study
  * other nomenclature (like SNP_DQA1_32717072 and AA_DQB1_57_32740666_AS)
  * values that are not variances from the reference genome (not sure about this)
3) For chrom-pos based polygenic score file only handling hg19 (there is only as single score file which is created with hg38 `PGS000325` at the time of writing 2021/02)

In [None]:
%reload_ext autoreload
%autoreload 2

import pandas as pd
from pathlib import Path

from search_your_dna.pgscatalog import calc_all_polygenic_scores_parallel, \
    get_pgs_metadata, get_all_pgs_api_data, do_calc_polygenic_score

## Inputs

In [None]:
root_path = Path("data/")
root_path.mkdir(parents=True, exist_ok=True)
input_path = root_path
input_path.mkdir(parents=True, exist_ok=True)

# individual's genome variant call file. vcf file *must* be annotated with rsids
vcf_file = input_path / "GENOME12345.GRCh38.p7.rsid_annotated.vcf"
vcf_rsid_index_file = vcf_file.parent / f"{vcf_file.name}.rsid"
vcf_compressed_file = vcf_file.parent / f"{vcf_file.name}.gz"
vcf_chrom_pos_index_file = vcf_file.parent / f"{vcf_file.name}.gz.tbi"

# metadata
hg19_ref_genome_dir = input_path / "hg19"
hg19_rsid_chrom_pos_mapping_file = hg19_ref_genome_dir.parent / f"{hg19_ref_genome_dir.name}/hg19_avsnp150.txt.gz"
hg19_rsid_chrom_pos_mapping_file_index = hg19_ref_genome_dir.parent / f"{hg19_ref_genome_dir.name}/hg19_avsnp150.txt.gz.tbi"

# other inputs
num_parallel_processes = 7 # num of cores to use to calculate pgs
max_pgs_alleles = 100_000 # won't calculate pgs score for file with more than `max_pgs_alleles`

### How to get snp metadata file for mapping hg19 chrom-pos to rsids

#### Download hg19 snp metadata file with annovar

Direct download from [annovar's server]( http://www.openbioinformatics.org/annovar/download/hg19_avsnp150.txt.gz) or using annovar:

In [None]:
# replace path /path/to/hg19_ref_genome_dir
!/path/to/annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 /path/to/hg19_ref_genome_dir

#### Create tabix index

*tabix is packaged with samtools installation*

In [None]:
# replace path /path/to/hg19_ref_genome_dir
!bgzip -c /path/to/hg19_ref_genome_dir/hg19_avsnp150.txt > /path/to/hg19_rsid_chrom_pos_mapping_file
!tabix --begin 2 --end 3 --sequence 1 /path/to/hg19_rsid_chrom_pos_mapping_file

## Outputs

In [None]:
output_dir = root_path / "5_how_to_calculate_individuals_all_pgs_catalog_polygenic_scores"

pgs_results_summary_csv = output_dir / "pgs_results.csv" # summary polygenic score values with metadata
pgs_catalog_input_cache_dir = output_dir / "pgs" # directory containing score file with metadata from PGS catalog API
pgs_results_cache_dir = output_dir / "pgs_results" # directory containing individual's pgs analysis results

## Get all PGS catalog score info from their API

In [None]:
pgs_catalog_score_metadata = get_all_pgs_api_data("score/all", cache_dir=pgs_catalog_input_cache_dir)

## Calculating all polygenic values for score found from the PGS catalog

In [None]:
%%time

pgs_ids = [pgs_id for pgs_id in sorted([item["id"] for item in pgs_catalog_score_metadata])]

# NOTE: there is intentional sleep for fetching polygenic score files from the PGS catalog as there is 100 req/min limit

all_pgs_scores = calc_all_polygenic_scores_parallel(
    pgs_ids=pgs_ids,
    vcf_file=vcf_file,
    num_parallel_processes=num_parallel_processes,
    max_pgs_alleles=max_pgs_alleles,
    hg19_rsid_chrom_pos_mapping_file=hg19_rsid_chrom_pos_mapping_file,
    pgs_catalog_input_cache_dir = pgs_catalog_input_cache_dir,
    pgs_result_cache_dir = pgs_results_cache_dir,
)
all_pgs_scores

## Attach polygenic score metadata

In [None]:
errors = all_pgs_scores[["pgs_id", "error"]][~all_pgs_scores["error"].isna()].set_index("pgs_id")
errors

In [None]:
pgs_metadata_df = pd.DataFrame(columns=["pgs_id", "trait", "method_categorized", "method", "ancestry"])
for pgs_id in pgs_ids:
    pgs_metadata_df = pgs_metadata_df.append(get_pgs_metadata(pgs_id, pgs_catalog_input_cache_dir), ignore_index=True)

result_df = all_pgs_scores[["pgs_id", "score"]].set_index("pgs_id").join(pgs_metadata_df.set_index("pgs_id"), on="pgs_id")
result_df = result_df.join(errors, on="pgs_id")

result_df.to_csv(pgs_results_summary_csv, index=None, sep="\t")

## Calculating a single score

### Example of calculating a score which have rsids in the pgs score files

In [None]:
%%time

pgs_325_id = "PGS000325"
pgs_325_score = do_calc_polygenic_score(
    vcf_file=vcf_file,
    pgs_id=pgs_325_id,
    hg19_rsid_chrom_pos_mapping_file=hg19_rsid_chrom_pos_mapping_file,
    max_pgs_alleles=200,
    pgs_catalog_input_cache_dir=pgs_catalog_input_cache_dir,
    pgs_result_cache_dir=pgs_results_cache_dir
)
pgs_325_score


### Example of calculating a score which have only chrom/pos values in the pgs score files

In [None]:
%%time

pgs_004_id = "PGS000004"
pgs_004_score = do_calc_polygenic_score(
    vcf_file=vcf_file,
    pgs_id=pgs_004_id,
    hg19_rsid_chrom_pos_mapping_file=hg19_rsid_chrom_pos_mapping_file,
    max_pgs_alleles=2000,
    pgs_catalog_input_cache_dir=pgs_catalog_input_cache_dir,
    pgs_result_cache_dir=pgs_results_cache_dir
)
pgs_004_score