## Aim: 
The GWAS summary statistics were originally based on the hg19 reference genome, whereas our current LD reference panel is hg38-based. To ensure accurate LD positioning for finemapping, we converted the GWAS summary statistics to hg38 using LiftOver.
## Input:
* LiftOver tool: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver
* hg19 → hg38 chain file: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
* GWAS summary statistics:
1. unadjusted walking pace GWAS: https://drive.google.com/drive/folders/1H1Xj33C-867dxVHOFIh5l_nLluWcnqzx

* convert.sh: script to convert hg19 to hg38.
* file_path.txt: the file recording the path to hg19 GWAS statistic(.bed)
## Output:
* hg_38 based GWAS summary statistics: `s3://statfungen/ftp_fgc_xqtl/GWAS/image_GWAS_hg38/`

## Procedures:
1. Format GWAS Summary Statistics for Conversion

Convert hg19-based GWAS summary statistics into standard .bed format with the following four required columns:chrom (without the chr prefix) start end region_id (to facilitate merging after conversion)
```
chrom	start	end	region_id
<chr>	<int>	<int>	<chr>
chr5	29439275	29439275	rs667647
chr5	85928892	85928892	rs113534962
```
**Note: LiftOver does not support .bed files with more than six columns. Since AD/aging image GWAS summary statistics share the same variant positions across multiple dimensions, we perform the conversion once for both datasets.**

2. Perform LiftOver Conversion
Use the `convert.sh` script to run LiftOver and map hg19 coordinates to hg38.

**Note: the conversion introduced some additional contigs and scaffolds that represent alternative loci or regions that are difficult to place within the main chromosomes, e.g. 'chr14_GL000009v2_random''chr19_KI270938v1_alt'. They are hard to interpret, so I removed them and only keep chr1-22.**

3. Merge Back to GWAS Summary Statistics
Load the hg38 .bed file and merge it back with the original GWAS summary statistics, preserving all necessary information.

## Simple summary for the conversion

| Studies | before_conversion(original) | unmapped | after_conversion(final) | overall_dropped | proportion_dropped |
|---------|----------------------------|----------|------------------------|-----------------|-------------------|
| unadjusted walking pace GWAS | 11,335,563 | 3,939 | 11,331,624 | 5,999 | 0.0529% |


In [22]:
5999/11335563

In [2]:
library(data.table)
library(tidyverse)

In [3]:
options(scipen = 999)


In [4]:
unadjusted_GWAS = fread("~/rl3328/motor_qtl/GWAS/wp_ukb.txt.gz")
head(unadjusted_GWAS)
dim(unadjusted_GWAS)

SNP,CHR,BP,EFFECT.ALLELE,ALT.ALLELE,EAF,INFO,BETA,SE,P
<chr>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1:612688_TCTC_T,1,612688,TCTC,T,0.979815,0.617496,0.00101014,0.00584836,0.9
rs545945172,1,636285,T,C,0.903972,0.721502,-0.000918819,0.00247797,0.71
rs201942322,1,649192,A,T,0.882575,0.743729,-0.00034383,0.00223761,0.87
rs371628865,1,662414,C,T,0.864997,0.694792,-0.000533999,0.00219459,0.81
rs61769339,1,662622,G,A,0.889532,0.777266,-0.000191469,0.0022453,0.92
rs539032812,1,665266,T,C,0.981031,0.670456,0.00411049,0.00558377,0.45


In [9]:
sum(grepl("[eE]", unadjusted_GWAS$BP))

In [5]:
unadjusted_GWAS_needed = unadjusted_GWAS |> mutate(chrom = paste0("chr",CHR), start = BP, end = BP, region_id = SNP) |> select(chrom, start, end, region_id)

In [6]:
head(unadjusted_GWAS_needed)
dim(unadjusted_GWAS_needed)

chrom,start,end,region_id
<chr>,<dbl>,<dbl>,<chr>
chr1,612688,612688,1:612688_TCTC_T
chr1,636285,636285,rs545945172
chr1,649192,649192,rs201942322
chr1,662414,662414,rs371628865
chr1,662622,662622,rs61769339
chr1,665266,665266,rs539032812


In [7]:
fwrite(unadjusted_GWAS_needed,"/mnt/lustre/home/rl3328/rl3328/motor_qtl/GWAS/unadjusted_GWAS_hg19.bed", sep = '\t',col.names=FALSE)

# Read in the hg38 .bed(three columns-chrom, pos, id) and merge it back to the original summary statistics 

In [11]:
unadjusted_GWAS_hg38 = fread("/mnt/lustre/home/rl3328/rl3328/motor_qtl/GWAS/unadjusted_GWAS_hg19.to_hg38.bed")

In [12]:
head(unadjusted_GWAS_hg38)


V1,V2,V3,V4
<chr>,<int>,<int>,<chr>
chr1,677308,677308,1:612688_TCTC_T
chr1,700905,700905,rs545945172
chr1,713812,713812,rs201942322
chr1,727034,727034,rs371628865
chr1,727242,727242,rs61769339
chr1,729886,729886,rs539032812


In [13]:
dim(unadjusted_GWAS_hg38)

In [14]:
unadjusted_GWAS_hg38 = unadjusted_GWAS_hg38[,-3]

In [15]:
colnames(unadjusted_GWAS_hg38) <- c("chr","pos","rsid")

In [16]:
unadjusted_GWAS_hg38 = unadjusted_GWAS_hg38 |> mutate(chr = gsub("chr", "", chr))

In [17]:
unadjusted_GWAS_hg38 = unadjusted_GWAS_hg38 |> mutate(chr=as.integer(chr))

[1m[22m[36mℹ[39m In argument: `chr = as.integer(chr)`.
[33m![39m NAs introduced by coercion”


In [18]:
unadjusted_GWAS_hg38 = unadjusted_GWAS_hg38 |> filter(!is.na(chr))

In [19]:
head(unadjusted_GWAS_hg38)
dim(unadjusted_GWAS_hg38)

chr,pos,rsid
<int>,<int>,<chr>
1,677308,1:612688_TCTC_T
1,700905,rs545945172
1,713812,rs201942322
1,727034,rs371628865
1,727242,rs61769339
1,729886,rs539032812


In [20]:
unique(unadjusted_GWAS_hg38$chr)

In [23]:
unadjusted_GWAS_remain = unadjusted_GWAS |> select(-CHR, -BP)
unadjusted_GWAS_hg38_final = unadjusted_GWAS_remain |> inner_join(unadjusted_GWAS_hg38, by = c('SNP' ='rsid'))

“[1m[22mDetected an unexpected many-to-many relationship between `x` and `y`.
[36mℹ[39m Row 2397 of `x` matches multiple rows in `y`.
[36mℹ[39m Row 2397 of `y` matches multiple rows in `x`.
[36mℹ[39m If a many-to-many relationship is expected, set `relationship =


In [24]:
head(unadjusted_GWAS_hg38_final)
dim(unadjusted_GWAS_hg38_final)

SNP,EFFECT.ALLELE,ALT.ALLELE,EAF,INFO,BETA,SE,P,chr,pos
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
1:612688_TCTC_T,TCTC,T,0.979815,0.617496,0.00101014,0.00584836,0.9,1,677308
rs545945172,T,C,0.903972,0.721502,-0.000918819,0.00247797,0.71,1,700905
rs201942322,A,T,0.882575,0.743729,-0.00034383,0.00223761,0.87,1,713812
rs371628865,C,T,0.864997,0.694792,-0.000533999,0.00219459,0.81,1,727034
rs61769339,G,A,0.889532,0.777266,-0.000191469,0.0022453,0.92,1,727242
rs539032812,T,C,0.981031,0.670456,0.00411049,0.00558377,0.45,1,729886


In [29]:
# unadjusted_GWAS_hg38_final = unadjusted_GWAS_hg38_final |> rename(A1 = EFFECT.ALLELE, A2 = ALT.ALLELE, chrom = chr)
unadjusted_GWAS_hg38_final = unadjusted_GWAS_hg38_final |> mutate(n_sample = 1 / (2 * EAF * (1 - EAF) * SE^2)) |> rename(beta = BETA, se = SE, p = P, effect_allele_frequency = EAF)

In [26]:
unadjusted_GWAS_hg38_final = unadjusted_GWAS_hg38_final |> arrange(chrom, pos) |> select(chrom, pos, A1, A2, everything())

In [30]:
head(unadjusted_GWAS_hg38_final)

chrom,pos,A1,A2,SNP,effect_allele_frequency,INFO,beta,se,p,n_sample
<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,97269,T,TAC,rs199687009,0.904921,0.847312,0.000316765,0.00224748,0.88,1150490.6
1,677308,TCTC,T,1:612688_TCTC_T,0.979815,0.617496,0.00101014,0.00584836,0.9,739143.9
1,700905,T,C,rs545945172,0.903972,0.721502,-0.000918819,0.00247797,0.71,938048.0
1,713812,A,T,rs201942322,0.882575,0.743729,-0.00034383,0.00223761,0.87,963582.8
1,727034,C,T,rs371628865,0.864997,0.694792,-0.000533999,0.00219459,0.81,889007.0
1,727242,G,A,rs61769339,0.889532,0.777266,-0.000191469,0.0022453,0.92,1009306.8


In [31]:
fwrite(unadjusted_GWAS_hg38_final, "/mnt/lustre/home/rl3328/rl3328/motor_qtl/GWAS/unadjusted_wp_ukb.sumstats_hg38.tsv.gz", sep = '\t')

In [None]:
# Method 1: Process header and data separately
(zcat /mnt/lustre/home/rl3328/rl3328/motor_qtl/GWAS/unadjusted_wp_ukb.sumstats_hg38.tsv.gz | head -1 | sed 's/^/#/'; \
 zcat /mnt/lustre/home/rl3328/rl3328/motor_qtl/GWAS/unadjusted_wp_ukb.sumstats_hg38.tsv.gz | tail -n +2 | \
 awk 'BEGIN{OFS="\t"} {$2=int($2); print}' | sort -k1,1V -k2,2n) | \
bgzip > /mnt/lustre/home/rl3328/rl3328/motor_qtl/GWAS/unadjusted_wp_ukb.sumstats_hg38_sorted.tsv.gz

tabix -s 1 -b 2 -e 2 /mnt/lustre/home/rl3328/rl3328/motor_qtl/GWAS/unadjusted_wp_ukb.sumstats_hg38_sorted.tsv.gz