SCARB2 - Single gene analysis in GP2 Neurobooster genotyping data (all ancestries)

Project: GP2 SCARB2

Version: Python/3.10.15, R/4.3.3

Notebook Overview

Description Loading Python libraries Set paths Make working directory

Installing packages

LD pruning(TMEM175/SCARB2/CTSB)

Create a covariate file with GP2 data

Annotation of the gene SCARB2

Association analysis to compare allele frequencies between cases and controls

GLM analysis adjusting for gender, age, PC1-5

Burden test(SkatO, Skat, cmc,zeggini,mb,fp,cmcWald)

Conditional analysis

Loading Python libraries

In [1]:
# Use pathlib for file path manipulation
import pathlib

# Install numpy
import numpy as np

# Install Pandas for tabular data
import pandas as pd

# Install plotnine: a ggplot2-compatible Python plotting package
from plotnine import *

# Always show all columns in a Pandas DataFrame
pd.set_option('display.max_columns', None)

Set paths

In [2]:
REL10_PATH = pathlib.Path(pathlib.Path.home(), 'workspace/gp2_tier2_eu_release10')
!ls -hal {REL10_PATH}

total 100K
dr-xr-xr-x. 1 jupyter users    0 Jul 26 10:56 clinical_data
dr-xr-xr-x. 1 jupyter users    0 Jul 26 10:56 imputed_genotypes
dr-xr-xr-x. 1 jupyter users    0 Jul 26 10:56 meta_data
dr-xr-xr-x. 1 jupyter users    0 Jul 26 10:56 raw_genotypes
dr-xr-xr-x. 1 jupyter users    0 Jul 26 10:56 raw_genotypes_flipped
-r--r--r--. 1 jupyter users 100K Jun 30 20:07 README_release10_01072025.txt
dr-xr-xr-x. 1 jupyter users    0 Jul 26 10:56 wgs


Make working directory

In [3]:
! mkdir ~/workspace/ws_files/SCARB2

In [4]:
WORK_DIR = "~/workspace/ws_files/SCARB2/"

In [5]:
# make sure all tools installed
! ls /home/jupyter/tools

annovar				       plink2
annovar.latest.tar.gz		       plink2_linux_x86_64_latest.zip
gcta-1.95.0-linux-kernel-3-x86_64      plink_linux_x86_64_20190304.zip
gcta-1.95.0-linux-kernel-3-x86_64.zip  prettify
intel-simplified-software-license.txt  rvtests
LICENSE				       toy.map
__MACOSX			       toy.ped
plink				       vcf_subset


In [6]:
# give permission

# chmod to make sure you have permission to run the program
! chmod u+x /home/jupyter/tools/plink
! chmod u+x /home/jupyter/tools/plink2
! chmod 777 /home/jupyter/tools/rvtests/executable/rvtest

In [7]:
%%bash
# making working directory
#Loop over all the ancestries
for ancestry in {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'} ;
do

#Make a folder for each ancestry
mkdir ~/workspace/ws_files/SCARB2/SCARB2_"$ancestry"

done

In [8]:
# covariate file has been created in another notebook without GBA carriers

Annotation of the gene

Extract the region using PLINK

Extract SCARB2 gene

SCARB2 coordinates: Chromosome 4: 76,158,737-76,234,536(GRCh38/hg38)

In [9]:
## extract region using plink
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --pfile {REL10_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release10_vwb \
    --chr 4 \
    --from-bp 76108737 \
    --to-bp 76284536 \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_SCARB2

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2.log.
Options in effect:
  --chr 4
  --from-bp 76108737
  --make-bed
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/SAS/chr4_SAS_release10_vwb
  --to-bp 76284536

Start time: Sat Jul 26 11:01:12 2025
26046 MiB RAM detected, ~24954 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
945 samples (339 females, 606 males; 945 founders) loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/SAS/chr4_SAS_release10_vwb.psam.
2227587 variants loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/SAS/chr4_SAS_release10_vwb.pvar.
1 binary phenotype loaded (317 cases, 269 controls).
2268 variants rema

In [10]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}/'
    
    ! head -n 1 {WORK_DIR}/{ancestry}_SCARB2.fam > {WORK_DIR}/{ancestry}_s1.txt

In [None]:
! head /home/jupyter/workspace/ws_files/SCARB2/SCARB2_EUR/EUR_s1.txt

Turn binary files into VCF

In [12]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    ## Turn binary files into VCF
    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --keep {WORK_DIR}/{ancestry}_s1.txt \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_SCARB2_v1

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2
  --keep /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_s1.txt
  --make-bed
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1

Start time: Sat Jul 26 11:04:12 2025
26046 MiB RAM detected, ~24848 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
945 samples (339 females, 606 males; 945 founders) loaded from
/home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2.fam.
2268 variants loaded from
/home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2.bim.
1 binary phenotype loaded (317 cases, 269 controls).
--keep: 1 sample remaining.
1 sample (0 females, 1 male; 1 founder) remaining after ma

In [14]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    ## Turn binary files into VCF
    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2_v1 \
    --recode vcf-fid \
    --out {WORK_DIR}/{ancestry}_SCARB2_v1

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1
  --export vcf-fid
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1

Start time: Sat Jul 26 11:09:15 2025
Note: --export 'vcf-fid' modifier is deprecated.  Use 'vcf' + 'id-paste=fid'.
26046 MiB RAM detected, ~24890 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
1 sample (0 females, 1 male; 1 founder) loaded from
/home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1.fam.
2268 variants loaded from
/home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1.bim.
1 binary phenotype loaded (1 case, 0 controls).
--export vcf to
/home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1

In [15]:
### Bgzip and Tabix (zip and index the file)
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    ! bgzip -f {WORK_DIR}/{ancestry}_SCARB2_v1.vcf
    ! tabix -f -p vcf {WORK_DIR}/{ancestry}_SCARB2_v1.vcf.gz 

Annotate using ANNOVAR

In [16]:
## annotate using ANNOVAR

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    ! perl /home/jupyter/tools/annovar/table_annovar.pl {WORK_DIR}/{ancestry}_SCARB2_v1.vcf.gz /home/jupyter/tools/annovar/humandb/ -buildver hg38 \
    -out {WORK_DIR}/{ancestry}_SCARB2.annovar \
    -remove -protocol refGene,clinvar_20140902 \
    -operation g,f \
    --nopolish \
    -nastring . \
    -vcfinput


NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2_v1.vcf.gz > /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2.annovar.avinput>
NOTICE: Finished reading 2275 lines from VCF file
NOTICE: A total of 2268 locus in VCF file passed QC threshold, representing 2093 SNPs (1452 transitions and 641 transversions) and 175 indels/substitutions
NOTICE: Finished writing allele frequencies based on 2093 SNP genotypes (1452 transitions and 641 transversions) and 175 indels/substitutions for 1 samples

NOTICE: Running with system command </home/jupyter/tools/annovar/table_annovar.pl /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2.annovar.avinput /home/jupyter/tools/annovar/humandb/ -buildver hg38 -outfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2.annovar -remove -protocol refGene,clinvar_20140902 -operation g,f --nopolish -nastring . -otheri

In [18]:
WORK_DIR = f'~/workspace/ws_files/SCARB2/'

In [49]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_CAH/CAH_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108753,76108753,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108753,chr4:76108753:T:C,T,C,.,.,PR,GT,0/0
1,4,76108755,76108755,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108755,chr4:76108755:T:C,T,C,.,.,PR,GT,0/0
2,4,76108804,76108804,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108804,chr4:76108804:T:C,T,C,.,.,PR,GT,0/0
3,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.0,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/0
4,4,76108848,76108848,C,T,intronic,ART3,.,.,.,.,0.0,.,.,4,76108848,chr4:76108848:C:T,C,T,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3915,4,76284281,76284281,C,A,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284281,chr4:76284281:C:A,C,A,.,.,PR,GT,0/0
3916,4,76284298,76284298,C,T,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284298,chr4:76284298:C:T,C,T,.,.,PR,GT,0/0
3917,4,76284303,76284303,C,A,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284303,chr4:76284303:C:A,C,A,.,.,PR,GT,0/0
3918,4,76284327,76284327,C,T,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284327,chr4:76284327:C:T,C,T,.,.,PR,GT,0/0


In [50]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      3440
intergenic     210
UTR3            81
downstream      72
exonic          60
upstream        30
UTR5            24
splicing         3
Name: count, dtype: int64

In [51]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    33
synonymous SNV       26
stopgain              1
Name: count, dtype: int64

In [52]:
#Make lists of variants to keep - all coding, coding nonsynonymous (missense - as they are coded in ANNOVAR), deleterious (CADD_phred > 20)

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    print(f'WORKING ON: {ancestry}')

    # Read in ANNOVAR multianno file
    gene = pd.read_csv(f'{WORK_DIR}/{ancestry}_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
    
    #Print number of variants in the different categories
    results = [] 

    utr5 = gene[gene['Func.refGene']== 'UTR5']
    intronic = gene[gene['Func.refGene']== 'intronic']
    exonic = gene[gene['Func.refGene']== 'exonic']
    utr3 = gene[gene['Func.refGene']== 'UTR3']
    coding_nonsynonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'nonsynonymous SNV')]
    coding_synonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] != 'nonsynonymous SNV')]
    lof = exonic[(exonic['ExonicFunc.refGene'] == 'stopgain') | (exonic['ExonicFunc.refGene'] == 'stoploss') | (exonic['ExonicFunc.refGene'] == 'frameshift deletion') | (exonic['ExonicFunc.refGene'] == 'frameshift insertion')]
    nonsynonymous_lof = pd.concat([coding_nonsynonymous, lof])

    print({ancestry})
    print('Total variants: ', len(gene))
    print("Intronic: ", len(intronic))
    print('UTR3: ', len(utr3))
    print('UTR5: ', len(utr5))
    print("Total exonic: ", len(exonic))
    print('  Synonymous: ', len(coding_synonymous))
    print("  Nonsynonymous: ", len(coding_nonsynonymous))
    print("nonsynonymous_lof: ", len(nonsynonymous_lof))
    results.append((gene, intronic, utr3, utr5, exonic, coding_synonymous, coding_nonsynonymous, nonsynonymous_lof))
    print('\n')

    # Save in PLINK format - coding nonsynonymous 
    # These are missense variants - other types of nonsynonymous variants (e.g stopgain/loss, or frameshift variants are coded differently in the ExonicFunc.refGene 
    variants_toKeep = nonsynonymous_lof[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep.to_csv(f'{WORK_DIR}/{ancestry}_SCARB2.nonsynonymous_lof.variantstoKeep.txt', sep="\t", index=False, header=False)


    # Save in PLINK format - all coding variants
    variants_toKeep2 = exonic[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep2.to_csv(f'{WORK_DIR}/{ancestry}_SCARB2.exonic.variantstoKeep.txt', sep="\t", index=False, header=False)

WORKING ON: SAS
{'SAS'}
Total variants:  2268
Intronic:  1972
UTR3:  59
UTR5:  12
Total exonic:  35
  Synonymous:  15
  Nonsynonymous:  20
nonsynonymous_lof:  20


WORKING ON: FIN
{'FIN'}
Total variants:  1200
Intronic:  1032
UTR3:  23
UTR5:  3
Total exonic:  16
  Synonymous:  7
  Nonsynonymous:  9
nonsynonymous_lof:  9


WORKING ON: MDE
{'MDE'}
Total variants:  3010
Intronic:  2656
UTR3:  56
UTR5:  14
Total exonic:  59
  Synonymous:  19
  Nonsynonymous:  40
nonsynonymous_lof:  41


WORKING ON: AAC
{'AAC'}
Total variants:  3967
Intronic:  3493
UTR3:  93
UTR5:  21
Total exonic:  63
  Synonymous:  27
  Nonsynonymous:  36
nonsynonymous_lof:  37


WORKING ON: AFR
{'AFR'}
Total variants:  4696
Intronic:  4111
UTR3:  95
UTR5:  34
Total exonic:  75
  Synonymous:  35
  Nonsynonymous:  40
nonsynonymous_lof:  42


WORKING ON: CAS
{'CAS'}
Total variants:  2681
Intronic:  2356
UTR3:  59
UTR5:  18
Total exonic:  50
  Synonymous:  22
  Nonsynonymous:  28
nonsynonymous_lof:  29


WORKING ON: CAH
{'CA

ALL variants

assoc

glm

ASSOC

In [53]:
#Run case-control analysis using plink assoc for all variants, not adjusting for any covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    ! /home/jupyter/tools/plink \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
    --assoc \
    --allow-no-sex \
    --ci 0.95 \
    --maf 0.01 \
    --out {WORK_DIR}/{ancestry}_SCARB2.all

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2
  --ci 0.95
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --maf 0.01
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all

26046 MB RAM detected; reserving 13023 MB for main workspace.
3967 variants loaded from .bim file.
1215 people (532 males, 683 females) loaded from .fam.
1133 phenotype values loaded from .fam.
--keep: 1207 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1207 founders and 0 nonfounders present.
Calculating allele frequencies done.
Total genotyping rate in remaining samples is 0.997423.
2526 variants removed due to

In [54]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    #Look at assoc results, only variants with nominal p-value < 0.05
    freq = pd.read_csv(f'{WORK_DIR}/{ancestry}_SCARB2.all.assoc', sep='\s+')
    sig_all_nonadj = freq[freq['P']<0.05]
    
    print(f'Variants with p-value < 0.05: {sig_all_nonadj.shape}')
    
    #Save FREQ to csv
    freq.to_csv(f'{WORK_DIR}/{ancestry}.all_nonadj.csv')

WORKING ON: AAC
Variants with p-value < 0.05: (236, 13)
WORKING ON: AFR
Variants with p-value < 0.05: (70, 13)
WORKING ON: AJ
Variants with p-value < 0.05: (61, 13)
WORKING ON: AMR
Variants with p-value < 0.05: (740, 13)
WORKING ON: CAS
Variants with p-value < 0.05: (28, 13)
WORKING ON: EAS
Variants with p-value < 0.05: (418, 13)
WORKING ON: EUR
Variants with p-value < 0.05: (270, 13)
WORKING ON: FIN
Variants with p-value < 0.05: (5, 13)
WORKING ON: MDE
Variants with p-value < 0.05: (225, 13)
WORKING ON: SAS
Variants with p-value < 0.05: (4, 13)
WORKING ON: CAH
Variants with p-value < 0.05: (40, 13)


In [55]:
#Run case-control analysis with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
    --allow-no-sex \
    --maf 0.01 \
    --ci 0.95 \
    --glm \
    --covar /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.noNA.txt \
    --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --neg9-pheno-really-missing \
    --out {WORK_DIR}/{ancestry}_SCARB2.all_adj

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all_adj.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.noNA.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all_adj

Start time: Sat Jul 26 11:24:21 2025
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
26046 MiB RAM detected, ~24828 available; reserving 13023 MiB for main
workspac

In [56]:
WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_AMR'

! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/AMR_SCARB2 \
--keep /home/jupyter/workspace/ws_files/AMR/AMR.noGBA.samplestoKeep \
--allow-no-sex \
--maf 0.01 \
--ci 0.95 \
--glm \
--covar /home/jupyter/workspace/ws_files/AMR/AMR_covariate_file_noGBA.orthoPCs.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--covar-variance-standardize \
--neg9-pheno-really-missing \
--out {WORK_DIR}/AMR_SCARB2.all_adj

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AMR/AMR_SCARB2.all_adj.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AMR/AMR_SCARB2
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/AMR/AMR_covariate_file_noGBA.orthoPCs.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/AMR/AMR.noGBA.samplestoKeep
  --maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AMR/AMR_SCARB2.all_adj

Start time: Sat Jul 26 11:25:41 2025
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
26046 MiB RAM detected, ~24813 available; reserving 13023 Mi

In [57]:
#Process results from plink glm analysis including covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    #Read in plink glm results
    assoc = pd.read_csv(f'{WORK_DIR}/{ancestry}_SCARB2.all_adj.PHENO1.glm.logistic.hybrid', delim_whitespace=True)

    #Filter for additive test only - this is the variant results
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Check if there are any significant (p < 0.05) variants
    significant = assoc_add[assoc_add['P']<0.05]

    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Check if there are any significant (p < 0.05) variants
    GWsignificant = assoc_add[assoc_add['P']<5e-8]

    print(f'There are {len(GWsignificant)} variants with p-value < 5e-8 in glm')
    
    #Save assoc_add to csv
    assoc_add.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')

WORKING ON: AAC
There are 248 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AFR
There are 209 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AJ
There are 52 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AMR
There are 167 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: CAS




There are 3 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: EAS




There are 62 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: EUR




There are 258 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: FIN




There are 0 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: MDE




There are 110 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: SAS




There are 11 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: CAH




There are 54 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm


format assoc file to do meta analysis(AFR/AJ/AMR/EAS/EUR)

In [58]:
# Define ancestry groups and their sample sizes
ancestry_sample_sizes = {
    'AFR': 3316,
    'AJ': 1693,
    'AMR': 3319,
    'EAS': 4975,
    'EUR': 33870
}

# Output container (optional)
cleaned_results = {}

# Loop over each ancestry
for ancestry, sample_size in ancestry_sample_sizes.items():
    # Load PLINK association result file
    file_path = f'/home/jupyter/workspace/ws_files/SCARB2/SCARB2_{ancestry}/{ancestry}.all_adj.csv'
    df = pd.read_csv(file_path)

    # Build MARKERNAME
    df['MARKERNAME'] = 'chr' + df['#CHROM'].astype(str) + ':' + df['POS'].astype(str)

    # Rename columns
    df.rename(columns={
        '#CHROM': 'CHROMOSOME',
        'POS': 'POSITION',
        'REF': 'NEA',
        'ALT': 'EA',
        'A1_FREQ': 'EAF',
        'LOG(OR)_SE': 'SE',
        'L95': 'OR_95L',
        'U95': 'OR_95U'
    }, inplace=True)

    # Add BETA = log(OR)
    df['BETA'] = np.log(df['OR'])

    # Add sample size
    df['N'] = sample_size

    # Keep desired columns
    final_cols = [
        'MARKERNAME', 'CHROMOSOME', 'POSITION', 'NEA', 'EA', 'EAF',
        'BETA', 'OR', 'SE', 'OR_95L', 'OR_95U', 'P', 'N'
    ]
    df_cleaned = df[final_cols]

    # Store or save
    cleaned_results[ancestry] = df_cleaned

    # Optional: save to file
    output_path = f'/home/jupyter/workspace/ws_files/SCARB2/SCARB2_{ancestry}/{ancestry}.cleaned_assoc.txt.gz'
    df_cleaned.to_csv(output_path, sep='\t', index=False)

In [59]:
df = pd.read_csv('/home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ.cleaned_assoc.txt.gz', sep='\t')
df

Unnamed: 0,MARKERNAME,CHROMOSOME,POSITION,NEA,EA,EAF,BETA,OR,SE,OR_95L,OR_95U,P,N
0,chr4:76108828,4,76108828,G,T,0.488445,-0.072232,0.930315,0.103227,0.759911,1.13893,0.484090,1693
1,chr4:76108858,4,76108858,G,A,0.488095,-0.072979,0.929620,0.103190,0.759399,1.13800,0.479424,1693
2,chr4:76108994,4,76108994,G,A,0.191527,-0.026174,0.974166,0.132669,0.751113,1.26346,0.843604,1693
3,chr4:76109007,4,76109007,G,A,0.191176,-0.027505,0.972870,0.132660,0.750126,1.26176,0.835750,1693
4,chr4:76109102,4,76109102,C,T,0.191176,-0.027505,0.972870,0.132660,0.750126,1.26176,0.835750,1693
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1024,chr4:76281999,4,76281999,A,G,0.336720,0.113248,1.119910,0.113801,0.896019,1.39975,0.319653,1693
1025,chr4:76282260,4,76282260,C,G,0.502102,-0.021202,0.979021,0.105906,0.795508,1.20487,0.841327,1693
1026,chr4:76283245,4,76283245,CAT,C,0.111773,-0.284414,0.752455,0.163238,0.546426,1.03617,0.081451,1693
1027,chr4:76283347,4,76283347,C,T,0.240533,0.015204,1.015320,0.126539,0.792303,1.30110,0.904384,1693


Burden Analyses using RVTests

In [60]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

#Loop over all the ancestries and the 2 variant classes - run rvtests for all coding and missense variants
for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'

        # Print the command to be executed (for debugging purposes)
        print(f'Running plink to extract {variant_class} variants for ancestry: {ancestry}')
        
        #Extract relevant variants
        ! /home/jupyter/tools/plink2 \
        --pfile {REL10_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release10_vwb \
        --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
        --extract range {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.variantstoKeep.txt \
        --recode vcf-iid \
        --out {WORK_DIR}/{ancestry}_SCARB2.{variant_class}
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running bgzip and tabix for {variant_class} variants for ancestry: {ancestry}')
        
        ## Bgzip and Tabix (zip and index the file)
        ! bgzip -f {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.vcf
        ! tabix -f -p vcf {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.vcf.gz

Running plink to extract exonic variants for ancestry: AAC
PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.exonic.log.
Options in effect:
  --export vcf-iid
  --extract range /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.exonic.variantstoKeep.txt
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.exonic
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb

Start time: Sat Jul 26 11:34:53 2025
Note: --export 'vcf-iid' modifier is deprecated.  Use 'vcf' + 'id-paste=iid'.
26046 MiB RAM detected, ~24892 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
1215 samples (683 females, 532 males; 1215 founders) loaded from
/home/jupyt

In [None]:
#Run RVtests
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'

        # Print the command to be executed (for debugging purposes)
        print(f'Running RVtests for {variant_class} variants for ancestry: {ancestry}')
        
        ## RVtests with covariates 
        #Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
        ! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
        --out {WORK_DIR}/{ancestry}_SCARB2.burden.{variant_class} \
        --kernel skato \
        --inVcf {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.vcf.gz \
        --pheno /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}_covariate_file_noGBA.txt \
        --pheno-name PHENO \
        --gene SCARB2 \
        --geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
        --covar /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}_covariate_file_noGBA.txt \
        --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
        --freqUpper 0.01
# --burden cmc,zeggini,mb,fp,cmcWald --kernel skat,skato \

Look at RVtest results SKAT-O

In [65]:
%load_ext rpy2.ipython



In [112]:
WORK_DIR = "~/workspace/ws_files/SCARB2"

In [63]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	40128	43	39	147695	0	0.520947


In [64]:
#Check EUR nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	40128	27	24	108275	0	0.536357


In [69]:
#Check AAC all_coding variant results
! cat {WORK_DIR}/SCARB2_AAC/AAC_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	1061	21	18	24584.7	0	0.00187003


In [70]:
#Check AAC nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_AAC/AAC_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	1061	12	9	7912.68	0	0.0500565


In [None]:
# run skat test in AAC ancestry in EXONIC
#Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_AAC'

! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--out {WORK_DIR}/AAC_SCARB2.SKAT.exonic \
--kernel skat \
--inVcf {WORK_DIR}/AAC_SCARB2.exonic.vcf.gz \
--pheno /home/jupyter/workspace/ws_files/AAC/AAC_covariate_file_noGBA.txt \
--pheno-name PHENO \
--gene SCARB2 \
--geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
--covar /home/jupyter/workspace/ws_files/AAC/AAC_covariate_file_noGBA.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [72]:
# look at result
! cat {WORK_DIR}/AAC_SCARB2.SKAT.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	1061	21	18	3.83605e+06	0.000890497	10000	10000	3.83605e+06	485	0	0.0485


In [None]:
# run single variant burden analysis
# single variant analysis
! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--inVcf {WORK_DIR}/AAC_SCARB2.exonic.vcf.gz \
--single score \
--pheno /home/jupyter/workspace/ws_files/AAC/AAC_covariate_file_noGBA.txt \
--pheno-name PHENO \
--geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
--gene SCARB2 \
--out {WORK_DIR}/AAC_SCARB2.burdenTESTSV.exonic \
--covar /home/jupyter/workspace/ws_files/AAC/AAC_covariate_file_noGBA.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [None]:
# look at results from single variant analysis
! cat {WORK_DIR}/AAC_SCARB2.burdenTESTSV.exonic.SingleScore.assoc

In [75]:
# store csv file

SV_AACexonic = pd.read_csv(f'{WORK_DIR}/AAC_SCARB2.burdenTESTSV.exonic.SingleScore.assoc', sep = '\t')
# Drop duplicate rows
SV_AACexonic = SV_AACexonic.drop_duplicates()
display(SV_AACexonic)

Unnamed: 0,Gene,CHROM,POS,REF,ALT,N_INFORMATIVE,AF,U,V,STAT,DIRECTION,EFFECT,SE,PVALUE
0,SCARB2,4,76161743,C,T,1061,0.000471,-3.09448,77.2455,0.123966,-,-3.12244,8.86837,0.724773
1,SCARB2,4,76163295,A,G,1061,0.000471,1.26482,74.1873,0.021564,+,1.32886,9.04931,0.883253
2,SCARB2,4,76163320,T,C,1061,0.0,,,,,,,
3,SCARB2,4,76163352,C,T,1061,0.0,,,,,,,
4,SCARB2,4,76174287,T,G,1061,0.001414,51.7497,232.278,11.5294,+,17.3652,5.11419,0.000685025
5,SCARB2,4,76176451,T,C,1061,0.000471,47.8512,77.4252,29.5735,+,48.1715,8.85807,5.38361e-08
6,SCARB2,4,76179562,A,G,1061,0.000943,-2.0681,155.264,0.027547,-,-1.0382,6.25526,0.868179
7,SCARB2,4,76179643,G,A,1061,0.001414,49.205,231.863,10.4421,+,16.5409,5.11876,0.00123174
8,SCARB2,4,76179654,T,C,1061,0.001885,-8.06009,301.074,0.215778,-,-2.08664,4.49204,0.642276
9,SCARB2,4,76179673,G,A,1061,0.000471,-1.28739,77.6605,0.021341,-,-1.29208,8.84464,0.883853


In [76]:
SV_AACexonic.to_csv(f'{WORK_DIR}/AAC_SCARB2.burdenTESTSV.exonic.SingleScore.assoc.csv')

In [79]:
#Check AFR all_coding variant results
! cat {WORK_DIR}/SCARB2_AFR/AFR_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	2738	19	16	7391.53	1	0.783387


In [80]:
#Check AFR nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_AFR/AFR_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	2738	10	9	24604.4	1	0.275785


In [81]:
#Check AJ all_coding variant results
! cat {WORK_DIR}/SCARB2_AJ/AJ_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	2174	5	3	43729.5	0	0.0206808


In [84]:
#Check AJ nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_AJ/AJ_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	2174	4	2	NA	NA	NA


In [None]:
# run skat test in AJ ancestry in EXONIC
#Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_AJ'

! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--out {WORK_DIR}/AJ_SCARB2.SKAT.exonic \
--kernel skat \
--inVcf {WORK_DIR}/AJ_SCARB2.exonic.vcf.gz \
--pheno /home/jupyter/workspace/ws_files/AJ/AJ_covariate_file_noGBA.txt \
--pheno-name PHENO \
--gene SCARB2 \
--geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
--covar /home/jupyter/workspace/ws_files/AJ/AJ_covariate_file_noGBA.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [88]:
# look at result
! cat {WORK_DIR}/AJ_SCARB2.SKAT.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	2174	5	3	1.89553e+07	0.0206494	10000	10000	1.89553e+07	209	0	0.0209


In [None]:
# run single variant burden analysis
# single variant analysis
! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--inVcf {WORK_DIR}/AJ_SCARB2.exonic.vcf.gz \
--single score \
--pheno /home/jupyter/workspace/ws_files/AJ/AJ_covariate_file_noGBA.txt \
--pheno-name PHENO \
--geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
--gene SCARB2 \
--out {WORK_DIR}/AJ_SCARB2.burdenTESTSV.exonic \
--covar /home/jupyter/workspace/ws_files/AJ/AJ_covariate_file_noGBA.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [90]:
# store csv file

SV_AJexonic = pd.read_csv(f'{WORK_DIR}/AJ_SCARB2.burdenTESTSV.exonic.SingleScore.assoc', sep = '\t')
# Drop duplicate rows
SV_AJexonic = SV_AJexonic.drop_duplicates()
display(SV_AJexonic)

Unnamed: 0,Gene,CHROM,POS,REF,ALT,N_INFORMATIVE,AF,U,V,STAT,DIRECTION,EFFECT,SE,PVALUE
0,SCARB2,4,76179654,T,C,2174,0.00989,-221.064,9123.97,5.35614,-,-5.24879,2.26795,0.020649
1,SCARB2,4,76195738,G,A,2174,0.0,,,,,,,
2,SCARB2,4,76195788,T,C,2174,0.0,,,,,,,
3,SCARB2,4,76213464,C,T,2174,0.00092,34.9293,862.831,1.41402,+,8.76981,7.37501,0.23439
4,SCARB2,4,76213496,C,G,2174,0.00138,64.1473,1291.92,3.18508,+,10.7564,6.02708,0.074314


In [91]:
SV_AJexonic.to_csv(f'{WORK_DIR}/AJ_SCARB2.burdenTESTSV.exonic.SingleScore.assoc.csv')

In [94]:
#Check AMR all_coding variant results
! cat {WORK_DIR}/SCARB2_AMR/AMR_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	3265	18	14	14716.2	0	0.854888


In [95]:
#Check AMR nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_AMR/AMR_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	3265	14	11	15769.7	0	0.502164


In [98]:
#Check CAS all_coding variant results
! cat {WORK_DIR}/SCARB2_CAS/CAS_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	1104	15	10	11529.8	0	0.127544


In [99]:
#Check CAS nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_CAS/CAS_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	1104	10	7	11315.1	0.2	0.090737


In [102]:
#Check EAS all_coding variant results
! cat {WORK_DIR}/SCARB2_EAS/EAS_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	4670	17	14	80648	0	0.0239711


In [103]:
#Check EAS nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_EAS/EAS_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	4670	12	12	80564.9	0	0.0192043


In [None]:
# run skat test in EAS ancestry in EXONIC
#Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_EAS'

! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--out {WORK_DIR}/EAS_SCARB2.SKAT.exonic \
--kernel skat \
--inVcf {WORK_DIR}/EAS_SCARB2.exonic.vcf.gz \
--pheno /home/jupyter/workspace/ws_files/EAS/EAS_covariate_file_noGBA.txt \
--pheno-name PHENO \
--gene SCARB2 \
--geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
--covar /home/jupyter/workspace/ws_files/EAS/EAS_covariate_file_noGBA.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [105]:
# look at result
! cat {WORK_DIR}/EAS_SCARB2.SKAT.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	4670	17	14	5.06632e+06	0.0129632	10000	10000	5.06632e+06	137	0	0.0137


In [None]:
# run skat test in EAS ancestry in nonsynonymous_lof
#Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_EAS'

! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--out {WORK_DIR}/EAS_SCARB2.SKAT.nonsynonymous_lof \
--kernel skat \
--inVcf {WORK_DIR}/EAS_SCARB2.nonsynonymous_lof.vcf.gz \
--pheno /home/jupyter/workspace/ws_files/EAS/EAS_covariate_file_noGBA.txt \
--pheno-name PHENO \
--gene SCARB2 \
--geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
--covar /home/jupyter/workspace/ws_files/EAS/EAS_covariate_file_noGBA.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [107]:
# look at result
! cat {WORK_DIR}/EAS_SCARB2.SKAT.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	4670	12	12	5.0611e+06	0.0107745	10000	10000	5.0611e+06	113	0	0.0113


In [None]:
# run single variant burden analysis
# single variant analysis
! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
--inVcf {WORK_DIR}/EAS_SCARB2.exonic.vcf.gz \
--single score \
--pheno /home/jupyter/workspace/ws_files/EAS/EAS_covariate_file_noGBA.txt \
--pheno-name PHENO \
--geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
--gene SCARB2 \
--out {WORK_DIR}/EAS_SCARB2.burdenTESTSV.exonic \
--covar /home/jupyter/workspace/ws_files/EAS/EAS_covariate_file_noGBA.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--freqUpper 0.01

In [109]:
# store csv file

SV_EASexonic = pd.read_csv(f'{WORK_DIR}/EAS_SCARB2.burdenTESTSV.exonic.SingleScore.assoc', sep = '\t')
# Drop duplicate rows
SV_EASexonic = SV_EASexonic.drop_duplicates()
display(SV_EASexonic)

Unnamed: 0,Gene,CHROM,POS,REF,ALT,N_INFORMATIVE,AF,U,V,STAT,DIRECTION,EFFECT,SE,PVALUE
0,SCARB2,4,76161720,C,T,4670,0.000107,0.701884,31.3742,0.015702,+,0.702537,5.60648,0.90028
1,SCARB2,4,76163238,C,T,4670,0.001285,30.7708,374.137,2.53075,+,2.58277,1.62353,0.111647
2,SCARB2,4,76163358,G,A,4670,0.000107,1.75344,31.3818,0.097973,+,1.75465,5.60579,0.754276
3,SCARB2,4,76166279,T,C,4670,0.000535,17.4393,156.611,1.94194,+,3.49689,2.50937,0.163459
4,SCARB2,4,76168474,C,T,4670,0.000964,-2.95886,281.496,0.031101,-,-0.330087,1.87171,0.860015
5,SCARB2,4,76174224,G,A,4670,0.000749,-2.44731,218.926,0.027358,-,-0.351049,2.1224,0.868628
6,SCARB2,4,76175799,G,C,4670,0.000321,6.95021,93.6762,0.515663,+,2.32994,3.2446,0.472697
7,SCARB2,4,76179549,C,T,4670,0.001071,-42.8865,312.717,5.88153,-,-4.30671,1.77583,0.0153
8,SCARB2,4,76179614,G,A,4670,0.000428,-5.93391,125.4,0.280791,-,-1.486,2.80432,0.596183
9,SCARB2,4,76179625,A,T,4670,0.000321,-7.47759,93.9443,0.595187,-,-2.49958,3.23997,0.44042


In [110]:
SV_EASexonic.to_csv(f'{WORK_DIR}/EAS_SCARB2.burdenTESTSV.exonic.SingleScore.assoc.csv')

In [114]:
WORK_DIR

'~/workspace/ws_files/SCARB2'

In [115]:
#Check EAS nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_FIN/FIN_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue


In [116]:
#Check MDE all_coding variant results
! cat {WORK_DIR}/SCARB2_MDE/MDE_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	1230	21	11	10.0292	1	1


In [117]:
#Check MDE nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_MDE/MDE_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	1230	15	7	143.991	1	1


In [118]:
#Check SAS all_coding variant results
! cat {WORK_DIR}/SCARB2_SAS/SAS_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	689	10	6	2310.65	0	0.286168


In [119]:
#Check SAS nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_SAS/SAS_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	689	7	5	3066.05	0.8	0.150845


In [124]:
#Check CAH all_coding variant results
! cat {WORK_DIR}/SCARB2_CAH/CAH_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	900	16	11	152.593	1	1


In [125]:
#Check CAH nonsynonymous_lof variant results
! cat {WORK_DIR}/SCARB2_CAH/CAH_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213899	900	9	7	393.54	1	1


LD pruning(in EUR)

In [126]:
WORK_DIR = "~/workspace/ws_files/"

In [127]:
# Make sure to use high-quality SNPs
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/SCARB2/SCARB2_EUR/EUR_SCARB2 \
--maf 0.01 \
--geno 0.05 \
--hwe 1e-5 0.001 \
--make-bed \
--keep /home/jupyter/workspace/ws_files/EUR/EUR.noGBA.samplestoKeep \
--exclude {WORK_DIR}/exclusion_regions_hg38.txt \
--out {WORK_DIR}/SCARB2/SCARB2_UNIMPUTED

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//SCARB2/SCARB2_UNIMPUTED.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//SCARB2/SCARB2_EUR/EUR_SCARB2
  --exclude /home/jupyter/workspace/ws_files//exclusion_regions_hg38.txt
  --geno 0.05
  --hwe 1e-5 0.001
  --keep /home/jupyter/workspace/ws_files/EUR/EUR.noGBA.samplestoKeep
  --maf 0.01
  --make-bed
  --out /home/jupyter/workspace/ws_files//SCARB2/SCARB2_UNIMPUTED

Start time: Sat Jul 26 12:28:25 2025
26046 MiB RAM detected, ~24817 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
58823 samples (25749 females, 33074 males; 58823 founders) loaded from
/home/jupyter/workspace/ws_files//SCARB2/SCARB2_EUR/EUR_SCARB2.fam.
7102 variants loaded from
/home/jupyter/workspace/ws_files//SCARB2/SCARB2_EUR/EUR_SCARB2.bim.
1 binary phenotype loade

In [128]:
# Prune out unnecessary SNPs (only need to do this to generate PCs)
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/SCARB2/SCARB2_UNIMPUTED \
--indep-pairwise 50 5 0.5 \
--out {WORK_DIR}/SCARB2/prune_SCARB2

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//SCARB2/prune_SCARB2.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//SCARB2/SCARB2_UNIMPUTED
  --indep-pairwise 50 5 0.5
  --out /home/jupyter/workspace/ws_files//SCARB2/prune_SCARB2

Start time: Sat Jul 26 12:29:00 2025
26046 MiB RAM detected, ~24777 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
53641 samples (23385 females, 30256 males; 53641 founders) loaded from
/home/jupyter/workspace/ws_files//SCARB2/SCARB2_UNIMPUTED.fam.
920 variants loaded from
/home/jupyter/workspace/ws_files//SCARB2/SCARB2_UNIMPUTED.bim.
1 binary phenotype loaded (24208 cases, 9662 controls).
Calculating allele frequencies... done.
--indep-pairwise (1 compute thread): 779/920 variants removed.
Variant lists written to
/home/jupyter/workspace/ws_files//SCAR

In [129]:
!wc -l {WORK_DIR}/SCARB2/prune_SCARB2.prune.in

141 /home/jupyter/workspace/ws_files//SCARB2/prune_SCARB2.prune.in


In [130]:
# 3.55E-4

zoom locus plot

In [2]:
## locus zoom format
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'

    df = pd.read_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')
    df['b'] = np.log(df['OR'])
    export_ldassoc = df[['#CHROM', 'POS', 'REF', 'ALT', 'A1_FREQ', 'b', 'LOG(OR)_SE', 'P']].copy()
    export_ldassoc = export_ldassoc.rename(columns = {'#CHROM': 'CHROM', 'A1_FREQ':'freq', 'LOG(OR)_SE':'se', 'P':'p'})
    # save to files
    export_ldassoc.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.formatted.tab', sep = '\t', index=False)

GCTA cojo analysis

In [132]:
## cojo format
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'

    sumstats = pd.read_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')
    #Format summary statistics for GCTA-COJO
    #First get the log odds ratio - this is required for COJO
    #1) For a case-control study, the effect size should be log(odds ratio) with its corresponding standard error.
    sumstats_formatted = sumstats.copy()
    sumstats_formatted['b'] = np.log(sumstats_formatted['OR'])

    #Now select just the necessary columns for COJO
    sumstats_export = sumstats_formatted[['ID', 'A1', 'OMITTED', 'A1_FREQ', 'b', 'LOG(OR)_SE', 'P', 'OBS_CT']].copy()

    #Rename columns following COJO format
    sumstats_export = sumstats_export.rename(columns = {'ID':'SNP', 'OMITTED':'A2', 'A1_FREQ':'freq', 'LOG(OR)_SE':'se', 'P':'p', 'OBS_CT':'N'})

    #Export
    sumstats_export.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.sumstats.ma', sep = '\t', index=False)

In [133]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    #Select multiple associated SNPs based on LD pruning p-value 4.67e-4
    #Can change the p-value for significance if needed
    #bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS
    ! /home/jupyter/tools/gcta-1.95.0-linux-kernel-3-x86_64/gcta64 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --maf 0.01 \
    --cojo-file {WORK_DIR}/{ancestry}.all_adj.sumstats.ma \
    --cojo-p 3.55e-4 \
    --cojo-slct \
    --out {WORK_DIR}/{ancestry}.all_adj.ldprune.COJO

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.95.0 Linux
[0;32m[0m* Built at Jul 21 2025 17:30:18, by GCC 8.4
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 12:32:55 UTC on Sat Jul 26 2025.
[0;32m[0mHostname: 4b8bf4fabeb0
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS.all_adj.sumstats.ma
--cojo-p 0.000355
--cojo-slct
--out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS.all_adj.ldprune.COJO


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/SCARB2/SCARB2_SAS/SAS_SCARB2.fam].
945 individuals to be included from [/home/jupyter/works