SCARB2 - Single gene analysis in GP2 Neurobooster genotyping data (all ancestries)

Project: GP2 SCARB2

Version: Python/3.10.15, R/4.3.3

Notebook Overview

1. Description
Loading Python libraries
Set paths
Make working directory

2. Installing packages

3. LD pruning(TMEM175/SCARB2/CTSB)

4. Create a covariate file with GP2 data

5. Annotation of the gene SCARB2

6. Association analysis to compare allele frequencies between cases and controls

7. GLM analysis adjusting for gender, age, PC1-5

8. Burden test(SkatO, Skat, cmc,zeggini,mb,fp,cmcWald)

9. Conditional analysis

Loading Python libraries

In [1]:
# Use pathlib for file path manipulation
import pathlib

# Install numpy
import numpy as np

# Install Pandas for tabular data
import pandas as pd

# Install plotnine: a ggplot2-compatible Python plotting package
from plotnine import *

# Always show all columns in a Pandas DataFrame
pd.set_option('display.max_columns', None)

Set paths

In [None]:
REL7_PATH = pathlib.Path(pathlib.Path.home(), 'workspace/gp2_tier2_eu_release7_30042024')
!ls -hal {REL7_PATH}

Make working directory

In [3]:
! mkdir ~/workspace/ws_files/SCARB2

mkdir: cannot create directory ‘/home/jupyter/workspace/ws_files/SCARB2’: File exists


In [4]:
WORK_DIR = "~/workspace/ws_files/SCARB2/"

In [4]:
# make sure all tools installed
! ls /home/jupyter/tools

LICENSE				   plink			    prettify
annovar				   plink2			    rvtests
annovar.latest.tar.gz		   plink2_linux_x86_64_latest.zip   toy.map
gcta-1.94.1-linux-kernel-3-x86_64  plink_linux_x86_64_20190304.zip  toy.ped


In [9]:
# give permission

# chmod to make sure you have permission to run the program
! chmod u+x /home/jupyter/tools/plink
! chmod u+x /home/jupyter/tools/plink2
! chmod 777 /home/jupyter/tools/rvtests/executable/rvtest

In [6]:
%%bash
# making working directory
#Loop over all the ancestries
for ancestry in {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'} ;
do

#Make a folder for each ancestry
mkdir ~/workspace/ws_files/SCARB2/SCARB2_"$ancestry"

done

LD pruning(in EUR)

In [9]:
WORK_DIR = "~/workspace/ws_files/"

In [6]:
# Make sure to use high-quality SNPs
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/TMEM175/TMEM175_EUR/EUR_TMEM175 \
--maf 0.01 \
--geno 0.05 \
--hwe 1E-6 \
--make-bed \
--exclude {WORK_DIR}/exclusion_regions_hg38.txt \
--out TMEM175_UNIMPUTED

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to TMEM175_UNIMPUTED.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//TMEM175/TMEM175_EUR/EUR_TMEM175
  --exclude /home/jupyter/workspace/ws_files//exclusion_regions_hg38.txt
  --geno 0.05
  --hwe 1E-6
  --maf 0.01
  --make-bed
  --out TMEM175_UNIMPUTED

Start time: Tue Aug 20 12:38:19 2024
52223 MiB RAM detected, ~50840 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
38839 samples (15543 females, 23296 males; 38839 founders) loaded from
/home/jupyter/workspace/ws_files//TMEM175/TMEM175_EUR/EUR_TMEM175.fam.
9504 variants loaded from
/home/jupyter/workspace/ws_files//TMEM175/TMEM175_EUR/EUR_TMEM175.bim.
1 binary phenotype loaded (21198 cases, 9214 controls).
--exclude: 9504 variants remaining.
Calculating allele frequencies... done.
--geno: 4576 variants removed due to mi

In [7]:
# Prune out unnecessary SNPs (only need to do this to generate PCs)
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/LDpruning/TMEM175_UNIMPUTED \
--indep-pairwise 50 5 0.5 \
--out {WORK_DIR}/LDpruning/prune_TMEM175

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//LDpruning/prune_TMEM175.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//LDpruning/TMEM175_UNIMPUTED
  --indep-pairwise 50 5 0.5
  --out /home/jupyter/workspace/ws_files//LDpruning/prune_TMEM175

Start time: Tue Aug 20 12:40:52 2024
52223 MiB RAM detected, ~50788 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
38839 samples (15543 females, 23296 males; 38839 founders) loaded from
/home/jupyter/workspace/ws_files//LDpruning/TMEM175_UNIMPUTED.fam.
366 variants loaded from
/home/jupyter/workspace/ws_files//LDpruning/TMEM175_UNIMPUTED.bim.
1 binary phenotype loaded (21198 cases, 9214 controls).
Calculating allele frequencies... done.
--indep-pairwise (1 compute thread): 50257/366 variants removed.
Variant lists written to
/home/jupyter/wo

In [8]:
!wc -l {WORK_DIR}/LDpruning/prune_TMEM175.prune.in

109 /home/jupyter/workspace/ws_files//LDpruning/prune_TMEM175.prune.in


In [14]:
# SCARB2 LD pruning
# Make sure to use high-quality SNPs
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/SCARB2/SCARB2_EUR/EUR_SCARB2 \
--maf 0.01 \
--geno 0.05 \
--hwe 1E-6 \
--make-bed \
--exclude {WORK_DIR}/LDpruning/exclusion_regions_hg38.txt \
--out {WORK_DIR}/LDpruning/SCARB2_UNIMPUTED

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//LDpruning/SCARB2_UNIMPUTED.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//SCARB2/SCARB2_EUR/EUR_SCARB2
  --exclude /home/jupyter/workspace/ws_files//LDpruning/exclusion_regions_hg38.txt
  --geno 0.05
  --hwe 1E-6
  --maf 0.01
  --make-bed
  --out /home/jupyter/workspace/ws_files//LDpruning/SCARB2_UNIMPUTED

Start time: Tue Aug 20 13:59:30 2024
52223 MiB RAM detected, ~50600 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
38839 samples (15543 females, 23296 males; 38839 founders) loaded from
/home/jupyter/workspace/ws_files//SCARB2/SCARB2_EUR/EUR_SCARB2.fam.
9570 variants loaded from
/home/jupyter/workspace/ws_files//SCARB2/SCARB2_EUR/EUR_SCARB2.bim.
1 binary phenotype loaded (21198 cases, 9214 controls).
--exclude: 9570 variants rem

In [15]:
# Prune out unnecessary SNPs (only need to do this to generate PCs)
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/LDpruning/SCARB2_UNIMPUTED \
--indep-pairwise 50 5 0.5 \
--out {WORK_DIR}/LDpruning/prune_SCARB2

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//LDpruning/prune_SCARB2.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//LDpruning/SCARB2_UNIMPUTED
  --indep-pairwise 50 5 0.5
  --out /home/jupyter/workspace/ws_files//LDpruning/prune_SCARB2

Start time: Tue Aug 20 13:59:36 2024
52223 MiB RAM detected, ~50591 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
38839 samples (15543 females, 23296 males; 38839 founders) loaded from
/home/jupyter/workspace/ws_files//LDpruning/SCARB2_UNIMPUTED.fam.
919 variants loaded from
/home/jupyter/workspace/ws_files//LDpruning/SCARB2_UNIMPUTED.bim.
1 binary phenotype loaded (21198 cases, 9214 controls).
Calculating allele frequencies... done.
--indep-pairwise (1 compute thread): 50776/919 variants removed.
Variant lists written to
/home/jupyter/workspa

In [16]:
!wc -l {WORK_DIR}/LDpruning/prune_SCARB2.prune.in

143 /home/jupyter/workspace/ws_files//LDpruning/prune_SCARB2.prune.in


In [18]:
# CTSB LD pruning
# Make sure to use high-quality SNPs
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/CTSB_EUR/EUR_CTSB \
--maf 0.01 \
--geno 0.05 \
--hwe 1E-6 \
--make-bed \
--exclude {WORK_DIR}/LDpruning/exclusion_regions_hg38.txt \
--out {WORK_DIR}/LDpruning/CTSB_UNIMPUTED

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//LDpruning/CTSB_UNIMPUTED.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//CTSB_EUR/EUR_CTSB
  --exclude /home/jupyter/workspace/ws_files//LDpruning/exclusion_regions_hg38.txt
  --geno 0.05
  --hwe 1E-6
  --maf 0.01
  --make-bed
  --out /home/jupyter/workspace/ws_files//LDpruning/CTSB_UNIMPUTED

Start time: Tue Aug 20 14:02:15 2024
52223 MiB RAM detected, ~50614 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
38839 samples (15543 females, 23296 males; 38839 founders) loaded from
/home/jupyter/workspace/ws_files//CTSB_EUR/EUR_CTSB.fam.
13481 variants loaded from
/home/jupyter/workspace/ws_files//CTSB_EUR/EUR_CTSB.bim.
1 binary phenotype loaded (21198 cases, 9214 controls).
--exclude: 13481 variants remaining.
Calculating allele frequenc

In [19]:
# Prune out unnecessary SNPs (only need to do this to generate PCs)
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/LDpruning/CTSB_UNIMPUTED \
--indep-pairwise 50 5 0.5 \
--out {WORK_DIR}/LDpruning/prune_CTSB

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//LDpruning/prune_CTSB.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//LDpruning/CTSB_UNIMPUTED
  --indep-pairwise 50 5 0.5
  --out /home/jupyter/workspace/ws_files//LDpruning/prune_CTSB

Start time: Tue Aug 20 14:02:51 2024
52223 MiB RAM detected, ~50604 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
38839 samples (15543 females, 23296 males; 38839 founders) loaded from
/home/jupyter/workspace/ws_files//LDpruning/CTSB_UNIMPUTED.fam.
594 variants loaded from
/home/jupyter/workspace/ws_files//LDpruning/CTSB_UNIMPUTED.bim.
1 binary phenotype loaded (21198 cases, 9214 controls).
Calculating allele frequencies... done.
--indep-pairwise (1 compute thread): 50477/594 variants removed.
Variant lists written to
/home/jupyter/workspace/ws_file

In [20]:
!wc -l {WORK_DIR}/LDpruning/prune_CTSB.prune.in

117 /home/jupyter/workspace/ws_files//LDpruning/prune_CTSB.prune.in


Create a covariate file with GP2 data

In [7]:
CLINICAL_DATA_PATH = pathlib.Path(REL7_PATH, 'clinical_data/master_key_release7_final_vwb.csv')

In [None]:
# Let's load the master key
key = pd.read_csv(CLINICAL_DATA_PATH, low_memory=False)
print(key.shape)
key.head()

In [None]:
# Subsetting to keep only a few columns 
key = key[['GP2sampleID', 'baseline_GP2_phenotype_for_qc', 'biological_sex_for_qc', 'age_at_sample_collection', 'age_of_onset', 'label']]
# Renaming the columns
key.rename(columns = {'GP2sampleID':'IID',
                                     'baseline_GP2_phenotype_for_qc':'phenotype',
                                     'biological_sex_for_qc':'SEX', 
                                     'age_at_sample_collection':'AGE', 
                                     'age_of_onset':'AAO'}, inplace = True)
key

In [10]:
! pwd

/home/jupyter/workspace/ws_files


In [11]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    ## Subset to keep ancestry of interest 
    ancestry_key = key[key['label']==ancestry].copy()
    ancestry_key.reset_index(drop=True)
    
    # Load information about related individuals in the ancestry analyzed
    related_df = pd.read_csv(f'{REL7_PATH}/meta_data/related_samples/{ancestry}_release7_vwb.related')
    print(f'Related individuals: {related_df.shape}')
    
    # Make a list of just one set of related people
    related_list = list(related_df['IID1'])
    
    # Check value counts of related and remove only one related individual
    ancestry_key = ancestry_key[~ancestry_key["IID"].isin(related_list)]
    
    # Check size
    print(f'Unrelated individuals: {ancestry_key.shape}')
    
    # Convert phenotype to binary (1/2)
    ## Assign conditions so case=2 and controls=1, and -9 otherwise (matching PLINK convention)
    # PD = 2; control = 1
    pheno_mapping = {"PD": 2, "Control": 1}
    ancestry_key['PHENO'] = ancestry_key['phenotype'].map(pheno_mapping).astype('Int64')
    
    # Check value counts of pheno
    ancestry_key['PHENO'].value_counts(dropna=False)
    
    ## Get the PCs
    pcs = pd.read_csv(f'{REL7_PATH}/meta_data/qc_metrics/projected_pcs_vwb.csv')
    
     #Select just first 5 PCs
    selected_columns = ['IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
    pcs = pd.DataFrame(data=pcs.iloc[:, 1:7].values, columns=selected_columns)
    
     # Drop the first row (since it's now the column names)
    pcs = pcs.drop(0)
    
    # Reset the index to remove any potential issues
    pcs = pcs.reset_index(drop=True)
    
    # Check size
    print(f'PCs: {pcs.shape}')
    
     # Check value counts of SEX
    sex_og_values = ancestry_key['SEX'].value_counts(dropna=False)
    print(f'Sex value counts - original:\n {sex_og_values.to_string()}')
    
     # Convert sex to binary (1/2)
    ## Assign conditions so female=2 and men=1, and -9 otherwise (matching PLINK convention)
    # Female = 2; Male = 1
    sex_mapping = {"Female": 2, "Male": 1}
    ancestry_key['SEX'] = ancestry_key['SEX'].map(sex_mapping).astype('Int64')
    
    # Check value counts of SEX after recoding
    sex_recode_values = ancestry_key['SEX'].value_counts(dropna=False)
    print(f'Sex value counts - recoded:\n{sex_recode_values.to_string()}')
    
    ## Make covariate file
    df = pd.merge(pcs, ancestry_key, on='IID', how='left')
    print(f'Check columns for covariate file: {df.columns}')
    
    #Make additional columns - FID, fatid and matid - these are needed for RVtests!!
    #RVtests needs the first 5 columns to be fid, iid, fatid, matid and sex otherwise it does not run correctly
    #Uppercase column name is ok
    #See https://zhanxw.github.io/rvtests/#phenotype-file
    df['FID'] = 0
    df['FATID'] = 0
    df['MATID'] = 0
    
    ## Clean up and keep columns we need 
    final_df = df[['FID','IID', 'FATID', 'MATID', 'SEX', 'AGE', 'PHENO','PC1', 'PC2', 'PC3', 'PC4', 'PC5']].copy()
    
    ##DO NOT replace missing values with -9 as this is misinterpreted by RVtests - needs to be nonnumeric
    #Leave missing values as NA
    
    #Check number of PD cases missing age
    pd_missAge = final_df[(final_df['PHENO']==2)&(final_df['AGE'].isna())]
    print(f'Number of PD cases missing age: {pd_missAge.shape[0]}')
    
    #Check number of controls missing age
    control_missAge = final_df[(final_df['PHENO']==1)&(final_df['AGE'].isna())]
    print(f'Number of controls missing age: {control_missAge.shape[0]}')
    
    ## Make file of sample IDs to keep 
    samples_toKeep = final_df[['FID', 'IID']].copy()
    samples_toKeep.to_csv(f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}/{ancestry}.samplestoKeep', sep = '\t', index=False, header=None)
    
    ## Make your covariate file
    #Included na_rep to write out missing/NA values explicitly as string/text, not as blank otherwise they are misread in RVtests
    final_df.to_csv(f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}/{ancestry}_covariate_file.txt', sep = '\t', na_rep='NA', index=False)

WORKING ON: AJ
Related individuals: (245, 9)
Unrelated individuals: (2675, 6)
PCs: (58209, 6)
Sex value counts - original:
 SEX
Male                          1655
Female                        1009
Other/Unknown/Not Reported      11
Sex value counts - recoded:
SEX
1       1655
2       1009
<NA>      11
Check columns for covariate file: Index(['IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'phenotype', 'SEX', 'AGE',
       'AAO', 'label', 'PHENO'],
      dtype='object')
Number of PD cases missing age: 95
Number of controls missing age: 195
WORKING ON: EAS
Related individuals: (350, 9)
Unrelated individuals: (5303, 6)
PCs: (58209, 6)
Sex value counts - original:
 SEX
Male                          3398
Female                        1899
Other/Unknown/Not Reported       6
Sex value counts - recoded:
SEX
1       3398
2       1899
<NA>       6
Check columns for covariate file: Index(['IID', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'phenotype', 'SEX', 'AGE',
       'AAO', 'label', 'PHENO'],
      dtype=

Annotation of the gene

Extract the region using PLINK

Extract SCARB2 gene

SCARB2 coordinates: Chromosome 4: 76,158,737-76,234,536(GRCh38/hg38)

In [16]:
# generate binary file in eur ancestry

WORK_DIR = "~/workspace/ws_files/SCARB2/"

! /home/jupyter/tools/plink2 \
--pfile {REL7_PATH}/imputed_genotypes/EUR/chr4_EUR_release7_vwb \
--chr 4 \
--from-bp 76108737 \
--to-bp 76284536 \
--make-bed \
--out {WORK_DIR}/SCARB2_EUR/EUR_SCARB2

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2//SCARB2_EUR/EUR_SCARB2.log.
Options in effect:
  --chr 4
  --from-bp 76108737
  --make-bed
  --out /home/jupyter/workspace/ws_files/SCARB2//SCARB2_EUR/EUR_SCARB2
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release7_30042024/imputed_genotypes/EUR/chr4_EUR_release7_vwb
  --to-bp 76284536

Start time: Tue Aug 27 14:03:25 2024
52223 MiB RAM detected, ~50462 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
38839 samples (15543 females, 23296 males; 38839 founders) loaded from
/home/jupyter/workspace/gp2_tier2_eu_release7_30042024/imputed_genotypes/EUR/chr4_EUR_release7_vwb.psam.
10419161 variants loaded from
/home/jupyter/workspace/gp2_tier2_eu_release7_30042024/imputed_genotypes/EUR/chr4_EUR_release7_vwb.pvar.
1 binary phenotype loaded (21198 cases,

In [12]:
## extract region using plink
for ancestry in ancestries:
    
    WORK_DIR = "~/workspace/ws_files/SCARB2/"

    ! /home/jupyter/tools/plink2 \
    --pfile {REL7_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release7_vwb \
    --chr 4 \
    --from-bp 76108737 \
    --to-bp 76284536 \
    --make-bed \
    --out {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2.log.
Options in effect:
  --chr 4
  --from-bp 76108737
  --make-bed
  --out /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release7_30042024/imputed_genotypes/AJ/chr4_AJ_release7_vwb
  --to-bp 76284536

Start time: Tue Aug 20 09:35:27 2024
52223 MiB RAM detected, ~50313 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
2655 samples (1007 females, 1648 males; 2655 founders) loaded from
/home/jupyter/workspace/gp2_tier2_eu_release7_30042024/imputed_genotypes/AJ/chr4_AJ_release7_vwb.psam.
1669760 variants loaded from
/home/jupyter/workspace/gp2_tier2_eu_release7_30042024/imputed_genotypes/AJ/chr4_AJ_release7_vwb.pvar.
1 binary phenotype loaded (1292 cases, 411 controls).


In [13]:
# Visualize bim file
! head {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.bim

4	chr4:76108753:T:G	0	76108753	G	T
4	chr4:76108755:T:C	0	76108755	C	T
4	chr4:76108761:G:A	0	76108761	A	G
4	chr4:76108794:G:A	0	76108794	A	G
4	chr4:76108804:T:A	0	76108804	A	T
4	chr4:76108804:T:C	0	76108804	C	T
4	chr4:76108807:A:C	0	76108807	C	A
4	chr4:76108828:G:T	0	76108828	T	G
4	chr4:76108844:G:A	0	76108844	A	G
4	chr4:76108848:C:T	0	76108848	T	C


In [None]:
# Visualize bim file
! head {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.fam

In [15]:
for ancestry in ancestries:
    
    WORK_DIR = "~/workspace/ws_files/SCARB2/"
    
    ! head -n 1 {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2.fam > {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_s1.txt

In [None]:
! head {WORK_DIR}/SCARB2_EUR/EUR_s1.txt

Turn binary files into VCF

In [17]:
for ancestry in ancestries:
    
    WORK_DIR = "~/workspace/ws_files/SCARB2/"
    
    ## Turn binary files into VCF
    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2 \
    --keep {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_s1.txt \
    --make-bed \
    --out {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2_v1

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2
  --keep /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_s1.txt
  --make-bed
  --out /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1

Start time: Tue Aug 20 09:38:51 2024
52223 MiB RAM detected, ~50233 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
2655 samples (1007 females, 1648 males; 2655 founders) loaded from
/home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2.fam.
1849 variants loaded from
/home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2.bim.
1 binary phenotype loaded (1292 cases, 411 controls).
--keep: 1 sample remaining.
1 sample (0 females, 1 male; 1 founder) remaining after mai

In [18]:
for ancestry in ancestries:
    
    WORK_DIR = "~/workspace/ws_files/SCARB2/"
    
    ## Turn binary files into VCF
    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2_v1 \
    --recode vcf-fid \
    --out {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2_v1

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1
  --export vcf-fid
  --out /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1

Start time: Tue Aug 20 09:39:45 2024
Note: --export 'vcf-fid' modifier is deprecated.  Use 'vcf' + 'id-paste=fid'.
52223 MiB RAM detected, ~50245 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
1 sample (0 females, 1 male; 1 founder) loaded from
/home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1.fam.
1849 variants loaded from
/home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1.bim.
1 binary phenotype loaded (1 case, 0 controls).
--export vcf to
/home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1.vcf .

In [19]:
### Bgzip and Tabix (zip and index the file)
for ancestry in ancestries:
    
    WORK_DIR = "~/workspace/ws_files/SCARB2/"
    ! bgzip -f {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2_v1.vcf
    ! tabix -f -p vcf {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2_v1.vcf.gz 

Annotate using ANNOVAR

In [20]:
## annotate using ANNOVAR

for ancestry in ancestries:
    
    WORK_DIR = "~/workspace/ws_files/SCARB2/"
    
    ! perl /home/jupyter/tools/annovar/table_annovar.pl {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2_v1.vcf.gz /home/jupyter/tools/annovar/humandb/ -buildver hg38 \
    -out {WORK_DIR}/SCARB2_{ancestry}/{ancestry}_SCARB2.annovar \
    -remove -protocol refGene,clinvar_20140902 \
    -operation g,f \
    --nopolish \
    -nastring . \
    -vcfinput


NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2_v1.vcf.gz > /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2.annovar.avinput>
NOTICE: Finished reading 1856 lines from VCF file
NOTICE: A total of 1849 locus in VCF file passed QC threshold, representing 1707 SNPs (1174 transitions and 533 transversions) and 142 indels/substitutions
NOTICE: Finished writing allele frequencies based on 1707 SNP genotypes (1174 transitions and 533 transversions) and 142 indels/substitutions for 1 samples

NOTICE: Running with system command </home/jupyter/tools/annovar/table_annovar.pl /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2.annovar.avinput /home/jupyter/tools/annovar/humandb/ -buildver hg38 -outfile /home/jupyter/workspace/ws_files/SCARB2//SCARB2_AJ/AJ_SCARB2.annovar -remove -protocol refGene,clinvar_20140902 -operation g,f --nopolish -nastring . -otherinfo>

In [21]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_AAC/AAC_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108753,76108753,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108753,chr4:76108753:T:C,T,C,.,.,PR,GT,0/0
1,4,76108755,76108755,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108755,chr4:76108755:T:C,T,C,.,.,PR,GT,0/0
2,4,76108761,76108761,G,A,intronic,ART3,.,.,.,.,0.0,.,.,4,76108761,chr4:76108761:G:A,G,A,.,.,PR,GT,0/0
3,4,76108804,76108804,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108804,chr4:76108804:T:C,T,C,.,.,PR,GT,0/0
4,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3843,4,76284266,76284266,G,A,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284266,chr4:76284266:G:A,G,A,.,.,PR,GT,0/0
3844,4,76284298,76284298,C,T,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284298,chr4:76284298:C:T,C,T,.,.,PR,GT,0/0
3845,4,76284397,76284397,A,G,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284397,chr4:76284397:A:G,A,G,.,.,PR,GT,0/0
3846,4,76284490,76284490,A,G,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284490,chr4:76284490:A:G,A,G,.,.,PR,GT,0/0


In [22]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      3385
intergenic     186
UTR3            89
downstream      71
exonic          59
upstream        33
UTR5            22
splicing         3
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [35]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    32
synonymous SNV       25
stopgain              1
stoploss              1
Name: count, dtype: int64

In [36]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_AFR/AFR_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108750,76108750,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108750,chr4:76108750:T:C,T,C,.,.,PR,GT,0/0
1,4,76108755,76108755,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108755,chr4:76108755:T:C,T,C,.,.,PR,GT,0/0
2,4,76108804,76108804,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108804,chr4:76108804:T:C,T,C,.,.,PR,GT,0/0
3,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/1
4,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,1.0,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,1/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4299,4,76284274,76284274,T,C,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284274,chr4:76284274:T:C,T,C,.,.,PR,GT,0/0
4300,4,76284298,76284298,C,T,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284298,chr4:76284298:C:T,C,T,.,.,PR,GT,0/0
4301,4,76284438,76284438,C,T,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284438,chr4:76284438:C:T,C,T,.,.,PR,GT,0/0
4302,4,76284476,76284476,C,G,intronic,FAM47E-STBD1,.,.,.,.,1.0,.,.,4,76284476,chr4:76284476:C:G,C,G,.,.,PR,GT,1/1


In [37]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      3759
intergenic     242
UTR3            95
downstream      81
exonic          64
upstream        34
UTR5            26
splicing         3
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [39]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV      36
synonymous SNV         26
frameshift deletion     1
stoploss                1
Name: count, dtype: int64

In [40]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_AJ/AJ_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108753,76108753,T,G,intronic,ART3,.,.,.,.,0.0,.,.,4,76108753,chr4:76108753:T:G,T,G,.,.,PR,GT,0/0
1,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/1
2,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,0/1
3,4,76108994,76108994,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76108994,chr4:76108994:G:A,G,A,.,.,PR,GT,0/1
4,4,76109007,76109007,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76109007,chr4:76109007:G:A,G,A,.,.,PR,GT,0/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1844,4,76283418,76283418,G,A,exonic,FAM47E,.,nonsynonymous SNV,"FAM47E:NM_001136570:exon8:c.G1142A:p.R381H,FAM...",.,0.0,.,.,4,76283418,chr4:76283418:G:A,G,A,.,.,PR,GT,0/0
1845,4,76283566,76283566,A,T,UTR3,FAM47E,NM_001242936:c.*108A>T;NM_001136570:c.*108A>T,.,.,.,0.0,.,.,4,76283566,chr4:76283566:A:T,A,T,.,.,PR,GT,0/0
1846,4,76284014,76284014,A,G,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284014,chr4:76284014:A:G,A,G,.,.,PR,GT,0/0
1847,4,76284108,76284108,G,C,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284108,chr4:76284108:G:C,G,C,.,.,PR,GT,0/0


In [41]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      1605
intergenic     111
UTR3            42
downstream      33
exonic          31
upstream        21
UTR5             5
splicing         1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [43]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    19
synonymous SNV       12
Name: count, dtype: int64

In [44]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_AMR/AMR_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108753,76108753,T,G,intronic,ART3,.,.,.,.,0,.,.,4,76108753,chr4:76108753:T:G,T,G,.,.,PR,GT,0/0
1,4,76108755,76108755,T,C,intronic,ART3,.,.,.,.,0,.,.,4,76108755,chr4:76108755:T:C,T,C,.,.,PR,GT,0/0
2,4,76108804,76108804,T,C,intronic,ART3,.,.,.,.,0,.,.,4,76108804,chr4:76108804:T:C,T,C,.,.,PR,GT,0/0
3,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/1
4,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,0/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2511,4,76283751,76283751,G,A,UTR3,FAM47E,NM_001242936:c.*293G>A;NM_001136570:c.*293G>A,.,.,.,0,.,.,4,76283751,chr4:76283751:G:A,G,A,.,.,PR,GT,0/0
2512,4,76283911,76283911,C,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76283911,chr4:76283911:C:T,C,T,.,.,PR,GT,0/0
2513,4,76284105,76284105,A,-,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284104,chr4:76284104:TA:T,TA,T,.,.,PR,GT,0/0
2514,4,76284108,76284108,G,C,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284108,chr4:76284108:G:C,G,C,.,.,PR,GT,0/0


In [45]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      2212
intergenic     137
UTR3            47
downstream      44
exonic          43
upstream        22
UTR5             9
splicing         2
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [47]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    29
synonymous SNV       14
Name: count, dtype: int64

In [48]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_CAH/CAH_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108753,76108753,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108753,chr4:76108753:T:C,T,C,.,.,PR,GT,0/0
1,4,76108755,76108755,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108755,chr4:76108755:T:C,T,C,.,.,PR,GT,0/0
2,4,76108804,76108804,T,C,intronic,ART3,.,.,.,.,0.0,.,.,4,76108804,chr4:76108804:T:C,T,C,.,.,PR,GT,0/0
3,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.0,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/0
4,4,76108848,76108848,C,T,intronic,ART3,.,.,.,.,0.0,.,.,4,76108848,chr4:76108848:C:T,C,T,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3545,4,76284241,76284241,C,T,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284241,chr4:76284241:C:T,C,T,.,.,PR,GT,0/0
3546,4,76284281,76284281,C,A,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284281,chr4:76284281:C:A,C,A,.,.,PR,GT,0/0
3547,4,76284298,76284298,C,T,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284298,chr4:76284298:C:T,C,T,.,.,PR,GT,0/0
3548,4,76284303,76284303,C,A,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284303,chr4:76284303:C:A,C,A,.,.,PR,GT,0/0


In [49]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      3125
intergenic     189
UTR3            71
downstream      66
exonic          52
upstream        25
UTR5            19
splicing         3
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [51]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    29
synonymous SNV       22
stopgain              1
Name: count, dtype: int64

In [52]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_CAS/CAS_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/1
1,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,0/1
2,4,76108861,76108864,GTAA,-,intronic,ART3,.,.,.,.,0.0,.,.,4,76108860,chr4:76108860:GGTAA:G,GGTAA,G,.,.,PR,GT,0/0
3,4,76108994,76108994,G,A,intronic,ART3,.,.,.,.,0.0,.,.,4,76108994,chr4:76108994:G:A,G,A,.,.,PR,GT,0/0
4,4,76109007,76109007,G,A,intronic,ART3,.,.,.,.,0.0,.,.,4,76109007,chr4:76109007:G:A,G,A,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2214,4,76283347,76283347,C,T,intronic,FAM47E;FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76283347,chr4:76283347:C:T,C,T,.,.,PR,GT,0/0
2215,4,76283509,76283509,A,G,UTR3,FAM47E,NM_001242936:c.*51A>G;NM_001136570:c.*51A>G,.,.,.,0.0,.,.,4,76283509,chr4:76283509:A:G,A,G,.,.,PR,GT,0/0
2216,4,76284107,76284107,A,G,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284107,chr4:76284107:A:G,A,G,.,.,PR,GT,0/0
2217,4,76284122,76284122,C,G,intronic,FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76284122,chr4:76284122:C:G,C,G,.,.,PR,GT,0/0


In [53]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      1945
intergenic     120
UTR3            51
downstream      37
exonic          36
upstream        19
UTR5            10
splicing         1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [55]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    21
synonymous SNV       14
stopgain              1
Name: count, dtype: int64

In [56]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_EAS/EAS_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/0
1,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,0,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,0/0
2,4,76108994,76108994,G,A,intronic,ART3,.,.,.,.,1,.,.,4,76108994,chr4:76108994:G:A,G,A,.,.,PR,GT,1/1
3,4,76109007,76109007,G,A,intronic,ART3,.,.,.,.,1,.,.,4,76109007,chr4:76109007:G:A,G,A,.,.,PR,GT,1/1
4,4,76109074,76109074,A,G,intronic,ART3,.,.,.,.,0,.,.,4,76109074,chr4:76109074:A:G,A,G,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3089,4,76284241,76284241,C,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284241,chr4:76284241:C:T,C,T,.,.,PR,GT,0/0
3090,4,76284298,76284298,C,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284298,chr4:76284298:C:T,C,T,.,.,PR,GT,0/0
3091,4,76284478,76284478,C,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284478,chr4:76284478:C:T,C,T,.,.,PR,GT,0/0
3092,4,76284489,76284489,C,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284489,chr4:76284489:C:T,C,T,.,.,PR,GT,0/0


In [57]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      2670
intergenic     170
UTR3            84
downstream      59
exonic          58
UTR5            26
upstream        25
splicing         2
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [59]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV      35
synonymous SNV         21
stopgain                1
frameshift deletion     1
Name: count, dtype: int64

In [60]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_MDE/MDE_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108755,76108755,T,C,intronic,ART3,.,.,.,.,0,.,.,4,76108755,chr4:76108755:T:C,T,C,.,.,PR,GT,0/0
1,4,76108804,76108804,T,C,intronic,ART3,.,.,.,.,0,.,.,4,76108804,chr4:76108804:T:C,T,C,.,.,PR,GT,0/0
2,4,76108820,76108820,A,G,intronic,ART3,.,.,.,.,0,.,.,4,76108820,chr4:76108820:A:G,A,G,.,.,PR,GT,0/0
3,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/1
4,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,0/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2375,4,76283774,76283774,T,G,UTR3,FAM47E,NM_001242936:c.*316T>G;NM_001136570:c.*316T>G,.,.,.,0,.,.,4,76283774,chr4:76283774:T:G,T,G,.,.,PR,GT,0/0
2376,4,76284107,76284107,A,G,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284107,chr4:76284107:A:G,A,G,.,.,PR,GT,0/0
2377,4,76284108,76284108,G,C,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284108,chr4:76284108:G:C,G,C,.,.,PR,GT,0/0
2378,4,76284127,76284127,C,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284127,chr4:76284127:C:T,C,T,.,.,PR,GT,0/0


In [61]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      2116
intergenic     116
UTR3            42
downstream      37
exonic          36
upstream        20
UTR5            11
splicing         2
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [63]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    26
synonymous SNV       10
Name: count, dtype: int64

In [64]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_SAS/SAS_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/1
1,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,0/1
2,4,76108861,76108864,GTAA,-,intronic,ART3,.,.,.,.,0,.,.,4,76108860,chr4:76108860:GGTAA:G,GGTAA,G,.,.,PR,GT,0/0
3,4,76108972,76108972,T,-,intronic,ART3,.,.,.,.,0,.,.,4,76108971,chr4:76108971:GT:G,GT,G,.,.,PR,GT,0/0
4,4,76108974,76108974,G,A,intronic,ART3,.,.,.,.,0,.,.,4,76108974,chr4:76108974:G:A,G,A,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2053,4,76283677,76283677,T,G,UTR3,FAM47E,NM_001242936:c.*219T>G;NM_001136570:c.*219T>G,.,.,.,0,.,.,4,76283677,chr4:76283677:T:G,T,G,.,.,PR,GT,0/0
2054,4,76283721,76283721,G,A,UTR3,FAM47E,NM_001242936:c.*263G>A;NM_001136570:c.*263G>A,.,.,.,0,.,.,4,76283721,chr4:76283721:G:A,G,A,.,.,PR,GT,0/0
2055,4,76284022,76284022,G,A,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284022,chr4:76284022:G:A,G,A,.,.,PR,GT,0/0
2056,4,76284108,76284108,G,C,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284108,chr4:76284108:G:C,G,C,.,.,PR,GT,0/0


In [65]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      1794
intergenic     105
UTR3            53
downstream      40
exonic          34
upstream        22
UTR5             9
splicing         1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [67]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    19
synonymous SNV       15
Name: count, dtype: int64

In [68]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_FIN/FIN_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108828,76108828,G,T,intronic,ART3,.,.,.,.,0.0,.,.,4,76108828,chr4:76108828:G:T,G,T,.,.,PR,GT,0/0
1,4,76108858,76108858,G,A,intronic,ART3,.,.,.,.,0.0,.,.,4,76108858,chr4:76108858:G:A,G,A,.,.,PR,GT,0/0
2,4,76108994,76108994,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76108994,chr4:76108994:G:A,G,A,.,.,PR,GT,0/1
3,4,76109007,76109007,G,A,intronic,ART3,.,.,.,.,0.5,.,.,4,76109007,chr4:76109007:G:A,G,A,.,.,PR,GT,0/1
4,4,76109102,76109102,C,T,intronic,ART3,.,.,.,.,0.5,.,.,4,76109102,chr4:76109102:C:T,C,T,.,.,PR,GT,0/1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1161,4,76283246,76283247,AT,-,intronic,FAM47E;FAM47E-STBD1,.,.,.,.,0.5,.,.,4,76283245,chr4:76283245:CAT:C,CAT,C,.,.,PR,GT,0/1
1162,4,76283278,76283279,AG,-,intronic,FAM47E;FAM47E-STBD1,.,.,.,.,0.0,.,.,4,76283277,chr4:76283277:TAG:T,TAG,T,.,.,PR,GT,0/0
1163,4,76283347,76283347,C,T,intronic,FAM47E;FAM47E-STBD1,.,.,.,.,0.5,.,.,4,76283347,chr4:76283347:C:T,C,T,.,.,PR,GT,0/1
1164,4,76283587,76283587,A,G,UTR3,FAM47E,NM_001242936:c.*129A>G;NM_001136570:c.*129A>G,.,.,.,0.0,.,.,4,76283587,chr4:76283587:A:G,A,G,.,.,PR,GT,0/0


In [69]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      999
intergenic     84
downstream     24
UTR3           23
exonic         16
upstream       15
UTR5            4
splicing        1
Name: count, dtype: int64

In [None]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding.count()

In [71]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV    10
synonymous SNV        6
Name: count, dtype: int64

In [24]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/SCARB2_EUR/EUR_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,76108753,76108753,T,G,intronic,ART3,.,.,.,.,.,.,.,4,76108753,chr4:76108753:T:G,T,G,.,.,PR,GT,./.
1,4,76108755,76108755,T,C,intronic,ART3,.,.,.,.,0,.,.,4,76108755,chr4:76108755:T:C,T,C,.,.,PR,GT,0/0
2,4,76108761,76108761,G,A,intronic,ART3,.,.,.,.,0,.,.,4,76108761,chr4:76108761:G:A,G,A,.,.,PR,GT,0/0
3,4,76108794,76108794,G,A,intronic,ART3,.,.,.,.,0,.,.,4,76108794,chr4:76108794:G:A,G,A,.,.,PR,GT,0/0
4,4,76108804,76108804,T,A,intronic,ART3,.,.,.,.,0,.,.,4,76108804,chr4:76108804:T:A,T,A,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9565,4,76284318,76284318,G,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284318,chr4:76284318:G:T,G,T,.,.,PR,GT,0/0
9566,4,76284341,76284341,C,T,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284341,chr4:76284341:C:T,C,T,.,.,PR,GT,0/0
9567,4,76284385,76284385,G,A,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284385,chr4:76284385:G:A,G,A,.,.,PR,GT,0/0
9568,4,76284405,76284405,A,G,intronic,FAM47E-STBD1,.,.,.,.,0,.,.,4,76284405,chr4:76284405:A:G,A,G,.,.,PR,GT,0/0


In [25]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic      8370
intergenic     473
UTR3           225
exonic         194
downstream     156
upstream        83
UTR5            66
splicing         3
Name: count, dtype: int64

In [26]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']

In [27]:
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV         128
synonymous SNV             58
stopgain                    4
nonframeshift deletion      1
frameshift insertion        1
frameshift deletion         1
stoploss                    1
Name: count, dtype: int64

In [28]:
#Make lists of variants to keep - all coding, coding nonsynonymous (missense - as they are coded in ANNOVAR), deleterious (CADD_phred > 20)

for ancestry in ancestries:
        
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    # Read in ANNOVAR multianno file
    gene = pd.read_csv(f'{WORK_DIR}/{ancestry}_SCARB2.annovar.hg38_multianno.txt', sep = '\t')
    
    #Print number of variants in the different categories
    results = [] 
    
    utr5 = gene[gene['Func.refGene']== 'UTR5']
    intronic = gene[gene['Func.refGene']== 'intronic']
    exonic = gene[gene['Func.refGene']== 'exonic']
    utr3 = gene[gene['Func.refGene']== 'UTR3']
    coding_nonsynonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'nonsynonymous SNV')]
    coding_synonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] != 'nonsynonymous SNV')]
    lof = exonic[(exonic['ExonicFunc.refGene'] == 'stopgain') | (exonic['ExonicFunc.refGene'] == 'stoploss') | (exonic['ExonicFunc.refGene'] == 'frameshift deletion') | (exonic['ExonicFunc.refGene'] == 'frameshift insertion')]
    nonsynonymous_lof = pd.concat([coding_nonsynonymous, lof])
    
    print({ancestry})
    print('Total variants: ', len(gene))
    print("Intronic: ", len(intronic))
    print('UTR3: ', len(utr3))
    print('UTR5: ', len(utr5))
    print("Total exonic: ", len(exonic))
    print('  Synonymous: ', len(coding_synonymous))
    print("  Nonsynonymous: ", len(coding_nonsynonymous))
    print("nonsynonymous_lof: ", len(nonsynonymous_lof))
    results.append((gene, intronic, utr3, utr5, exonic, coding_synonymous, coding_nonsynonymous, nonsynonymous_lof))
    print('\n')
    
    # Save in PLINK format - coding nonsynonymous 
    # These are missense variants - other types of nonsynonymous variants (e.g stopgain/loss, or frameshift variants are coded differently in the ExonicFunc.refGene 
    variants_toKeep = nonsynonymous_lof[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep.to_csv(f'{WORK_DIR}/{ancestry}_SCARB2.nonsynonymous_lof.variantstoKeep.txt', sep="\t", index=False, header=False)
    
    # Save in PLINK format - all coding variants
    variants_toKeep2 = exonic[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep2.to_csv(f'{WORK_DIR}/{ancestry}_SCARB2.exonic.variantstoKeep.txt', sep="\t", index=False, header=False)

WORKING ON: AJ
{'AJ'}
Total variants:  1849
Intronic:  1605
UTR3:  42
UTR5:  5
Total exonic:  31
  Synonymous:  12
  Nonsynonymous:  19
nonsynonymous_lof:  19


WORKING ON: EAS
{'EAS'}
Total variants:  3094
Intronic:  2670
UTR3:  84
UTR5:  26
Total exonic:  58
  Synonymous:  23
  Nonsynonymous:  35
nonsynonymous_lof:  37


WORKING ON: SAS
{'SAS'}
Total variants:  2058
Intronic:  1794
UTR3:  53
UTR5:  9
Total exonic:  34
  Synonymous:  15
  Nonsynonymous:  19
nonsynonymous_lof:  19


WORKING ON: AMR
{'AMR'}
Total variants:  2516
Intronic:  2212
UTR3:  47
UTR5:  9
Total exonic:  43
  Synonymous:  14
  Nonsynonymous:  29
nonsynonymous_lof:  29


WORKING ON: MDE
{'MDE'}
Total variants:  2380
Intronic:  2116
UTR3:  42
UTR5:  11
Total exonic:  36
  Synonymous:  10
  Nonsynonymous:  26
nonsynonymous_lof:  26


WORKING ON: CAH
{'CAH'}
Total variants:  3550
Intronic:  3125
UTR3:  71
UTR5:  19
Total exonic:  52
  Synonymous:  23
  Nonsynonymous:  29
nonsynonymous_lof:  30


WORKING ON: FIN
{'FIN

ALL variants

assoc

glm

ASSOC

In [29]:
#Run case-control analysis using plink assoc for all variants, not adjusting for any covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    ! /home/jupyter/tools/plink \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --keep {WORK_DIR}/{ancestry}.samplestoKeep \
    --assoc \
    --allow-no-sex \
    --ci 0.95 \
    --maf 0.01 \
    --out {WORK_DIR}/{ancestry}_SCARB2.all

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2
  --ci 0.95
  --keep /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC.samplestoKeep
  --maf 0.01
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all

52223 MB RAM detected; reserving 26111 MB for main workspace.
3848 variants loaded from .bim file.
1111 people (455 males, 656 females) loaded from .fam.
1086 phenotype values loaded from .fam.
--keep: 1111 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1111 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181920212223242526272829303132333435363738394041424344454647484

In [30]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    #Look at assoc results, only variants with nominal p-value < 0.05
    freq = pd.read_csv(f'{WORK_DIR}/{ancestry}_SCARB2.all.assoc', delim_whitespace=True)
    sig_all_nonadj = freq[freq['P']<0.05]
    
    print(f'Variants with p-value < 0.05: {sig_all_nonadj.shape}')
    
    #Save FREQ to csv
    freq.to_csv(f'{WORK_DIR}/{ancestry}.all_nonadj.csv')

WORKING ON: AAC
Variants with p-value < 0.05: (106, 13)
WORKING ON: AFR
Variants with p-value < 0.05: (368, 13)




WORKING ON: AJ
Variants with p-value < 0.05: (56, 13)
WORKING ON: AMR
Variants with p-value < 0.05: (10, 13)




WORKING ON: CAS
Variants with p-value < 0.05: (96, 13)
WORKING ON: EAS
Variants with p-value < 0.05: (403, 13)




WORKING ON: EUR
Variants with p-value < 0.05: (277, 13)
WORKING ON: FIN
Variants with p-value < 0.05: (41, 13)




WORKING ON: MDE
Variants with p-value < 0.05: (132, 13)
WORKING ON: SAS
Variants with p-value < 0.05: (20, 13)




WORKING ON: CAH
Variants with p-value < 0.05: (72, 13)




GLM

In [31]:
#Run case-control analysis with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --keep {WORK_DIR}/{ancestry}.samplestoKeep \
    --allow-no-sex \
    --maf 0.01 \
    --ci 0.95 \
    --glm \
    --covar {WORK_DIR}/{ancestry}_covariate_file.txt \
    --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --neg9-pheno-really-missing \
    --out {WORK_DIR}/{ancestry}_SCARB2.all_adj

PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all_adj.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_covariate_file.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC.samplestoKeep
  --maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.all_adj

Start time: Tue Aug 20 09:48:42 2024
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
52223 MiB RAM detected, ~50235 available; reserving 2

In [32]:
#Process results from plink glm analysis including covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    #Read in plink glm results
    assoc = pd.read_csv(f'{WORK_DIR}/{ancestry}_SCARB2.all_adj.PHENO1.glm.logistic.hybrid', delim_whitespace=True)
    
    #Filter for additive test only - this is the variant results
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Check if there are any significant (p < 0.05) variants
    significant = assoc_add[assoc_add['P']<0.05]

    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Check if there are any significant (p < 0.05) variants
    GWsignificant = assoc_add[assoc_add['P']<5e-8]

    print(f'There are {len(GWsignificant)} variants with p-value < 5e-8 in glm')
    
    #Save assoc_add to csv
    assoc_add.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')

WORKING ON: AAC
There are 82 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: AFR




There are 62 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: AJ
There are 52 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AMR
There are 19 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: CAS




There are 1 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: EAS
There are 123 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: EUR
There are 279 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: FIN




There are 21 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: MDE
There are 58 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: SAS
There are 8 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm
WORKING ON: CAH
There are 104 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




change format of association file in EUR

In [34]:
WORK_DIR = "~/workspace/ws_files/SCARB2/"

In [38]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = "~/workspace/ws_files/SCARB2/"
    print(f'WORKING ON: {ancestry}')
    
    df = pd.read_csv(f'{WORK_DIR}/SCARB2_{ancestry}/{ancestry}.all_adj.csv')
    
    #First just select relevant columns: chromosome, bp position and p-value
    export_ldassoc = df[['#CHROM', 'POS', 'P']].copy()
    
    #Rename the #CHROM column to remove the hashtag as I think this might be confusing LDassoc
    export_ldassoc = export_ldassoc.rename(columns={'#CHROM': 'CHROM'}) 
    
    #Then export as a tab-separated, not comma-separated file

    export_ldassoc.to_csv(f'{WORK_DIR}/SCARB2_{ancestry}/{ancestry}.all_adj.formatted.tab', sep = '\t', index=False)

WORKING ON: AJ
WORKING ON: EAS
WORKING ON: SAS
WORKING ON: AMR
WORKING ON: MDE
WORKING ON: CAH
WORKING ON: FIN
WORKING ON: CAS
WORKING ON: AFR
WORKING ON: EUR
WORKING ON: AAC


GCTA conditional analysis

In [39]:
# prepare file for COJO
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    #Read in summary statistics
    sumstats = pd.read_csv(f'{WORK_DIR}/SCARB2_{ancestry}/{ancestry}.all_adj.csv')
    
    #Format summary statistics for GCTA-COJO
    #First get the log odds ratio - this is required for COJO
    #1) For a case-control study, the effect size should be log(odds ratio) with its corresponding standard error.
    sumstats_formatted = sumstats.copy()
    sumstats_formatted['b'] = np.log(sumstats_formatted['OR'])
    
    #Now select just the necessary columns for COJO
    sumstats_export = sumstats_formatted[['ID', 'A1', 'OMITTED', 'A1_FREQ', 'b', 'LOG(OR)_SE', 'P', 'OBS_CT']].copy()
    
    #Rename columns following COJO format
    sumstats_export = sumstats_export.rename(columns = {'ID':'SNP', 'OMITTED':'A2', 'A1_FREQ':'freq', 'LOG(OR)_SE':'se', 'P':'p', 'OBS_CT':'N'})
    
    #Export
    sumstats_export.to_csv(f'{WORK_DIR}/SCARB2_{ancestry}/{ancestry}.all_adj.sumstats.ma', sep = '\t', index=False)

In [43]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    #Select multiple associated SNPs based on significance p-value 5e-08
    #Can change the p-value for significance if needed
    #bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS
    ! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --maf 0.01 \
    --cojo-file {WORK_DIR}/{ancestry}.all_adj.sumstats.ma \
    --cojo-p 5e-8 \
    --cojo-slct \
    --out {WORK_DIR}/{ancestry}.all_adj.COJO

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 10:23:43 UTC on Tue Aug 20 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ_SCARB2
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ.all_adj.sumstats.ma
--cojo-p 5e-08
--cojo-slct
--out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ.all_adj.COJO


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ_SCARB2.fam].
2655 individuals to be included from [/home/jupyter/workspace

use p threshold(0.00238) from LD pruning

In [21]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    #Select multiple associated SNPs based on significance p-value 5e-08
    #Can change the p-value for significance if needed
    #bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS
    ! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --maf 0.01 \
    --cojo-file {WORK_DIR}/{ancestry}.all_adj.sumstats.ma \
    --cojo-p 3.5e-4 \
    --cojo-slct \
    --out {WORK_DIR}/{ancestry}.all_adj.LDprune.COJO

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 14:22:46 UTC on Tue Aug 20 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_CAH/CAH_SCARB2
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/SCARB2/SCARB2_CAH/CAH.all_adj.sumstats.ma
--cojo-p 0.00035
--cojo-slct
--out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_CAH/CAH.all_adj.LDprune.COJO


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/SCARB2/SCARB2_CAH/CAH_SCARB2.fam].
851 individuals to be included from [/home/

In [44]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
    
    #Select multiple associated SNPs based on significance p-value 5e-08
    #Can change the p-value for significance if needed
    #bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS
    ! /home/jupyter/tools/gcta-1.94.1-linux-kernel-3-x86_64/gcta64 \
    --bfile {WORK_DIR}/{ancestry}_SCARB2 \
    --maf 0.01 \
    --cojo-file {WORK_DIR}/{ancestry}.all_adj.sumstats.ma \
    --cojo-top-SNPs 10 \
    --out {WORK_DIR}/{ancestry}.all_adj.top10

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.94.1 Linux
[0;32m[0m* Built at Nov 15 2022 21:14:25, by GCC 8.5
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 10:29:53 UTC on Tue Aug 20 2024.
[0;32m[0mHostname: jupyterlabvertexai20240708
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ_SCARB2
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ.all_adj.sumstats.ma
--cojo-top-SNPs 10
--out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ.all_adj.top10


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/SCARB2/SCARB2_AJ/AJ_SCARB2.fam].
2655 individuals to be included from [/home/jupyter/workspace/ws_fil

Burden Analyses using RVTests

In [82]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

#Loop over all the ancestries and the 2 variant classes - run rvtests for all coding and missense variants
for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running plink to extract {variant_class} variants for ancestry: {ancestry}')
        
        #Extract relevant variants
        ! /home/jupyter/tools/plink2 \
        --pfile {REL7_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release7_vwb \
        --keep {WORK_DIR}/{ancestry}.samplestoKeep \
        --extract range {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.variantstoKeep.txt \
        --recode vcf-iid \
        --out {WORK_DIR}/{ancestry}_SCARB2.{variant_class}
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running bgzip and tabix for {variant_class} variants for ancestry: {ancestry}')
        
        ## Bgzip and Tabix (zip and index the file)
        ! bgzip -f {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.vcf
        ! tabix -f -p vcf {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.vcf.gz

Running plink to extract exonic variants for ancestry: AAC
PLINK v2.00a6LM 64-bit Intel (4 Jul 2024)      www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.exonic.log.
Options in effect:
  --export vcf-iid
  --extract range /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.exonic.variantstoKeep.txt
  --keep /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC.samplestoKeep
  --out /home/jupyter/workspace/ws_files/SCARB2/SCARB2_AAC/AAC_SCARB2.exonic
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release7_30042024/imputed_genotypes/AAC/chr4_AAC_release7_vwb

Start time: Mon Jul 29 15:19:25 2024
Note: --export 'vcf-iid' modifier is deprecated.  Use 'vcf' + 'id-paste=iid'.
52223 MiB RAM detected, ~50656 available; reserving 26111 MiB for main
workspace.
Using up to 8 compute threads.
1111 samples (656 females, 455 males; 1111 founders) loaded f

In [None]:
#Run RVtests
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/SCARB2/SCARB2_{ancestry}'
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running RVtests for {variant_class} variants for ancestry: {ancestry}')
        
        ## RVtests with covariates 
        #Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
        ! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
        --out {WORK_DIR}/{ancestry}_SCARB2.burden.{variant_class} \
        --kernel skat,skato \
        --inVcf {WORK_DIR}/{ancestry}_SCARB2.{variant_class}.vcf.gz \
        --pheno {WORK_DIR}/{ancestry}_covariate_file.txt \
        --pheno-name PHENO \
        --gene SCARB2 \
        --geneFile ~/workspace/ws_files/TMEM175/refFlat.txt \
        --covar {WORK_DIR}/{ancestry}_covariate_file.txt \
        --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
        --freqUpper 0.01

EUR

In [85]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	20918	64	43	8475.92	0.891794	10000	1109	8475.92	1000	0	0.901713


In [86]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	20918	64	43	892.339	1	1


In [87]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	20918	43	26	3091.69	0.731763	10000	1389	3091.69	1000	0	0.719942


In [88]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EUR/EUR_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	20918	43	26	4256.15	1	0.345116


AAC

In [89]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AAC/AAC_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	925	20	16	626.538	0.971408	10000	1062	626.538	1000	0	0.94162


In [90]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AAC/AAC_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	925	20	16	473.593	1	0.719617


In [91]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AAC/AAC_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	925	11	9	547.367	0.85572	10000	1263	547.367	1000	0	0.791766


In [92]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AAC/AAC_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	925	11	9	462.384	1	0.638153


AFR

In [93]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AFR/AFR_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1062	17	12	2540.23	0.518996	10000	1937	2540.23	1000	0	0.516262


In [94]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AFR/AFR_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1062	17	12	1066.59	1	0.590661


In [96]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AFR/AFR_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1062	9	6	2386.29	0.293053	10000	4078	2386.29	1000	0	0.245218


In [97]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AFR/AFR_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1062	9	6	1141.95	0.6	0.402744


AJ

In [98]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AJ/AJ_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1410	5	3	187.979	0.360916	10000	3606	187.979	1000	0	0.277316


In [99]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AJ/AJ_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1410	5	3	NA	NA	NA


In [100]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AJ/AJ_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1410	4	2	187.979	0.360916	10000	3606	187.979	1000	0	0.277316


In [101]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AJ/AJ_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	1410	4	2	NA	NA	NA


AMR

In [102]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AMR/AMR_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	447	15	5	76.8106	0.466648	10000	1730	76.8106	1000	0	0.578035


In [103]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AMR/AMR_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	447	15	5	38.4053	0.1	0.466649


In [104]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AMR/AMR_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	447	11	4	0	-nan	10000	1000	0	0	1000	0.5


In [105]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_AMR/AMR_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	447	11	4	NA	NA	NA


CAH

In [106]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAH/CAH_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	693	15	10	4916.94	0.185781	10000	6012	4916.94	1000	0	0.166334


In [107]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAH/CAH_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	693	15	10	2458.47	0	0.300097


In [108]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAH/CAH_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	693	8	6	1734.19	0.428815	10000	2561	1734.19	1000	0	0.390472


In [109]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAH/CAH_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	693	8	6	867.093	0	0.581358


CAS

In [110]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAS/CAS_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	658	10	7	1034.61	0.149244	10000	8942	1034.61	1000	0	0.111832


In [111]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAS/CAS_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	658	10	7	517.304	0	0.233515


In [112]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAS/CAS_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	658	7	4	53.0168	0.514372	10000	1537	53.0168	1000	0	0.650618


In [113]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_CAS/CAS_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	658	7	4	26.5084	0.4	0.517254


EAS

In [114]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EAS/EAS_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	2523	17	12	3848.26	0.728263	10000	1379	3848.26	1000	0	0.725163


In [115]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EAS/EAS_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	2523	17	12	2026.13	1	0.666468


In [116]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EAS/EAS_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	2523	12	9	4024.31	0.60396	10000	1692	4024.31	1000	0	0.591017


In [117]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_EAS/EAS_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	2523	12	9	3800.46	1	0.447802


FIN

In [118]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_FIN/FIN_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue


In [122]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_FIN/FIN_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue


In [121]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_FIN/FIN_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue


In [123]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_FIN/FIN_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue


MDE

In [124]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_MDE/MDE_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	379	10	5	243.498	0.489121	10000	1430	243.498	1000	0	0.699301


In [125]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_MDE/MDE_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	379	10	5	274.609	1	0.276794


In [126]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_MDE/MDE_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	379	7	3	183.965	0.361618	10000	1971	183.965	1000	0	0.507357


In [127]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_MDE/MDE_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	379	7	3	123.624	1	0.350369


SAS

In [128]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_SAS/SAS_SCARB2.burden.exonic.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	296	9	3	464.996	0.0027725	10000	10000	464.996	128	37	0.01465


In [129]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_SAS/SAS_SCARB2.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	296	9	3	NA	NA	NA


In [130]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_SAS/SAS_SCARB2.burden.nonsynonymous_lof.Skat.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	Pvalue	NumPerm	ActualPerm	Stat	NumGreater	NumEqual	PermPvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	296	6	2	464.996	0.0027725	10000	10000	464.996	128	37	0.01465


In [131]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/SCARB2_SAS/SAS_SCARB2.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
SCARB2	4:76158738-76213824,4:76158738-76213824	296	6	2	232.498	0	0.000884866
