TMEM175 - Single gene analysis in GP2 Neurobooster genotyping data (all ancestries)

Project: GP2 TMEM175

Version: Python/3.10.15, R/4.3.3

Notebook Overview

Description Loading Python libraries Set paths Make working directory

Installing packages

Create a covariate file with GP2 data

Annotation of the gene TMEM175

Association analysis to compare allele frequencies between cases and controls

GLM analysis adjusting for gender, age, PC1-5

Burden test(SkatO, Skat, cmc,zeggini,mb,fp,cmcWald)

Conditional analysis

Loading Python libraries

In [1]:
# Use pathlib for file path manipulation
import pathlib

# Install numpy
import numpy as np

# Install Pandas for tabular data
import pandas as pd

# Install plotnine: a ggplot2-compatible Python plotting package
from plotnine import *

# Always show all columns in a Pandas DataFrame
pd.set_option('display.max_columns', None)

Set paths

In [2]:
REL10_PATH = pathlib.Path(pathlib.Path.home(), 'workspace/gp2_tier2_eu_release10')
!ls -hal {REL10_PATH}

total 100K
dr-xr-xr-x. 1 jupyter users    0 Jul 25 13:32 clinical_data
dr-xr-xr-x. 1 jupyter users    0 Jul 25 13:32 imputed_genotypes
dr-xr-xr-x. 1 jupyter users    0 Jul 25 13:32 meta_data
dr-xr-xr-x. 1 jupyter users    0 Jul 25 13:32 raw_genotypes
dr-xr-xr-x. 1 jupyter users    0 Jul 25 13:32 raw_genotypes_flipped
-r--r--r--. 1 jupyter users 100K Jun 30 20:07 README_release10_01072025.txt
dr-xr-xr-x. 1 jupyter users    0 Jul 25 13:32 wgs


Make working directory

In [4]:
! mkdir ~/workspace/ws_files/TMEM175

In [3]:
WORK_DIR = "~/workspace/ws_files/TMEM175/"

In [4]:
# make sure all tools installed
! ls /home/jupyter/tools

annovar				       plink_linux_x86_64_20190304.zip
annovar.latest.tar.gz		       prettify
intel-simplified-software-license.txt  rvtests
LICENSE				       toy.map
plink				       toy.ped
plink2				       vcf_subset
plink2_linux_x86_64_latest.zip


install packages

RVTests

In [7]:
%%bash

#Install RVTESTS: Option 1 (~15min)
if test -e /home/jupyter/tools/rvtests; then

echo "rvtests is already installed"
else
echo "rvtests is not installed"

mkdir /home/jupyter/tools/rvtests
cd /home/jupyter/tools/rvtests

wget https://github.com/zhanxw/rvtests/releases/download/v2.1.0/rvtests_linux64.tar.gz 

tar -zxvf rvtests_linux64.tar.gz
fi

rvtests is not installed


--2025-07-24 15:16:49--  https://github.com/zhanxw/rvtests/releases/download/v2.1.0/rvtests_linux64.tar.gz
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/7528862/63fbb780-29ae-11e9-91d3-99ec3c3134cc?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-07-24T16%3A10%3A13Z&rscd=attachment%3B+filename%3Drvtests_linux64.tar.gz&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-07-24T15%3A09%3A25Z&ske=2025-07-24T16%3A10%3A13Z&sks=b&skv=2018-11-09&sig=v%2Bg0ZHemeimaLBK7%2B25gwCWFnKuu9C755inMrILqaMs%3D&jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1MzM3MDUwOSwibmJmIjoxNzUzMzcwMjA5LCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHV

executable/
executable/combineKinship
executable/plink2vcf
executable/vcf2plink
executable/vcfAnnoSummaryLite
executable/explainCSI2
executable/vcf2kinship
executable/vcfIndvSummary
executable/createVCFIndex
executable/extractVCFIndex
executable/explainTabix
executable/vcf2ld_neighbor
executable/vcfConcordance
executable/vcfExtractSite
executable/vcfSummaryLite
executable/bgenFileInfo
executable/vcfPeek
executable/vcfVariantSummaryLite
executable/kinshipDecompose
executable/rvtest
executable/vcfPair
executable/vcfSummary
executable/explainCSI1
executable/vcf2ld_window
executable/queryVCFIndex
executable/bgen2vcf
executable/vcf2ld_gene
executable/vcf2geno
example/
example/README.md
example/example.vcf
example/covar.missing
example/pheno
example/experimental/
example/experimental/1k.fam
example/experimental/1k.bed
example/experimental/1k.bim
example/experimental/1k.1m.bgen
example/experimental/cmd.sh
example/experimental/1k.1m.vcf.gz
example/experimental/1k.ped
example/experimental/1k.1m

In [8]:
# give permission

# chmod to make sure you have permission to run the program
! chmod u+x /home/jupyter/tools/plink
! chmod u+x /home/jupyter/tools/plink2
! chmod 777 /home/jupyter/tools/rvtests/executable/rvtest

In [9]:
%%bash
# making working directory
#Loop over all the ancestries
for ancestry in {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'} ;
do

#Make a folder for each ancestry
mkdir ~/workspace/ws_files/TMEM175/TMEM175_"$ancestry"

done

In [11]:
# covariate file has been created in another notebook without GBA carriers

Annotation of the gene

Extract the region using PLINK

Extract TMEM175 gene

TMEM175 coordinates: Chromosome 4:932387-4:958656(GRCh38/hg38)

In [12]:
## extract region using plink
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --pfile {REL10_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release10_vwb \
    --chr 4 \
    --from-bp 882387 \
    --to-bp 1008656 \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_TMEM175

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_TMEM175.log.
Options in effect:
  --chr 4
  --from-bp 882387
  --make-bed
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_TMEM175
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/EUR/chr4_EUR_release10_vwb
  --to-bp 1008656

Start time: Thu Jul 24 15:23:46 2025
26046 MiB RAM detected, ~24020 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
58823 samples (25749 females, 33074 males; 58823 founders) loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/EUR/chr4_EUR_release10_vwb.psam.
7745089 variants loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/EUR/chr4_EUR_release10_vwb.pvar.
1 binary phenotype loaded (26778 cases, 10372 controls).
720

In [13]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}/'
    
    ! head -n 1 {WORK_DIR}/{ancestry}_TMEM175.fam > {WORK_DIR}/{ancestry}_s1.txt

In [None]:
! head /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_s1.txt

Turn binary files into VCF

In [None]:
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    ## Turn binary files into VCF
    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --keep {WORK_DIR}/{ancestry}_s1.txt \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_TMEM175_v1

In [18]:
### Bgzip and Tabix (zip and index the file)
for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    ! bgzip -f {WORK_DIR}/{ancestry}_TMEM175_v1.vcf
    ! tabix -f -p vcf {WORK_DIR}/{ancestry}_TMEM175_v1.vcf.gz 

Annotate using ANNOVAR

In [19]:
## annotate using ANNOVAR

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    ! perl /home/jupyter/tools/annovar/table_annovar.pl {WORK_DIR}/{ancestry}_TMEM175_v1.vcf.gz /home/jupyter/tools/annovar/humandb/ -buildver hg38 \
    -out {WORK_DIR}/{ancestry}_TMEM175.annovar \
    -remove -protocol refGene,clinvar_20140902 \
    -operation g,f \
    --nopolish \
    -nastring . \
    -vcfinput


NOTICE: Running with system command <convert2annovar.pl  -includeinfo -allsample -withfreq -format vcf4 /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_TMEM175_v1.vcf.gz > /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_TMEM175.annovar.avinput>
NOTICE: Finished reading 7216 lines from VCF file
NOTICE: A total of 7209 locus in VCF file passed QC threshold, representing 6826 SNPs (5193 transitions and 1633 transversions) and 383 indels/substitutions
NOTICE: Finished writing allele frequencies based on 6826 SNP genotypes (5193 transitions and 1633 transversions) and 383 indels/substitutions for 1 samples

NOTICE: Running with system command </home/jupyter/tools/annovar/table_annovar.pl /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_TMEM175.annovar.avinput /home/jupyter/tools/annovar/humandb/ -buildver hg38 -outfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_EUR/EUR_TMEM175.annovar -remove -protocol refGene,clinvar_20140902 -operation g,f --nopolish -nast

In [56]:
# Read in ANNOVAR multianno file
gene = pd.read_csv(f'{WORK_DIR}/TMEM175_SAS/SAS_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
display(gene)

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,clinvar_20140902,Otherinfo1,Otherinfo2,Otherinfo3,Otherinfo4,Otherinfo5,Otherinfo6,Otherinfo7,Otherinfo8,Otherinfo9,Otherinfo10,Otherinfo11,Otherinfo12,Otherinfo13
0,4,882397,882397,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882397,chr4:882397:C:T,C,T,.,.,PR,GT,0/0
1,4,882432,882432,G,A,intronic,GAK,.,.,.,.,0,.,.,4,882432,chr4:882432:G:A,G,A,.,.,PR,GT,0/0
2,4,882530,882530,C,T,intronic,GAK,.,.,.,.,0,.,.,4,882530,chr4:882530:C:T,C,T,.,.,PR,GT,0/0
3,4,882566,882566,G,-,intronic,GAK,.,.,.,.,0,.,.,4,882565,chr4:882565:AG:A,AG,A,.,.,PR,GT,0/0
4,4,882611,882611,T,C,intronic,GAK,.,.,.,.,0,.,.,4,882611,chr4:882611:T:C,T,C,.,.,PR,GT,0/0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2098,4,1008337,1008337,C,T,intergenic,"IDUA,FGFRL1",dist=3773;dist=3289,.,.,.,0.5,.,.,4,1008337,chr4:1008337:C:T,C,T,.,.,PR,GT,0/1
2099,4,1008371,1008371,C,-,intergenic,"IDUA,FGFRL1",dist=3807;dist=3255,.,.,.,0,.,.,4,1008370,chr4:1008370:TC:T,TC,T,.,.,PR,GT,0/0
2100,4,1008480,1008480,A,G,intergenic,"IDUA,FGFRL1",dist=3916;dist=3146,.,.,.,0,.,.,4,1008480,chr4:1008480:A:G,A,G,.,.,PR,GT,0/0
2101,4,1008515,1008515,C,T,intergenic,"IDUA,FGFRL1",dist=3951;dist=3111,.,.,.,0,.,.,4,1008515,chr4:1008515:C:T,C,T,.,.,PR,GT,0/0


In [57]:
gene["Func.refGene"].value_counts()

Func.refGene
intronic           1701
exonic              187
intergenic           94
UTR3                 59
UTR5                 25
downstream           24
upstream             12
exonic;splicing       1
Name: count, dtype: int64

In [58]:
# Filter exonic variants
coding = gene[gene['Func.refGene'] == 'exonic']
coding["ExonicFunc.refGene"].value_counts()

ExonicFunc.refGene
nonsynonymous SNV         104
synonymous SNV             78
stopgain                    2
frameshift deletion         2
nonframeshift deletion      1
Name: count, dtype: int64

In [59]:
#Make lists of variants to keep - all coding, coding nonsynonymous (missense - as they are coded in ANNOVAR), deleterious (CADD_phred > 20)

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    print(f'WORKING ON: {ancestry}')

    # Read in ANNOVAR multianno file
    gene = pd.read_csv(f'{WORK_DIR}/{ancestry}_TMEM175.annovar.hg38_multianno.txt', sep = '\t')
    
    #Print number of variants in the different categories
    results = [] 

    utr5 = gene[gene['Func.refGene']== 'UTR5']
    intronic = gene[gene['Func.refGene']== 'intronic']
    exonic = gene[gene['Func.refGene']== 'exonic']
    utr3 = gene[gene['Func.refGene']== 'UTR3']
    coding_nonsynonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] == 'nonsynonymous SNV')]
    coding_synonymous = gene[(gene['Func.refGene'] == 'exonic') & (gene['ExonicFunc.refGene'] != 'nonsynonymous SNV')]
    lof = exonic[(exonic['ExonicFunc.refGene'] == 'stopgain') | (exonic['ExonicFunc.refGene'] == 'stoploss') | (exonic['ExonicFunc.refGene'] == 'frameshift deletion') | (exonic['ExonicFunc.refGene'] == 'frameshift insertion')]
    nonsynonymous_lof = pd.concat([coding_nonsynonymous, lof])

    print({ancestry})
    print('Total variants: ', len(gene))
    print("Intronic: ", len(intronic))
    print('UTR3: ', len(utr3))
    print('UTR5: ', len(utr5))
    print("Total exonic: ", len(exonic))
    print('  Synonymous: ', len(coding_synonymous))
    print("  Nonsynonymous: ", len(coding_nonsynonymous))
    print("nonsynonymous_lof: ", len(nonsynonymous_lof))
    results.append((gene, intronic, utr3, utr5, exonic, coding_synonymous, coding_nonsynonymous, nonsynonymous_lof))
    print('\n')

    # Save in PLINK format - coding nonsynonymous 
    # These are missense variants - other types of nonsynonymous variants (e.g stopgain/loss, or frameshift variants are coded differently in the ExonicFunc.refGene 
    variants_toKeep = nonsynonymous_lof[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep.to_csv(f'{WORK_DIR}/{ancestry}_TMEM175.nonsynonymous_lof.variantstoKeep.txt', sep="\t", index=False, header=False)


    # Save in PLINK format - all coding variants
    variants_toKeep2 = exonic[['Chr', 'Start', 'End', 'Gene.refGene']].copy()
    variants_toKeep2.to_csv(f'{WORK_DIR}/{ancestry}_TMEM175.exonic.variantstoKeep.txt', sep="\t", index=False, header=False)

WORKING ON: EUR
{'EUR'}
Total variants:  7209
Intronic:  5651
UTR3:  214
UTR5:  82
Total exonic:  731
  Synonymous:  313
  Nonsynonymous:  418
nonsynonymous_lof:  437


WORKING ON: AMR
{'AMR'}
Total variants:  3883
Intronic:  3134
UTR3:  102
UTR5:  42
Total exonic:  323
  Synonymous:  150
  Nonsynonymous:  173
nonsynonymous_lof:  180


WORKING ON: MDE
{'MDE'}
Total variants:  2741
Intronic:  2182
UTR3:  77
UTR5:  34
Total exonic:  257
  Synonymous:  123
  Nonsynonymous:  134
nonsynonymous_lof:  136


WORKING ON: AJ
{'AJ'}
Total variants:  1378
Intronic:  1086
UTR3:  46
UTR5:  7
Total exonic:  141
  Synonymous:  66
  Nonsynonymous:  75
nonsynonymous_lof:  79


WORKING ON: SAS
{'SAS'}
Total variants:  2103
Intronic:  1701
UTR3:  59
UTR5:  25
Total exonic:  187
  Synonymous:  83
  Nonsynonymous:  104
nonsynonymous_lof:  108


WORKING ON: CAH
{'CAH'}
Total variants:  3430
Intronic:  2746
UTR3:  95
UTR5:  32
Total exonic:  282
  Synonymous:  128
  Nonsynonymous:  154
nonsynonymous_lof:  162

ALL variants

assoc

glm

ASSOC

In [60]:
! pwd

/home/jupyter/workspace/Sample_Notebooks


In [65]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','FIN','MDE','SAS','CAH', 'EUR'}

for ancestry in ancestries:
    # read cov file
    cov = pd.read_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}_covariate_file_noGBA.txt', sep = '\t')

    ## Make file of sample IDs to keep 
    samples_toKeep = cov[['FID', 'IID']].copy()
    samples_toKeep.to_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep', sep = '\t', index=False, header=None)

In [66]:
#Run case-control analysis using plink assoc for all variants, not adjusting for any covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    ! /home/jupyter/tools/plink \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
    --assoc \
    --allow-no-sex \
    --ci 0.95 \
    --maf 0.01 \
    --out {WORK_DIR}/{ancestry}_TMEM175.all

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all.log.
Options in effect:
  --allow-no-sex
  --assoc
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175
  --ci 0.95
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --maf 0.01
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all

26046 MB RAM detected; reserving 13023 MB for main workspace.
3520 variants loaded from .bim file.
1215 people (532 males, 683 females) loaded from .fam.
1133 phenotype values loaded from .fam.
--keep: 1207 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1207 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.998418.
2916 variants re

In [68]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    #Look at assoc results, only variants with nominal p-value < 0.05
    freq = pd.read_csv(f'{WORK_DIR}/{ancestry}_TMEM175.all.assoc', sep='\s+')
    sig_all_nonadj = freq[freq['P']<0.05]
    
    print(f'Variants with p-value < 0.05: {sig_all_nonadj.shape}')
    
    #Save FREQ to csv
    freq.to_csv(f'{WORK_DIR}/{ancestry}.all_nonadj.csv')

WORKING ON: AAC
Variants with p-value < 0.05: (23, 13)
WORKING ON: AFR
Variants with p-value < 0.05: (76, 13)
WORKING ON: AJ
Variants with p-value < 0.05: (150, 13)
WORKING ON: AMR
Variants with p-value < 0.05: (297, 13)
WORKING ON: CAS
Variants with p-value < 0.05: (8, 13)
WORKING ON: EAS
Variants with p-value < 0.05: (203, 13)
WORKING ON: EUR
Variants with p-value < 0.05: (218, 13)
WORKING ON: FIN
Variants with p-value < 0.05: (14, 13)
WORKING ON: MDE
Variants with p-value < 0.05: (107, 13)
WORKING ON: SAS
Variants with p-value < 0.05: (10, 13)
WORKING ON: CAH
Variants with p-value < 0.05: (88, 13)


In [76]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','FIN','MDE','SAS','CAH', 'EUR'}

for ancestry in ancestries:
    # read cov file
    cov = pd.read_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}_covariate_file_noGBA.txt', sep = '\t')
    cov = cov.dropna()
    cov.columns = ['FID', 'IID', 'FATID', 'MATID', 'SEX', 'AGE', 'PHENO', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']

    ## save files 
    cov.to_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}.noGBA.noNA.txt', sep = '\t', index=False)

In [13]:
from sklearn.decomposition import PCA
cov = pd.read_csv(f'~/workspace/ws_files/AMR/AMR_covariate_file_noGBA.txt', sep = '\t')
pcs = cov[['PC1', 'PC2', 'PC3', 'PC4', 'PC5']]
pca = PCA(n_components=5)
pcs_ortho = pca.fit_transform(pcs)
for i in range(5):
    cov[f'PC{i+1}_ortho'] = pcs_ortho[:, i]

# (Optional) Drop old PCs
cov = cov.drop(columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])
cov = cov.rename(columns={
    'PC1_ortho': 'PC1',
    'PC2_ortho': 'PC2',
    'PC3_ortho': 'PC3',
    'PC4_ortho': 'PC4',
    'PC5_ortho': 'PC5'
})

cov.to_csv('~/workspace/ws_files/AMR/AMR_covariate_file_noGBA.orthoPCs.txt', sep='\t', index=False)

In [16]:
cov = pd.read_csv(f'~/workspace/ws_files/AMR/AMR_covariate_file_noGBA.orthoPCs.txt', sep = '\t')
cov = cov.dropna()
cov.to_csv('~/workspace/ws_files/AMR/AMR_covariate_file_noGBA.orthoPCs.txt', sep='\t', index=False)

In [7]:
#Run case-control analysis with covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
    --allow-no-sex \
    --maf 0.01 \
    --ci 0.95 \
    --glm \
    --covar /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.noNA.txt \
    --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
    --covar-variance-standardize \
    --neg9-pheno-really-missing \
    --out {WORK_DIR}/{ancestry}_TMEM175.all_adj

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all_adj.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.noNA.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.all_adj

Start time: Fri Jul 25 08:23:36 2025
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
26046 MiB RAM detected, ~24894 available; reserving 13023 MiB for main

In [17]:
WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_AMR'

! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/AMR_TMEM175 \
--keep /home/jupyter/workspace/ws_files/AMR/AMR.noGBA.samplestoKeep \
--allow-no-sex \
--maf 0.01 \
--ci 0.95 \
--glm \
--covar /home/jupyter/workspace/ws_files/AMR/AMR_covariate_file_noGBA.orthoPCs.txt \
--covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
--covar-variance-standardize \
--neg9-pheno-really-missing \
--out {WORK_DIR}/AMR_TMEM175.all_adj

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR_TMEM175.all_adj.log.
Options in effect:
  --allow-no-sex
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR_TMEM175
  --ci 0.95
  --covar /home/jupyter/workspace/ws_files/AMR/AMR_covariate_file_noGBA.orthoPCs.txt
  --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5
  --covar-variance-standardize
  --glm
  --keep /home/jupyter/workspace/ws_files/AMR/AMR.noGBA.samplestoKeep
  --maf 0.01
  --neg9-pheno-really-missing
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AMR/AMR_TMEM175.all_adj

Start time: Fri Jul 25 08:37:12 2025
Note: --allow-no-sex no longer has any effect.  (Missing-sex samples are
automatically excluded from association analysis when sex is a covariate, and
treated normally otherwise.)
26046 MiB RAM detected, ~24828 available; reserving

In [18]:
#Process results from plink glm analysis including covariates
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']

for ancestry in ancestries:

    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    print(f'WORKING ON: {ancestry}')
    
    #Read in plink glm results
    assoc = pd.read_csv(f'{WORK_DIR}/{ancestry}_TMEM175.all_adj.PHENO1.glm.logistic.hybrid', delim_whitespace=True)

    #Filter for additive test only - this is the variant results
    assoc_add = assoc[assoc['TEST']=="ADD"]
    
    #Check if there are any significant (p < 0.05) variants
    significant = assoc_add[assoc_add['P']<0.05]

    print(f'There are {len(significant)} variants with p-value < 0.05 in glm')
    
    #Check if there are any significant (p < 0.05) variants
    GWsignificant = assoc_add[assoc_add['P']<5e-8]

    print(f'There are {len(GWsignificant)} variants with p-value < 5e-8 in glm')
    
    #Save assoc_add to csv
    assoc_add.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')

WORKING ON: AAC
There are 33 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AFR
There are 55 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AJ
There are 152 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: AMR
There are 24 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: CAS
There are 3 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: EAS
There are 55 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: EUR
There are 171 variants with p-value < 0.05 in glm
There are 33 variants with p-value < 5e-8 in glm




WORKING ON: FIN
There are 10 variants with p-value < 0.05 in glm
There are 2 variants with p-value < 5e-8 in glm




WORKING ON: MDE
There are 94 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: SAS
There are 66 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




WORKING ON: CAH
There are 67 variants with p-value < 0.05 in glm
There are 0 variants with p-value < 5e-8 in glm




format assoc file to do meta analysis(AFR/AJ/AMR/EAS/EUR)

In [26]:
# Define ancestry groups and their sample sizes
ancestry_sample_sizes = {
    'AFR': 3316,
    'AJ': 1693,
    'AMR': 3319,
    'EAS': 4975,
    'EUR': 33870
}

# Output container (optional)
cleaned_results = {}

# Loop over each ancestry
for ancestry, sample_size in ancestry_sample_sizes.items():
    # Load PLINK association result file
    file_path = f'/home/jupyter/workspace/ws_files/TMEM175/TMEM175_{ancestry}/{ancestry}.all_adj.csv'
    df = pd.read_csv(file_path)

    # Build MARKERNAME
    df['MARKERNAME'] = 'chr' + df['#CHROM'].astype(str) + ':' + df['POS'].astype(str)

    # Rename columns
    df.rename(columns={
        '#CHROM': 'CHROMOSOME',
        'POS': 'POSITION',
        'REF': 'NEA',
        'ALT': 'EA',
        'A1_FREQ': 'EAF',
        'LOG(OR)_SE': 'SE',
        'L95': 'OR_95L',
        'U95': 'OR_95U'
    }, inplace=True)

    # Add BETA = log(OR)
    df['BETA'] = np.log(df['OR'])

    # Add sample size
    df['N'] = sample_size

    # Keep desired columns
    final_cols = [
        'MARKERNAME', 'CHROMOSOME', 'POSITION', 'NEA', 'EA', 'EAF',
        'BETA', 'OR', 'SE', 'OR_95L', 'OR_95U', 'P', 'N'
    ]
    df_cleaned = df[final_cols]

    # Store or save
    cleaned_results[ancestry] = df_cleaned

    # Optional: save to file
    output_path = f'/home/jupyter/workspace/ws_files/TMEM175/TMEM175_{ancestry}/{ancestry}.cleaned_assoc.txt.gz'
    df_cleaned.to_csv(output_path, sep='\t', index=False)

In [27]:
df = pd.read_csv('/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AJ/AJ.cleaned_assoc.txt.gz', sep='\t')
df

Unnamed: 0,MARKERNAME,CHROMOSOME,POSITION,NEA,EA,EAF,BETA,OR,SE,OR_95L,OR_95U,P,N
0,chr4:882530,4,882530,C,T,0.249648,0.278139,1.320670,0.130923,1.021770,1.70702,0.033631,1693
1,chr4:882611,4,882611,T,C,0.039216,0.259213,1.295910,0.292887,0.729911,2.30081,0.376139,1693
2,chr4:883018,4,883018,G,A,0.036765,0.207095,1.230100,0.293740,0.691686,2.18762,0.480791,1693
3,chr4:883133,4,883133,G,A,0.015406,-0.502955,0.604741,0.366620,0.294783,1.24062,0.170104,1693
4,chr4:883483,4,883483,G,A,0.025210,-0.468553,0.625907,0.295532,0.350714,1.11703,0.112862,1693
...,...,...,...,...,...,...,...,...,...,...,...,...,...
409,chr4:1007954,4,1007954,T,C,0.267157,0.015736,1.015860,0.121593,0.800447,1.28924,0.897053,1693
410,chr4:1008009,4,1008009,C,T,0.432489,0.060521,1.062390,0.109869,0.856571,1.31766,0.581740,1693
411,chr4:1008209,4,1008209,C,T,0.044210,-0.110261,0.895600,0.245206,0.553853,1.44822,0.652950,1693
412,chr4:1008337,4,1008337,C,T,0.267157,0.015736,1.015860,0.121593,0.800447,1.28924,0.897053,1693


create PCs file for meta analysis

In [None]:
# Define ancestries and base path
ancestries = ['AFR', 'AJ', 'AMR', 'EAS', 'EUR']
base_path = '~/workspace/ws_files/'

# List to store each ancestry's PC dataframe
dfs = []

for ancestry in ancestries:
    file_path = f'{base_path}{ancestry}/{ancestry}.noGBA.noNA.txt'
    
    # Read the file
    df = pd.read_csv(file_path, sep='\t')
    
    # Keep only PC1–PC5
    df_pcs = df[['IID','PC1', 'PC2', 'PC3', 'PC4', 'PC5']].copy()
    
    # Add group label
    df_pcs['GROUP'] = ancestry
    
    # Store
    dfs.append(df_pcs)

# Combine all ancestries
merged_df = pd.concat(dfs, ignore_index=True)

# View merged structure
merged_df

In [None]:
merged_df = merged_df.drop(columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])
merged_df

In [None]:
pcs = pd.read_csv(f'{REL10_PATH}/meta_data/qc_metrics/projected_pcs_vwb.csv')
pcs

In [None]:
# Merge on IID, keeping all rows from merged_df
final_df = merged_df.merge(pcs.drop(columns=['FID']), on='IID', how='left')

# View result
final_df

In [13]:
final_df.to_csv('~/workspace/ws_files/PCs_meta.txt', sep='\t', index=False)

Burden Analyses using RVTests

In [4]:
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

#Loop over all the ancestries and the 2 variant classes - run rvtests for all coding and missense variants
for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running plink to extract {variant_class} variants for ancestry: {ancestry}')
        
        #Extract relevant variants
        ! /home/jupyter/tools/plink2 \
        --pfile {REL10_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release10_vwb \
        --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
        --extract range {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.variantstoKeep.txt \
        --recode vcf-iid \
        --out {WORK_DIR}/{ancestry}_TMEM175.{variant_class}
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running bgzip and tabix for {variant_class} variants for ancestry: {ancestry}')
        
        ## Bgzip and Tabix (zip and index the file)
        ! bgzip -f {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.vcf
        ! tabix -f -p vcf {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.vcf.gz

Running plink to extract exonic variants for ancestry: AAC
PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.exonic.log.
Options in effect:
  --export vcf-iid
  --extract range /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.exonic.variantstoKeep.txt
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.exonic
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb

Start time: Fri Jul 25 13:38:41 2025
Note: --export 'vcf-iid' modifier is deprecated.  Use 'vcf' + 'id-paste=iid'.
26046 MiB RAM detected, ~24836 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
1215 samples (683 females, 532 males; 1215 founders) loaded from
/h

In [None]:
#Run RVtests
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running RVtests for {variant_class} variants for ancestry: {ancestry}')
        
        ## RVtests with covariates 
        #Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
        ! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
        --out {WORK_DIR}/{ancestry}_TMEM175.burden.{variant_class} \
        --kernel skato \
        --inVcf {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.vcf.gz \
        --pheno /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}_covariate_file_noGBA.txt \
        --pheno-name PHENO \
        --gene TMEM175 \
        --geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
        --covar /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}_covariate_file_noGBA.txt \
        --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5 \
        --freqUpper 0.01
# --burden cmc,zeggini,mb,fp,cmcWald --kernel skat,skato \

Look at RVtest results SKAT-O

In [6]:
WORK_DIR = "~/workspace/ws_files/TMEM175"

In [46]:
%load_ext rpy2.ipython



In [7]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	40128	128	103	800191	0	0.40194


In [8]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	40128	69	56	76157.3	1	0.856905


In [9]:
#Check AAC all_coding variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1061	59	46	36140.1	0	0.628155


In [10]:
#Check AAC all_coding variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1061	38	32	25101.9	0	0.524219


In [11]:
#Check AFR all_coding variant results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	2738	76	61	37231.7	1	0.738165


In [12]:
#Check AFR nonsynonymous_lof results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	2738	47	37	22926.7	1	0.743344


In [13]:
#Check AJ all_coding variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	2174	24	18	21568.3	0	0.746508


In [14]:
#Check AJ nonsynonymous_lof  variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	2174	15	12	14410	1	0.580701


In [15]:
#Check AMR all_coding variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	3265	72	53	41022.6	0	0.62126


In [16]:
#Check AMR nonsynonymous_lof  variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	3265	42	29	5872.81	1	0.829141


In [17]:
#Check CAS all_coding variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1104	39	22	22431.3	1	0.179996


In [18]:
#Check CAS nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1104	26	12	7573.21	0	0.261768


In [19]:
#Check EAS all_coding variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	4670	52	34	32480.5	0	0.85871


In [20]:
#Check EAS nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	4670	34	21	28568.2	0	0.401021


In [23]:
#Check MDE all_coding variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1230	47	30	17800.6	0	0.42309


In [24]:
#Check MDE nonsynonymous_lof  variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1230	24	19	18140.9	0	0.0663344


In [25]:
#Check SAS all_coding variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	689	25	13	4939.3	0	0.523074


In [26]:
#Check SAS nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	689	18	10	3602.35	1	0.546915


In [27]:
#Check CAH all_coding variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.burden.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	900	47	35	12161	1	0.642748


In [28]:
#Check CAH nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.burden.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	900	30	24	15084.2	0	0.433118


add GWA hits as covariates for TMEM175 

In [29]:
## extract p.m393t using plink
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --pfile {REL10_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release10_vwb \
    --chr 4 \
    --from-bp 958159 \
    --to-bp 958159 \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_M393T

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_M393T.log.
Options in effect:
  --chr 4
  --from-bp 958159
  --make-bed
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_M393T
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb
  --to-bp 958159

Start time: Fri Jul 25 14:06:54 2025
26046 MiB RAM detected, ~24890 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
1215 samples (683 females, 532 males; 1215 founders) loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb.psam.
4117743 variants loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb.pvar.
1 binary phenotype loaded (370 cases, 763 controls).
1 variant remainin

In [32]:
## recode p.m393t using plink
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    !/home/jupyter/tools/plink2 \
        --bfile {WORK_DIR}/{ancestry}_M393T \
        --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
        --recode A \
        --out {WORK_DIR}/{ancestry}_M393T

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_M393T.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_M393T
  --export A
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_M393T

Start time: Fri Jul 25 14:12:58 2025
26046 MiB RAM detected, ~24880 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
1215 samples (683 females, 532 males; 1215 founders) loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_M393T.fam.
1 variant loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_M393T.bim.
1 binary phenotype loaded (370 cases, 763 controls).
--keep: 1207 samples remaining.
1207 samples (680 females, 527 males; 1207 founders) remai

In [36]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','FIN','MDE','SAS','CAH', 'EUR'}

for ancestry in ancestries:
    # read cov file
    cov = pd.read_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}_covariate_file_noGBA.txt', sep = '\t')
    M393T = pd.read_csv(f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}/{ancestry}_M393T.raw', sep = '\t')
    cov_M393T = cov.merge(M393T[['IID', 'chr4:958159:T:C_T']], on='IID', how='left')
    cov_M393T = cov_M393T.rename(columns={'chr4:958159:T:C_T': 'M393T'})
    cov_M393T = cov_M393T.dropna()

    ## save files 
    cov_M393T.to_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}.cov_M393T.txt', sep = '\t', index=False)

In [38]:
## extract p.Q65P using plink
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    ! /home/jupyter/tools/plink2 \
    --pfile {REL10_PATH}/imputed_genotypes/{ancestry}/chr4_{ancestry}_release10_vwb \
    --chr 4 \
    --from-bp 950422 \
    --to-bp 950422 \
    --make-bed \
    --out {WORK_DIR}/{ancestry}_Q65P

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_Q65P.log.
Options in effect:
  --chr 4
  --from-bp 950422
  --make-bed
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_Q65P
  --pfile /home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb
  --to-bp 950422

Start time: Fri Jul 25 14:23:21 2025
26046 MiB RAM detected, ~24828 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
1215 samples (683 females, 532 males; 1215 founders) loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb.psam.
4117743 variants loaded from
/home/jupyter/workspace/gp2_tier2_eu_release10/imputed_genotypes/AAC/chr4_AAC_release10_vwb.pvar.
1 binary phenotype loaded (370 cases, 763 controls).
1 variant remaining 

In [39]:
## recode p.m393t using plink
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    !/home/jupyter/tools/plink2 \
        --bfile {WORK_DIR}/{ancestry}_Q65P \
        --keep /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.noGBA.samplestoKeep \
        --recode A \
        --out {WORK_DIR}/{ancestry}_Q65P

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_Q65P.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_Q65P
  --export A
  --keep /home/jupyter/workspace/ws_files/AAC/AAC.noGBA.samplestoKeep
  --out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_Q65P

Start time: Fri Jul 25 14:24:08 2025
26046 MiB RAM detected, ~24811 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
1215 samples (683 females, 532 males; 1215 founders) loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_Q65P.fam.
1 variant loaded from
/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_Q65P.bim.
1 binary phenotype loaded (370 cases, 763 controls).
--keep: 1207 samples remaining.
1207 samples (680 females, 527 males; 1207 founders) remaining 

In [41]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','FIN','MDE','SAS','CAH', 'EUR'}

for ancestry in ancestries:
    # read cov file
    cov_M393T = pd.read_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}.cov_M393T.txt', sep = '\t')
    Q65P = pd.read_csv(f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}/{ancestry}_Q65P.raw', sep = '\t')
    cov_Q65P = cov_M393T.merge(Q65P[['IID', 'chr4:950422:A:C_A']], on='IID', how='left')
    cov_Q65P = cov_Q65P.rename(columns={'chr4:950422:A:C_A': 'Q65P'})
    cov_Q65P = cov_Q65P.dropna()

    ## save files 
    cov_Q65P.to_csv(f'~/workspace/ws_files/{ancestry}/{ancestry}.cov_M393T_Q65P.txt', sep = '\t', index=False)

Run RVtest with GWAS hits

In [42]:
#Run RVtests
ancestries = ['AAC', 'AFR', 'AJ', 'AMR', 'CAS', 'EAS', 'EUR', 'FIN', 'MDE', 'SAS', 'CAH']
variant_classes = ['exonic', 'nonsynonymous_lof']

for ancestry in ancestries:
    for variant_class in variant_classes:
        
        WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
        
        # Print the command to be executed (for debugging purposes)
        print(f'Running RVtests for {variant_class} variants for ancestry: {ancestry}')
        
        ## RVtests with covariates 
        #Make sure the pheno and covariate file starts with the first 5 columsn: fid, iid, fatid, matid, sex
        #The pheno-name flag only works when the pheno/covar file is structured properly
        ! /home/jupyter/tools/rvtests/executable/rvtest --noweb --hide-covar \
        --out {WORK_DIR}/{ancestry}_TMEM175.GWAhits.{variant_class} \
        --kernel skato \
        --inVcf {WORK_DIR}/{ancestry}_TMEM175.{variant_class}.vcf.gz \
        --pheno /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.cov_M393T_Q65P.txt \
        --pheno-name PHENO \
        --gene TMEM175 \
        --geneFile /home/jupyter/workspace/ws_files/refFlat_HG38.txt \
        --covar /home/jupyter/workspace/ws_files/{ancestry}/{ancestry}.cov_M393T_Q65P.txt \
        --covar-name SEX,AGE,PC1,PC2,PC3,PC4,PC5,M393T,Q65P \
        --freqUpper 0.01
# --burden cmc,zeggini,mb,fp,cmcWald --kernel skat,skato \

Running RVtests for exonic variants for ancestry: AAC
Thank you for using rvtests (version: 20190205, git: c86e589efef15382603300dc7f4c3394c82d69b8)
  For documentations, refer to http://zhanxw.github.io/rvtests/
  For questions and comments, plase send to Xiaowei Zhan <zhanxw@umich.edu>
  For bugs and feature requests, please submit at: https://github.com/zhanxw/rvtests/issues

The following parameters are available.  Ones with "[]" are in effect:

Available Options
      Basic Input/Output:
                          --inVcf [/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.exonic.vcf.gz]
                          --inBgen [], --inBgenSample [], --inKgg []
                          --out [/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.GWAhits.exonic]
                          --outputRaw
       Specify Covariate:
                          --covar [/home/jupyter/workspace/ws_files/AAC/AAC.cov_M393T_Q65P.txt]
                          --covar-name [SEX,

In [43]:
WORK_DIR = "~/workspace/ws_files/TMEM175"

In [44]:
#Check AAC all_coding variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1014	59	45	7107.84	0	0.424299


In [45]:
#Check AAC nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_AAC/AAC_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1014	38	31	3376.87	0	0.622341


In [72]:
#Check AFR all_coding variant results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	2547	76	60	2155.27	1	0.873784


In [73]:
#Check AFR nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_AFR/AFR_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	2547	47	37	5176.09	1	0.703047


In [76]:
#Check AJ all_coding variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1421	24	16	4023.79	0	0.212791


In [77]:
#Check AJ nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_AJ/AJ_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1421	15	10	3156.23	0	0.240471


In [80]:
#Check AMR all_coding variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	3104	72	51	7510.5	0	0.5333


In [81]:
#Check AMR nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_AMR/AMR_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	3104	42	29	11430.4	1	0.143274


In [84]:
#Check CAS all_coding variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	861	39	18	480.407	1	0.688999


In [85]:
#Check CAS nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_CAS/CAS_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	861	26	9	NA	NA	NA


In [88]:
#Check EAS all_coding variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	3364	52	32	11949.2	1	0.255894


In [89]:
#Check EAS nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_EAS/EAS_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	3364	34	19	20875.8	1	0.0442927


In [92]:
#Check EUR all_coding variant results
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	25226	128	95	153490	0	0.062384


In [93]:
#Check EUR nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_EUR/EUR_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	25226	69	55	83041.7	0	0.141847


In [98]:
#Check MDE all_coding variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1070	47	28	3884.07	0	0.358024


In [99]:
#Check MDE nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_MDE/MDE_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	1070	24	17	1097.97	0	0.608166


In [103]:
#Check SAS all_coding variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	405	25	10	528.451	0	0.465113


In [104]:
#Check SAS nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_SAS/SAS_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	405	18	8	208.184	0	0.610274


In [107]:
#Check CAH all_coding variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.GWAhits.exonic.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	845	47	34	10121.7	1	0.273611


In [108]:
#Check CAH nonsynonymous_lof variant results
! cat {WORK_DIR}/TMEM175_CAH/CAH_TMEM175.GWAhits.nonsynonymous_lof.SkatO.assoc

Gene	RANGE	N_INFORMATIVE	NumVar	NumPolyVar	Q	rho	Pvalue
TMEM175	4:932459-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655,4:932386-958655	845	30	23	14266.3	1	0.00597753


LD pruning(in EUR)

In [111]:
WORK_DIR = "~/workspace/ws_files/"

In [119]:
# Make sure to use high-quality SNPs
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/TMEM175/TMEM175_EUR/EUR_TMEM175 \
--maf 0.01 \
--geno 0.05 \
--hwe 1e-5 0.001 \
--make-bed \
--keep /home/jupyter/workspace/ws_files/EUR/EUR.noGBA.samplestoKeep \
--exclude {WORK_DIR}/exclusion_regions_hg38.txt \
--out {WORK_DIR}/TMEM175/TMEM175_UNIMPUTED

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//TMEM175/TMEM175_UNIMPUTED.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//TMEM175/TMEM175_EUR/EUR_TMEM175
  --exclude /home/jupyter/workspace/ws_files//exclusion_regions_hg38.txt
  --geno 0.05
  --hwe 1e-3
  --keep /home/jupyter/workspace/ws_files/EUR/EUR.noGBA.samplestoKeep
  --maf 0.01
  --make-bed
  --out /home/jupyter/workspace/ws_files//TMEM175/TMEM175_UNIMPUTED

Start time: Fri Jul 25 15:20:54 2025
26046 MiB RAM detected, ~24770 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
58823 samples (25749 females, 33074 males; 58823 founders) loaded from
/home/jupyter/workspace/ws_files//TMEM175/TMEM175_EUR/EUR_TMEM175.fam.
7209 variants loaded from
/home/jupyter/workspace/ws_files//TMEM175/TMEM175_EUR/EUR_TMEM175.bim.
1 binary phenotyp

In [123]:
# Prune out unnecessary SNPs (only need to do this to generate PCs)
! /home/jupyter/tools/plink2 \
--bfile {WORK_DIR}/TMEM175/TMEM175_UNIMPUTED \
--indep-pairwise 50 5 0.5 \
--out {WORK_DIR}/TMEM175/prune_TMEM175

PLINK v2.0.0-a.7LM 64-bit Intel (7 Jul 2025)       cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/jupyter/workspace/ws_files//TMEM175/prune_TMEM175.log.
Options in effect:
  --bfile /home/jupyter/workspace/ws_files//TMEM175/TMEM175_UNIMPUTED
  --indep-pairwise 50 5 0.5
  --out /home/jupyter/workspace/ws_files//TMEM175/prune_TMEM175

Start time: Fri Jul 25 15:24:04 2025
26046 MiB RAM detected, ~24800 available; reserving 13023 MiB for main
workspace.
Using up to 4 compute threads.
53641 samples (23385 females, 30256 males; 53641 founders) loaded from
/home/jupyter/workspace/ws_files//TMEM175/TMEM175_UNIMPUTED.fam.
366 variants loaded from
/home/jupyter/workspace/ws_files//TMEM175/TMEM175_UNIMPUTED.bim.
1 binary phenotype loaded (24208 cases, 9662 controls).
Calculating allele frequencies... done.
--indep-pairwise (1 compute thread): 259/366 variants removed.
Variant lists written to
/home/jupyter/workspace/ws_f

In [124]:
!wc -l {WORK_DIR}/TMEM175/prune_TMEM175.prune.in

107 /home/jupyter/workspace/ws_files//TMEM175/prune_TMEM175.prune.in


In [None]:
# 4.67E-4

zoom locus plot

In [2]:
## locus zoom format
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    df = pd.read_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')
    df['b'] = np.log(df['OR'])
    export_ldassoc = df[['#CHROM', 'POS', 'A1', 'OMITTED', 'A1_FREQ', 'b', 'LOG(OR)_SE', 'P']].copy()
    export_ldassoc = export_ldassoc.rename(columns = {'#CHROM': 'CHROM', 'OMITTED':'A2', 'A1_FREQ':'freq', 'LOG(OR)_SE':'se', 'P':'p'})
    # save to files
    export_ldassoc.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.formatted.tab', sep = '\t', index=False)



GCTA cojo analysis

In [127]:
## cojo format
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'

    sumstats = pd.read_csv(f'{WORK_DIR}/{ancestry}.all_adj.csv')
    #Format summary statistics for GCTA-COJO
    #First get the log odds ratio - this is required for COJO
    #1) For a case-control study, the effect size should be log(odds ratio) with its corresponding standard error.
    sumstats_formatted = sumstats.copy()
    sumstats_formatted['b'] = np.log(sumstats_formatted['OR'])

    #Now select just the necessary columns for COJO
    sumstats_export = sumstats_formatted[['ID', 'A1', 'OMITTED', 'A1_FREQ', 'b', 'LOG(OR)_SE', 'P', 'OBS_CT']].copy()

    #Rename columns following COJO format
    sumstats_export = sumstats_export.rename(columns = {'ID':'SNP', 'OMITTED':'A2', 'A1_FREQ':'freq', 'LOG(OR)_SE':'se', 'P':'p', 'OBS_CT':'N'})

    #Export
    sumstats_export.to_csv(f'{WORK_DIR}/{ancestry}.all_adj.sumstats.ma', sep = '\t', index=False)



In [None]:
# install GCTA-COJO run in terminal
! wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip
! unzip gcta-1.95.0-linux-kernel-3-x86_64.zip 

In [129]:
! /home/jupyter/tools/gcta-1.95.0-linux-kernel-3-x86_64/gcta64 --version

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.95.0 Linux
[0;32m[0m* Built at Jul 21 2025 17:30:18, by GCC 8.4
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 15:39:09 UTC on Fri Jul 25 2025.
[0;32m[0mHostname: 4b8bf4fabeb0
[0;32m[0m
Accepted options:
[0;31mError: [0m
  invalid option "--version".

[0m[0mAn error occurs, please check the options or data


In [130]:
ancestries = {'AAC','AFR','AJ','AMR','CAS','EAS','EUR','FIN','MDE','SAS','CAH'}

for ancestry in ancestries:
    
    WORK_DIR = f'~/workspace/ws_files/TMEM175/TMEM175_{ancestry}'
    
    #Select multiple associated SNPs based on LD pruning p-value 4.67e-4
    #Can change the p-value for significance if needed
    #bfile is referring to the full dataset in plink binary format, e.g. GP2 - whatever you used to run the GWAS
    ! /home/jupyter/tools/gcta-1.95.0-linux-kernel-3-x86_64/gcta64 \
    --bfile {WORK_DIR}/{ancestry}_TMEM175 \
    --maf 0.01 \
    --cojo-file {WORK_DIR}/{ancestry}.all_adj.sumstats.ma \
    --cojo-p 4.67e-4 \
    --cojo-slct \
    --out {WORK_DIR}/{ancestry}.all_adj.ldprune.COJO

[0;32m[0m*******************************************************************
[0;32m[0m* Genome-wide Complex Trait Analysis (GCTA)
[0;32m[0m* version v1.95.0 Linux
[0;32m[0m* Built at Jul 21 2025 17:30:18, by GCC 8.4
[0;32m[0m* (C) 2010-present, Yang Lab, Westlake University
[0;32m[0m* Please report bugs to Jian Yang <jian.yang@westlake.edu.cn>
[0;32m[0m*******************************************************************
[0;32mAnalysis started [0mat 15:43:28 UTC on Fri Jul 25 2025.
[0;32m[0mHostname: 4b8bf4fabeb0
[0;32m[0m
Accepted options:
--bfile /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175
--maf 0.01
--cojo-file /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC.all_adj.sumstats.ma
--cojo-p 0.000467
--cojo-slct
--out /home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC.all_adj.ldprune.COJO


Reading PLINK FAM file from [/home/jupyter/workspace/ws_files/TMEM175/TMEM175_AAC/AAC_TMEM175.fam].
1215 individuals to be included from [/home/ju