GWAS finishing effort - 21,572 missing variants #19

yk-tanigawa · 2020-07-04T16:12:11Z

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

There are 880 files with 1059397 lines (meaning that 21572 lines are missing in each file).

21,033 variants are on both arrays
539 variants are on one array

The text was updated successfully, but these errors were encountered:

yk-tanigawa · 2020-07-05T19:52:42Z

The analysis script for this is in here: https://github.com/rivas-lab/ukbb-tools/tree/master/04_gwas/extras/202006-GWAS-finish/gwas21572

We the following message multiple times

Skipping chrY in --glm regression on phenotype 'PHENO1', since all
remaining samples are controls.

yk-tanigawa · 2020-07-11T06:45:09Z

for this, we have 832 log files, indicating that 48 files were not finished (presumably due to the computing shutdown on 7/8).

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ find data/ -name "*array-combined.log" | wc
    832     832   50372

yk-tanigawa · 2020-07-11T06:55:27Z

cat ../gwas-current-gz-wc.20200629-131527.tsv | awk -v FS='\t' -v wc_l=1059397 '($3==wc_l){print $1}' | while read symlink ; do
    pop=$(basename $(dirname $symlink))
    GBE_ID=$(basename $symlink | awk -v FS='.' '{print $2}')
    log_f=data/${pop}/ukb24983_v2_hg19.${GBE_ID}.array-combined.log
    if [ ! -f ${log_f} ] ; then echo $GBE_ID $pop | tr ' ' '\t' ; fi
done | tee job.$(date +%Y%m%d-%H%M%S).tsv

Wrote a sbatch wrapper

yk-tanigawa · 2020-07-11T07:05:27Z

For binary traits, we submit SLURM jobs.

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ sbatch -p mrivas --qos=high_p --time=6:00:00 --mem=30000 --nodes=1 --cores=6 --job-name=gwas21572 --output=logs/gwas21572.%A_%a.out --error=logs/gwas21572.%A_%a.err --array=1-20 $parallel_sbatch_sh 3_gwas.sbatch.sh 1.20.lst 1
Submitted batch job 3896267

For QTs, we just run it in the interactive session.

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ seq 21 47 | while read i ; do echo $i ;bash 3_gwas.sbatch.sh $i ; done

yk-tanigawa · 2020-07-11T16:24:47Z

Those remaining 48 files were generated.

yk-tanigawa · 2020-07-11T18:44:40Z

check the files

removal of 11 phenotypes with phenotyping error

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ find data -name "*array-combined.*.glm*" | wc
    836     836   67040
[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ find data -name "*.array-combined.log" | wc
    880     880   53381

The difference:

african e_asian non_british_white s_asian
88 files.
88 = 4 pops * (both and one array) * 11 phenotypes

This means

we are fine to focus on the 836 phenotypes.
Also, WB finished successflly, meaning that while running this batch the phe file is fixed.

find data | egrep 'INI21049|INI21051|INI21052|INI21053|INI21054|INI21055|INI21056|INI21058|INI21059|INI21060|INI21061' | while read f ; do mv $f $(echo $f | sed -e 's/^data/__delete/g') ; done

check again

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572
find data -name "*array-combined.*.glm*" | wc -l
825
find data -name "*.array-combined.log" | wc -l
825

We have 825 files. Looks good.

line count

824 files have fewer lines (-691) because the chrY variants were skipped.

$ find data -name "*array-combined.*.glm*" | while read f ; do wc $f ; done | awk '{print $1}' | sort | uniq -c
    824 20882
      1 21573

find data -name "*array-combined.*.glm*" | while read f ; do wc $f ; done | awk '$1==21573'
  21573  280449 1545067 data/white_british/ukb24983_v2_hg19.INI22152.array-combined.PHENO1.glm.linear

find data -name "*.array-combined.log" | parallel 'cat {}' | grep 'Skipping chrY in --glm regression' | wc
   1648   17214  114704

combine the results


echo "#GBE_ID pop symlink additional_res" | tr ' ' '\t' > 4_combined.idx.tsv

wc_l=1059397

cat ../gwas-current-gz-wc.20200629-131527.tsv | egrep -v 'INI21049|INI21051|INI21052|INI21053|INI21054|INI21055|INI21056|INI21058|INI21059|INI21060|INI21061' | awk -v FS='\t' -v wc_l=${wc_l} '($3==wc_l){print $1}' | while read symlink ; do
    pop=$(basename $(dirname $symlink))
    GBE_ID=$(basename $symlink | awk -v FS='.' '{print $2}')
    echo $GBE_ID $pop $symlink $(readlink -f data/${pop}/$(basename $symlink .gz) | sed -e "s/array-combined/array-combined.PHENO1/g")
done | tr ' ' '\t' >> 4_combined.idx.tsv

We then combine the results

bash 4_gwas_merge.sh 825
bash 4_gwas_merge.sh 824

sbatch -p mrivas --qos=high_p --time=1:00:00 --mem=6000 --nodes=1 --cores=1 --job-name=merge --output=logs/merge.%A_%a.out --error=logs/merge.%A_%a.err --array=1-83 $parallel_sbatch_sh 4_gwas_merge.sh 1.823.lst 10
Submitted batch job 3904698

yk-tanigawa · 2020-07-11T20:30:31Z

This patch generated 824 files with 1080278 lines (-691) because the chrY variants were skipped and one file with 1080969 lines.

yk-tanigawa self-assigned this Jul 4, 2020

yk-tanigawa added this to the GWAS sumstats freeze, July 2020 milestone Jul 4, 2020

yk-tanigawa mentioned this issue Jul 5, 2020

GWAS finishing effort - Simple line counts check #21

Closed

yk-tanigawa closed this as completed Jul 11, 2020

yk-tanigawa added a commit that referenced this issue Jul 11, 2020

add scripts for issue ticket #19

982bb92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GWAS finishing effort - 21,572 missing variants #19

GWAS finishing effort - 21,572 missing variants #19

yk-tanigawa commented Jul 4, 2020

yk-tanigawa commented Jul 5, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

GWAS finishing effort - 21,572 missing variants #19

GWAS finishing effort - 21,572 missing variants #19

Comments

yk-tanigawa commented Jul 4, 2020

yk-tanigawa commented Jul 5, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

check the files

removal of 11 phenotypes with phenotyping error

check again

line count

combine the results

yk-tanigawa commented Jul 11, 2020