Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GWAS finishing effort - 21,572 missing variants #19

Closed
yk-tanigawa opened this issue Jul 4, 2020 · 7 comments
Closed

GWAS finishing effort - 21,572 missing variants #19

yk-tanigawa opened this issue Jul 4, 2020 · 7 comments
Assignees

Comments

@yk-tanigawa
Copy link
Contributor

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

There are 880 files with 1059397 lines (meaning that 21572 lines are missing in each file).

  • 21,033 variants are on both arrays
  • 539 variants are on one array
@yk-tanigawa
Copy link
Contributor Author

The analysis script for this is in here: https://github.com/rivas-lab/ukbb-tools/tree/master/04_gwas/extras/202006-GWAS-finish/gwas21572

We the following message multiple times

Skipping chrY in --glm regression on phenotype 'PHENO1', since all
remaining samples are controls.

@yk-tanigawa
Copy link
Contributor Author

for this, we have 832 log files, indicating that 48 files were not finished (presumably due to the computing shutdown on 7/8).

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ find data/ -name "*array-combined.log" | wc
    832     832   50372

@yk-tanigawa
Copy link
Contributor Author

cat ../gwas-current-gz-wc.20200629-131527.tsv | awk -v FS='\t' -v wc_l=1059397 '($3==wc_l){print $1}' | while read symlink ; do
    pop=$(basename $(dirname $symlink))
    GBE_ID=$(basename $symlink | awk -v FS='.' '{print $2}')
    log_f=data/${pop}/ukb24983_v2_hg19.${GBE_ID}.array-combined.log
    if [ ! -f ${log_f} ] ; then echo $GBE_ID $pop | tr ' ' '\t' ; fi
done | tee job.$(date +%Y%m%d-%H%M%S).tsv

Wrote a sbatch wrapper

@yk-tanigawa
Copy link
Contributor Author

For binary traits, we submit SLURM jobs.

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ sbatch -p mrivas --qos=high_p --time=6:00:00 --mem=30000 --nodes=1 --cores=6 --job-name=gwas21572 --output=logs/gwas21572.%A_%a.out --error=logs/gwas21572.%A_%a.err --array=1-20 $parallel_sbatch_sh 3_gwas.sbatch.sh 1.20.lst 1
Submitted batch job 3896267

For QTs, we just run it in the interactive session.

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ seq 21 47 | while read i ; do echo $i ;bash 3_gwas.sbatch.sh $i ; done

@yk-tanigawa
Copy link
Contributor Author

Those remaining 48 files were generated.

@yk-tanigawa
Copy link
Contributor Author

check the files

removal of 11 phenotypes with phenotyping error

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ find data -name "*array-combined.*.glm*" | wc
    836     836   67040
[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572]$ find data -name "*.array-combined.log" | wc
    880     880   53381

The difference:

african e_asian non_british_white s_asian
88 files.
88 = 4 pops * (both and one array) * 11 phenotypes

This means

  • we are fine to focus on the 836 phenotypes.
  • Also, WB finished successflly, meaning that while running this batch the phe file is fixed.
find data | egrep 'INI21049|INI21051|INI21052|INI21053|INI21054|INI21055|INI21056|INI21058|INI21059|INI21060|INI21061' | while read f ; do mv $f $(echo $f | sed -e 's/^data/__delete/g') ; done

check again

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas21572
find data -name "*array-combined.*.glm*" | wc -l
825
find data -name "*.array-combined.log" | wc -l
825

We have 825 files. Looks good.

line count

824 files have fewer lines (-691) because the chrY variants were skipped.

$ find data -name "*array-combined.*.glm*" | while read f ; do wc $f ; done | awk '{print $1}' | sort | uniq -c
    824 20882
      1 21573
find data -name "*array-combined.*.glm*" | while read f ; do wc $f ; done | awk '$1==21573'
  21573  280449 1545067 data/white_british/ukb24983_v2_hg19.INI22152.array-combined.PHENO1.glm.linear
find data -name "*.array-combined.log" | parallel 'cat {}' | grep 'Skipping chrY in --glm regression' | wc
   1648   17214  114704

combine the results


echo "#GBE_ID pop symlink additional_res" | tr ' ' '\t' > 4_combined.idx.tsv

wc_l=1059397

cat ../gwas-current-gz-wc.20200629-131527.tsv | egrep -v 'INI21049|INI21051|INI21052|INI21053|INI21054|INI21055|INI21056|INI21058|INI21059|INI21060|INI21061' | awk -v FS='\t' -v wc_l=${wc_l} '($3==wc_l){print $1}' | while read symlink ; do
    pop=$(basename $(dirname $symlink))
    GBE_ID=$(basename $symlink | awk -v FS='.' '{print $2}')
    echo $GBE_ID $pop $symlink $(readlink -f data/${pop}/$(basename $symlink .gz) | sed -e "s/array-combined/array-combined.PHENO1/g")
done | tr ' ' '\t' >> 4_combined.idx.tsv

We then combine the results

bash 4_gwas_merge.sh 825
bash 4_gwas_merge.sh 824
sbatch -p mrivas --qos=high_p --time=1:00:00 --mem=6000 --nodes=1 --cores=1 --job-name=merge --output=logs/merge.%A_%a.out --error=logs/merge.%A_%a.err --array=1-83 $parallel_sbatch_sh 4_gwas_merge.sh 1.823.lst 10
Submitted batch job 3904698

@yk-tanigawa
Copy link
Contributor Author

This patch generated 824 files with 1080278 lines (-691) because the chrY variants were skipped and one file with 1080969 lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant