## Create a minimal example from the UKB data

Select 100 individuals from the bed files

1. Individuals to select: `/home/dc2325/scratch60/plink-clumping/samplesID.txt`
2. Filter bgen files from chr1 and chr2 to contain only those individuals:

## Running this notebook

On Yale Farnam cluster,

```
sos run ~/project/pleiotropy_UKB/workflow/MWE.ipynb plink2 -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 40 \
-s build &> sos-submission-MWE-060920.log
```

In [None]:
[global]
# Working directory: change accordingly
parameter: cwd = path('~/scratch60/plink-clumping')
# Genotype file in plink binary format
parameter: genoFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindivs.bed')
# Path to bgen files
parameter: bgenFile = paths([f'/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{x+1}_v3.bgen' for x in range(2)])
# Path to sample file
parameter: sampleFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample')
# Samples to select for plink format contains two columns FID and IID
parameter: samplesPlink = path('/home/dc2325/scratch60/plink-clumping/samples_plink.txt')
# Samples to select for awk contains only one column IID
parameter: samplesID = path('/home/dc2325/scratch60/plink-clumping/samplesID.txt')
# Samples to select for qctool white-space delimited list of IID
parameter: samplesQctool = path('/home/dc2325/scratch60/plink-clumping/samples_qctool.txt')
# Covariate file
parameter: covarFile = path('/home/dc2325/scratch60/plink-clumping/phenotypes.txt')
# Raw phenotype to extract individuals
parameter: rawPheno2 = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_withagesex_033120')
# Raw phenotype to extract individuals
parameter: rawPheno1 = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720')
# Phenotype file for both quantitative (BMI) and qualitative (asthma) traits
parameter: phenoFile = path('/home/dc2325/scratch60/plink-clumping/phenotypes.txt')
# Unrelated samples from UKB
parameter: unrelated_samples = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620')
# Filter snps in bgenFiles
parameter: rsid = paths([f'{cwd}/chr{x+1}_filter_snps.txt' for x in range(2)])
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Specific number of threads to use
parameter: numThreads = 20
# Load specific modules
parameter: plink_module = '''
module load PLINK/1.90-beta5.3
echo "Module plink loaded"
{cmd}
'''
parameter: qctool_module = '''
module load QCTOOL/2.0-foss-2016b-rc7-CentOS6.8
echo "Module qctool loaded"
{cmd}
'''
parameter: plink2_module = '''
module load PLINK/2_x86_64_20180428
echo "Module plink2 loaded"
{cmd}
'''

In [None]:
# Create .bed, .fam and .bin files
[plink]
input: genoFile, samplesPlink
output: f'{cwd}/{_input[0]:b}.MWE_data.bed'
bash: expand= "${ }", workdir = cwd, template = plink_module
    plink \
    --bfile ${_input[0]} \
    --keep ${_input[1]} \
    --make-bed \
    --out ${_output:bn}

In [None]:
# Extract the 100 individuals from the phenotypic files
[phenotypes]
input: rawPheno1, rawPheno2, unrelated_samples
output: f'{cwd}/phenotype_100samples.txt', f'{cwd}/samplesID.txt', f'{cwd}/samples_qctool.txt', f'{cwd}/samples_plink.txt'
bash: expand= "${ }", workdir = cwd
    awk 'FNR==NR{a[$1];next}($1 in a){print}' ${_input[0]} ${_input[1]} > common_IDs.txt
    cat common_IDs.txt | awk 'NR==1; $5==1 {print}' | head -n 51 > 50_cases.txt
    cat common_IDs.txt | awk 'NR==1; $5==0 {print}' | head -n 51 > 50_controls.txt
    awk 'FNR>1 || NR==1' 50_* > ${_output[0]}
    awk '{print $2}' ${_output[0]} | sort -k 1n > ${_output[1]}
    cat ${_output[1]} | awk -F " " 'NR>1 {print}; {ORS= " "}' > ${_output[2]}
    awk '{print $1,$2}' ${_output[0]} > ${_output[3]}
    grep -w -F -f ${_output[1]} ${_input[2]}  > unrelated_samplesID.txt
    rm 50_* common_IDs.txt

In [None]:
# Filter bgen files chr1 and chr2 with only the 100 individuals
[qctool]
input: bgenFile
output: f'{cwd}/{_input:bn}.filtered.bgen', f'{cwd}/{_input:bn}.filtered.sample'
task: trunk_workers = 1, trunk_size = job_size, cores = numThreads, walltime = '12h', mem = '48G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', template = qctool_module
    qctool \
    -g ${_input} \
    -s ${sampleFile} \
    -og ${_output[0]} \
    -os ${_output[1]} \
    -incl-samples ${samplesQctool} \
    -incl-rsids ${_input.info}

In [None]:
# Filter bgen files chr1 and chr2 with only the 100 individuals and 8 bytes format
[plink2]
input: bgenFile, group_by=1, paired_with= 'rsid'
output: f'{cwd}/{_input:bn}.plink.filtered.bgen'
task: trunk_workers = 1, trunk_size = job_size, cores = numThreads, walltime = '12h', mem = '48G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", workdir = cwd, stderr = f'{_output:bn}.stderr', stdout = f'{_output:bn}.stdout', template = plink2_module
    plink2 \
      --bgen ${_input} ref-first \
      --sample ${sampleFile} \
      --keep ${samplesPlink} \
      --extract ${_rsid:n} \
      --export bgen-1.2 "bits=8" \
      --out ${_output:bn}

```
cat /SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720 \
| awk 'NR==1; $5==1 {print}' | head -n 51 > asthma_cases.txt
cat /SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720 \
| awk 'NR==1; $5==0 {print}' | head -n 51 > asthma_controls.txt

#This was to see which ID's were different between asthma and BMI
diff -y <( awk '{print $1}' samples_asthma_sorted.txt) <( awk '{print $1}' BMI_samples_sorted.txt )

# Look for common IDs in both files
# One way to do it with awk
awk 'FNR==NR{a[$1];next}($1 in a){print}' Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720 UKB_caucasians_BMIwaisthip_AsthmaAndT2D_withagesex_033120 > common_IDs.txt

# Another way to do it but files need to be sorted firt
sort -k 2n Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720
sort -k 2n UKB_caucasians_BMIwaisthip_AsthmaAndT2D_withagesex_033120
comm -12  <( awk '{print $1}' Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720) \
<( awk '{print $1}' UKB_caucasians_BMIwaisthip_AsthmaAndT2D_withagesex_033120) > common_IDs.txt

#Now select 50 cases and 50 controls based on asthma 
cat common_IDs.txt | awk 'NR==1; $5==1 {print}' | head -n 51 > asthma_cases.txt
cat common_IDs.txt | awk 'NR==1; $5==0 {print}' | head -n 51 > asthma_controls.txt

#Join the cases and controls in one file 
awk 'FNR>1 || NR==1' asthma_c* > asthma_samples.txt

# Create the samplesID file to be used 
awk '{print $2}' asthma_samples.txt | sort -k 1n > samplesID.txt #only 84 are unrelated

# See if the samples are unrelated
grep -w -F -f samplesID.txt /SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620 > unrelated_samplesID.txt

#Create the samples file for qctools
cat samplesID.txt | awk -F " " 'NR>1 {print}; {ORS= " "}' > samples_qctool.txt

#Create the sample file for plink
awk '{print $1,$2}' asthma_samples.txt > samples_plink.txt

#Select column with awk with partial matches

awk '$2 ~ /rs/ { print $2 }' dummy_file 

cat chr1_snps.txt | awk 'BEGIN {ORS=" "}; $2 ~ /rs/ { print $2 }' > chr1_filter_snps.txt
cat chr2_snps.txt | awk 'BEGIN {ORS=" "}; $2 ~ /rs/ { print $2 }' > chr2_filter_snps.txt
```