## Solve problem with sample names not matching for chr 2 and chr 22

Problem: some blocks of chr 2 (58 onwards) and chr20  (3 onwards) had different number of available samples. During the download of the first blocks and the last blocks 3 more individuals were set to missing (withdrew from study) by the UKBB. Therefore, here I describe the patch to solve this problem on the data

samples_name.txt: format old_name new_name\n
```
chr2_b57_v1_original chr2_b58_v1_original
-000001 -000002
-000002 -000003
3081695 -000004
-000003 -000005
-000004 -000006
-000005 -000007
-000006 -000008
-000007 -000009
-000009 -000011
-000010 -000012
-000011 -000013
2284688 -000014
-000012 -000015
-000013 -000016
-000014 -000017
-000015 -000018
1533484 -000001
```
Note: between downloads on the UKBB the missing samples id are shuffled creating problems in the merge

```
diff -y ukb23156_c2_b57_v1.samples ukb23156_c2_b58_v1.samples > differentb57_b58
diff -y ukb23156_c20_b0_v1.samples ukb23156_c20_b3_v1.samples > different_chr20_b0_b3
comm <(sort ukb23156_c1.merged.filtered.bed.merged_allchr.fam) <(sort ukb23156_c1.merged.filtered.bed.keep_samples.id) -3
```
Find which plink files need to be replaced all but chr2 and chr20
```
egrep "1533484|2284688|3081695" *.fam

for file in ukb23156_c*.merged.filtered.fam; do
awk -v OFS=' ' 'NR==FNR{a[$1]=$2;next} $1 in a{$1=a[$1]} {print $1,$1,$3,$4,$5,$6}' samples_to_rename_vcf.txt $file > ${file%.*}.new.fam;
done
```

## To run notebook

```
sos run /home/dmc2245/project/UKBB_GWAS_dev/workflow/patch_vcf_files.ipynb reheader \
    --cwd  /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/072721_run/ \
    --vcfs /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/072721_run/cache/ukb23156_c2_b{0..57}_v1.leftnorm.filtered.vcf.gz \
    --samples_name /mnt/mfs/statgen/UKBiobank/plink_files/samples_to_rename_vcf.txt \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif
```

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# pVCF files 
parameter: vcfs = paths
# samples file which contains sample names to be changed
parameter: samples_name = path
# Container with bcftools
parameter: container_lmm = 'statisticalgenetics/lmm:2.4'
# Number of threads
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 10

In [None]:
# Change names of problematic samples during merging
[reheader]
input: vcfs, group_by=1
output: f'{cwd}/reheader/{_input:bnn}.vcf.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = '12h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'

    bcftools reheader -s ${samples_name} -o ${_output} ${_input}

### Script to merge and then patch plink files

In [None]:
[]
rm -f list_beds.txt
for chr in {2..22}; do echo "ukb23156_c${chr}.merged.filtered.bed ukb23156_c${chr}.merged.filtered.bim ukb23156_c${chr}.merged.filtered.bed" >> list_beds.txt; done

plink \
  --bed ukb_cal_chr1_v2.bed \
  --bim ukb_snp_chr1_v2.bim \
  --fam ukbXXX_int_chr1_v2_s488373.fam \
  --merge-list list_beds.txt \
  --make-bed --out ukb_cal_allChrs