# Subset LDblock from imputed UK Biobank data

## Aim

The aim of this notebook is to provide bgen, bgi and variant files from genomewide LD blocks to then run LDstore to generate Ld matrices for each one of this blocks

## Input data

To run this notebook you would need

* A list of regions to be extracted in the format chr start stop (Note: make sure chr is in the format 01,02,03...etc)

* The original bgen files from which these regions are to be extracted

## Software used

This software uses bgenix as implemented in the lmm.sif image 

## Output

After running this notebook you would be able to get bgen, bgi and variant files for each inputed region organized by chromosome and region

## Minimal working examples

To generate a single file with all the regions:
```
sos run ~/UKBB_GWAS_dev/workflow/113022_bgenix_ldblocks.ipynb \
    regions\
    --cwd test\
    --genofile_prefix test/ukb_imp_chr\
    --genofile_suffix _v3.bgen\
    --merged_filename test/fourier_ls-chr1_22.txt\
    --region_files data/ldblocks/EUR/fourier_ls-chr*.bed\
    --job_size 10
```

To generate the bgen, bgi and variant files per region:

```
sos run ~/UKBB_GWAS_dev/workflow/113022_bgenix_ldblocks.ipynb \
    bgenix\
    --cwd test\
    --genofile_prefix test/ukb_imp_chr\
    --genofile_suffix _v3.bgen\
    --region_file data/ldblocks/EUR/fourier_ls-all.bed\
    --job_size 10
```

## Command interface

In [132]:
sos run /home/dmc2245/project/UKBB_GWAS_dev/workflow/113022_bgenix_ldblocks.ipynb -h

usage: sos run /home/dmc2245/project/UKBB_GWAS_dev/workflow/113022_bgenix_ldblocks.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  regions
  bgenix

Global Workflow Options:
  --cwd VAL (as path, required)
                        working directory
  --genofile-prefix ''
                        please provide the genofile prefix name
  --genofile-suffix ''
                        please provide the genofile suffix name
  --region-file . (as path)
                        this the name of the files to be merged the format of
                        the files should be chr start stop (the asterisk to
                        match file names)
  --merged-filename ''
   

In [None]:
[global]
# Specify the working dir
parameter: cwd = path
# Provide the genofile prefix name
parameter: genofile_prefix = ''
# Provide the genofile suffix name
parameter: genofile_suffix = ''
# Provide the acompannying sample file if the genofile is bgen format
parameter: sample_file = path
# Provide the region file as chr start stop
parameter: region_file = path('.')
# Provide a file with a column Id for related samples
parameter: related_samples = path('.')
# Exclusion sample file name generated from the related samples
parameter: excluded_samples = ''
# Provide the name of the merged region file if you want to create it
parameter: merged_filename = ''
# Number of jobs
parameter: job_size = 20
# Specify the walltime
parameter: walltime = '2h'
# Specify the memory
parameter: mem = '10G'
# Specify the number of threads
parameter: numThreads = 1
# Specify the container path
parameter: container = '/mnt/vast/hpc/csg/containers/lmm.sif'

In [None]:
# Just run this step is you have a file per chromosome 
[regions (create region file)]
#If you have one file per chromosome and need to concatenate into a df and create a single file provide the region_files
parameter: region_files=''
input: region_files
output: merged_filename
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    import glob
    import pandas as pd
    region_files=sorted(glob.glob(${_input:dr}+'/chr*.bed'), key=len)
    df = pd.concat((pd.read_csv(f,header=0, sep="\t",dtype=str) for f in region_files), ignore_index=True)
    df.columns = df.columns.str.strip()
    df['chr'] = df['chr'].str.strip()
    df['start'] = df['start'].str.strip()
    df['stop'] = df['stop'].str.strip()
    df['chr'] = df['chr'].str.replace('chr','')
    df.to_csv(${_output:r}, header=0, sep=' ', index=False)

In [None]:
# Extract the regions
[bgenix_1 (extract region and samples)]
import pandas as pd
df=pd.read_csv(region_file,header=0,sep="\t", names=["chr", "start", "stop"], dtype=str)
df.columns = df.columns.str.strip()
df1 = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df1['chr'] = df1['chr'].str.replace('chr','')
df1['chr'] = df1['chr'].str.zfill(2)
region=df1.values.tolist()
input: for_each = 'region' #is a list of regions for all the chromosomes
output: region_bgen = f'{cwd}/{_region[0]}/{_region[0]}_{_region[1]}_{_region[2]}.bgen'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bgenix \
    -g ${genofile_prefix}${_region[0].strip('0')}${genofile_suffix} \
    -incl-range ${_region[0]}:${_region[1]}-${_region[2]}  > ${_output}

In [None]:
# Create the bgi files
[bgenix_2 (create index)]
input: named_output('region_bgen')
output: f'{_input}.bgi'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bgenix -g ${_input} -index

In [None]:
# Create variant list
[bgenix_3 (create variants)]
input: named_output('region_bgen')
output: f'{_input:n}.variants'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bgenix -g ${_input} -list > ${_output}