## Pipeline Initialization

This notebook will create necessary directories and objects for the pipeline.

In [1]:
import os
import pandas as pd
import numpy as np
%run {os.environ['NB_DIR']}/nb.py
%run common.py
%run files.py

### Directory Initialization

In [2]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


### Canine Reference Genome

There is no standardized set of statistics for canine genome assemblies that will make it easier to create a Hail reference genome, so we must instead create one based on the data at hand.  All we need to know in order for Hail to interpret our data is what contigs exist, how long each one is, and which of those contigs correspond to sex chromosomes.  Luckily, the first and the last of those are well known (dogs have 38 autosomes and 2 allosomes + MT DNA) and we can approximate the contig lengths based on the data.

In [2]:
df_ref_bim = get_bim(ORGANISM_CANINE_REF_DIR, PLINK_FILE_REF)
df_ref_bim.head()

Unnamed: 0,contig,snp,pos,locus,alt,ref
0,1,chr1_212740,0,212740,A,G
1,1,chr1_249580,0,249580,G,A
2,1,chr1_273487,0,273487,A,G
3,1,chr1_307563,0,307563,A,C
4,1,chr1_320055,0,320055,G,A


In [3]:
df_tgt_bim = get_bim(ORGANISM_CANINE_TGT_DIR, PLINK_FILE_TGT)
df_tgt_bim.head()

Unnamed: 0,contig,snp,pos,locus,alt,ref
0,1,BICF2P1383091,0.058048,212740,A,G
1,1,TIGRP2P259_rs8993730,0.058849,249580,G,A
2,1,BICF2G630707908,0.059382,273487,A,G
3,1,BICF2P563564,0.060122,307563,A,C
4,1,BICF2P574107,0.06039,320055,G,A


In [11]:
# Find the max locus for each contig across both datasets
contigs = pd.concat([
    df_ref_bim.groupby('contig')['locus'].max(),
    df_tgt_bim.groupby('contig')['locus'].max()
], axis=1).max(axis=1).astype(int).to_dict()

# Create a Hail-compatible reference genome spec
rg = dict(
    name='canine',
    contigs=[str(k) for k in contigs.keys()],
    lengths={str(k): v for k, v in contigs.items()},
    x_contigs='39',
    mt_contigs='41'
)
rg

{'name': 'canine',
 'contigs': ['1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9',
  '10',
  '11',
  '12',
  '13',
  '14',
  '15',
  '16',
  '17',
  '18',
  '19',
  '20',
  '21',
  '22',
  '23',
  '24',
  '25',
  '26',
  '27',
  '28',
  '29',
  '30',
  '31',
  '32',
  '33',
  '34',
  '35',
  '36',
  '37',
  '38',
  '39',
  '41'],
 'lengths': {'1': 122670980,
  '2': 85416217,
  '3': 91858198,
  '4': 88267880,
  '5': 88908300,
  '6': 77552613,
  '7': 80858461,
  '8': 74057381,
  '9': 61043804,
  '10': 69316974,
  '11': 74388336,
  '12': 72480470,
  '13': 63232306,
  '14': 60959782,
  '15': 64187680,
  '16': 59511764,
  '17': 64281982,
  '18': 55763074,
  '19': 53735656,
  '20': 58114749,
  '21': 50855586,
  '22': 61382644,
  '23': 52291577,
  '24': 47651928,
  '25': 51628093,
  '26': 38939728,
  '27': 45753342,
  '28': 41164216,
  '29': 41841565,
  '30': 40196606,
  '31': 39786599,
  '32': 38745890,
  '33': 31361794,
  '34': 42089769,
  '35': 26506199,
  '36': 30798114,
  '37':

In [15]:
import json
with open(REF_GENOME_FILE, 'w') as fd:
    json.dump(rg, fd)
!cat $REF_GENOME_FILE

{"name": "canine", "contigs": ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "41"], "lengths": {"1": 122670980, "2": 85416217, "3": 91858198, "4": 88267880, "5": 88908300, "6": 77552613, "7": 80858461, "8": 74057381, "9": 61043804, "10": 69316974, "11": 74388336, "12": 72480470, "13": 63232306, "14": 60959782, "15": 64187680, "16": 59511764, "17": 64281982, "18": 55763074, "19": 53735656, "20": 58114749, "21": 50855586, "22": 61382644, "23": 52291577, "24": 47651928, "25": 51628093, "26": 38939728, "27": 45753342, "28": 41164216, "29": 41841565, "30": 40196606, "31": 39786599, "32": 38745890, "33": 31361794, "34": 42089769, "35": 26506199, "36": 30798114, "37": 30897806, "38": 23903967, "39": 123833839, "41": 6608343}, "x_contigs": "39", "mt_contigs": "41"}