# Genome Comparison

This notebook contains the code used to compare whether different popular project's `GRCh38` reference FASTA was derived from the `GRCh38_no_alt` set. It contains the links that were used to track down the reference genome for each project, the code to download and prepare the reference genome as specified in each project's docs, and the final generation of a sequence dictionary for each FASTA to do the comparison. 

In particular, the goal is to determine whether the autosomes, the sex chromosomes, and the decoy sequences that all genomes share (for instance, the EBV sequence) have the same MD5 sum as the original no alt set.

## Setup

First, let's ensure you have all the command line tools needed. For this notebook, I have preinstalled [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) and I'm using anaconda + bioconda to install any dependencies not already included with Ubuntu 18.04.

In [1]:
%%bash
# Uncomment the lines below to create the conda environment.
# - I'm not necessarily advocating you do this in this notebook, 
#   probably want to set it up in a terminal and then start this Jupyter notebook after :).
# - `gunzip` is assumed to be available on any machine by default.

# conda create -n genome-comparison \
#              -c anaconda \
#              -c bioconda \
#              wget \
#              picard \
#              -y
# conda init bash
# source ~/.bashrc
# conda activate genome-comparison

If you'd just like to check that you have everything on the `$PATH`, you can run this snippet.

In [2]:
%%bash
FAIL=false
commands_needed=(wget gunzip picard)
for CMD in ${commands_needed[*]}; do
  if ! which $CMD &>/dev/null; then
    >&2 echo "- \`$CMD\` must be available on the \$PATH!"
    FAIL=true
  fi
done

if [[ $FAIL = true ]]; then 
  >&2 echo ""
  >&2 echo "Please add the above command line tools to your \$PATH before continuing! Note that some of tools (\`picard\` specifically) should be installed with bioconda (see instructions above). If they are not, you may need to edit these commands manually."
fi

## Original no alt analysis set

It's always good to have a baseline, right? For our baseline, we will be using the officially hosted `GRCh38_no_alt` analysis set FASTA from the NCBI (and also recommended by [Heng Li on his blog](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use)). That's a great read by the way, you should check it out if you have a few minutes (like, say, as you are running this)!

In [3]:
%%bash
FASTA_SOURCE="NCBI"
FASTA_URL="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz"

>&2 echo "== $FASTA_SOURCE =="
>&2 echo "  [*] Downloading $FASTA_SOURCE reference genome..."
wget $FASTA_URL -O "$FASTA_SOURCE".fa.gz -q --continue

>&2 echo "  [*] Unzipping $FASTA_SOURCE reference genome..."
gunzip -c "$FASTA_SOURCE".fa.gz > "$FASTA_SOURCE".fa

>&2 echo "  [*] Creating the $FASTA_SOURCE sequence dictionary..."
picard CreateSequenceDictionary R="$FASTA_SOURCE".fa O="$FASTA_SOURCE".fa.dict

== NCBI ==
  [*] Downloading NCBI reference genome...
  [*] Unzipping NCBI reference genome...
  [*] Creating the NCBI sequence dictionary...
INFO	2019-07-07 00:02:07	CreateSequenceDictionary	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    CreateSequenceDictionary -R NCBI.fa -O NCBI.fa.dict
**********


00:02:08.148 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/claymcleod/conda/envs/star-mapping/share/picard-2.20.2-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Jul 07 00:02:08 CDT 2019] CreateSequenceDictionary OUTPUT=NCBI.fa.dict REFERENCE=NCBI.fa    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRI

## ENCODE Project

The ENCODE project stores a reference to all of it's currently used reference files [here](https://www.encodeproject.org/data-standards/reference-sequences/). From that page, you can see that the base reference genome can be downloaded [here](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz). However, we'd like to do a complete analysis including of all of the sequences they use for decoys/viruse/etc. if we want to use them later. Here's the steps I took to find the complete set of reference FASTAs they used in their STAR index:

1. Searching for their RNA-seq pipeline yields their specification pretty quickly ([here](https://www.encodeproject.org/pages/pipelines/#RNA-seq)).
2. The pipeline we are looking for is [this one](https://www.encodeproject.org/pipelines/ENCPL002LPE/).
3. At the bottom of the page, you will see a PDF that is a comprehensive overview of their pipelines and contains a list to all of the current ENCODE reference accessions ([link](https://www.encodeproject.org/documents/6354169f-86f6-4b59-8322-141005ea44eb/@@download/attachment/Long%20RNA-seq%20pipeline%20overview.pdf)).
4. In that document, you find that the link to their most currently built `STAR` genome is [here](https://www.encodeproject.org/references/ENCSR314WMD/). 
5. That page gives you every `STAR` index they use at ENCODE. Selectthe `GRCh38`-based one ([here](https://www.encodeproject.org/files/ENCFF742NER/)).
6. Finally, you can see of all of the files from which this `STAR` index was derived from. I think that's a pretty nice feature for data providence! 

For our purposes, we can just concatenate them for our purposes. Note that the names I've given them in this list are derived from the metadata tags on each of those pages.
  * [Spikes.fixed.fasta.gz](https://www.encodeproject.org/files/ENCFF001RTP/@@download/ENCFF001RTP.fasta.gz)
  * [PhiX.fasta.gz](https://www.encodeproject.org/files/ENCFF335FFV/@@download/ENCFF335FFV.fasta.gz)
  * [GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz)

In [4]:
%%bash
FASTA_SOURCE="ENCODE"
FASTA_URL="https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz"
PHIX_FASTA="https://www.encodeproject.org/files/ENCFF335FFV/@@download/ENCFF335FFV.fasta.gz"
SPIKEIN_FASTA="https://www.encodeproject.org/files/ENCFF001RTP/@@download/ENCFF001RTP.fasta.gz"

>&2 echo "== $FASTA_SOURCE =="
>&2 echo "  [*] Downloading $FASTA_SOURCE reference genome..."
wget $FASTA_URL -O "$FASTA_SOURCE".fa.gz -q --continue
>&2 echo "  [*] Downloading SpikeIn reference FASTA..."
wget $SPIKEIN_FASTA -O "$FASTA_SOURCE".SpikeIn.fa.gz -q --continue
>&2 echo "  [*] Downloading PhiX reference FASTA..."
wget $PHIX_FASTA -O "$FASTA_SOURCE".PhiX.fa.gz -q --continue

>&2 echo "  [*] Unzipping fastas..."
gunzip -c "$FASTA_SOURCE".fa.gz "$FASTA_SOURCE".SpikeIn.fa.gz "$FASTA_SOURCE".PhiX.fa.gz> "$FASTA_SOURCE".fa

>&2 echo "  [*] Creating the $FASTA_SOURCE sequence dictionary..."
picard CreateSequenceDictionary R="$FASTA_SOURCE".fa O="$FASTA_SOURCE".fa.dict

== ENCODE ==
  [*] Downloading ENCODE reference genome...
  [*] Downloading SpikeIn reference FASTA...
  [*] Downloading PhiX reference FASTA...
  [*] Unzipping fastas...
  [*] Creating the ENCODE sequence dictionary...
INFO	2019-07-07 00:04:05	CreateSequenceDictionary	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    CreateSequenceDictionary -R ENCODE.fa -O ENCODE.fa.dict
**********


00:04:06.015 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/claymcleod/conda/envs/star-mapping/share/picard-2.20.2-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Jul 07 00:04:06 CDT 2019] CreateSequenceDictionary OUTPUT=ENCODE.fa.dict REFERENCE=ENCODE.fa    TRUNCATE_NAMES_AT_WHITESPA

## Genomic Data Commons

The GDC similarly makes all of their reference files available on one page [here](https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files). The docs currently specify that their genome is made up of the three following reference sets:

* The standard `GRCh38_no_alt` FASTA.
* A set of standard sequence decoys.
* A set of viral sequences.

To be safe, we will download their full, concatenated FASTA they provide.

In [5]:
%%bash
FASTA_SOURCE="GDC"
FASTA_URL="https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834"

>&2 echo "== $FASTA_SOURCE =="
>&2 echo "  [*] Downloading $FASTA_SOURCE reference genome..."
wget $FASTA_URL -O "$FASTA_SOURCE".fa.tar.gz -q --continue

>&2 echo "  [*] Unzipping $FASTA_SOURCE reference genome..."
tar xfvz "$FASTA_SOURCE".fa.tar.gz
mv GRCh38.d1.vd1.fa "$FASTA_SOURCE".fa

>&2 echo "  [*] Creating the $FASTA_SOURCE sequence dictionary..."
picard CreateSequenceDictionary R="$FASTA_SOURCE".fa O="$FASTA_SOURCE".fa.dict

GRCh38.d1.vd1.fa


== GDC ==
  [*] Downloading GDC reference genome...
  [*] Unzipping GDC reference genome...
  [*] Creating the GDC sequence dictionary...
INFO	2019-07-07 00:06:05	CreateSequenceDictionary	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    CreateSequenceDictionary -R GDC.fa -O GDC.fa.dict
**********


00:06:06.183 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/claymcleod/conda/envs/star-mapping/share/picard-2.20.2-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Jul 07 00:06:06 CDT 2019] CreateSequenceDictionary OUTPUT=GDC.fa.dict REFERENCE=GDC.fa    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPR

## TOPMed + GTEx RNA-seq pipeline

The GTEx consortium and TOPMed program both use the [GTEx RNA-seq pipeline](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) developed by the Broad Institute. This workflow processes a high number of samples and has high reputation, so it's worth taking a look at.

Following the "reference genome and annotation" [section](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq#reference-genome-and-annotation) of their `README.md`, you are directed to the [TOPMed RNA-seq pipeline harmonization](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) page. Reading the "Reference files" section of that documentation essentially lays out that they use the Broad Insitute's version of `GRCh38` and add the `ERCC SpikeIn` sequences. They provide both [a link to the Broad's original FASTA](https://software.broadinstitute.org/gatk/download/bundle) and [a link to their built FASTA](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) (although, given it points to a personal page, I'm not sure how long this link will be valid. For now, we will use the personal link.

In [6]:
%%bash
FASTA_SOURCE="TOPMed"
FASTA_URL="https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz"

>&2 echo "== $FASTA_SOURCE =="
>&2 echo "  [*] Downloading $FASTA_SOURCE reference genome..."
wget $FASTA_URL -O "$FASTA_SOURCE".fa.tar.gz -q --continue

>&2 echo "  [*] Unzipping $FASTA_SOURCE reference genome..."
tar xfvz "$FASTA_SOURCE".fa.tar.gz Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta
mv Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta "$FASTA_SOURCE".fa

>&2 echo "  [*] Creating the $FASTA_SOURCE sequence dictionary..."
picard CreateSequenceDictionary R="$FASTA_SOURCE".fa O="$FASTA_SOURCE".fa.dict

Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta


== TOPMed ==
  [*] Downloading TOPMed reference genome...
  [*] Unzipping TOPMed reference genome...
  [*] Creating the TOPMed sequence dictionary...
INFO	2019-07-07 00:19:05	CreateSequenceDictionary	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    CreateSequenceDictionary -R TOPMed.fa -O TOPMed.fa.dict
**********


00:19:05.708 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/claymcleod/conda/envs/star-mapping/share/picard-2.20.2-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Jul 07 00:19:05 CDT 2019] CreateSequenceDictionary OUTPUT=TOPMed.fa.dict REFERENCE=TOPMed.fa    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION

## Final Analysis

Now that we have downloaded all of the FASTAs we want to explore and created a sequence dictionary for each, we can run through them with some quick Python code to generate a Markdown table laying out how the MD5s of each sequence compare.

First, some utility functions:

In [7]:
import sys

def parse_sq_line(sq_line):
    sq_split_by_tab = sq_line.split("\t")      
    # Dynamically look for sequence name and md5sum
    sn = None
    md5 = None
    for col in sq_split_by_tab:
        if col.startswith("SN:"):
            sn = col.replace("SN:", "")
        elif col.startswith("M5:"):
            md5 = col.replace("M5:", "")
                
        if sn and md5:
            break
                    
    if not sn or not md5:
        print(f"Could not parse SN and MD5 for SQ line: {sq}!")
        sys.exit(1)
    
    return sn, md5

def blacklisted_sq(sq_name):
    """
    Returns True if we do not want to compare this SQ tag.
    
    I have chosen a blacklist approach here to ensure any new sequences added to any
    of the genomes must be manually triaged (and this notebook must be updated).
    """
    
    blacklisted = ['phiX', 'ERCC', 'chrUn_', '_random', 'CMV', 'HBV', 'HCV', 'HIV', 'KSHV', 'HTLV', 'SV40', 'HPV', 'MCV']
    for b in blacklisted:
        if b in sq_name:
            return True
    return False
        
def read_sequence_dict(file, blacklist=True):
    results = {}
    
    with open(file) as f:
        sqs = [line for line in f.readlines() if line.startswith("@SQ")]
        for sq in sqs:
            sn, md5 = parse_sq_line(sq)
            if blacklist and blacklisted_sq(sq):
                continue
            results[sn] = md5
                
    return results

Then, the main analysis:

In [8]:
import sys

from glob import glob
from IPython.display import display, Markdown, Latex

# which reference dataset do you want to use as a baseline?
baseline='NCBI'
# default behavior is True which means to just compare the autosomes/sex chromosomes/common decoy seqs
# if set to True, this script will complain about any mismatches for MD5 sums.
# if set to False, the script knows there will be differences for any uncommmon sequences and just
# prints out the ASCII table.
blacklist_uncommon_seqs=True

# Read all sequence dictionaries
results = {}
for file_name in glob("*.dict"):
    source_name = file_name.replace(".fa.dict", "")
    results[source_name] = read_sequence_dict(file_name, blacklist=blacklist_uncommon_seqs)
sources = sorted(results.keys())
if baseline not in sources:
    print(f"Missing baseline source sequence dictionary: {baseline}!")
    sys.exit(1)
sources.remove(baseline)
sources = [baseline] + sources
    
# Detect all chrs in each source
detectedChrs = set()
for source in sources:
    for sq in results[source].keys():
        detectedChrs.add(sq)
        
# Accumulate chrs in a nice order for viewing. Append remaining to end in sorted order.
allChrs = []
for i in list(range(1, 23)) + ['X', 'Y', 'M', 'EBV']:
    identifier = f'chr{i}'
    if identifier in detectedChrs:
        allChrs.append(identifier)
        detectedChrs.remove(identifier)
allChrs = allChrs + sorted(detectedChrs)

# Print markdown table header
lines = []
lines.append(' | '.join(['Sequence Name'] + [sources[0] + " (baseline)"] + sources[1:] + ['Concordant']))
lines.append(' | '.join(['-'] * (len(sources) + 2)))

for _chr in allChrs:
    line = [_chr]
    concordant = True
    concordant_md5 = None
    for source in sources:
        md5 = "Not present"
        if _chr in results[source]:
            md5 = results[source][_chr]
        if not concordant_md5:
            concordant_md5 = md5
        else:
            if concordant and concordant_md5 != md5:
                if blacklist_uncommon_seqs:
                    print(f"{_chr} does not match for all sources!")
                concordant = False
        line.append("`" + md5 + "`")  
    line.append('True' if concordant else 'False')
    lines.append(' | '.join(line))  

lines = ['| ' + line + ' |' for line in lines]
md = '\n'.join(lines)

# You can embed this in your Jupyter notebook or print it to be included 
# elsewhere by commenting/uncommenting the following lines.

display(Markdown(md))
# print(md)

| Sequence Name | NCBI (baseline) | ENCODE | GDC | TOPMed | Concordant |
| - | - | - | - | - | - |
| chr1 | `6aef897c3d6ff0c78aff06ac189178dd` | `6aef897c3d6ff0c78aff06ac189178dd` | `6aef897c3d6ff0c78aff06ac189178dd` | `6aef897c3d6ff0c78aff06ac189178dd` | True |
| chr2 | `f98db672eb0993dcfdabafe2a882905c` | `f98db672eb0993dcfdabafe2a882905c` | `f98db672eb0993dcfdabafe2a882905c` | `f98db672eb0993dcfdabafe2a882905c` | True |
| chr3 | `76635a41ea913a405ded820447d067b0` | `76635a41ea913a405ded820447d067b0` | `76635a41ea913a405ded820447d067b0` | `76635a41ea913a405ded820447d067b0` | True |
| chr4 | `3210fecf1eb92d5489da4346b3fddc6e` | `3210fecf1eb92d5489da4346b3fddc6e` | `3210fecf1eb92d5489da4346b3fddc6e` | `3210fecf1eb92d5489da4346b3fddc6e` | True |
| chr5 | `a811b3dc9fe66af729dc0dddf7fa4f13` | `a811b3dc9fe66af729dc0dddf7fa4f13` | `a811b3dc9fe66af729dc0dddf7fa4f13` | `a811b3dc9fe66af729dc0dddf7fa4f13` | True |
| chr6 | `5691468a67c7e7a7b5f2a3a683792c29` | `5691468a67c7e7a7b5f2a3a683792c29` | `5691468a67c7e7a7b5f2a3a683792c29` | `5691468a67c7e7a7b5f2a3a683792c29` | True |
| chr7 | `cc044cc2256a1141212660fb07b6171e` | `cc044cc2256a1141212660fb07b6171e` | `cc044cc2256a1141212660fb07b6171e` | `cc044cc2256a1141212660fb07b6171e` | True |
| chr8 | `c67955b5f7815a9a1edfaa15893d3616` | `c67955b5f7815a9a1edfaa15893d3616` | `c67955b5f7815a9a1edfaa15893d3616` | `c67955b5f7815a9a1edfaa15893d3616` | True |
| chr9 | `6c198acf68b5af7b9d676dfdd531b5de` | `6c198acf68b5af7b9d676dfdd531b5de` | `6c198acf68b5af7b9d676dfdd531b5de` | `6c198acf68b5af7b9d676dfdd531b5de` | True |
| chr10 | `c0eeee7acfdaf31b770a509bdaa6e51a` | `c0eeee7acfdaf31b770a509bdaa6e51a` | `c0eeee7acfdaf31b770a509bdaa6e51a` | `c0eeee7acfdaf31b770a509bdaa6e51a` | True |
| chr11 | `1511375dc2dd1b633af8cf439ae90cec` | `1511375dc2dd1b633af8cf439ae90cec` | `1511375dc2dd1b633af8cf439ae90cec` | `1511375dc2dd1b633af8cf439ae90cec` | True |
| chr12 | `96e414eace405d8c27a6d35ba19df56f` | `96e414eace405d8c27a6d35ba19df56f` | `96e414eace405d8c27a6d35ba19df56f` | `96e414eace405d8c27a6d35ba19df56f` | True |
| chr13 | `a5437debe2ef9c9ef8f3ea2874ae1d82` | `a5437debe2ef9c9ef8f3ea2874ae1d82` | `a5437debe2ef9c9ef8f3ea2874ae1d82` | `a5437debe2ef9c9ef8f3ea2874ae1d82` | True |
| chr14 | `e0f0eecc3bcab6178c62b6211565c807` | `e0f0eecc3bcab6178c62b6211565c807` | `e0f0eecc3bcab6178c62b6211565c807` | `e0f0eecc3bcab6178c62b6211565c807` | True |
| chr15 | `f036bd11158407596ca6bf3581454706` | `f036bd11158407596ca6bf3581454706` | `f036bd11158407596ca6bf3581454706` | `f036bd11158407596ca6bf3581454706` | True |
| chr16 | `db2d37c8b7d019caaf2dd64ba3a6f33a` | `db2d37c8b7d019caaf2dd64ba3a6f33a` | `db2d37c8b7d019caaf2dd64ba3a6f33a` | `db2d37c8b7d019caaf2dd64ba3a6f33a` | True |
| chr17 | `f9a0fb01553adb183568e3eb9d8626db` | `f9a0fb01553adb183568e3eb9d8626db` | `f9a0fb01553adb183568e3eb9d8626db` | `f9a0fb01553adb183568e3eb9d8626db` | True |
| chr18 | `11eeaa801f6b0e2e36a1138616b8ee9a` | `11eeaa801f6b0e2e36a1138616b8ee9a` | `11eeaa801f6b0e2e36a1138616b8ee9a` | `11eeaa801f6b0e2e36a1138616b8ee9a` | True |
| chr19 | `85f9f4fc152c58cb7913c06d6b98573a` | `85f9f4fc152c58cb7913c06d6b98573a` | `85f9f4fc152c58cb7913c06d6b98573a` | `85f9f4fc152c58cb7913c06d6b98573a` | True |
| chr20 | `b18e6c531b0bd70e949a7fc20859cb01` | `b18e6c531b0bd70e949a7fc20859cb01` | `b18e6c531b0bd70e949a7fc20859cb01` | `b18e6c531b0bd70e949a7fc20859cb01` | True |
| chr21 | `974dc7aec0b755b19f031418fdedf293` | `974dc7aec0b755b19f031418fdedf293` | `974dc7aec0b755b19f031418fdedf293` | `974dc7aec0b755b19f031418fdedf293` | True |
| chr22 | `ac37ec46683600f808cdd41eac1d55cd` | `ac37ec46683600f808cdd41eac1d55cd` | `ac37ec46683600f808cdd41eac1d55cd` | `ac37ec46683600f808cdd41eac1d55cd` | True |
| chrX | `2b3a55ff7f58eb308420c8a9b11cac50` | `2b3a55ff7f58eb308420c8a9b11cac50` | `2b3a55ff7f58eb308420c8a9b11cac50` | `2b3a55ff7f58eb308420c8a9b11cac50` | True |
| chrY | `ce3e31103314a704255f3cd90369ecce` | `ce3e31103314a704255f3cd90369ecce` | `ce3e31103314a704255f3cd90369ecce` | `ce3e31103314a704255f3cd90369ecce` | True |
| chrM | `c68f52674c9fb33aef52dcf399755519` | `c68f52674c9fb33aef52dcf399755519` | `c68f52674c9fb33aef52dcf399755519` | `c68f52674c9fb33aef52dcf399755519` | True |
| chrEBV | `6743bd63b3ff2b5b8985d8933c53290a` | `6743bd63b3ff2b5b8985d8933c53290a` | `6743bd63b3ff2b5b8985d8933c53290a` | `6743bd63b3ff2b5b8985d8933c53290a` | True |