# Extracting per-read per-position dwell-times from fast5 files

The purpose of this notebook is to extract a dwell-time for each nucleotide position on one nanopore sequencing read.
The dwell-times for the read are extracted from that read's fast5 file.

This notebook is written very pedantically on purpose. It helps me learn, and someone might need to trace my steps.

This notebook was written to be run with Python 2.

## Background

These are all good pages to learn about the fast5 files and the data they contain:
* https://nanoporetech.github.io/tombo/resquiggle.html?highlight=fast5#tombo-fast5-format (official fast5 docs)
* https://simpsonlab.github.io/2015/04/08/eventalign/ (brief high-level overview)
* https://simpsonlab.github.io/2015/12/18/kdtree-mapping/ (just the first few paragraphs - info on raw signal data)
* https://simpsonlab.github.io/2017/02/27/packing_fast5/ (nuts and bolts on fast5 internals)
* https://portal.hdfgroup.org/display/HDF5/Introduction+to+HDF5 (intro to HDF5)

(The blog posts on the Simpson Lab website are not about Tombo, but nanopolish. nanopolish is a similar tool to Tombo. The nanopolish analog of Tombo's "resquiggling" step is called "eventalign.") Fast5 files are really HDF5 files, so it helps to know about them.

## Data source

The `data` directory, next to this notebook, contains one fast5 file and one FASTA file. The fast5 file (which is in single-fast5 format, not multi-fast5 format) is an HDF5 file describing
* the raw output of the nanopore sequencing device
* inaccurate basecalls made during sequencing
* the results of "resquiggling" the read to a reference genome

The FASTA file in the `data` directory is the same reference genome to which the fast5 was resquiggled.
The fast5 and FASTA file were copied from these directories on 7/28:
* `/fs/project/PAS1405/GabbyLee/project/m6A_modif/sequencing_data/Trizol_A/01936724-8ae4-4a0f-99ce-c67f3a334c1f.fast5`
* `/fs/project/PAS1405/General/Kimmel_Chris/RNA_section__454_9627.fa`

The original fast5 was from natural HIV RNA harvested with Trizol, not from IVT RNA.

## Navigating fast5 files

In [155]:
import numpy as np
import h5py # standard Python library for working with HDF5 files - http://docs.h5py.org/en/stable/quick.html

In [3]:
fast5_path = 'data/01936724-8ae4-4a0f-99ce-c67f3a334c1f.fast5'
reference_fasta_path = 'data/RNA_section__454_9627.fa'

Here's just a toy example of how we can navigate through HDF5 groups and read their attributes:

(None of the groups `Analyses`, `Raw`, `UniqueGlobalKey` have any attributes.)

In [57]:
with h5py.File(fast5_path, 'r') as file:
    print('root-group subgroups: {}'.format(file.keys()))
    print('root-group attributes: {}'.format(file.attrs.keys()))
    print('')
    for key in file.keys():
        print('group: {}'.format(key))
        print('\tsubgroups: {}'.format(file[key].keys()))
        print('\tattributes: {}'.format(file[key].attrs.keys()))
        print('')

root-group subgroups: [u'Analyses', u'Raw', u'UniqueGlobalKey']
root-group attributes: [u'file_version']

group: Analyses
	subgroups: [u'Segmentation_000', u'Basecall_1D_000', u'RawGenomeCorrected_000']
	attributes: []

group: Raw
	subgroups: [u'Reads']
	attributes: []

group: UniqueGlobalKey
	subgroups: [u'channel_id', u'context_tags', u'tracking_id']
	attributes: []



In [174]:
# A useful function for peeking at HDF5 groups

def explore_hdf5_group(hdf5_group):
    '''
    Given an HDF5 group hdf5_group, print info on its name, attributes, and subgroups
    '''
    assert isinstance(hdf5_group, h5py._hl.group.Group), 'Error: This is not an hdf5 group. Its type is {}'.format(type(hdf5_group))
    
    print('group: {}'.format(hdf5_group))
    
    print('attribute values:')
    for key in hdf5_group.attrs.keys():
        print('\tattribute: {}'.format(key))
        print('\tvalue: {}'.format(hdf5_group.attrs[key]))
    
    datasets_and_subgroups = hdf5_group.keys()
    print('datasets: {}'.format([x for x in datasets_and_subgroups if isinstance(hdf5_group[x], h5py._hl.dataset.Dataset)]))
    print('subgroups: {}'.format([x for x in datasets_and_subgroups if isinstance(hdf5_group[x], h5py._hl.group.Group)]))
    
    # delete this: print('subgroups and datasets: {}'.format(hdf5_group.keys()))

There are some nice hints at https://simpsonlab.github.io/2017/02/27/packing_fast5/ about what is stored where in this HDF5 file.

### Basecall_1D_000

The dataset `file['Analyses']['Basecall_1D_000']['BaseCalled_template']['Fastq']` is the sequencing device's initial guess (pre-resquiggling) at the sequence of the read. Let's take a look

In [96]:
# Peek at the pre-resquiggling basecall fastQ file

with h5py.File(fast5_path, 'r') as file:
    print(np.asarray(file['Analyses']['Basecall_1D_000']['BaseCalled_template']['Fastq']))

@01936724-8ae4-4a0f-99ce-c67f3a334c1f runid=0bc4549742d8f2e47ef1c96ae611fbfe6c1c6847 read=28507 ch=475 start_time=2020-01-08T12:24:42Z flow_cell_id=FAN05606 protocol_group_id=010720_HIVRNA_SSIV_multioligos sample_id=010720_HIVRNA_SSIV_multioligos .poretools_tmp/single_fast5_A/0/01936724-8ae4-4a0f-99ce-c67f3a334c1f.fast5
AGACCAGAUAGGGCAAGAGAGCUCCUGCUAACUAGGAACCCAUGCUUAAGCCUCAAUUUGGCUUACACCUCUGAAGUGCUCAAGUAGUAUGUGCCCGUCUGUUGUGUGACUCCGUGGUAACUAUAGAGAUAUCCCUCAGACCUUUUAGUCAGUGUGGAAAAAAUUUUUAGCAGUGGCGCCCCGAACAGGACUUGAGCGAAAGUAAAUGAGAGGAGAUUUGCGCAGGACUCGGCUUGCUGAAGCGCGCACAAUAAGAGCGCGAGGGGCGGCGACUGGUGAGUCGCCAAAUUCUGACUAGCGGAGGCUAGAAGGAGAGAUGGGUGCGAGAGCGUCGGUAUUAAGCGGGGAGAAUUAGAGAGAUAAUAAAUGGGGAAAAAUGGUUAAGGCUCAGGAAAGAAAACAAUAUAAACUAAACUACUAACAGUAUGGGCAAGCGAGAGCUAGAACGAACUUCGCGGUUAAUCCUGGCCUUUUUUAAAGAGUUUCGGAAGGCAAGUAGACAAAUACUGGGACGGCUACUACUAUACCUCCUUCGGACAGGAUCAGAAGAAACUUAGAUCAUUAUAAUACAAUAGCAGACCGUUGUGUGCAUUUUAAAGGAUAGAUGCUGAAGAUUUCGAGGAAGCUUAGAUACAAAGAUAGAGGAAGAGCAAAACAAAAGUAAGAAAAAGGCACAGC

### RawGenomeCorrected_000

We know there is only one resquiggling in the fast5 file because there is only one HDF5 group with a name like `RawGenomeCorrected_000` under `Analyses`.

If the fast5 was resquiggled twice, then there would be another HDF5 group with a name like `RawGenomeCorrected_001`.
(If `tombo resquiggle` was fed a custom name with the `--corrected-group` option, then it would have some other name than `RawGenomeCorrected_001`.)

In [137]:
# Peek at HDF5 group "RawGenomeCorrected_000'

with h5py.File(fast5_path, 'r') as file:
    explore_hdf5_group(file['Analyses']['RawGenomeCorrected_000'])

group: <HDF5 group "/Analyses/RawGenomeCorrected_000" (1 members)>
attribute values:
	attribute: tombo_version
	value: 1.5.1
	attribute: basecall_group
	value: Basecall_1D_000
subgroups: [u'BaseCalled_template']


In [219]:
with h5py.File(fast5_path, 'r') as file:
    explore_hdf5_group(file['Analyses']['RawGenomeCorrected_000']['BaseCalled_template'])

group: <HDF5 group "/Analyses/RawGenomeCorrected_000/BaseCalled_template" (2 members)>
attribute values:
	attribute: status
	value: success
	attribute: rna
	value: True
	attribute: signal_match_score
	value: 1.28494354458
	attribute: shift
	value: 570.098736793
	attribute: scale
	value: 73.7784909873
	attribute: norm_type
	value: median
	attribute: lower_lim
	value: -5.0
	attribute: upper_lim
	value: 5.0
	attribute: outlier_threshold
	value: 5.0
datasets: [u'Events']
subgroups: [u'Alignment']


In [220]:
with h5py.File(fast5_path, 'r') as file:
    explore_hdf5_group(file['Analyses']['RawGenomeCorrected_000']['BaseCalled_template']['Alignment'])

group: <HDF5 group "/Analyses/RawGenomeCorrected_000/BaseCalled_template/Alignment" (0 members)>
attribute values:
	attribute: mapped_start
	value: 33
	attribute: mapped_end
	value: 9166
	attribute: mapped_strand
	value: +
	attribute: mapped_chrom
	value: truncated_hiv_rna_genome
	attribute: clipped_bases_start
	value: 18
	attribute: clipped_bases_end
	value: 13
	attribute: num_insertions
	value: 380
	attribute: num_deletions
	value: 459
	attribute: num_matches
	value: 8302
	attribute: num_mismatches
	value: 372
datasets: []
subgroups: []


In [192]:
# And this is the dataset that actually contains the resquiggling results for this read:

with h5py.File(fast5_path, 'r') as file:
    dataset = file['Analyses']['RawGenomeCorrected_000']['BaseCalled_template']['Events']
    
    print('Shape of dataset (In our case, this is the number of positions):')
    print('\t' + str(dataset.shape))
    # If the dataset was two-dimension then dataset.shape would be a length-two tuple.
    # Even though our dataset has one dimension (9100 rows), each entry has 5 different "fields", which are basically like columns
    
    print('Dataset dtype:')
    print('\t' + str(dataset.dtype))

Shape of dataset (In our case, this is the number of positions):
	(9133,)
Dataset dtype:
	[('norm_mean', '<f8'), ('norm_stdev', '<f8'), ('start', '<u4'), ('length', '<u4'), ('base', 'S1')]


## Resquiggling results

HDF5 datasets can have attributes, just like groups:

In [228]:
with h5py.File(fast5_path, 'r') as file:
    dataset = file['Analyses']['RawGenomeCorrected_000']['BaseCalled_template']['Events']
    print('Attribute(s) and value(s) of the "Events" dataset:')
    for key in dataset.attrs.keys():
        print('\tattribute: {}'.format(key))
        print('\tvalue: {}'.format(dataset.attrs[key]))

Attribute(s) and value(s) of the "Events" dataset:
	attribute: read_start_rel_to_raw
	value: 884


Let's extract the dataset into a numpy array.

In [221]:
# Extract resquiggling results into numpy array

with h5py.File(fast5_path, 'r') as file:
    dataset = file['Analyses']['RawGenomeCorrected_000']['BaseCalled_template']['Events']
    resquig_array = np.asarray(dataset)
    
# Now we don't have to do everything in "with ... as ..." blocks.
print(type(resquig_array))
print(resquig_array.shape)
print('First few rows:')
display(resquig_array[:20].reshape(-1,1)) # the "reshape" is just to make it display in one column
print('Last few rows:')
display(resquig_array[-20:].reshape(-1,1)) # the "reshape" is just to make it display in one column

<type 'numpy.ndarray'>
(9133,)
First few rows:


array([[( 2.1117437 , nan,   0, 20, 'G')],
       [( 2.31800499, nan,  20, 17, 'A')],
       [( 1.93237796, nan,  37,  6, 'G')],
       [(-0.87049404, nan,  43, 24, 'C')],
       [(-1.46034994, nan,  67, 87, 'T')],
       [(-1.33534568, nan, 154, 19, 'C')],
       [(-0.58246972, nan, 173,  8, 'T')],
       [(-0.58190496, nan, 181,  6, 'C')],
       [( 1.28680815, nan, 187, 80, 'T')],
       [( 0.88280516, nan, 267, 26, 'G')],
       [( 0.69669713, nan, 293,  6, 'G')],
       [(-0.60501642, nan, 299, 13, 'C')],
       [(-0.08718083, nan, 312, 36, 'T')],
       [( 0.66874183, nan, 348, 32, 'A')],
       [(-0.09744908, nan, 380, 22, 'A')],
       [(-0.69920428, nan, 402, 80, 'C')],
       [( 0.43608903, nan, 482, 22, 'T')],
       [( 1.95839557, nan, 504, 29, 'A')],
       [( 2.30242602, nan, 533, 63, 'G')],
       [( 2.9469981 , nan, 596, 21, 'G')]],
      dtype=[('norm_mean', '<f8'), ('norm_stdev', '<f8'), ('start', '<u4'), ('length', '<u4'), ('base', 'S1')])

Last few rows:


array([[(-0.81684245, nan, 404889,  18, 'T')],
       [( 0.25531489, nan, 404907,  31, 'C')],
       [( 1.50604909, nan, 404938,  94, 'A')],
       [( 1.47831608, nan, 405032,   6, 'A')],
       [(-0.62072434, nan, 405038, 152, 'T')],
       [( 1.27361094, nan, 405190,  47, 'A')],
       [( 1.17880195, nan, 405237,  29, 'A')],
       [( 1.89266687, nan, 405266,  19, 'A')],
       [( 0.63909227, nan, 405285,   8, 'G')],
       [(-0.97422045, nan, 405293,  18, 'C')],
       [(-0.71518681, nan, 405311,   6, 'T')],
       [( 0.08107055, nan, 405317,  25, 'T')],
       [( 1.01747717, nan, 405342,  18, 'G')],
       [(-1.42034681, nan, 405360,  13, 'C')],
       [(-0.8544484 , nan, 405373,  17, 'C')],
       [(-0.13107025, nan, 405390, 105, 'T')],
       [( 1.2512576 , nan, 405495,  41, 'T')],
       [( 1.66882629, nan, 405536,  27, 'G')],
       [( 2.28478419, nan, 405563,  12, 'A')],
       [( 0.71333169, nan, 405575,  11, 'G')]],
      dtype=[('norm_mean', '<f8'), ('norm_stdev', '<f8'), (

For explanation of the fields, look at the official Tombo docs here:
* https://nanoporetech.github.io/tombo/resquiggle.html#tombo-fast5-format (see the paragraph starting with "The Events slot" under section "Tombo FAST5 format)

It appears that
* Every row corresponds to one nucleotide position in the reference genome (but not every nucleotide in the reference genome gets mapped to)
* `norm_mean` is the (normalized) average current across all current measurements taken while that nucleotide of that read was centered in the nanopore
* `norm_stdev` is the (normalized) standard deviation across all measurements taken while that nucleotide of that read was centered in the nanopore
* `length` is the total number of current measurements taken while that nucleotide of that read was centered in the nanopore
* `base` is the corresponding nucleotide in the reference genome
* `start` tells is the index of the first current measurement attributed to this nucleotide of this read, relative to read_start_rel_to_raw

(I think the 'Events' dataset has been deprecated? See Mari Miyamoto's post in this tweet thread: https://twitter.com/nickschurch/status/1037328783836160000. Not sure what to make of this.)

In [233]:
print("Is the 'length' field just the difference between neighboring 'start' fields?")
np.all( # only true if every entry is true
    np.equal( # elementwise comparison
        np.diff(resquig_array['start']), # has length one less than len(resquig_array['start'])
        resquig_array['length'][:-1] # Use index [:-1] to throw out last entry
    )
)

Is the 'length' field just the difference between neighboring 'start' fields?


True

Well that's convenient. The value of `length` at row $x$ equals:

{`start` at row $x+1$} - {`start` at row $x$}

This holds for all but the last row (since there's no row $x+1$ for that row).

In [250]:
# Is the length field ever zero?
print(np.where(resquig_array['length'] == 0))
assert len(np.where(resquig_array['length'] == 0)[0]) == 0, 'Error: there are entries in the Events table with length 0'
# The result of np.where is an empty array, so there are no positions where the length field is zero.

(array([], dtype=int64),)


In this dataset, it appears `length` is never zero.

Because the flowcell ammeters take current measurements at a constant rate, `length` is proportional to the dwell time.

In [272]:
# Here's the genome from the events table
events_genome = ''.join(resquig_array['base'])
events_genome

'GAGCTCTCTGGCTAACTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTCAAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAAAGTAAAGCCAGAGGAGATCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGGCGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCGTCGGTATTAAGCGGGGGAGAATTAGATAAATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAACAATATAAACTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTTTTAGAGACATCAGAAGGCTGTAGACAAATACTGGGACAGCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAATAGCAGTCCTCTATTGTGTGCATCAAAGGATAGATGTAAAAGACACCAAGGAAGCCTTAGATAAGATAGAGGAAGAGCAAAACAAAAGTAAGAAAAAGGCACAGCAAGCAGCAGCTGACACAGGAAACAACAGCCAGGTCAGCCAAAATTACCCTATAGTGCAGAACCTCCAGGGGCAAATGGTACATCAGGCCATATCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTCAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAATACCATGCTAAACACAGTGGGGGGACATCAAGCAGCCATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCATGCAGGGCCTATTGCACCAGGCCAGATGAGAGAACCAAGG

In [268]:
# Here's the genome from the FASTA file:
with open(reference_fasta_path, 'rt') as file:
    # For some reason, file.read() outputs a string including "\r\n". That's supposed to be a newline separator
    fasta_content = file.read().split('\r\n')
print(fasta_content)
reference_genome = fasta_content[1]

['>truncated_hiv_rna_genome', 'gggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgctcaaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagggacttgaaagcgaaagtaaagccagaggagatctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagtacgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgagagcgtcggtattaagcgggggagaattagataaatgggaaaaaattcggttaaggccagggggaaagaaacaatataaactaaaacatatagtatgggcaagcagggagctagaacgattcgcagttaatcctggccttttagagacatcagaaggctgtagacaaatactgggacagctacaaccatcccttcagacaggatcagaagaacttagatcattatataatacaatagcagtcctctattgtgtgcatcaaaggatagatgtaaaagacaccaaggaagccttagataagatagaggaagagcaaaacaaaagtaagaaaaaggcacagcaagcagcagctgacacaggaaacaacagccaggtcagccaaaattaccctatagtgcagaacctccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtaatacccatgttttcagcattatcagaaggagccaccccacaagatttaaataccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaa

In [274]:
# Check that events_genome is a subset of the reference genome:
events_genome in reference_genome

False

In [None]:
# Okay, let's try r
events_genome in reference_genome