Skip to content

Columns in pairwise tsv file

Ryan Wick edited this page May 30, 2023 · 15 revisions

Verticall pairwise produces a TSV file which contains a lot of information and is in turn used by other Verticall commands. This page describes each of the columns in this file.

Columns 1 and 2 indicate which pair of assemblies are being compared:

  1. assembly_a: sample name of assembly A (taken from the assembly's filename).
  2. assembly_b: sample name of assembly B (taken from the assembly's filename)

Note that there may be more than one line in the tsv for a given pair (see Primary vs secondary results).

Columns 3–6 describe the minimap2 alignments between the two assemblies:

  1. alignment_count: the number of alignments.
  2. n50_alignment_length: the N50 size of the alignments.
  3. aligned_fraction: the fraction of assembly A which is covered by the alignments. E.g. 1.0 would indicate that the entirety of assembly A aligned to assembly B, 0.5 would indicate that half of the bases in assembly A aligned to assembly B, etc.
  4. mean_distance: the mean genomic distance over all regions of the alignments. Equivalent to 1–ANI, so provides similar information to a tool such as FastANI.

Columns 7–10 describe the sliding windows across alignments and the resulting distance distribution:

  1. window_size: the size of the sliding windows. Windows overlap by 99% of their length, so dividing this value by 100 will give the step distance between adjacent windows.
  2. window_count: the total number of windows in all alignments for this assembly pair.
  3. mean_window_distance: the mean genomic distance using all windows (the mean number of differences per window divided by the window size).
  4. median_window_distance: the interpolated median genomic distance of all windows (the interpolated median number of differences per window divided by the window size).

Column 11 describes the smoothed distance distribution:

  1. mass_peaks: a comma-delimited list of all peaks in the smoothed distribution. Numbers are given as genomic distances (the number of differences per window divided by the window size).

The preceding columns are all unique to each pairwise comparison. I.e. they will be the same whether the result is primary or secondary. The following columns are based off a particular mass peak, and will differ between primary and secondary results. See Primary vs secondary results for more info.

Columns 12–14 describe the mass peak the following columns are based on:

  1. result_level: either "primary" or "secondary". Each assembly pair will have one primary result. Most pairs will not have a secondary result, but some might. Two or more secondary results are rare but possible.
  2. peak_window_distance: the genomic distance of this peak. Will be equal to one of the values in the mass_peaks column.
  3. peak_mass: the mass associated with this peak. Primary results will have the largest mass for the pair.

Columns 15 and 16 describe the alignment painting:

  1. alignments_vertical_fraction: the fraction of the alignments painted vertical (expressed as a percentage).
  2. alignments_horizontal_fraction: the fraction of the alignments painted horizontal (expressed as a percentage).

Columns 17–20 provide genomic distances based on the vertically painted regions of the alignments:

  1. mean_vertical_window_distance: the mean of all vertically painted windows in the alignments.
  2. median_vertical_window_distance: the interpolated median of all vertically painted windows in the alignments.
  3. mean_vertical_distance: the mean of all vertically painted regions in the alignments. This is the default distance taken by Verticall matrix.
  4. r/m: the number of genomic differences in horizontally painted regions divided by the number of genomic differences in vertically painted regions.

Columns 21–26 give an overview of the contig painting for each assembly:

  1. assembly_a_vertical_fraction: the fraction of assembly A painted vertical (expressed as a percentage).
  2. assembly_a_horizontal_fraction: the fraction of assembly A painted horizontal (expressed as a percentage).
  3. assembly_a_unaligned_fraction: the fraction of assembly A painted unaligned (expressed as a percentage).
  4. assembly_b_vertical_fraction: the fraction of assembly B painted vertical (expressed as a percentage).
  5. assembly_b_horizontal_fraction: the fraction of assembly B painted horizontal (expressed as a percentage).
  6. assembly_b_unaligned_fraction: the fraction of assembly B painted unaligned (expressed as a percentage).

Columns 27–32 describe the contig painting in detail for each assembly. These columns can be quite lengthy, especially when the assemblies contain many contigs:

  1. assembly_a_vertical_regions: regions of assembly A painted vertical (expressed as a comma-delimited list of contig:start-end).
  2. assembly_a_horizontal_regions: regions of assembly A painted horizontal (expressed as a comma-delimited list of contig:start-end).
  3. assembly_a_unaligned_regions: regions of assembly A painted unaligned (expressed as a comma-delimited list of contig:start-end).
  4. assembly_b_vertical_regions: regions of assembly B painted vertical (expressed as a comma-delimited list of contig:start-end).
  5. assembly_b_horizontal_regions: regions of assembly B painted horizontal (expressed as a comma-delimited list of contig:start-end).
  6. assembly_b_unaligned_regions: regions of assembly B painted unaligned (expressed as a comma-delimited list of contig:start-end).