runlength_analysis_cpp

This library was built to replace rlorigro/runlength_analysis. Given raw sequences, or an assembly, this library will generate a confusion matrix of homopolymer length predictions. Some evaluation tools are included to aid in the development of runlength related methods.

Installation

Download repo and run library installation script:

git clone https://github.com/rlorigro/runlength_analysis_cpp.git
cd runlength_analysis_cpp/
sudo sh install_prerequisites_ubuntu_18.sh

If minimap2 not installed and added to path:

cd your/preferred/installation/directory
sudo sh path/to/runlength_analysis_cpp/install_minimap2.sh

If cmake is not installed:

wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh
sudo mkdir /opt/cmake 
sudo sh cmake-3.14.4-Linux-x86_64.sh --prefix=/opt/cmake --skip-license 
sudo ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake
cmake --version

Finally:

mkdir build
cd build
cmake ..
make -j [n_threads]

Usage

Generating a Shasta config file

Acquire a FASTA file containing read sequences.

Run measure_runlength_distribution_from_fasta using the executable in your build directory:

/home/ubuntu/software/runlength_analysis_cpp/build/measure_runlength_distribution_from_fasta \
--minimap_preset map-ont \
--max_threads 92 \
--minimum_match_length 11 \
--output_dir /home/ubuntu/data/human/hg005/reads/rle_run1/ \
--ref /home/ubuntu/data/human/reference/hg005/HG005.paternal.f1_assembly_v2_genbank.fasta \
--sequences /home/ubuntu/data/human/hg005/reads/05_25_21_R941_GM24631_9X_2_Guppy_5.0.7_prom_sup.fasta

To avoid regions of the alignment that contain misalignments as a result of short repeats (dinucleotide or others), it is highly recommended that the minimum_match_length parameter is used. See below in the "x_repeat" channel for an example of misalignments in RLE-space:

For ONT data, be sure to use the map-ont preset, or for assemblies the asm20 preset, etc.

Once this finishes running, a CSV will be generated containing confusion matrices for each nucleotide's runlength from 1-50bp. This is used to generate a table of normalized likelihoods (shasta's Bayesian ConsensusCaller config file). For example:

../scripts/convert_frequency_matrix_to_shasta_config.py \
--prior human \
--name HG005_wg_guppy_5.0.7_with_10_pseudocount_4-6-2022 \
--input /home/ryan/data/human/hg005/rle/guppy5/run2/length_frequency_matrix_nondirectional.csv \
--output /home/ryan/data/human/hg005/rle/guppy5/run2/shasta_bayesian_config_nondirectional_zero-fix/ \
--pseudocount 10

This generates an output directory and in it is a CSV which can be used directly by Shasta.

You can also edit the plot_matrix script to generate comparisons of frequency matrices for debugging/QC purposes:

Generating a MarginPolish RLE Matrix

To generate a MarginPolish RLE Matrix you need to use the measure_runlength_distribution_from_fasta exectuable and the convert_frequency_matrix_to_marginpolish.py script from this repository, and a reference and .fasta (not .fastq) file with your reads.

From the build/ directory:

./measure_runlength_distribution_from_fasta --ref reference.fasta --sequences reads.fasta
python ../scripts/convert_frequency_matrix_to_marginpolish.py -i output/length_frequency_matrix_directional.csv -o output/marginPolish

This will create a file in output/marginPolish. The contents of this file can be placed into the "repeatCountSubstitutionMatrix" node of the polish parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

runlength_analysis_cpp

Installation

Download repo and run library installation script:

If minimap2 not installed and added to path:

If cmake is not installed:

Finally:

Usage

Generating a Shasta config file

Generating a MarginPolish RLE Matrix

Files

README.md

Latest commit

History

README.md

File metadata and controls

runlength_analysis_cpp

Installation

Download repo and run library installation script:

If minimap2 not installed and added to path:

If cmake is not installed:

Finally:

Usage

Generating a Shasta config file

Generating a MarginPolish RLE Matrix