polyAudit
All-in-one poly(A) tail analyses for long-read RNA sequencing data
Documentation
The entire pipeline is contained within polyAudit.py
and can be run in a single command. Default options will run the -polyA
pipeline on the input file and write results to the output directory. Test input files can be found in /test
.
Setup the polyAudit conda environment by running conda env create -f environment.yml
and activate with conda activate polyAudit
.
usage: python polyAudit.py [-h] [-p PRIMER_FILE] [-polyA] [-l LENGTH_FILE] [-k K] [-n N]
[-PAS PAS_FILE] [-pssm PSSM]
input_file output_directory
Measure poly(A) tails, count k-mers, find the putative PAS, and generate PSSMs a given sequence file.
positional arguments:
input_file A FASTA file with long-read sequencing data that hasn't
had poly(A) tails trimmed.
output_directory A path to the output directory.
optional arguments:
-h, --help show this help message and exit
-p PRIMER_FILE, --primer_file PRIMER_FILE
A file with a single primer sequences per line. NOTE:
it is recommended that the input sequences are trimmed
for adapters and primers prior to running this
program. I do not guarantee a desirable or predictable
outcome with the -p option.
-polyA, --polyA Runs the algorithm to measure poly(A) tails in a FASTA
file. Requires -p if sequences contain a 3' adapter or
primer. Will include poly(A) tail measurement, k-mer
counting in the 50 nt upstream of of the tail,
identification of a putative PAS, and generation of a
PSSM for the entire input file.
-l LENGTH_FILE, --length_file LENGTH_FILE
A file with an isoform ID and poly(A) tail length on
each line to be used for k-mer counting and PAS
identification. This file can be generated with
-polyA.
-k K, --k K An integer, the kmer length (defaults to 6).
-n N, --n N An integer, the number of k-mers to pipe from the
k-mer search to the PAS search (defaults to 10).
-PAS PAS_FILE, --pas_file PAS_FILE
Identify the most likely PAS; requires a file with one
k-mer per line. NOTE: the algorithm will search for
each k-mer one at a time and stop the search once it
has found one. Thus, it is HIGHLY recommended that
this file is ordered in descending PAS/k-mer usage.
-pssm PSSM, --pssm PSSM
Generate a position-specific score matrix for the
region flanking the PAS. Requires a CSV file
(generated by -polyA) with column names Isoform, 3'
Sequence, kmer, and Count.
Expected Output
Output files:
Command: python polyAudit.py -p test/primers.txt -polyA test/sample.fasta test/output
sample_trimmed.fasta
: A FASTA files with sequences that have had 3' adapters/primers removed. NOTE: There are better tools to perform this trimming, and I recommend using those to first trim your sequences and then use polyAudit
without the -p
option.
sample_polya.csv
: A CSV file (without column names) that includes FASTA headers from sample.fasta
in column one and the measured poly(A) tail length in column two.
sample_6mer.csv
: A CSV containing all possible k-mers of length -k
with the measured abundance in the 50 nt upstream of the poly(A) tails in all FASTA entries in sample.fasta
.
sample_PAS.csv
: A CSV file with FASTA headers from sample.fasta
, its predicted PAS, the 15 nt downstream and 50 nt upstream of the 5' end of the poly(A) site, and the PAS index (i.e, distance of the PAS upstream of the poly(A) tail).
sample_PAS_clean.fasta
: A FASTA file of all entries from sample.fasta
clipped to include the 15 nt downstream and 50 nt upstream of the 5' end of the poly(A) site.
sample_pssm.txt
: A position-specific score matrix (PSSM) generated from sample_PAS_clean.fasta
.
Alternative command:
A PSSM for the entire input_file
is not that useful, so I recommend choosing subsets of sequences for PSSM generation (i.e., all transcripts that utilize a given PAS). In this case, you can bypass most functions and go directly to -pssm
, for example:
python polyAudit.py -pssm test/AATAAA.csv test/sample.fasta test/output
will result in:
AATAAA_pssm.txt
and AATAAA_PAS_clean.fasta
Expected Data
I've also included an example R script for data analysis. Note that this script is NOT intended for production and should only be used as a guide. Following are potential plots that can be created using the data generated by this script.
(A) Histogram of poly(A) tail lengths with the median length drawn in red
(B) Top ten 6-mers in the 50 nt upstream of the 5' end of the poly(A)
(C) Location of given 6-mers in the 50 nt upstream of the 5' end of the poly(A)
(D) Nucleotide frequencies in the 50 nt upstream of the 5' end of the poly(A), faceted by PAS