Skip to content

zamanianlab/polyAudit

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

polyAudit

All-in-one poly(A) tail analyses for long-read RNA sequencing data

Documentation

The entire pipeline is contained within polyAudit.py and can be run in a single command. Default options will run the -polyA pipeline on the input file and write results to the output directory. Test input files can be found in /test.

Setup the polyAudit conda environment by running conda env create -f environment.yml and activate with conda activate polyAudit.

usage: python polyAudit.py [-h] [-p PRIMER_FILE] [-polyA] [-l LENGTH_FILE] [-k K] [-n N]
                [-PAS PAS_FILE] [-pssm PSSM]
                input_file output_directory

Measure poly(A) tails, count k-mers, find the putative PAS, and generate PSSMs a given sequence file.

positional arguments:
  input_file            A FASTA file with long-read sequencing data that hasn't
                        had poly(A) tails trimmed.
  output_directory      A path to the output directory.

optional arguments:
  -h, --help            show this help message and exit
  -p PRIMER_FILE, --primer_file PRIMER_FILE
                        A file with a single primer sequences per line. NOTE:
                        it is recommended that the input sequences are trimmed
                        for adapters and primers prior to running this
                        program. I do not guarantee a desirable or predictable
                        outcome with the -p option.
  -polyA, --polyA       Runs the algorithm to measure poly(A) tails in a FASTA
                        file. Requires -p if sequences contain a 3' adapter or
                        primer. Will include poly(A) tail measurement, k-mer
                        counting in the 50 nt upstream of of the tail,
                        identification of a putative PAS, and generation of a
                        PSSM for the entire input file.
  -l LENGTH_FILE, --length_file LENGTH_FILE
                        A file with an isoform ID and poly(A) tail length on
                        each line to be used for k-mer counting and PAS
                        identification. This file can be generated with
                        -polyA.
  -k K, --k K           An integer, the kmer length (defaults to 6).
  -n N, --n N           An integer, the number of k-mers to pipe from the
                        k-mer search to the PAS search (defaults to 10).
  -PAS PAS_FILE, --pas_file PAS_FILE
                        Identify the most likely PAS; requires a file with one
                        k-mer per line. NOTE: the algorithm will search for
                        each k-mer one at a time and stop the search once it
                        has found one. Thus, it is HIGHLY recommended that
                        this file is ordered in descending PAS/k-mer usage.
  -pssm PSSM, --pssm PSSM
                        Generate a position-specific score matrix for the
                        region flanking the PAS. Requires a CSV file
                        (generated by -polyA) with column names Isoform, 3'
                        Sequence, kmer, and Count.

Expected Output

Output files:

Command: python polyAudit.py -p test/primers.txt -polyA test/sample.fasta test/output

sample_trimmed.fasta: A FASTA files with sequences that have had 3' adapters/primers removed. NOTE: There are better tools to perform this trimming, and I recommend using those to first trim your sequences and then use polyAudit without the -p option.

sample_polya.csv: A CSV file (without column names) that includes FASTA headers from sample.fasta in column one and the measured poly(A) tail length in column two.

sample_6mer.csv: A CSV containing all possible k-mers of length -k with the measured abundance in the 50 nt upstream of the poly(A) tails in all FASTA entries in sample.fasta.

sample_PAS.csv: A CSV file with FASTA headers from sample.fasta, its predicted PAS, the 15 nt downstream and 50 nt upstream of the 5' end of the poly(A) site, and the PAS index (i.e, distance of the PAS upstream of the poly(A) tail).

sample_PAS_clean.fasta: A FASTA file of all entries from sample.fasta clipped to include the 15 nt downstream and 50 nt upstream of the 5' end of the poly(A) site.

sample_pssm.txt: A position-specific score matrix (PSSM) generated from sample_PAS_clean.fasta.

Alternative command:

A PSSM for the entire input_file is not that useful, so I recommend choosing subsets of sequences for PSSM generation (i.e., all transcripts that utilize a given PAS). In this case, you can bypass most functions and go directly to -pssm, for example:

python polyAudit.py -pssm test/AATAAA.csv test/sample.fasta test/output

will result in:

AATAAA_pssm.txt and AATAAA_PAS_clean.fasta

Expected Data

I've also included an example R script for data analysis. Note that this script is NOT intended for production and should only be used as a guide. Following are potential plots that can be created using the data generated by this script.

poly(A) Plot

(A) Histogram of poly(A) tail lengths with the median length drawn in red
(B) Top ten 6-mers in the 50 nt upstream of the 5' end of the poly(A)
(C) Location of given 6-mers in the 50 nt upstream of the 5' end of the poly(A)
(D) Nucleotide frequencies in the 50 nt upstream of the 5' end of the poly(A), faceted by PAS

About

All-in-one poly(A) tail analyses for long-read RNA sequencing data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages