Set of tools used for processing Geneomic and Bioinformatic data implemented in Python 2
Package containing utility data structures and functions for handling generic DNA data
Useful data structures and functions for handling DNA data
It supports the following data structures:
- dna_standard_alphabet ~ standard DNA text characters: "ACGT"
- complement ~ dictionary of the complement of a DNA character, used for reverse compliment
It supports the following functions:
- generate_random_seq(num_in_seq) ~ function to get back a random seq made from the dna_standard_alphabet
- reverse_complement(dna_seq) ~ given a DNA sequence, return the reverse complement of that sequence
- get_random_reads(genome, num_reads, read_len) ~ returns a list containing num_reads of random "reads" from the genome of read_len
Package containing utility data structures and functions for handling fasta formatted data
Useful functions for fast files with a single geneome
It supports the following functions:
read_genome(file_name) ~ reads an entire genome in a fasta file given the fasta file name
Useful functions for handling fast files with multiple reads in them
It supports the following data structures:
-
CodonInfo() containing the following:
- codons ~ list of all codons extracted from a given single sided DNA sequence
- open_reading_frames ~ list of seq staring with ATG and ending with TAA, TAG, or TGA
-
FastaRecord() containing the following:
- id ~ unique identifier
- header
- sequence (combined)
- frame_1_codon_info list of all codons and ORFs starting at position 0 in DNA seq
- frame_2_codon_info list of all codons and ORFs starting at position 1 in DNA seq
- frame_3_codon_info list of all codons and ORFs starting at position 2 in DNA seq
It supports the following functions:
- get a count of the number of valid and potentially erroneous records
- get a list containing just the record headers
- get a dictionary containing a list of FastaRecord() objects pulled from the file keyed on the id in the FastaRecord()
Package containing utility data structures and functions for handling fastq formatted data
Useful data structures and functions for handling fastq data
It supports the following functions:
- q_to_phred_33(Q) ~ Turn Q into Phred+33 ASCII-encoded quality
- phread_33_to_q(qual) ~ Turn Phred+33 ASCII-encoded quality into Q
- read_fastq(file_name) ~ given a fastq file, return a list of reads and their corresponding qualities
- matplotlib ~ used for graphing
Package containing utility data structures and functions and tests for aligning reads to sequences
Utility data structures and functions for aligning reads to sequences
It supports the following data structures:
global_alignment_score ~ 2x2 penalty matrix with scores for alignment mis-matches using dynamic programming and global alignment. transitions (A<->G, C<->T) [2] vs tranversions [4] and gaps [8] KMerIndex ~ class used for calculating and querying a Kmer index on a given sequence BoyerMoore ~ class used to do Boyer-Moore pre-processing on a given read before doing alignment
It supports the following functions:
-
naive_exact(read, sequence) ~ returns a list of offsets where the pattern occurs in the sequence as well as the number of matched and mismatched character reads
-
naive_exact_with_rc(read, sequence) ~ returns a list of offset where the read occurs on either the forward OR reverse strand
-
naive_exact_with_counts(read, sequence) ~ returns the same information as the standard naive_exact() function, but with two additional counts: num_alignments, and num_characters_compared as a measure of work done
-
naive_mm_allowed(read, sequence, num_mm_allowed=2) ~ Matches exact read in DNA sequence, returning a list of the occurrences (offsets from start of sequence), allows up to num_mm_allowed mismatches, with a default of 2
-
boyer_moore(read, p_bm, sequence) ~ uses the boyer moore algorithm to return matches, requires that you pre-process the read with the BoyerMoore() object which is passed in as the second parameter
-
boyer_moore_with_counts(read, p_pm, sequence) ~ returns the same information as the standard boyer_moore() function, but with two additional counts: num_alignments, and num_characters_compared as a measure of work done
-
approximate_match_boyer_moore(read, sequence, num_allowed_edits) ~ approximate matching function that uses num_allowed_edits+1 segments and pigeon-hole matching with boyer-moore for exact matching per segment
-
query_k_mer_index(read, sequence, index) ~ searches a pre-indexed sequence stored in a KMerIndex object for the given read
-
approximate_match_kmer_index(read, sequence, num_allowed_edits, kmer_index) ~ approximate matching function that uses num_allowed_edits+1 segments and pigeon-hole matching with the kmer-index for exact matching per segment
-
approximate_match_subsequence_index(read, sequence, num_allowed_edits, subsequence_index) ~ approximate matching function that uses num_allowed_edits+1 segments and pigeon-hole matching with the subsequence-index for exact matching per segment
-
get_hamming_distance(str_1, str_2) ~ returns the hamming distance between two strings of equal length, if the strings are not equal length, return -1
-
get_edit_distance_recursive(str_1, str_2) ~ returns the edit distance between two strings implemented using a recursive technique (SLOW!!!)
-
get_edit_distance_dynamic_programming(str_1, str_2) ~ returns the edit distance between two strings implemented using a dynamic programming technique
-
get_edit_distance_dynamic_programming_global_alignment(str_1, str_2) ~ returns the global edit distance between two strings using a scoring matrix
-
get_edit_distance_dynamic_programming_approximate(str_1, str_2) ~ returns the approximate edit distance implemented using dynamic programming
#assembly Package containing utility data structures and functions for dealing with genomic assembly
It supports the following functions:
- overlap(str_1, str_2, min_overlap_length=3) ~ returns the number of overlapping characters between the suffix of str_1 overlaps and the prefix of str_2 with at LEAST min_overlap_len characters matching. Default of 3.
- naive_overlap_map(reads, min_overlap_length) ~ returns a map of overlapping reads using a simple naive suffix -> prefix overlapping structure with the nodes being the reads and the edges containing the number of characters that overlapped
- overlap_all_pairs(reads, k) ~ given a list of reads and a kmer value of k, it returns a list of tuples representing each overlapping read that matches exact suffix to prefix
- shortest_common_super_string(set_of_strings) ~ SLOW!!!! N! run time =*( <--- saddest panda
- shortest_common_super_string_list(set_of_strings) ~ SLOW!! returns a list of all the possible shortest common super strings
- pick_maximal_overlap(reads, k) ~ Finds the two reads with the maximal overlap, combines them and returns both reads and the overlap length
- greedy_shortest_common_super_string(reads, k) ~ relies on the pick_maximal_overlap() function, returns the shortest common super string in a "greedy fasion", will need to find optimal value of k
- bisect ~ used for bisect left (binary search), used in the k-mer and subsequence index implementations
- permutations ~ used for naive_overlap_map and shortest_common_super_string