# Using bioinformatics algorithms in this repository

`reverse_complement.py` returns the reverse complement of a DNA k-mer. 

**Example:** k-mer = ACTAAG<br>
         reverse k-mer = GAATCA<br>
         reverse complement k-mer = CTTAGT<br>

The file to be read consists of a single DNA k-mer. If no file is specified, `text_files/Vibrio_cholerae.txt` is used by default.

`count_kmer_occurrences.py` counts number of occurrences of a k-mer in a given genome sequence.

The file to be read consists of 2 lines:<br>
Line 1: genome sequence<br>
Line 2: k-mer<br>

**Sample input:**  <br>
GCGCG<br>
GCG

**Output:** 2


`frequent_kmers.py` findins patterns of length *k* (k-mers) that occur most frequently in a genome.

The file to be read consists of 2 lines:<br>
Line 1: Nucleotide sequence<br>
Line 2: length k of k-mer<br>

**Sample input:** <br>
ACGTTGCATGTCGCATGATGCATGAGAGCT<br>
4<br>

**Output:**
CATG GCAT

`kmer_genome_match.py` generates the list of indices in a genome that indicate the start of the k-mer. 

The file to be read consists of 2 lines:<br>
Line 1: k-mer, e.g. ACT<br>
Line 2: genome sequence<br>

If no file is specified, `text_files/Kmer_genome_match.txt` is used by default.

**Sample input:**<br>
ATAT<br>
GATATATGCATATACTT<br>

**Output:** 1 3 9

`find_kmer_clumps.py` finds all k-mers that occur t times within a window of length L. 

The file to be read consists of 2 lines:<br>
Line 1: genome sequence<br>
Line 2: k L t<br>

If no file is specified, the file `text_files.E-coli_kmer_clumps.txt` is used for calculation. 
It contains the genome of E.coli.

**Sample input:** <br>
CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA<br>
5 50 4<br>

**Output:** CGACA GAAGA

`minimum_skew_problem.py` finds the point(s) for which the number of guanines - the number of cytosines is minimal within a genome. This minimum could be the original of replication of the genome. 

The file to be read consists of a single line:<br>
Line 1: genome sequence<br>

If no file is specified, the file `text_files/Vibrio_cholerae.txt` is used for calculation. 
It contains the genome of the bacteria causing cholera.

NOTE: Positions are counted from 1 on, not from 0. 

**Sample input:**<br>
TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT<br>

**Output:** 11 24

`hamming_distance.py` calculates the Hamming distance between two DNA strands. 

The file to be read consists of two lines:<br>
Line 1: DNA sequence 1<br>
Line 2: DNA sequence 2<br>

If no file is specified, the file `text_files/Hamming_distance.txt` is used for calculation. 

**Sample input:**<br>
GGGCCGTTGGT<br>
GGACCGTTGAC<br>

**Output:** 3

`approximate_pattern_matching.py` finds the total count of k-mers within a genome that differ with the genome by at most Hamming distance *d*. The code also outputs the indices of the approximately matching k-mers. 

The file to be read consists of two lines:<br>
Line 1: k-mer, e.g. 'TGC'<br>
Line 2: DNA sequence<br>
Line 3: d (maximum allowed distance)<br>

If no file is specified, the file `text_files/Approx_matching.txt` is used for calculation.

**Sample input:**<br>
ATTCTGGA<br>
CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT<br>
3<br>

**Output:**<br>
4<br>
6 7 26 27<br>


`frequent_kmers_with_mismatches.py` finds patterns of length k (k-mers) that occur most frequently in a genome. 
The k-mers are allowed to have a mismatch of magnitude up to d (Hamming distance).

The file to be read consists of 2 lines:<br>
Line 1: Nucleotide sequence<br>
Line 2: k (length k of k-mer) d (Hamming distance measure for level of mismatch)<br>

If no file is specified by the user, `text_files/Frequent_patterns_with_mismatch.txt` is used by default.

**Sample input:**<br>
GGCGGGCGTACGATGGATAGGGGATACGGAGGAGGCGGGAGATTACTACGGATACTACGGAGGCGGGCGGGCGTAGGGATTACTAGGGGAGGAGATGGAGATGGCGGATTACGGCGGGAGGCGGATTAGGTAGGTACGGAGATTACGGATACTAGGGATTACGGCGTAGGTAGGGATGATTAGGTACGATGGAGGATAGGGATTAGGTAGGTAGGGGCGTACGGCGGGATAGGGATTAGGGATTAGGGATGGCGGATGGATACGGCGTAGGGATGGAGATGGCGTAGGGGATAGGTACGGCGTAGGGGCGGATTAGGGGAGGCGGATTACGATGGCG<br>
5 3<br>

**Output:** GGGGG

`frequent_kmers_mismatch_reverse_complement.py` finds patterns of length k (k-mers) that occur most frequently in a genome. <br>
The k-mers are allowed to have a mismatch of magnitude d (Hamming distance). <br>
Reverse complements and their mismatches are included as well.<br>

The output consists of all k-mers with the highest sum of counts:<br>
max (count(pattern) + count(reverse_complement))<br>

The file to be read consists of 2 lines:<br>
Line 1: Nucleotide sequence<br>
Line 2: k (length k of k-mer) d (Hamming distance measure for level of mismatch)

**Sample input:**<br>
ACGTTGCATGTCGCATGATGCATGAGAGCT<br>
4 1<br>

**Output:** ATGT ACAT

`neighbours.py` generates the set of all k-mers whose Hamming distance from pattern does not exceed d.

The file to be read consists of 2 lines:<br>
Line 1: pattern<br>
Line 2: d<br>

If no file is specified, the default file Neighbours.txt is loaded.

**Sample input:**<br>
ACG<br>
1<br>

**Output:** CCG TCG GCG AAG ATG AGG ACA ACC ACT ACG

`motif_enumeration.py` finds all (k,d)-motifs in given genome sequences. (k,d)-motifs are k-mers with a distance of at most d. The hope is that these motifs correspond to regulatory motifs used by regulatory proteins to bind to in order to control the expression of different genes (circadian activity). 

The file to be read consists of 2 lines:<br>
Line 1: k d<br>
Line 2: sequence1 sequence2 sequence3 ...<br>

**Sample Input:** <br>
3 1<br>
ATTTGGC TGCCTTA CGGTATC GAAAATT<br>

**Output:** ATA ATT GTT TTT

`motif_entropy.py` calculates the entropy of a set of motifs.

The file to be read consists of 1 line:<br>
Line 1: motif1 motif2 motif3 ...<br>

If no file is specified, 'Motif_entropy.txt' is loaded by default.<br>

**Sample input:**<br>
TCGGGGGTTTTT CCGGTGACTTAC ACGGGGATTTTC TTGGGGACTTTT AAGGGGACTTCC TTGGGGACTTCC TCGGGGATTCAT TCGGGGATTCCT TAGGGGAACTAC TCGGGTATAACC<br>

**Output:** 9.916290005356972

`median_string_problem.py` finds a k-mer that minimizes distance(kmer, dna_string) among all possible choices of k-mers

The file to be read consists of 2 lines:<br>
Line 1: k<br>
Line 2: DNA strings separated by a space<br>

This is a brute force method.<br>

**Sample input:**<br>
3<br>
AAATTGACGCAT GACGACCACGTT CGTCAGCGCCTG GCTGAGCACCGG AGTTCGGGACAG<br>

**Output:** GAC