## Genetic code - translation

In this part, the objective is to **translate a dna sequence into a protein sequence** (amino acid sequence) using a the codon table. This notebook contains the following functions:

- **sequence_extractor_for_all_files(data_folder, output_folder)**: Process all .txt files in the data folder, translating their DNA sequences to amino acid sequences. Output files are saved in the output folder with the same name as input files.


- **sequence_extractor(txt_file, data_folder, output_folder)**: Reads the DNA sequence from the input .txt file, translates it to an amino acid sequence, and writes the result to a .txt file in the output folder.


- **translate(seq)**: Translates a DNA sequence into an amino acid sequence using the global codon table.



In [2]:
import os

# Global codon table
CODON_TABLE = {
    'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M',
    'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T',
    'AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K',
    'AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': 'R',
    'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L',
    'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P',
    'CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R',
    'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V',
    'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A',
    'GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E',
    'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G',
    'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S',
    'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L',
    'TAC': 'Y', 'TAT': 'Y', 'TAA': '_', 'TAG': '_',
    'TGC': 'C', 'TGT': 'C', 'TGA': '_', 'TGG': 'W',
}

def sequence_extractor_for_all_files(data_folder, output_folder):

    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Loop through all files in the data folder
    for txt_file in os.listdir(data_folder):
        txt_path = os.path.join(data_folder, txt_file)

        # Check if it is a valid .txt file
        if os.path.isfile(txt_path) and txt_file.endswith(".txt"):
            print(f"Processing {txt_file}...")
            sequence_extractor(txt_file, data_folder, output_folder)

def sequence_extractor(txt_file, data_folder, output_folder):

    txt_path = os.path.join(data_folder, txt_file)

    # Read the DNA sequence from the file
    with open(txt_path, 'r') as f:
        dna_sequence = f.read().strip()  # Read and remove any surrounding whitespace

    # Translate the DNA sequence to a protein sequence
    protein_sequence = translate(dna_sequence)

    # Write the protein sequence to the output file
    output_path = os.path.join(output_folder, txt_file)  # Keep the same filename
    with open(output_path, 'w') as out_file:
        out_file.write(protein_sequence)

def translate(seq):

    protein = ""
    # Process the sequence in codons (triplets)
    for i in range(0, len(seq) - len(seq) % 3, 3):  # Ignore trailing incomplete codons
        codon = seq[i:i + 3]
        protein += CODON_TABLE.get(codon, 'X')  # Use 'X' for unknown codons
    return protein


# Input and output directories
input_folder = "dna_sequences"  # Folder containing input .txt files
output_folder = "aa_sequences"  # Folder to store output .txt files

# Run the sequence extraction and translation
sequence_extractor_for_all_files(input_folder, output_folder)

Additionally, we should implement frame-shift mutations and create functions for translations in all six possible reading frames. To do so, we will create a function that ensures the mutated sequence is passed to the translation function for evaluation of its impact on the protein sequence. This new function will:

- **translate_in_all_frames** - Translates the DNA sequence in all six possible reading frames:

    1. **Forward frames:** Translation starting from positions 0, 1, and 2.
    2. **Reverse frames:** Reverse complement and translation starting from positions 0, 1, 2 
    3. **Returns** the protein sequences for all frames as a dictionary
    
    
    
- **updated_sequence_extractor_for_all_files** - Reads DNA sequences, applies frame-shift mutations, translates in all six frames, and saves the results (mutated sequences and protein sequences for all frames) in the output folder.


In [2]:
def translate_in_all_frames(seq):
    
    frames = {}
    
    # Forward frames
    for frame in range(3):
        frames[f'forward_frame_{frame}'] = translate(seq[frame:])
    
    # Reverse complement and reverse frames
    rev_comp_seq = reverse_complement(seq)
    for frame in range(3):
        frames[f'reverse_frame_{frame}'] = translate(rev_comp_seq[frame:])
    
    return frames

def updated_sequence_extractor_for_all_files(data_folder, output_folder, shift=1, deletion=False):
   
    os.makedirs(output_folder, exist_ok=True)
    
    for txt_file in os.listdir(data_folder):
        txt_path = os.path.join(data_folder, txt_file)
        
        if os.path.isfile(txt_path) and txt_file.endswith(".txt"):
            print(f"Processing {txt_file} with frame-shift mutation...")
            
            with open(txt_path, 'r') as f:
                dna_sequence = f.read().strip()
            
            # Apply frame-shift mutation
            mutated_seq = simulate_frame_shift(dna_sequence, shift, deletion)
            
            # Translate in all six frames
            frames_translation = translate_in_all_frames(mutated_seq)
            
            # Save results
            output_file_path = os.path.join(output_folder, txt_file)
            with open(output_file_path, 'w') as out_file:
                out_file.write(f"Original Sequence:\n{dna_sequence}\n")
                out_file.write(f"Mutated Sequence:\n{mutated_seq}\n")
                out_file.write("Translations in all six frames:\n")
                for frame, protein_seq in frames_translation.items():
                    out_file.write(f"{frame}: {protein_seq}\n")