## Libraries

In [10]:
import os
import sys

# Custom functions
python_dir_path = os.path.join('..', 'scripts', 'python')
sys.path.append(python_dir_path)
from subset_pr2_database import run_mafft, alignment_stats

## Variables

In [4]:
project = 'Suthaus_2022'
marker = 'Full18S'
sim = 'sim90'
denoise_method = 'RAD'
raw_data = os.path.join('..', 'raw_data')
subset_align_dir = os.path.join(raw_data, 'reference_alignments', 'pr2_subset')

# MAFFT Aligning

Given the diversity and size of our dataset (around 1,600 sequences of 18S rRNA), we would benefit from some level of iterative refinement, so FFT-NS-2 might be a good choice, as FFT-NS-2 is mostly used for moderately large datasets.


**FFT-NS-2:**
- **Speed:** It's faster than most iterative methods but slower than FFT-NS-1.
- **Method:** After creating the initial progressive alignment, it conducts tree-dependent iterative refinement for the final alignment. This means it tries to make the alignment more accurate by optimizing it against a guide tree.
- **Usage Scenario:** Provides a balance between speed and accuracy. It's useful when you need a relatively quick alignment, but you want it to be more accurate than what FFT-NS-1 would provide.

In [8]:
# Run MAFFT alignment on the input file using the FFT-NS-2 strategy.

# Output and input paths
input_file_path = os.path.join(subset_align_dir, 'pr2_version_5.0.0_SSU_UTAX_euknucl_subset.fasta')
output_file_path = os.path.join(subset_align_dir, 'pr2_version_5.0.0_SSU_UTAX_euknucl_subset_mafft.fasta')
log_file_path =  os.path.join(subset_align_dir, 'pr2_version_5.0.0_SSU_UTAX_euknucl_subset_mafft.log')

# Run MAFFT
run_mafft(input_file = input_file_path, 
          output_file = output_file_path, 
          log_file = log_file_path)

Alignment was saved to: ../raw_data/reference_alignments/pr2_subset/pr2_version_5.0.0_SSU_UTAX_euknucl_subset_mafft.fasta
Log file was saved to: ../raw_data/reference_alignments/pr2_subset/pr2_version_5.0.0_SSU_UTAX_euknucl_subset_mafft.log


0

In [11]:
# Checking some basic statistics about our MAFFT alignment
mafft_alignment = os.path.join(subset_align_dir, 'pr2_version_5.0.0_SSU_UTAX_euknucl_subset_mafft.fasta')

num_sequences, alignment_length, average_gaps, percentage_gaps, percentage_identity = alignment_stats(alignment_file = mafft_alignment)
print(f"Number of sequences: {num_sequences}")
print(f"Alignment length: {alignment_length}")
print(f"Average number of gaps per sequence: {average_gaps:.2f}")
print(f"Percentage of gaps across the sequences: {percentage_gaps:.2f}%")
print(f"Average percentage identity: {percentage_identity:.2f}%")

Number of sequences: 1371
Alignment length: 11943
Average number of gaps per sequence: 10217.93
Percentage of gaps across the sequences: 85.56%
Average percentage identity: 94.98%
