# SeqPandas Alignment and Logo Tutorial

This tutorial demonstrates how to use SeqPandas for:
1. **Alignment visualization** - Reading and visualizing SAM/BAM alignments with gaps
2. **Sequence logos** - Creating publication-quality sequence logos from alignments

These features are particularly useful for:
- Visualizing multiple sequence alignments
- Identifying conserved regions
- Creating sequence motif logos
- Analyzing alignment patterns

In [1]:
import seqpandas as spd
from seqpandas.alignment import (
    read_aligned_sam,
    show_alignment_block,
    alignment_to_frequency_matrix,
    generate_position_labels,
    get_alignment_frequencies,
    sequences_to_logo_format,
)
from seqpandas.logo import plot_logo_with_indels
import pandas as pd
import numpy as np
from pathlib import Path
import tempfile

<Figure size 100x100 with 0 Axes>

## Part 1: Alignment Visualization

The alignment module provides tools to read SAM/BAM files and visualize aligned sequences with explicit gap representations.

### Sample Files Setup

In [2]:
# Example SAM file from test data
alignment_sam = "tests/test-data/example.sam"
reference_fasta = "tests/test-data/example.fasta"

print(f"Using SAM file: {alignment_sam}")
print(f"Using reference: {reference_fasta}")

Using SAM file: tests/test-data/example.sam
Using reference: tests/test-data/example.fasta


### Reading Aligned Sequences

The `read_aligned_sam` function reads SAM/BAM files and returns sequences aligned to reference coordinates, with gaps shown explicitly.

In [3]:
# Read SAM file with aligned view
aligned_df = read_aligned_sam(
    alignment_file=alignment_sam,
    reference_file=reference_fasta,
    max_reads=10,  # Limit for demonstration
    show_insertions=True,  # Show insertion positions
    gap_char="-",  # Character for deletions
    insertion_char="*",  # Character for insertions
)

print(f"Loaded {len(aligned_df)} alignments")
print("\nDataFrame columns:")
print(aligned_df.columns.tolist())

Loaded 5 alignments

DataFrame columns:
['read_name', 'ref_name', 'ref_start', 'ref_end', 'strand', 'mapq', 'aligned_sequence', 'original_sequence', 'cigar', 'alignment_start', 'alignment_end']


In [4]:
# View the first few alignments
aligned_df[["read_name", "ref_name", "ref_start", "ref_end", "strand", "mapq"]].head()

Unnamed: 0,read_name,ref_name,ref_start,ref_end,strand,mapq
0,r001,ref1,6,22,+,30
1,r002,ref1,8,18,+,30
2,r003,ref1,8,14,+,30
3,r004,ref1,15,40,+,30
4,r003,ref2,28,33,-,17


### Visualizing Alignment Blocks

The `show_alignment_block` function creates a text visualization of aligned sequences, similar to traditional alignment viewers.

In [5]:
# Create alignment visualization
if len(aligned_df) > 0:
    alignment_view = show_alignment_block(
        aligned_df,
        max_display=5,  # Show first 5 sequences
        width=80,  # Characters per line
        show_positions=True,  # Show position ruler
    )
    print(alignment_view)
else:
    print("No alignments found in the SAM file")

Position 6:
r001                   TTAGATAAagGATA-CTG------------------
r002                   --AGATAAggATA----------------------
r003                   --AGCTAA--------------------------
r004                   ---------ATAGCT--------------TCAGC
r003                   ----------------------TAGGC-------


### Understanding Aligned Sequences

Each aligned sequence in the DataFrame:
- Has the same length (padded with gaps)
- Shows deletions as '-' characters
- Shows insertions as lowercase letters (when show_insertions=True)
- Is positioned relative to the reference coordinates

In [6]:
# Examine a single aligned sequence in detail
if len(aligned_df) > 0:
    first_read = aligned_df.iloc[0]
    print(f"Read name: {first_read['read_name']}")
    print(f"Reference: {first_read['ref_name']}")
    print(f"Position: {first_read['ref_start']}-{first_read['ref_end']}")
    print(f"Strand: {first_read['strand']}")
    print(f"CIGAR: {first_read['cigar']}")
    print(f"\nAligned sequence (first 100 bp):")
    print(first_read["aligned_sequence"][:100])
    print(f"\nOriginal sequence (first 100 bp):")
    print(first_read["original_sequence"][:100] if first_read["original_sequence"] else "N/A")

Read name: r001
Reference: ref1
Position: 6-22
Strand: +
CIGAR: 8M2I4M1D3M

Aligned sequence (first 100 bp):
TTAGATAAagGATA-CTG------------------

Original sequence (first 100 bp):
TTAGATAAAGGATACTG


In [7]:
# Convert sequences to logo format
if len(aligned_df) > 0 and "sequences" in locals():
    logo_df, logo_labels, variant_positions = sequences_to_logo_format(sequences)

    print("Logo-Ready Format")
    print("=" * 40)
    print(f"DataFrame shape: {logo_df.shape}")
    print(f"Position labels (first 10): {logo_labels[:10]}...")
    print(f"\nVariant positions (differ from reference): {variant_positions}")

    # This is the exact format that logo.py uses internally
    print("\nFrequency DataFrame (as used by logo.py):")
    print(logo_df.head(10))

    # Show the transposed version (as saved to CSV by logo.py)
    csv_format = logo_df.copy().T
    csv_format.columns = logo_labels
    print("\nCSV Format (transposed, first 5 positions):")
    print(csv_format.iloc[:, :5])

### Logo-Ready Format

The `sequences_to_logo_format` function prepares data exactly as needed by logomaker, including identifying variant positions.

In [8]:
# Get aligned sequences from our SAM file
if len(aligned_df) > 0:
    sequences = aligned_df["aligned_sequence"].tolist()

    # Create frequency matrix
    freq_matrix = alignment_to_frequency_matrix(sequences)

    print("Frequency Matrix from SAM Alignments")
    print("=" * 40)
    print(f"Shape: {freq_matrix.shape}")
    print(f"Characters: {list(freq_matrix.columns)}")
    print("\nFirst 10 positions:")
    print(freq_matrix.head(10))

    # Show positions with high variation
    print("\nPositions with multiple characters (entropy > 0):")
    for idx, row in freq_matrix.iterrows():
        if (row > 0).sum() > 1:  # More than one character present
            print(f"Position {idx}: {dict(row[row > 0])}")

Frequency Matrix from SAM Alignments
Shape: (36, 6)
Characters: ['-', 'A', 'C', 'G', 'T', '.']

First 10 positions:
     -    A    C    G    T    .
0  0.8  0.0  0.0  0.0  0.2  0.0
1  0.8  0.0  0.0  0.0  0.2  0.0
2  0.4  0.6  0.0  0.0  0.0  0.0
3  0.4  0.0  0.0  0.6  0.0  0.0
4  0.4  0.4  0.2  0.0  0.0  0.0
5  0.4  0.0  0.0  0.0  0.6  0.0
6  0.4  0.6  0.0  0.0  0.0  0.0
7  0.4  0.6  0.0  0.0  0.0  0.0
8  0.6  0.2  0.0  0.2  0.0  0.0
9  0.4  0.2  0.0  0.4  0.0  0.0

Positions with multiple characters (entropy > 0):
Position 0: {'-': 0.8, 'T': 0.2}
Position 1: {'-': 0.8, 'T': 0.2}
Position 2: {'-': 0.4, 'A': 0.6}
Position 3: {'-': 0.4, 'G': 0.6}
Position 4: {'-': 0.4, 'A': 0.4, 'C': 0.2}
Position 5: {'-': 0.4, 'T': 0.6}
Position 6: {'-': 0.4, 'A': 0.6}
Position 7: {'-': 0.4, 'A': 0.6}
Position 8: {'-': 0.6, 'A': 0.2, 'G': 0.2}
Position 9: {'-': 0.4, 'A': 0.2, 'G': 0.4}
Position 10: {'-': 0.4, 'A': 0.2, 'G': 0.2, 'T': 0.2}
Position 11: {'-': 0.4, 'A': 0.4, 'T': 0.2}
Position 12: {'-': 0.4,

## Part 1.5: Frequency Matrix Generation

SeqPandas now provides powerful functions to convert alignments into frequency matrices, which are the foundation for sequence logos and other analyses.

## Part 2: Creating Sequence Logos

Sequence logos are graphical representations of sequence conservation, where the height of each letter represents its frequency at that position.

In [9]:
## Part 2: Creating Sequence Logos

# Sequence logos are graphical representations of sequence conservation, where the height of each letter represents its frequency at that position. The logo module now uses the frequency matrix functions from the alignment module internally, making it more modular and reusable.

### Logo-Ready Format

The `sequences_to_logo_format` function prepares data exactly as needed by logomaker, including identifying variant positions.

In [10]:
# Get frequency matrix directly from SAM file
freq_df, position_labels = get_alignment_frequencies(
    alignment_sam, max_reads=10, normalize=True  # Get frequencies instead of counts
)

print("Alignment Frequencies with Position Labels")
print("=" * 40)
print(f"DataFrame shape: {freq_df.shape}")
print(f"Number of positions: {len(position_labels)}")

# Show the labeled frequency matrix
print("\nFrequency matrix with position labels (first 10 positions):")
print(freq_df.head(10))

# Find the most variable positions
print("\nMost variable positions (lowest max frequency):")
max_freqs = freq_df.max(axis=1).sort_values()
for pos, max_freq in list(max_freqs.head(5).items()):
    print(f"  {pos}: max frequency = {max_freq:.2f}")

Alignment Frequencies with Position Labels
DataFrame shape: (36, 6)
Number of positions: 36

Frequency matrix with position labels (first 10 positions):
       -    A    C    G    T    .
T1   0.8  0.0  0.0  0.0  0.2  0.0
T2   0.8  0.0  0.0  0.0  0.2  0.0
A3   0.4  0.6  0.0  0.0  0.0  0.0
G4   0.4  0.0  0.0  0.6  0.0  0.0
A5   0.4  0.4  0.2  0.0  0.0  0.0
T6   0.4  0.0  0.0  0.0  0.6  0.0
A7   0.4  0.6  0.0  0.0  0.0  0.0
A8   0.4  0.6  0.0  0.0  0.0  0.0
A9   0.6  0.2  0.0  0.2  0.0  0.0
G10  0.4  0.2  0.0  0.4  0.0  0.0

Most variable positions (lowest max frequency):
  G11: max frequency = 0.40
  A5: max frequency = 0.40
  G10: max frequency = 0.40
  A12: max frequency = 0.40
  T13: max frequency = 0.40


### Complete Alignment-to-Frequency Pipeline

The `get_alignment_frequencies` function combines reading a SAM/BAM file and generating a labeled frequency matrix in one step.

In [11]:
# Demonstrate that logo.py now uses alignment module functions
print("Logo Generation Process:")
print("1. plot_logo_with_indels() receives alignment")
print("2. Internally calls sequences_to_logo_format()")
print("3. sequences_to_logo_format() calls:")
print("   - alignment_to_frequency_matrix()")
print("   - generate_position_labels()")
print("4. Returns frequency DataFrame ready for logomaker")
print("\nThis modular approach allows you to:")
print("- Use frequency matrices without creating logos")
print("- Generate position labels for other analyses")
print("- Create custom visualizations from the same data")

Logo Generation Process:
1. plot_logo_with_indels() receives alignment
2. Internally calls sequences_to_logo_format()
3. sequences_to_logo_format() calls:
   - alignment_to_frequency_matrix()
   - generate_position_labels()
4. Returns frequency DataFrame ready for logomaker

This modular approach allows you to:
- Use frequency matrices without creating logos
- Generate position labels for other analyses
- Create custom visualizations from the same data


In [12]:
# Get aligned sequences from our SAM file
if len(aligned_df) > 0:
    sequences = aligned_df["aligned_sequence"].tolist()
    reference_seq = sequences[0]  # First sequence as reference

    # Generate all position labels
    labels, indices = generate_position_labels(reference_seq)

    print("Position Labels")
    print("=" * 40)
    print(f"Total positions: {len(labels)}")
    print(f"\nFirst 20 labels:")
    for i, label in enumerate(labels[:20]):
        print(f"  Position {i}: {label}")

    # Generate labels for specific positions only
    print("\n\nFiltered Position Labels (positions 10-15):")
    filtered_labels, filtered_indices = generate_position_labels(reference_seq, positions=[10, 11, 12, 13, 14, 15])
    for label, idx in zip(filtered_labels, filtered_indices):
        print(f"  Index {idx}: {label}")

Position Labels
Total positions: 36

First 20 labels:
  Position 0: T1
  Position 1: T2
  Position 2: A3
  Position 3: G4
  Position 4: A5
  Position 5: T6
  Position 6: A7
  Position 7: A8
  Position 8: A9
  Position 9: G10
  Position 10: G11
  Position 11: A12
  Position 12: T13
  Position 13: A14_A
  Position 14: -14_B
  Position 15: C15
  Position 16: T16
  Position 17: G17_A
  Position 18: -17_B
  Position 19: -17_C


Filtered Position Labels (positions 10-15):
  Index 9: G10
  Index 10: G11
  Index 11: A12
  Index 12: T13
  Index 13: A14_A
  Index 15: C15


In [13]:
# Get aligned sequences from our SAM file
if len(aligned_df) > 0:
    sequences = aligned_df["aligned_sequence"].tolist()

    # Create frequency matrix
    freq_matrix = alignment_to_frequency_matrix(sequences)

    print("Frequency Matrix from SAM Alignments")
    print("=" * 40)
    print(f"Shape: {freq_matrix.shape}")
    print(f"Characters: {list(freq_matrix.columns)}")
    print("\nFirst 10 positions:")
    print(freq_matrix.head(10))

    # Show positions with high variation
    print("\nPositions with multiple characters (entropy > 0):")
    for idx, row in freq_matrix.iterrows():
        if (row > 0).sum() > 1:  # More than one character present
            print(f"Position {idx}: {dict(row[row > 0])}")

Frequency Matrix from SAM Alignments
Shape: (36, 6)
Characters: ['-', 'A', 'C', 'G', 'T', '.']

First 10 positions:
     -    A    C    G    T    .
0  0.8  0.0  0.0  0.0  0.2  0.0
1  0.8  0.0  0.0  0.0  0.2  0.0
2  0.4  0.6  0.0  0.0  0.0  0.0
3  0.4  0.0  0.0  0.6  0.0  0.0
4  0.4  0.4  0.2  0.0  0.0  0.0
5  0.4  0.0  0.0  0.0  0.6  0.0
6  0.4  0.6  0.0  0.0  0.0  0.0
7  0.4  0.6  0.0  0.0  0.0  0.0
8  0.6  0.2  0.0  0.2  0.0  0.0
9  0.4  0.2  0.0  0.4  0.0  0.0

Positions with multiple characters (entropy > 0):
Position 0: {'-': 0.8, 'T': 0.2}
Position 1: {'-': 0.8, 'T': 0.2}
Position 2: {'-': 0.4, 'A': 0.6}
Position 3: {'-': 0.4, 'G': 0.6}
Position 4: {'-': 0.4, 'A': 0.4, 'C': 0.2}
Position 5: {'-': 0.4, 'T': 0.6}
Position 6: {'-': 0.4, 'A': 0.6}
Position 7: {'-': 0.4, 'A': 0.6}
Position 8: {'-': 0.6, 'A': 0.2, 'G': 0.2}
Position 9: {'-': 0.4, 'A': 0.2, 'G': 0.4}
Position 10: {'-': 0.4, 'A': 0.2, 'G': 0.2, 'T': 0.2}
Position 11: {'-': 0.4, 'A': 0.4, 'T': 0.2}
Position 12: {'-': 0.4,

## Part 1.5: Frequency Matrix Generation

SeqPandas now provides powerful functions to convert alignments into frequency matrices, which are the foundation for sequence logos and other analyses.

### Preparing Alignment for Logo Generation

First, let's create a simple multiple sequence alignment to demonstrate logo generation.

In [14]:
# Create example aligned sequences for logo demonstration
# These could come from your actual alignment data
example_alignment = [
    "ATGGATTTAT-CTGCTCTTCG",
    "ATGGATTTAT-CTGCTCTTCG",
    "ATGGATTTAT-CTGCTCTTCG",
    "ATGGATTTGT-CTGCTCTTCG",
    "ATGGATTTAT-CTGCTCTTCG",
    "ATGGCTTTAT-CTGCTCTTCG",
    "ATGGATTTAT-CTGCTCTTCG",
    "ATGGATTTATACTGCTCTTCG",  # Has insertion
    "ATGGATTTAT-CTGCTCTTCG",
    "ATGGATTTAT-CTGCTCTTCG",
]

print(f"Alignment length: {len(example_alignment[0])} positions")
print(f"Number of sequences: {len(example_alignment)}")
print("\nFirst 3 sequences:")
for seq in example_alignment[:3]:
    print(seq)

Alignment length: 21 positions
Number of sequences: 10

First 3 sequences:
ATGGATTTAT-CTGCTCTTCG
ATGGATTTAT-CTGCTCTTCG
ATGGATTTAT-CTGCTCTTCG


### Creating a Basic Sequence Logo

In [15]:
# Generate sequence logo
output_logo = "example_logo.png"

freq_df = plot_logo_with_indels(
    alignment=example_alignment,
    output_file=output_logo,
    title="Example Sequence Logo",
    color_scheme="NajafabadiEtAl2017",  # Default color scheme
)

print(f"Logo saved to: {output_logo}")
print("\nFrequency matrix shape:", freq_df.shape)
print("\nFirst few positions:")
freq_df.head()

Logo saved to: example_logo.png

Frequency matrix shape: (21, 6)

First few positions:


Unnamed: 0,-,A,C,G,T,.
0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.9,0.1,0.0,0.0,0.0


### Highlighting Specific Positions

You can highlight specific positions in the logo to draw attention to important sites.

In [16]:
# Create logo with highlighted positions
highlighted_logo = "example_logo_highlighted.png"

freq_df_highlight = plot_logo_with_indels(
    alignment=example_alignment,
    output_file=highlighted_logo,
    title="Logo with Highlighted Positions",
    positions=[5, 6, 7, 8, 9],  # Focus on positions 5-9
    highlight_pos=[1, 2],  # Highlight specific positions
    highlight_color="gold",
    highlight_alpha=0.5,
)

print(f"Highlighted logo saved to: {highlighted_logo}")

Highlighted logo saved to: example_logo_highlighted.png


### Using Different Color Schemes

The logo module supports various color schemes for different visualization needs.

## Summary

SeqPandas alignment and logo modules provide a comprehensive toolkit for sequence analysis:

**Alignment Module:**
- Read SAM/BAM files with gap-aware alignment
- Visualize alignments with explicit gap representation
- Extract aligned sequences for downstream analysis
- Handle insertions and deletions properly
- **NEW: Generate frequency matrices from alignments**
- **NEW: Create position labels with gap handling**
- **NEW: Direct conversion to logo-ready format**

**Logo Module:**
- Create publication-quality sequence logos
- Support for DNA, RNA, and protein sequences
- Handle insertions and deletions in alignments
- Multiple color schemes for different visualization needs
- Export frequency matrices for further analysis
- **NEW: Uses alignment module functions internally for modularity**

**Frequency Matrix Functions:**
- `alignment_to_frequency_matrix()` - Convert sequences to frequency matrix
- `generate_position_labels()` - Create meaningful position labels
- `get_alignment_frequencies()` - Complete pipeline from SAM to frequencies
- `sequences_to_logo_format()` - Prepare data for logomaker

**Integration:**
- Seamlessly convert SAM/BAM alignments to sequence logos
- Identify conserved and variable regions
- Create visual summaries of alignment data
- Perform custom analyses on frequency matrices
- Calculate conservation metrics like Shannon entropy
- Generate consensus sequences

### Custom Analysis with Frequency Matrices

The frequency matrices can be used for various bioinformatics analyses beyond visualization.

In [17]:
# Example: Calculate Shannon entropy for each position
import numpy as np


def calculate_shannon_entropy(freq_matrix):
    """Calculate Shannon entropy for each position in the alignment."""
    entropies = []
    for idx, row in freq_matrix.iterrows():
        # Remove zeros and calculate entropy
        probs = row[row > 0]
        if len(probs) > 0:
            entropy = -sum(probs * np.log2(probs))
        else:
            entropy = 0
        entropies.append(entropy)
    return entropies


# Calculate entropy for our alignment
if "freq_matrix" in locals() and not freq_matrix.empty:
    entropies = calculate_shannon_entropy(freq_matrix)

    print("Shannon Entropy Analysis")
    print("=" * 40)
    print(f"Mean entropy: {np.mean(entropies):.3f}")
    print(f"Max entropy: {np.max(entropies):.3f}")
    print(f"Min entropy: {np.min(entropies):.3f}")

    # Find highly conserved positions (low entropy)
    conserved_threshold = 0.5
    conserved_positions = [i for i, e in enumerate(entropies) if e < conserved_threshold]
    print(f"\nHighly conserved positions (entropy < {conserved_threshold}):")
    print(f"  Positions: {conserved_positions[:10]}...")

    # Find highly variable positions (high entropy)
    variable_threshold = 1.0
    variable_positions = [i for i, e in enumerate(entropies) if e > variable_threshold]
    print(f"\nHighly variable positions (entropy > {variable_threshold}):")
    print(f"  Positions: {variable_positions}")

Shannon Entropy Analysis
Mean entropy: 0.765
Max entropy: 1.922
Min entropy: -0.000

Highly conserved positions (entropy < 0.5):
  Positions: [18, 19, 20, 21, 27, 28, 34, 35]...

Highly variable positions (entropy > 1.0):
  Positions: [4, 8, 9, 10, 11, 12, 13]


## Part 5: Advanced Matrix Analysis

With the new frequency matrix functions, you can perform advanced analyses beyond just creating logos.

In [18]:
# Available color schemes
color_schemes = ["NajafabadiEtAl2017", "grays", "PGDM1400", "PGT121", "VRC01"]

print("Available color schemes:")
for scheme in color_schemes:
    print(f"  - {scheme}")

# Create logo with gray color scheme
gray_logo = "example_logo_gray.png"

freq_df_gray = plot_logo_with_indels(
    alignment=example_alignment, output_file=gray_logo, title="Logo with Gray Color Scheme", color_scheme="grays"
)

print(f"\nGray logo saved to: {gray_logo}")

Available color schemes:
  - NajafabadiEtAl2017
  - grays
  - PGDM1400
  - PGT121
  - VRC01

Gray logo saved to: example_logo_gray.png


## Part 3: Integration - From SAM to Logo

Now let's combine both modules to create a logo directly from SAM alignment data.

In [19]:
# Extract aligned sequences from SAM data
if len(aligned_df) > 0 and "aligned_sequence" in aligned_df.columns:
    # Get aligned sequences
    sam_sequences = aligned_df["aligned_sequence"].tolist()

    # Take first N sequences that have actual alignment data
    sam_sequences = [seq for seq in sam_sequences if seq and len(seq) > 0][:10]

    if sam_sequences:
        # Ensure all sequences have the same length
        max_len = max(len(seq) for seq in sam_sequences)
        sam_sequences = [seq.ljust(max_len, "-") for seq in sam_sequences]

        # Create logo from SAM alignments
        sam_logo = "sam_alignment_logo.png"

        freq_df_sam = plot_logo_with_indels(
            alignment=sam_sequences, output_file=sam_logo, title="Logo from SAM Alignments"
        )

        print(f"Created logo from {len(sam_sequences)} SAM alignments")
        print(f"Logo saved to: {sam_logo}")
        print(f"\nAlignment statistics:")
        print(f"  Sequence length: {max_len}")
        print(f"  Number of sequences: {len(sam_sequences)}")
    else:
        print("No valid aligned sequences found in SAM data")
else:
    print("No alignment data available for logo generation")

Created logo from 5 SAM alignments
Logo saved to: sam_alignment_logo.png

Alignment statistics:
  Sequence length: 36
  Number of sequences: 5


## Part 4: Advanced Features

### Working with Protein Sequences

The logo module is particularly useful for protein sequence alignments, supporting all 20 amino acids.

In [20]:
# Example protein alignment
protein_alignment = [
    "MKWVTFISLLFLFSSAYS",
    "MKWVTFISLLFLFSSAYS",
    "MKWVTFISLLFLFSSAYS",
    "MKWVTFISLLFLFSSAYT",
    "MKWVTFISLLFLFSSAYS",
    "MKWVTFISLLFLFS-AYS",  # With deletion
    "MKWVTFISLLFLFSSAYS",
    "MKWVAFISLLFLFSSAYS",
]

protein_logo = "protein_logo.png"

freq_df_protein = plot_logo_with_indels(
    alignment=protein_alignment, output_file=protein_logo, title="Protein Sequence Logo"
)

print(f"Protein logo saved to: {protein_logo}")
print("\nAmino acid frequencies at first 5 positions:")
freq_df_protein.head()

Protein logo saved to: protein_logo.png

Amino acid frequencies at first 5 positions:


Unnamed: 0,-,A,F,I,K,L,M,S,T,V,W,Y,.
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.875,0.0,0.0,0.0,0.0


### Handling Insertions and Deletions

The logo function handles indels gracefully, using alphabetical suffixes for positions with insertions.

In [21]:
# Alignment with complex indels
indel_alignment = [
    "ATG-GATTTAT",  # Deletion at position 4
    "ATGCGATTTAT",  # Normal
    "ATG-GATTTAT",  # Deletion
    "ATGCGATTTAT",  # Normal
    "ATG--ATTTAT",  # Double deletion
    "ATGCGATTTAT",  # Normal
]

indel_logo = "indel_logo.png"

freq_df_indel = plot_logo_with_indels(
    alignment=indel_alignment, output_file=indel_logo, title="Logo with Insertions and Deletions"
)

print(f"Indel logo saved to: {indel_logo}")
print("\nNote: Positions with gaps are labeled with suffixes (e.g., G4_A for insertion after G4)")

Indel logo saved to: indel_logo.png

Note: Positions with gaps are labeled with suffixes (e.g., G4_A for insertion after G4)


### Exporting Frequency Data

The frequency matrix used to generate the logo is automatically saved as a CSV file for further analysis.

In [22]:
# The frequency data is saved alongside the logo
csv_file = Path(protein_logo).with_suffix(".csv")

if csv_file.exists():
    freq_data = pd.read_csv(csv_file, index_col=0)
    print(f"Frequency data saved to: {csv_file}")
    print("\nFrequency matrix (first 5 positions):")
    print(freq_data.iloc[:, :5])
    print("\nThis data can be used for:")
    print("  - Statistical analysis of conservation")
    print("  - Identifying highly variable positions")
    print("  - Creating custom visualizations")
    print("  - Comparing different alignments")

Frequency data saved to: protein_logo.csv

Frequency matrix (first 5 positions):
    M1   K2   W3   V4     T5
-  0.0  0.0  0.0  0.0  0.000
A  0.0  0.0  0.0  0.0  0.125
F  0.0  0.0  0.0  0.0  0.000
I  0.0  0.0  0.0  0.0  0.000
K  0.0  1.0  0.0  0.0  0.000
L  0.0  0.0  0.0  0.0  0.000
M  1.0  0.0  0.0  0.0  0.000
S  0.0  0.0  0.0  0.0  0.000
T  0.0  0.0  0.0  0.0  0.875
V  0.0  0.0  0.0  1.0  0.000
W  0.0  0.0  1.0  0.0  0.000
Y  0.0  0.0  0.0  0.0  0.000
.  0.0  0.0  0.0  0.0  0.000

This data can be used for:
  - Statistical analysis of conservation
  - Identifying highly variable positions
  - Creating custom visualizations
  - Comparing different alignments


## Summary

SeqPandas alignment and logo modules provide:

**Alignment Module:**
- Read SAM/BAM files with gap-aware alignment
- Visualize alignments with explicit gap representation
- Extract aligned sequences for downstream analysis
- Handle insertions and deletions properly

**Logo Module:**
- Create publication-quality sequence logos
- Support for DNA, RNA, and protein sequences
- Handle insertions and deletions in alignments
- Multiple color schemes for different visualization needs
- Export frequency matrices for further analysis

**Integration:**
- Seamlessly convert SAM/BAM alignments to sequence logos
- Identify conserved and variable regions
- Create visual summaries of alignment data

## Cleanup

Remove temporary files created during the tutorial.

In [24]:
# Clean up generated files (optional)
import os

files_to_clean = [
    "example_logo.png",
    "example_logo.csv",
    "example_logo_highlighted.png",
    "example_logo_highlighted.csv",
    "example_logo_gray.png",
    "example_logo_gray.csv",
    "sam_alignment_logo.png",
    "sam_alignment_logo.csv",
    "protein_logo.png",
    "protein_logo.csv",
    "indel_logo.png",
    "indel_logo.csv",
]

for file in files_to_clean:
    if os.path.exists(file):
        os.remove(file)
        print(f"Removed: {file}")

print("\nCleanup complete!")

Removed: example_logo.png
Removed: example_logo.csv
Removed: example_logo_highlighted.png
Removed: example_logo_highlighted.csv
Removed: example_logo_gray.png
Removed: example_logo_gray.csv
Removed: sam_alignment_logo.png
Removed: sam_alignment_logo.csv
Removed: protein_logo.png
Removed: protein_logo.csv
Removed: indel_logo.png
Removed: indel_logo.csv

Cleanup complete!
