# SciKit-Bio: 25 Examples for Bioinformatics Students

This notebook demonstrates various functions and capabilities of **scikit-bio**, a Python library for bioinformatics. We'll cover:

- Sequence objects (DNA, RNA, Protein)
- File format parsing and writing (FASTA, FASTQ)
- Sequence alignment and manipulation
- Distance matrices and phylogenetic trees
- Diversity metrics (alpha and beta diversity)
- Visualizations with matplotlib and seaborn

---

In [None]:
# Install required packages if needed
# !pip install scikit-bio matplotlib seaborn numpy pandas

In [None]:
# Import all required libraries
import skbio
from skbio import DNA, RNA, Protein, Sequence
from skbio import TabularMSA, DistanceMatrix, TreeNode
from skbio import io as skbio_io
from skbio.diversity import alpha_diversity, beta_diversity
from skbio.stats.distance import permanova, anosim
from skbio.stats.ordination import pcoa

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print(f"scikit-bio version: {skbio.__version__}")

---
## Part 1: DNA Sequence Objects
---

### Example 1: Creating and Inspecting DNA Sequences

The `DNA` class is fundamental in scikit-bio. It provides validation, metadata support, and various sequence operations.

In [None]:
# Example 1: Creating and Inspecting DNA Sequences

# Create a DNA sequence with metadata
dna_seq = DNA(
    'ATGCGATCGATCGATCGATCGATCGATCGATCG',
    metadata={'id': 'gene_001', 'description': 'Example gene sequence'},
    positional_metadata={'quality': list(range(33))}
)

print("DNA Sequence Information:")
print(f"Sequence: {dna_seq}")
print(f"Length: {len(dna_seq)} bp")
print(f"Metadata: {dna_seq.metadata}")
print(f"\nSequence statistics:")
print(f"GC Content: {dna_seq.gc_content():.2%}")
print(f"Has gaps: {dna_seq.has_gaps()}")
print(f"Has degenerates: {dna_seq.has_degenerates()}")

### Example 2: DNA Sequence Slicing and Manipulation

DNA sequences support Python-like slicing and various manipulation methods.

In [None]:
# Example 2: DNA Sequence Slicing and Manipulation

dna = DNA('ATGCGATCGATCGATCGATCG')

print("Original sequence:", dna)
print(f"First 10 bases: {dna[:10]}")
print(f"Last 5 bases: {dna[-5:]}")
print(f"Every 3rd base: {dna[::3]}")
print(f"\nReverse complement: {dna.reverse_complement()}")
print(f"Complement: {dna.complement()}")

# Find specific patterns using regex
import re
pattern = 'GATC'
seq_str = str(dna)
positions = [m.start() for m in re.finditer(pattern, seq_str)]
print(f"\nMotif '{pattern}' found at positions: {positions}")

### Example 3: Transcription and Translation

Convert DNA to RNA (transcription) and then to protein (translation).

In [None]:
# Example 3: Transcription and Translation

# Create a coding DNA sequence (CDS)
coding_dna = DNA('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

print("Central Dogma Workflow:")
print(f"DNA (template):     {coding_dna}")

# Transcribe DNA to RNA
rna = coding_dna.transcribe()
print(f"RNA (transcript):   {rna}")

# Translate RNA to Protein
protein = rna.translate()
print(f"Protein (peptide):  {protein}")

# Analyze the protein
print(f"\nProtein length: {len(protein)} amino acids")
print(f"Contains stop codon: {'*' in str(protein)}")

---
## Part 2: RNA and Protein Sequences
---

### Example 4: RNA Sequence Analysis

RNA sequences have their own class with specific methods for RNA-related operations.

In [None]:
# Example 4: RNA Sequence Analysis

# Create an RNA sequence (note: U instead of T)
rna_seq = RNA('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

print("RNA Sequence Analysis:")
print(f"Sequence: {rna_seq}")
print(f"Length: {len(rna_seq)} nt")
print(f"GC Content: {rna_seq.gc_content():.2%}")

# Reverse transcribe back to DNA
back_to_dna = rna_seq.reverse_transcribe()
print(f"\nReverse transcribed to DNA: {back_to_dna}")

# Direct translation
protein = rna_seq.translate()
print(f"Translated protein: {protein}")

### Example 5: Protein Sequence Analysis

Analyze protein sequences for composition and properties.

In [None]:
# Example 5: Protein Sequence Analysis

# Create a protein sequence (insulin B chain)
insulin_b = Protein(
    'FVNQHLCGSHLVEALYLVCGERGFFYTPKT',
    metadata={'id': 'insulin_b', 'description': 'Insulin B chain'}
)

print("Protein Sequence Analysis:")
print(f"Sequence: {insulin_b}")
print(f"Length: {len(insulin_b)} amino acids")

# Count amino acid frequencies
aa_counts = insulin_b.frequencies()
print(f"\nAmino acid frequencies:")
for aa, count in sorted(aa_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {aa}: {count} ({count/len(insulin_b):.1%})")

### Example 6: Visualizing Sequence Composition

Create visualizations of nucleotide and amino acid composition using matplotlib and seaborn.

In [None]:
# Example 6: Visualizing Sequence Composition

# Create a longer DNA sequence for better statistics
dna_long = DNA('ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGGAATTCCGGAATTCCGG'
               'GCGCGCATATATATGCGCGCATATATATGCGCGCATATATAT')

# Get nucleotide frequencies
nt_freq = dna_long.frequencies()

# Create side-by-side plots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of nucleotide frequencies
colors = ['#e74c3c', '#3498db', '#2ecc71', '#f39c12']  # A, C, G, T
nucleotides = ['A', 'C', 'G', 'T']
counts = [nt_freq.get(nt, 0) for nt in nucleotides]

axes[0].bar(nucleotides, counts, color=colors, edgecolor='black', linewidth=1.5)
axes[0].set_xlabel('Nucleotide', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Nucleotide Frequency Distribution', fontsize=14)

# Add value labels on bars
for i, (nt, count) in enumerate(zip(nucleotides, counts)):
    axes[0].text(i, count + 0.5, str(count), ha='center', fontsize=11, fontweight='bold')

# Pie chart of GC vs AT content
gc = nt_freq.get('G', 0) + nt_freq.get('C', 0)
at = nt_freq.get('A', 0) + nt_freq.get('T', 0)
axes[1].pie([gc, at], labels=['GC', 'AT'], autopct='%1.1f%%',
            colors=['#9b59b6', '#e67e22'], explode=[0.05, 0],
            textprops={'fontsize': 12})
axes[1].set_title(f'GC Content: {dna_long.gc_content():.1%}', fontsize=14)

plt.tight_layout()
plt.show()

---
## Part 3: File Format Parsing and Writing
---

### Example 7: Reading and Writing FASTA Files

FASTA is one of the most common formats for storing biological sequences.

In [None]:
# Example 7: Reading and Writing FASTA Files

# Create sample FASTA data
fasta_data = """>seq1 Homo sapiens beta-globin
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTG
>seq2 Mus musculus beta-globin
ATGGTGCACCTGACTGATGCTGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTG
>seq3 Rattus norvegicus beta-globin
ATGGTGCACTTGACTGATGCTGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTG
"""

# Parse FASTA from string
print("Reading FASTA sequences:")
sequences = []
for seq in skbio_io.read(StringIO(fasta_data), format='fasta', constructor=DNA):
    sequences.append(seq)
    print(f"\nID: {seq.metadata.get('id', 'N/A')}")
    print(f"Description: {seq.metadata.get('description', 'N/A')}")
    print(f"Length: {len(seq)} bp")
    print(f"First 30 bp: {seq[:30]}...")

# Write sequences to FASTA format
print("\n" + "="*50)
print("Writing sequences back to FASTA format:")
output = StringIO()
for seq in sequences:
    skbio_io.write(seq, format='fasta', into=output)
print(output.getvalue()[:200] + "...")

### Example 8: Working with FASTQ Files (Quality Scores)

FASTQ format includes quality scores for each base, essential for NGS data analysis.

In [None]:
# Example 8: Working with FASTQ Files (Quality Scores)

# Create sample FASTQ data
fastq_data = """@read1 length=50
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read2 length=50
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
+
IIIIIIIIIIIIIIIIIIIII55555555555555555555555555555
@read3 length=50
TTTTAAAACCCCGGGGTTTTAAAACCCCGGGGTTTTAAAACCCCGGGGTT
+
IIIII!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!IIIIII
"""

print("Parsing FASTQ sequences with quality scores:")
reads = []
for seq in skbio_io.read(StringIO(fastq_data), format='fastq', constructor=DNA, 
                          variant='illumina1.8'):
    reads.append(seq)
    quality = seq.positional_metadata['quality'].values
    print(f"\nRead: {seq.metadata.get('id', 'N/A')}")
    print(f"Sequence: {seq[:20]}...")
    print(f"Mean quality: {np.mean(quality):.1f}")
    print(f"Min quality: {np.min(quality)}")
    print(f"Max quality: {np.max(quality)}")

### Example 9: Visualizing Quality Scores

Quality score visualization is critical for assessing sequencing data quality.

In [None]:
# Example 9: Visualizing Quality Scores

# Create a plot of quality scores across positions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot quality scores for each read
colors = sns.color_palette('husl', len(reads))
for i, seq in enumerate(reads):
    quality = seq.positional_metadata['quality'].values
    axes[0].plot(quality, label=seq.metadata.get('id', f'Read {i+1}'), 
                 color=colors[i], linewidth=2)

axes[0].axhline(y=30, color='red', linestyle='--', label='Q30 threshold')
axes[0].axhline(y=20, color='orange', linestyle='--', label='Q20 threshold')
axes[0].set_xlabel('Position in Read', fontsize=12)
axes[0].set_ylabel('Quality Score (Phred)', fontsize=12)
axes[0].set_title('Per-Base Quality Scores', fontsize=14)
axes[0].legend(loc='lower left')
axes[0].set_ylim(0, 45)

# Box plot of quality distributions
quality_data = [seq.positional_metadata['quality'].values for seq in reads]
read_names = [seq.metadata.get('id', f'Read {i+1}') for i, seq in enumerate(reads)]

bp = axes[1].boxplot(quality_data, labels=read_names, patch_artist=True)
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

axes[1].axhline(y=30, color='red', linestyle='--', label='Q30 threshold')
axes[1].set_xlabel('Read', fontsize=12)
axes[1].set_ylabel('Quality Score (Phred)', fontsize=12)
axes[1].set_title('Quality Score Distribution by Read', fontsize=14)

plt.tight_layout()
plt.show()

---
## Part 4: Sequence Alignment
---

### Example 10: Sequence Comparison and Similarity

Compare sequences by calculating identity and finding differences between them.

In [None]:
# Example 10: Sequence Comparison and Similarity

# Define two sequences to compare
seq1 = DNA('ATGCGATCGATCGATCGATCG')
seq2 = DNA('ATGCGATGGATCGTTCGATCG')  # Has some differences

print("Sequence Comparison:")
print(f"\nSequence 1: {seq1}")
print(f"Sequence 2: {seq2}")
print(f"Length: {len(seq1)} bp each")

# Calculate identity (matching positions)
s1, s2 = str(seq1), str(seq2)
matches = sum(a == b for a, b in zip(s1, s2))
identity = matches / len(s1) * 100

print(f"\nMatching positions: {matches}/{len(s1)}")
print(f"Percent Identity: {identity:.1f}%")

# Find positions that differ
print("\nDifferences:")
diff_positions = []
for i, (a, b) in enumerate(zip(s1, s2)):
    if a != b:
        diff_positions.append(i)
        print(f"  Position {i}: {a} -> {b}")

print(f"\nTotal differences: {len(diff_positions)} positions")

# Create visual alignment representation
print("\nVisual comparison:")
print(f"Seq1: {s1}")
match_str = ''.join('|' if a == b else ' ' for a, b in zip(s1, s2))
print(f"      {match_str}")
print(f"Seq2: {s2}")

### Example 11: Finding Conserved Regions

Identify and extract conserved regions between sequences.

In [None]:
# Example 11: Finding Conserved Regions

# Sequences with a conserved region in the middle
seq1 = DNA('AAAAAAGATCGATCGATCGATCAAAAAAA')
seq2 = DNA('TTTTTTGATCGATCGATCGATCTTTTTT')

print("Finding Conserved Regions:")
print(f"\nSequence 1: {seq1}")
print(f"Sequence 2: {seq2}")

# Find longest common substring (conserved region)
def find_longest_common_substring(s1, s2):
    """Find the longest common substring between two sequences."""
    m, n = len(s1), len(s2)
    # Create a table to store lengths of longest common suffixes
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    longest = 0
    end_pos = 0
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
                if dp[i][j] > longest:
                    longest = dp[i][j]
                    end_pos = i
    
    return s1[end_pos - longest:end_pos], end_pos - longest

s1, s2 = str(seq1), str(seq2)
conserved, start_pos = find_longest_common_substring(s1, s2)

print(f"\nConserved region found: {conserved}")
print(f"Length: {len(conserved)} bp")
print(f"Position in seq1: {start_pos}-{start_pos + len(conserved) - 1}")

# Find position in seq2
start_in_seq2 = s2.find(conserved)
print(f"Position in seq2: {start_in_seq2}-{start_in_seq2 + len(conserved) - 1}")

# Visualize the conserved region
print("\nVisualization:")
print(f"Seq1: {s1}")
highlight1 = ' ' * start_pos + '^' * len(conserved) + ' ' * (len(s1) - start_pos - len(conserved))
print(f"      {highlight1}")
print(f"Seq2: {s2}")
highlight2 = ' ' * start_in_seq2 + '^' * len(conserved) + ' ' * (len(s2) - start_in_seq2 - len(conserved))
print(f"      {highlight2}")
print(f"\n      '^' marks the conserved region: {conserved}")

### Example 12: Multiple Sequence Alignment (TabularMSA)

TabularMSA represents multiple aligned sequences in a tabular format.

In [None]:
# Example 12: Multiple Sequence Alignment (TabularMSA)

# Create a multiple sequence alignment
msa = TabularMSA([
    DNA('ATGCGATC-ATCGATCGATCG', metadata={'id': 'human'}),
    DNA('ATGCGATCGATCGATCGATCG', metadata={'id': 'chimp'}),
    DNA('ATGCGAT--ATCGATCGATCG', metadata={'id': 'gorilla'}),
    DNA('ATGCGATC-ATCGAT-GATCG', metadata={'id': 'orangutan'}),
])

print("Multiple Sequence Alignment:")
print(msa)

print(f"\nAlignment dimensions: {msa.shape[0]} sequences x {msa.shape[1]} positions")

# Access individual sequences
print(f"\nFirst sequence (human): {msa[0]}")

# Access columns (positions)
print(f"\nColumn 7 (all sequences): {msa[:, 7]}")

# Calculate conservation at each position
print("\nConservation analysis:")
for pos in range(min(10, msa.shape[1])):
    col = [str(seq)[pos] for seq in msa]
    unique = set(col)
    conserved = "*" if len(unique) == 1 else " "
    print(f"Position {pos}: {col} {conserved}")

---
## Part 5: Distance Matrices
---

### Example 13: Creating Distance Matrices

Distance matrices are fundamental for clustering and phylogenetic analysis.

In [None]:
# Example 13: Creating Distance Matrices

# Create a distance matrix from raw data
ids = ['Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E']
data = np.array([
    [0.0, 0.2, 0.4, 0.6, 0.8],
    [0.2, 0.0, 0.3, 0.5, 0.7],
    [0.4, 0.3, 0.0, 0.4, 0.6],
    [0.6, 0.5, 0.4, 0.0, 0.3],
    [0.8, 0.7, 0.6, 0.3, 0.0]
])

dm = DistanceMatrix(data, ids=ids)

print("Distance Matrix:")
print(dm)

print(f"\nMatrix shape: {dm.shape}")
print(f"Sample IDs: {dm.ids}")

# Access specific distances
print(f"\nDistance between Sample_A and Sample_B: {dm['Sample_A', 'Sample_B']:.2f}")
print(f"Distance between Sample_A and Sample_E: {dm['Sample_A', 'Sample_E']:.2f}")

### Example 14: Visualizing Distance Matrices as Heatmaps

Heatmaps provide an intuitive visualization of pairwise distances.

In [None]:
# Example 14: Visualizing Distance Matrices as Heatmaps

# Convert to pandas DataFrame for seaborn
dm_df = dm.to_data_frame()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Standard heatmap
sns.heatmap(dm_df, annot=True, cmap='YlOrRd', fmt='.2f', 
            ax=axes[0], square=True, linewidths=0.5)
axes[0].set_title('Distance Matrix Heatmap', fontsize=14)

# Clustered heatmap data (using seaborn's clustermap separately)
# For the second subplot, show a different colormap
sns.heatmap(dm_df, annot=True, cmap='viridis', fmt='.2f',
            ax=axes[1], square=True, linewidths=0.5)
axes[1].set_title('Distance Matrix (Viridis colormap)', fontsize=14)

plt.tight_layout()
plt.show()

### Example 15: Computing Sequence Distances

Calculate evolutionary distances between aligned sequences.

In [None]:
# Example 15: Computing Sequence Distances

# Function to calculate Hamming distance (proportion of differing positions)
def hamming_distance(seq1, seq2):
    """Calculate normalized Hamming distance between two sequences."""
    s1, s2 = str(seq1), str(seq2)
    if len(s1) != len(s2):
        raise ValueError("Sequences must be same length")
    differences = sum(a != b for a, b in zip(s1, s2))
    return differences / len(s1)

# Use our MSA from earlier
msa_seqs = [
    DNA('ATGCGATCGATCGATCGATCG', metadata={'id': 'human'}),
    DNA('ATGCGATCGATCGATTGATCG', metadata={'id': 'chimp'}),
    DNA('ATGCGATCGATCAATCGATCG', metadata={'id': 'gorilla'}),
    DNA('ATGCGATCAATCGATCGATCG', metadata={'id': 'orangutan'}),
    DNA('ATGCAATCGATCGATCGATCG', metadata={'id': 'gibbon'}),
]

# Calculate pairwise distances
n = len(msa_seqs)
distances = np.zeros((n, n))
ids = [seq.metadata['id'] for seq in msa_seqs]

for i in range(n):
    for j in range(n):
        distances[i, j] = hamming_distance(msa_seqs[i], msa_seqs[j])

seq_dm = DistanceMatrix(distances, ids=ids)

print("Sequence Distance Matrix (Hamming distances):")
print(seq_dm)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(seq_dm.to_data_frame(), annot=True, cmap='Blues', fmt='.3f',
            square=True, linewidths=0.5)
plt.title('Pairwise Sequence Distances (Hamming)', fontsize=14)
plt.tight_layout()
plt.show()

---
## Part 6: Phylogenetic Trees
---

### Example 16: Creating and Manipulating Phylogenetic Trees

TreeNode allows construction and manipulation of phylogenetic trees.

In [None]:
# Example 16: Creating and Manipulating Phylogenetic Trees

# Create a tree from Newick format
newick_string = "((human:0.1,chimp:0.1):0.1,gorilla:0.2,(orangutan:0.2,gibbon:0.3):0.1);"
tree = TreeNode.read(StringIO(newick_string))

print("Phylogenetic Tree from Newick:")
print(tree.ascii_art())

print(f"\nNumber of tips: {len(list(tree.tips()))}")
print(f"Tip names: {[tip.name for tip in tree.tips()]}")

# Calculate tree statistics
print(f"\nTotal branch length: {tree.descending_branch_length():.2f}")

# Find common ancestor
human_node = tree.find('human')
chimp_node = tree.find('chimp')
lca = tree.lowest_common_ancestor(['human', 'chimp'])
print(f"\nLowest common ancestor of human and chimp:")
print(f"  Children: {[c.name for c in lca.children]}")

### Example 17: Tree-Based Distance Calculations

Calculate phylogenetic distances between tips on a tree.

In [None]:
# Example 17: Tree-Based Distance Calculations

# Get tip-to-tip distances from the tree
tip_names = [tip.name for tip in tree.tips()]
n_tips = len(tip_names)

# Calculate pairwise tip distances using the tree's built-in method
# The tip_tip_distances method returns a proper distance matrix
tree_dm = tree.tip_tip_distances()

print("Tree-based Phylogenetic Distances:")
print(tree_dm)

# Visualize tree distances
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap of distances
sns.heatmap(tree_dm.to_data_frame(), annot=True, cmap='Greens', fmt='.2f',
            square=True, linewidths=0.5, ax=axes[0])
axes[0].set_title('Phylogenetic Distances', fontsize=14)

# Bar chart of distances from human
human_distances = {tip: tree_dm['human', tip] for tip in tree_dm.ids if tip != 'human'}
axes[1].barh(list(human_distances.keys()), list(human_distances.values()), 
             color=sns.color_palette('Greens_r', len(human_distances)))
axes[1].set_xlabel('Phylogenetic Distance', fontsize=12)
axes[1].set_title('Distance from Human', fontsize=14)

plt.tight_layout()
plt.show()

---
## Part 7: Diversity Metrics
---

### Example 18: Alpha Diversity Calculations

Alpha diversity measures species diversity within a single sample.

In [None]:
# Example 18: Alpha Diversity Calculations

# Create sample abundance data (counts of different species/OTUs)
# Rows are samples, columns are species/OTUs
sample_ids = ['Sample_1', 'Sample_2', 'Sample_3', 'Sample_4', 'Sample_5']
counts = np.array([
    [100, 50, 30, 20, 10, 5, 3, 2, 1, 0],   # Sample 1 - high diversity
    [150, 40, 25, 15, 8, 4, 2, 1, 0, 0],    # Sample 2 - moderate
    [200, 20, 5, 2, 1, 0, 0, 0, 0, 0],      # Sample 3 - low diversity
    [50, 50, 50, 50, 50, 10, 10, 10, 10, 10], # Sample 4 - even distribution
    [250, 1, 1, 1, 1, 0, 0, 0, 0, 0],       # Sample 5 - dominated by one species
])

# Calculate various alpha diversity metrics
print("Alpha Diversity Metrics:")
print("=" * 60)

metrics = ['observed_otus', 'shannon', 'simpson', 'chao1']
results = {}

for metric in metrics:
    diversity = alpha_diversity(metric, counts, ids=sample_ids)
    results[metric] = diversity
    print(f"\n{metric.upper()}:")
    for sample, value in zip(sample_ids, diversity):
        print(f"  {sample}: {value:.4f}")

### Example 19: Visualizing Alpha Diversity

Create comprehensive visualizations of alpha diversity metrics.

In [None]:
# Example 19: Visualizing Alpha Diversity

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

colors = sns.color_palette('Set2', len(sample_ids))

# Plot each metric
for idx, (metric, values) in enumerate(results.items()):
    ax = axes[idx // 2, idx % 2]
    bars = ax.bar(sample_ids, values, color=colors, edgecolor='black', linewidth=1.5)
    ax.set_title(f'{metric.replace("_", " ").title()}', fontsize=14)
    ax.set_ylabel('Diversity Value', fontsize=12)
    ax.set_xlabel('Sample', fontsize=12)
    ax.tick_params(axis='x', rotation=45)
    
    # Add value labels
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01 * max(values),
                f'{val:.2f}', ha='center', va='bottom', fontsize=10)

plt.suptitle('Alpha Diversity Metrics Comparison', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

### Example 20: Beta Diversity Calculations

Beta diversity measures differences in species composition between samples.

In [None]:
# Example 20: Beta Diversity Calculations

# Calculate beta diversity using different metrics
print("Beta Diversity Analysis:")
print("=" * 60)

# Bray-Curtis dissimilarity
bc_dm = beta_diversity('braycurtis', counts, ids=sample_ids)
print("\nBray-Curtis Dissimilarity:")
print(bc_dm)

# Jaccard distance
jaccard_dm = beta_diversity('jaccard', counts, ids=sample_ids)
print("\nJaccard Distance:")
print(jaccard_dm)

### Example 21: PCoA Ordination Analysis

Principal Coordinates Analysis for visualizing sample relationships.

In [None]:
# Example 21: PCoA Ordination Analysis

# Perform PCoA on Bray-Curtis distances
pcoa_results = pcoa(bc_dm)

print("PCoA Results:")
print(f"Proportion explained by PC1: {pcoa_results.proportion_explained[0]:.2%}")
print(f"Proportion explained by PC2: {pcoa_results.proportion_explained[1]:.2%}")

# Extract coordinates
coords = pcoa_results.samples

# Create PCoA plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# PCoA scatter plot
colors = sns.color_palette('husl', len(sample_ids))
for i, sample in enumerate(sample_ids):
    axes[0].scatter(coords.iloc[i, 0], coords.iloc[i, 1], 
                    c=[colors[i]], s=200, label=sample, edgecolor='black', linewidth=2)
    axes[0].annotate(sample, (coords.iloc[i, 0], coords.iloc[i, 1]),
                     xytext=(5, 5), textcoords='offset points', fontsize=10)

axes[0].set_xlabel(f'PC1 ({pcoa_results.proportion_explained[0]:.1%})', fontsize=12)
axes[0].set_ylabel(f'PC2 ({pcoa_results.proportion_explained[1]:.1%})', fontsize=12)
axes[0].set_title('PCoA of Bray-Curtis Distances', fontsize=14)
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)

# Scree plot
explained = pcoa_results.proportion_explained[:5] * 100
axes[1].bar(range(1, len(explained)+1), explained, color='steelblue', edgecolor='black')
axes[1].plot(range(1, len(explained)+1), np.cumsum(explained), 'ro-', linewidth=2, markersize=8)
axes[1].set_xlabel('Principal Coordinate', fontsize=12)
axes[1].set_ylabel('Variance Explained (%)', fontsize=12)
axes[1].set_title('Scree Plot', fontsize=14)
axes[1].legend(['Cumulative', 'Individual'], loc='right')

plt.tight_layout()
plt.show()

---
## Part 8: Statistical Analysis
---

### Example 22: PERMANOVA Analysis

Test if groups of samples differ significantly in their composition.

In [None]:
# Example 22: PERMANOVA Analysis

# Create grouping variable as a DataFrame (required format for newer scikit-bio)
grouping_df = pd.DataFrame({
    'Group': ['Treatment', 'Treatment', 'Control', 'Control', 'Treatment']
}, index=sample_ids)

print("PERMANOVA Analysis:")
print("=" * 60)
print(f"\nGrouping:")
for sample in sample_ids:
    print(f"  {sample}: {grouping_df.loc[sample, 'Group']}")

# Perform PERMANOVA
permanova_results = permanova(bc_dm, grouping_df, column='Group', permutations=999)

print(f"\nPERMANOVA Results:")
print(f"  Test statistic: {permanova_results['test statistic']:.4f}")
print(f"  p-value: {permanova_results['p-value']:.4f}")
print(f"  Number of permutations: {permanova_results['number of permutations']}")

if permanova_results['p-value'] < 0.05:
    print("\n  Interpretation: Significant difference between groups (p < 0.05)")
else:
    print("\n  Interpretation: No significant difference between groups (p >= 0.05)")

### Example 23: ANOSIM Analysis

Analysis of Similarities - another test for group differences.

In [None]:
# Example 23: ANOSIM Analysis

print("ANOSIM Analysis:")
print("=" * 60)

# Perform ANOSIM (using the same grouping DataFrame from Example 22)
anosim_results = anosim(bc_dm, grouping_df, column='Group', permutations=999)

print(f"\nANOSIM Results:")
print(f"  R statistic: {anosim_results['test statistic']:.4f}")
print(f"  p-value: {anosim_results['p-value']:.4f}")
print(f"  Number of permutations: {anosim_results['number of permutations']}")

print(f"\n  R statistic interpretation:")
print(f"    R = 1: Complete separation between groups")
print(f"    R = 0: No separation (random grouping)")
print(f"    R < 0: More similarity between groups than within")

---
## Part 9: Advanced Examples
---

### Example 24: GC Content Analysis Across Sequences

Analyze GC content distribution across multiple sequences with visualization.

In [None]:
# Example 24: GC Content Analysis Across Sequences

# Generate synthetic sequences with varying GC content
np.random.seed(42)

def generate_sequence(length, gc_bias=0.5):
    """Generate a random DNA sequence with specified GC bias."""
    gc_prob = gc_bias / 2
    at_prob = (1 - gc_bias) / 2
    probs = [at_prob, gc_prob, gc_prob, at_prob]  # A, C, G, T
    bases = np.random.choice(['A', 'C', 'G', 'T'], size=length, p=probs)
    return DNA(''.join(bases))

# Generate sequences from different "organisms"
organisms = {
    'E. coli': 0.51,
    'S. cerevisiae': 0.38,
    'Human': 0.41,
    'P. falciparum': 0.19,
    'M. tuberculosis': 0.66
}

# Generate 20 sequences per organism
sequences_data = []
for org, gc_target in organisms.items():
    for i in range(20):
        seq = generate_sequence(1000, gc_bias=gc_target)
        sequences_data.append({
            'organism': org,
            'sequence_id': f"{org}_{i+1}",
            'gc_content': seq.gc_content(),
            'length': len(seq)
        })

gc_df = pd.DataFrame(sequences_data)

print("GC Content Summary by Organism:")
print(gc_df.groupby('organism')['gc_content'].agg(['mean', 'std', 'min', 'max']).round(3))

In [None]:
# Visualize GC content distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Box plot
colors = sns.color_palette('Set2', len(organisms))
sns.boxplot(data=gc_df, x='organism', y='gc_content', palette='Set2', ax=axes[0, 0])
axes[0, 0].set_xlabel('Organism', fontsize=12)
axes[0, 0].set_ylabel('GC Content', fontsize=12)
axes[0, 0].set_title('GC Content Distribution by Organism', fontsize=14)
axes[0, 0].tick_params(axis='x', rotation=45)

# Violin plot
sns.violinplot(data=gc_df, x='organism', y='gc_content', palette='Set2', ax=axes[0, 1])
axes[0, 1].set_xlabel('Organism', fontsize=12)
axes[0, 1].set_ylabel('GC Content', fontsize=12)
axes[0, 1].set_title('GC Content Violin Plot', fontsize=14)
axes[0, 1].tick_params(axis='x', rotation=45)

# Histogram with KDE for each organism
for org in organisms.keys():
    org_data = gc_df[gc_df['organism'] == org]['gc_content']
    sns.kdeplot(org_data, label=org, ax=axes[1, 0], linewidth=2)
axes[1, 0].set_xlabel('GC Content', fontsize=12)
axes[1, 0].set_ylabel('Density', fontsize=12)
axes[1, 0].set_title('GC Content Density by Organism', fontsize=14)
axes[1, 0].legend()

# Strip plot with mean markers
sns.stripplot(data=gc_df, x='organism', y='gc_content', palette='Set2', 
              alpha=0.6, ax=axes[1, 1])
# Add mean markers
means = gc_df.groupby('organism')['gc_content'].mean()
for i, org in enumerate(gc_df['organism'].unique()):
    axes[1, 1].scatter(i, means[org], color='red', s=100, marker='D', 
                       zorder=3, edgecolor='black', linewidth=2)
axes[1, 1].set_xlabel('Organism', fontsize=12)
axes[1, 1].set_ylabel('GC Content', fontsize=12)
axes[1, 1].set_title('GC Content with Mean (red diamond)', fontsize=14)
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### Example 25: Comprehensive Sequence Analysis Pipeline

A complete workflow demonstrating multiple scikit-bio features together.

In [None]:
# Example 25: Comprehensive Sequence Analysis Pipeline

print("="*70)
print("COMPREHENSIVE BIOINFORMATICS ANALYSIS PIPELINE")
print("="*70)

# Step 1: Read sequences (simulated FASTA)
fasta_input = """>species_A gene=rbcL organism=PlantA
ATGTCACCACAAACAGAGACTAAAGCAAGTGTTGGATTCAAAGCTGGTGTTAAAGATTACAAATTGACTTATTATACTCCTGACTATGAAACCAAAGATACTGATATCTTGGCAGCATTCCGAGTAACTCCTCAACCTGGAGTTCCACCTGAAGAAGCAGGGGCTGCGGTAGCTGCCGAATCTTCTACTGGTACATGGACAACTGTGTGGACCGATGGGCTTACCAGTCTTGATCGTTACAAAGGACGATGCTACCACATCGAGCCCGTTGCTGGAGAAGAAAATCAATTTATTGCTTATGTAGCTTACCCATTAGACCTTTTTGAAGAAGGTTCTGTTACTAACATGTTTACTTCCATTGTGGGTAATGTATTTGGGTTCAAAGCCTTGCGCGCTCTACGTCTGGAAGATCTGCGAATTCCCCCTGCTTATTCAAAAACTTTCCAAGGTCCGCCTCACGGCATCCAAGTTGAAAGAGATAAATTGAACAAGTATGGTCGTCCCCTATTGGGATGTACTATTAAACCAAAATTGGGGTTATCCGCTAAGAATTACGGTAGAGCTGTTTATGAATGTCTGCGCGGTGGACTTGATTTTACCAAAGATGATGAAAACGTGAACTCACAACCATTTATGCGTTGGAGAGATCGTTTCTTATTTTGTGCCGAAGCTATTTATAAATCACAGGCTGAAACAGGTGAAATCAAAGGGCATTACTTGAATGCTACTGCAGGTACATGCGAAGAAATGATCAAAAGGGCTGTATTTGCTAGAGAATTGGGAGTTCCTATCGTAATGCATGACTACTTAACAGGGGGATTCACTGCAAATACTAGTTTGGCTCATTATTGCCGAGATAATGGCCTACTTCTTCACATCCACCGTGCAATGCATGCAGTTATTGATAGACAGAAGAATCATGGTATGCACTTTCGTGTACTAGCTAAAGCTTTACGTATGTCAGGTGGAGATCATATTCACGCTGGTACAGTAGTAGGTAAACTTGAAGGAGAAAGAGAAATTACTTTGGGCTTTGTTGATTTACTGCGTGATGATTATGTTGAAAAAGATCGAAGTCGCGGTATTTTTTTCACTCAAGATTGGGTCTCTTTACCAGGAGTCTTGACTGTATTGCCTAATTTGATCGGTTATGAAAACAACGTGTTAACCAATTGCTGCGTTATTCATCGTCTTGAACGTGAACATCAACAAACTTGATAATAG
>species_B gene=rbcL organism=PlantB
ATGTCACCACAAACAGAGACTAAAGCAAGTGTTGGATTCAAAGCTGGTGTTAAAGATTACAAATTGACTTATTATACTCCTGACTACGAAACCAAAGATACTGATATCTTGGCAGCATTCCGAGTAACTCCTCAACCTGGAGTTCCACCTGAAGAAGCAGGGGCTGCGGTAGCTGCCGAATCTTCTACTGGTACATGGACAACTGTATGGACCGATGGGCTTACCAGTCTTGATCGTTACAAAGGACGATGCTACCATATCGAGCCCGTTGCTGGAGAAGAAAATCAATTTATTGCTTATGTAGCTTACCCATTAGACCTTTTTGAAGAAGGTTCTGTTACTAACATGTTTACTTCCATTGTAGGTAATGTATTTGGGTTCAAAGCCTTGCGCGCTCTACGTCTGGAAGATCTGCGAATTCCCCCTGCTTATTCAAAAACTTTCCAAGGTCCGCCTCACGGAATCCAAGTTGAAAGAGATAAATTGAACAAGTATGGTCGTCCCCTATTGGGATGTACTATTAAACCAAAATTGGGGTTATCCGCTAAGAATTACGGTAGAGCTGTTTATGAATGTCTGCGCGGTGGACTTGATTTTACCAAAGATGATGAAAACGTGAACTCACAACCATTTATGCGTTGGAGAGATCGTTTCTTATTTTGTGCCGAAGCAATTTATAAATCACAGGCTGAAACAGGTGAAATCAAAGGGCATTACTTGAATGCTACTGCAGGTACATGCGAAGAAATGATCAAAAGGGCTGTATTTGCTAGAGAATTGGGAGTCCCTATCGTAATGCATGACTACTTAACAGGGGGATTCACTGCAAATACTAGTTTGGCTCATTATTGCCGAGATAATGGTCTACTTCTTCACATCCACCGTGCAATGCATGCAGTTATTGATAGACAGAAGAATCATGGTATGCACTTTCGTGTACTAGCTAAAGCTTTACGTATGTCAGGTGGAGATCATATTCACGCTGGTACAGTAGTAGGTAAACTTGAAGGAGAAAGAGAAATTACTTTGGGCTTTGTGGATTTACTGCGTGATGATTATGTTGAAAAAGATCGAAGTCGCGGTATTTTTTTCACTCAAGATTGGGTCTCTTTACCAGGAGTCTTAACTGTATTGCCTAATTTGATCGGTTATGAAAACAACGTGTTAACCAATTGCTGCGTTATTCATCGTCTTGAACGTGAACATCAACAAACTTGATAATAG
>species_C gene=rbcL organism=PlantC
ATGTCACCACAAACAGAGACTAAAGCAAGTGTTGGATTCAAAGCTGGTGTTAAAGATTACAAATTGACTTATTATACTCCTGACTATGAAACCAAAGATTCTGATATCTTGGCAGCATTCCGAGTAACTCCTCAACCTGGAGTTCCACCTGAAGAAGCAGGGGCTGCAGTAGCTGCCGAATCTTCTACTGGTACATGGACAACTGTGTGGACCGATGGGCTTACCAGTCTTGATCGTTACAAAGGACGATGCTACCACATCGAGCCCGTTGCTGGAGAAGAAAATCAATATATTGCTTATGTAGCTTACCCATTAGACCTTTTTGAAGAAGGTTCTGTTACTAACATGTTTACTTCCATTGTGGGTAATGTATTTGGGTTCAAAGCCTTGCGCGCTCTACGTCTGGAAGATCTGCGAATTCCCCCTGCTTATTCTAAAACTTTCCAAGGTCCGCCTCATGGCATCCAAGTTGAAAGAGATAAATTGAACAAGTATGGACGTCCCCTATTGGGATGTACTATTAAACCGAAATTGGGGTTATCCGCTAAGAATTACGGTAGAGCTGTTTATGAATGTCTACGCGGTGGACTTGATTTTACCAAAGATGATGAAAACGTGAACTCACAACCATTTATGCGTTGGAGAGATCGTTTCTTATTTTGTGCCGAAGCTATTTATAAATCACAGGCTGAAACAGGTGAAATCAAAGGGCATTACTTGAATGCTACTGCAGGTACATGCGAAGAAATGATCAAAAGGGCTGTATTTGCTAGAGAATTGGGAGTTCCTATCGTAATGCATGACTACTTAACAGGGGGATTCACCGCAAATACTAGTTTGGCTCATTATTGCCGAGATAATGGCCTACTTCTTCACATCCACCGTGCAATGCATGCAGTTATTGATAGACAGAAGAATCATGGTATGCACTTTCGTGTACTAGCTAAAGCTTTACGTATGTCAGGTGGAGATCATATTCACGCTGGTACAGTAGTAGGTAAACTTGAAGGAGAAAGAGAAATCACTTTGGGCTTTGTTGATTTACTGCGTGATGATTATGTTGAAAAAGATCGAAGTCGCGGTATTTTTTTCACTCAAGATTGGGTCTCTTTACCAGGAGTCTTGACTGTATTGCCTAATTTGATCGGTTATGAAAACAACGTGTTAACCAATTGCTGCGTTATTCATCGTCTTGAACGTGAACATCAACAAACTTGATAATAG
"""

# Parse sequences
sequences = list(skbio_io.read(StringIO(fasta_input), format='fasta', constructor=DNA))
print(f"\n1. SEQUENCE INPUT")
print(f"   Loaded {len(sequences)} sequences")
for seq in sequences:
    print(f"   - {seq.metadata['id']}: {len(seq)} bp, GC={seq.gc_content():.1%}")

In [None]:
# Step 2: Pairwise alignment and distance calculation
print(f"\n2. PAIRWISE DISTANCE CALCULATION")

n = len(sequences)
distances = np.zeros((n, n))
ids = [seq.metadata['id'].split()[0] for seq in sequences]

for i in range(n):
    for j in range(i+1, n):
        # Calculate simple p-distance (proportion of different sites)
        s1, s2 = str(sequences[i]), str(sequences[j])
        min_len = min(len(s1), len(s2))
        diff = sum(a != b for a, b in zip(s1[:min_len], s2[:min_len]))
        dist = diff / min_len
        distances[i, j] = dist
        distances[j, i] = dist

seq_dm = DistanceMatrix(distances, ids=ids)
print(f"   Calculated pairwise distances:")
for i in range(n):
    for j in range(i+1, n):
        print(f"   {ids[i]} vs {ids[j]}: {seq_dm[i, j]:.4f}")

In [None]:
# Step 3: Translation analysis
print(f"\n3. TRANSLATION ANALYSIS")
proteins = []
for seq in sequences:
    protein = seq.translate()
    proteins.append(protein)
    print(f"   {seq.metadata['id'].split()[0]}:")
    print(f"     Protein length: {len(protein)} aa")
    print(f"     First 30 aa: {str(protein)[:30]}...")
    # Count amino acid composition
    aa_freq = protein.frequencies()
    most_common = sorted(aa_freq.items(), key=lambda x: x[1], reverse=True)[:3]
    print(f"     Most common AAs: {most_common}")

In [None]:
# Step 4: Final visualization summary
print(f"\n4. GENERATING SUMMARY VISUALIZATION")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# A: GC Content comparison
gc_values = [seq.gc_content() for seq in sequences]
colors = sns.color_palette('viridis', len(sequences))
bars = axes[0, 0].bar(ids, gc_values, color=colors, edgecolor='black', linewidth=1.5)
axes[0, 0].axhline(y=np.mean(gc_values), color='red', linestyle='--', label=f'Mean: {np.mean(gc_values):.1%}')
axes[0, 0].set_xlabel('Species', fontsize=12)
axes[0, 0].set_ylabel('GC Content', fontsize=12)
axes[0, 0].set_title('GC Content by Species', fontsize=14)
axes[0, 0].legend()
for bar, val in zip(bars, gc_values):
    axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                    f'{val:.1%}', ha='center', fontsize=11)

# B: Distance matrix heatmap
sns.heatmap(seq_dm.to_data_frame(), annot=True, cmap='YlOrRd', fmt='.4f',
            square=True, linewidths=0.5, ax=axes[0, 1])
axes[0, 1].set_title('Pairwise Sequence Distances', fontsize=14)

# C: Sequence length comparison
lengths = [len(seq) for seq in sequences]
axes[1, 0].barh(ids, lengths, color=colors, edgecolor='black', linewidth=1.5)
axes[1, 0].set_xlabel('Sequence Length (bp)', fontsize=12)
axes[1, 0].set_title('Sequence Lengths', fontsize=14)
for i, (idx, length) in enumerate(zip(ids, lengths)):
    axes[1, 0].text(length + 10, i, str(length), va='center', fontsize=11)

# D: Nucleotide composition stacked bar
nt_data = []
for seq in sequences:
    freq = seq.frequencies()
    total = len(seq)
    nt_data.append({
        'A': freq.get('A', 0) / total,
        'T': freq.get('T', 0) / total,
        'G': freq.get('G', 0) / total,
        'C': freq.get('C', 0) / total
    })

nt_df = pd.DataFrame(nt_data, index=ids)
nt_df.plot(kind='bar', stacked=True, ax=axes[1, 1], 
           color=['#e74c3c', '#f39c12', '#2ecc71', '#3498db'],
           edgecolor='black', linewidth=0.5)
axes[1, 1].set_xlabel('Species', fontsize=12)
axes[1, 1].set_ylabel('Proportion', fontsize=12)
axes[1, 1].set_title('Nucleotide Composition', fontsize=14)
axes[1, 1].legend(title='Nucleotide', bbox_to_anchor=(1.02, 1))
axes[1, 1].tick_params(axis='x', rotation=0)

plt.suptitle('Comprehensive Sequence Analysis Summary', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("ANALYSIS COMPLETE")
print("="*70)

---
## Summary

This notebook covered 25 examples demonstrating scikit-bio's capabilities:

| Example | Topic | Key Functions |
|---------|-------|---------------|
| 1-3 | DNA Sequences | `DNA()`, `gc_content()`, `transcribe()`, `translate()` |
| 4-6 | RNA/Protein & Visualization | `RNA()`, `Protein()`, `frequencies()`, matplotlib/seaborn |
| 7-9 | File I/O | `skbio.io.read()`, `skbio.io.write()`, FASTA, FASTQ |
| 10-12 | Sequence Comparison | Sequence identity, conserved regions, `TabularMSA` |
| 13-15 | Distance Matrices | `DistanceMatrix()`, heatmaps, Hamming distance |
| 16-17 | Phylogenetic Trees | `TreeNode`, Newick format, tree distances |
| 18-21 | Diversity Metrics | `alpha_diversity()`, `beta_diversity()`, `pcoa()` |
| 22-23 | Statistical Analysis | `permanova()`, `anosim()` |
| 24-25 | Advanced Analysis | GC content analysis, comprehensive pipelines |

### Further Resources
- [scikit-bio Documentation](http://scikit-bio.org/)
- [scikit-bio GitHub](https://github.com/biocore/scikit-bio)
- [scikit-bio Tutorials](http://scikit-bio.org/docs/latest/index.html)

---