# BioPython Tutorial: 25 Examples for Learning

This notebook contains 25 hands-on examples to help you learn BioPython, a powerful library for computational biology and bioinformatics.

## Prerequisites
Make sure BioPython is installed: `pip install biopython`

In [None]:
# Uncomment to install BioPython if needed
# !pip install biopython

---
## Part 1: Working with Sequences (Examples 1-7)

### Example 1: Creating a Seq Object
The `Seq` object is the fundamental sequence type in BioPython.

In [None]:
from Bio.Seq import Seq

# Create a DNA sequence
my_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(f"DNA Sequence: {my_dna}")
print(f"Type: {type(my_dna)}")
print(f"Length: {len(my_dna)} nucleotides")

### Example 2: Sequence Properties and Counting
Analyze nucleotide composition of a sequence.

In [None]:
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Count individual nucleotides
print("Nucleotide Counts:")
print(f"  Adenine (A):  {dna.count('A')}")
print(f"  Thymine (T):  {dna.count('T')}")
print(f"  Guanine (G):  {dna.count('G')}")
print(f"  Cytosine (C): {dna.count('C')}")

# Calculate GC content manually
gc_count = dna.count('G') + dna.count('C')
gc_content = (gc_count / len(dna)) * 100
print(f"\nGC Content: {gc_content:.2f}%")

### Example 3: Slicing and Indexing Sequences
Access parts of a sequence using Python slicing.

In [None]:
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Indexing (0-based)
print(f"First nucleotide: {dna[0]}")
print(f"Last nucleotide: {dna[-1]}")

# Slicing
print(f"\nFirst 10 nucleotides: {dna[:10]}")
print(f"Last 10 nucleotides: {dna[-10:]}")
print(f"Nucleotides 5-15: {dna[5:15]}")

# Every third nucleotide (codon positions)
print(f"\nFirst position of each codon: {dna[0::3]}")
print(f"Second position of each codon: {dna[1::3]}")
print(f"Third position of each codon: {dna[2::3]}")

### Example 4: Complement and Reverse Complement
Essential operations for working with double-stranded DNA.

In [None]:
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATG")

print(f"Original (5'->3'):         {dna}")
print(f"Complement (3'->5'):       {dna.complement()}")
print(f"Reverse Complement (5'->3'): {dna.reverse_complement()}")

# Visual representation of double-stranded DNA
print("\nDouble-stranded DNA:")
print(f"5'-{dna}-3'")
print(f"3'-{dna.complement()}-5'")

### Example 5: Transcription (DNA to RNA)
Convert DNA to messenger RNA.

In [None]:
from Bio.Seq import Seq

# Template strand (3' to 5') - used by RNA polymerase
template_strand = Seq("TACGGTAACATTACCCGGCGACTT")

# Coding strand (5' to 3') - same sequence as mRNA (with T instead of U)
coding_strand = template_strand.reverse_complement()

print(f"Template strand (3'->5'): {template_strand}")
print(f"Coding strand (5'->3'):   {coding_strand}")

# Transcription replaces T with U
mrna = coding_strand.transcribe()
print(f"mRNA (5'->3'):            {mrna}")

# You can also go back from RNA to DNA
back_to_dna = mrna.back_transcribe()
print(f"\nBack to DNA: {back_to_dna}")

### Example 6: Translation (DNA/RNA to Protein)
Translate nucleotide sequences to amino acids.

In [None]:
from Bio.Seq import Seq

# A coding DNA sequence (CDS)
cds = Seq("ATGGCCATTGTAATGGGCCGCTGA")
print(f"DNA: {cds}")

# Translate to protein
protein = cds.translate()
print(f"Protein: {protein}")

# Translate until stop codon
protein_to_stop = cds.translate(to_stop=True)
print(f"Protein (to stop): {protein_to_stop}")

# Using a different codon table (e.g., mitochondrial)
mito_protein = cds.translate(table="Vertebrate Mitochondrial")
print(f"Mito translation: {mito_protein}")

# Show available codon tables
from Bio.Data import CodonTable
print("\nAvailable codon tables:")
for name in list(CodonTable.unambiguous_dna_by_name.keys())[:5]:
    print(f"  - {name}")
print("  ... and more")

### Example 7: GC Content with SeqUtils
Use built-in utilities for common calculations.

In [None]:
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction

sequences = [
    ("Human beta-globin", Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG")),
    ("E. coli lacZ", Seq("ATGACCATGATTACGGATTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAAC")),
    ("AT-rich example", Seq("ATAATATATAATATATATTATTATATATAT")),
    ("GC-rich example", Seq("GCGCGCGCGCGCGCGCGCGCGCGCGCGCGC"))
]

print("GC Content Analysis:")
print("-" * 50)
for name, seq in sequences:
    gc = gc_fraction(seq) * 100
    print(f"{name:20} | GC: {gc:5.1f}% | Length: {len(seq)}")

---
## Part 2: SeqRecord Objects (Examples 8-10)

### Example 8: Creating SeqRecord Objects
SeqRecord combines a sequence with metadata.

In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Create a SeqRecord with full annotations
record = SeqRecord(
    Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"),
    id="gene001",
    name="example_gene",
    description="An example gene for demonstration"
)

# Add additional annotations
record.annotations["organism"] = "Homo sapiens"
record.annotations["molecule_type"] = "DNA"

print(f"ID: {record.id}")
print(f"Name: {record.name}")
print(f"Description: {record.description}")
print(f"Sequence: {record.seq}")
print(f"Length: {len(record)}")
print(f"Annotations: {record.annotations}")

### Example 9: Working with Sequence Features
Add biological features like genes, exons, or binding sites.

In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

# Create a sequence record
record = SeqRecord(
    Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"),
    id="gene001"
)

# Add a gene feature (positions 0-39)
gene_feature = SeqFeature(
    FeatureLocation(start=0, end=39),
    type="gene",
    qualifiers={"gene": "example", "product": "example protein"}
)
record.features.append(gene_feature)

# Add a CDS feature (coding sequence)
cds_feature = SeqFeature(
    FeatureLocation(start=0, end=36),
    type="CDS",
    qualifiers={"protein_id": "PRO001"}
)
record.features.append(cds_feature)

# Display features
print(f"Record: {record.id}")
print(f"Number of features: {len(record.features)}")
for feature in record.features:
    print(f"\n  Type: {feature.type}")
    print(f"  Location: {feature.location}")
    print(f"  Qualifiers: {feature.qualifiers}")

### Example 10: Slicing SeqRecords
Extract sub-records while preserving relevant features.

In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

# Create a record with features
record = SeqRecord(
    Seq("AAAAAATGGCCATTGTAATGGGCCGCTGAAAAAAAAA"),
    id="gene001",
    description="Gene with flanking regions"
)

# Add features at different positions
record.features.append(SeqFeature(FeatureLocation(0, 6), type="5'UTR"))
record.features.append(SeqFeature(FeatureLocation(6, 30), type="CDS"))
record.features.append(SeqFeature(FeatureLocation(30, 37), type="3'UTR"))

print("Original record:")
print(f"  Sequence: {record.seq}")
print(f"  Features: {len(record.features)}")

# Slice the record to get just the CDS
cds_record = record[6:30]

print("\nSliced record (CDS only):")
print(f"  Sequence: {cds_record.seq}")
print(f"  Features: {len(cds_record.features)}")

---
## Part 3: Parsing FASTA Files (Examples 11-15)

### Example 11: Creating Sample FASTA Files
First, let's create sample FASTA files to work with.

In [None]:
# Create a sample FASTA file with multiple sequences
fasta_content = """>seq1 Homo sapiens beta-globin
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAAC
GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAG
>seq2 Mus musculus beta-globin
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCTGTGGGGAAAGGTGAAC
TCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCTTGGACCCAG
>seq3 Gallus gallus beta-globin
ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTCAAT
GTGGCCGAATGTGGGGCCGAAGCCCTGGCCAGACTGCTGGTCGTCTACCCCTGGACTCAG
>seq4 Danio rerio hemoglobin
ATGAGTCTGTCTGATGACGACGCTGCTGACAGGCTGCAGAAAGCCCTTCAGCTCAACTGT
GACAAATCCCTTCACGCGAAGGTTGGTGGTGAGGCCTTGGGCAGGTTGCTGGTGGTCTAC
>seq5 Xenopus laevis hemoglobin
ATGGTGCATTTGACTGCTGAGGAAAAGACTGCCGTCACTGCCCTGTGGGGCAAAGTGAAT
GTTGATGATGTTGGTGGTGAGGCCCTGGGCAGATTGCTGGTTGTCTACCCATGGACTCAG
"""

# Write to file
with open("sample_sequences.fasta", "w") as f:
    f.write(fasta_content)

print("Created 'sample_sequences.fasta' with 5 sequences")
print("\nFile contents:")
print(fasta_content[:500] + "...")

### Example 12: Parsing a Single FASTA Sequence
Read and process individual sequences from a FASTA file.

In [None]:
from Bio import SeqIO

# Parse and iterate through sequences one at a time
print("Parsing FASTA file sequence by sequence:")
print("=" * 60)

for record in SeqIO.parse("sample_sequences.fasta", "fasta"):
    print(f"\nID: {record.id}")
    print(f"Description: {record.description}")
    print(f"Sequence length: {len(record.seq)} bp")
    print(f"First 30 bp: {record.seq[:30]}...")

### Example 13: Loading All FASTA Sequences into Memory
When you need random access to sequences.

In [None]:
from Bio import SeqIO

# Method 1: Load as a list
sequences_list = list(SeqIO.parse("sample_sequences.fasta", "fasta"))
print(f"Loaded {len(sequences_list)} sequences as a list")
print(f"Third sequence ID: {sequences_list[2].id}")

# Method 2: Load as a dictionary (indexed by ID)
sequences_dict = SeqIO.to_dict(SeqIO.parse("sample_sequences.fasta", "fasta"))
print(f"\nLoaded {len(sequences_dict)} sequences as a dictionary")
print(f"Available IDs: {list(sequences_dict.keys())}")

# Access a specific sequence by ID
seq2 = sequences_dict["seq2"]
print(f"\nseq2 description: {seq2.description}")
print(f"seq2 sequence: {seq2.seq[:40]}...")

### Example 14: Filtering FASTA Sequences
Select sequences based on specific criteria.

In [None]:
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

# Filter sequences based on multiple criteria
print("Filtering FASTA sequences:")
print("=" * 60)

filtered_by_length = []
filtered_by_gc = []
filtered_by_species = []

for record in SeqIO.parse("sample_sequences.fasta", "fasta"):
    # Filter by length (> 100 bp)
    if len(record.seq) > 100:
        filtered_by_length.append(record)
    
    # Filter by GC content (> 50%)
    if gc_fraction(record.seq) > 0.50:
        filtered_by_gc.append(record)
    
    # Filter by species in description
    if "Homo sapiens" in record.description or "Mus musculus" in record.description:
        filtered_by_species.append(record)

print(f"Sequences > 100 bp: {len(filtered_by_length)}")
print(f"Sequences with GC > 50%: {len(filtered_by_gc)}")
for rec in filtered_by_gc:
    print(f"  - {rec.id}: {gc_fraction(rec.seq)*100:.1f}% GC")
print(f"Mammalian sequences: {len(filtered_by_species)}")

### Example 15: FASTA Statistics and Summary
Calculate comprehensive statistics from a FASTA file.

In [None]:
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction
import statistics

# Collect statistics
lengths = []
gc_contents = []
nucleotide_counts = {'A': 0, 'T': 0, 'G': 0, 'C': 0}

for record in SeqIO.parse("sample_sequences.fasta", "fasta"):
    seq = record.seq
    lengths.append(len(seq))
    gc_contents.append(gc_fraction(seq) * 100)
    for nucleotide in nucleotide_counts:
        nucleotide_counts[nucleotide] += seq.count(nucleotide)

total_bases = sum(nucleotide_counts.values())

print("FASTA File Statistics")
print("=" * 50)
print(f"Number of sequences: {len(lengths)}")
print(f"\nSequence Lengths:")
print(f"  Total bases: {sum(lengths):,}")
print(f"  Min length: {min(lengths)}")
print(f"  Max length: {max(lengths)}")
print(f"  Mean length: {statistics.mean(lengths):.1f}")
print(f"\nGC Content:")
print(f"  Min: {min(gc_contents):.1f}%")
print(f"  Max: {max(gc_contents):.1f}%")
print(f"  Mean: {statistics.mean(gc_contents):.1f}%")
print(f"\nNucleotide Composition:")
for nuc, count in nucleotide_counts.items():
    pct = (count / total_bases) * 100
    print(f"  {nuc}: {count:,} ({pct:.1f}%)")

---
## Part 4: File I/O and Format Conversion (Examples 16-18)

### Example 16: Writing FASTA Files
Save sequences to FASTA format.

In [None]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Create some sequence records
records = [
    SeqRecord(Seq("ATGCATGCATGC"), id="seq_a", description="First sequence"),
    SeqRecord(Seq("GCTAGCTAGCTA"), id="seq_b", description="Second sequence"),
    SeqRecord(Seq("TTAATTAATTAA"), id="seq_c", description="Third sequence")
]

# Write to FASTA file
count = SeqIO.write(records, "output_sequences.fasta", "fasta")
print(f"Wrote {count} sequences to 'output_sequences.fasta'")

# Verify by reading back
print("\nVerifying file contents:")
with open("output_sequences.fasta") as f:
    print(f.read())

### Example 17: Parsing GenBank Files
Work with richly annotated GenBank format files.

In [None]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

# Create a GenBank-style record
record = SeqRecord(
    Seq("ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG"),
    id="HBB_HUMAN",
    name="HBB",
    description="Homo sapiens hemoglobin subunit beta"
)

# Add annotations
record.annotations["molecule_type"] = "DNA"
record.annotations["organism"] = "Homo sapiens"
record.annotations["taxonomy"] = ["Eukaryota", "Metazoa", "Chordata", "Mammalia", "Primates", "Hominidae", "Homo"]

# Add features
record.features.append(
    SeqFeature(FeatureLocation(0, 93), type="gene", qualifiers={"gene": "HBB"})
)
record.features.append(
    SeqFeature(FeatureLocation(0, 93), type="CDS", qualifiers={"gene": "HBB", "product": "hemoglobin subunit beta"})
)

# Write as GenBank
SeqIO.write(record, "sample.gb", "genbank")
print("Created 'sample.gb'")

# Parse the GenBank file
print("\nParsing GenBank file:")
for record in SeqIO.parse("sample.gb", "genbank"):
    print(f"ID: {record.id}")
    print(f"Name: {record.name}")
    print(f"Organism: {record.annotations.get('organism', 'N/A')}")
    print(f"Features: {len(record.features)}")
    for feature in record.features:
        print(f"  - {feature.type}: {feature.location}")

### Example 18: Converting Between File Formats
Convert sequences between FASTA, GenBank, and other formats.

In [None]:
from Bio import SeqIO

# Convert GenBank to FASTA
records = list(SeqIO.parse("sample.gb", "genbank"))
count = SeqIO.write(records, "converted.fasta", "fasta")
print(f"Converted {count} record(s) from GenBank to FASTA")

# One-liner conversion using SeqIO.convert()
count = SeqIO.convert("sample.gb", "genbank", "converted_v2.fasta", "fasta")
print(f"Direct conversion: {count} record(s)")

# Show the converted FASTA
print("\nConverted FASTA content:")
with open("converted.fasta") as f:
    print(f.read())

# List of supported formats (partial)
print("Some supported formats:")
formats = ["fasta", "genbank", "fastq", "tab", "embl", "swiss", "clustal"]
for fmt in formats:
    print(f"  - {fmt}")

---
## Part 5: Sequence Analysis (Examples 19-22)

### Example 19: Finding Patterns and Motifs
Search for specific patterns in sequences.

In [None]:
from Bio.Seq import Seq
import re

# DNA sequence with potential restriction sites
dna = Seq("ATGAATTCGCTAGCAATTCGATCGATGAATTCGCTAGAATTCGATCG")

# Find EcoRI sites (GAATTC)
pattern = "GAATTC"

print(f"Sequence: {dna}")
print(f"Length: {len(dna)} bp")
print(f"\nSearching for {pattern} (EcoRI site):")

# Find all occurrences
positions = []
start = 0
while True:
    pos = str(dna).find(pattern, start)
    if pos == -1:
        break
    positions.append(pos)
    start = pos + 1

print(f"Found {len(positions)} site(s) at positions: {positions}")

# Using regex for more complex patterns
print("\nUsing regex to find patterns:")

# Find all ATG (start codons)
start_codons = [m.start() for m in re.finditer("ATG", str(dna))]
print(f"Start codons (ATG) at positions: {start_codons}")

# Find N-glycosylation motif pattern: N{P}[ST]{P}
protein = Seq("MKNSTLNWTSFGNETSNRTP")
nglyc_pattern = r"N[^P][ST][^P]"
matches = [m.start() for m in re.finditer(nglyc_pattern, str(protein))]
print(f"\nProtein: {protein}")
print(f"N-glycosylation sites at positions: {matches}")

### Example 20: Pairwise Sequence Alignment
Align two sequences to find similarities.

In [None]:
from Bio import Align

# Create an aligner
aligner = Align.PairwiseAligner()

# Set alignment parameters
aligner.mode = 'global'  # Global alignment (Needleman-Wunsch)
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5

# Sequences to align
seq1 = "ATGCGATCGATCGATCG"
seq2 = "ATGCAATCGATGATCG"

print(f"Sequence 1: {seq1}")
print(f"Sequence 2: {seq2}")
print(f"\nAlignment parameters:")
print(f"  Match: {aligner.match_score}")
print(f"  Mismatch: {aligner.mismatch_score}")
print(f"  Gap open: {aligner.open_gap_score}")
print(f"  Gap extend: {aligner.extend_gap_score}")

# Perform alignment
alignments = aligner.align(seq1, seq2)

print(f"\nNumber of alignments: {len(alignments)}")
print(f"Alignment score: {alignments[0].score}")
print("\nBest alignment:")
print(alignments[0])

### Example 21: Molecular Weight and Protein Properties
Calculate physical and chemical properties of proteins.

In [None]:
from Bio.Seq import Seq
from Bio.SeqUtils import molecular_weight
from Bio.SeqUtils.ProtParam import ProteinAnalysis

# Protein sequence (insulin A chain)
protein_seq = "GIVEQCCTSICSLYQLENYCN"

print(f"Protein sequence: {protein_seq}")
print(f"Length: {len(protein_seq)} amino acids")

# Analyze with ProteinAnalysis
protein = ProteinAnalysis(protein_seq)

print(f"\nPhysicochemical Properties:")
print(f"  Molecular weight: {protein.molecular_weight():.2f} Da")
print(f"  Isoelectric point (pI): {protein.isoelectric_point():.2f}")
print(f"  Instability index: {protein.instability_index():.2f}")
print(f"  GRAVY (hydrophobicity): {protein.gravy():.3f}")
print(f"  Aromaticity: {protein.aromaticity():.3f}")

# Amino acid composition
print(f"\nAmino Acid Composition:")
aa_count = protein.count_amino_acids()
for aa, count in sorted(aa_count.items(), key=lambda x: -x[1]):
    if count > 0:
        pct = (count / len(protein_seq)) * 100
        print(f"  {aa}: {count} ({pct:.1f}%)")

# Secondary structure prediction (fraction)
helix, turn, sheet = protein.secondary_structure_fraction()
print(f"\nSecondary Structure Propensity:")
print(f"  Helix: {helix:.1%}")
print(f"  Turn: {turn:.1%}")
print(f"  Sheet: {sheet:.1%}")

### Example 22: Codon Usage Analysis
Analyze codon usage patterns in coding sequences.

In [None]:
from Bio.Seq import Seq
from Bio.Data import CodonTable
from collections import Counter

# Coding sequence
cds = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAGTTGCAGGTAATGCCC")

# Extract codons
codons = [str(cds[i:i+3]) for i in range(0, len(cds) - len(cds) % 3, 3)]

print(f"CDS: {cds}")
print(f"Length: {len(cds)} bp ({len(codons)} codons)")
print(f"\nCodons: {' '.join(codons)}")

# Count codon usage
codon_counts = Counter(codons)

print(f"\nCodon Usage:")
table = CodonTable.unambiguous_dna_by_name["Standard"]
for codon, count in codon_counts.most_common():
    if codon in table.forward_table:
        aa = table.forward_table[codon]
    elif codon in table.stop_codons:
        aa = "Stop"
    else:
        aa = "?"
    freq = count / len(codons) * 100
    print(f"  {codon} ({aa}): {count} ({freq:.1f}%)")

# Codon adaptation index (simple version)
# Group by amino acid
print(f"\nSynonymous codon usage:")
from collections import defaultdict
aa_codons = defaultdict(list)
for codon in codons:
    if codon in table.forward_table:
        aa_codons[table.forward_table[codon]].append(codon)

for aa, codon_list in sorted(aa_codons.items()):
    if len(set(codon_list)) > 1:
        usage = Counter(codon_list)
        print(f"  {aa}: {dict(usage)}")

---
## Part 6: NCBI and External Resources (Examples 23-25)

### Example 23: Using Entrez for Database Searches
Search NCBI databases programmatically.

**Note:** You should provide your email address when using Entrez.

In [None]:
from Bio import Entrez

# IMPORTANT: Set your email (required by NCBI)
Entrez.email = "your.email@example.com"  # Replace with your email

# Example: Search for nucleotide sequences
# Note: This requires internet connection
print("Entrez Database Search Example")
print("="*50)
print("\nNote: Uncomment the code below to run actual searches.")
print("Requires internet connection and valid email.")

# Uncomment to run:
"""
# Search nucleotide database
handle = Entrez.esearch(db="nucleotide", term="human[orgn] AND hemoglobin AND complete cds", retmax=5)
record = Entrez.read(handle)
handle.close()

print(f"\nSearch Results:")
print(f"Total hits: {record['Count']}")
print(f"Retrieved IDs: {record['IdList']}")

# Get information about each ID
ids = record['IdList']
if ids:
    handle = Entrez.esummary(db="nucleotide", id=",".join(ids))
    summaries = Entrez.read(handle)
    handle.close()
    
    print("\nSummary of results:")
    for summary in summaries:
        print(f"  - {summary['Title'][:60]}...")
"""

# Show available databases
print("\nAvailable NCBI databases (selection):")
databases = ["nucleotide", "protein", "gene", "pubmed", "taxonomy", "structure", "genome"]
for db in databases:
    print(f"  - {db}")

### Example 24: Fetching Sequences from NCBI
Download sequences directly from NCBI databases.

In [None]:
from Bio import Entrez, SeqIO

# Set your email
Entrez.email = "your.email@example.com"  # Replace with your email

print("Fetching Sequences from NCBI")
print("="*50)
print("\nNote: Uncomment the code below to fetch real sequences.")

# Uncomment to run:
"""
# Fetch a specific sequence by accession number
accession = "NM_000518"  # Human beta-globin mRNA

# Fetch as GenBank format
handle = Entrez.efetch(db="nucleotide", id=accession, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"\nFetched: {record.id}")
print(f"Description: {record.description}")
print(f"Sequence length: {len(record.seq)} bp")
print(f"Features: {len(record.features)}")
print(f"Organism: {record.annotations.get('organism', 'N/A')}")

# Save to local file
SeqIO.write(record, f"{accession}.gb", "genbank")
print(f"\nSaved to {accession}.gb")

# Fetch as FASTA
handle = Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text")
fasta_record = SeqIO.read(handle, "fasta")
handle.close()
print(f"\nFirst 100 bp: {fasta_record.seq[:100]}...")
"""

# Example output (what you would see)
print("\nExample output:")
print("""
Fetched: NM_000518.5
Description: Homo sapiens hemoglobin subunit beta (HBB), mRNA
Sequence length: 626 bp
Features: 8
Organism: Homo sapiens
""")

### Example 25: Complete Workflow - FASTA Analysis Pipeline
A practical bioinformatics workflow combining multiple concepts.

In [None]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqUtils import gc_fraction
import re

def analyze_fasta_file(filename):
    """Complete analysis pipeline for a FASTA file."""
    
    results = {
        'sequences': [],
        'translations': [],
        'summary': {}
    }
    
    total_gc = 0
    total_length = 0
    orfs_found = 0
    
    print(f"Analyzing: {filename}")
    print("=" * 60)
    
    for record in SeqIO.parse(filename, "fasta"):
        seq = record.seq
        gc = gc_fraction(seq) * 100
        
        print(f"\n>> {record.id}")
        print(f"   Description: {record.description}")
        print(f"   Length: {len(seq)} bp")
        print(f"   GC content: {gc:.1f}%")
        
        # Find ORFs (start codon to stop codon)
        seq_str = str(seq)
        start_codons = [m.start() for m in re.finditer('ATG', seq_str)]
        stop_codons = ['TAA', 'TAG', 'TGA']
        
        orfs = []
        for start in start_codons:
            # Check reading frame
            for i in range(start, len(seq_str) - 2, 3):
                codon = seq_str[i:i+3]
                if codon in stop_codons:
                    orf_length = i + 3 - start
                    if orf_length >= 30:  # Minimum ORF length (10 aa)
                        orfs.append((start, i + 3, orf_length))
                        orfs_found += 1
                    break
        
        print(f"   ORFs found (>30bp): {len(orfs)}")
        
        # Translate longest ORF
        if orfs:
            longest_orf = max(orfs, key=lambda x: x[2])
            orf_seq = seq[longest_orf[0]:longest_orf[1]]
            protein = orf_seq.translate(to_stop=True)
            print(f"   Longest ORF: {longest_orf[2]} bp -> {len(protein)} aa")
            print(f"   Protein (first 30 aa): {protein[:30]}...")
            
            # Store translation
            prot_record = SeqRecord(
                protein,
                id=record.id + "_protein",
                description=f"Translated from {record.id}"
            )
            results['translations'].append(prot_record)
        
        # Update totals
        total_gc += gc
        total_length += len(seq)
        results['sequences'].append(record)
    
    # Summary
    n_seqs = len(results['sequences'])
    results['summary'] = {
        'num_sequences': n_seqs,
        'total_bases': total_length,
        'mean_gc': total_gc / n_seqs if n_seqs > 0 else 0,
        'orfs_found': orfs_found
    }
    
    print("\n" + "=" * 60)
    print("SUMMARY")
    print("=" * 60)
    print(f"Total sequences: {n_seqs}")
    print(f"Total bases: {total_length:,}")
    print(f"Average GC content: {results['summary']['mean_gc']:.1f}%")
    print(f"Total ORFs found: {orfs_found}")
    
    return results

# Run the pipeline
results = analyze_fasta_file("sample_sequences.fasta")

# Save translations to file
if results['translations']:
    SeqIO.write(results['translations'], "translated_orfs.fasta", "fasta")
    print(f"\nSaved {len(results['translations'])} translations to 'translated_orfs.fasta'")

---
## Cleanup (Optional)
Remove files created during this tutorial.

In [None]:
import os

# List of files created
files_to_remove = [
    "sample_sequences.fasta",
    "output_sequences.fasta",
    "sample.gb",
    "converted.fasta",
    "converted_v2.fasta",
    "translated_orfs.fasta"
]

# Uncomment to remove files
# for f in files_to_remove:
#     if os.path.exists(f):
#         os.remove(f)
#         print(f"Removed: {f}")

print("To clean up, uncomment the removal code above and run this cell.")
print(f"Files that would be removed: {files_to_remove}")

---
## Summary

This notebook covered 25 BioPython examples organized into 6 parts:

1. **Working with Sequences (1-7):** Seq objects, properties, slicing, complement, transcription, translation, GC content

2. **SeqRecord Objects (8-10):** Creating records, adding features, slicing with features

3. **Parsing FASTA Files (11-15):** Creating, reading, filtering, and analyzing FASTA files

4. **File I/O and Format Conversion (16-18):** Writing FASTA, parsing GenBank, format conversion

5. **Sequence Analysis (19-22):** Pattern finding, pairwise alignment, protein properties, codon usage

6. **NCBI and External Resources (23-25):** Entrez searches, fetching sequences, complete workflow

### Further Resources
- [BioPython Tutorial](https://biopython.org/DIST/docs/tutorial/Tutorial.html)
- [BioPython API Documentation](https://biopython.org/docs/latest/api/)
- [NCBI Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/)