# DNA Sequence Analysis: Population Genomics Basics

Learn genomics fundamentals through DNA sequence analysis and phylogenetics.

## Dataset

10 individuals sequenced for COI gene (240 bp):
- 4 geographic locations (A, B, C, D)
- Cytochrome c oxidase subunit I (mitochondrial barcode)
- Realistic single nucleotide polymorphisms (SNPs)

## Methods
- Multiple sequence alignment
- Phylogenetic tree construction
- Genetic diversity analysis
- Population structure assessment

In [None]:
from Bio import SeqIO, AlignIO, Phylo
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("✓ Setup complete")

## 1. Load and Explore Sequences

In [None]:
# Load FASTA file
sequences = list(SeqIO.parse('sample_sequences.fasta', 'fasta'))

print(f"Number of sequences: {len(sequences)}")
print(f"\nFirst sequence:")
print(f"ID: {sequences[0].id}")
print(f"Length: {len(sequences[0])} bp")
print(f"Sequence: {str(sequences[0].seq)[:60]}...")

# Extract metadata
metadata = []
for seq in sequences:
    parts = seq.id.split('|')
    metadata.append({'ID': parts[0], 'Location': parts[1], 'Length': len(seq)})

metadata_df = pd.DataFrame(metadata)
print("\nSample Information:")
print(metadata_df)

## 2. Sequence Composition

In [None]:
# Calculate GC content
def calc_gc_content(seq):
    g = seq.count('G')
    c = seq.count('C')
    return (g + c) / len(seq) * 100

composition_data = []
for seq in sequences:
    comp = {
        'ID': seq.id.split('|')[0],
        'A': seq.seq.count('A'),
        'T': seq.seq.count('T'),
        'G': seq.seq.count('G'),
        'C': seq.seq.count('C'),
        'GC%': calc_gc_content(seq.seq)
    }
    composition_data.append(comp)

comp_df = pd.DataFrame(composition_data)
print("Nucleotide Composition:")
print(comp_df.round(2))

# Plot GC content
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(comp_df['ID'], comp_df['GC%'], color='steelblue', alpha=0.7, edgecolor='black')
ax.axhline(comp_df['GC%'].mean(), color='red', linestyle='--', label=f'Mean: {comp_df["GC%"].mean():.1f}%')
ax.set_xlabel('Individual', fontsize=11)
ax.set_ylabel('GC Content (%)', fontsize=11)
ax.set_title('GC Content by Individual', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 3. Sequence Alignment

In [None]:
# Create alignment (sequences are pre-aligned in this example)
alignment = MultipleSeqAlignment(sequences)

print(f"Alignment length: {alignment.get_alignment_length()} bp")
print(f"Number of sequences: {len(alignment)}")

# Show alignment snippet
print("\nAlignment (first 80 bp):")
print(alignment[:, :80])

## 4. Variant Identification

In [None]:
# Find variable positions
alignment_length = alignment.get_alignment_length()
variable_positions = []

for i in range(alignment_length):
    column = alignment[:, i]
    unique_bases = set(column)
    if len(unique_bases) > 1:  # Variable position
        variable_positions.append({
            'Position': i+1,
            'Variants': ','.join(sorted(unique_bases)),
            'Count': len(unique_bases)
        })

var_df = pd.DataFrame(variable_positions)

print(f"Total variable positions: {len(variable_positions)}")
print(f"Nucleotide diversity: {len(variable_positions)/alignment_length*100:.2f}%")

if len(var_df) > 0:
    print("\nVariable Positions:")
    print(var_df.head(10))

# Plot variant distribution
if len(variable_positions) > 0:
    fig, ax = plt.subplots(figsize=(12, 4))
    ax.scatter([v['Position'] for v in variable_positions], 
              [1]*len(variable_positions), s=50, color='red', alpha=0.6)
    ax.set_xlim(0, alignment_length)
    ax.set_ylim(0.5, 1.5)
    ax.set_xlabel('Position (bp)', fontsize=11)
    ax.set_yticks([])
    ax.set_title('Variant Positions Along Sequence', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()

## 5. Pairwise Distances

In [None]:
# Calculate pairwise distances
calculator = DistanceCalculator('identity')
distance_matrix = calculator.get_distance(alignment)

# Convert to DataFrame
names = [seq.id.split('|')[0] for seq in sequences]
dist_df = pd.DataFrame(distance_matrix.matrix, index=names, columns=names)

print("Pairwise Distance Matrix:")
print(dist_df.round(3))

# Heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(dist_df, annot=True, fmt='.3f', cmap='YlOrRd', square=True, ax=ax,
           cbar_kws={'label': 'Genetic Distance'})
ax.set_title('Pairwise Genetic Distances', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 6. Phylogenetic Tree

In [None]:
# Construct UPGMA tree
constructor = DistanceTreeConstructor()
upgma_tree = constructor.upgma(distance_matrix)

# Draw tree
fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(upgma_tree, axes=ax, do_show=False)
ax.set_title('UPGMA Phylogenetic Tree', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Tree shows evolutionary relationships based on genetic distances.")
print("Closely related individuals cluster together.")

## 7. Population Structure

In [None]:
# Analyze by location
location_data = []
for seq in sequences:
    parts = seq.id.split('|')
    location_data.append({'Individual': parts[0], 'Location': parts[1]})

loc_df = pd.DataFrame(location_data)

print("Samples by Location:")
print(loc_df['Location'].value_counts())

# Average distance within vs between locations
locations = loc_df['Location'].unique()
within_dist = []
between_dist = []

for i in range(len(sequences)):
    for j in range(i+1, len(sequences)):
        loc_i = loc_df.iloc[i]['Location']
        loc_j = loc_df.iloc[j]['Location']
        distance = dist_df.iloc[i, j]
        
        if loc_i == loc_j:
            within_dist.append(distance)
        else:
            between_dist.append(distance)

print(f"\nGenetic Distance Summary:")
print(f"  Within populations: {np.mean(within_dist):.4f} ± {np.std(within_dist):.4f}")
print(f"  Between populations: {np.mean(between_dist):.4f} ± {np.std(between_dist):.4f}")
print(f"  Fst (approximation): {(np.mean(between_dist) - np.mean(within_dist))/np.mean(between_dist):.3f}")

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.boxplot([within_dist, between_dist], labels=['Within Populations', 'Between Populations'])
ax.set_ylabel('Genetic Distance', fontsize=12)
ax.set_title('Population Structure', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 8. Summary Statistics

In [None]:
summary = {
    'Number of Sequences': len(sequences),
    'Sequence Length': alignment_length,
    'Variable Positions': len(variable_positions),
    'Nucleotide Diversity (%)': f"{len(variable_positions)/alignment_length*100:.2f}",
    'Mean GC Content (%)': f"{comp_df['GC%'].mean():.2f}",
    'Mean Pairwise Distance': f"{dist_df.values[np.triu_indices_from(dist_df.values, k=1)].mean():.4f}",
    'Number of Locations': len(locations)
}

print("="*60)
print("GENOMICS ANALYSIS SUMMARY")
print("="*60)
for key, value in summary.items():
    print(f"{key:.<45} {value}")
print("="*60)

print("\n✓ Analysis complete!")

## Key Concepts Learned

### DNA Sequences
- **FASTA format**: Standard sequence file format
- **Nucleotides**: A, T, G, C building blocks
- **GC content**: Proportion of G+C nucleotides
- **Sequence length**: Number of base pairs

### Sequence Alignment
- **Multiple sequence alignment**: Arranging sequences to identify similarities
- **Variable positions**: Sites that differ between sequences
- **Polymorphisms**: Genetic variation (SNPs)

### Phylogenetics
- **Genetic distance**: Measure of sequence difference
- **Phylogenetic tree**: Evolutionary relationships
- **UPGMA**: Tree construction method
- **Clustering**: Related sequences group together

### Population Genetics
- **Nucleotide diversity**: Genetic variation measure
- **Population structure**: Genetic differentiation
- **Fst**: Proportion of variation between populations

## Next Steps

### Real Genomic Data
- **[NCBI GenBank](https://www.ncbi.nlm.nih.gov/genbank/)**: Millions of sequences
- **[1000 Genomes](https://www.internationalgenome.org/)**: Human genetic variation
- **[BOLD Systems](http://www.boldsystems.org/)**: DNA barcoding database

### Advanced Analyses
- Whole genome sequencing
- Variant calling from reads
- GWAS (genome-wide association studies)
- Population genomics at scale

## Resources

- **[Biopython Tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html)**
- **[MEGA Software](https://www.megasoftware.net/)**: Phylogenetics
- **Textbook**: *Molecular Evolution and Phylogenetics* by Nei & Kumar