**Lesson 14: Off-Target Prediction & Specificity**

Learn to predict off-target effects using the TP53 cancer gene.

---

## 🧬 Real Example: TP53 (Cancer Research)

**Target Gene: TP53** (Tumor suppressor protein)

- **Function**: Prevents cancer by fixing DNA damage
- **Mutation**: Found in ~50% of human cancers
- **CRISPR Research**: Studying cancer mechanisms

### Why Off-Targets Matter:
- TP53 is CRITICAL - cutting wrong sites could cause cancer
- Must ensure guide is highly specific
- Safety is paramount in cancer research

In [None]:
# Real TP53 gene sequence from NCBI
TP53_GENE = "ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCGCGTGGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCTGTCCCTTCCCAGAAAACCTACCAGGGCAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCACGTACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCAGCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACACTTTTCG"

print(f"TP53 Gene: Tumor protein p53 (cancer suppressor gene)")
print(f"Length: {len(TP53_GENE)} bp")
print(f"First 60 bp: {TP53_GENE[:60]}")
print(f"Last 60 bp: {TP53_GENE[-60:]}")
print(f"\nGC Content: {(TP53_GENE.count('G') + TP53_GENE.count('C')) / len(TP53_GENE) * 100:.1f}%")

## 🎯 Off-Target Risk Assessment

### Mismatch Tolerance:
- **0-2 mismatches**: High risk
- **3-4 mismatches**: Moderate risk
- **5+ mismatches**: Low risk

### Seed Region (positions 10-20):
- Mismatches here greatly reduce cutting
- Most critical for specificity

In [None]:
def count_mismatches(seq1, seq2):
    """Count mismatches between two sequences"""
    if len(seq1) != len(seq2):
        return 999, []
    mismatches = 0
    positions = []
    for i, (a, b) in enumerate(zip(seq1, seq2)):
        if a != b:
            mismatches += 1
            positions.append(i)
    return mismatches, positions

def calculate_offtarget_risk(target, potential_offtarget):
    """Calculate off-target risk score"""
    count, positions = count_mismatches(target, potential_offtarget)
    
    if count == 0:
        return 100, "On-target (perfect match)"
    elif count >= 5:
        return 10, "Low risk"
    elif count == 4:
        return 30, "Moderate risk"
    elif count == 3:
        return 50, "Moderate-high risk"
    elif count == 2:
        return 70, "High risk"
    else:
        return 85, "Very high risk"

print("Off-target assessment functions ready!")

## 🔍 Find Guides in TP53 and Check Specificity

In [None]:
def find_pam_sites(dna):
    pams = []
    for i in range(len(dna) - 2):
        if dna[i:i+3].endswith('GG') and dna[i] in 'ATGC':
            pams.append(i)
    return pams

def find_guides(gene_sequence):
    pam_sites = find_pam_sites(gene_sequence)
    guides = []
    for pam_pos in pam_sites:
        if pam_pos >= 20:
            guide = gene_sequence[pam_pos-20:pam_pos]
            pam = gene_sequence[pam_pos:pam_pos+3]
            guides.append({'guide': guide, 'pam': pam, 'position': pam_pos-20})
    return guides

# Find guides in TP53
print("Finding CRISPR guides in TP53 gene...\n")
tp53_guides = find_guides(TP53_GENE)

print(f"Found {len(tp53_guides)} potential guides in TP53")
print("\nTop 3 TP53 Guides:\n" + "="*60)

for i, guide in enumerate(tp53_guides[:3], 1):
    print(f"\nGuide #{i}:")
    print(f"  Sequence: 5'-{guide['guide']}-3'")
    print(f"  PAM: {guide['pam']}")
    print(f"  Position: {guide['position']} bp")

## 💡 Challenge: Off-Target Search

Simulate off-target search by comparing guides against the gene itself:

In [None]:
# Check first guide for potential off-targets
if tp53_guides:
    test_guide = tp53_guides[0]['guide']
    print(f"Testing guide: {test_guide}")
    print(f"\nSearching for similar sequences in TP53...\n")
    
    offtargets_found = []
    for i in range(len(TP53_GENE) - 20 + 1):
        potential = TP53_GENE[i:i+20]
        count, positions = count_mismatches(test_guide, potential)
        if 0 < count <= 4:  # 1-4 mismatches
            risk, desc = calculate_offtarget_risk(test_guide, potential)
            offtargets_found.append({'seq': potential, 'pos': i, 'mismatches': count, 'risk': risk, 'desc': desc})
    
    print(f"Found {len(offtargets_found)} potential off-target sites with 1-4 mismatches")
    
    if offtargets_found:
        print("\nTop off-target concerns:")
        for ot in sorted(offtargets_found, key=lambda x: x['risk'], reverse=True)[:3]:
            print(f"\n  Sequence: {ot['seq']}")
            print(f"  Position: {ot['pos']}")
            print(f"  Mismatches: {ot['mismatches']}")
            print(f"  Risk: {ot['risk']}/100 - {ot['desc']}")

## 🔑 Key Takeaways

1. **TP53 is critical** - off-targets here could cause cancer
2. **Always search** for similar sequences genome-wide
3. **Seed region mismatches** reduce risk significantly
4. **Balance** target location with specificity
5. Real tools search entire genomes (3 billion bases!)

## 📚 Next Lesson
Lesson 15: Complete pipeline with CCR5 (HIV resistance)