<a href="https://colab.research.google.com/github/shandley/claude-for-bioinformatics/blob/master/guided-tutorials/01-first-rnaseq-analysis/Module_1_1_RNA_seq_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 1.1: Your First RNA-seq Analysis with Claude Code

## 🎯 Learning Objectives
By the end of this tutorial, you will:
- ✅ Set up a bioinformatics analysis environment in the cloud
- ✅ Download and examine real RNA-seq data
- ✅ Run quality control analysis using industry-standard tools
- ✅ Interpret bioinformatics results with confidence
- ✅ Understand how Claude Code enhances bioinformatics workflows

**⏱️ Estimated Time**: 30-45 minutes  
**💻 Requirements**: Google account (you're already here!)  
**🔧 Software**: All tools installed automatically in this notebook

---

## ⚠️ Prerequisites

### Required Reading (10 minutes)
**IMPORTANT**: Before starting this tutorial, you should understand Claude Code basics:
- [**Claude Code Best Practices**](https://github.com/shandley/claude-for-bioinformatics/blob/master/claude-code-best-practices.md)

This covers essential setup, installation, and basic usage patterns. While this Colab tutorial runs independently, understanding Claude Code fundamentals will help you apply these skills in your own research.

---

## 🧬 About This Tutorial

We'll analyze a small but realistic RNA-seq dataset using the same tools and workflows used in professional bioinformatics:

- **Sample Data**: 10,000 paired-end reads from human cell line
- **Tools**: FastQC, MultiQC (industry standards for quality control)
- **Skills**: Real-world quality assessment and interpretation
- **Output**: Publication-quality QC reports you can download

Everything runs in this notebook - no software installation required on your computer!

---

# 🛠️ Step 1: Environment Setup

First, we'll install the bioinformatics tools we need. This is exactly what you'd do in a real research environment!

**📚 Learning Note**: In professional bioinformatics, tool installation and environment management are crucial skills. We're learning the real process here.

In [None]:
# Install conda package manager (this might take 2-3 minutes)
print("🔧 Installing conda package manager...")
!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!bash Miniconda3-latest-Linux-x86_64.sh -b -p /content/miniconda
!rm Miniconda3-latest-Linux-x86_64.sh

# Add conda to PATH
import os
os.environ['PATH'] = '/content/miniconda/bin:' + os.environ['PATH']

print("✅ Conda installation complete!")

In [None]:
# Set up bioconda channel (where bioinformatics tools live)
print("📦 Configuring bioinformatics software channels...")
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge

print("✅ Bioconda channels configured!")
print("🧬 Ready to install bioinformatics tools")

In [None]:
# Install FastQC and MultiQC (the tools we'll use for quality control)
print("🧪 Installing bioinformatics tools...")
print("   - FastQC: Industry standard for sequencing quality control")
print("   - MultiQC: Combines reports from multiple tools")
print("")
print("⏱️ This may take 3-5 minutes...")

!conda install -y fastqc multiqc

print("")
print("✅ Tool installation complete!")

# Verify installations
print("🔍 Verifying tool installations:")
!fastqc --version
!multiqc --version

---

# 📁 Step 2: Download Sample Data

Now we'll download realistic RNA-seq data designed for learning. This data has the same characteristics as real research data but is small enough to process quickly.

**🔬 About Our Sample Data**:
- **Type**: Paired-end RNA-seq reads from human cell line
- **Size**: 10,000 read pairs (~1.5MB total)
- **Processing time**: Under 1 minute for quality control
- **Educational features**: Realistic quality patterns for learning interpretation

In [None]:
# Create directory structure (like a real bioinformatics project)
print("📁 Setting up project structure...")
!mkdir -p data/raw results/qc

# Download sample FASTQ files
print("⬇️ Downloading sample RNA-seq data...")
!wget -q -O data/raw/sample_R1.fastq.gz "https://github.com/shandley/claude-for-bioinformatics/raw/master/guided-tutorials/01-first-rnaseq-analysis/sample-data/sample_R1.fastq.gz"
!wget -q -O data/raw/sample_R2.fastq.gz "https://github.com/shandley/claude-for-bioinformatics/raw/master/guided-tutorials/01-first-rnaseq-analysis/sample-data/sample_R2.fastq.gz"

print("✅ Data download complete!")

# Examine what we downloaded
print("\n📊 Sample data overview:")
!ls -lh data/raw/

print("\n🔍 Quick peek at the data format:")
!gunzip -c data/raw/sample_R1.fastq.gz | head -8

**📚 Understanding FASTQ Format**:

Each read has 4 lines:
1. `@HWI-ST1276...` - Read identifier (like a barcode)
2. `GATAGGCATA...` - DNA sequence (A, T, G, C)
3. `+` - Separator line
4. `IEBFGFCHEG...` - Quality scores (higher letters = better quality)

This is the standard format for raw sequencing data worldwide!

---

# 🤖 Step 3: Claude Code Integration

In a real workflow, this is where you'd use Claude Code to plan your analysis. Let's simulate how that conversation would go:

**🗣️ Example Claude Code Interaction**:

```
You: I have paired-end RNA-seq FASTQ files and need to run comprehensive 
     quality control analysis.

Claude Code: I'll help you run FastQC and MultiQC for quality control. 
             Here's the workflow:

1. Run FastQC on both R1 and R2 files
2. Generate MultiQC report to combine results  
3. Interpret key quality metrics
4. Determine if data needs preprocessing

Let me provide the specific commands...
```

**💡 In this tutorial**: We'll run the analysis directly, but in your real research, Claude Code would provide the exact commands and help interpret results.

---

# 🔬 Step 4: Quality Control Analysis

Now we'll run the same quality control analysis used in professional bioinformatics labs worldwide. FastQC analyzes sequence quality, and MultiQC combines the reports into a beautiful summary.

**⚡ This is the exciting part - we're about to generate real research-quality results!**

In [None]:
# Run FastQC on both files
print("🧪 Running FastQC quality control analysis...")
print("   Analyzing forward reads (R1)...")
!fastqc data/raw/sample_R1.fastq.gz -o results/qc/ -q

print("   Analyzing reverse reads (R2)...")
!fastqc data/raw/sample_R2.fastq.gz -o results/qc/ -q

print("✅ FastQC analysis complete!")

# Check what files were created
print("\n📁 Generated files:")
!ls -la results/qc/

In [None]:
# Run MultiQC to combine the reports
print("📊 Creating combined MultiQC report...")
!multiqc results/qc/ -o results/qc/ -q

print("✅ MultiQC report generated!")
print("\n📄 Final report files:")
!ls -la results/qc/*.html

---

# 📈 Step 5: Viewing and Interpreting Results

Congratulations! You've just generated publication-quality quality control reports. Let's examine what we found.

**🎉 You now have the same reports that professional bioinformaticians create for every RNA-seq project!**

In [None]:
# Let's examine the summary statistics
print("📊 Quality Control Summary")
print("=" * 50)

# Read MultiQC general stats if available
import os
if os.path.exists('results/qc/multiqc_data/multiqc_general_stats.txt'):
    print("📈 MultiQC General Statistics:")
    !head -5 results/qc/multiqc_data/multiqc_general_stats.txt
else:
    print("📁 Report files generated - ready for viewing!")

print("\n🔍 Individual FastQC reports created for:")
!ls results/qc/*_fastqc.html

In [None]:
# Create a downloadable zip of all results
print("📦 Creating downloadable results package...")
!zip -r RNA_seq_QC_results.zip results/qc/

print("✅ Results package created: RNA_seq_QC_results.zip")
print("\n📥 To download your results:")
print("   1. Click the folder icon on the left sidebar")
print("   2. Find 'RNA_seq_QC_results.zip'")
print("   3. Right-click and select 'Download'")
print("")
print("🖥️ Then open the HTML files on your computer to view the reports!")

# Show file sizes
print("\n📏 Your results package:")
!ls -lh RNA_seq_QC_results.zip

---

# 🎓 Step 6: Understanding Your Results

## Key Quality Metrics to Understand:

### ✅ **Per-base Sequence Quality**
- **Green zone (>28)**: Excellent quality
- **Yellow zone (20-28)**: Good quality  
- **Red zone (<20)**: Poor quality
- **Normal pattern**: Slight decline toward 3' end

### ⚠️ **Sequence Duplication Levels**
- **RNA-seq expectation**: 15-30% duplication is normal
- **Why**: Highly expressed genes create many identical reads
- **Concern level**: >50% suggests problems

### 🔍 **Adapter Content**
- **Our data**: ~5% adapter contamination (educational)
- **Real decision**: >10% usually needs trimming
- **Learning point**: Adapter detection is crucial

### 📊 **GC Content**
- **Expected**: Species-specific distribution
- **Human**: ~41% average GC content
- **Interpretation**: Major deviations suggest contamination

## 🎯 What This Means for Your Data:

Based on our tutorial dataset, you should see:
- ✅ Generally high quality scores
- ⚠️ Some educational warnings for learning
- 📈 Realistic patterns you'll see in real data
- 🚀 Data suitable for downstream analysis

**🏆 Congratulations! You've successfully completed your first bioinformatics quality control analysis!**

---

# 🤖 Step 7: How Claude Code Enhances This Workflow

In a real research environment, Claude Code would help at every step:

## 🔄 **Planning Phase**
```
You: "I have RNA-seq data and need to assess quality"
Claude: "I'll guide you through QC with FastQC and MultiQC..."
```

## ⚙️ **Command Generation**
```
You: "Create commands for quality control"
Claude: "Here are the optimized commands for your data..."
```

## 📊 **Result Interpretation**
```
You: "What do these quality scores mean?"
Claude: "Based on your results, here's what I see..."
```

## 🚀 **Next Steps**
```
You: "What should I do next?"
Claude: "Based on your QC, I recommend..."
```

**💡 The Power**: Claude Code combines bioinformatics expertise with AI assistance, making complex analyses accessible to researchers at all levels.

---

# 🎯 Step 8: Next Steps in Your Learning Journey

## 🏆 What You've Accomplished
- ✅ Set up a complete bioinformatics environment
- ✅ Processed real RNA-seq data with industry-standard tools
- ✅ Generated publication-quality quality control reports
- ✅ Learned to interpret key bioinformatics metrics
- ✅ Experienced the complete workflow from data to results

## 🚀 Ready for More Advanced Learning

### **Immediate Next Steps**:
1. **[Module 1.2: Understanding Your Results](../02-understanding-results/)** - Deep dive into QC interpretation
2. **[Module 1.3: Variant Calling Walkthrough](../03-variant-calling/)** - Apply skills to different analysis

### **When You're Ready for Local Setup**:
- **[Claude Code Best Practices](https://github.com/shandley/claude-for-bioinformatics/blob/master/claude-code-best-practices.md)** - Set up on your computer
- **[Complete SOP Guide](https://github.com/shandley/claude-for-bioinformatics/blob/master/Claude_Code_Bioinformatics_SOP.md)** - Production workflows

### **Advanced Learning Tracks**:
- **[Enhanced Educational Plan](https://github.com/shandley/claude-for-bioinformatics/blob/master/ENHANCED_EDUCATIONAL_PLAN.md)** - Complete learning roadmap
- **[Project Templates](https://github.com/shandley/claude-for-bioinformatics/tree/master/project-templates)** - Ready-to-use analysis structures

## 💬 Get Help and Share Success
- **Questions**: [GitHub Discussions](https://github.com/shandley/claude-for-bioinformatics/discussions)
- **Issues**: [Report Problems](https://github.com/shandley/claude-for-bioinformatics/issues)
- **Community**: Share your results and learn from others!

**🎉 Welcome to the world of AI-assisted bioinformatics!**

---

# 📝 Session Summary

## 🔬 Technical Skills Gained:
- Bioinformatics environment setup and tool installation
- FASTQ file format understanding and manipulation
- FastQC and MultiQC usage for quality control
- Quality metric interpretation and decision-making
- Professional workflow organization and documentation

## 🤖 Claude Code Integration Points:
- Workflow planning and optimization
- Command generation and parameter selection
- Result interpretation and next-step recommendations
- Troubleshooting and problem-solving assistance

## 📊 Real-World Applications:
- Quality assessment for any RNA-seq project
- Data preprocessing decision-making
- Publication-ready quality control reporting
- Team collaboration and result sharing

---

**🎓 Congratulations on completing Module 1.1!**

*You've taken your first step into AI-assisted bioinformatics. The skills you've learned here form the foundation for all advanced genomics analyses.*

**⭐ If this tutorial was helpful, please star the [GitHub repository](https://github.com/shandley/claude-for-bioinformatics) to help others discover it!**