# FASTQ Quality Control and Preprocessing Pipeline

## Overview
This notebook performs quality control and preprocessing of RNA-seq FASTQ files using FASTP. It processes raw sequencing data from reproducibility experiments, performs quality trimming, and organizes files into appropriate directory structures for downstream analysis.

## Pipeline Workflow
1. **Data Download**: Retrieve sequencing results from the core facility
2. **Environment Setup**: Activate conda environment and verify tool installations
3. **Quality Control**: Run FASTP for adapter trimming and quality filtering
4. **File Organization**: Organize processed files into sample-specific directories
5. **File Merging**: Combine technical replicates (lanes) into single files per sample
6. **Quality Assessment**: Generate HTML reports for quality metrics

## Input Data
- Raw FASTQ files from sequencing core facility
- Pool: POOL-631 (Project P2794_HSIU-CHUAN)
- Location: `/BSSE_TREUTLEIN/TREUTLEIN/P2794_HSIU-CHUAN/POOL-631`

## Output Structure
```
Reproducibility_SC102A1/
├── [Sample_ID]/
│   └── 00.CLEAN_FQ/
│       ├── [Sample_ID]_R1.merged.fastq.gz
│       ├── [Sample_ID]_R2.merged.fastq.gz
│       ├── *_fastp.html (quality reports)
│       ├── *_fastp.json (quality metrics)
│       └── *_fastp.log (processing logs)
```

## Software Requirements
- FASTP (adapter trimming and quality control)
- GNU Parallel (parallel processing)
- Standard Unix tools (sed, basename, etc.)

## Notes
- Uses parallel processing for efficient handling of multiple samples
- Merges technical replicates (lanes) into single files per sample
- Generates comprehensive quality reports for each sample

## Configuration and Environment Setup

In [None]:
# =============================================================================
# CONFIGURATION VARIABLES
# =============================================================================

# Core processing parameters
export MAXCORES=8
export PARALLEL_JOBS=15

# File paths and directories
export SEQUENCING_POOL="/BSSE_TREUTLEIN/POOL-631"
export RAW_DATA_DIR="/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw"
export PROJECT_NAME="Reproducibility_SC102A1"
export SCRIPTS_DIR="/links/groups/treutlein/SCRIPTS/sequencing"

# Reference genome path
export STARGEN="/local1/sequencing/PUBLIC_DATA/genomes/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/"

echo "Configuration loaded:"
echo "  Max cores: $MAXCORES"
echo "  Parallel jobs: $PARALLEL_JOBS"
echo "  Project: $PROJECT_NAME"
echo "  Raw data: $RAW_DATA_DIR"

## 1. Data Acquisition and Download

In [None]:
# Download sequencing data from core facility
# Pool: POOL-631 (Project P2794_HSIU-CHUAN)
# Location: /BSSE_TREUTLEIN/TREUTLEIN/P2794_HSIU-CHUAN/POOL-631

echo "Data download location: $SEQUENCING_POOL"

In [None]:
# Check sequencing download script options
python $SCRIPTS_DIR/get_sequencing_results.py -h

Traceback (most recent call last):
  File "/links/groups/treutlein/SCRIPTS/sequencing/get_sequencing_results.py", line 6, in <module>
    import git
ModuleNotFoundError: No module named 'git'


: 1

In [None]:
# Download sequencing results from the specified pool
python $SCRIPTS_DIR/get_sequencing_results.py $SEQUENCING_POOL

## 2. Environment Setup and Tool Verification

In [None]:
# Activate conda environment for bulk RNA-seq processing
mamba activate bulk_seq
echo "Activated environment: $CONDA_DEFAULT_ENV"

(bulk_seq) 


: 1

In [None]:
# Verify FASTP installation and location
which fastp
fastp --version 2>/dev/null || echo "FASTP version check failed"

/nas/groups/treutlein/USERS/jjans/mambaforge/envs/bulk_seq/bin/fastp
(bulk_seq) 


: 1

In [None]:
# Verify STAR aligner installation (for downstream analysis)
which STAR
STAR --version 2>/dev/null || echo "STAR version check failed"

/nas/groups/treutlein/USERS/jjans/mambaforge/envs/bulk_seq/bin/STAR
(bulk_seq) 


: 1

## 3. FASTQ File Processing with FASTP

In [None]:
# List available FASTQ files to verify data presence
echo "Available R1 FASTQ files:"
ls $RAW_DATA_DIR/*/*R1*.fastq.gz | head -5
echo "Total R1 files: $(ls $RAW_DATA_DIR/*/*R1*.fastq.gz | wc -l)"

[0m[01;32m/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_409B2_1_1/Reproducibility_409B2_1_1_S1_L001_R1_001.fastq.gz[0m[K
[01;32m/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_409B2_1_1/Reproducibility_409B2_1_1_S2_L002_R1_001.fastq.gz[0m[K
[01;32m/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_409B2_1_10/Reproducibility_409B2_1_10_S1_L001_R1_001.fastq.gz[0m[K
[01;32m/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_409B2_1_10/Reproducibility_409B2_1_10_S2_L002_R1_001.fastq.gz[0m[K
[01;32m/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_409B2_1_11/Reproducibility_409B2_1_11_S1_L001_R1_001.fastq.gz[0m[K
[01;32m/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_409B2_1_11/Reproducibility_409B2_1_11_S2_L002_R1_001.fastq.gz[0m[K
[01;32m/links/groups/

: 1

In [None]:
# =============================================================================
# MAIN FASTP PROCESSING PIPELINE
# =============================================================================

echo "Starting FASTP processing with $PARALLEL_JOBS parallel jobs"
echo "Using $MAXCORES cores per job"
echo "Processing files from: $RAW_DATA_DIR"

# Process all R1 FASTQ files in parallel
ls $RAW_DATA_DIR/*/*R1*.fastq.gz | parallel -j $PARALLEL_JOBS --plus \
    --rpl '{/..L} s:.fastq.gz::;s:.*/::;s:_R.*::' \
    --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' "
        echo 'Processing sample: {/..}'
        
        # Create sample directory
        mkdir -p {/..}
        cd {/..}
        
        # Create clean FASTQ directory
        mkdir -p 00.CLEAN_FQ
        
        # Run FASTP quality control and trimming
        fastp \
            -i {} \
            -I {=s/R1/R2/=} \
            -o 00.CLEAN_FQ/{/..L}_R1.clean.fastq.gz \
            -O 00.CLEAN_FQ/{/..L}_R2.clean.fastq.gz \
            -h 00.CLEAN_FQ/{/..L}_fastp.html \
            -j 00.CLEAN_FQ/{/..L}_fastp.json \
            --thread $MAXCORES \
            &>> {/..L}_fastp.log
        
        echo 'Completed: {/..L}'
"

echo "FASTP processing completed for all samples"

(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 26 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/Reproducibility_409B2_1_10
/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_exp

: 1

## 4. File Organization and Merging

In [None]:
# =============================================================================
# MERGE TECHNICAL REPLICATES (LANES) INTO SINGLE FILES
# =============================================================================

echo "Merging technical replicates for each sample..."

# Define the base directory
BASE_DIR="$PROJECT_NAME"

# Merge FASTQ files within each sample folder
for FOLDER in $BASE_DIR/*; do
    # Skip if not a directory
    if [ ! -d "$FOLDER" ]; then
        continue
    fi
    
    # Get the folder name (sample ID)
    FOLDER_NAME=$(basename "$FOLDER")
    
    echo "Merging files for sample: $FOLDER_NAME"
    
    # Change to the clean FASTQ directory
    cd "$FOLDER"/00.CLEAN_FQ
    
    # Merge all R1 clean FASTQ files into one
    if ls *_R1.clean.fastq.gz 1> /dev/null 2>&1; then
        cat *_R1.clean.fastq.gz > "$FOLDER_NAME"_R1.merged.fastq.gz
        echo "  ✓ R1 files merged"
    fi
    
    # Merge all R2 clean FASTQ files into one  
    if ls *_R2.clean.fastq.gz 1> /dev/null 2>&1; then
        cat *_R2.clean.fastq.gz > "$FOLDER_NAME"_R2.merged.fastq.gz
        echo "  ✓ R2 files merged"
    fi
    
    # Return to parent directory
    cd ../../..
done

echo "File merging completed for all samples"

## 5. Quality Assessment and Final Verification

In [None]:
# =============================================================================
# FINAL VERIFICATION AND SUMMARY
# =============================================================================

echo "=== PROCESSING SUMMARY ==="

# Count processed samples
echo "Processed samples:"
find $PROJECT_NAME -name "*_R1.merged.fastq.gz" | wc -l

# List final merged files
echo -e "\nFinal merged FASTQ files:"
find $PROJECT_NAME -name "*.merged.fastq.gz" | sort

# Check for quality reports
echo -e "\nQuality reports generated:"
find $PROJECT_NAME -name "*_fastp.html" | wc -l

# Display directory structure
echo -e "\nFinal directory structure:"
tree $PROJECT_NAME -L 2 2>/dev/null || find $PROJECT_NAME -type d | sort

echo -e "\n=== FASTP PIPELINE COMPLETED SUCCESSFULLY ==="
echo "Next step: Run STAR alignment (see 01_Mapping_STAR.ipynb)"