###  NGS: *Arabidopsis thalian*a RNA-Seq Preprocessing for DGE Analysis<br>  
 &nbsp;  rna-seq_03  : Streamlined Pipeline, **Paired-end**

---

This notebook is designed for performing NGS analysis on Google Colab.

*   Preprocessing Arabidopsis thaliana paired-end RNA-Seq data for subsequent differential gene expression (DGE) analysis.<br>

*   The following files are expected to be located in a specific folder on Google Drive:<br>
accession_list.txt, at_index*.ht2, araport11.gtf<br>  

Note: See the README for more details.<br>

---


Mounting Google Drive

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

Installing micromamba

In [None]:
!curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj -C /root/ --strip-components=1

In [None]:
import os
os.environ["PATH"] = os.path.abspath("/root") + os.pathsep + os.environ["PATH"] # Add /root/ to the environment variables

In [None]:
!micromamba create -y -n ngs  # Create virtual environment for NGS

In [None]:
os.environ["PATH"] = os.path.abspath("/root/.local/share/mamba/envs/ngs/bin") + os.pathsep + os.environ["PATH"]
# Add env path to the environment variables

Installing Required Tools

In [None]:
!micromamba install -y -q -n ngs -c conda-forge -c bioconda  sra-tools fastqc fastp hisat2 samtools subread

Raw Reads Processing: Downloading, Trimming, Mapping, and Binary Conversion

In [None]:
%%bash
set -e
set -o pipefail
now=$(date '+%Y%m%d_%H%M%S')
logfile="/content/run_${now}.log"
exec > >(tee -a "$logfile") 2>&1

drive_path="/content/drive/MyDrive/ngs_analysis/rna-seq_03"
comon_ref="/content/drive/MyDrive/ngs_analysis/common_ref"

while read file; do
    echo "------------------------------------------------------------"
    start_time=$(date +%s)
    echo "<<< Start processing for ${file} :  [`date '+%F %T'`] >>> "

    echo "### [PREFETCH ] : ${file} ###"
    prefetch "${file}" -O /content/

    echo "### [FASTQ-DUMP] : ${file} ###"
    fasterq-dump /content/"${file}" --split-files -O /content/
    rm  /content/"${file}"/"${file}.sra"  # Remove sra files

    echo "### [ FASTP ] : ${file} ###"
    fastp   --in1 /content/"${file}_1.fastq" \
              --in2 /content/"${file}_2.fastq" \
             --out1 /content/"${file}_1_cleaned.fastq" \
             --out2 /content/"${file}_2_cleaned.fastq" \
             --html /content/"${file}.html"
    cp /content/"${file}.html"  ${drive_path}   # Save fastp report file to Google Drive
    rm  /content/"${file}_1.fastq"   /content/"${file}_2.fastq" # Remove fastp files

    echo "### [ HISAT2 ] : ${file} ,  [SAM→BAM] ###"
    hisat2 -x ${comon_ref}/at_index \
           -1 /content/"${file}_1_cleaned.fastq" \
           -2 /content/"${file}_2_cleaned.fastq" \
           -p 4 \
           -k 2 \
           --phred33 |
           samtools view -@ 4 -b  -o /content/"${file}.bam"

    rm  /content/"${file}_1_cleaned.fastq"   /content/"${file}_2_cleaned.fastq"  # Remove cleaned fastq files

     echo "<< Completed processing for ${file} >>"
     end_time=$(date +%s)
     elapsed=$((end_time - start_time))
     echo "Time for ${file}: ${elapsed} seconds"
     echo "------------------------------------------------------------"
     cp /content/*.log ${drive_path} # Save log file to Google Drive
 done < ${drive_path}/accession_list.txt

 echo "### [featureCounts] ###"
 featureCounts  -T 4  -p \
            -t exon \
            -g gene_id \
            -a ${comon_ref}/araport11.gtf \
            -o counts.txt \
              /content/*.bam
cp /content/counts.txt  ${drive_path}   # Save read count file to Google Drive
cp /content/counts.txt.summary  ${drive_path}

echo "<<< featureCounts completed - result saved : [`date '+%F %T'`] >>>"
cp /content/*.log ${drive_path} # Save log file to Google Drive


🔚 End of this notebook:<br>&nbsp;&nbsp; &nbsp;&nbsp; Next, proceed to DEG analysis using the generated count matrix (counts.txt).



---

