#### 【NGS解析】Arabidopsis RNA-Seq データの DGE解析に向けた前処理   <br>&nbsp; RNAseq_02  : Partially Streamlined Pipeline

---
<small>このノートブックはGoogle Colab上で動かすことを想定しています。  
・Arabidopsis thaliana のsingle-end RNA-Seq データを使用した、DGE解析に向けたデータの前処理です。  
・Google Drive フォルダに次のファイルが入れてある前提です。    
&nbsp;&nbsp; &nbsp;&nbsp;accession_list.txt, at_index*.ht2 , araport11.gtf  
※詳細はReadmeに記載しています。  </small>

---


■ 初期設定：Google Drive をマウント

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

■ 初期設定：Micromambaをインストール

In [None]:
!curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj -C /root/ --strip-components=1

In [None]:
import os
os.environ["PATH"] = os.path.abspath("/root") + os.pathsep + os.environ["PATH"] # 環境変数に/root/を追加

In [None]:
!micromamba create -y -n ngs  # NGS用の仮想環境を作成

In [None]:
os.environ["PATH"] = os.path.abspath("/root/.local/share/mamba/envs/ngs/bin") + os.pathsep + os.environ["PATH"]  # 環境変数にパスを追加

In [None]:
!micromamba install -y -q -n ngs -c conda-forge -c bioconda  sra-tools fastqc fastp hisat2 samtools subread

■  SRA取得 → ダンプ → トリミング → マッピング → バイナリ変換 >>> 定量化  
　 prefetch → fasterq-dump → fastp → hisat2 → samtools view >>> featureCounts


In [None]:
%%bash
set -e
set -o pipefail

drive_path="/content/drive/MyDrive/ngs_analysis/rna-seq_02"
common_ref="/content/drive/MyDrive/ngs_analysis/common_ref"

while read file; do
    echo "------------------------------------------------------------"
    echo "<<< Start processing for ${file} >>> "

    echo "### [PREFETCH ] : ${file} ###"
    prefetch "${file}" -O /content/

    echo "### [FASTQ-DUMP] : ${file} ###"
    fasterq-dump /content/"${file}" -O /content/

    echo "### [ FASTQC ] :  ${file} ###"
    fastqc /content/"${file}.fastq"

    echo "### [ FASTP ] : ${file} ###"
    fastp -i /content/"${file}.fastq" \
            -o /content/"${file}_cleaned.fastq" \
            --trim_poly_g \
            --trim_poly_x \
            --length_required 20 \
            -h /content/"${file}.html"

    echo "### [ HISAT2 ] : ${file} ,  [SAM→BAM→ sorting] ###"
    hisat2 -x ${common_ref}/at_index \
           -U /content/"${file}_cleaned.fastq" \
           -p 2 \
           --phred33 | \
     samtools view -b  | \
     samtools sort -@ 2  -o /content/"${file}.bam"   # for genome browser visualization


     echo "### [BAM Indexing] : ${file} ###"
     samtools index /content/"${file}.bam"  # for genome browser visualization

     echo "<<< Completed processing for ${file} >>>"
     echo "------------------------------------------------------------"

done < ${drive_path}/accession_list.txt

echo "### [featureCounts] ###"
featureCounts  -T 2 \
            -t exon \
            -g gene_id \
            -a ${common_ref}/araport11.gtf \
            -o counts.txt \
              /content/*.bam

cp /content/counts.txt  ${drive_path}   # Save read count file to Google Drive
cp /content/counts.txt.summary  ${drive_path}

echo "### Count file saved ###"

cp /content/*.bai ${drive_path}   # Copy BAM indices if genome browser visualization is intended
cp /content/*.bam ${drive_path} # Copy BAM files  if genome browser visualization is intended
cp /content/*.html ${drive_path} # Copy FastQC and Fastp reports to assess FASTQ quality before and after trimming


🔚このNotebookはここまで  <br>
&nbsp;&nbsp; &nbsp;&nbsp;作成したカウントマトリックス（counts.txt）を用いてDGE解析に進みます。  



---

