# M1 MEG - UE 5 A2B

## Pipeline d'Analyse de Données RNA-seq - Partie 1
### TP5 - 24/09/2025

<div class="alert alert-info">
<b>Aperçu du Cours :</b><br>
Ce notebook couvre les étapes essentielles de la première partie de l'analyse de données RNA-seq incluant : <br>
    - vérification de l'intégrité des fichiers <br>
    - Contrôle qualité des données de séquençage brutes <br>
    - Prétraitement des reads <br>
    - Alignement des reads sur le génome de référence <br>
</div>

<div class="alert alert-warning">
<b>Ressources de l'environnement :</b><br>
- CPUs : 4
- RAM : 4 GB
</div>

<div class="alert alert-success">
<b>Emplacements des Données :</b><br>
- Données Brutes : <code>/srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/</code> <br>
- Annotation du Génome : <code>/srv/data/Genomes/Mmu/GRCm39/extracted/genome_annotation-M37.gtf</code> <br>
- Index du Génome : <code>/srv/data/Genomes/Mmu/GRCm39/indexes_upto49bases/</code> <br>
</div>

## 0. Configuration de l'Environnement

<div class="alert alert-info">
Les outils requis sont pré-installés dans l'environnement meg-m1-ue5-unix2 dans lequel vous êtes : <br>
- FastQC (v0.12.1) - Contrôle qualité <br>
- MultiQC (v1.13) - Rapports agrégés <br>
- fastp (v0.23.1) - Prétraitement des reads <br>
- STAR (v2.7.11a) - Alignement des reads <br>
- samtools (v1.18) - Manipulation des fichiers BAM <br>
</div>

Création des répertoires où seront stockés les résultats des différentes étapes d'analyses.   

In [3]:
cd ~/meg-m1-ue5-unix2-testmapping
pwd

/srv/home/scaburet/meg-m1-ue5-unix2-testmapping


In [4]:
ls -lh

total 84K
drwxr-xr-x 2 scaburet scaburet 4.0K Sep 11 16:18 binder
-rw-r--r-- 1 scaburet scaburet  358 Sep 11 16:18 README.md
-rw-rw-r-- 1 scaburet     1012  76K Sep 11 16:19 TP5-QC-mapping.ipynb


In [5]:
samtools flagstat /srv/data/meg-m1-a2b/blumenthal-2014/chr7/3-bam/SRX1589831.chr7.bam

3341615 + 0 in total (QC-passed reads + QC-failed reads)
3168704 + 0 primary
172911 + 0 secondary
0 + 0 supplementary
185458 + 0 duplicates
185458 + 0 primary duplicates
3341615 + 0 mapped (100.00% : N/A)
3168704 + 0 primary mapped (100.00% : N/A)
3168704 + 0 paired in sequencing
1584802 + 0 read1
1583902 + 0 read2
3166272 + 0 properly paired (99.92% : N/A)
3166272 + 0 with itself and mate mapped
2432 + 0 singletons (0.08% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


In [6]:
# Cellule 1 : Créer les répertoires de travail
mkdir -p ./rnaseq/results/1-fastqc
mkdir -p ./rnaseq/results/2-fastpfq
mkdir -p ./rnaseq/results/3-bam
mkdir -p ./rnaseq/results/0-multiqc


## 1. Vérification des fichiers de données initiaux

In [7]:
ls -lh /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/

total 481M
-rw-rw-r--+ 1 scaburet 1012 372 Sep  9 15:47 md5sum.txt
-rw-rw-r--+ 1 scaburet 1012 14K Sep  3 17:18 samples-Blumenthal2014.tsv
-rw-rw-r--+ 1 scaburet 1012 80M Sep  9 15:46 SRX1589831_chr7_R1.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 82M Sep  9 15:46 SRX1589831_chr7_R2.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 90M Sep  9 15:46 SRX1589834_chr7_R1.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 92M Sep  9 15:46 SRX1589834_chr7_R2.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 69M Sep  9 15:47 SRX1589839_chr7_R1.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 70M Sep  9 15:47 SRX1589839_chr7_R2.fastq.gz


In [8]:
md5sum /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589831_chr7_R1.fastq.gz

0021704ca4f1dd04a00d708d1311db73  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589831_chr7_R1.fastq.gz


In [9]:
head /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/md5sum.txt

0021704ca4f1dd04a00d708d1311db73  SRX1589831_chr7_R1.fastq.gz
558d9168b0c630a0182f60c63d931c1a  SRX1589831_chr7_R2.fastq.gz
e8e4e8c09b0adf38c57cacb1041a9a6a  SRX1589834_chr7_R1.fastq.gz
cf8fa4a75e6d4f2723ae8218a91e8d74  SRX1589834_chr7_R2.fastq.gz
4765f492b28bc17892e1112576d2723c  SRX1589839_chr7_R1.fastq.gz
1089bdb8566a7389a24f692ce95341d4  SRX1589839_chr7_R2.fastq.gz


In [10]:
md5sum /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/*.fastq.gz

0021704ca4f1dd04a00d708d1311db73  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589831_chr7_R1.fastq.gz
558d9168b0c630a0182f60c63d931c1a  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589831_chr7_R2.fastq.gz
e8e4e8c09b0adf38c57cacb1041a9a6a  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589834_chr7_R1.fastq.gz
cf8fa4a75e6d4f2723ae8218a91e8d74  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589834_chr7_R2.fastq.gz
4765f492b28bc17892e1112576d2723c  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589839_chr7_R1.fastq.gz
1089bdb8566a7389a24f692ce95341d4  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589839_chr7_R2.fastq.gz


## 2. Évaluation de la Qualité des Données Brutes

Nous examinerons d'abord la qualité des données de séquençage brutes en utilisant FastQC.
Nous commencerons par analyser le premier échantillon, puis nous utiliserons une commande pour traiter les 2 autres échantillons d'un seul coup. 

In [5]:
fastqc -v 
fastqc --help

FastQC v0.12.1

            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

	fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the version of the p

In [11]:
# Cellule 2 : Exécuter FastQC sur le premier échantillon

fastqc -o ./rnaseq/results/1-fastqc -t 10 \
  $(ls /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589831_chr7_R{1,2}.fastq.gz)

application/gzip
application/gzip
Started analysis of SRX1589831_chr7_R1.fastq.gz
Started analysis of SRX1589831_chr7_R2.fastq.gz
Approx 5% complete for SRX1589831_chr7_R1.fastq.gz
Approx 5% complete for SRX1589831_chr7_R2.fastq.gz
Approx 10% complete for SRX1589831_chr7_R1.fastq.gz
Approx 10% complete for SRX1589831_chr7_R2.fastq.gz
Approx 15% complete for SRX1589831_chr7_R1.fastq.gz
Approx 15% complete for SRX1589831_chr7_R2.fastq.gz
Approx 20% complete for SRX1589831_chr7_R1.fastq.gz
Approx 20% complete for SRX1589831_chr7_R2.fastq.gz
Approx 25% complete for SRX1589831_chr7_R1.fastq.gz
Approx 25% complete for SRX1589831_chr7_R2.fastq.gz
Approx 30% complete for SRX1589831_chr7_R1.fastq.gz
Approx 30% complete for SRX1589831_chr7_R2.fastq.gz
Approx 35% complete for SRX1589831_chr7_R1.fastq.gz
Approx 35% complete for SRX1589831_chr7_R2.fastq.gz
Approx 40% complete for SRX1589831_chr7_R2.fastq.gz
Approx 40% complete for SRX1589831_chr7_R1.fastq.gz
Approx 45% complete for SRX1589831_chr7_

In [12]:
# Cellule 3 : Commande pour exécuter FastQC sur tous les échantillons

fastqc -o ./rnaseq/results/1-fastqc -t 10 \
  /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/*.fastq.gz

application/gzip
application/gzip
Started analysis of SRX1589831_chr7_R1.fastq.gz
application/gzip
application/gzip
application/gzip
application/gzip
Started analysis of SRX1589831_chr7_R2.fastq.gz
Approx 5% complete for SRX1589831_chr7_R1.fastq.gz
Started analysis of SRX1589834_chr7_R1.fastq.gz
Approx 5% complete for SRX1589831_chr7_R2.fastq.gz
Approx 10% complete for SRX1589831_chr7_R1.fastq.gz
Started analysis of SRX1589834_chr7_R2.fastq.gz
Approx 5% complete for SRX1589834_chr7_R1.fastq.gz
Approx 10% complete for SRX1589831_chr7_R2.fastq.gz
Approx 15% complete for SRX1589831_chr7_R1.fastq.gz
Started analysis of SRX1589839_chr7_R1.fastq.gz
Approx 5% complete for SRX1589834_chr7_R2.fastq.gz
Approx 15% complete for SRX1589831_chr7_R2.fastq.gz
Approx 10% complete for SRX1589834_chr7_R1.fastq.gz
Approx 5% complete for SRX1589839_chr7_R1.fastq.gz
Approx 20% complete for SRX1589831_chr7_R1.fastq.gz
Started analysis of SRX1589839_chr7_R2.fastq.gz
Approx 10% complete for SRX1589834_chr7_R2.

In [13]:
# Cellule 4

multiqc -f -o rnaseq/results/0-multiqc \
    rnaseq/results/1-fastqc/ \
    --interactive \
    --title "1-fastqc-" \
    --comment "Rapport MultiQC sur les 3 échantillons"


  [34m/[0m[32m/[0m[31m/[0m ]8;id=826113;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.13[0m

[34m|           multiqc[0m | [33mMultiQC Version v1.31 now available![0m
[34m|           multiqc[0m | Report title: 1-fastqc-
[34m|           multiqc[0m | Search path : /srv/home/scaburet/meg-m1-ue5-unix2-testmapping/rnaseq/results/1-fastqc
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m12/12[0m  [0m0m  
[?25h[34m|            fastqc[0m | Found 6 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : rnaseq/results/0-multiqc/1-fastqc-_multiqc_report.html
[34m|           multiqc[0m | Data        : rnaseq/results/0-multiqc/1-fastqc-_multiqc_report_data
[34m|           multiqc[0m | MultiQC complete


## 3. Prétraitement des Reads

<div class="alert alert-info">
Nous utilisons fastp pour :  <br>
- Élaguer les bases de faible qualité  <br>
- Supprimer les séquences d'adaptateurs  <br>
- Filtrer les reads de mauvaise qualité  <br>
</div>

In [7]:
fastp -v
fastp --help

fastp 0.23.1
usage: fastp [options] ... 
options:
  -i, --in1                            read1 input file name (string [=])
  -o, --out1                           read1 output file name (string [=])
  -I, --in2                            read2 input file name (string [=])
  -O, --out2                           read2 output file name (string [=])
      --unpaired1                      for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
      --unpaired2                      for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
      --overlapped_out                 for each read pair, output the overlapped region if it has no any mismatched base. (string [=])
      --failed_out                     specify the file to store reads that cannot pass the filters. (string [=])
  

In [14]:
ls -lh /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/

total 481M
-rw-rw-r--+ 1 scaburet 1012 372 Sep  9 15:47 md5sum.txt
-rw-rw-r--+ 1 scaburet 1012 14K Sep  3 17:18 samples-Blumenthal2014.tsv
-rw-rw-r--+ 1 scaburet 1012 80M Sep  9 15:46 SRX1589831_chr7_R1.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 82M Sep  9 15:46 SRX1589831_chr7_R2.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 90M Sep  9 15:46 SRX1589834_chr7_R1.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 92M Sep  9 15:46 SRX1589834_chr7_R2.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 69M Sep  9 15:47 SRX1589839_chr7_R1.fastq.gz
-rw-rw-r--+ 1 scaburet 1012 70M Sep  9 15:47 SRX1589839_chr7_R2.fastq.gz


In [15]:
# Cellule 4 : Traiter le premier échantillon avec fastp

echo "Traitement de SRX1589831..."
fastp \
    --in1 /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589831_chr7_R1.fastq.gz \
    --in2 /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/SRX1589831_chr7_R2.fastq.gz \
    --out1 ./rnaseq/results/2-fastpfq/SRX1589831_chr7_R1.fastp.fastq.gz \
    --out2 ./rnaseq/results/2-fastpfq/SRX1589831_chr7_R2.fastp.fastq.gz \
    --detect_adapter_for_pe \
    --dedup \
    --dup_calc_accuracy 3 \
    -p -P 500 \
    --html ./rnaseq/results/2-fastpfq/SRX1589831_chr7_report.fastp.html \
    --json ./rnaseq/results/2-fastpfq/SRX1589831_chr7_report.fastp.json \
    --thread 3


Traitement de SRX1589831...
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
GGATTTAGCTCAGTGGTAGAGCGCTTGCCTAGCAAGCGCAAGGCCCTGGGTTCGGTCCT

Read1 before filtering:
total reads: 1984129
total bases: 96301751
Q20 bases: 95705016(99.3803%)
Q30 bases: 92864786(96.431%)

Read2 before filtering:
total reads: 1984129
total bases: 95521970
Q20 bases: 94798111(99.2422%)
Q30 bases: 91284912(95.5643%)

Read1 after filtering:
total reads: 1934291
total bases: 93880210
Q20 bases: 93294457(99.3761%)
Q30 bases: 90513292(96.4136%)

Read2 after filtering:
total reads: 1934291
total bases: 93033769
Q20 bases: 92324215(99.2373%)
Q30 bases: 88886934(95.5427%)

Filtering result:
reads passed filter: 3954578
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 13680
reads with adapter trimmed: 9330
bases trimmed due to adapters: 379947

Duplication rate: 2.40105%

Insert size peak (evaluated by paired-

In [16]:
# (Cellule 5) : Traiter tous les échantillons avec fastp

for sample in /srv/data/meg-m1-a2b/blumenthal-2014/chr7/1-fastq/*_R1.fastq.gz; do
    base=$(basename $sample _R1.fastq.gz)
    echo "Traitement de $base..."
    fastp \
        -i ${sample} \
        -I ${sample%_R1.fastq.gz}_R2.fastq.gz \
        -o rnaseq/results/2-fastpfq/${base}_R1_fastp.fastq.gz \
        -O rnaseq/results/2-fastpfq/${base}_R2_fastp.fastq.gz \
        --detect_adapter_for_pe \
        --dedup \
        --dup_calc_accuracy 3 \
        -p -P 500 \
        --html rnaseq/results/2-fastpfq/${base}_report.fastp.html \
        --json rnaseq/results/2-fastpfq/${base}_report.fastp.json \
        --thread 3
done

Traitement de SRX1589831_chr7...
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
GGATTTAGCTCAGTGGTAGAGCGCTTGCCTAGCAAGCGCAAGGCCCTGGGTTCGGTCCT

Read1 before filtering:
total reads: 1984129
total bases: 96301751
Q20 bases: 95705016(99.3803%)
Q30 bases: 92864786(96.431%)

Read2 before filtering:
total reads: 1984129
total bases: 95521970
Q20 bases: 94798111(99.2422%)
Q30 bases: 91284912(95.5643%)

Read1 after filtering:
total reads: 1934291
total bases: 93880232
Q20 bases: 93294457(99.376%)
Q30 bases: 90513322(96.4136%)

Read2 after filtering:
total reads: 1934291
total bases: 93033788
Q20 bases: 92324231(99.2373%)
Q30 bases: 88886985(95.5427%)

Filtering result:
reads passed filter: 3954578
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 13680
reads with adapter trimmed: 9330
bases trimmed due to adapters: 379947

Duplication rate: 2.40105%

Insert size peak (evaluated by pai

## 4. Contrôle Qualité Post-traitement

<div class="alert alert-info">
Nous exécuterons FastQC sur les reads nettoyés et générerons un rapport MultiQC combinant toutes les métriques de qualité.
</div>

In [17]:
# Cellule 6 : Exécuter FastQC sur les reads nettoyés 

fastqc -o rnaseq/results/2-fastpfq/ -t 10 \
  $(ls rnaseq/results/2-fastpfq/*_fastp.fastq.gz)

application/gzip
application/gzip
Started analysis of SRX1589831_chr7_R1_fastp.fastq.gz
application/gzip
application/gzip
application/gzip
application/gzip
Started analysis of SRX1589831_chr7_R2_fastp.fastq.gz
Approx 5% complete for SRX1589831_chr7_R1_fastp.fastq.gz
Started analysis of SRX1589834_chr7_R1_fastp.fastq.gz
Approx 5% complete for SRX1589831_chr7_R2_fastp.fastq.gz
Approx 10% complete for SRX1589831_chr7_R1_fastp.fastq.gz
Started analysis of SRX1589834_chr7_R2_fastp.fastq.gz
Approx 5% complete for SRX1589834_chr7_R1_fastp.fastq.gz
Approx 10% complete for SRX1589831_chr7_R2_fastp.fastq.gz
Started analysis of SRX1589839_chr7_R1_fastp.fastq.gz
Approx 15% complete for SRX1589831_chr7_R1_fastp.fastq.gz
Approx 5% complete for SRX1589834_chr7_R2_fastp.fastq.gz
Approx 10% complete for SRX1589834_chr7_R1_fastp.fastq.gz
Approx 15% complete for SRX1589831_chr7_R2_fastp.fastq.gz
Started analysis of SRX1589839_chr7_R2_fastp.fastq.gz
Approx 5% complete for SRX1589839_chr7_R1_fastp.fastq.gz

In [18]:
multiqc --version

multiqc, version 1.13


In [19]:
# Cellule 7 : Générer un rapport MultiQC pour tous les échantillons

multiqc -f -o rnaseq/results/0-multiqc \
    rnaseq/results/1-fastqc/ \
    rnaseq/results/2-fastpfq/ \
    --interactive \
    --title "2-fastqc-fastp" \
    --comment "Rapport MultiQC avant et après nettoyage des reads par fastp"

  


  [34m/[0m[32m/[0m[31m/[0m ]8;id=46928;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.13[0m

[34m|           multiqc[0m | [33mMultiQC Version v1.31 now available![0m
[34m|           multiqc[0m | Report title: 2-fastqc-fastp
[34m|           multiqc[0m | Search path : /srv/home/scaburet/meg-m1-ue5-unix2-testmapping/rnaseq/results/1-fastqc
[34m|           multiqc[0m | Search path : /srv/home/scaburet/meg-m1-ue5-unix2-testmapping/rnaseq/results/2-fastpfq
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m38/38[0m  [0m0m  
[?25h[34m|             fastp[0m | Found 3 reports
[34m|            fastqc[0m | Found 12 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : rnaseq/results/0-multiqc/2-fastqc-fastp_multiqc_report.html
[34m|           multiqc[0m | Data        : rnaseq/results/0-multiqc/2-fastqc-fastp_multiqc_report_data
[34m|           mul

## 5. Alignement des Reads

<div class="alert alert-info">
Nous allons aligner les reads nettoyés sur le génome de référence de souris en utilisant STAR. <br>
Le fichier d'annotation du génome est situé dans srv/data : <br>
<code>/srv/data/Genomes/Mmu/GRCm39/extracted/genome_annotation-M37.gtf</code><br>
L'index du génome de référence a déjà été préparé (étape très gourmande en temps de calcul et en puissance) et est disponible là :<br>
<code>/srv/data/Genomes/Mmu/GRCm39/indexes_upto49bases/</code> <br>
</div>

In [20]:
head -n 7 /srv/data/Genomes/Mmu/GRCm39/extracted/genome_annotation-M37.gtf

##description: evidence-based annotation of the mouse genome (GRCm39), version M37 (Ensembl 114)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2025-01-19
chr1	HAVANA	gene	3143476	3144545	.	+	.	gene_id "ENSMUSG00000102693.2"; gene_type "TEC"; gene_name "4933401J01Rik"; level 2; mgi_id "MGI:1918292"; havana_gene "OTTMUSG00000049935.1";
chr1	HAVANA	transcript	3143476	3144545	.	+	.	gene_id "ENSMUSG00000102693.2"; transcript_id "ENSMUST00000193812.2"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; tag "Ensembl_canonical"; tag "GENCODE_Primary"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";


In [21]:
STAR

Usage: STAR  [options]... --genomeDir /path/to/genome/index/   --readFilesIn R1.fq R2.fq
Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2022

STAR version=2.7.11a
STAR compilation time,server,dir=2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
For more details see:
<https://github.com/alexdobin/STAR>
<https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>

To list all parameters, run STAR --help


In [1]:
# Cellule 8 : Aligner les reads pour le premier échantillon

echo "Alignement de SRX1589831_chr7..."
STAR --genomeDir /srv/data/Genomes/Mmu/GRCm39/indexes_upto49bases/ \
     --readFilesIn rnaseq/results/2-fastpfq/SRX1589831_chr7_R1_fastp.fastq.gz rnaseq/results/2-fastpfq/SRX1589831_chr7_R2_fastp.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix rnaseq/results/3-bam/SRX1589831_chr7_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 5


Alignement de SRX1589831_chr7...
	/srv/conda/envs/notebook/bin/STAR-avx2 --genomeDir /srv/data/Genomes/Mmu/GRCm39/indexes_upto49bases/ --readFilesIn rnaseq/results/2-fastpfq/SRX1589831_chr7_R1_fastp.fastq.gz rnaseq/results/2-fastpfq/SRX1589831_chr7_R2_fastp.fastq.gz --readFilesCommand zcat --outFileNamePrefix rnaseq/results/3-bam/SRX1589831_chr7_ --outSAMtype BAM SortedByCoordinate --runThreadN 5
	STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Sep 11 18:15:45 ..... started STAR run
Sep 11 18:15:45 ..... loading genome
/srv/conda/envs/notebook/bin/STAR: line 8:  1029 Killed                  "${cmd}" "$@"


: 137

In [2]:
# Cellule 8 : Aligner les reads pour le premier échantillon

echo "Alignement de SRX1589831_chr7..."
STAR --genomeDir new/indexchr7_upto49bases/ \
     --readFilesIn rnaseq/results/2-fastpfq/SRX1589831_chr7_R1_fastp.fastq.gz rnaseq/results/2-fastpfq/SRX1589831_chr7_R2_fastp.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix rnaseq/results/3-bam/SRX1589831_chr7_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 5


Alignement de SRX1589831_chr7...
	/srv/conda/envs/notebook/bin/STAR-avx2 --genomeDir new/indexchr7_upto49bases/ --readFilesIn rnaseq/results/2-fastpfq/SRX1589831_chr7_R1_fastp.fastq.gz rnaseq/results/2-fastpfq/SRX1589831_chr7_R2_fastp.fastq.gz --readFilesCommand zcat --outFileNamePrefix rnaseq/results/3-bam/SRX1589831_chr7_ --outSAMtype BAM SortedByCoordinate --runThreadN 5
	STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Sep 11 19:29:52 ..... started STAR run
Sep 11 19:29:52 ..... loading genome
Sep 11 19:29:53 ..... started mapping
Sep 11 19:32:39 ..... finished mapping
Sep 11 19:32:39 ..... started sorting BAM
Sep 11 19:32:40 ..... finished successfully


la cellule précédente fonctionne dans un env 6 coeurs + 15 GB RAM (--runThreadN 5)   
Alignement de SRX1589831_chr7...  
Sep 11 19:29:52 ..... started STAR run  
Sep 11 19:32:40 ..... finished successfully
< 3 mn pour 1 échantillon 

Essai en dessous pour les 3, env 6 coeurs + 15 GB RAM (--runThreadN 5)   

Alignement de SRX1589831_chr7...  
Sep 12 08:44:31 ..... started STAR run   
Sep 12 08:47:23 ..... finished successfully   
Alignement de SRX1589834_chr7...   
Sep 12 08:47:24 ..... started STAR run   
Sep 12 08:50:11 ..... finished successfully   
Alignement de SRX1589839_chr7...  
Sep 12 08:50:11 ..... started STAR run   
Sep 12 08:52:59 ..... finished successfully

Donc un peu moins de 9 mn pour les 3, ça roule

Je relance avec un env de 5 coeurs et 12 GB RAM (--runThreadN 4)   
Alignement de SRX1589831_chr7...  
Sep 12 08:59:32 ..... started STAR run   
Sep 12 09:02:25 ..... finished successfully   
Alignement de SRX1589834_chr7...   
Sep 12 09:02:25 ..... started STAR run   
Sep 12 09:05:21 ..... finished successfully   
Alignement de SRX1589839_chr7...   
Sep 12 09:05:21 ..... started STAR run   
Sep 12 09:08:09 ..... finished successfully  

Idem en timing, ok. 

Je relance avec un env de 5 coeurs et 10 GB RAM (--runThreadN 4) 
Alignement de SRX1589831_chr7...  
Sep 12 09:10:38 ..... started STAR run   
Sep 12 09:13:35 ..... finished successfully  
Alignement de SRX1589834_chr7...   
Sep 12 09:13:36 ..... started STAR run   
Sep 12 09:16:30 ..... finished successfully  
Alignement de SRX1589839_chr7...   
Sep 12 09:16:30 ..... started STAR run  
Sep 12 09:19:21 ..... finished successfully  

Idem en timing, ok. 

Je relance avec un env de 4 coeurs et 8 GB RAM (--runThreadN 3)  
Alignement de SRX1589831_chr7...  
Sep 12 09:25:13 ..... started STAR run  
Sep 12 09:29:25 ..... finished successfully  
Alignement de SRX1589834_chr7...  
Sep 12 09:29:25 ..... started STAR run    
Sep 12 09:34:08 ..... finished successfully  
Alignement de SRX1589839_chr7...  
Sep 12 09:34:08 ..... started STAR run  
Sep 12 09:37:35 ..... finished successfully
Timing un peu plus long : 12 mn pour les 3 échantillons
 donc on peut rester sur le 5 coeurs + 8 GB de RAM


In [1]:
# (Cellule 9) : Commande pour aligner tous les échantillons

for sample in rnaseq/results/2-fastpfq/*_R1_fastp.fastq.gz; do
    base=$(basename $sample _R1_fastp.fastq.gz)
    echo "Alignement de $base..."
    STAR --genomeDir new/indexchr7_upto49bases/ \
         --readFilesIn ${sample} ${sample%_R1_fastp.fastq.gz}_R2_fastp.fastq.gz \
         --readFilesCommand zcat \
         --outFileNamePrefix rnaseq/results/3-bam/${base}_ \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN 3
done

Alignement de SRX1589831_chr7...
	/srv/conda/envs/notebook/bin/STAR-avx2 --genomeDir new/indexchr7_upto49bases/ --readFilesIn rnaseq/results/2-fastpfq/SRX1589831_chr7_R1_fastp.fastq.gz rnaseq/results/2-fastpfq/SRX1589831_chr7_R2_fastp.fastq.gz --readFilesCommand zcat --outFileNamePrefix rnaseq/results/3-bam/SRX1589831_chr7_ --outSAMtype BAM SortedByCoordinate --runThreadN 3
	STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Sep 12 09:25:13 ..... started STAR run
Sep 12 09:25:13 ..... loading genome
Sep 12 09:25:14 ..... started mapping
Sep 12 09:29:23 ..... finished mapping
Sep 12 09:29:24 ..... started sorting BAM
Sep 12 09:29:25 ..... finished successfully
Alignement de SRX1589834_chr7...
	/srv/conda/envs/notebook/bin/STAR-avx2 --genomeDir new/indexchr7_upto49bases/ --readFilesIn rnaseq/results/2-fastpfq/SRX1589834_chr7_R1_fastp.fastq.gz rnaseq/results/2-fastpfq/SRX1589834_chr7_R2_fastp.fastq.gz --readFilesCommand zcat --o

In [8]:
cd /srv/home/scaburet/meg-m1-ue5-unix2-testmapping/
pwd

ls -lh rnaseq/results/3-bam/

/srv/home/scaburet/meg-m1-ue5-unix2-testmapping
total 438M
-rw-rw-r-- 1 scaburet scaburet 148M Sep 12 09:29 SRX1589831_chr7_Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 scaburet scaburet 2.0K Sep 12 09:29 SRX1589831_chr7_Log.final.out
-rw-rw-r-- 1 scaburet scaburet 6.8K Sep 12 09:29 SRX1589831_chr7_Log.out
-rw-rw-r-- 1 scaburet scaburet  482 Sep 12 09:29 SRX1589831_chr7_Log.progress.out
-rw-rw-r-- 1 scaburet scaburet 452K Sep 12 09:29 SRX1589831_chr7_SJ.out.tab
-rw-rw-r-- 1 scaburet scaburet 163M Sep 12 09:34 SRX1589834_chr7_Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 scaburet scaburet 2.0K Sep 12 09:34 SRX1589834_chr7_Log.final.out
-rw-rw-r-- 1 scaburet scaburet 6.8K Sep 12 09:34 SRX1589834_chr7_Log.out
-rw-rw-r-- 1 scaburet scaburet  600 Sep 12 09:34 SRX1589834_chr7_Log.progress.out
-rw-rw-r-- 1 scaburet scaburet 478K Sep 12 09:34 SRX1589834_chr7_SJ.out.tab
-rw-rw-r-- 1 scaburet scaburet 126M Sep 12 09:37 SRX1589839_chr7_Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 scaburet scaburet 2.0K 

In [9]:
# Cellule 10 : Indexer les fichiers BAM

for bam in rnaseq/results/3-bam/*_Aligned.sortedByCoord.out.bam; do
    echo "Indexation de $(basename $bam)..."
    samtools index -@ 3 $bam
done

Indexation de SRX1589831_chr7_Aligned.sortedByCoord.out.bam...
Indexation de SRX1589834_chr7_Aligned.sortedByCoord.out.bam...
Indexation de SRX1589839_chr7_Aligned.sortedByCoord.out.bam...


## 6. Résumé et Étapes Suivantes

<div class="alert alert-success">
<b>Étapes Complétées :</b><br>
✓ Contrôle qualité des données brutes<br>
✓ Prétraitement et nettoyage des reads<br>
✓ Contrôle qualité post-traitement<br>
✓ Alignement des reads sur le génome de référence<br>
✓ Indexation des fichiers BAM
</div>

<div class="alert alert-info">
<b>Fichiers de Sortie Générés :</b><br>
- Rapports de qualité : <code>./rnaseq/results/1-fastqc/</code><br>
- Reads nettoyés : <code>./rnaseq/results/2-fastpfq/</code><br>
- Rapport MultiQC : <code>./rnaseq/results/0-multiqc/</code><br>
- Alignements bam et index bai : <code>./rnaseq/results/3-bam/</code>
</div>

<div class="alert alert-warning">
<b>Prochaines Étapes :</b><br>
Les fichiers BAM générés peuvent maintenant être utilisés pour :<br>
- Le comptage des reads par gène (featureCounts, HTSeq)<br>
- L'analyse d'expression différentielle (DESeq2, edgeR)<br>
- La visualisation des alignements (IGV, UCSC Genome Browser)
</div>