# STAR RNA-seq Alignment Pipeline

## Overview
This notebook performs STAR alignment of RNA-seq data processed through the FASTP quality control pipeline. It aligns cleaned paired-end FASTQ files to the human reference genome (GRCh38) and generates BAM files for downstream analysis.

## Pipeline Workflow
1. **Environment Setup**: Activate conda environment and verify tool installations
2. **Reference Genome Setup**: Prepare or verify STAR genome index
3. **Sample Processing**: Align paired-end reads using STAR aligner
4. **Output Generation**: Generate sorted BAM files and gene count matrices
5. **Quality Assessment**: Collect alignment statistics and summaries

## Input Requirements
- Processed FASTQ files from `00_FASTP_FQ_FILES.ipynb`
- Directory structure with merged R1/R2 files in `00.CLEAN_FQ/` subdirectories
- STAR genome index for GRCh38

## Output Structure
```
[Sample_ID]/
├── 00.CLEAN_FQ/
│   ├── [Sample_ID]_R1.merged.fastq.gz
│   └── [Sample_ID]_R2.merged.fastq.gz
└── 10.MAPPING/
    ├── [Sample_ID]_STAR_Aligned.sortedByCoord.out.bam
    ├── [Sample_ID]_STAR_ReadsPerGene.out.tab
    ├── [Sample_ID]_STAR_Log.final.out
    └── [Sample_ID]_STAR_Signal.Unique.str*.out.bg
```

## Software Requirements
- STAR aligner (>=2.7)
- GNU Parallel (parallel processing)
- Standard Unix tools

## Reference Genome
- Genome: Human GRCh38 (Ensembl release 111)
- GTF: Homo_sapiens.GRCh38.111.chr.gtf
- FASTA: Homo_sapiens.GRCh38.dna.toplevel.fa

## Notes
- Uses parallel processing for efficient alignment of multiple samples
- Generates gene count matrices compatible with DESeq2 and other tools
- Produces normalized bedGraph tracks for visualization
- Includes unstranded alignment and RPM normalization

## Configuration and Environment Setup

In [None]:
# =============================================================================
# CONFIGURATION VARIABLES
# =============================================================================

# Core processing parameters
export MAXCORES=8
export PARALLEL_JOBS=15

# Data paths
export RAW_DATA_DIR="/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw"
export PROJECT_NAME="Reproducibility_SC102A1"

# STAR genome index path (multiple options - adjust as needed)
export STARGEN="/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38"
# Alternative paths:
# export STARGEN="/local1/sequencing/PUBLIC_DATA/genomes/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/"
# export STARGEN="/links/groups/treutlein/USERS/jjans/software/cellranger/refdata-gex-GRCh38-2020-A/star/"

# Reference genome files (for index generation if needed)
export GENOME_FASTA="/links/groups/treutlein/USERS/jjans/resources/genomes/hsapiens/Homo_sapiens.GRCh38.dna.toplevel.fa"
export GTF_FILE="/links/groups/treutlein/USERS/jjans/resources/genomes/hsapiens/Homo_sapiens.GRCh38.111.chr.gtf"
export STAR_INDEX_DIR="/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38"

# STAR alignment parameters
export STAR_THREADS=$MAXCORES
export STAR_OUTWIG_TYPE="bedGraph"
export STAR_OUTWIG_STRAND="Unstranded"
export STAR_OUTWIG_NORM="RPM"

echo "Configuration loaded:"
echo "  Max cores: $MAXCORES"
echo "  Parallel jobs: $PARALLEL_JOBS"
echo "  STAR genome index: $STARGEN"
echo "  Raw data directory: $RAW_DATA_DIR"

## 1. Environment Setup and Tool Verification

In [None]:
# Activate conda environment for bulk RNA-seq processing
mamba activate bulk_seq
echo "Activated environment: $CONDA_DEFAULT_ENV"

In [None]:
# Verify STAR aligner installation and version
which STAR
STAR --version 2>/dev/null || echo "STAR version check failed"

# Verify other required tools
which parallel
echo "GNU Parallel version: $(parallel --version 2>/dev/null | head -n1)"

# Check if STAR genome index exists
if [ -d "$STARGEN" ]; then
    echo "✓ STAR genome index found at: $STARGEN"
    ls -la "$STARGEN" | head -5
else
    echo "⚠ STAR genome index not found at: $STARGEN"
    echo "Will need to generate index first"
fi

## 2. STAR Genome Index Generation (Optional)

This section is only needed if the STAR genome index doesn't exist yet. Skip if the index is already available.

In [None]:
# Generate STAR genome index (only run if index doesn't exist)
# This step typically takes 1-2 hours and requires ~30GB of RAM

if [ ! -d "$STAR_INDEX_DIR" ] || [ -z "$(ls -A $STAR_INDEX_DIR)" ]; then
    echo "Generating STAR genome index..."
    
    # Create the output directory
    mkdir -p "$STAR_INDEX_DIR"
    
    # Uncompress GTF file if needed
    if [ ! -f "$GTF_FILE" ] && [ -f "${GTF_FILE}.gz" ]; then
        echo "Uncompressing GTF file..."
        gunzip -c "${GTF_FILE}.gz" > "$GTF_FILE"
    fi
    
    # Run STAR genome generation
    STAR --runThreadN 20 \
         --runMode genomeGenerate \
         --genomeDir "$STAR_INDEX_DIR" \
         --genomeFastaFiles "$GENOME_FASTA" \
         --sjdbGTFfile "$GTF_FILE" \
         --sjdbOverhang 100
    
    echo "✓ STAR genome index generation completed"
else
    echo "✓ STAR genome index already exists at: $STAR_INDEX_DIR"
fi

## 3. STAR Alignment Pipeline

In [None]:
# Verify processed FASTQ files are available
echo "Checking for processed FASTQ files..."

# Find samples with merged FASTQ files
SAMPLE_DIRS=$(find "$PROJECT_NAME" -name "*_R1.merged.fastq.gz" -exec dirname {} \; | sort | uniq)

if [ -n "$SAMPLE_DIRS" ]; then
    echo "Found samples ready for alignment:"
    for DIR in $SAMPLE_DIRS; do
        SAMPLE_NAME=$(basename $(dirname "$DIR"))
        echo "  - $SAMPLE_NAME"
        ls -la "$DIR"/*.merged.fastq.gz | head -2
    done
else
    echo "⚠ No merged FASTQ files found. Please run 00_FASTP_FQ_FILES.ipynb first."
    exit 1
fi

In [None]:
# =============================================================================
# MAIN STAR ALIGNMENT PIPELINE
# =============================================================================

echo "Starting STAR alignment with $PARALLEL_JOBS parallel jobs"
echo "Using $STAR_THREADS threads per job"
echo "STAR genome index: $STARGEN"

# Process all samples with merged FASTQ files
find "$PROJECT_NAME" -name "*_R1.merged.fastq.gz" | parallel -j $PARALLEL_JOBS "
    # Extract sample information
    MERGED_R1_FILE={}
    SAMPLE_DIR=\$(dirname \"\$MERGED_R1_FILE\")
    CLEAN_FQ_DIR=\$(dirname \"\$SAMPLE_DIR\")
    SAMPLE_NAME=\$(basename \"\$CLEAN_FQ_DIR\")
    
    echo \"Processing sample: \$SAMPLE_NAME\"
    
    # Change to sample directory
    cd \"\$CLEAN_FQ_DIR\"
    
    # Create mapping output directory
    mkdir -p 10.MAPPING
    
    # Run STAR alignment
    STAR \\
        --runThreadN $STAR_THREADS \\
        --genomeDir $STARGEN \\
        --readFilesIn 00.CLEAN_FQ/\${SAMPLE_NAME}_R1.merged.fastq.gz 00.CLEAN_FQ/\${SAMPLE_NAME}_R2.merged.fastq.gz \\
        --readFilesCommand zcat \\
        --outFileNamePrefix 10.MAPPING/\${SAMPLE_NAME}_STAR_ \\
        --outSAMtype BAM SortedByCoordinate \\
        --outWigType $STAR_OUTWIG_TYPE \\
        --outWigStrand $STAR_OUTWIG_STRAND \\
        --outWigNorm $STAR_OUTWIG_NORM \\
        --outReadsUnmapped Fastx \\
        --quantMode GeneCounts \\
        &>> \${SAMPLE_NAME}_STAR.log
    
    echo \"Completed alignment for: \$SAMPLE_NAME\"
"

echo "✓ STAR alignment completed for all samples"

## 4. Quality Assessment and Verification

In [None]:
# =============================================================================
# ALIGNMENT QUALITY ASSESSMENT AND SUMMARY
# =============================================================================

echo "=== STAR ALIGNMENT SUMMARY ==="

# Count successfully aligned samples
echo "Aligned samples:"
find "$PROJECT_NAME" -name "*_STAR_Aligned.sortedByCoord.out.bam" | wc -l

# List generated BAM files
echo -e "\nGenerated BAM files:"
find "$PROJECT_NAME" -name "*_STAR_Aligned.sortedByCoord.out.bam" | sort

# Check for gene count files
echo -e "\nGene count files:"
find "$PROJECT_NAME" -name "*_STAR_ReadsPerGene.out.tab" | wc -l

# Display alignment statistics from log files
echo -e "\nAlignment Statistics Summary:"
find "$PROJECT_NAME" -name "*_STAR_Log.final.out" | head -5 | while read logfile; do
    sample=$(basename "$logfile" | sed 's/_STAR_Log.final.out//')
    echo "Sample: $sample"
    grep -E "(Uniquely mapped reads|Number of input reads)" "$logfile" | sed 's/^/  /'
    echo ""
done

# Check for bedGraph files (signal tracks)
echo -e "\nSignal tracks (bedGraph):"
find "$PROJECT_NAME" -name "*_STAR_Signal.*.out.bg" | wc -l

echo -e "\n=== STAR ALIGNMENT COMPLETED SUCCESSFULLY ==="
echo "Next step: Run downstream analysis (count matrix processing, DESeq2, etc.)"

In [3]:
which fastp

/nas/groups/treutlein/USERS/jjans/mambaforge/envs/bulk_seq/bin/fastp
(bulk_seq) 


: 1

In [5]:
which STAR

/nas/groups/treutlein/USERS/jjans/mambaforge/envs/bulk_seq/bin/STAR
(bulk_seq) 


: 1

In [None]:
export MAXCORES=8

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/iGABA_pre_24/*R1*.fastq.gz | parallel -j 15 --plus --rpl '{//} s:_S[0-9]+_L[0-9]+_R1_001.fastq.gz::;s:.*/::' "
    mkdir -p {/}
    cd {/}
    echo {}
    
    fastp -i {} -I {= s/_R1_/_R2_/ =} -o {/}/{/.}.clean_R1.fastq.gz -O {/}/{/.}.clean_R2.fastq.gz
" ::: /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/iGABA_pre_24/*R1*.fastq.gz


In [30]:
export MAXCORES=8
ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/iGABA_pre_24/*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' "
        mkdir -p {/..}
        cd {/..}
        echo {}
        echo {/..L}
"

(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 14 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/iGABA_pre_24
/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/iGABA_pre_24/iGABA_pre_24_S1_L001_R1

: 1

In [62]:
export MAXCORES=8
export STARGEN=/local1/sequencing/PUBLIC_DATA/genomes/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/
ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/iGABA_pre_24/*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;s:_R.*::' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' "
        mkdir -p {/..}
        cd {/..}
        
        mkdir -p 00.CLEAN_FQ
        fastp -i {} -I {=s/R1/R2/=} -o 00.CLEAN_FQ/{/..L}_R1.clean.fastq.gz -O 00.CLEAN_FQ/{/..L}_R2.clean.fastq.gz -h 00.CLEAN_FQ/{/..L}_fastp.html -j 00.CLEAN_FQ/{/..L}_fastp.json &>> {/..L}_fastp.log
        
        cat 00.CLEAN_FQ/*_R1.clean.fastq.gz > 00.CLEAN_FQ/{/..}_R1.merged.fastq.gz
        cat 00.CLEAN_FQ/*_R2.clean.fastq.gz > 00.CLEAN_FQ/{/..}_R2.merged.fastq.gz
       
"

(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 22 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/iGABA_pre_24
/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/iGABA

: 1

In [83]:
# Set variables for your files and directories
GENOME_FASTA="/links/groups/treutlein/USERS/jjans/resources/genomes/hsapiens/Homo_sapiens.GRCh38.dna.toplevel.fa"
GTF_FILE="/links/groups/treutlein/USERS/jjans/resources/genomes/hsapiens/Homo_sapiens.GRCh38.111.chr.gtf"
STAR_INDEX_DIR="/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38"

# Create the output directory if it doesn't exist
mkdir -p $STAR_INDEX_DIR

# Run STAR to generate the genome index
STAR --runThreadN 20 \
     --runMode genomeGenerate \
     --genomeDir $STAR_INDEX_DIR \
     --genomeFastaFiles $GENOME_FASTA \
     --sjdbGTFfile $GTF_FILE \
     --sjdbOverhang 100


(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
	/nas/groups/treutlein/USERS/jjans/mambaforge/envs/bulk_seq/bin/STAR-avx2 --runThreadN 20 --runMode genomeGenerate --genomeDir /links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38 --genomeFastaFiles /links/groups/treutlein/USERS/jjans/resources/genomes/hsapiens/Homo_sapiens.GRCh38.dna.toplevel.fa --sjdbGTFfile /links/groups/treutlein/USERS/jjans/resources/genomes/hsapiens/Homo_sapiens.GRCh38.111.chr.gtf --sjdbOverhang 100
	STAR version: 2.7.11b   compiled: 2024-03-19T08:38:59+0000 :/opt/conda/conda-bld/star_1710837244939/work/source
May 28 16:21:15 ..... started STAR run
May 28 16:21:15 ... starting to generate Genome files
May 28 16:21:51 ..... processing annotations GTF
May 28 16:22:17 ... starting to sort Suffix Array. This may take a long time...
May 28 16:22:30 ... sorting Suffix Array chunks and saving them to disk...
May 28 16:45:59 ... loading ch

: 1

In [85]:
export MAXCORES=8
export STARGEN=/local1/sequencing/PUBLIC_DATA/genomes/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/

export STARGEN=/links/groups/treutlein/USERS/jjans/software/cellranger/refdata-gex-GRCh38-2020-A/star/

export STARGEN=/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/iGABA_pre_24/*S1*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' "
        cd {/..}
        
        mkdir -p 10.MAPPING
        
        STAR \
        --runThreadN ${MAXCORES} \
        --genomeDir ${STARGEN} \
        --readFilesIn 00.CLEAN_FQ/*merged*.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix 10.MAPPING/{/..}_STAR_ \
        --outSAMtype BAM SortedByCoordinate \
        --outWigType bedGraph \
        --outWigStrand Unstranded \
        --outWigNorm RPM \
        --outReadsUnmapped Fastx \
        --quantMode GeneCounts \
        &>> {/..}_STAR.log
"

(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 27 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/iGABA_pre_24
(bulk_seq) 


: 1

In [87]:
export MAXCORES=8
export STARGEN=/local1/sequencing/PUBLIC_DATA/genomes/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/star/

export STARGEN=/links/groups/treutlein/USERS/jjans/software/cellranger/refdata-gex-GRCh38-2020-A/star/

export STARGEN=/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/iGABA_pre_*/*S1*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' "
        cd iGABA_pre/{/..}
        
        mkdir -p 10.MAPPING
        
        STAR \
        --runThreadN ${MAXCORES} \
        --genomeDir ${STARGEN} \
        --readFilesIn 00.CLEAN_FQ/*merged*.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix 10.MAPPING/{/..}_STAR_ \
        --outSAMtype BAM SortedByCoordinate \
        --outWigType bedGraph \
        --outWigStrand Unstranded \
        --outWigNorm RPM \
        --outReadsUnmapped Fastx \
        --quantMode GeneCounts \
        &>> {/..}_STAR.log
"

(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 28 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/iGABA_pre/iGABA_pre_20
/nas/groups/treutlein/US

: 1

In [94]:
export MAXCORES=8
export STARGEN=/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_409B2*/*S1*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' "
        cd Reproducibility_409B2/{/..}
        
        mkdir -p 10.MAPPING
        
        STAR \
        --runThreadN ${MAXCORES} \
        --genomeDir ${STARGEN} \
        --readFilesIn 00.CLEAN_FQ/*merged*.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix 10.MAPPING/{/..}_STAR_ \
        --outSAMtype BAM SortedByCoordinate \
        --outWigType bedGraph \
        --outWigStrand Unstranded \
        --outWigNorm RPM \
        --outReadsUnmapped Fastx \
        --quantMode GeneCounts \
        &>> {/..}_STAR.log
"

(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 29 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/Reproducibility_409B2/Reproducibility_409B2_1_11
/nas/groups/treutlein/USERS/jjans/analysis/iNe

: 1

In [106]:
export MAXCORES=8
export STARGEN=/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Stability*/*S1*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' --rpl '{S} s:.fastq.gz::;s:.*/::;s:_sample.*::' "
    echo {S}
"

(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 40 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d10
Stability_d21
St

: 1

In [None]:
export MAXCORES=8
export STARGEN=/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Stability*/*S1*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S.*::' --rpl '{S} s:.fastq.gz::;s:.*/::;s:_sample.*::' "
        cd {S}/{/..}
        
        mkdir -p 10.MAPPING
        
        STAR \
        --runThreadN ${MAXCORES} \
        --genomeDir ${STARGEN} \
        --readFilesIn 00.CLEAN_FQ/*merged*.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix 10.MAPPING/{/..}_STAR_ \
        --outSAMtype BAM SortedByCoordinate \
        --outWigType bedGraph \
        --outWigStrand Unstranded \
        --outWigNorm RPM \
        --outReadsUnmapped Fastx \
        --quantMode GeneCounts \
        &>> {/..}_STAR.log
"

(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 41 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?



In [10]:
export MAXCORES=8
export STARGEN=/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_SC102A1*/*S1*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S[0-9].*::' "
        cd Reproducibility_SC102A1/{/..}
        
        mkdir -p 10.MAPPING
        
        STAR \
        --runThreadN ${MAXCORES} \
        --genomeDir ${STARGEN} \
        --readFilesIn 00.CLEAN_FQ/*merged*.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix 10.MAPPING/{/..}_STAR_ \
        --outSAMtype BAM SortedByCoordinate \
        --outWigType bedGraph \
        --outWigStrand Unstranded \
        --outWigNorm RPM \
        --outReadsUnmapped Fastx \
        --quantMode GeneCounts \
        &>> {/..}_STAR.log
"

(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 46 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

/nas/groups/treutlein/USERS/jjans/analysis/iNeuron_morphogens/revisions/bulk_experiments/Reproducibility_SC102A1/Reproducibility_SC102A1_2_10
/nas/groups/treutlein/USERS/jjans/analysis

: 1

In [9]:
export MAXCORES=8
export STARGEN=/links/groups/treutlein/USERS/jjans/resources/genomes/STAR/Homo_sapiens.GRCh38

ls /links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_SC102A1*/*S1*R1*.fastq.gz | parallel -j 15 --plus --rpl '{/..L} s:.fastq.gz::;s:.*/::;' --rpl '{/..} s:.fastq.gz::;s:.*/::;s:_S[0-9].*::' "
echo {/..}
echo {}
echo {/..L}
"

(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
  Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 45 times. Isn't it about time 
you run 'parallel --citation' once to silence the citation notice?

Reproducibility_SC102A1_1_1
/links/groups/treutlein/DATA/sequencing/20240524_P2794_HSIU-CHUAN/raw/Reproducibility_SC102A1_1_1/Reproducibility_SC102A1_1_1_S1_L001_R1_001.fastq.gz
Reprod

: 1

In [93]:
# Define the base directory
BASE_DIR="Reproducibility_SC102A1"

# Merge FASTQ files within each folder
for FOLDER in $BASE_DIR/*; do
    # Skip if not a directory
    if [ ! -d "$FOLDER" ]; then
        continue
    fi
    
    # Get the folder name
    FOLDER_NAME=$(basename "$FOLDER")
    
    echo $FOLDER_NAME
    
    mkdir -p 10.MAPPING

    STAR \
    --runThreadN ${MAXCORES} \
    --genomeDir ${STARGEN} \
    --readFilesIn 00.CLEAN_FQ/*merged*.fastq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix 10.MAPPING/{/..}_STAR_ \
    --outSAMtype BAM SortedByCoordinate \
    --outWigType bedGraph \
    --outWigStrand Unstranded \
    --outWigNorm RPM \
    --outReadsUnmapped Fastx \
    --quantMode GeneCounts \
    &>> STAR.log

done


(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
(bulk_seq) 
Reproducibility_SC102A1_1_1
Reproducibility_SC102A1_1_10
Reproducibility_SC102A1_1_11
Reproducibility_SC102A1_1_12
Reproducibility_SC102A1_1_2
Reproducibility_SC102A1_1_3
Reproducibility_SC102A1_1_4
Reproducibility_SC102A1_1_5
Reproducibility_SC102A1_1_6
Reproducibility_SC102A1_1_7
Reproducibility_SC102A1_1_8
Reproducibility_SC102A1_1_9
Reproducibility_SC102A1_2_1
Reproducibility_SC102A1_2_10
Reproducibility_SC102A1_2_11
Reproducibility_SC102A1_2_12
Reproducibility_SC102A1_2_2
Reproducibility_SC102A1_2_3
Reproducibility_SC102A1_2_4
Reproducibility_SC102A1_2_5
Reproducibility_SC102A1_2_6
Reproducibility_SC102A1_2_7
Reproducibility_SC102A1_2_8
Reproducibility_SC102A1_2_9
Reproducibility_SC102A1_3_1
Reproducibility_SC102A1_3_10
Reproducibility_SC102A1_3_11
Reproducibility_SC102A1_3_12
Reproducibility_SC102A1_3_2
Reproducibility_SC102A1_3_3
Reproducibility_SC102A1_3_4
Reproducibility_SC102A1_3_5
Reproducibility_SC102A1_3_6
Reproducibility_SC1

: 1

In [None]:
    # Merge FASTQ files within the folder
    cd "$FOLDER"/00.CLEAN_FQ
    cat *_R1.clean.fastq.gz > "$FOLDER_NAME"_R1.merged.fastq.gz
    cat *_R2.clean.fastq.gz > "$FOLDER_NAME"_R2.merged.fastq.gz
    cd ../../..
