# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

To investigate the transcriptomic effects of chronic opioid exposure and withdrawal in brain reward circuits, and how these are altered by the presence of chronic pain.

What do the conditions mean?

oxy: 2 weeks dayly treatment with oxycodone after spared nerve injury or sham surgery


sal: treatment with 0.9% saline after spared nerve injury or sham surgery

What do the genotypes mean?

SNI: spared nerve injury (surgery of left sciatic nerve: 1-2 mm sections of these nerves were removed) -> Chronical pain


Sham: in sham controls the surgery was mimicked but the nerve was not hurt -> no chronical pain

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Which groups would you compare to each other?

Please also mention which outcome you would expect to see from each comparison.

What would you do?

Since we get the raw data, we first hav to do a Quality Control (FastQC) and check for the reads quality of the RNAseq runs. We maybe have to trimm reads and correct for adapter sequences.
Then we align the reads to the current mouse genome (HISAT2) and count the reads for each annotated gene in the reference genome to get a counts matrix.
To get differnetially expressed genes within conditions DESeq2 could be performed with also already takes care of normalization and therefore has to be performed an the raw counts.

Which groups would you compare to each other? Please also mention which outcome you would expect to see from each comparison.

1. Find DEGs for treatemnt and Genotype against controls, to find specific effects for oxy and specific effects for SNI.
-> This means 2 Comparisons:
- SNI-oxy VS SNI-sal
- SNI-sal VS Sham-sal
2. Find DEG for SNI-oxy with Sham-sal for specific effects for oxy in a SNI setting on withdrawal, compared to Control (sham-sal = control in genotype and treatment)
3. Compare Overlapps to find DEGs which are specific for the SNI-oxy compared to the Sham-sal without effects that also occure only due to oxy or only due to the SNI.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [32]:
import pandas as pd
import os

# Paths
print("Current working directory:", os.getcwd())

# Read data
df = pd.read_excel("conditions_runs_oxy_project.xlsx", index_col="Run")
df = df.fillna(False)
df =df.replace("X", True)
df =df.replace("x", True)

# Binary Conditions
conditions = ["SNI", "Sham", "oxy", "sal"]

# Save
df.to_excel("results/cleaned_table.xlsx", index=False)

print("Table cleaned and saved as cleaned_table.xlsx")
print(df.head())

Current working directory: /home/chrissi/BioPrak/computational-workflows-2025/notebooks/day_02
Table cleaned and saved as cleaned_table.xlsx
            Patient  RNA-seq  DNA-seq  condition: Sal  Condition: Oxy  \
Run                                                                     
SRR23195505       ?     True    False            True           False   
SRR23195506       ?     True    False           False            True   
SRR23195507       ?     True    False            True           False   
SRR23195508       ?     True    False           False            True   
SRR23195509       ?     True    False           False            True   

             Genotype: SNI  Genotype: Sham  
Run                                         
SRR23195505           True           False  
SRR23195506          False            True  
SRR23195507          False            True  
SRR23195508           True           False  
SRR23195509           True           False  


  df = df.fillna(False)
  df =df.replace("x", True)


In [None]:
import numpy as np

# Samples per condition/ genotype
conditions = ["Sal", "Oxy"]
genotypes  = ["SNI", "Sham"]
df["Condition"] = np.select(df[["condition: Sal", "Condition: Oxy"]].to_numpy().T, conditions, default=None)
df["Genotype"]  = np.select(df[["Genotype: SNI", "Genotype: Sham"]].to_numpy().T, genotypes, default=None)
df = df.drop(['condition: Sal', 'Condition: Oxy', 'Genotype: SNI', 'Genotype: Sham'], axis=1)

print(df.head())

            Patient  RNA-seq  DNA-seq Condition Genotype
Run                                                     
SRR23195505       ?     True    False       Sal      SNI
SRR23195506       ?     True    False       Oxy     Sham
SRR23195507       ?     True    False       Sal     Sham
SRR23195508       ?     True    False       Oxy      SNI
SRR23195509       ?     True    False       Oxy      SNI


They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [35]:
# Sort
df = df.sort_values(by=['Condition', 'Genotype'])

print(df.head())

            Patient  RNA-seq  DNA-seq Condition Genotype
Run                                                     
SRR23195508       ?     True    False       Oxy      SNI
SRR23195509       ?     True    False       Oxy      SNI
SRR23195516       ?     True    False       Oxy      SNI
SRR23195517       ?     True    False       Oxy      SNI
SRR23195506       ?     True    False       Oxy     Sham


In [None]:
# Add Bases to dataframe
df_base = pd.read_csv("base_counts.csv", index_col="Run")
print(df_base)

df["Bases"] = df_base.Bases

print(df.head())

                  Bases
Run                    
SRR23195505  6922564500
SRR23195506  7859530800
SRR23195507  8063298900
SRR23195508  6927786900
SRR23195509  7003550100
SRR23195510  7377388500
SRR23195511  6456390900
SRR23195512  7462857900
SRR23195513  8099181600
SRR23195514  7226808600
SRR23195515  8169101700
SRR23195516  6203117700
SRR23195517  6863840400
SRR23195518  7908500400
SRR23195519  6996050100
SRR23195520  7858146000
            Patient  RNA-seq  DNA-seq Condition Genotype       Bases
Run                                                                 
SRR23195508       ?     True    False       Oxy      SNI  6927786900
SRR23195509       ?     True    False       Oxy      SNI  7003550100
SRR23195516       ?     True    False       Oxy      SNI  6203117700
SRR23195517       ?     True    False       Oxy      SNI  6863840400
SRR23195506       ?     True    False       Oxy     Sham  7859530800


In [38]:
# Save
df.to_excel("results/cleaned_table.xlsx", index=False)

print("Table cleaned and saved as cleaned_table.xlsx to results folder")

Table cleaned and saved as cleaned_table.xlsx to results folder


In [None]:
# Number of samples per condition/ genotype
genotype_counts = df["Genotype"].value_counts()
print(genotype_counts)

condition_counts = df["Condition"].value_counts()
print(condition_counts)

combination_counts = df.groupby(["Genotype", "Condition"]).size()
print(combination_counts)

Genotype
SNI     8
Sham    8
Name: count, dtype: int64
Condition
Oxy    8
Sal    8
Name: count, dtype: int64
Genotype  Condition
SNI       Oxy          4
          Sal          4
Sham      Oxy          4
          Sal          4
dtype: int64


In [39]:
lowest_two = df.nsmallest(2, "Bases")

print(lowest_two)

            Patient  RNA-seq  DNA-seq Condition Genotype       Bases
Run                                                                 
SRR23195516       ?     True    False       Oxy      SNI  6203117700
SRR23195511       ?     True    False       Oxy     Sham  6456390900


In [None]:
# Fetch data from SRA

!nextflow run nf-core/fetchngs --input /home/chrissi/BioPrak/computational-workflows-2025/notebooks/day_02/ids.csv -profile docker --outdir /home/chrissi/BioPrak/computational-workflows-2025/notebooks/day_02/SRR_data_fetch --max_memory "4GB"

# get paired end reads
# pipeline_info -> execusion_report: shows more info on the pocess that was run
# sample sheet -> info on the samples that were downloaded

# Since fetching could not be performed in the time given, we shared the data manually via an USB-Stick
# I put it in the SRR_data_fetch folder.

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.

Since we get the raw files, we have to do a quality control and preprocess the data before the analysis, ideally the same as in the paper.
However in the paper any prepocessing is described and also used tools miss the used version, etc.

This is why we suggest to use the nf-core pipelines to make the analysis reproducible.