# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

The main objective was to find the connection between oxycodone withdrawal and pain on a transcriptimical level and build a new mouse model out of it.

What do the conditions mean?

oxy: Mouse treated with oxycodone


sal: Mouse treated with a placebo (control)

What do the genotypes mean?

SNI: Mouse with chronical pain model (spared nerve injury)


Sham: Mouse with no chronical pain model (control)

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do? 1. Get an overview on the data 2. Precprocess them in a suitable format 3. Run DE analysis between all possible combinations of the four groups (oxy-SNI vs. sal-Sham for significance minimum baseline) 4. Compare DE genes of the sham groups with them of the SNI groups (as they did) to see which regulators are pain related and which addiction related.

Which groups would you compare to each other? Each with each (if not to computational expensive) as described above.

Please also mention which outcome you would expect to see from each comparison. Oxy vs. Sal for SNI should give another DE set than Oxy vs. Sal for Sham, but they should intersect. SNI vs. Sham for oxy and sal the same. Oxy SNI vs Sal Sham should have the least intersection, but this intersection should be used to kick out these genes if they are in the intersect between other comparisions, because these are then results of side effects and has likely not specific Oxy or SNI background.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [1]:
import pandas as pd

conditions_runs = pd.read_excel("conditions_runs_oxy_project.xlsx", index_col="Run")
conditions_runs["SNI"] = conditions_runs["Genotype: SNI"].notna()
conditions_runs["Oxy"] = conditions_runs["Condition: Oxy"].notna()
conditions_runs = conditions_runs[["SNI", "Oxy"]]
conditions_runs

Unnamed: 0_level_0,SNI,Oxy
Run,Unnamed: 1_level_1,Unnamed: 2_level_1
SRR23195505,True,False
SRR23195506,False,True
SRR23195507,False,False
SRR23195508,True,True
SRR23195509,True,True
SRR23195510,True,False
SRR23195511,False,True
SRR23195512,False,False
SRR23195513,True,False
SRR23195514,False,True


In [2]:
condition_counts = conditions_runs['Oxy'].value_counts()
print("Samples per condition:")
print(condition_counts)

genotype_counts = conditions_runs['SNI'].value_counts()
print("\nSamples per genotype:")
print(genotype_counts)

cross_counts = pd.crosstab(conditions_runs['SNI'], conditions_runs['Oxy'])
print("\nCounts of condition per genotype:")
print(cross_counts)

Samples per condition:
Oxy
False    8
True     8
Name: count, dtype: int64

Samples per genotype:
SNI
True     8
False    8
Name: count, dtype: int64

Counts of condition per genotype:
Oxy    False  True 
SNI                
False      4      4
True       4      4


They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [3]:
bases = pd.read_csv("base_counts.csv", index_col= "Run")

metadata = conditions_runs.merge(bases, on="Run")
metadata = metadata.sort_values(by="Bases", ascending=1)
metadata

Unnamed: 0_level_0,SNI,Oxy,Bases
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SRR23195516,True,True,6203117700
SRR23195511,False,True,6456390900
SRR23195517,True,True,6863840400
SRR23195505,True,False,6922564500
SRR23195508,True,True,6927786900
SRR23195519,False,True,6996050100
SRR23195509,True,True,7003550100
SRR23195514,False,True,7226808600
SRR23195510,True,False,7377388500
SRR23195512,False,False,7462857900


In [4]:
metadata_small = metadata.head(2)
metadata_small.index.to_series().to_csv("ids.csv", index=False, header=False)

In [5]:
!nextflow run nf-core/fetchngs -profile docker --input ids.csv --outdir ./expression_data --max_memory "8GB" -resume


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/fetchngs` [0;2m[[0;1;36mbackstabbing_watson[0;2m] DSL2 - [36mrevision: [0;36m8ec2d934f9 [master][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/fetchngs v1.12.0-g8ec2d93[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision       : [0;32mmaster[0m
  [0;34mrunName        : [0;32mbackstab

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.
1. Use nf-core RNA-Seq for gene quantification
2. Generate count datframe
3. Do DE analysis with the same tool the authors did
4. Do the same Pathway / Upstream Regulator Analysis
5. Analyze differences/simmelarities between both results