# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

Look at/find out:
- How does oxycodone withdrawal alter gene expression in reward-related brain regions?
- How are these changes modified by the presence of chronic neuropathic pain?
- Do histone deacetylases (HDAC1/HDAC2) drive these maladaptive transcriptional changes?

What do the conditions mean?

oxy: mice are given daily subcutaneous injections of oxycodone (opioid) at 30 mg/kg for 14 days.


sal: mice are given daily subcutaneous injections of saline (as control) for 14 days.

What do the genotypes mean?

SNI: animals undergoe spared nerve injury (SNI) 2.5 months before drug administration to produce chronic neurophatic pain state.


Sham: animals undergo sham surgery 2.5 months before drug administration to have a control to comrpare the SNI genotype against.

control against pain (chronic pain induced in SNI, Sham to account for surgery pain/complications)

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Which groups would you compare to each other?

Please also mention which outcome you would expect to see from each comparison.

Procedure of what I would do (Pipeline):
- QC: FastQC + MultiQC
- trimming (if adapters/low-qulaity tails are present)
- quantification
- drop lowly expressed genes
- sample qc
- Differential Expression

Groups:
- Sham + Saline (Baseline)
- Sham + Oxy Withdrawal (Sham-W)
- SNI + Saline (SNI-only)
- SNI + Oxy Withdrawal (SNI-W)

1. Sham-W vs Baseline:
--> Withdrawal signature: HDAC1/2 repression/enrichment, dopaminergic/glutaminergic signaling shifts
2. SNI-only vs Baseline
--> neuroimmune/inflammatory activation
3. SNI-only vs. Baseline
--> amplification of withdrawal program
4. SNI-W vs Sham-W
--> withdrawal-specific changes on a pain background; overlap with #1 but larger effect sizes for immune/stress pathways

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>

df = df.read_excel((condition_table.xlxs, index_col="Run"))
df = df.fillna(False)
df = df.replace("x", True)

import numpy
conditions = ["Sal", "Oxy"]
np.select(df("condition: Sal", "Condition: Oxy")).to_numpy().T


In [None]:
import pandas as pd

df = pd.read_excel('conditions_runs_oxy_project.xlsx')


# Create new simplified columns
df["Condition"] = df.apply(
    lambda row: "Sal" if row["condition: Sal"] == "x" else ("Oxy" if row["Condition: Oxy"] == "x" else None),
    axis=1
)

df["Genotype"] = df.apply(
    lambda row: "SNI" if row["Genotype: SNI"] == "x" else ("Sham" if row["Genotype: Sham"] == "x" else None),
    axis=1
)

# Keep relevant columns only
clean_df = df[["Run", "Condition", "Genotype"]]

clean_df

Unnamed: 0,Run,Condition,Genotype
0,SRR23195505,Sal,SNI
1,SRR23195506,Oxy,Sham
2,SRR23195507,Sal,Sham
3,SRR23195508,Oxy,SNI
4,SRR23195509,Oxy,SNI
5,SRR23195510,Sal,SNI
6,SRR23195511,Oxy,Sham
7,SRR23195512,Sal,Sham
8,SRR23195513,Sal,SNI
9,SRR23195514,Oxy,Sham


Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [8]:
def assign_group(row):
    if row["Genotype"] == "Sham" and row["Condition"] == "Sal":
        return "Sham-Sal"
    elif row["Genotype"] == "Sham" and row["Condition"] == "Oxy":
        return "Sham-Oxy"
    elif row["Genotype"] == "SNI" and row["Condition"] == "Sal":
        return "SNI-Sal"
    elif row["Genotype"] == "SNI" and row["Condition"] == "Oxy":
        return "SNI-Oxy"
    else:
        return "Unassigned"

clean_df["Group"] = clean_df.apply(assign_group, axis=1)

# 1. Count per condition
condition_counts = clean_df["Condition"].value_counts()
print(condition_counts)

# 2. Count per genotype
genotype_counts = clean_df["Genotype"].value_counts()
print(genotype_counts)

# 3. Cross-tab condition vs genotype
cond_geno_ct = pd.crosstab(clean_df["Genotype"], clean_df["Condition"])
print(cond_geno_ct)

Condition
Sal    8
Oxy    8
Name: count, dtype: int64
Genotype
SNI     8
Sham    8
Name: count, dtype: int64
Condition  Oxy  Sal
Genotype           
SNI          4    4
Sham         4    4


They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [4]:
bases_per_run_csv = "base_counts.csv"

bases = pd.read_csv(bases_per_run_csv, index_col="Run")
bases

Unnamed: 0_level_0,Bases
Run,Unnamed: 1_level_1
SRR23195505,6922564500
SRR23195506,7859530800
SRR23195507,8063298900
SRR23195508,6927786900
SRR23195509,7003550100
SRR23195510,7377388500
SRR23195511,6456390900
SRR23195512,7462857900
SRR23195513,8099181600
SRR23195514,7226808600


In [6]:
df = df.merge(bases, on="Run")
df

Unnamed: 0,Patient,Run,RNA-seq,DNA-seq,condition: Sal,Condition: Oxy,Genotype: SNI,Genotype: Sham,Condition,Genotype,Bases_x,Bases_y
0,?,SRR23195505,x,,x,,x,,Sal,SNI,,6922564500
1,?,SRR23195506,x,,,x,,x,Oxy,Sham,,7859530800
2,?,SRR23195507,x,,x,,,x,Sal,Sham,,8063298900
3,?,SRR23195508,x,,,x,x,,Oxy,SNI,,6927786900
4,?,SRR23195509,x,,,x,x,,Oxy,SNI,,7003550100
5,?,SRR23195510,x,,x,,x,,Sal,SNI,,7377388500
6,?,SRR23195511,x,,,x,,x,Oxy,Sham,,6456390900
7,?,SRR23195512,x,,x,,,x,Sal,Sham,,7462857900
8,?,SRR23195513,x,,x,,x,,Sal,SNI,,8099181600
9,?,SRR23195514,x,,,x,,x,Oxy,Sham,,7226808600


In [None]:
fetchings pipeline

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.