# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

https://www.nature.com/articles/s41593-023-01350-3#Sec10
The objective of the study was to investigate the effects of chronic neuropathic pain on brain function and behavior, specifically focusing on the role of the prefrontal cortex (PFC) and its neural circuits. The researchers aimed to understand how neuropathic pain alters PFC activity and connectivity, leading to cognitive and emotional impairments commonly observed in chronic pain patients. They also sought to identify potential therapeutic targets for alleviating these deficits.

What do the conditions mean?

oxy: Oxycodone treatment
sal: Saline treatment

What do the genotypes mean?

SNI: Spared Nerve Injury (a model for neuropathic pain)
Sham: Sham surgery (control group without nerve injury)

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Which groups would you compare to each other?

Please also mention which outcome you would expect to see from each comparison.

I want to compare the following groups:
1. SNI + oxy vs. SNI + sal: To assess the effect of oxycodone treatment on neuropathic pain and its associated cognitive and emotional impairments.
2. Sham + oxy vs. Sham + sal: To evaluate the impact of oxycodone on normal brain function and behavior in the absence of neuropathic pain.
3. SNI + sal vs. Sham + sal: To understand the effects of neuropathic pain on brain function and behavior without any drug intervention.
4. SNI + oxy vs. Sham + oxy: To investigate how neuropathic pain alters the response to oxycodone treatment compared to normal brain function.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [79]:
import pandas as pd
pd.set_option('future.no_silent_downcasting', True)
df = pd.read_excel("conditions_runs_oxy_project.xlsx")
df = df.drop(columns=["Patient"])
df = df.replace("x", 1)
df = df.fillna(0)
df["DNA-seq"] = df["DNA-seq"].astype(object)
df = df.replace(0.0, 0)
df = df.rename(columns={"condition: Sal": "Condition: Sal"})
print(df)
df.to_csv("conditions_runs_oxy_project.csv", index=False)







            Run RNA-seq DNA-seq Condition: Sal Condition: Oxy Genotype: SNI  \
0   SRR23195505       1       0              1              0             1   
1   SRR23195506       1       0              0              1             0   
2   SRR23195507       1       0              1              0             0   
3   SRR23195508       1       0              0              1             1   
4   SRR23195509       1       0              0              1             1   
5   SRR23195510       1       0              1              0             1   
6   SRR23195511       1       0              0              1             0   
7   SRR23195512       1       0              1              0             0   
8   SRR23195513       1       0              1              0             1   
9   SRR23195514       1       0              0              1             0   
10  SRR23195515       1       0              1              0             0   
11  SRR23195516       1       0              0      

They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [91]:
import pandas as pd

bases_per_run_csv = "base_counts.csv"
bases_df = pd.read_csv(bases_per_run_csv)
print(bases_df.head())

conditions_runs_csv = "conditions_runs_oxy_project.csv"
conditions_runs_df = pd.read_csv(conditions_runs_csv)
print(conditions_runs_df.head())

merged_df = pd.merge(conditions_runs_df, bases_df, on="Run")
sorted_df = merged_df.sort_values(by=["Bases", "Condition: Oxy"])
sorted_df.to_csv("sorted_conditions_runs_oxy_project.csv", index=False)
print(sorted_df.head())
smallest_runs_df = sorted_df.head(2)
print(smallest_runs_df)
smallest_runs_df.to_csv("ids_own_implementation.csv", index=False, columns=["Run"])






           Run       Bases
0  SRR23195505  6922564500
1  SRR23195506  7859530800
2  SRR23195507  8063298900
3  SRR23195508  6927786900
4  SRR23195509  7003550100
           Run  RNA-seq  DNA-seq  Condition: Sal  Condition: Oxy  \
0  SRR23195505        1        0               1               0   
1  SRR23195506        1        0               0               1   
2  SRR23195507        1        0               1               0   
3  SRR23195508        1        0               0               1   
4  SRR23195509        1        0               0               1   

   Genotype: SNI  Genotype: Sham  
0              1               0  
1              0               1  
2              0               1  
3              1               0  
4              1               0  
            Run  RNA-seq  DNA-seq  Condition: Sal  Condition: Oxy  \
11  SRR23195516        1        0               0               1   
6   SRR23195511        1        0               0               1   
12  SRR23195

In [86]:
!nextflow run nf-core/fetchngs --input ids.csv -profile docker --max_memory "4GB" --outdir fetchings -resume

[33mNextflow 25.04.7 is available - Please consider updating your version to it[m

[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 24.10.5[m
[K
Launching[35m `https://github.com/nf-core/fetchngs` [0;2m[[0;1;36magitated_golick[0;2m] DSL2 - [36mrevision: [0;36m8ec2d934f9 [master][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/fetchngs v1.12.0-g8ec2d93[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow 

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.