# What to expect

In notebook 3A we ran a differential gene expression analysis on the example dataset Schistosoma mansoni and used visualisation techniques to view the most significant genes. In this notebook we will apply the same methods to our choice of dataset in this notebook. We will go on to explore the GO terms and pathways associated with them using online resources.



# Differential Expression analysis

In [None]:
import pandas as pd
from pydeseq2.dds import DeseqDataSet

# load in the counts and metadata again for your dataset (set the prefix)
prefix = ""
counts = pd.read_csv(f"analysis/{prefix}/star/ReadsPerGene.csv", index_col=0).T
metadata = pd.read_csv(f"data/{prefix}/metadata.csv", index_col=0)

# restrict to the 2 stages we want to compare

# create deseq2 dataset object

# Run DeSeq2

In [None]:
# Summarize results


In [None]:
! mkdir -p "analysis/Trypanosoma/de"
! mkdir -p "analysis/Plasmodium/de"

In [None]:
# save this intermediate CSV with a sensible name
comparison = ""
res.to_csv(f"analysis/{prefix}/de/{comparison}.full.csv")

In [None]:
# Filter results with baseMean<10 so that gene expressions close to zero don't skew results

# Filter by padj<=0.05


In [None]:
# Get list of only genes that have a fold change FC > 2 or FC < 0.5 
# You can play with the exact thresholds here, these are just a guide to filter the lists


In [None]:
# Save the up and down regulated genes separately
up = sigs[sigs["log2FoldChange"] > 0]
up.to_csv(f"analysis/{prefix}/de/{comparison}.up.csv")
up

In [None]:
down = sigs[sigs["log2FoldChange"] < 0]
up.to_csv(f"analysis/{prefix}/de/{comparison}.down.csv")
down

# Visualisation - Volcano Plot
Following the steps from notebook 3A, create a volcano plot for this dataset, colouring the significant up and down regulated genes. If you chose a different fold change cut off above, use the updated fold change here so you can see which genes are selected for. If desired, you could instead make it interactive with plotly (see extension in 3A).

In [None]:
import seaborn as sns
import matplotlib.pylab as plt

# define which parameters determine if a gene is significantly up or down

# plot the all the genes and then highlight downregulated and upregulated

# Add axys labels

# Add threshold lines

plt.savefig('dataset_volcano.png')

# GO Analysis and Metabolic pathways analysis

We will now use these results for GO and metabolic pathway analysis using the TriTrypDB and PlasmoDB websites. Full details of how to do this are in the presentation, but a summary is provided below.

You are welcome to perform either GO or metabolic pathway analysis on your chosen dataset, or both if you have time.

For these analyses you will need either a text list of gene IDs which can be pasted into the website field, or by uploading a text file. You should create a separate list for up regulated and down regulated genes for this.

## GO Analysis
For this you will need either a text list of gene IDs which can be pasted into the website field, or by uploading a text file. The DBs return a .csv file that includes the names (other annotations, such as cellular locations or GO terms, can be included, if desired):

https://tritrypdb.org/tritrypdb/app/search/transcript/GeneByLocusTag
https://plasmodb.org/plasmo/app/search/transcript/GeneByLocusTag

After adding the list of gene IDs, click "Get Answer". When the results table appears, there will be a tab called "Analyze Results". From this you will be able to select either "Gene Ontology Enrichment" or "Metabolic Pathway Enrichment". Select the GO analysis.

The results can be downloaded as a .csv file, and the results filtered by (p-value or) bonferroni adjusted p-value.

## Metabolic Pathways
Similarly to above, you will need either a text list of gene IDs, this time for ‘up’ and ‘down’ separately. This can be submitted to the respective DB tools for metabolic pathway analysis:

https://tritrypdb.org/tritrypdb/app/search/pathway/PathwaysByGeneList
https://plasmodb.org/plasmo/app/search/pathway/PathwaysByGeneList

After adding the list of gene IDs, click "Get Answer". The search function offers a few different ‘Pathway Sources’ (KEGG, LeishCyc, MetaCyc and TrypanoCyc for T. brucei). Start with "Any" and include all of them, the hit list can be subsequently filtered by source, if desired.

Pathways with hits can be downloaded as .csv file, and the results can be customized in terms of the columns that should be included, such as ‘Total Pathway Enzymes’, ‘Unique Gene Counts’ (i.e. hits in that pathway), EC numbers etc.