# What to expect

In notebook 3A we ran a differential gene expression analysis on the example dataset Schistosoma mansoni and used visualisation techniques to view the most significant genes. In this notebook we will apply the same methods to our choice of dataset in this notebook. We will go on to explore the GO terms and pathways associated with them using online resources.



# Differential Expression analysis

In [None]:
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# load in the counts and metadata again for your dataset (set the prefix)
prefix = ""
counts = pd.read_csv(f"analysis/{prefix}/star/ReadsPerGene.csv", index_col=0).T
metadata = pd.read_csv(f"data/{prefix}/metadata.csv", index_col=0)

# restrict to the 2 stages we want to compare

# create deseq2 dataset object

# Run DeSeq2

In [None]:
import numpy as np

# Summarize results

res=

#replace p-values of 0 with a very small number as otherwise they cause errors


In [None]:
! mkdir -p "analysis/Trypanosoma/de"
! mkdir -p "analysis/Plasmodium/de"

In [None]:
# save this intermediate CSV with a sensible name
comparison = ""
res.to_csv(f"analysis/{prefix}/de/{comparison}.full.csv")

In [None]:
# Filter results with baseMean<10 so that gene expressions close to zero don't skew results

# Filter by padj<=0.05


In [None]:
# Get list of only genes that have an absolute fold change FC > 1.5 and the abs(logFoldChange) > np.log2(1.5)
# You can play with the exact thresholds here, these are just a guide to filter the lists


In [None]:
# Save the up and down regulated genes separately
up = sigs[sigs["log2FoldChange"] > 0]
up.to_csv(f"analysis/{prefix}/de/{comparison}.up.csv")
up

In [None]:
down = sigs[sigs["log2FoldChange"] < 0]
up.to_csv(f"analysis/{prefix}/de/{comparison}.down.csv")
down

# Visualisation - Volcano Plot
Following the steps from notebook 3A, create a volcano plot for this dataset, colouring the significant up and down regulated genes. If you chose a different fold change cut off above, use the updated fold change here so you can see which genes are selected for. If desired, you could instead make it interactive with plotly (see extension in 3A).

In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pylab as plt

# define the significantly up or down regulated genes

# plot the all the genes and then highlight downregulated and upregulated

# Add axis labels

# Add threshold lines

plt.savefig('dataset_volcano.png')

# GO Analysis and Metabolic pathways analysis

We will now use these results for GO and metabolic pathway analysis using the [TriTrypDB](https://tritrypdb.org/tritrypdb/app/user/registration) and [PlasmoDB](https://plasmodb.org/plasmo/app/user/registration) websites. Before using these websites you will need to register. Full details of how to do this analysis are in the presentation, but a summary is provided below. 

You are welcome to perform either GO or metabolic pathway analysis on your chosen dataset, or both if you have time.

For these analyses you will need the 2 CSV files of up and down regulated genes which you created above and saved as `analysis/{prefix}/de/{comparison}.up.csv` and `analysis/{prefix}/de/{comparison}.down.csv`. 

## Initial Analysis
First you will "Identify Genes based on List of IDs" by uploading one of the CSV files as a text file to:

https://tritrypdb.org/tritrypdb/app/search/transcript/GeneByLocusTag  
https://plasmodb.org/plasmo/app/search/transcript/GeneByLocusTag

After adding the list of gene IDs, click "Get Answer". 

When the results table appears, you will obtain a list with the organism, gene names (Product description) and genomic locations for the IDs you have introduced. Make sure the box “Show only one transcript per gene” is ticked. This will ensure we do not get duplications. These genes can be explored and the results can be downloaded as a .csv file.

There will now be a tab called "Analyze Results". From this you will be able to select either "Gene Ontology Enrichment" or "Metabolic Pathway Enrichment". Select the GO analysis.

## Gene Ontology Enrichment
This will open up an options page where you can change the parameters for the statistical test it will perform (the defaults are fine). When you are ready, click "Submit". For each GO term, the software will then performs a statistical test to determine if a particular GO term is enriched in your list of genes. 

A table of results will appear reporting each GO term that has been found to be significantly increased in your list. You can download this table, or explore it on the page.

## Metabolic Pathways
This will open up an options page where you can change the parameters for the statistical test it will perform (the defaults are fine). When you are ready, click "Submit". 

A table of results will appear reporting a list of pathways enriched in your dataset along with statistics about the numbers of genes in your list which are associated with it. You can click on a pathway ID to visualise it. You can download this results table, or explore it on the page.