# What to expect
In notebook 3A we ran a differential gene expression analysis on the example dataset Schistosoma mansoni and used visualisation techniques to view the most significant genes. In this notebook we will apply the same methods to our choice of dataset in this notebook. We will go on to explore the GO terms and pathways associated with them using online resources. 

In [1]:
# import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt

#Install PyDESeq2 and import required classes
! pip install --quiet pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

In [2]:
# load in the counts and metadata again
prefix = "Plasmodium"
counts = pd.read_csv(f"analysis/{prefix}/star/ReadsPerGene.csv", index_col=0).T
metadata = pd.read_csv(f"data/{prefix}/metadata.csv", index_col=0)

# restrict to the 2 stages we want to compare
counts_s = counts[metadata["timepoint"].isin([16,24]) & metadata["condition"].isin(["wildtype"])]
metadata_s = metadata[metadata["timepoint"].isin([16,24]) & metadata["condition"].isin(["wildtype"])]

# create deseq2 dataset object
dds = DeseqDataSet(
    counts=counts_s,
    metadata=metadata_s,
    design_factors="timepoint",  # compare samples based on the developmental "stage"
    refit_cooks=True
)

# Run DeSeq2
dds.deseq2()

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 1.53 seconds.

Fitting dispersion trend curve...
... done in 0.16 seconds.

Fitting MAP dispersions...
... done in 1.18 seconds.

Fitting LFCs...
... done in 0.61 seconds.

Calculating cook's distance...
... done in 0.01 seconds.

Replacing 0 outlier genes.



In [3]:
# Summarize results
stat_res=DeseqStats(dds)
stat_res.summary()
res = stat_res.results_df

Running Wald tests...


Log2 fold change & Wald test p-value: timepoint 24 vs 16
                    baseMean  log2FoldChange     lfcSE      stat    pvalue  \
gene                                                                         
PBANKA_0000101      0.499719        0.458866  4.079205  0.112489  0.910436   
PBANKA_0000201      0.000000             NaN       NaN       NaN       NaN   
PBANKA_0000301     10.017695       -0.514198  1.046018 -0.491577  0.623019   
PBANKA_0000401     21.489374        2.099202  0.840283  2.498208  0.012482   
PBANKA_0000600     24.726040       -1.279860  0.682604 -1.874966  0.060797   
...                      ...             ...       ...       ...       ...   
PBANKA_MIT03300     0.000000             NaN       NaN       NaN       NaN   
PBANKA_MIT03400     0.000000             NaN       NaN       NaN       NaN   
PBANKA_MIT03500  4967.510592        1.068291  0.586247  1.822255  0.068416   
PBANKA_MIT03600    80.486185        0.527777  0.587982  0.897607  0.369395   
PBANKA_

... done in 0.26 seconds.



In [4]:
! mkdir -p "analysis/Plasmodium/de"

In [5]:
res.to_csv(f"analysis/{prefix}/de/16_vs_24h_wildtype.full.csv")

In [6]:
# Filter results with baseMean<10 so that gene expressions close to zero don't skew results
res=res[res.baseMean>=10]

# Filter by padj<=0.05
res=res[res.padj<=0.05]

In [7]:
# Get list of only genes that have a fold change FC > 2 or FC < 0.5 
# You can play with the exact thresholds here, these are just a guide to filter the lists
sigs=res[abs(res.log2FoldChange)>1]
sigs

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PBANKA_0000401,21.489374,2.099202,0.840283,2.498208,1.248229e-02,1.649136e-02
PBANKA_0007701,61.267753,2.148695,0.514617,4.175328,2.975570e-05,4.825806e-05
PBANKA_0008101,195.532688,1.897001,0.295428,6.421195,1.352086e-10,2.930848e-10
PBANKA_0100021,1402.834566,3.470364,0.160586,21.610658,1.426008e-103,3.176726e-102
PBANKA_0100041,38.141408,1.635641,0.726573,2.251174,2.437451e-02,3.125623e-02
...,...,...,...,...,...,...
PBANKA_1466121,226.572142,-1.131772,0.267453,-4.231671,2.319611e-05,3.793047e-05
PBANKA_API00095,71.627149,2.234160,0.736168,3.034850,2.406552e-03,3.381205e-03
PBANKA_MIT00800,106.386590,1.004904,0.469661,2.139635,3.238425e-02,4.095585e-02
PBANKA_MIT01000,206.692383,1.021474,0.411498,2.482329,1.305265e-02,1.722281e-02


In [8]:
# Save the up and down regulated genes separately
up = sigs[sigs["log2FoldChange"] > 0]
up.to_csv(f"analysis/{prefix}/de/16_vs_24h_wildtype.up.csv")
up

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PBANKA_0000401,21.489374,2.099202,0.840283,2.498208,1.248229e-02,1.649136e-02
PBANKA_0007701,61.267753,2.148695,0.514617,4.175328,2.975570e-05,4.825806e-05
PBANKA_0008101,195.532688,1.897001,0.295428,6.421195,1.352086e-10,2.930848e-10
PBANKA_0100021,1402.834566,3.470364,0.160586,21.610658,1.426008e-103,3.176726e-102
PBANKA_0100041,38.141408,1.635641,0.726573,2.251174,2.437451e-02,3.125623e-02
...,...,...,...,...,...,...
PBANKA_1465821,17.110991,3.627303,1.050189,3.453951,5.524370e-04,8.232959e-04
PBANKA_API00095,71.627149,2.234160,0.736168,3.034850,2.406552e-03,3.381205e-03
PBANKA_MIT00800,106.386590,1.004904,0.469661,2.139635,3.238425e-02,4.095585e-02
PBANKA_MIT01000,206.692383,1.021474,0.411498,2.482329,1.305265e-02,1.722281e-02


In [9]:
down = sigs[sigs["log2FoldChange"] < 0]
up.to_csv(f"analysis/{prefix}/de/16_vs_24h_wildtype.down.csv")
down

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PBANKA_0100100,201.871191,-2.298185,0.290933,-7.899364,2.803300e-15,7.596516e-15
PBANKA_0100200,1626.922711,-1.330331,0.121823,-10.920171,9.232170e-28,3.952475e-27
PBANKA_0101000,4546.956831,-1.853167,0.126414,-14.659484,1.171605e-48,9.121150e-48
PBANKA_0101300,2578.706476,-1.617343,0.144068,-11.226222,3.031645e-29,1.351893e-28
PBANKA_0102500,1632.255934,-1.323266,0.146632,-9.024398,1.806862e-19,5.847869e-19
...,...,...,...,...,...,...
PBANKA_1465200,340.912963,-1.263252,0.279559,-4.518734,6.221062e-06,1.057951e-05
PBANKA_1465300,462.033479,-1.760952,0.220407,-7.989533,1.354503e-15,3.725427e-15
PBANKA_1465500,138.688469,-1.079225,0.332681,-3.244019,1.178560e-03,1.706491e-03
PBANKA_1466100,34.856922,-1.284692,0.603423,-2.129009,3.325355e-02,4.203458e-02


# GO Analysis and Metabolic pathways analysis

We will now use these results for GO and metabolic pathway analysis using the TriTrypDB and PlasmoDB websites. 

You are welcome to perform either GO or metabolic pathway analysis on your chosen dataset, or both if you have time.

## GO Analysis
For this you will need either a text list of gene IDs which can be pasted into the website field, or by uploading a text file. The DBs return a .csv file that includes the names (other annotations, such as cellular locations or GO terms, can be included, if desired):

https://tritrypdb.org/tritrypdb/app/search/transcript/GeneByLocusTag
https://plasmodb.org/plasmo/app/search/transcript/GeneByLocusTag

Note that these functions have a glitch: some genes get duplicated in the returned list (this seems to be random). In Excel it is straightforward to identify such duplications and remove them.

## Metabolic Pathways
Submit the filtered .csv files for ‘up’ and ‘down’ separately to the respective DB tools for metabolic pathway analysis:

https://tritrypdb.org/tritrypdb/app/search/pathway/PathwaysByGeneList
https://plasmodb.org/plasmo/app/search/pathway/PathwaysByGeneList

Note that the same glitch described above affects these searches, i.e. a bunch of duplicates need to be subsequently removed.
The search function offers a few different ‘Pathway Sources’ (KEGG, LeishCyc, MetaCyc and TrypanoCyc for T. brucei). I’ve included all of them, the hit list can be subsequently filtered by source, if desired.
Pathways with hits can be downloaded as .csv file, and the results can be customized in terms of the columns that should be included, such as ‘Total Pathway Enzymes’, ‘Unique Gene Counts’ (i.e. hits in that pathway), EC numbers etc.
I’ve included a separate column with ‘% Hits’ (Unique Gene counts * 100 / Total Pathway Enzymes).

In [10]:
with open(f"analysis/{prefix}/de/16_vs_24h_wildtype.genes.txt", "w") as f:
    for gene in sigs.index:
        f.write(f"{gene}\n")