# What to expect

In notebook 3A we ran a differential gene expression analysis on the example dataset Schistosoma mansoni and used visualisation techniques to view the most significant genes. In this notebook we will apply the same methods to our choice of dataset in this notebook. We will go on to explore the GO terms and pathways associated with them using online resources.



In [1]:
# import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt

#Install PyDESeq2 and import required classes
! pip install --quiet pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

In [2]:
# load in the counts and metadata again
prefix = "Trypanosoma"
counts = pd.read_csv(f"analysis/{prefix}/star/ReadsPerGene.csv", index_col=0).T
metadata = pd.read_csv(f"data/{prefix}/metadata.csv", index_col=0)
counts.fillna(0)

# restrict to the 2 stages we want to compare
counts_s = counts[metadata["organism"].isin(["Trypanosoma brucei brucei"])]
counts_s = counts_s.loc[:, (counts_s != 0).any(axis=0)]
metadata_s = metadata[metadata["organism"].isin(["Trypanosoma brucei brucei"])]

# create deseq2 dataset object
dds = DeseqDataSet(
    counts=counts_s,
    metadata=metadata_s,
    design_factors="condition",  # compare samples based on the developmental "stage"
    refit_cooks=True
)

# Run DeSeq2
dds.deseq2()

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 1.79 seconds.

Fitting dispersion trend curve...
... done in 0.23 seconds.

Fitting MAP dispersions...
... done in 2.30 seconds.

Fitting LFCs...
... done in 0.93 seconds.

Calculating cook's distance...
... done in 0.01 seconds.

Replacing 0 outlier genes.



In [3]:
! mkdir -p "analysis/Trypanosoma/de"

In [4]:
# Summarize results
stat_res=DeseqStats(dds)
stat_res.summary()
res = stat_res.results_df

Running Wald tests...


Log2 fold change & Wald test p-value: condition peak vs ascending
                                  baseMean  log2FoldChange     lfcSE  \
gene                                                                   
Tb04.24M18.150                  197.292416        0.217076  0.190053   
Tb04.3I12.100                   218.408392        0.124410  0.171674   
Tb05.30F7.410                    99.278007       -1.824686  0.655682   
Tb05.5K5.100                     16.771503        0.644534  0.565982   
Tb05.5K5.110                    329.781049       -0.045490  0.139694   
...                                    ...             ...       ...   
Tb927_10_v4.snoRNA.0063:snoRNA    3.790393        0.122767  1.088272   
Tb927_10_v4.snoRNA.0064:snoRNA    0.153151        0.709747  4.425350   
Tb927_10_v4.snoRNA.0073:snoRNA    0.183423       -1.213827  4.425356   
Tb927_10_v4.snoRNA.0078:snoRNA  133.952072        0.242593  0.196860   
tmp.1.100                        48.341400        0.781146  0.795903  

... done in 0.52 seconds.



In [5]:
res.to_csv(f"analysis/{prefix}/de/slender_vs_stumpy_tbrucei.full.csv")

In [6]:
# Filter results with baseMean<10 so that gene expressions close to zero don't skew results
res=res[res.baseMean>=10]

# Filter by padj<=0.05
res=res[res.padj<=0.05]

In [7]:
# Get list of only genes that have a fold change FC > 1.5 or FC < 0.7 
# You can play with the exact thresholds here, these are just a guide to filter the lists
sigs=res[abs(res.log2FoldChange)>0.5]
sigs

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Tb05.30F7.410,99.278007,-1.824686,0.655682,-2.782883,5.387819e-03,1.327922e-02
Tb05.5K5.130,704.934698,-0.628927,0.130836,-4.806982,1.532258e-06,7.231510e-06
Tb05.5K5.150,356.075307,-0.569119,0.144114,-3.949087,7.844998e-05,2.813295e-04
Tb05.5K5.210,652.756184,-0.546718,0.123264,-4.435348,9.192383e-06,3.866796e-05
Tb05.5K5.270,559.894678,2.175802,0.136297,15.963716,2.287011e-57,2.794969e-55
...,...,...,...,...,...,...
Tb927.9.9820,1610.596099,-1.227299,0.095661,-12.829598,1.119436e-37,6.301404e-36
Tb927.9.9840,1142.358426,-0.682270,0.133546,-5.108861,3.241061e-07,1.680792e-06
Tb927.9.9870,1282.657306,-0.712261,0.095751,-7.438662,1.017103e-13,1.143687e-12
Tb927.9.9940,2501.918060,-1.650387,0.091116,-18.113086,2.512616e-73,5.427251e-71


In [8]:
# Save the up and down regulated genes separately
up = sigs[sigs["log2FoldChange"] > 0]
up.to_csv(f"analysis/{prefix}/de/slender_vs_stumpy_tbrucei.up.csv")
up

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Tb05.5K5.270,559.894678,2.175802,0.136297,15.963716,2.287011e-57,2.794969e-55
Tb05.5K5.70,711.309715,0.849127,0.114743,7.400239,1.359397e-13,1.501318e-12
Tb08.27P2.60,34.502243,1.439498,0.488999,2.943764,3.242468e-03,8.454813e-03
Tb08.27P2.90,45.643900,1.285696,0.365416,3.518446,4.340823e-04,1.362080e-03
Tb09.v4.0150,900.439292,0.607229,0.126314,4.807313,1.529727e-06,7.223235e-06
...,...,...,...,...,...,...
Tb927.9.9300,2943.061947,1.205118,0.080061,15.052421,3.328281e-51,3.323986e-49
Tb927.9.9410,1057.658097,1.239115,0.099359,12.471079,1.073619e-35,5.478995e-34
Tb927.9.9600,2503.524120,0.536295,0.081707,6.563676,5.249737e-11,4.452927e-10
Tb927.9.9620,2059.820109,0.584625,0.095175,6.142626,8.116830e-10,6.013667e-09


In [9]:
down = sigs[sigs["log2FoldChange"] < 0]
up.to_csv(f"analysis/{prefix}/de/slender_vs_stumpy_tbrucei.down.csv")
down

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Tb05.30F7.410,99.278007,-1.824686,0.655682,-2.782883,5.387819e-03,1.327922e-02
Tb05.5K5.130,704.934698,-0.628927,0.130836,-4.806982,1.532258e-06,7.231510e-06
Tb05.5K5.150,356.075307,-0.569119,0.144114,-3.949087,7.844998e-05,2.813295e-04
Tb05.5K5.210,652.756184,-0.546718,0.123264,-4.435348,9.192383e-06,3.866796e-05
Tb05.5K5.280,501.262274,-1.617947,0.146860,-11.016920,3.167116e-28,1.081477e-26
...,...,...,...,...,...,...
Tb927.9.9810,1413.470872,-1.451264,0.100885,-14.385271,6.402719e-47,5.773636e-45
Tb927.9.9820,1610.596099,-1.227299,0.095661,-12.829598,1.119436e-37,6.301404e-36
Tb927.9.9840,1142.358426,-0.682270,0.133546,-5.108861,3.241061e-07,1.680792e-06
Tb927.9.9870,1282.657306,-0.712261,0.095751,-7.438662,1.017103e-13,1.143687e-12


# GO Analysis and Metabolic pathways analysis

We will now use these results for GO and metabolic pathway analysis using the TriTrypDB and PlasmoDB websites. 

You are welcome to perform either GO or metabolic pathway analysis on your chosen dataset, or both if you have time.

## GO Analysis
For this you will need either a text list of gene IDs which can be pasted into the website field, or by uploading a text file. The DBs return a .csv file that includes the names (other annotations, such as cellular locations or GO terms, can be included, if desired):

https://tritrypdb.org/tritrypdb/app/search/transcript/GeneByLocusTag
https://plasmodb.org/plasmo/app/search/transcript/GeneByLocusTag

Note that these functions have a glitch: some genes get duplicated in the returned list (this seems to be random). In Excel it is straightforward to identify such duplications and remove them.

## Metabolic Pathways
Submit the filtered .csv files for ‘up’ and ‘down’ separately to the respective DB tools for metabolic pathway analysis:

https://tritrypdb.org/tritrypdb/app/search/pathway/PathwaysByGeneList
https://plasmodb.org/plasmo/app/search/pathway/PathwaysByGeneList

Note that the same glitch described above affects these searches, i.e. a bunch of duplicates need to be subsequently removed.
The search function offers a few different ‘Pathway Sources’ (KEGG, LeishCyc, MetaCyc and TrypanoCyc for T. brucei). I’ve included all of them, the hit list can be subsequently filtered by source, if desired.
Pathways with hits can be downloaded as .csv file, and the results can be customized in terms of the columns that should be included, such as ‘Total Pathway Enzymes’, ‘Unique Gene Counts’ (i.e. hits in that pathway), EC numbers etc.
I’ve included a separate column with ‘% Hits’ (Unique Gene counts * 100 / Total Pathway Enzymes).

In [10]:
with open(f"analysis/{prefix}/de/slender_vs_stumpy_tbrucei.genes.txt", "w") as f:
    for gene in sigs.index:
        f.write(f"{gene}\n")

In [11]:
with open(f"analysis/{prefix}/de/slender_vs_stumpy_tbrucei.up_genes.txt", "w") as f:
    for gene in up.index:
        f.write(f"{gene}\n")

In [12]:
with open(f"analysis/{prefix}/de/slender_vs_stumpy_tbrucei.down_genes.txt", "w") as f:
    for gene in down.index:
        f.write(f"{gene}\n")