# What to expect
In notebook 3A we ran a differential gene expression analysis on the example dataset Schistosoma mansoni and used visualisation techniques to view the most significant genes. In this notebook we will apply the same methods to our choice of dataset in this notebook. We will go on to explore the GO terms and pathways associated with them using online resources. 

In [7]:
# import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt

#Install PyDESeq2 and import required classes
! pip install --quiet pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

In [8]:
# load in the counts and metadata again
prefix = "Plasmodium"
counts = pd.read_csv(f"analysis/{prefix}/star/ReadsPerGene.csv", index_col=0).T
metadata = pd.read_csv(f"data/{prefix}/metadata.csv", index_col=0)

# restrict to the 2 stages we want to compare
counts_s = counts[metadata["timepoint"].isin([16,24]) & metadata["condition"].isin(["wildtype"])]
metadata_s = metadata[metadata["timepoint"].isin([16,24]) & metadata["condition"].isin(["wildtype"])]

# create deseq2 dataset object
dds = DeseqDataSet(
    counts=counts_s,
    metadata=metadata_s,
    design_factors="timepoint",  # compare samples based on the developmental "stage"
    refit_cooks=True
)

# Run DeSeq2
dds.deseq2()

In [10]:
# Summarize results
stat_res=DeseqStats(dds)
stat_res.summary()
res = stat_res.results_df

Running Wald tests...


Log2 fold change & Wald test p-value: timepoint 24 vs 16
                    baseMean  log2FoldChange     lfcSE      stat    pvalue  \
gene                                                                         
PBANKA_0000101      0.499719        0.458866  4.079205  0.112489  0.910436   
PBANKA_0000201      0.000000             NaN       NaN       NaN       NaN   
PBANKA_0000301     10.017695       -0.514198  1.046018 -0.491577  0.623019   
PBANKA_0000401     21.489374        2.099202  0.840283  2.498208  0.012482   
PBANKA_0000600     24.726040       -1.279860  0.682604 -1.874966  0.060797   
...                      ...             ...       ...       ...       ...   
PBANKA_MIT03300     0.000000             NaN       NaN       NaN       NaN   
PBANKA_MIT03400     0.000000             NaN       NaN       NaN       NaN   
PBANKA_MIT03500  4967.510592        1.068291  0.586247  1.822255  0.068416   
PBANKA_MIT03600    80.486185        0.527777  0.587982  0.897607  0.369395   
PBANKA_

... done in 0.30 seconds.



In [12]:
! mkdir -p "analysis/Plasmodium/de"

In [13]:
res.to_csv(f"analysis/{prefix}/de/16_vs_24h_wildtype.full.csv")

In [14]:
# Filter results with baseMean<10 so that gene expressions close to zero don't skew results
res=res[res.baseMean>=10]
res

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PBANKA_0000301,10.017695,-0.514198,1.046018,-0.491577,0.623019,0.655635
PBANKA_0000401,21.489374,2.099202,0.840283,2.498208,0.012482,0.016491
PBANKA_0000600,24.726040,-1.279860,0.682604,-1.874966,0.060797,0.074545
PBANKA_0000901,20.009201,0.783186,0.733236,1.068123,0.285465,0.317623
PBANKA_0001001,169.147522,-0.278415,0.297130,-0.937015,0.348751,0.383068
...,...,...,...,...,...,...
PBANKA_MIT02700,1717.186751,0.662946,0.304095,2.180067,0.029253,0.037160
PBANKA_MIT02800,129.439561,0.717236,0.523630,1.369739,0.170769,0.197018
PBANKA_MIT03500,4967.510592,1.068291,0.586247,1.822255,0.068416,0.083311
PBANKA_MIT03600,80.486185,0.527777,0.587982,0.897607,0.369395,0.404707


In [15]:
# Get list of only genes that have a fold change FC > 2 or FC < 0.5
sigs=res[(res.padj<0.05)&(abs(res.log2FoldChange)>1)]
sigs

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PBANKA_0000401,21.489374,2.099202,0.840283,2.498208,1.248229e-02,1.649136e-02
PBANKA_0007701,61.267753,2.148695,0.514617,4.175328,2.975570e-05,4.825806e-05
PBANKA_0008101,195.532688,1.897001,0.295428,6.421195,1.352086e-10,2.930848e-10
PBANKA_0100021,1402.834566,3.470364,0.160586,21.610658,1.426008e-103,3.176726e-102
PBANKA_0100041,38.141408,1.635641,0.726573,2.251174,2.437451e-02,3.125623e-02
...,...,...,...,...,...,...
PBANKA_1466121,226.572142,-1.131772,0.267453,-4.231671,2.319611e-05,3.793047e-05
PBANKA_API00095,71.627149,2.234160,0.736168,3.034850,2.406552e-03,3.381205e-03
PBANKA_MIT00800,106.386590,1.004904,0.469661,2.139635,3.238425e-02,4.095585e-02
PBANKA_MIT01000,206.692383,1.021474,0.411498,2.482329,1.305265e-02,1.722281e-02


In [16]:
sigs.to_csv(f"analysis/{prefix}/de/16_vs_24h_wildtype.filtered.csv")