# Porting DESeq into python using rpy2#

I will use a small example of [ERCC transcript](https://www.thermofisher.com/order/catalog/product/4456740) from [samples A and B in MAQC data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3272078/).

In [5]:
%load_ext autoreload
%autoreload 2
import pandas as pd 
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We will read the table and it should only contains count data of ERCC spikeins (rows) and 3 replicates from each of samples A and B (columns).

In [6]:
df = pd.read_table('./diffexpr/test/data/ercc.tsv')
df.head(5)

Unnamed: 0,id,A_1,A_2,A_3,B_1,B_2,B_3
0,ERCC-00002,111461,106261,107547,333944,199252,186947
1,ERCC-00003,6735,5387,5265,13937,8584,8596
2,ERCC-00004,17673,13983,15462,5065,3222,3353
3,ERCC-00009,4669,4431,4211,6939,4155,3647
4,ERCC-00012,0,2,0,0,0,0


And here, we will create a design matrix based on the samples in the count table. Note that the sample name has to be used as the ```pd.DataFrame``` index

In [7]:
sample_df = pd.DataFrame({'samplename': df.columns}) \
        .query('samplename != "id"')\
        .assign(sample = lambda d: d.samplename.str.extract('([AB])_', expand=False)) \
        .assign(replicate = lambda d: d.samplename.str.extract('_([123])', expand=False)) 
sample_df.index = sample_df.samplename
sample_df

Unnamed: 0_level_0,samplename,sample,replicate
samplename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A_1,A_1,A,1
A_2,A_2,A,2
A_3,A_3,A,3
B_1,B_1,B,1
B_2,B_2,B,2
B_3,B_3,B,3


Running DESeq2 is jsut like how it is run in ```R```, but instead of the row.name being gene ID for the count table, we can jsut tell the function which column is the gene ID:

In [8]:
from diffexpr.diffexpr.py_deseq import py_DESeq2

dds = py_DESeq2(count_matrix = df,
               design_matrix = sample_df,
               design_formula = '~ replicate + sample',
               gene_column = 'id') # <- telling DESeq2 this should be the gene ID column
    
dds.run_deseq() 
dds.get_deseq_result(contrast = ['sample','B','A'])
res = dds.deseq_result 
res.head()







INFO:DESeq2:Using contrast: ['sample', 'B', 'A']


Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,id
ERCC-00002,167917.342729,0.808857,0.047606,16.990537,9.650176e-65,1.1028769999999999e-63,ERCC-00002
ERCC-00003,7902.634073,0.521731,0.058878,8.861252,7.912103999999999e-19,4.868987e-18,ERCC-00003
ERCC-00004,10567.048228,-2.330122,0.055754,-41.792764,0.0,0.0,ERCC-00004
ERCC-00009,4672.573043,-0.19566,0.0616,-3.176286,0.001491736,0.003616329,ERCC-00009
ERCC-00012,0.384257,-1.565491,4.047562,-0.386774,0.6989237,,ERCC-00012


In [9]:
res.head(50)

Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,id
ERCC-00002,167917.342729,0.808857,0.047606,16.990537,9.650176e-65,1.1028769999999999e-63,ERCC-00002
ERCC-00003,7902.634073,0.521731,0.058878,8.861252,7.912103999999999e-19,4.868987e-18,ERCC-00003
ERCC-00004,10567.048228,-2.330122,0.055754,-41.792764,0.0,0.0,ERCC-00004
ERCC-00009,4672.573043,-0.19566,0.0616,-3.176286,0.001491736,0.003616329,ERCC-00009
ERCC-00012,0.384257,-1.565491,4.047562,-0.386774,0.6989237,,ERCC-00012
ERCC-00013,4.643218,1.997352,1.423997,1.402638,0.160725,0.2624081,ERCC-00013
ERCC-00014,57.192988,0.835153,0.376703,2.217004,0.0266228,0.05756282,ERCC-00014
ERCC-00016,1.359524,0.323383,2.395383,0.135003,0.8926097,0.9273867,ERCC-00016
ERCC-00017,0.171986,-1.199331,4.07315,-0.294448,0.7684155,,ERCC-00017
ERCC-00019,71.211359,-2.257933,0.372021,-6.06937,1.284133e-09,5.406874e-09,ERCC-00019


In [10]:
dds.normalized_count() #DESeq2 normalized count

INFO:DESeq2:Normalizing counts


Unnamed: 0,A_1,A_2,A_3,B_1,B_2,B_3,id
ERCC-00002,115018.353297,122494.471246,128809.545168,218857.357008,207880.854689,214443.474968,ERCC-00002
ERCC-00003,6949.952086,6209.970889,6305.915138,9133.911628,8955.740754,9860.313944,ERCC-00003
ERCC-00004,18237.045763,16119.180051,18518.909755,3319.456296,3361.532701,3846.164804,ERCC-00004
ERCC-00009,4818.014297,5107.922964,5043.534405,4547.622357,4334.937422,4183.406812,ERCC-00009
ERCC-00012,0.000000,2.305540,0.000000,0.000000,0.000000,0.000000,ERCC-00012
...,...,...,...,...,...,...,...
ERCC-00164,2.063831,1.152770,5.988523,3.276857,1.043306,2.294163,ERCC-00164
ERCC-00165,269.329992,246.692736,287.449123,513.811202,484.094095,489.803869,ERCC-00165
ERCC-00168,1.031916,3.458309,0.000000,4.587600,4.173225,1.147082,ERCC-00168
ERCC-00170,137.244785,148.707304,135.340629,26.870229,10.433062,32.118286,ERCC-00170


In [11]:
dds.comparison # show coefficients for GLM

['Intercept', 'replicate_2_vs_1', 'replicate_3_vs_1', 'sample_B_vs_A']

In [12]:
# from the last cell, we see the arrangement of coefficients, 
# so that we can now use "coef" for lfcShrink
# the comparison we want to focus on is 'sample_B_vs_A', so coef = 4 will be used
lfc_res = dds.lfcShrink(coef=4, method='apeglm')
lfc_res.head()

  type='apeglm' requires installing the Bioconductor package 'apeglm'


 



RRuntimeError: Error in (function (dds, coef, contrast, res, type = c("apeglm", "ashr",  : 
  type='apeglm' requires installing the Bioconductor package 'apeglm'
