### Format Pseudobulked Read Counts for DESeq2

I would like to analyze my data in DESeq2. In order to do this, I'm following the vignettes online, but I've realized that in order to use the recommended models (for a time course experiment with multiple conditions), I need a "full rank" sample matrix. Because we only performed 1 conditon per individual at time t=0, DESeq2 doesn't allow me to load in _all_ of the data natively. Therefore, in order to capture the effect at time t=0, I would like to just copy those counts multiple times and just create a bunch of separate counts matrices, and then perform DESeq2 multiple times, one for each condition as compared to control. 

In [1]:
import pandas as pd
import pickle as pkl
import numpy as np
from tqdm.notebook import tqdm

In [2]:
prefix = '/data/codec/'

In [3]:
counts = pd.read_pickle(prefix + 'production.run/mrna/pkls/aggr/pseudobulk.conds.inds.pkl')

Just get the raw data, that's all I want.

In [5]:
counts['COND'].value_counts()

A    64
B    64
G    64
C    64
R    64
P    64
Name: COND, dtype: int64

In [6]:
np.unique(counts['COND'].values)

array(['A', 'B', 'C', 'G', 'P', 'R'], dtype=object)

In [7]:
counts.isna().sum().sum()

0

Create 5 separate sets of files, one for each stimulation, that has counts for both controls and the stim conditions, and in the appropriate format for easy reading into DESeq2.

In [21]:
for cond in tqdm(['A','B','G','P','R']):
    # extract out only the cond conditions and controls
    df = counts[(counts['COND'] == cond) | (counts['COND'] == 'C')]
    
    # make a new index that encompasses the entirety of the sample name
    df.index = ['-'.join([j,k]) for j,k in zip(df['COND'], df['FID'])]
    

    # With the bulk data, I had to remove and sum over duplicate genes
    # but because this is pseuobulked from an adata object, I already ran 
    # var_names_make_unique, so I think I'm good (also confirmed in a separate cell)
    
    # extract out only the genes, then rotate because that's what DESeq2 expects
    
    conds = df.iloc[:,2:].T
    
    
    # get new columns, and then make a separate dfs for the coldata
    coldata_columns = df.columns[:2]
    coldata = pd.DataFrame(data=df[coldata_columns].values, index=conds.columns, columns=coldata_columns)
    

    # next line was required, was getting the following error from DESeq2:
    # every gene contains at least one zero, cannot compute log geometric means
    # this was the suggested fix:
    conds = conds.loc[conds.sum(1) > 0,:] + 1
    
    #export to csv
    conds.to_csv(prefix + 'production.run/mrna/de.csvs/pseudobulks/%s.csv' % cond)
    coldata.to_csv(prefix + 'production.run/mrna/de.csvs/pseudobulks/%s.col.csv' % cond)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


