### Format Pseudobulked Read Counts for DESeq2

I would like to analyze my data in DESeq2. In order to do this, I'm following the vignettes online, but I've realized that in order to use the recommended models (for a time course experiment with multiple conditions), I need a "full rank" sample matrix. Because we only performed 1 conditon per individual at time t=0, DESeq2 doesn't allow me to load in _all_ of the data natively. Therefore, in order to capture the effect at time t=0, I would like to just copy those counts multiple times and just create a bunch of separate counts matrices, and then perform DESeq2 multiple times, one for each condition as compared to control. 

In [1]:
import pandas as pd
import pickle as pkl
import numpy as np
from tqdm.notebook import tqdm

In [2]:
prefix = '/data/codec/production.run/'

In [3]:
counts = pd.read_pickle(prefix + 'adts/pkls/pseudobulk.cts.pkl')

Just get the raw data, that's all I want.

In [4]:
counts['CT'].value_counts()

Mono_cDC_All            384
CD4_T_Naive             384
B_Naive                 384
HSC                     384
CD4_T_Memory            384
NK                      384
CD8_T_Naive             384
B_Memory                384
pDC                     384
CD8_T_Memory_MAIT_GD    384
Mono_C                  320
cDC                     320
Mono_NC                 320
Name: CT, dtype: int64

In [5]:
np.unique(counts['CT'].values)

array(['B_Memory', 'B_Naive', 'CD4_T_Memory', 'CD4_T_Naive',
       'CD8_T_Memory_MAIT_GD', 'CD8_T_Naive', 'HSC', 'Mono_C', 'Mono_NC',
       'Mono_cDC_All', 'NK', 'cDC', 'pDC'], dtype=object)

In [6]:
counts.isna().sum().sum()

0

Create 5 separate sets of files, one for each stimulation, that has counts for both controls and the stim conditions, and in the appropriate format for easy reading into DESeq2.

In [7]:
for cond in tqdm(['A','B','G','P','R']):
    # extract out only the cond conditions and controls
    df = counts[(counts['COND'] == cond) | (counts['COND'] == 'C')]
    
    # make a new index that encompasses the entirety of the sample name
    df.index = ['-'.join([i,j,k]) for i,j,k in zip(df['CT'], df['COND'], df['FID'])]
    

    # With the bulk data, I had to remove and sum over duplicate genes
    # but because this is pseuobulked from an adata object, I already ran 
    # var_names_make_unique, so I think I'm good (also confirmed in a separate cell)
    
    # extract out only the genes, then rotate because that's what DESeq2 expects
    cts = df.iloc[:,3:].T
    
    
    # get new columns, and then make a separate dfs for the coldata
    coldata_columns = df.columns[:3]
    coldata = pd.DataFrame(data=df[coldata_columns].values, index=cts.columns, columns=coldata_columns)
    
#     if cond == 'P':
#         cts = cts.loc[:,~np.any(np.stack([cts.columns.str.contains('Mono_C-C'), cts.columns.str.contains('Mono_NC-C')]), axis=0)]
#         coldata = coldata.loc[~np.any(np.stack([coldata.index.str.contains('Mono_C-C'), coldata.index.str.contains('Mono_NC-C')]), axis=0),:]
    
    # next line was required, was getting the following error from DESeq2:
    # every gene contains at least one zero, cannot compute log geometric means
    # this was the suggested fix:
    cts = cts.loc[cts.sum(1) > 0,:] + 1

    #export to csv
    cts.to_csv(prefix + 'adts/de.csvs/%s.cts.csv' % cond)
    coldata.to_csv(prefix + 'adts/de.csvs/%s.col.csv' % cond)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


