### Fix `htseq-count` Read Counts Output for DESeq2

I would like to analyze my data in DESeq2. In order to do this, I'm following the vignettes online, but I've realized that in order to use the recommended models (for a time course experiment with multiple conditions), I need a "full rank" sample matrix. Because we only performed 1 conditon per individual at time t=0, DESeq2 doesn't allow me to load in _all_ of the data natively. Therefore, in order to capture the effect at time t=0, I would like to just copy those counts multiple times and just create a bunch of separate counts matrices, and then perform DESeq2 multiple times, one for each condition as compared to control. 

In [9]:
import pandas as pd
import pickle as pkl
import numpy as np

In [10]:
prefix = '/data/codec/bulk.jan20/'

In [11]:
with open(prefix + "counts.pkl", "rb") as file:
    counts = pd.read_pickle(file)

Just get the raw data, that's all I want.

In [12]:
counts = counts['raw']

In [13]:
counts

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,TSPAN6,TNMD,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,GCLC,NFYA,...,AL356417.3,AC010616.2,AL034430.1,AP000646.1,AP006216.3,__no_feature,__ambiguous,__too_low_aQual,__not_aligned,__alignment_not_unique
IND,TIME,STIM,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1895,0,,12,0,106,54,0,1879,4,117,105,86,...,10,0,0,0,0,3813417,104400,0,0,4017668
1895,12,Control,4,0,167,122,2,230,0,64,477,86,...,0,0,0,0,0,2218082,87556,0,0,2633375
1895,12,IFNB,2,0,179,92,7,174,5,76,316,111,...,0,0,0,0,0,1893717,85918,0,0,2366138
1895,12,IFNG,7,0,187,193,9,647,12,71,836,69,...,0,0,0,0,0,2955527,137238,0,0,3947510
1895,12,PMAI,13,0,547,58,4,248,0,22,215,138,...,0,0,0,0,0,2329449,101390,0,0,3612969
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
731,9,IFNB,3,0,92,63,11,125,0,15,204,63,...,0,0,0,0,0,1657083,65585,0,0,2190246
731,9,IFNG,4,0,238,200,7,574,8,26,672,68,...,0,0,0,0,0,2748356,103415,0,0,3985939
731,9,PMAI,0,0,334,43,0,839,9,6,247,134,...,4,0,0,0,0,2802613,91004,0,0,4059545
731,9,R848,2,0,259,226,19,198,6,12,211,62,...,1,0,0,0,0,3864753,131514,0,0,5261412


Get rid of the multiindex, it only complicates things here.

In [14]:
counts = counts.reset_index()

Create 5 separate sets of files, one for each stimulation, that has counts for both controls and the stim conditions, and in the appropriate format for easy reading into DESeq2.

In [15]:
newdfs = list()
for stim in ['IFNB', 'IFNG', 'PMAI', 'R848', 'TNFa']:
    # extract out only the stim conditions and controls
    df = counts[(counts['STIM'] == stim) | (counts['STIM'] == 'Control')]
    
    # extract out the None, which I would like to trick DESeq2 into thinking is our time t=0 for both stim and control
    df_none = counts[(counts['STIM'] == 'None')]
    
    df_none = pd.concat([df_none]*2) # repeat it twice, once for stim and once for control
    
    # change the values in STIM to be control and our specific stim, instead of "None"
    df_none['STIM'] = ['Control']*3 + [stim]*3
    
    # concatenate and then sort for nice organization
    df_new = pd.concat([df, df_none]).sort_values(['IND','TIME','STIM'])
    
    # drop the extra meta data that I'm pretty sure DESeq2 won't use anyway
    df_new = df_new.iloc[:, :-5]
    
    # make a new index that encompasses the entirety of the sample name
    df_new.index = ['-'.join([i,j,k]) for i,j,k in zip(df_new['IND'], df_new['STIM'], df_new['TIME'].astype(str))]
    
    # there is a slight issue with 24 gene names coming up twice in the df
    # R does not allow dataframes with duplicate rows, and I'm not sure why htseq-count reported them as two separate ENSG ids
    # maybe its the GTF's fault
    # in any case, I'm going to just sum over the duplicate rows
    cols, counts_of_cols = np.unique(df_new.columns.values, return_counts=True)
    
    for dup_gene in cols[counts_of_cols > 1]: # the columns also contain some meta data, but none of those should be repeated
        new_vals = df_new[dup_gene].sum(axis=1)
        df_new.drop(dup_gene, axis=1, inplace=True)
        df_new[dup_gene] = new_vals # note these get added to the end, not in their original place, but it shouldn't matter
        
    # extract out only the genes, then rotate because that's what DESeq2 expects
    cts = df_new.iloc[:,3:].T
    
    # get new columns, and then make a separate dfs for the coldata
    coldata_columns = df_new.columns[:3]
    coldata = pd.DataFrame(data=df_new[coldata_columns].values, index=cts.columns, columns=coldata_columns)
    break
    #export to csv
    cts.to_csv(prefix + 'counts.csvs/%s.cts.csv' % stim)
    coldata.to_csv(prefix + 'counts.csvs/%s.col.csv' % stim)

In [16]:
cts

Unnamed: 0,1895-Control-0,1895-IFNB-0,1895-Control-3,1895-IFNB-3,1895-Control-6,1895-IFNB-6,1895-Control-9,1895-IFNB-9,1895-Control-12,1895-IFNB-12,...,731-Control-0,731-IFNB-0,731-Control-3,731-IFNB-3,731-Control-6,731-IFNB-6,731-Control-9,731-IFNB-9,731-Control-12,731-IFNB-12
TSPAN6,12,12,1,0,0,2,19,12,4,2,...,2,2,0,2,8,10,5,3,1,1
TNMD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DPM1,106,106,142,64,105,131,220,186,167,179,...,59,59,101,40,213,206,190,92,220,228
SCYL3,54,54,74,50,47,125,53,97,122,92,...,33,33,39,38,117,159,70,63,93,154
C1orf112,0,0,13,8,10,10,8,0,2,7,...,10,10,1,3,20,19,23,11,10,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SCO2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
SOD2,659,659,2110,1068,1675,4210,2149,2989,2967,2512,...,346,346,1030,779,3279,5729,1803,2747,2055,3477
TBCE,34,34,43,18,2,23,30,26,48,38,...,14,14,16,20,71,55,41,43,50,21
TMSB15B,5,5,0,0,2,0,0,0,0,0,...,4,4,0,0,0,6,0,4,0,1


In [17]:
coldata

Unnamed: 0,IND,TIME,STIM
1895-Control-0,1895,0,Control
1895-IFNB-0,1895,0,IFNB
1895-Control-3,1895,3,Control
1895-IFNB-3,1895,3,IFNB
1895-Control-6,1895,6,Control
1895-IFNB-6,1895,6,IFNB
1895-Control-9,1895,9,Control
1895-IFNB-9,1895,9,IFNB
1895-Control-12,1895,12,Control
1895-IFNB-12,1895,12,IFNB
