# Making Counts Matrix for DESeq2 using Python#

We will now take the output from our featureCounts read summary program and use it to make a counts matrix. A counts matrix is a dataframe that displays the number of reads, or counts, for each gene. All experimental samples will be represented at once in in the rows of this dataframe as well. This is a comprehensive way of making sense of our previous alignment files.

**Example Counts Matrix Format**

|  | Sample_1 | Sample_2 | Sample_3 | Sample_4 |
| --- | --- | --- | --- | --- |
| Gene_1 | 1 | 2 | 4 | 5 |
| Gene_2 | 12 | 8 | 7 | 10 | 
| Gene_3 | 45 | 55 | 16 | 21 |
| Gene_4 | 17 | 22 | 70 | 65 | |



Let's begin by opening Jupyter notebooks. Notebooks are documents that we use to write and execute code, and are a great way to analyze data, make figures, and export your analysis using Python and R packages. Let's use this [notebook](https://github.com/ryanmarina/BMS_bioinformatics_bootcamp_2018/blob/master/tutorials/How_to_load_jupyter_notebooks.ipynb) to learn how we can connect to Jupyter through TSCC. Once we do that, let's start importing some data:

In [2]:
#First we will import packages that contain functions we will use here. 

#Pandas is a great python package for dataframe manipulation.
import pandas as pd

read_table is the command that will read in a tab separated file as a dataframe. See what other pandas read functions are available by pressing tab after pd.read. Auto-complete will show you all the options. 

We are going to set the index of the dataframe as the first column rather than an arbitrary number. (Try this function both with and without setting the index, how is it different??) If you look at the file we are loading with less on the command line, you will see that there are comments at the top of the file on lines that start with #. We need to tell pandas to ignore those when loading the dataframe. So we will use comment = "#".

To make sure that all manipulations are doing what we expect, we will also print the shape of the dataframe (do the number of rows and columns make sense) and look at the beginning of the dataframe with df.head()

In [3]:
counts = pd.read_table("/oasis/tscc/scratch/biom200/bms_2018/rna_seq/analysis/featurecounts/featureCounts.txt", 
                       index_col=0, 
                       comment="#")
print counts.shape
counts.head()

(53379, 9)


Unnamed: 0_level_0,Chr,Start,End,Strand,Length,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep1_Aligned.out.sam,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep2_Aligned.out.sam,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep1_Aligned.out.sam,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep2_Aligned.out.sam
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ENSMUSG00000102693.1,chr1,3073253,3074322,+,1070,0,0,0,0
ENSMUSG00000064842.1,chr1,3102016,3102125,+,110,0,0,0,0
ENSMUSG00000051951.5,chr1;chr1;chr1;chr1;chr1;chr1;chr1,3205901;3206523;3213439;3213609;3214482;342170...,3207317;3207317;3215632;3216344;3216968;342190...,-;-;-;-;-;-;-,6094,0,0,1,0
ENSMUSG00000102851.1,chr1,3252757,3253236,+,480,0,0,0,0
ENSMUSG00000103377.1,chr1,3365731,3368549,-,2819,1,0,0,0


In [4]:
#This is the syntax to make a list in python. Lists are surrounded by square brackets. 

#We don't care about a few columns in the dataframe, so let's get rid of them.

cols_to_drop = ['Chr','Start','End','Strand']

#The command to get rid of rows is df.drop
#We provide a list of columns to drop, and the axis that contains these values (1 is columns, 0 is rows)
counts = counts.drop(cols_to_drop, axis=1)
print counts.shape
counts.head()

(53379, 5)


Unnamed: 0_level_0,Length,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep1_Aligned.out.sam,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep2_Aligned.out.sam,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep1_Aligned.out.sam,/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep2_Aligned.out.sam
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSMUSG00000102693.1,1070,0,0,0,0
ENSMUSG00000064842.1,110,0,0,0,0
ENSMUSG00000051951.5,6094,0,0,1,0
ENSMUSG00000102851.1,480,0,0,0,0
ENSMUSG00000103377.1,2819,1,0,0,0


Notice how dropping 4 columns changed the number of rows in the dataframe. 

The column names are pretty annoying because they list the full path and the bam file name. Let's rename them to something shorter. We can use

    counts.columns
    
to give us a list of column names. This is easy to copy the ones we want into a dictionary that we will make below. 

In [5]:
counts.columns

Index([u'Length',
       u'/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep1_Aligned.out.sam',
       u'/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep2_Aligned.out.sam',
       u'/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep1_Aligned.out.sam',
       u'/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep2_Aligned.out.sam'],
      dtype='object')

A dictionary lets us link a key:value pair. In this instance we are using a key that is the old name and a value that is the new name. We will use this pairing scheme to define all old:new column names and feed that into a function to rename columns. 

Dictionaries are make with {"key":"value", "key2":"value2", "key3":"value3}

In [6]:
col_names = {'/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep1_Aligned.out.sam':"mouse_0hr_rep1",
       '/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_0hr_rep2_Aligned.out.sam':"mouse_0hr_rep2",
       '/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep1_Aligned.out.sam':"mouse_4hr_rep1",
       '/home/ucsd-train40/bms_2018/rna_seq/analysis/star_alignment/sam_files/mouse_4hr_rep2_Aligned.out.sam':"mouse_4hr_rep2"}

You can put a dictionary in the rename function to rename the columns. Let's feed in the dictionary that we made called col_names. Check the shape and head of the dataframe to make sure the changes happened as you expected them to.

In [7]:
counts = counts.rename(columns = col_names)
print counts.shape
counts.head()

(53379, 5)


Unnamed: 0_level_0,Length,mouse_0hr_rep1,mouse_0hr_rep2,mouse_4hr_rep1,mouse_4hr_rep2
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSMUSG00000102693.1,1070,0,0,0,0
ENSMUSG00000064842.1,110,0,0,0,0
ENSMUSG00000051951.5,6094,0,0,1,0
ENSMUSG00000102851.1,480,0,0,0,0
ENSMUSG00000103377.1,2819,1,0,0,0


You will notice that this dataframe has almost 58,000 genes. That's a lot. And it looks like there might be a lot with very few counts. So we are going to calculate the mean counts across all of our samples and get rid of genes that have a mean count of less than 5.

You will notice that our dataframe also contains length information

In [8]:
counts.columns

Index([u'Length', u'mouse_0hr_rep1', u'mouse_0hr_rep2', u'mouse_4hr_rep1',
       u'mouse_4hr_rep2'],
      dtype='object')

Let's make a list of the columns that we want to look at when calculating the mean. We don't want to include the Length value in our mean calculation. 

In [9]:
samples = ['mouse_0hr_rep1', 'mouse_0hr_rep2', 'mouse_4hr_rep1', 'mouse_4hr_rep2']

Access only those columns by putting the list in square brackets after the name of the dataframe. Calculate mean across the rows with .mean(axis=1). Let's take a look at the first 5 results .head()

In [10]:
counts[samples].mean(axis=1).head()

Geneid
ENSMUSG00000102693.1    0.00
ENSMUSG00000064842.1    0.00
ENSMUSG00000051951.5    0.25
ENSMUSG00000102851.1    0.00
ENSMUSG00000103377.1    0.25
dtype: float64

Set the filtering cutoff at 5. We want to keep genes that have a mean count value greater than 5. Use a boolean to find out which genes have a mean greater than 5. This returns a True/False array of genes to keep.

In [11]:
counts[samples].mean(axis=1) > 5

Geneid
ENSMUSG00000102693.1     False
ENSMUSG00000064842.1     False
ENSMUSG00000051951.5     False
ENSMUSG00000102851.1     False
ENSMUSG00000103377.1     False
ENSMUSG00000104017.1     False
ENSMUSG00000103025.1     False
ENSMUSG00000089699.1     False
ENSMUSG00000103201.1     False
ENSMUSG00000103147.1     False
ENSMUSG00000103161.1     False
ENSMUSG00000102331.1     False
ENSMUSG00000102348.1     False
ENSMUSG00000102592.1     False
ENSMUSG00000088333.2     False
ENSMUSG00000102343.1     False
ENSMUSG00000025900.12    False
ENSMUSG00000102948.1     False
ENSMUSG00000104123.1     False
ENSMUSG00000025902.13    False
ENSMUSG00000104238.1     False
ENSMUSG00000102269.1     False
ENSMUSG00000096126.1     False
ENSMUSG00000103003.1     False
ENSMUSG00000104328.1     False
ENSMUSG00000102735.1     False
ENSMUSG00000098104.1     False
ENSMUSG00000102175.1     False
ENSMUSG00000088000.1     False
ENSMUSG00000103265.1     False
                         ...  
ENSMUSG00000064343.1      True
E

Let's save this result and call it genes_to_keep. Notice a few key things. The True/False is described by the GeneID. The geneid is also the index of our dataframe. This is very important!!

In [12]:
genes_to_keep = counts[samples].mean(axis=1) > 5

Since the geneID is the index of our dataframe, we can use .loc to only keep instances where the geneID is True. So the syntax of the following command is dataframe.loc[]. Inside the square bracket is an array describing for each item in the index, whether or not to keep it. True items are kept, false items are removed. Take a look at how this affects how many rows are in the dataframe with .shape. How many genes are left in our analysis? 

In [13]:
counts_clean = counts.loc[genes_to_keep]
print counts_clean.shape
counts_clean.head()

(16082, 5)


Unnamed: 0_level_0,Length,mouse_0hr_rep1,mouse_0hr_rep2,mouse_4hr_rep1,mouse_4hr_rep2
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSMUSG00000103922.1,1069,3,7,5,8
ENSMUSG00000033845.13,8487,292,546,135,175
ENSMUSG00000025903.14,7145,423,525,285,365
ENSMUSG00000033813.15,3017,577,561,564,483
ENSMUSG00000103280.1,1111,20,3,0,6


To find out how many True values were in our array, we can use .sum(). Sum in this case will count the number of Trues. (True has a value of 1, False has a value of 0). genes_to_keep.sum() should match the number of rows in our dataframe. Does it? 

In [14]:
genes_to_keep.sum()

16082

For DESeq2, we need to provide a counts matrix. We are going to use the filtered counts matrix to get rid of the genes with 0 (or nearly 0) counts. But we don't want the length column. Remember to access only a few rows, give the name of the dataframe followed by square brackets. Inside that square bracket, give it a list of the columns you want to use. We made a list called samples so this is the list we will use to keep only our samples and ignore the Length column.

In [15]:
counts_clean[samples].head()

Unnamed: 0_level_0,mouse_0hr_rep1,mouse_0hr_rep2,mouse_4hr_rep1,mouse_4hr_rep2
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENSMUSG00000103922.1,3,7,5,8
ENSMUSG00000033845.13,292,546,135,175
ENSMUSG00000025903.14,423,525,285,365
ENSMUSG00000033813.15,577,561,564,483
ENSMUSG00000103280.1,20,3,0,6


Save this dataframe as a csv. First define the directory where you want to save it. Make sure this directory exists (make it on your command line first). Call it deseq_dir. Put it in quotes to tell python this is a string. 

To save, use .to_csv() and put in the name of the file where you want to save it. This is the directory + a meaningful filename. spaces are important here, don't space between the directory and the string with the file name. Follow the syntax exactly as written below.

In [17]:
deseq_dir = "/home/ucsd-train40/projects/mouse_LPS/deseq2/"

counts_clean[samples].to_csv(deseq_dir+"Mouse_LPS_counts_for_deseq2.csv")

DESeq2 also needs a conditions matrix where the row names are the sample names (that exactly match the column names from the counts matrix) and there is one column describing the condition that the sample came from. We can set the row names (index) directly when making a new dataframe by saying index = samples. Remember samples is a list of column names from our counts matrix above. 

In [18]:
conditions = pd.DataFrame(index = samples)
conditions.head()

mouse_0hr_rep1
mouse_0hr_rep2
mouse_4hr_rep1
mouse_4hr_rep2


To make a new column in a dataframe, put square brackets after the dataframe with the new column name in quotes. In this dataframe, we will make a new column called 'condition'. Set this equal to a list of the values that you would like to fill this column. In our case, the values are knockdown or control depending on the sample. Look at the conditions dataframe with head to make sure it is doing what you think it is. 

In [19]:
conditions['condition'] = ['knockdown','knockdown','control','control']

In [20]:
conditions.head()

Unnamed: 0,condition
mouse_0hr_rep1,knockdown
mouse_0hr_rep2,knockdown
mouse_4hr_rep1,control
mouse_4hr_rep2,control


Save this dataframe the same way we did before.

In [21]:
conditions.to_csv(deseq_dir+"Mouse_LPS_conditions_for_deseq2.csv")

We also want to save the counts matrix with the Length column. We need this to calculate TPM and FPKM, useful normalized read numbers that are used to quantify relative expression levels in RNA-seq data. Check out this [notebook]() for more details on calculating these numbers!

Take a look at that dataframe to make sure it is what we want and save it to the featurecounts directory that you have likely already made. 

In [22]:
counts_clean.head()

Unnamed: 0_level_0,Length,mouse_0hr_rep1,mouse_0hr_rep2,mouse_4hr_rep1,mouse_4hr_rep2
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSMUSG00000103922.1,1069,3,7,5,8
ENSMUSG00000033845.13,8487,292,546,135,175
ENSMUSG00000025903.14,7145,423,525,285,365
ENSMUSG00000033813.15,3017,577,561,564,483
ENSMUSG00000103280.1,1111,20,3,0,6


In [24]:
feature_counts_dir = "/home/ucsd-train40/projects/mouse_LPS/featurecounts/"

counts_clean.to_csv(feature_counts_dir+"Mouse_LPS_clean_counts_with_length.csv")