# Cleaning Up the Annotated RNA-seq data

Tong Shu Li

In this notebook, we clean up the raw annotated RNA-seq data file `(Q1_Sunitha_RNAseq_36samples_annotated.raw)` and prepare it for subsequent use.

To begin, run `dos2unix` on the file to remove carriage returns (`\r`).

In [1]:
import pandas as pd # version 0.18.0

In [2]:
data = pd.read_csv("Q1_Sunitha_RNAseq_36samples_annotated.raw", sep = '\t')

In [3]:
data.head(5)

Unnamed: 0,seqname,samples,value,cohort,replicate,age,drug,conc,add,youngref,td,wormbaseid,symbol,v1
0,2L52.1,11,3.48662,1,1,1,h2o,0,1,1.779111,0.672818,WBGene00007063,2L52.1,2L52.1
1,2L52.1,12,0.963996,1,2,1,h2o,0,1,1.779111,-0.612782,WBGene00007063,2L52.1,2L52.1
2,2L52.1,13,0.886719,1,3,1,h2o,0,1,1.779111,-0.696341,WBGene00007063,2L52.1,2L52.1
3,2L52.1,21,3.52988,2,1,3,h2o,0,1,1.779111,0.68515,WBGene00007063,2L52.1,2L52.1
4,2L52.1,22,3.39253,2,2,3,h2o,0,1,1.779111,0.645462,WBGene00007063,2L52.1,2L52.1


In [4]:
data.shape

(690084, 14)

## Delete the redundant v1 column

The `v1` column is redundant because it is equal to the `symbol` column for all rows:

In [5]:
len(data) == (data["v1"] == data["symbol"]).sum()

True

In [6]:
# drop the redundant "v1" column
data = data.drop("v1", axis = 1)

## Rename the values in the `drug` column

We will rename `h2o` and `mia` to the more informative `water` and `mianserin` in the `drug` column.

In [7]:
data.loc[:, "drug"] = data.loc[:, "drug"].map(
    lambda v: "water" if v == "h2o" else "mianserin"
)

In [8]:
data["drug"].value_counts()

mianserin    460056
water        230028
Name: drug, dtype: int64

## Rename the columns to be more descriptive

In [9]:
data = data.rename(
    columns = {
        "age": "day_harvested",
        "conc": "drug_conc_uM",
        "add": "day_drug_added",
        "td": "trans_drift",
        "symbol": "gene_symbol"
    }
)

In [10]:
data.head()

Unnamed: 0,seqname,samples,value,cohort,replicate,day_harvested,drug,drug_conc_uM,day_drug_added,youngref,trans_drift,wormbaseid,gene_symbol
0,2L52.1,11,3.48662,1,1,1,water,0,1,1.779111,0.672818,WBGene00007063,2L52.1
1,2L52.1,12,0.963996,1,2,1,water,0,1,1.779111,-0.612782,WBGene00007063,2L52.1
2,2L52.1,13,0.886719,1,3,1,water,0,1,1.779111,-0.696341,WBGene00007063,2L52.1
3,2L52.1,21,3.52988,2,1,3,water,0,1,1.779111,0.68515,WBGene00007063,2L52.1
4,2L52.1,22,3.39253,2,2,3,water,0,1,1.779111,0.645462,WBGene00007063,2L52.1


## Save cleaned data to file

In [11]:
data.to_csv("annotated_cpm_values.tsv", sep = '\t', index = False)