# Simplifying the mianserin worm data

Tong Shu Li

In this notebook, we will:

1. Remove genes which did not have drift values for all three replicates in all three samples.
2. Average the drift and expression values observed in the different biological replicates.

In [1]:
import pandas as pd
import numpy as np

from functools import reduce

---

## Read the full annotated data

In [2]:
full = pd.read_csv("annotated_cpm_values.tsv", sep = '\t')

In [3]:
full.head(10)

Unnamed: 0,seqname,samples,value,cohort,replicate,day_harvested,drug,drug_conc_uM,day_drug_added,youngref,trans_drift,wormbaseid,gene_symbol
0,2L52.1,11,3.48662,1,1,1,water,0,1,1.779111,0.672818,WBGene00007063,2L52.1
1,2L52.1,12,0.963996,1,2,1,water,0,1,1.779111,-0.612782,WBGene00007063,2L52.1
2,2L52.1,13,0.886719,1,3,1,water,0,1,1.779111,-0.696341,WBGene00007063,2L52.1
3,2L52.1,21,3.52988,2,1,3,water,0,1,1.779111,0.68515,WBGene00007063,2L52.1
4,2L52.1,22,3.39253,2,2,3,water,0,1,1.779111,0.645462,WBGene00007063,2L52.1
5,2L52.1,23,4.19573,2,3,3,water,0,1,1.779111,0.857953,WBGene00007063,2L52.1
6,2L52.1,31,1.64999,3,1,5,water,0,1,1.779111,-0.075347,WBGene00007063,2L52.1
7,2L52.1,32,2.83349,3,2,5,water,0,1,1.779111,0.465397,WBGene00007063,2L52.1
8,2L52.1,33,3.10817,3,3,5,water,0,1,1.779111,0.55792,WBGene00007063,2L52.1
9,2L52.1,41,2.66396,4,1,10,water,0,1,1.779111,0.403698,WBGene00007063,2L52.1


In [4]:
full.shape

(690084, 13)

## Explore the data

In [5]:
# number of unique genes
len(full["seqname"].unique())

19169

In [6]:
# check that the full transcriptome was measured for each replicate of each sample
full.groupby(["cohort", "replicate"]).apply(
    lambda f: len(f["seqname"].unique())
)

cohort  replicate
1       1            19169
        2            19169
        3            19169
2       1            19169
        2            19169
        3            19169
3       1            19169
        2            19169
        3            19169
4       1            19169
        2            19169
        3            19169
5       1            19169
        2            19169
        3            19169
6       1            19169
        2            19169
        3            19169
7       1            19169
        2            19169
        3            19169
8       1            19169
        2            19169
        3            19169
9       1            19169
        2            19169
        3            19169
10      1            19169
        2            19169
        3            19169
11      1            19169
        2            19169
        3            19169
12      1            19169
        2            19169
        3            19169
dtype: int

In [7]:
# in each sample, how many genes were missing drift values?
full.groupby(["cohort", "replicate"]).apply(
    lambda f: f["trans_drift"].isnull().sum()
)

cohort  replicate
1       1            2857
        2            4392
        3            2899
2       1            2406
        2            2705
        3            2746
3       1            2231
        2            2847
        3            2014
4       1            2154
        2            3831
        3            1922
5       1            3068
        2            3618
        3            2991
6       1            2754
        2            5020
        3            2981
7       1            3104
        2            3451
        3            2824
8       1            3011
        2            3680
        3            2593
9       1            2095
        2            2767
        3            2322
10      1            2846
        2            2831
        3            2258
11      1            3308
        2            2075
        3            1913
12      1            3192
        2            6089
        3            2568
dtype: int64

In [8]:
# provide some statistics about genes missing drift values
full.groupby(["cohort", "replicate"]).apply(
    lambda f: f["trans_drift"].isnull().sum()
).describe()

count      36.000000
mean     2954.527778
std       858.373129
min      1913.000000
25%      2385.000000
50%      2838.500000
75%      3126.000000
max      6089.000000
dtype: float64

So some genes are missing drift values in some samples. Let's check how many genes are missing in all samples, and how many are missing in at least one sample.

In [9]:
missing_genes = {
    info: set(df[df["trans_drift"].isnull()]["seqname"])
    for info, df in full.groupby(["cohort", "replicate"])
}

print("# unique genes with no drift value in all samples:",
    len(reduce(lambda x, y: x & y, missing_genes.values()))
)

print("# unique genes with no drift value in any sample:",
    len(reduce(lambda x, y: x | y, missing_genes.values()))
)

# unique genes with no drift value in all samples: 1653
# unique genes with no drift value in any sample: 8290


Although we measured the expression values for all 19169 genes, we do not have the drift values for 1653 genes in all 36 samples. This means that either the value of expression or the young reference for that gene was 0.

We will need to exclude these genes from our analysis, since they provide us with no information.

---

## Discard any genes which have no drift value in any sample

Since we want to be consistent, with three replicates for each measurement, we will discard any genes which are missing in **any** sample.

In [10]:
bad_genes = reduce(lambda x, y: x | y, missing_genes.values())
good_genes = set(full["seqname"]) - bad_genes

temp = pd.DataFrame({"seqname": list(good_genes)})

good = pd.merge(temp, full, how = "left", on = "seqname")

In [11]:
good.shape

(391644, 13)

In [12]:
good.head(10)

Unnamed: 0,seqname,samples,value,cohort,replicate,day_harvested,drug,drug_conc_uM,day_drug_added,youngref,trans_drift,wormbaseid,gene_symbol
0,C30G4.4,11,1.20691,1,1,1,water,0,1,1.019207,0.169035,WBGene00016269,C30G4.4
1,C30G4.4,12,0.963996,1,2,1,water,0,1,1.019207,-0.055693,WBGene00016269,C30G4.4
2,C30G4.4,13,0.886719,1,3,1,water,0,1,1.019207,-0.139252,WBGene00016269,C30G4.4
3,C30G4.4,21,1.76494,2,1,3,water,0,1,1.019207,0.549092,WBGene00016269,C30G4.4
4,C30G4.4,22,0.502597,2,2,3,water,0,1,1.019207,-0.706992,WBGene00016269,C30G4.4
5,C30G4.4,23,1.28441,2,3,3,water,0,1,1.019207,0.231272,WBGene00016269,C30G4.4
6,C30G4.4,31,2.02498,3,1,5,water,0,1,1.019207,0.686536,WBGene00016269,C30G4.4
7,C30G4.4,32,0.386386,3,2,5,water,0,1,1.019207,-0.969944,WBGene00016269,C30G4.4
8,C30G4.4,33,1.87477,3,3,5,water,0,1,1.019207,0.60946,WBGene00016269,C30G4.4
9,C30G4.4,41,2.02461,4,1,10,water,0,1,1.019207,0.68635,WBGene00016269,C30G4.4


In [13]:
# number of unique genes now
len(good["seqname"].unique())

10879

We have effectively reduced the number of genes we are measuring from 19169 to 10879, which is around half the size.

## Average drift values across replicates

For further plotting, we will ignore the fact that multiple batches exist. We will take the mean of the drift values for each gene for each sample.

In [14]:
avg = good.drop(["samples", "replicate"], axis = 1).groupby(["seqname", "cohort"], as_index = False).agg(np.mean)

# add back in the metadata columns
metadata = good[["seqname", "cohort", "drug", "wormbaseid", "gene_symbol"]].drop_duplicates()

avg = pd.merge(avg, metadata, on = ["seqname", "cohort"], how = "inner")
avg = avg.rename(columns = {"cohort": "sample"})

In [15]:
avg.head(20)

Unnamed: 0,seqname,sample,value,day_harvested,drug_conc_uM,day_drug_added,youngref,trans_drift,drug,wormbaseid,gene_symbol
0,2RSSE.1,1,4.105727,1,0,1,4.105726,-0.012915,water,WBGene00007064,2RSSE.1
1,2RSSE.1,2,7.687213,3,0,1,4.105726,0.562429,water,WBGene00007064,2RSSE.1
2,2RSSE.1,3,7.2205,5,0,1,4.105726,0.56249,water,WBGene00007064,2RSSE.1
3,2RSSE.1,4,3.923197,10,0,1,4.105726,-0.049039,water,WBGene00007064,2RSSE.1
4,2RSSE.1,5,4.042023,3,50,1,4.105726,-0.030335,mianserin,WBGene00007064,2RSSE.1
5,2RSSE.1,6,2.93401,5,50,1,4.105726,-0.490734,mianserin,WBGene00007064,2RSSE.1
6,2RSSE.1,7,3.966477,10,50,1,4.105726,-0.126457,mianserin,WBGene00007064,2RSSE.1
7,2RSSE.1,8,4.658657,5,50,3,4.105726,0.091843,mianserin,WBGene00007064,2RSSE.1
8,2RSSE.1,9,3.88884,10,50,3,4.105726,-0.05739,mianserin,WBGene00007064,2RSSE.1
9,2RSSE.1,10,2.7337,10,50,5,4.105726,-0.477487,mianserin,WBGene00007064,2RSSE.1


In [16]:
avg.shape

(130548, 11)

We now have a cleaned dataframe containing the averaged expression and drift values for 10879 genes for all samples and conditions.

## Save cleaned data to file

In [17]:
avg.to_csv("avg_annotated_cpm_values.tsv", sep = '\t', index = False)