# Create random gene subsets of PMID26623667

Tong Shu Li

### Background

When trying to interpret the GO term drift analysis results for PMID26623667, we would like to know what the distribution of results is like for randomly selected gene sets containing the same number of genes as the GO term. Therefore we will generate random gene sets of different sizes and use that to interpret our GO term results.

### Overview

We will be doing two analyses with the randomly generated data. In the first, we will examine the general influence of gene set size on drift model parameters. In the second, we will use the randomly generated data to score the GO terms and identify outliers.

For the overview analysis, we will generate 100 random gene sets per gene set size, with gene set sizes increasing by one. For the detailed GO analysis, we will generate 10000 gene sets per gene set size, but will increase gene set size in increments of 10.

In [1]:
import json
import os
import pandas as pd

from itertools import product
from numpy import random as rand
from tqdm import tqdm

In [2]:
rand.seed(20160617)

---

## Read the expression data for PMID26623667

In [3]:
data = (pd
    .read_csv("../clean_annotated_cpm_values.tsv", sep = '\t')
    .rename(columns = {"wormbaseid": "wormbase_id"})
)

In [4]:
data.shape

(391644, 13)

In [5]:
data.head()

Unnamed: 0,seqname,samples,cpm_value,cohort,replicate,day_harvested,drug,drug_conc_uM,day_drug_added,youngref_cpm,RLFEC,wormbase_id,gene_symbol
0,2RSSE.1,11,4.55942,1,1,1,water,0,1,4.105726,0.104814,WBGene00007064,2RSSE.1
1,2RSSE.1,12,3.21332,1,2,1,water,0,1,4.105726,-0.245078,WBGene00007064,2RSSE.1
2,2RSSE.1,13,4.54444,1,3,1,water,0,1,4.105726,0.101521,WBGene00007064,2RSSE.1
3,2RSSE.1,21,11.4368,2,1,3,water,0,1,4.105726,1.024454,WBGene00007064,2RSSE.1
4,2RSSE.1,22,4.77467,2,2,3,water,0,1,4.105726,0.150942,WBGene00007064,2RSSE.1


In [6]:
data["wormbase_id"].nunique()

10879

The dataset contains 10879 unique genes which we will randomly sample from.

---

## Create random gene sets of different sizes

### Parameters

We will generate gene sets of 10 to 500 genes inclusive, which matches the filtering requirements for the GO terms. However, the step size will be 10 instead of 1 in order to reduce the amount of data which needs to be stored.

All gene sets are stored in the `genesets/` folder, which is assumed to already exist.

In [7]:
# min and max gene set size, inclusive
num_min = 10
num_max = 500

all_genes = data["wormbase_id"].unique()

### Overview gene sets

In [8]:
genesets = dict()

pop_size = 100
sizes = range(num_min, num_max + 1)

for set_size, i in tqdm(product(sizes, range(pop_size)), total = pop_size * (num_max-num_min+1)):
    uid = "unif_gset_size{}_num{}".format(set_size, i)
    genesets[uid] = sorted(list(rand.choice(all_genes, set_size)))
    
assert len(genesets) == pop_size * (num_max-num_min+1)

100%|██████████| 49100/49100 [00:06<00:00, 7756.03it/s]


In [9]:
with open("genesets/uniform.json", "w") as fout:
    json.dump(genesets, fout, sort_keys = True, indent = 4)

---

### GO analysis gene sets

In [10]:
def rand_set(set_size, pop_size):
    genesets = dict()
    for i in range(pop_size):
        uid = "gset_size{}_num{}".format(set_size, i)
        genesets[uid] = sorted(list(rand.choice(all_genes, set_size)))
        
    assert len(genesets.keys()) == pop_size, "Identifiers are not unique!"

    return genesets

In [11]:
step = 10
pop_size = 10000

for num in tqdm(range(num_min, num_max+step, step)):
    res = rand_set(num, pop_size)
    
    fname = os.path.join("genesets", "size_{}.json".format(num))    
    with open(fname, "w") as fout:
        json.dump(res, fout, sort_keys = True, indent = 4)

100%|██████████| 50/50 [02:42<00:00,  5.97s/it]


Model fitting for these gene sets will be done by another program.