# Tesing Bulk dataset - GSE53481

Here we tested the dataset available at accession number: [GSE53481](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53481). Link to Paper: [Cell-Cycle Control of Developmentally Regulated Transcription Factors Accounts for Heterogeneity in Human Pluripotent Cells](https://doi.org/10.1016/j.stemcr.2013.10.009)

<div id="toc"></div>

## Neccessary Imports

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [8]:
import sys
code = "./../../code/"
data = "./../../data/"
sys.path.append(code)
import pandas
import pypairs as pairs
from sklearn.preprocessing import QuantileTransformer
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import numpy as np
from pathlib import Path
from tqdm import tqdm_notebook as tqdm
import helper
import timeit

init_notebook_mode(connected=True)

## Loading Oscope CC only marker pairs

In [3]:
cc_marker = helper.load_ocope_marker(data)

[__set_matrix] Original Matrix 'x' has shape 19084 x 247
[__set_matrix] Removed 16689 genes that were not in 'subset_genes'. 2395 genes remaining.
[__set_matrix] Removed 61 genes that were not expressed in any samples. 2334 genes remaining.
[__set_matrix] Removed 0 samples that were not annotated in 'phases'. 247 samples remaining.
[__set_matrix] Matrix truncation done. Working with 2334 genes for 247 samples.
[sandbag] Identifying marker pairs...Processing in parallel with 10 processes...
 Done!
[sandbag] Identified 8146 marker pairs (phase: count): {'G1': 2575, 'S': 4101, 'G2M': 1470}


## Loading human embryonic stem cells GSE53481

In [5]:
gencounts_GSE53481 = pandas.read_csv(Path(data + "GSE53481_humanRNAseq.txt"), sep='\t')
genes = [s[s.rindex('_') +1:] for s in gencounts_GSE53481["GENE"]]
gencounts_GSE53481["GENE"] = genes
gencounts_GSE53481.set_index("GENE", inplace=True)
gencounts_GSE53481.head(10)

Unnamed: 0_level_0,H1.DN,H1.KO2,H1.AzLow,H1.AzHigh,H2.DN,H2.KO2,H2.AzLow,H2.AzHigh,H3.DN,H3.KO2,H3.AzLow,H3.AzHigh
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LOC100289255,0.0,0.0,0.02,0.02,0.0,0.01,0.03,0.03,0.0,0.01,0.02,0.03
LOC644656,1.76,1.81,0.73,0.54,2.26,1.71,0.54,0.78,1.63,1.8,1.08,0.64
LOC646903,1.49,1.14,0.61,0.35,1.94,2.01,0.72,0.65,1.12,1.42,0.69,0.44
FLJ36644,0.11,0.12,0.0,0.01,0.12,0.2,0.07,0.03,0.15,0.16,0.12,0.05
LOC284454,0.5,0.99,0.19,0.22,0.85,0.72,0.38,0.4,0.66,0.81,0.36,0.58
LOC149773,0.01,0.0,0.03,0.01,0.03,0.01,0.06,0.02,0.07,0.02,0.04,0.03
LOC100131176,0.21,0.26,0.0,0.0,0.34,0.56,0.2,0.0,1.0,0.38,0.35,0.0
LOC100131366,0.19,0.03,0.31,0.29,0.0,0.3,0.96,0.71,0.3,0.0,0.71,0.94
FLJ42351,0.1,0.06,0.22,0.33,0.15,0.0,0.25,0.18,0.05,0.15,0.25,0.7
LOC392232,0.09,0.08,0.02,0.06,0.02,0.01,0.0,0.02,0.07,0.02,0.01,0.06


In [9]:
x = gencounts_GSE53481.T.values

X_std = QuantileTransformer().fit_transform(x.astype(float))

gencounts_GSE53481_Qnorm = pandas.DataFrame(X_std.T, index=gencounts_GSE53481.index, columns=gencounts_GSE53481.columns)

gencounts_GSE53481_Qnorm.head(10)

Unnamed: 0_level_0,H1.DN,H1.KO2,H1.AzLow,H1.AzHigh,H2.DN,H2.KO2,H2.AzLow,H2.AzHigh,H3.DN,H3.KO2,H3.AzLow,H3.AzHigh
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LOC100289255,1e-07,1e-07,0.6361361,0.6361361,1e-07,0.4194194,0.9999999,0.9999999,1e-07,0.4194194,0.636136,0.9999999
LOC644656,0.7273273,0.9089138,0.2728729,1e-07,0.9999999,0.6364466,1e-07,0.3634222,0.5455312,0.8185142,0.454506,0.1818422
LOC646903,0.8178995,0.6357958,0.1820635,1e-07,0.9094977,0.9999999,0.4544619,0.2727273,0.5455406,0.7275551,0.363697,0.09079514
FLJ36644,0.4547405,0.6361361,1e-07,0.09078309,0.6361361,0.9999999,0.3635214,0.1818182,0.81845,0.9089616,0.636136,0.2727273
LOC284454,0.4545657,0.9999999,1e-07,0.09058149,0.9089687,0.7271716,0.2727273,0.3634332,0.6364169,0.8183809,0.182107,0.5454545
LOC149773,0.1816817,1e-07,0.6361361,0.1816817,0.6361361,0.1816817,0.9092169,0.4094094,0.9999999,0.4094094,0.81804,0.6361361
LOC100131176,0.4544741,0.5454075,1e-07,1e-07,0.6365918,0.9089923,0.3642466,1e-07,0.9999999,0.8179049,0.727013,1e-07
LOC100131366,0.2728443,0.1814285,0.6356982,0.3641536,1e-07,0.5,0.9999999,0.7727728,0.5,1e-07,0.772773,0.9096284
FLJ42351,0.2726727,0.1814858,0.6364169,0.9089548,0.4094094,1e-07,0.7727728,0.5454278,0.0910485,0.4094094,0.772773,0.9999999
LOC392232,0.9999999,0.9090909,0.4089089,0.6816817,0.4089089,0.1361361,1e-07,0.4644645,0.8181818,0.4016517,0.136136,0.6816817


In [11]:
GSE53481_prediction = pairs.cyclone(gencounts_GSE53481_Qnorm, cc_marker, min_pairs=1, verbose=True)

[__set_matrix] Original Matrix 'x' has shape 510 x 12
[__set_matrix] Matrix truncation done. Working with 510 genes for 12 samples.
[cyclone] Preparing marker pairs, where at least one gene was not present in 'x'... Done!
[cyclone] Removed 8102 marker pairs. 8146 marker pairs remaining.
[cyclone] Calculating scores and predicting cell cycle phase... Done!
[cyclone] Calculated scores and prediction (phase: count): G1: 5, S: 2, G2M: 5


In [12]:
GSE53481_prediction_table = helper.get_prediction_table(GSE53481_prediction)
helper.DataTable(GSE53481_prediction_table)

Unnamed: 0_level_0,G1,G2M,S,G1_norm,G2M_norm,S_norm,prediction
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
H1.DN,0.606212,0.215075,0.657603,0.40991,0.14543,0.44466,G1
H1.KO2,0.659639,0.099,0.751773,0.436728,0.065545,0.497727,G1
H1.AzLow,0.429719,0.174174,0.798193,0.306485,0.124225,0.56929,S
H1.AzHigh,0.082495,0.925852,0.02719,0.079664,0.894079,0.026257,G2M
H2.DN,0.956827,0.174349,0.337011,0.651707,0.118751,0.229542,G1
H2.KO2,0.854985,0.001014,0.644535,0.569787,0.000676,0.429537,G1
H2.AzLow,0.496994,0.751503,0.324899,0.315873,0.477631,0.206495,G2M
H2.AzHigh,0.070288,0.988978,0.0,0.066355,0.933645,0.0,G2M
H3.DN,0.674372,0.58676,0.09697,0.496555,0.432044,0.071401,G1
H3.KO2,0.178068,0.038038,0.311927,0.33723,0.072037,0.590733,S


## Plot prediction quality

In [14]:
GSE53481_labels = ['G1', 'G1','S','G2M','G1', 'G1','S','G2M','G1', 'G1','S','G2M']

In [15]:
GSE53481_evaluation = helper.evaluate_prediction(GSE53481_prediction_table, GSE53481_labels)

F1 Score: G1: 0.9090909090909091, S: 0.4, G2M: 0.7499999999999999
Reacall: G1: 0.8333333333333334, S: 0.3333333333333333, G2M: 1.0 
Precision: G1: 1.0, S: 0.5, G2M: 0.6 


In [16]:
iplot(helper.plot_evaluation(*GSE53481_evaluation, xaxis=["G1","S","G2M"], xaxislbl="Phase"))

In [18]:
sample1_g1 = [GSE53481_prediction_table.iloc[i, 0] for i in range (0,4)]
sample1_s = [GSE53481_prediction_table.iloc[i, 2] for i in range (0,4)]
sample1_g2m = [GSE53481_prediction_table.iloc[i, 1] for i in range (0,4)]
plot = helper.get_prediction_plot(sample1_g1, sample1_s, sample1_g2m, t="pie", xaxis=['DN', 'KO2', 'AzLow', 'AzHigh'], xaxislbl="H1", title="Assignment of hESC H1 cells",width=950,height=950)
iplot(plot)

In [19]:
avg_g1 = [np.average(GSE53481_prediction_table.iloc[[i, i+4, i+8], 0].values) for i in range (0,4)]
avg_s = [np.average(GSE53481_prediction_table.iloc[[i, i+4, i+8], 2].values) for i in range (0,4)]
avg_g2m = [np.average(GSE53481_prediction_table.iloc[[i, i+4, i+8], 1].values) for i in range (0,4)]
plot = helper.get_prediction_plot(avg_g1, avg_s, avg_g2m, t="pie", xaxis=['DN', 'KO2', 'AzLow', 'AzHigh'], xaxislbl="Average", title="Average assignment of hESC all cells", width=950,height=950)
iplot(plot)