# Tesing Bulk dataset - GSE71456

Here we tested the dataset available at accession number: [GSE71456](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71456). Link to Paper: [Derivation and differentiation of haploid human embryonic stem cells](https://www.nature.com/articles/nature17408)

<div id="toc"></div>

## Neccessary Imports

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [2]:
import sys
code = "./../../code/"
data = "./../../data/"
sys.path.append(code)
import pandas
import pypairs as pairs
from sklearn.preprocessing import QuantileTransformer
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import numpy as np
from pathlib import Path
from tqdm import tqdm_notebook as tqdm
import helper
import timeit

init_notebook_mode(connected=True)

## Loading Oscope CC only marker pairs

In [3]:
cc_marker = helper.load_ocope_marker(data)

[__set_matrix] Original Matrix 'x' has shape 19084 x 247
[__set_matrix] Removed 16689 genes that were not in 'subset_genes'. 2395 genes remaining.
[__set_matrix] Removed 61 genes that were not expressed in any samples. 2334 genes remaining.
[__set_matrix] Removed 0 samples that were not annotated in 'phases'. 247 samples remaining.
[__set_matrix] Matrix truncation done. Working with 2334 genes for 247 samples.
[sandbag] Identifying marker pairs...Processing in parallel with 10 processes...
 Done!
[sandbag] Identified 8146 marker pairs (phase: count): {'G1': 2575, 'S': 4101, 'G2M': 1470}


## Loading human embryonic stem cells - GSE71456

In [4]:
gencounts_GSE71456 = pandas.read_csv(
    Path(data + "GSE71456_Samples_RPKM.csv"), sep='\t', index_col=0, 
    usecols=[1,4,5,6,7,8,9,10,11,12,13,14,15,16]
)
gencounts_GSE71456.head()

Unnamed: 0_level_0,pES10 h-G1 rep1,pES10 h-G1 rep2,pES10 d-G1 rep1,pES10 d-G1 rep2,h-pES10 d-G2/M,d-pES10 d-G2/M,pES12 h-G1 rep1,pES12 h-G1 rep2,pES12 d-G1 rep1,pES12 d-G1 rep2,pES10 NPC h-G1,pES10 EB h-G1,pES10 EB d-G1
Gene name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
DDX11L1,0.0,0.0,0.0,0.0,0.0,0.0,0.06253,0.0,0.071163,0.030771,0.0,0.0,0.0
WASH7P,0.544188,0.611637,0.63454,0.750842,0.613818,0.78038,0.859602,0.783642,0.921835,0.858255,1.36685,1.00679,0.63645
MIR1302-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
FAM138A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
OR4G4P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
x = gencounts_GSE71456.T.values

X_std = QuantileTransformer().fit_transform(x.astype(float))

gencounts_GSE71456_Qnorm = pandas.DataFrame(X_std.T, index=gencounts_GSE71456.index, columns=gencounts_GSE71456.columns)


invalid value encountered in subtract



In [6]:
GSE71456_prediction = pairs.cyclone(gencounts_GSE71456_Qnorm, cc_marker, verbose=True)

[__set_matrix] Original Matrix 'x' has shape 63657 x 13
[__set_matrix] Matrix truncation done. Working with 63657 genes for 13 samples.
[cyclone] Preparing marker pairs, where at least one gene was not present in 'x'... Done!
[cyclone] Removed 64 marker pairs. 8146 marker pairs remaining.
[cyclone] Calculating scores and predicting cell cycle phase... Done!
[cyclone] Calculated scores and prediction (phase: count): S: 2, G1: 8, G2M: 3


In [7]:
GSE71456_prediction_table = helper.get_prediction_table(GSE71456_prediction)
helper.DataTable(GSE71456_prediction_table)

Unnamed: 0_level_0,G1,G2M,S,G1_norm,G2M_norm,S_norm,prediction
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
pES10 h-G1 rep1,0.37,0.24,0.918,0.242147,0.157068,0.600785,S
pES10 h-G1 rep2,0.999,0.862,0.0,0.536808,0.463192,0.0,G1
pES10 d-G1 rep1,0.712,0.661,0.51,0.37812,0.351036,0.270844,G1
pES10 d-G1 rep2,0.548,0.018,1.0,0.349936,0.011494,0.63857,G1
h-pES10 d-G2/M,0.371,1.0,0.0,0.270605,0.729395,0.0,G2M
d-pES10 d-G2/M,0.0,1.0,0.001,0.0,0.999001,0.000999,G2M
pES12 h-G1 rep1,0.789,0.722,0.0,0.522171,0.477829,0.0,G1
pES12 h-G1 rep2,0.366,0.028,0.985,0.26541,0.020305,0.714286,S
pES12 d-G1 rep1,0.999,0.0,1.0,0.49975,0.0,0.50025,G1
pES12 d-G1 rep2,0.909,0.086,0.289,0.707944,0.066978,0.225078,G1


## Plot prediction

In [8]:
labels = ["G1","G1","G1","G1","G2M","G2M","G1","G1","G1","G1","G1","G1","G1"]
GSE71456_evaluation = helper.evaluate_prediction(GSE71456_prediction_table, labels)
helper.plot_evaluation(*GSE71456_evaluation, xaxis=["G1","S","G2M"], xaxislbl="Phase")


F-score is ill-defined and being set to 0.0 in labels with no true samples.


Recall is ill-defined and being set to 0.0 in labels with no true samples.



F1 Score: G1: 0.8421052631578948, S: 0.0, G2M: 0.8
Reacall: G1: 0.7272727272727273, S: 0.0, G2M: 1.0 
Precision: G1: 1.0, S: 0.0, G2M: 0.6666666666666666 


{'data': [{'marker': {'color': 'red', 'size': 10, 'symbol': 'circle'},
   'mode': 'markers',
   'name': 'F1-Score',
   'type': 'scatter',
   'x': ['G1', 'S', 'G2M'],
   'y': array([0.84210526, 0.        , 0.8       ])},
  {'marker': {'color': 'blue', 'size': 10, 'symbol': 'square'},
   'mode': 'markers',
   'name': 'Recall-Score',
   'type': 'scatter',
   'x': ['G1', 'S', 'G2M'],
   'y': array([0.72727273, 0.        , 1.        ])},
  {'marker': {'color': 'green', 'size': 10, 'symbol': 'triangle-up'},
   'mode': 'markers',
   'name': 'Precision-Score',
   'type': 'scatter',
   'x': ['G1', 'S', 'G2M'],
   'y': array([1.        , 0.        , 0.66666667])}],
 'layout': {'title': '',
  'xaxis': {'title': 'Phase'},
  'yaxis': {'title': 'F1, Recall, Precision Score'}}}

In [10]:
sample1_g1 = [GSE71456_prediction_table.iloc[i, 0] for i in range (0,12)]
sample1_s = [GSE71456_prediction_table.iloc[i, 2] for i in range (0,12)]
sample1_g2m = [GSE71456_prediction_table.iloc[i, 1] for i in range (0,12)]
plot = helper.get_prediction_plot(sample1_g1, sample1_s, sample1_g2m, t="scatter", xaxis=GSE71456_prediction_table.index.tolist(), xaxislbl="", width=950,height=950, title="Assignment of hESC")

In [11]:
iplot(plot)