Dataset #6

FarshidShekari · 2018-12-26T20:23:03Z

Hi
can you save your dataset in CSV file and send the link for download?
thanks for it

yil8 · 2018-12-27T20:40:12Z

@FarshidShekari I'm not sure which dataset you are referring to. For the original GEO/GTEx dataset, can you please follow the instructions on the README page for data accessing

FarshidShekari · 2018-12-28T11:12:24Z

I send you an email, I grateful if you reply.

yil8 · 2018-12-28T18:50:13Z

@FarshidShekari I've replied your email.

FarshidShekari · 2018-12-28T19:36:17Z

thanks, but I need that files data in CSV files separately.

yil8 · 2018-12-28T19:39:41Z

@FarshidShekari Sorry, but I only have the original data in gctx format and I'm using l1ktools v1.1 to read/process the data as documented in the README.

FarshidShekari · 2018-12-28T20:21:01Z

thank you.

FarshidShekari · 2018-12-29T15:06:00Z

Can you explain more about your datasets?

yil8 · 2018-12-29T23:05:27Z

@FarshidShekari All the details have been documented in the published paper and supplementary:
https://www.biorxiv.org/content/early/2015/12/15/034421

FarshidShekari · 2019-01-01T21:01:25Z

thanks, @yil8.
I want to know, what's happen in bgedv2.py file?

FarshidShekari · 2019-01-02T07:44:09Z

What's the GD462.GeneQuantRPKM.50FN.samplename.resk10.txt? I can't find a guidance file about it, why used it?

yil8 · 2019-01-02T19:19:22Z

@FarshidShekari The preprocess.sh contains all the details you can go through.

FarshidShekari · 2019-01-02T20:04:44Z

Yes, that's true, It just shows the steps for preparing data for the model. I read your code in each file for each step and for that reason, I asked the above question.

yil8 · 2019-01-02T21:15:45Z

@FarshidShekari The 1000G data is used as the validation dataset for RNA-Seq, see code here https://github.com/uci-cbcl/D-GEX/blob/master/H1_0-4760.py#L152 . In the paper, it says For the RNA-Seq platform, we used GEO-tr for training, the 1000 Genomes data for validation (denoted as 1000G-va), and the GTEx data for testing (denoted as GTEx-te). The validation data 1000G-va was used to do model selection and parameter tuning for all the methods.

FarshidShekari · 2019-01-04T11:21:44Z

thanks, @yil8. I am rewriting your code with cmapPy library and after this line, I can't uderstand what are you doing, Can you explain it? and when interpreter want to run this line raise below error, can you help me?
ValueError: Cross index must be 1 dimensional

yil8 · 2019-01-04T18:31:01Z

@FarshidShekari I highly recommend you not rewrite your own code with another packages other than provided within this repo since I'm not familiar with it and haven't tested it thoroughly. For line by line explanation/debugging of the code, I'm sorry but I'm afraid I won't have enough time for that. Would you please try to explore the original l1ktools as documented within the README.

FarshidShekari · 2019-01-05T15:39:08Z

I'm trying to use l1ktools but I get below error when running gct2npy file:
ImportError: No module named fcntl

FarshidShekari · 2019-01-07T09:50:09Z

@yil8 you run your code on linux or windows?

yil8 · 2019-01-07T18:34:01Z

@FarshidShekari Linux, probably ubuntu 14.04

FarshidShekari · 2019-01-07T21:11:31Z

thank you.

FarshidShekari · 2019-01-11T13:16:42Z

Hi, @yil8 Thank you for helping me.

@FarshidShekari The 1000G data is used as the validation dataset for RNA-Seq, see code here https://github.com/uci-cbcl/D-GEX/blob/master/H1_0-4760.py#L152 . In the paper, it says For the RNA-Seq platform, we used GEO-tr for training, the 1000 Genomes data for validation (denoted as 1000G-va), and the GTEx data for testing (denoted as GTEx-te). The validation data 1000G-va was used to do model selection and parameter tuning for all the methods.

I want to sumarize above your answer in below sentences:
You used GEO dataset (bgedv2_QNORM.gctx) for training and GTEx dataset (GTEx_RNASeq_RPKM_n2921x55993.gctx) for testing and 1000G dataset(GD462.GeneQuantRPKM.50FN.samplename.resk10.txt) for validation. Is correct my understanding?

yil8 · 2019-01-11T17:40:24Z

@FarshidShekari Yes, you are correct.

FarshidShekari · 2019-01-14T08:48:02Z

I want to know Why you loaded multiple files generated from GEO dataset in this below code:

    X_tr = np.load('bgedv2_X_tr_float64.npy')
    Y_tr = np.load('bgedv2_Y_tr_0-4760_float64.npy')
    Y_tr_target = np.array(Y_tr)
    X_va = np.load('bgedv2_X_va_float64.npy')  ------------>this line
    Y_va = np.load('bgedv2_Y_va_0-4760_float64.npy')------------>this line
    Y_va_target = np.array(Y_va) ------------>this line
    X_te = np.load('bgedv2_X_te_float64.npy')
    Y_te = np.load('bgedv2_Y_te_0-4760_float64.npy')
    Y_te_target = np.array(Y_te)

    X_1000G = np.load('1000G_X_float64.npy')
    Y_1000G = np.load('1000G_Y_0-4760_float64.npy')
    Y_1000G_target = np.array(Y_1000G)
    X_GTEx = np.load('GTEx_X_float64.npy')
    Y_GTEx = np.load('GTEx_Y_0-4760_float64.npy')
    Y_GTEx_target = np.array(Y_GTEx)

according to your previus answer, determined line what's for used?

yil8 · 2019-01-14T19:30:52Z

@FarshidShekari I recommend you get familiar with basic python and numpy operations, and then probably you will get through your issues quickly.

FarshidShekari · 2019-01-14T20:50:30Z

I have no problem with Numpy, just I want to know why used these for files bgedv2_X_va_float64.npy and bgedv2_Y_va_0-4760_float64.npy that generated from GEO dataset,?
Because we used GEO dataset for training and GTEx for testing and 1000G for validation.
thanks for your answering.

yil8 · 2019-01-15T19:46:13Z

@FarshidShekari There are two validation set, 1000G for RNA-Seq, and GEO-va for GEO(microarray) data. The original paper says

For the microarray platform, we used the GEO data for training, validation and testing. Specifically, we randomly partitioned the GEO data into ∼80% for training (88 807 samples denoted as GEO-tr), ∼10% for validation (11 101 samples denoted as GEO-va) and ∼10% for testing (11 101 samples denoted as GEO-te). The validation data GEO-va were used to do model selection and parameter tuning for all the methods.

5. Model selection was performed based on GEO-va for the GEO data and 1000G-va for the GTEx data. Training was run for 200 epochs. The model was evaluated on GEO-va and 1000G-va after each epoch, and the model with the best performance was saved respectively.

FarshidShekari · 2019-01-15T20:42:36Z

According to paper
one Model trained and saved in below order:
For the microarray platform, we used the GEO data for training, validation and testing. Specifically, we randomly partitioned the GEO data into ∼80% for training (88 807 samples denoted as GEO-tr), ∼10% for validation (11 101 samples denoted as GEO-va) and ∼10% for testing (11 101 samples denoted as GEO-te). The validation data GEO-va were used to do model selection and parameter tuning for all the methods.
I don't have any problem with the above sentences.

And second model, acording to paper is in below order:

For the RNA-Seq platform, we used GEO-tr for training, the
1000 Genomes data for validation (denoted as 1000G-va), and the GTEx data for testing (denoted as GTEx-te).  
The validation data 1000G-va was used to do model selection and parameter tuning for all the methods.

My qustion about the above sentences is: Why you used GEO-tr for training and after test with other platform test data?
and another question is, what's your mean from last sentence?

yil8 · 2019-01-15T23:27:10Z

@FarshidShekari

Why use GEO-tr for training and test with other playform:
Results on the GTEx data further demonstrate D-GEX, combined with the dropout regularization technique, achieves the best performance even where training and prediction were performed on datasets obtained from different platforms (microarray versus RNA-Seq). Such cross platforms generalizability implies the great potential of D-GEX to be applied to the LINCS program where training and prediction were also done separately on the microarray data and the L1000 data.
Which last sentence?

FarshidShekari · 2019-01-16T07:08:20Z

The validation data 1000G-va was used to do model selection and parameter tuning for all the methods.

yil8 · 2019-01-16T20:20:06Z

@FarshidShekari I'm not sure which part of this sentence confuses you, it literally means what it means

FarshidShekari · 2019-02-04T12:37:56Z

hi @yil8 , How you compare D-GEX result with linear regression for all target genes?

yil8 · 2019-02-05T01:43:40Z

@FarshidShekari I think it's all written in the paper

FarshidShekari · 2019-02-05T07:04:31Z

I saw in the paper, My question is how you calculated 0.4702 + 0.1234 for Linear Regression? (Are you averaged this error loss for all target genes).

yil8 · 2019-02-05T15:14:31Z

@FarshidShekari Like I said it's written in the paper :
We define the overall error as the average MAE over all target genes, and use it to evaluate the general predictive performance.
Numerics after ‘±’ are the standard deviations of prediction errors over all target genes.

FarshidShekari · 2019-02-12T06:32:57Z

Hi @yil8, I used linearRegression in this link for predict each taget gene and MAE for each one is 7.15***(each star is different number) and when average MAE on all Linear models I get 7.15292082.(number of all linearmodels is 9520) and this value with different with your report.

yil8 · 2019-02-12T09:17:21Z

@FarshidShekari I guess there must be something wrong with your code, the error of 7.15 is significantly higher than the error reported in the paper (all of which are less than 1)

FarshidShekari · 2019-02-12T11:32:30Z

This is my code:

import pandas as pd
import sklearn.linear_model
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
import numpy as np



X_tr = np.load('Datasets/bgedv2_X_tr_float64.npy')
Y_tr = np.array(np.load('Datasets/bgedv2_Y_tr_0-4760_float64.npy'))



X_te = np.load('Datasets/bgedv2_X_te_float64.npy')
Y_te = np.array(np.load('Datasets/bgedv2_Y_te_0-4760_float64.npy'))



modelCnt=len(Y_tr.transpose())

sum=0;
sum2=0;

for i in range(modelCnt):
    model = LinearRegression().fit(X_tr, Y_tr[:, i:i + 1])
    # r2 = model.score(X_tr, Y_tr[:, i:i + 1])
    pre = model.predict(X_te)
    mse = mean_absolute_error(Y_te[:, i:i + 1], pre)
    sum+=mse
    print(mse)
    # print(r2)
    print("---------------------------------------------")

print(modelCnt)
print("AVG: ",sum/modelCnt)
Y_te=00
Y_tr=00
Y_tr2 = np.array(np.load('Datasets/bgedv2_Y_tr_4760-9520_float64.npy'))
Y_te2 = np.array(np.load('Datasets/bgedv2_Y_te_4760-9520_float64.npy'))
modelCnt2=len(Y_tr2.transpose())

for i in range(modelCnt2):
    model = LinearRegression().fit(X_tr, Y_tr2[:, i:i + 1])
    # r2 = model.score(X_tr, Y_tr2[:, i:i + 1])
    pre = model.predict(X_te)
    mse = mean_absolute_error(Y_te2[:, i:i + 1], pre)
    sum2+=mse
    print(mse)
    # print(r2)
    print("---------------------------------------------")

print(modelCnt2)
print("AVG: ",sum2/modelCnt2)
sumF=(sum+sum2)
cnt=((modelCnt+modelCnt2))
print("**************************************")
print(sumF/cnt)

yil8 · 2019-02-12T14:26:46Z

@FarshidShekari I think your code looks fine. When you print(r2), what do they look like?

FarshidShekari · 2019-02-12T15:11:53Z

my code output is for two target gene:

5.90659864679571
0.7667017531506103
---------------------------------------------
6.008281658448795
0.6351918447933045

yil8 · 2019-02-13T04:02:59Z

@FarshidShekari Can you try to print the error of all 9520 genes? It seems some of your error looks reasonable, e.g. 0.766 0.635, while some others are pretty big.

FarshidShekari · 2019-02-13T06:13:39Z

@FarshidShekari Can you try to print the error of all 9520 genes? It seems some of your error looks reasonable, e.g. 0.766 0.635, while some others are pretty big.

Sorry, Not yet(because that system I use in the lab now working on other projects), I run it next week, When I rut it, I comment here.

FarshidShekari · 2019-02-18T10:24:37Z

Hi @yil8
Now, I'm running your code for 3 hidden layers, My question is, the output of this line and the next line reported in your paper?
and the second question is, are you averaged output error of models for this and this?

yil8 · 2019-02-19T20:07:10Z

@FarshidShekari 1. that's for validation dataset not for test dataset. 2. Yes, I averaged the output of those two.

FarshidShekari · 2019-02-20T06:08:06Z

@FarshidShekari Can you try to print the error of all 9520 genes? It seems some of your error looks reasonable, e.g. 0.766 0.635, while some others are pretty big.

Sorry, Not yet(because that system I use in the lab now working on other projects), I run it next week, When I rut it, I comment here.

I run it for 604 target gene and model score was 0.71730562913.

yil8 · 2019-02-20T18:37:35Z

@FarshidShekari That looks within the reasonable range of the numbers I reported. Can you do this for all the ~9000 target genes?

FarshidShekari · 2019-05-04T16:31:29Z

Hi dear
I want to know MAE that you reported in your paper is just between [0 1] or not(i.e your accuracy in reported error for GEO data set is 78% or 99.68)?
thanks for your reply.

yil8 · 2019-05-05T06:18:49Z

@FarshidShekari The two MAE on GEO and GTEx are reported in Table 1 and Table 2 in the original paper respectively, and they are neither 78% nor 99.68%.

FarshidShekari · 2019-05-05T07:41:18Z

thanks, dear @yil8

FarshidShekari · 2019-08-14T18:23:35Z

Hi @yil8
1- Why didn't you Quantile normalization in GTEx_reqnorm.py like bgedv2_reqnorm.py separately?
2- Why did you use bgedv2_GTEx_1000G_quantile_float64.npy file generated in bgedv2_reqnorm.py in Quantile normalization process?

thanks for the reply

yil8 · 2019-08-15T04:00:24Z

@FarshidShekari Sorry, but I may not be able to answer all your questions in details, given it's published about three years ago, and I'm currently mainly focusing on other business, but I'll try my best once I have time though.

FarshidShekari · 2019-08-15T06:59:15Z

your welcome @yil8 , If you could answer, I would grateful.

yil8 added the help wanted label Jan 5, 2019

yil8 closed this as completed Jan 14, 2019

Dataset #6

Dataset #6

Comments

FarshidShekari commented Dec 26, 2018

yil8 commented Dec 27, 2018

FarshidShekari commented Dec 28, 2018

yil8 commented Dec 28, 2018

FarshidShekari commented Dec 28, 2018

yil8 commented Dec 28, 2018

FarshidShekari commented Dec 28, 2018

FarshidShekari commented Dec 29, 2018

yil8 commented Dec 29, 2018

FarshidShekari commented Jan 1, 2019

FarshidShekari commented Jan 2, 2019

yil8 commented Jan 2, 2019

FarshidShekari commented Jan 2, 2019

yil8 commented Jan 2, 2019

FarshidShekari commented Jan 4, 2019

yil8 commented Jan 4, 2019

FarshidShekari commented Jan 5, 2019

FarshidShekari commented Jan 7, 2019

yil8 commented Jan 7, 2019

FarshidShekari commented Jan 7, 2019

FarshidShekari commented Jan 11, 2019

yil8 commented Jan 11, 2019

FarshidShekari commented Jan 14, 2019

yil8 commented Jan 14, 2019

FarshidShekari commented Jan 14, 2019

yil8 commented Jan 15, 2019

FarshidShekari commented Jan 15, 2019

yil8 commented Jan 15, 2019

FarshidShekari commented Jan 16, 2019

yil8 commented Jan 16, 2019

FarshidShekari commented Feb 4, 2019

yil8 commented Feb 5, 2019

FarshidShekari commented Feb 5, 2019

yil8 commented Feb 5, 2019

FarshidShekari commented Feb 12, 2019

yil8 commented Feb 12, 2019

FarshidShekari commented Feb 12, 2019

yil8 commented Feb 12, 2019

FarshidShekari commented Feb 12, 2019

yil8 commented Feb 13, 2019

FarshidShekari commented Feb 13, 2019

FarshidShekari commented Feb 18, 2019

yil8 commented Feb 19, 2019

FarshidShekari commented Feb 20, 2019

yil8 commented Feb 20, 2019

FarshidShekari commented May 4, 2019

yil8 commented May 5, 2019

FarshidShekari commented May 5, 2019

FarshidShekari commented Aug 14, 2019 • edited Loading

yil8 commented Aug 15, 2019

FarshidShekari commented Aug 15, 2019 • edited Loading

FarshidShekari commented Aug 14, 2019 •

edited

Loading

FarshidShekari commented Aug 15, 2019 •

edited

Loading