Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem to read Numeric data #28

Closed
TinaNeik opened this issue Nov 5, 2018 · 42 comments
Closed

Problem to read Numeric data #28

TinaNeik opened this issue Nov 5, 2018 · 42 comments
Labels
input error Wrong input data type or format

Comments

@TinaNeik
Copy link

TinaNeik commented Nov 5, 2018

Hi,

I'm struggling to get my numeric data to read in MVP.

I prepared the numeric format based on the instructions at https://github.com/XiaoleiLiuBio/MVP

It looks like this (Just a portion shown here. The total number of row/SNP is 52157 and total number of column/individual is 68), no header and row names:

NaN | NaN | NaN | NaN | NaN | NaN | NaN
0 | 0 | 0 | 0 | 2 | 0 | NaN
2 | 2 | 2 | 2 | 2 | 2 | 0
2 | 2 | 2 | 2 | 2 | 2 | 2
2 | 0 | 2 | 2 | 2 | 0 | 0
0 | NaN | 0 | 0 | 0 | 2 | 2
NaN | NaN | NaN | NaN | NaN | NaN | NaN
0 | 0 | 0 | 0 | 0 | 0 | 0

I understand that there shouldn't be NaN or missing data in numeric format. I don't know how to overcome this. Could you please help me?

I also got this error:

Error in big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames = list(rowNames, :
A big.matrix must have at least one row and one column

I used this code to try to fix:

fileNum=matrix(1:3546676,ncol=68)

But still got the same error.

May I ask what should I do?

Appreciate your help.

Many thanks.

Regards,
Tina

@hyacz hyacz added the input error Wrong input data type or format label Nov 5, 2018
@hyacz
Copy link
Collaborator

hyacz commented Nov 6, 2018

Hi, Tina

First, you should do quality control and impute before you do GWAS.

your numeric file is separated by ' | ', you need to replace it with tab. I always recommend using a text editor (VSCode, NotePad++, or whatever you like) to format your data when dealing with small data.

Then delete the mark with too high the miss rate(>10%) and the mark with the low MAF value(<0.05), and then impute it to make sure there is no missing in your genotype file.

I usually do quality control and imputation from files in Plink format or VCF format. I don't know if there is a tool to do quality control and impute on data in numeric format. There are two functions MVP.Data.impute and MVP.Data.QC in the devel branch, you can find a way to read your file into R, then refer to these two functions.

https://github.com/XiaoleiLiuBio/MVP/blob/370bf6bab733f03b970dc6a33c83850f43fbabd6/R/MVP.Data.r#L501
https://github.com/XiaoleiLiuBio/MVP/blob/370bf6bab733f03b970dc6a33c83850f43fbabd6/R/MVP.Data.r#L552

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 6, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 7, 2018

Hi, Tina

Sorry to reply to you so late.
I recommend that you convert the original file to a format supported by Plink, such as .ped and .map(see https://www.cog-genomics.org/plink2/formats), so that you can use more software to do data preprocessing.

I show you my qc and impute practice.

  1. qc with plink
plink --file <your_prefix> --geno 0.1 --maf 0.05 --recode vcf --out mydata.qc
  1. impute with beagle
java -Xmx8g -jar beagle.28Sep18.793.jar gt=mydata.qc.vcf out=mydata.qc.imp

you can download beagle from https://faculty.washington.edu/browning/beagle/beagle.html

  1. qc with plink again
plink --vcf mydata.qc.imp --maf 0.05 --make-bed --out mydata.clean

Then you get the clean data in bfile format, MVP can use it.

If there is anything I have not stated, you can point out that we can discuss it again.

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 7, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 7, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 7, 2018

Hi, Tina

I think it's hard to find the optimal threshold for quality control, depending on your propensity to analyze. Often you need to try multiple thresholds when analyzing. I usually use a slightly stricter quality control. If I don't detect the signal, I will relax the quality control condition a bit.

  1. MAF
    I usually use --maf 0.01 first, but I noticed that you only have 68 individuals, so I suggest you use strict conditions like --maf 0.05 or --maf 0.1. When a certain type of sample is too small, it cannot represent the population. So the smaller your sample size, the more stringent quality control you need for MAF.
    And the mark with small MAF value will have low filling accuracy. So I usually filter out these marker before filling.

image

Bouwman A C, Veerkamp R F. Consequences of splitting whole-genome sequencing effort over multiple breeds on imputation accuracy[J]. BMC Genetics, BioMed Central, 2014, 15(1): 105.

  1. missing
    For the same batch of data, markers with too high a miss rate are usually considered to be of low quality, and are more likely to be filtered out in order to avoid false positives.

In short, I think that the value of quality control conditions depends largely on your data. The more lenient the quality control conditions, the higher the probability of detecting a signal and the higher the false positive.

Haohao

@hyacz
Copy link
Collaborator

hyacz commented Nov 7, 2018

Can you simply replace 'B' with 'T' to make TASSEL recognize your data? What the original SNP does not affect the analysis of GWAS.

@TinaNeik
Copy link
Author

TinaNeik commented Nov 8, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 8, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 8, 2018

Hi, Tina

Sorry, my bad English make you confused. Regarding whether you should do qc before impute, my answer is "YES, you should to control maf and missing rate before impute."

Because marker with low maf and high missing rate are more error-prone when imputing.

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 8, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 8, 2018

Hi, Tina

If your data is organized into a 0, 1, 2 matrix that can be read directly into R, you can follow the following code to get a set of bfile formatted data accepted by plink.

# get bed file.
> suppressMessages(library(MVP))
> geno <- as.matrix(read.table('data.txt', sep='|'))
> geno
      V1  V2  V3  V4  V5  V6  V7
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,]   0   0   0   0   2   0 NaN
[3,]   2   2   2   2   2   2   0
[4,]   2   2   2   2   2   2   2
[5,]   2   0   2   2   2   0   0
[6,]   0 NaN   0   0   0   2   2
[7,] NaN NaN NaN NaN NaN NaN NaN
[8,]   0   0   0   0   0   0   0
> geno[is.nan(geno)] <- NA
> geno
     V1 V2 V3 V4 V5 V6 V7
[1,] NA NA NA NA NA NA NA
[2,]  0  0  0  0  2  0 NA
[3,]  2  2  2  2  2  2  0
[4,]  2  2  2  2  2  2  2
[5,]  2  0  2  2  2  0  0
[6,]  0 NA  0  0  0  2  2
[7,] NA NA NA NA NA NA NA
[8,]  0  0  0  0  0  0  0
> write_bfile(as.big.matrix(geno)@address, 'mydata')

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
# get fam file.
> pheno <- read.table('pheno.txt', header=T)
> pheno
  ID Pheno1
1  1   51.9
2  2   49.8
3  3   30.1
4  4   45.4
5  5   35.4
6  6   40.2
7  7   49.1
> fam <- cbind(pheno$ID, pheno$ID, 0, 0, 0, pheno$Pheno1)
> write.table(fam, 'mydata.fam', quote = F, row.names = F, col.names = F, sep = '\t')

# get bim file.
> map <- read.table('map.txt', header=T)
> map
  SNP Chr  Pos
1 rs1   1 1001
2 rs2   1 2019
3 rs3   1 3221
4 rs4   1 4400
5 rs5   1 5550
6 rs6   1 6190
7 rs7   1 7990
8 rs8   1 9999
> bim <- cbind(map[, 2], map[, 1], 0, map[, 3], 0, 0)
> write.table(bim, 'mydata.bim', quote = F, row.names = F, col.names = F, sep = '\t')

you will can get plink.bed, plink.fam, plink.bim. now you can use plink --bfile mydata to access your data.

Cheers,
Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 8, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 8, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 8, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 9, 2018

Hi, Tina

Yes, MVP can read data from hapmap format, it should work fine.

First, may I ask why is there an error at the end of the run? What does it
mean?

This problem seems to be caused by the separator. You originally had only one phenotype, but after the conversion, 4 were generated. Rlm7 has been successfully executed. The second phenotype is all NA, so the program has an error when generating an empty big.matrix.

why is the number of individuals reduced from 68 to 65

Please check your Rlm7mvp.hmp.phe file. The Rlm7 column may have several missing values and the missing values will be deleted.

Is it write_file?

write_bfile is a function in the MVP development version that you can download from the devel branch in Github.

Cheers,
Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 9, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 9, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 9, 2018

Hi, Tina

where did the 5 individuals go?

Individuals with phenotype or genotype deletion will be filtered, and if your phenotype is ok, check your genotype file.

What does this error mean?

These are just some warnings, not errors. Some parameters are ignored when plotting the phenotype distribution plot, and you have ever run MVP so the folder for the PCA 3D plot already exists.

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 9, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 11, 2018

Hi, Tina

Sorry to reply to your message so late. We are very happy that you use our software and we will improve the way MVP works to reduce the occurrence of these problems. Please feel free to ask your question.

From your reply, your original file may not be fully supported by MVP. I am not sure what went wrong. Could you please share data files with me? I will check it and tell you why the MVP is not working properly.

Haohao
Email: haohaozhang@whut.edu.cn

@TinaNeik
Copy link
Author

TinaNeik commented Nov 11, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 12, 2018

Hi, Tina
I have received your data, but I have time to check this data this afternoon and I will reply you today.

Haohao

@hyacz
Copy link
Collaborator

hyacz commented Nov 12, 2018

Hi, Tina

I have checked the data, as you said there is nothing unusual here. And then I rearranged the data with the development version of MVP and got the expected results.(see bellow log) The MVP.Data module was upgraded in the development version, I guess there are some bugs in the previous version. If I find the bug, I will let you know.

This is the result: ForTina.zip
You can continue your analysis with the mvp.imp.* files in the archive.

And log:

> suppressMessages(library(MVP))
> MVP.Data(fileHMP='Rlm7_SNP_68lines_forNumericAll_forTASSEL_ForHaohao.hmp.txt',
+ filePhe='Phenotype_Rlm7_68LinesAllA_ForHaohao.txt')
Preparing data for MVP...
Reading file...
inds: 68	markers:52099

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Preparation for GENOTYPE data is done within 1s
Preparation for Genotype File is done!
Preparation for PHENOTYPE data is Done within 0s
Imputing...
Impute Genotype File is done!
[1] "Principal Component Analysis Start..."
Preparation for PC matrix is done!
Calculate KINSHIP using Vanraden method...
Preparation for Kinship matrix is done!
MVP data prepration accomplished successfully!

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 12, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 12, 2018

Hi, Tina

Yes, the latest version of MVP can be downloaded from Github's devel branch. It is unstable and we may have some updates in a few days.

You can install it as follows

  1. Download https://github.com/XiaoleiLiuBio/MVP/archive/devel.zip
  2. Start R input:
devtools::install_local("MVP-devel")

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 12, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 12, 2018

Hi, Tina

We've planned a lot of things to do in the MVP 2.0 release, including some new models, more complete fault tolerance mechanisms, and more visualization. It may be released early next year. We will be very grateful for any suggestions.

Cheers,
Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 12, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 12, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 12, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 13, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 13, 2018

Hi, Tina

I am very sorry that you have encountered so many problems when using our software.

  1. Mac default Clang does not support OpenMP, you need to download Clang from https://cran.r-project.org/bin/macosx/tools/. Or just like you did, disable OpenMP.
  2. Can you tell me which individual names are not correctly identified?
  3. The MVP.Version function just prints some welcome messages. I first encountered this error, maybe the MVP was not installed correctly.
  4. Is 1.0.1 instead of v1.0.1, I will fix it on the README page https://github.com/XiaoleiLiuBio/MVP/releases/download/1.0.1/MVP_offline_1.0.1.zip

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 13, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 13, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 14, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 14, 2018

Hi, Tina
This seems to be because some individuals have been mistakenly identified. The log show that only 173 individuals have been identified. Have you replaced the individual with S1, S2, S3...?

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 14, 2018 via email

@TinaNeik
Copy link
Author

TinaNeik commented Nov 14, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 14, 2018

Hi, Tina
I tried to replace S1, S2, S3... as Carousy, Trilogies, MPT... as you said yesterday. But without encountering errors, Could you please share the original name data with me? It will be very helpful in identifying this issue.

Haohao

@TinaNeik
Copy link
Author

TinaNeik commented Nov 14, 2018 via email

@hyacz
Copy link
Collaborator

hyacz commented Nov 15, 2018

Hi, Tina

Yes, I replaced them and worked fine.

> suppressMessages(library(MVP))
> MVP.Data(fileHMP='Rlm7_SNP_68lines_forNumericAll_forTASSEL_ForHaohao.hmp.txt', filePhe='Phenotype_Rlm7_68LinesAllA_ForHaohao.txt')
Preparing data for MVP...
Reading file...
inds: 68	markers:52099

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Preparation for GENOTYPE data is done within 1s
Preparation for Genotype File is done!
Preparation for PHENOTYPE data is Done within 0s
Imputing...
Impute Genotype File is done!
[1] "Principal Component Analysis Start..."
Preparation for PC matrix is done!
Calculate KINSHIP using Vanraden method...
Preparation for Kinship matrix is done!
MVP data prepration accomplished successfully!
> geno <- attach.big.matrix('mvp.imp.geno.desc')
> pheno <- read.table('mvp.phe', header=T)
> map <- read.table('mvp.map', header=T)
> imMVP <- MVP(
+     phe=pheno,
+     geno=geno,
+     map=map,
+     nPC.MLM=3,
+     perc=1,
+     priority="speed",
+     vc.method="EMMA",
+     threshold=0.05,
+     method=c("MLM")
+ )
====================== Welcome to MVP ======================
       A Memory-efficient, Visualization-enhanced, and
             Parallel-accelerated Tool For GWAS
                    __  __  __   __  ___
                   |  \/  | \ \ / / | _ \
                   | |\/| |  \ V /  |  _/
                   |_|  |_|   \_/   |_|     Version: 1.1.0
  Authors: Lilin Yin, Haohao Zhang, and Xiaolei Liu
  Contact: xiaoleiliu@mail.hzau.edu.cn
============================================================
[1] "Input data has 68 individuals, 52099 markers"
[1] "Principal Component Analysis Start..."
[1] "GWAS Start..."
[1] "Mixed Linear Model (MLM) Start ..."
[1] "Calculating Kinship..."
[1] "Variance components..."
[1] "Variance Components Estimation is Done!"
[1] "Eigen-Decomposition..."
[1] "Eigen-Decomposition is Done!"
|-------------------------------------------------->100.00%
[1] "Significance Level: 9.59711318835294e-07"
[1] "Visualization Start..."
[1] "Phenotype distribution Plotting..."
[1] "PCA plot2d..."
[1] "The 3D PCA map has been temporarily disabled, and we will improve this feature in subsequent versions."
[1] "SNP_Density Plotting..."
[1] "Circular_Manhattan Plotting Rlm7.MLM..."
[1] "Rectangular_Manhattan Plotting Rlm7.MLM..."
[1] "Q_Q Plotting Rlm7.MLM..."

@hyacz hyacz closed this as completed Dec 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
input error Wrong input data type or format
Projects
None yet
Development

No branches or pull requests

2 participants