Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in svd(x, nu = 0, nv = k) : infinite or missing values in 'x' #34

Closed
philippbayer opened this issue Mar 19, 2019 · 18 comments
Closed
Projects

Comments

@philippbayer
Copy link

Hi,

I'm running an analysis with rMVP with a HapMap formatted SNP file, and it crashes in the PCA step.
The file was converted to HapMap from VCF via Tassel's export function. The same file works fine in GAPIT.

My analysis script:

library(rMVP)
MVP.Data(fileHMP="input.hmp.txt",
         filePhe="Phenotype.csv",
         fileKin=FALSE,
         filePC=FALSE,
         out="mvp.vcf",
         priority="memory",
)


genotype <- attach.big.matrix("mvp.vcf.geno.desc")

phenotype <- read.table("mvp.vcf.phe",head=TRUE)

map <- read.table("mvp.vcf.map" , head = TRUE)


MVP
for(i in 2:ncol(phenotype)){
  imMVP <- MVP(
    phe=phenotype[, c(1, i)],
    geno=genotype,
    map=map,
    nPC.GLM=5,
    nPC.MLM=3,
    nPC.FarmCPU=3,
    perc=1,
    file='tiff',
    priority="memory",,
    ncpus=12,
    vc.method="EMMA",
    maxLoop=10,
    method.bin="FaST-LMM",#"FaST-LMM","EMMA", "static"
    #permutation.threshold=TRUE,
    #permutation.rep=100,
    threshold=0.05,
    method=c("FarmCPU", 'MLM')
  )
}

The STDOUT:

Preparing data for MVP...
Reading file...
inds: 83        markers:4640569
Preparation for GENOTYPE data is done within 1m54s
83 common individuals between phenotype and genotype.
Preparation for PHENOTYPE data is Done within 0s
Imputing...
Impute Genotype File is done!
MVP data prepration accomplished successfully!

The STDERR:

Loading required package: Rcpp
Loading required package: RcppProgress
Loading required package: BH
Loading required package: bigmemory
Loading required package: biganalytics
Loading required package: foreach
Loading required package: biglm
Loading required package: DBI
Loading required package: parallel
Loading required package: MASS
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Error in svd(x, nu = 0, nv = k) : infinite or missing values in 'x'
Calls: MVP -> MVP.PCA -> prcomp -> prcomp.default -> svd
Execution halted

Have you seen this before?

@XiaoleiLiuBio
Copy link
Collaborator

I guess the error is caused by some missing values are remaining in the genotype file. If the 'allele' column in the HapMap file is 'NA', the error will occur. You can check if the HapMap file has the 'NA' information and we will update 'MVP.Data' function to skip those markers with 'NA' allele.

@philippbayer
Copy link
Author

philippbayer commented Mar 19, 2019

Thank you for your fast reply! I appreciate all of your work on this amazing package.

What I've tried now:

  • some SNPs had NA as one of the alleles - so in HapMap they looked like -/T or similar. I removed those, the error remained

  • I removed all SNPs that were not bi-allelic, the error remained

  • Double checked whether column 2 in the HapMap file has anything that isn't an allele, definitely only
    C/T
    G/A
    G/T
    C/A
    G/T
    A/G
    G/A
    G/C ...

Still the same error.

  • then I used SNPRelate to make a PCA matrix, and reformatted the output to the manuals' format:
0.0148802036697822      -0.174287100840609      -0.0481350609384221
-0.0224252021750267     -0.114742327963779      -0.0139827875349924
0.0288828570311105      -0.160927486070197      -0.031365668593198
0.0326738853578806      -0.112547120658613      -0.0256821848442263
0.0220493250426905      -0.0524834505823422     -0.0261171667960449
....

and changed the code:

library(rMVP)
MVP.Data(fileHMP="Input.hmp.hmp.txt",
         filePhe="Phenotype_no_Pheno_Indo.csv",
         fileKin=FALSE,
         filePC="SnPrelate_PCAs.tsv",
         out="mvp.vcf",
         priority="memory",
)

genotype <- attach.big.matrix("mvp.vcf.geno.desc")
phenotype <- read.table("mvp.vcf.phe",head=TRUE)
map <- read.table("mvp.vcf.map" , head = TRUE)
Covariates <- attach.big.matrix("mvp.vcf.imp.pc.desc")

for(i in 2:ncol(phenotype)){
  imMVP <- MVP(
    phe=phenotype[, c(1, i)],
    geno=genotype,
    map=map,
    #K=Kinship,
    CV.GLM=Covariates,
    CV.MLM=Covariates,
    CV.FarmCPU=Covariates,
    nPC.GLM=0,
    nPC.MLM=0,
    nPC.FarmCPU=0,
    perc=1,
    file='tiff',
    priority="memory",,
    ncpus=12,
    vc.method="EMMA",
    maxLoop=10,
    method.bin="FaST-LMM",#"FaST-LMM","EMMA", "static"
    #permutation.threshold=TRUE,
    #permutation.rep=100,
    threshold=0.05,
    method=c("FarmCPU", 'MLM')
  )
}

That gave me a different error:

Error in if (nrow < 1 | ncol < 1) stop("A big.matrix must have at least one row and one column") :
  argument is of length zero
Calls: MVP -> big.matrix -> filebacked.big.matrix
Execution halted

I checked all *.desc files and the input all has >1 row and column?

new("big.matrix.descriptor", description = list(sharedType = "FileBacked",
    filename = "mvp.vcf.geno.bin", dirname = "./", totalRows = 3714063L,
    totalCols = 83L, rowOffset = c(0, 3714063), colOffset = c(0,
    83), nrow = 3714063, ncol = 83, rowNames = NULL, colNames = NULL,
    type = "char", separated = FALSE))
new("big.matrix.descriptor", description = list(sharedType = "FileBacked",
    filename = "mvp.vcf.imp.geno.bin", dirname = "./", totalRows = 3714063L,
    totalCols = 83L, rowOffset = c(0, 3714063), colOffset = c(0,
    83), nrow = 3714063, ncol = 83, rowNames = NULL, colNames = NULL,
    type = "char", separated = FALSE))
new("big.matrix.descriptor", description = list(sharedType = "FileBacked",
    filename = "mvp.vcf.imp.pc.bin", dirname = "./", totalRows = 83L,
    totalCols = 3L, rowOffset = c(0, 83), colOffset = c(0, 3),
    nrow = 83, ncol = 3, rowNames = NULL, colNames = NULL, type = "double",
    separated = FALSE))
  • Then I thought - I can't remember older versions of MVP doing imputation, so I turned that off while turning the PCA back on:
MVP.Data(fileHMP="Input.hmp.hmp.txt",
         filePhe="Phenotype_no_Pheno_Indo.csv",
         fileKin=FALSE,
         filePC=FALSE,
         SNP.impute=NULL,
         out="mvp.vcf",
         priority="memory",
)

Back to the old error:

Error in svd(x, nu = 0, nv = k) : infinite or missing values in 'x'
Calls: MVP -> MVP.PCA -> prcomp -> prcomp.default -> svd
Execution halted

And now I'm out of ideas :)

Edit: some more things I tried:

  • the manual has each SNPs alleles as AA, GG, GC, i.e., biallelic, while my TASSEL file had them monoallelic - A, G, Y, so I changed that with a small script, still the same error

  • found 5 overlapping SNPs with duplicate positions, removed those, still same error

  • fixed the order of the alleles - TASSEL took them directly from the VCF, so A/T and T/A were possible, I sorted them all, still same error

  • set perc to 0.1 or 0.2 in the hopes that the subsampled data won't contain the 'bad' SNP that causes this error, got the same error

and again I'm out of ideas

Edit2: I should say that I got the same errors using the VCF file, converting to HapMap was my first idea and now I keep on playing with the HapMap file - this isn't a HapMap bug

Edit3: Is it possible that the imputation doesn't work 'right'? I added this piece to the MVP.PCA method just before the prcomp function is called:

    for(col in 1:ncol(big.geno)) {
            thissum <- sum(big.geno[1:83, col], na.rm=TRUE)
            if ( thissum == 0 ) {
            print(col)
            }
        }

Before I added the na.rm=T lots of NAs popped up, and the NAs are what causes the error in prcomp() - the version on GitHub and which I have installed doesn't seem to use big.PCA for the PCA?

            PCs <- prcomp(big.geno)$x[, 1:pcs.keep]
            return(list(PCs = PCs))

big.PCA seems to work with NA values.

If I replace the function call in MVP.PCA:

            #PCs <- prcomp(big.geno)$x[, 1:pcs.keep]
            PCs <- big.PCA(big.geno)$PCs[, 1:pcs.keep]

This happens in the STDOUT:

[1] "Input data has 83 individuals, 3713722 markers"
[1] "Principal Component Analysis Start..."
 means for first 10 snps:
 [1] 0 0 0 0 0 0 0 0 0 0
[1] "GWAS Start..."
[1] "Mixed Linear Model (MLM) Start ..."
[1] "Calculating Kinship..."
[1] "Z assignment..."

Hooray it works now :)

@hyacz
Copy link
Collaborator

hyacz commented Mar 20, 2019

Hi,

Thank you for using our software and sharing the details of the issue!

  • This problem seems to be due to the padding not working properly. I am trying to reproduce this issue and I will update this comment if I have made progress.
  • bigpca is a great package, but it was removed from the CRAN repository. We plan to submit rMVP to Bioconductor, so we removed the dependency on bigpca. (We plan to extend linear algebra library of bigmatrix, including PCA, Choleskey decomposition, etc.)

@hyacz hyacz added this to In progress in MVP Update Mar 20, 2019
@philippbayer
Copy link
Author

Ah I understand about bigpca - would you like me to send you my data so you can reproduce the issue more easily? I can put it onto cloudstor and send you a link via email

@hyacz
Copy link
Collaborator

hyacz commented Mar 20, 2019

yes, it will be very helpful ! my email is haohaozhang@whut.edu.cn . Are you using the latest version( 0.99.14) of rMVP? What is your operating system?

@philippbayer
Copy link
Author

In the meantime, my 'fixed' MVP run also crashed:

[1] "Input data has 83 individuals, 3713722 markers"
[1] "Principal Component Analysis Start..."
 means for first 10 snps:
 [1] 0 0 0 0 0 0 0 0 0 0
[1] "GWAS Start..."
[1] "Mixed Linear Model (MLM) Start ..."
[1] "Calculating Kinship..."
[1] "Z assignment..."
[1] "Assignment done!"

Error in cbind(matrix(1, n), CV) :
  number of rows of matrices must match (see arg 2)
Calls: MVP -> MVP.MLM -> cbind
Execution halted

I used 0.99.13, the latest version on the releases page: https://github.com/XiaoleiLiuBio/rMVP/releases Is there a chance 0.99.14 fixed this? It's not on the releases

I send you a link to the data via email

@hyacz
Copy link
Collaborator

hyacz commented Mar 21, 2019

Thanks a lot! I have received your email.

  • The 0.99.13 and 0.99.14 are the same, except that the version requirements for R are different. Bioconductor requires that the submitted package must depend on R 3.6.
  • The reason for the crash is because PCA deleted the SNP with missing, and genotype did not delete those. The length of the covariates is different from the genotype.

After I fix this bug, I will send you a development version.

@philippbayer
Copy link
Author

Thank you :)

@hyacz
Copy link
Collaborator

hyacz commented Mar 21, 2019

I updated the code and now MVP.Data will only keep a set of <out>.imp.* genotype files. Temporary files(<out>.geno.*) before imputation are automatically deleted.

Package:rMVP_0.99.15.tar.gz

Sample Code:

library(rMVP)

# Step1: Convert Hapmap to rMVP genotype
MVP.Data(
    fileHMP="UnifiedGenotyper_call_89_Pangenome_MAF001.QUAL20.DP10.noIndoPhil.sorted.forMVP.hmp.hmp.fixed.txt",
    fileKin=FALSE,
    filePC=FALSE
)

# Step2: Read data
genotype  <- attach.big.matrix("mvp.imp.geno.desc")     # Note: read the "<out>.imp" genotype
phenotype <- read.table("Phenotype_no_Pheno_Indo.csv", header = TRUE)
map       <- read.table("mvp.imp.map", head = TRUE)

# Step3: do GWAS
for(i in 2:ncol(phenotype)){
    imMVP <- MVP(
        phe=phenotype[, c(1, i)],
        geno=genotype,
        map=map,
        # nPC.GLM=5,
        nPC.MLM=3,
        nPC.FarmCPU=3,
        perc=0.1,
        file='tiff',
        priority="memory",
        ncpus=12,
        vc.method="EMMA",
        maxLoop=10,
        method.bin="FaST-LMM",#"FaST-LMM","EMMA", "static"
        threshold=0.05,
        method=c("FarmCPU", 'MLM')
    )
}

@philippbayer
Copy link
Author

Thank you very much!!

[1] "Principal Component Analysis Start..."
[1] "GWAS Start..."
[1] "Mixed Linear Model (MLM) Start ..."
[1] "Calculating Kinship..."
[1] "Z assignment..."
[1] "Assignment done!"
[1] "Variance components..."
[1] "Variance Components Estimation is Done!"
[1] "Variance components is Done!"
[1] "Eigen Decomposition..."
[1] "Eigen-Decomposition is Done!"
[...]

It works now :)

@hyacz hyacz moved this from In progress to Done in MVP Update Mar 30, 2019
@naglemi
Copy link

naglemi commented Apr 8, 2019

I'm getting this same error, even though I'm using MVP 0.99.15. Do you know why? Thanks for taking a look.

MVP.Data(fileBed= bed_input,
         filePhe= phe_input,
         fileKin=TRUE,
         filePC=FALSE,
         #type.map="double",
         type.geno = "double",
         priority="memory",
         maxLine=10000,
         out=outname)

M <- attach.big.matrix(paste0(outname, 
                              ".geno.desc"))
phenotype <- read.table(paste0(outname,
                        ".phe"),head=TRUE)
map <- read.table(paste0(outname,
                  ".map"), head = TRUE)
Kinship <- attach.big.matrix(paste0(outname,
                             ".kin.desc"))

for(i in 2:ncol(phenotype)){
  imMVP <- MVP(
    phe=phenotype,
    geno=M,
    map=map,
    K=Kinship,
    nPC.GLM=3,
    nPC.MLM=3,
    nPC.FarmCPU=3,
    #perc=0.10,
    priority="memory",
    ncpus=12,
    vc.method="EMMA",
    maxLoop=10,
    method.bin="FaST-LMM",#"FaST-LMM","EMMA", "static"
    #permutation.threshold=TRUE,
    #permutation.rep=100,
    threshold=0.05,
    method=c("MLM", "FarmCPU")
  )
}

Here is the output:

====================== Welcome to MVP ======================
      A Memory-efficient, Visualization-enhanced, and       
             Parallel-accelerated Tool For GWAS             
                    __  __  __   __  ___                    
                   |  \/  | \ \ / / | _ \                   
                   | |\/| |  \ V /  |  _/                   
                   |_|  |_|   \_/   |_|   Version: 0.99.15  
  Authors: Lilin Yin, Haohao Zhang, and Xiaolei Liu         
  Contact: xiaoleiliu@mail.hzau.edu.cn                      
============================================================
[1] "Input data has 253 individuals, 7901612 markers"
[1] "Principal Component Analysis Start..."
Error in svd(x, nu = 0, nv = k) : infinite or missing values in 'x'

Here is my Session Info showing I'm running the 0.99.15:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] compiler  tools     parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rMVP_0.99.15        MASS_7.3-51.1       biganalytics_1.1.14 biglm_0.9-1         DBI_1.0.0          
 [6] foreach_1.4.4       bigmemory_4.5.33    BH_1.69.0-1         RcppProgress_0.4.1  Rcpp_1.0.0         

loaded via a namespace (and not attached):
[1] bigmemory.sri_0.1.3 yaml_2.2.0          codetools_0.2-15    iterators_1.0.10  

@hyacz
Copy link
Collaborator

hyacz commented Apr 8, 2019

hi, naglemi
you should read the imputed genotype file.

M <- attach.big.matrix(paste0(outname,
".imp.geno.desc"))

@naglemi
Copy link

naglemi commented Apr 9, 2019

I ran MVP.Data with rMVP 0.99.15 and do not see the imp.geno.desc file among my outputs. How can I obtain this?

By the way, is there a good way for me to use MVP to obtain PCs without imputation? I think that NAs in my dataset are more often due to indels than missing data and such.

steed /scratch2/NSF_GWAS/Results/MVP 1009$ ls -lhrt
total 53G
-rw-r--r--. 1 naglemi upg6252 173M Apr  8 16:38 MAF0.05_geno882__TDZ_Shoot_Area_lntrans_0omit.map
-rw-r--r--. 1 naglemi upg6252 6.6K Apr  8 16:38 MAF0.05_geno882__TDZ_Shoot_Area_lntrans_0omit.geno.ind
-rw-r--r--. 1 naglemi upg6252  369 Apr  8 16:38 MAF0.05_geno882__TDZ_Shoot_Area_lntrans_0omit.geno.desc
-rw-r--r--. 1 naglemi upg6252  52G Apr  8 16:45 MAF0.05_geno882__TDZ_Shoot_Area_lntrans_0omit.geno.bin
-rw-r--r--. 1 naglemi upg6252  13K Apr  8 16:45 MAF0.05_geno882__TDZ_Shoot_Area_lntrans_0omit.phe
-rw-r--r--. 1 naglemi upg6252  356 Apr  8 20:17 MAF0.05_geno882__TDZ_Shoot_Area_lntrans_0omit.kin.desc
-rw-r--r--. 1 naglemi upg6252 6.0M Apr  8 20:17 MAF0.05_geno882__TDZ_Shoot_Area_lntrans_0omit.kin.bin

@hyacz
Copy link
Collaborator

hyacz commented Apr 9, 2019

rMVP will determine if your data is missing and then decide whether to impute it. If there is no missing data in your data, the .imp.geno.desc file will not be generated.

You can use the following code to confirm if there is a missing in your genotype file.

> genotype <- attach.big.matrix(genoPath)
> rMVP:::hasNA(genotype)
[1] FALSE

If the result is TRUE and no .imp.geno.desc file is generated, please see the log of MVP.Data(). I guess there may be some errors. I noticed that you have ~800,000 SNPs. In this case, using the rMVP built-in simple impute method may cause memory errors (we are working on improving this problem), I suggest you use professional impute software such as Beagle5 to handle your missing data.

rMVP does not allow NAs in Genotype when computing PCs, you can delete these SNP. generally, I won't mix SNP and InDel together for GWAS. If you are interested in InDel, you can analyze it separately. noInDel and InDel can be encoded as 0 and 1 respectively.

@naglemi
Copy link

naglemi commented Apr 12, 2019

I receive the following error from hasNA. Do you know the meaning of this? Thanks.

> rMVP:::hasNA(M)
Error in rMVP:::hasNA(M) : Expecting an external pointer: [type=S4].

@hyacz
Copy link
Collaborator

hyacz commented Apr 13, 2019

sorry, i made a mistake in example code, you should give a pointer to hasNA function instead of S4 object.
try this

rMVP:::hasNA(M@address)

@naglemi
Copy link

naglemi commented Apr 15, 2019

Interestingly, there appear to be no NA

> rMVP:::hasNA(M@address)
[1] FALSE

In this case, what might be causing the error during PCA?

@Rudrakshi1234
Copy link

I tried this code

MVP.Data(
fileHMP="genotype.txt",
fileKin=FALSE,
filePC=FALSE
)

genotype <- attach.big.matrix("mvp.geno.desc") # Note: read the ".imp" genotype
phenotype <- read.table("phenotypic_data.txt", header = TRUE)
map <- read.table("mvp.geno.map", head = TRUE)

for(i in 2:ncol(phenotype)){
imMVP <- MVP(
phe=phenotype[, c(1, i)],
geno=genotype,
map=map,
# nPC.GLM=5,
nPC.MLM=3,
nPC.FarmCPU=3,
perc=0.1,
file='tiff',
priority="memory",
ncpus=12,
vc.method="EMMA",
maxLoop=10,
method.bin="FaST-LMM",#"FaST-LMM","EMMA", "static"
threshold=0.05,
method=c("FarmCPU", 'MLM')
)
}
I am getting this error:

MVP.Data(

  • fileHMP="genotype.txt",
  • fileKin=FALSE,
  • filePC=FALSE
  • )
    Preparing data for MVP...
    Reading file...
    inds: 133 markers:20064
    Number of threads: 1
    0% 10 20 30 40 50 60 70 80 90 100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    Preparation for GENOTYPE data is done within 1s
    Imputing...
    out is NULL, impute inplace.
    Number of threads: 3
    0% 10 20 30 40 50 60 70 80 90 100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    Impute Genotype File is done!
    MVP data prepration accomplished successfully!

genotype <- attach.big.matrix("mvp.geno.desc") # Note: read the ".imp" genotype
phenotype <- read.table("phenotypic_data.txt", header = TRUE)
map <- read.table("mvp.geno.map", head = TRUE)
for(i in 2:ncol(phenotype)){

  • imMVP <- MVP(
  • phe=phenotype[, c(1, i)],
    
  • geno=genotype,
    
  • map=map,
    
  • # nPC.GLM=5,
    
  • nPC.MLM=3,
    
  • nPC.FarmCPU=3,
    
  • perc=0.1,
    
  • file='tiff',
    
  • priority="memory",
    
  • ncpus=12,
    
  • vc.method="EMMA",
    
  • maxLoop=10,
    
  • method.bin="FaST-LMM",#"FaST-LMM","EMMA", "static"
    
  • threshold=0.05,
    
  • method=c("FarmCPU", 'MLM')
    
  • )
  • }
    Error in MVP(phe = phenotype[, c(1, i)], geno = genotype, map = map, nPC.MLM = 3, :
    argument 7 matches multiple formal arguments.

I can't see any imp file in my output folder. Can someone please help :)

Regards
Rudra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
MVP Update
  
Done
Development

No branches or pull requests

5 participants