Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error at the end of getBam Counts and output is not S4. #22

Closed
nelsonchanhk opened this issue Dec 28, 2019 · 14 comments
Closed

Error at the end of getBam Counts and output is not S4. #22

nelsonchanhk opened this issue Dec 28, 2019 · 14 comments
Assignees

Comments

@nelsonchanhk
Copy link

@nelsonchanhk nelsonchanhk commented Dec 28, 2019

Hello, I'm a new to ExomeDepth and is currently try it on my custom targeted capture panel. At the end of getBamCounts, the following error shows up.

Now parsing ~/OPERATION/RedCellNGS_hg19/BAM_testing/Normal9.bam
Parsing chromosome chr1
Parsing chromosome chr10
Parsing chromosome chr11
Parsing chromosome chr12
Parsing chromosome chr13
Parsing chromosome chr14
Parsing chromosome chr15
Parsing chromosome chr16
Parsing chromosome chr17
Parsing chromosome chr18
Parsing chromosome chr19
Parsing chromosome chr2
Parsing chromosome chr20
Parsing chromosome chr21
Parsing chromosome chr22
Parsing chromosome chr3
Parsing chromosome chr4
Parsing chromosome chr5
Parsing chromosome chr6
Parsing chromosome chr7
Parsing chromosome chr8
Parsing chromosome chr9
Number of counted fragments : 836974
There were 50 or more warnings (use warnings() to see the first 50)

warnings()
Warning messages:
1: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

  • in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
  • in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]
    2: In .Seqinfo.mergexy(x, y) :
    Each of the 2 combined objects has sequence levels not in the other:
  • in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
  • in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]
    3: In .Seqinfo.mergexy(x, y) :
    Each of the 2 combined objects has sequence levels not in the other:
  • in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
  • in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]
    4: In .Seqinfo.mergexy(x, y) :

Furthermore, the output of getBamCounts in my case is a list and when I tried to convert it to a data frame, it failed with the following error.

ExomeCount.dafr <- as(ExomeCount, 'data.frame')
Error in as(ExomeCount, "data.frame") :
internal problem in as(): “tbl_df” is(object, "data.frame") is TRUE, but the metadata asserts that the 'is' relation is FALSE
typeof(ExomeCount)
[1] "list"

My script is as follows.
data(exons.hg19)
hg19<-file.path("/OPERATION/RedCellNGS_hg19/ref","hg19.fa")
bamName <- list.files("
/OPERATION/RedCellNGS_hg19/BAM", pattern = '*.bam$')
bamFile <- file.path("~/OPERATION/RedCellNGS_hg19/BAM",bamName)
ExomeCount <- getBamCounts(bed.frame = exons.hg19,bam.files = bamFile,include.chr = TRUE,referenceFasta = hg19)

My sessionInfo() is as follows.
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ExomeDepth_1.1.13

loaded via a namespace (and not attached):
[1] matrixStats_0.55.0 lattice_0.20-38
[3] IRanges_2.20.1 Rsamtools_2.2.1
[5] Biostrings_2.54.0 GenomicAlignments_1.22.1
[7] bitops_1.0-6 grid_3.6.2
[9] GenomeInfoDb_1.22.0 stats4_3.6.2
[11] magrittr_1.5 zlibbioc_1.32.0
[13] XVector_0.26.0 S4Vectors_0.24.1
[15] Matrix_1.2-18 aod_1.3.1
[17] BiocParallel_1.20.1 tools_3.6.2
[19] Biobase_2.46.0 RCurl_1.95-4.12
[21] DelayedArray_0.12.1 parallel_3.6.2
[23] compiler_3.6.2 BiocGenerics_0.32.0
[25] GenomicRanges_1.38.0 SummarizedExperiment_1.16.1
[27] GenomeInfoDbData_1.2.2

Thank you very much for your help!

@nelsonchanhk nelsonchanhk changed the title Output of getBam Counts is not S4 and cannot convert to dataframe Error at the end of getBam Counts and output is not S4. Dec 28, 2019
@vplagnol
Copy link
Owner

@vplagnol vplagnol commented Dec 28, 2019

mmm I recently updated ExomeDepth to 1.13 to deal with some new R compatibility issue, and something odd may be happening with tibbles and data frames. It could be that you need to update your dplyr package? Some incompatibility there?

For now, I have done created a version 1.1.14, that does
return(data.frame(exon_count_frame))
in getBamCounts so ideally you should not be bothered by the tibble issue anymore.
Can you please try:
> devtools::install_github("vplagnol/ExomeDepth")
and tell me what happens?

Loading

@vplagnol vplagnol self-assigned this Dec 28, 2019
@nelsonchanhk
Copy link
Author

@nelsonchanhk nelsonchanhk commented Dec 29, 2019

Dear vplagnol,

Thank you for your swift reply! I just tried 1.1.14 and now I can proceed with the dataframe step without the previous error though it is not a S4 object. You can see the output as follows.

ExomeCount.dafr <- as(ExomeCount, 'data.frame')
typeof(ExomeCount)
[1] "list"
typeof(ExomeCount.dafr)
[1] "list"

I shall try proceeding with the rest of the processing to see if it can work on my dataset.

Nevertheless, the >50 warnings at the end of getBamCounts still exist. Is it normal or is it related to the chr notation in my bam? I used hg19 for my bam and fasta but both of them use the chr(1,2,...,X,Y) notation.

Thank you very much in advance and wish you a happy New Year!

Now parsing ~/OPERATION/RedCellNGS_hg19/BAM_testing/Normal9.bam
Parsing chromosome chr1
Parsing chromosome chr10
Parsing chromosome chr11
Parsing chromosome chr12
Parsing chromosome chr13
Parsing chromosome chr14
Parsing chromosome chr15
Parsing chromosome chr16
Parsing chromosome chr17
Parsing chromosome chr18
Parsing chromosome chr19
Parsing chromosome chr2
Parsing chromosome chr20
Parsing chromosome chr21
Parsing chromosome chr22
Parsing chromosome chr3
Parsing chromosome chr4
Parsing chromosome chr5
Parsing chromosome chr6
Parsing chromosome chr7
Parsing chromosome chr8
Parsing chromosome chr9
Number of counted fragments : 836974
There were 50 or more warnings (use warnings() to see the first 50)

warnings()
Warning messages:
1: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

  • in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
  • in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]
    2: In .Seqinfo.mergexy(x, y) :
    Each of the 2 combined objects has sequence levels not in the other:
  • in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
  • in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]
    3: In .Seqinfo.mergexy(x, y) :
    Each of the 2 combined objects has sequence levels not in the other:
  • in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
  • in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]
    4: In .Seqinfo.mergexy(x, y) :
    Each of the 2 combined objects has sequence levels not in the other:
  • in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
  • in 'y': chrX, chrY, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl0 [... truncated]

Loading

@vplagnol
Copy link
Owner

@vplagnol vplagnol commented Dec 29, 2019

OK, so the errors I see suggest an issue between the use of the "chr" convention, probably in your BAM file and the convention "1..22" which I think is the default I use.
Can you check how the output data frames look like? Do they have reasonable counts in them?
Have you looked at the "include.chr" option in getBamCounts that is meant to deal with that complication?

Loading

@nelsonchanhk
Copy link
Author

@nelsonchanhk nelsonchanhk commented Dec 31, 2019

Hello vplagnol, your quick response is much appreciated. I have attached the screenshot of the dataframe for reference. "include.chr" was set to be true. I think the counts are reasonable at regions with coverage.

Nevertheless as I used a custom capture panel, a lot of the genes had no coverage at all, do you think it would affect the algorithm for the enumeration of CNV?

Furthermore, do you think the warnings at the end of processing is safe to bypass?

Thank you. Have a nice day and all the best in the new year!

image
image

Loading

@nelsonchanhk
Copy link
Author

@nelsonchanhk nelsonchanhk commented Dec 31, 2019

Hello vplagnol, sorry for another issue. When I was trying to build my reference set from my samples, I encountered the following problem.

my.choice <- select.reference.set(test.counts=my.test, reference.counts=my.reference.set, bin.length=(ExomeCount.dafr$end-ExomeCount.dafr$start)/1000, n.bins.reduced=1000)
Optimization of the choice of aggregate reference set, this process can take some time
Error in ExomeCount.dafr$end - ExomeCount.dafr$start :
non-numeric argument to binary operator

It looks like it's another dataframe problem. After changing the attribute to numeric, the step worked by there were three warnings at the end. The sample data reference set also has the same 3 warnings at the end of processing. Is it normal?

sapply(ExomeCount.dafr, mode)
seqnames start end exon GC HA1.bam
"character" "numeric" "numeric" "character" "numeric" "numeric"
HA2.bam HA3.bam HA4.bam HA5.bam Normal1.bam Normal2.bam
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Normal3.bam chromosome
"numeric" "character"
my.choice <- select.reference.set(test.counts=my.test, reference.counts=my.reference.set, bin.length=(ExomeCount.dafr$end-ExomeCount.dafr$start)/1000, n.bins.reduced=1000)
Optimization of the choice of aggregate reference set, this process can take some time
Number of selected bins: 1000
Warning messages:
1: In model.matrix.default(mt, mfb, contrasts) :
non-list contrasts argument ignored
2: In model.matrix.default(mt, mfb, contrasts) :
non-list contrasts argument ignored
3: In model.matrix.default(mt, mfb, contrasts) :
non-list contrasts argument ignored

Furthermore, at the end of ExomeDepth CNV calling, there are two warning messages.

all.exons <- new('ExomeDepth', test = my.test,reference = my.reference.selected, formula = 'cbind(test, reference) ~ 1')
Now fitting the beta-binomial model on a data frame with 185130 rows : this step can take a few minutes.
Now computing the likelihood for the different copy number states
Warning messages:
1: In aod::betabin(data = data.for.fit, formula = as.formula(formula), :
The data set contains at least one line with weight = 0.

2: In model.matrix.default(mt, mfb, contrasts) :
non-list contrasts argument ignored

Are the these warnings safe to bypass?

Thank you very much!

Loading

@vplagnol
Copy link
Owner

@vplagnol vplagnol commented Dec 31, 2019

Oops, sorry, in my rewrite of the code I stupidly converted start/end to character. Now fixed in 1.1.15. Thank you for picking this up!

To your earlier question, if you use a custom panel, I would restrict the BED file to the covered regions to avoid all the 0s. Not sure how much it really matters, but it may, so play it safe...

The statement "non-list contrasts argument ignored" concerns me... but I am not sure where it comes from. It must be from:
mod <- aod::betabin( data = data.for.fit, formula = as.formula(formula), random = ~ 1, link = 'logit', warnings = FALSE)perhaps because it does not want to correct for GC content? I am not sure but it is worrying.

Loading

@nelsonchanhk
Copy link
Author

@nelsonchanhk nelsonchanhk commented Dec 31, 2019

That's perfect @vplagnol , I've got it working now.

Just a couple of quick questions. As I'm working on a gremlin targeted capture panel of about 7M with 10 normal controls and 160 samples (I presume about 20-30% of them have CNVs of different genes) The samples were run on two occasions following same protocol and on the same sequencer

  1. for n.bin.reduced, should I still use 10000? how can I vary it to improve sensitivity/specificity?
  2. Shall I use the 10 samples as reference set or should I use all 170 bam (170-1) for control?

Your support is truly appreciated! Wish you another prosperous year ahead!

Loading

@vplagnol
Copy link
Owner

@vplagnol vplagnol commented Dec 31, 2019

So for question 1, n.bins.reduced is useful when you have more than 10K bins with non-zero count, to speed things up. If you have more, downsampling should be fine, and if you have fewer, nothing should happen. So I would leave that parameter as is.

Re question 2 and the choice of reference samples, much of the accuracy of ExomeDepth depends on tight correlations between the test and reference samples. So in general it is best to have more BAM files, in order to find the most closely technically matched samples (usually the ones run in the same sequencing batch). The only exception is when you expect many CNVs in the same locations across your cases, which would obviously make it difficult to distinguish CNVs in the test sample (as the reference samples would be similarly affected). So the general answer is to NOT restrict, unless you have a very specific and well defined genetic cause of disease and you always look for relatively common CNVs in the same few genes/exons.

Thank you for the good words and please let me know if you find what you expect to see, with the recent changes some confirmation that nothing is broken would good.

Loading

@nelsonchanhk
Copy link
Author

@nelsonchanhk nelsonchanhk commented Dec 31, 2019

@vplagnol That's very clear and helpful!

I've tried on my sample, known deletions of about 20K and 3K-4K were not detected. I did an exploratory analysis on my samples which no SNV/small indels could account for the phenotype. Unfortunately, no particular relevant CNVs were detected.

On the other hand, at the reference building stage, I setup my variables similar to the way that you did in the tutorial, except I used my on primary target bed. When I go through the variable my.choice, i.e. output of select.reference.set, in my set of 176 samples, only about the top 10 odd samples show up as meaningful values and the rest are all labelled as NA. Is it normal? or is there some problem with my script?

Furthermore, can ExomeDepth detect deletions of 20K and 3K-4K? How should I optimise the bin size to detect CNV of such sizes?

Sorry for my many questions but you really guided me through using the ExomeDepth and I'm much obliged to you.

my.ref.samples <- c('X11H28680.2.bam', 'X12H28270.5.bam', 'X12H28309.8.bam', 'X12H28470.7.bam', 'X12H28618.7.bam', 'X12H28723.6.bam', 'X12H28754.2.bam', 'X13H28294.9.bam', 'X13H28372.0.bam', 'X13H28504.5.bam', 'X13H28587.0.bam', 'X13H28619.4.bam', 'X13H28944.1.bam', 'X13H28944.1R.bam', 'X14H0289496.bam', 'X14H28307.4.bam', 'X14H28348.3.bam', 'X14H28384.7.bam', 'X14H28448.4.bam', 'X14H28534.4.bam', 'X14H28539.9.bam', 'X14H28622.8.bam', 'X14H28625.9.bam', 'X14H28626.6.bam', 'X14H28698.1.bam', 'X14H28699.8.bam', 'X14H28844.0.bam', 'X14H28947.2.bam', 'X14H28999.1.bam', 'X14H29009.2.bam', 'X15H28027.7.bam', 'X15H28179.7.bam', 'X15H28244.4.bam', 'X15H28299.4.bam', 'X15H28324.9.bam', 'X15H28349.0.bam', 'X15H28381.6.bam', 'X15H28452.5.bam', 'X15H28458.7.bam', 'X15H28547.8.bam', 'X15H28548.5.bam', 'X15H28571.5.bam', 'X15H28577.7.bam', 'X15H28691.2.bam', 'X15H28714.0.bam', 'X15H28715.7.bam', 'X15H28823.7.bam', 'X15H28908.7.bam', 'X15H28910.4.bam', 'X15H29145.7.bam', 'X16H28053.8.bam', 'X16H28186.9.bam', 'X16H28202.8.bam', 'X16H28241.3.bam', 'X16H28242.0.bam', 'X16H28304.3.bam', 'X16H28327.0.bam', 'X16H28379.9.bam', 'X16H28380.9.bam', 'X16H28411.6.bam', 'X16H28412.3.bam', 'X16H28462.8.bam', 'X16H28464.2.bam', 'X16H28488.6.bam', 'X16H28517.9.bam', 'X16H28540.9.bam', 'X16H28622.8.bam', 'X16H28647.9.bam', 'X16H28652.7.bam', 'X16H28656.5.bam', 'X16H28677.8.bam', 'X16H28706.1.bam', 'X16H28732.2.bam', 'X16H28786.5.bam', 'X16H28787.2.bam', 'X16H28818.9.bam', 'X16H28820.6.bam', 'X16H29092.0.bam', 'X16H29122.0.bam', 'X16H29190.7.bam', 'X16H29240.3.bam', 'X16H29325.3.bam', 'X16H29327.7.bam', 'X17H027436.0.bam', 'X17H027437.7.bam', 'X17H027439.1.bam', 'X17H028279.8.bam', 'X17H028305.0.bam', 'X17H028363.4.bam', 'X17H028450.1.bam', 'X17H028452.5.bam', 'X17H028465.9.bam', 'X17H028514.8.bam', 'X17H028529.6.bam', 'X17H028531.3.bam', 'X17H028589.4.bam', 'X17H028623.5.bam', 'X17H028626.6.bam', 'X17H028669.9.bam', 'X17H028707.8.bam', 'X17H028708.5.bam', 'X17H028710.2.bam', 'X17H028733.9.bam', 'X17H028736.0.bam', 'X17H028737.7.bam', 'X17H028763.8.bam', 'X17H028769.0.bam', 'X17H028776.2.bam', 'X17H028827.5.bam', 'X17H028832.3.bam', 'X17H028861.5.bam', 'X17H028883.5.bam', 'X17H029017.1.bam', 'X17H029119.6.bam', 'X17H029120.6.bam', 'X17H029144.0.bam', 'X17H029179.4.bam', 'X17H029267.8.bam', 'X17H029322.2.bam', 'X17H029354.5.bam', 'X17H029403.4.bam', 'X17H029472.8.bam', 'X17H28050.7.bam', 'X17H28052.1.bam', 'X17H28102.7.bam', 'X17H28120.9.bam', 'X17H286651.bam', 'X18H028007.1.bam', 'X18H028197.9.bam', 'X18H028461.1.bam', 'X18H028488.6.bam', 'X18H028593.5.bam', 'X19H21103.bam', 'HA1.bam', 'HA10.bam', 'HA11.bam', 'HA12.bam', 'HA13.bam', 'HA14.bam', 'HA15.bam', 'HA16.bam', 'HA17.bam', 'HA18.bam', 'HA19.bam', 'HA2.bam', 'HA20.bam', 'HA21.bam', 'HA22.bam', 'HA23.bam', 'HA3.bam', 'HA4.bam', 'HA5.bam', 'HA6.bam', 'HA7.bam', 'HA8.bam', 'HA9.bam', 'HBD1.bam', 'HBD2.bam', 'HBD3.bam', 'HBD4.bam', 'HBD5.bam', 'HBD6.bam', 'HBD7.bam', 'HBD8.bam', 'Normal1.bam', 'Normal10.bam', 'Normal2.bam', 'Normal3.bam', 'Normal4.bam', 'Normal5.bam', 'Normal6.bam', 'Normal7.bam', 'Normal8.bam', 'Normal9.bam', 'RA35.bam')
my.reference.set <- as.matrix(ExomeCount.dafr[, my.ref.samples])

my.test <- ExomeCount.dafr$UW684347.bam

my.choice <- select.reference.set(test.counts=my.test, reference.counts=my.reference.set, bin.length=(ExomeCount.dafr$end-ExomeCount.dafr$start)/1000)
Optimization of the choice of aggregate reference set, this process can take some time
Number of selected bins: 2008
There were 14 warnings (use warnings() to see them)
my.matrix <- as.matrix(ExomeCount.dafr[, my.choice$reference.choice, drop = FALSE])
my.reference.selected <- apply(X = my.matrix, MAR = 1, FUN = sum)
all.exons <- new('ExomeDepth', test = my.test,reference = my.reference.selected, formula = 'cbind(test, reference) ~ 1')
Now fitting the beta-binomial model on a data frame with 2232 rows : this step can take a few minutes.
Now computing the likelihood for the different copy number states
Warning message:
In model.matrix.default(mt, mfb, contrasts) :
non-list contrasts argument ignored
all.exons <- CallCNVs(x = all.exons, transition.probability = 10^-4, chromosome = ExomeCount.dafr$chromosome, start = ExomeCount.dafr$start, end = ExomeCount.dafr$end, name = ExomeCount.dafr$exon)
Correlation between reference and tests count is 0.99962
To get meaningful result, this correlation should really be above 0.97. If this is not the case, consider the output of ExomeDepth as less reliable (i.e. most likely a high false positive rate)
Number of calls for chromosome 2 : 1
Number of calls for chromosome 6 : 1
Number of calls for chromosome 17 : 1
all.exons <- AnnotateExtra(x = all.exons, reference.annotation = Conrad.hg19.common.CNVs, min.overlap = 0.5, column.name = 'Conrad.hg19')
exons.hg19.GRanges <- GenomicRanges::GRanges(seqnames = exons.hg19$chromosome,IRanges::IRanges(start=exons.hg19$start,end=exons.hg19$end),names = exons.hg19$name)
all.exons <- AnnotateExtra(x = all.exons,reference.annotation = exons.hg19.GRanges,min.overlap = 0.0001,column.name = 'exons.hg19')
head(all.exons@CNV.calls[ order ( all.exons@CNV.calls$BF, decreasing = TRUE),])
start.p end.p type nexons start end chromosome
2 544 549 duplication 6 32520752 32552155 6
1 250 250 deletion 1 175547502 175547644 2
3 1925 1926 deletion 2 46115033 46115428 17
id BF reads.expected reads.observed reads.ratio
2 chr6:32520752-32552155 21.20 369 589 1.600
1 chr2:175547502-175547644 5.82 58 22 0.379
3 chr17:46115033-46115428 5.06 190 125 0.658
Conrad.hg19
2 CNVR2845.21,CNVR2845.25,CNVR2845.24,CNVR2845.5
1
3
exons.hg19
2 HLA-DRB1_6,HLA-DRB1_5,HLA-DRB1_4,HLA-DRB1_3,HLA-DRB1_2
1
3 COPZ2_2,COPZ2_1
my.choice
$reference.choice
[1] "X17H028861.5.bam" "HA6.bam" "X14H28625.9.bam" "X19H21103.bam"
[5] "X15H28349.0.bam" "X14H28698.1.bam"

$summary.stats
ref.samples correlations expected.BF phi
X17H028861.5.bam X17H028861.5.bam 0.9696150 4.306852 0.0010101030
HA6.bam HA6.bam 0.9684568 6.261057 0.0006421234
X14H28625.9.bam X14H28625.9.bam 0.9683212 6.924948 0.0005043947
X19H21103.bam X19H21103.bam 0.9681989 7.331545 0.0004154194
X15H28349.0.bam X15H28349.0.bam 0.9678799 7.440957 0.0003745715
X14H28698.1.bam X14H28698.1.bam 0.9678143 7.513889 0.0003396163
Normal5.bam Normal5.bam 0.9676507 7.473663 0.0003132495
X18H028488.6.bam X18H028488.6.bam 0.9672575 7.398554 0.0002966433
X15H28547.8.bam X15H28547.8.bam 0.9667066 7.377088 0.0002709542
X17H029322.2.bam X17H029322.2.bam 0.9661038 7.349389 0.0002555399
HA2.bam HA2.bam 0.9659973 7.244257 0.0002428891
Normal8.bam Normal8.bam 0.9659702 7.208335 0.0002312084
X17H028707.8.bam X17H028707.8.bam 0.9658821 7.243164 0.0002159127
HBD1.bam HBD1.bam 0.9657609 NA 0.0002010216
HA17.bam HA17.bam 0.9653922 NA NA
X14H28448.4.bam X14H28448.4.bam 0.9653091 NA NA
Normal7.bam Normal7.bam 0.9652204 NA NA
X16H28820.6.bam X16H28820.6.bam 0.9651466 NA NA
HA11.bam HA11.bam 0.9650629 NA NA
X17H028529.6.bam X17H028529.6.bam 0.9650061 NA NA
X13H28294.9.bam X13H28294.9.bam 0.9647416 NA NA
X13H28504.5.bam X13H28504.5.bam 0.9647169 NA NA
X14H28384.7.bam X14H28384.7.bam 0.9646008 NA NA
X14H28622.8.bam X14H28622.8.bam 0.9639411 NA NA
X16H28517.9.bam X16H28517.9.bam 0.9638205 NA NA
HA21.bam HA21.bam 0.9637785 NA NA
X15H28691.2.bam X15H28691.2.bam 0.9637432 NA NA
X15H28714.0.bam X15H28714.0.bam 0.9632372 NA NA
X15H28452.5.bam X15H28452.5.bam 0.9631393 NA NA

Loading

@vplagnol
Copy link
Owner

@vplagnol vplagnol commented Dec 31, 2019

At first sight that correlation level that you find (0.99962) seems too high. High is good, but that extent high suggests that something is wrong. As if your reference sample exactly predicted the test sample. Odd... Are you sure that the duplicate(s) of the test are NOT included in the potential reference set?
I am really worried about this message:
"In model.matrix.default(mt, mfb, contrasts) :
non-list contrasts argument ignored"

which I have not seen before. Also that 0.99962 does not seem to match the numbers I see in the summary stats table, which are more realistic. And also with such super high correlations it would explain why you see so few CNV calls, basically the match is nearly perfect. I am a bit concerned...

Loading

@nelsonchanhk
Copy link
Author

@nelsonchanhk nelsonchanhk commented Jan 1, 2020

@vplagnol That's a good suggestion. I've checked they were not duplicates and sample was excluded from the reference set. Could it be possible that I have more than 170 bams that the algorithm was able to select a bam of very high correlation?

Previously I used the bams with GATK, cnvkit, viscap and several other programs the results appeared to correlate with other programs. I think the problem I'm now having might be related to the worrisome error message. Is there anyway to troubleshoot the error?

I will also try several other reference set combinations toexplore. Thank you and have a nice day!

Loading

@nelsonchanhk
Copy link
Author

@nelsonchanhk nelsonchanhk commented Jan 2, 2020

Hello @vplagnol, the problem was partly resolved.

In my bed, there were several big chromosomal regions which were not separated by genes/exons. They were mistakenly treated as a big exon. I've worked on my bed file and apparently the correlation problem has resolved. I can also detect the deletions mentioned earlier which were buried in the large chromosomal regions.

However, the error still persists.
"In model.matrix.default(mt, mfb, contrasts) :
non-list contrasts argument ignored"

Loading

@vplagnol
Copy link
Owner

@vplagnol vplagnol commented Jan 2, 2020

OK that's good news. The fact that you can see at least some of the CNVs you are meant to see tells me that the code is not broken, something good is happening. Still a bit unsure about that error message, but I think I am comfortable enough to push the updated code to CRAN. I'll try to figure out what the error message is telling me, but really happy to hear things are now working, so thank you for this.

Loading

@vplagnol
Copy link
Owner

@vplagnol vplagnol commented Jan 8, 2020

Uploading 1.1.15 now to CRAN. I will close the issue unless something else comes up (in which case re-open please).

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants