Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare command traceback #7

Closed
msxakk opened this issue May 27, 2015 · 16 comments
Closed

Compare command traceback #7

msxakk opened this issue May 27, 2015 · 16 comments

Comments

@msxakk
Copy link

msxakk commented May 27, 2015

Dear IsoSCM developers,

I've successfully run assembly command and what to compare differential 3'UTR usage in my dataset. When I run compare command I get the following traceback:

2015-05-27T12:07:34,440 [main] INFO executable.IsoSCM - Starting compare ...
Exception in thread "main" java.io.FileNotFoundException: /home/msxakk/Desktop/3primeUTR_lenght_work/A_assembly/tmp/NC.exon.trimmed.gtf (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at util.IO.bufferedScanner(IO.java:198)
at tools.ParseGTF$TranscriptIterator.(ParseGTF.java:266)
at multisample.JointSegmentation.performJointSegmentation(JointSegmentation.java:161)
at executable.IsoSCM.main(IsoSCM.java:613)

I used the following command:
java -Xmx2048m -jar IsoSCM-2.0.7.jar compare -base test -x1 /home/msxakk/Desktop/3primeUTR_lenght_work/A_assembly/A.assembly_parameters.xml -x2 /home/msxakk/Desktop/3primeUTR_lenght_work/NC_assembly/NC.assembly_parameters.xml

It looks to me as if the program is trying to find the file NC.exon.trimmed.gtf but in wrong directory .../A_assembly/tmp/ . However it is in .../NC_assembly/tmp/

This appears to be a bug. Can you help?

Best wishes
msxakk

@shenkers
Copy link
Owner

Hi msxakk,

You are correct, an updated version is available here: https://github.com/shenkers/isoscm/releases/tag/2.0.8

Let me know if this does not fix the issue.

Best,
Sol

@msxakk
Copy link
Author

msxakk commented May 27, 2015

It's running now with no trace back. We shall see...
I presume it is ok to use the assembly made using v7 and compare using v8?

@shenkers
Copy link
Owner

yes, the only change I made was where it is looking for the files, so it should be fine.

@msxakk
Copy link
Author

msxakk commented May 27, 2015

Is there a quick way to obtain gene name from the loci, i.e. how can I know the genes that have differential 3'UTR usage?

@shenkers
Copy link
Owner

there is nothing off the shelf that I'm aware of, although I agree that would be a nice extension. I can't promise a specific timeline, but I'll try and get an update for this in the next couple days...

@msxakk
Copy link
Author

msxakk commented May 28, 2015

Hi Shenkers,

Thanks for your quick replies. May I ask you a couple more questions regarding the interpretation of compare command main .txt output and few general questions:

  • According to documentation the 'confidence' value is "the likelihood that a change-point occurs at the given position, compared to a null model that no change-point at this position". I'm a bit confused here. Is this the probability of observed data (that there is point-change) if null hypothesis were true (i.e. statistical P-value) or is it the probability that it is true that there is a point-change here? In other words is it better if this value is low or high.
  • It seems to me that likelihood value refers to change-point (whatever it is). I was wondering if there is a statistical value associated with any potentially observed differential (between conditions/states) 3UTR site usage?
  • I am trying to understand how the 3UTR usage is calculated. So you divide the depth downstream (step-down) by the depth upstream (step-up). So in my mind it is like the reads downstream are treated as % of upstream. Then you use this value to subtract from 1. That means that the more reads downstream relative to upstream (bigger %) then the lower final usage value. But this seems counter intuitive. You would think that the more reads generating from 3UTR region indicate bigger usage? I would appreciate if you could explain the concept of usage a little bit here. The differential usage would then be obvious.
  • I like the figure 6B in your paper. Can you tell me how you generated this matrix? I presume you used some bioconductor package? That will be useful when I report my data.

P.S
Let me know if you prefer to open a new thread for this kind of discussion.

Thanks!
msxakk

@shenkers
Copy link
Owner

I'm happy leaving the discussion here, I may re-organize things to create a wiki and create an FAQ section at some point in the future.

  • The confidence can be thought of as the complement of the p-value, i.e. confidence = 1-(p-value), so higher is better.
  • The current version of IsoSCM does not do this estimation, although several tools such as MISO or DEXSeq can be used to make these sorts of assessments. To use MISO or DEXSeq you would need to convert the output of IsoSCM into something that is compatible with these tools. IsoSCM, DEXSeq, and MISO focus on slightly different problems, and I would rather delegate the quantification to tools that specialize in isoform quantification rather than integrate everything in IsoSCM.
  • The usage is an estimate of the relative fraction of isoforms from the locus that use the polyadenylation site at the position of the change-point (as opposed to a polyadenylation site downstream). The usage will vary between 0 (when there is equal coverage up and down stream) and 1 (when there is zero coverage downstream). I think you're confused because you're interpreting the usage as "usage of the extended isoform", when it represents "1-(usage of the extended isoform)". I guess you looked at the explanation in the readme, so let me know if there were places where that explanation could be more clear.
  • The heatmap in 6B could be generated in R (although I used a different program). The colors are proportional to the log(p-value) of a test for differential usage at tandem APA sites. If there was not a trend towards lengthening or shortening, you would expect the differential usage to be symetrically distributed about 0. If there are more lengthening events (differential usage > 0) in one of the two tissues, you could test if this difference is significant compared to a null model that there would be an equal proportion of lengthening events in each tissue (by a binomial test, for example). The upper and lower triangles of the matrix represent the two one sided tests respectively.

@msxakk
Copy link
Author

msxakk commented May 30, 2015

Hi Shenkers,

Thanks for that. I'll get back to this discussion in due course because it is important. I've got some other data to analyze for now and so can't properly reply.

Hear you soon
msxakk

shenkers added a commit that referenced this issue Jun 2, 2015
shenkers added a commit that referenced this issue Jun 2, 2015
shenkers added a commit that referenced this issue Jun 2, 2015
@shenkers
Copy link
Owner

shenkers commented Jun 2, 2015

@msxakk, I added the feature you requested (https://github.com/shenkers/isoscm/releases/tag/2.0.9), to identify loci that are identified by the 'compare' command that share structure with reference gene models.

To run this command type:

java -jar IsoSCM-2.0.9.jar diff -x [ compare_paramters.xml from compare step ] -G [ reference gene models GTF ]

The reference gene models have to use the same format as the Ensembl GTFs ( http://useast.ensembl.org/info/data/ftp/index.html ), ie they need "gene_id" and "transcript_id" attributes delimited using the same spacing characters. If you use gene models from another source you need to make sure that they match the Ensembl format or they will not be parsed correctly.

The output of this command will be a table with columns:

  1. exon - the terminal exon from the compare command
  2. locus_id - the locus_id, corresponding to the output of the compare command
  3. match_5p_3p - the transcript_id's of reference models that match both the 5' and 3' boundaries of the assembled exon
  4. match_5p_3p_type - the relative location of the matching exon in the reference transcript
  5. match_5p - the transcript_id's of reference models that match only the 5' boundaries of the assembled exon
  6. match_5p_type - the relative location of the matching exon in the reference transcript
  7. match_3p - the transcript_id's of reference models that match only the 3' boundaries of the assembled exon
  8. match_3p_type - the relative location of the matching exon in the reference transcript
  9. overlaps - the transcript_id's of reference models that overlap the assembled exon
  10. overlaps_type - the relative location of the matching exon in the reference transcript

The reference exon types will be one of [ 5p_exon, internal_exon, 3p_exon ] if it comes from a spliced transcript, and unspliced if the reference model is a single-exon transcript. While it is possible that IsoSCM will assemble novel 3'-terminal exons that were never previously annotated, the most conservative approach would be to focus on terminal exons that are consistent with existing gene models. You can do this by filtering for rows that have a match_5p_type of 3p_exon, to select loci that exactly match the 5' boundary of an annotated terminal exon.

Let me know if your run into any problems.

Best,
Sol

@adomingues
Copy link

Hi Sol,

just a follow up on some of @msxakk questions, in particular regarding the p-value. You mentioned that:

The confidence can be thought of as the complement of the p-value, i.e. confidence = 1-(p-value), so higher is better.

But if my reading is correct, confidence should never be above 1, because 0 > p-value < 1. However, I have some (1947 to be precise) confidence values > 1. Am I missing something?

To give yo some background, I am trying to filter the results of compare (currently > 30k different change points) to have only a set of UTRs that show differences between conditions that could be used for further analysis validation. Tentatively, downstream_cov smaller than the median will be removed, as will UTRs with small differential_usage. These can defined by looking at their distribution. However, I am struggling with the use of the confidence values.

Any feedback on how to filter the results is welcome.

(the fixes in v2.0.11 worked btw)

@shenkers
Copy link
Owner

Hi Antonio,

The issue with the confidence values is something I have noticed but not resolved yet. The confidence values are calculated by summing over the likelihoods of various 3'utr segmentations, comparing those that have a change point within a specified number of bins (parameter -c of the assemble command), with segmentations that have no change point within -c bins of that position. Values greater than 1.0 happen when the read coverage supports a model where two change points occur in close proximity (less than 2*c+1 bins apart). I agree that this isn't ideal, so I'll try to see if there is a better approach.

About filtering, I didn't understand what filter you are using for the downstream coverage. I would recommend a minimum upstream coverage in both conditions, as well as a minimum differential usage, as you were planning to use. Sometimes the differential segments will be very short, which I would be more cautious to believe without many replicated out otherc experimental evidence. If you want to be conservative about this case you could require that the downstream segment is above a threshold length.

Glad to hear 2.0.11 worked, thanks for letting me know.

@adomingues
Copy link

Values greater than 1.0 happen when the read coverage supports a model where two change points occur in close proximity (less than 2*c+1 bins apart). I agree that this isn't ideal, so I'll try to see if there is a better approach.

Got it. I was also surprised at the large number of confidence values == 1, about half (15954/30322), which was surprising as I naively expected the distribution to tail at 1:

confidence_distribution

So I am still unsure on which values to use. From what you told me values greater than 1.0 are harder to interpret, and the number of high confidence values, (=1.0), is a bit too high. Still, I guess that > 0.99 should be a good starting point.

About filtering, I didn't understand what filter you are using for the downstream coverage.

I calculate the median downstream_cov for both conditions, take the smallest value, and use this as the threshold for minimum coverage in both conditions. Basically, trying to take only "highish" coverage regions if that makes sense.

I would recommend a minimum upstream coverage in both conditions, as well as a minimum differential usage, as you were planning to use.

It's more or less what I am doing (above), except for downstream coverage, and minimum upstream is decided based on median coverage. The differential usage is arbitrarily set at a minimum of 0.20.

If you want to be conservative about this case you could require that the downstream segment is above a threshold length.

I have considered this. Good idea.

Thanks for your help.

@shenkers
Copy link
Owner

On Jul 22, 2015 4:29 AM, "FridayMeetsSunday" notifications@github.com
wrote:

Got it. I was also surprised at the large number of confidence values ==
1, about half (15954/30322), which was surprising as I naively expected the
distribution to tail at 1:

The segmentation algorithm used by IsoSCM reports a maximum likelihood
solution for the position of changepoints, after weighing all possible
configurations (if you are familiar with HMMs it is analogous to a viterbi
decoding). The confidence values are calculated post-hoc to try and capture
how well the changepoint can be localized (imagine a sharp drop compared to
a gradual decline in coverage). Since the confidence values are calculated
after the fact from the maximum likelihood segmentation you should not
expect a majority of low confidence change points with a tail of high
confidence values.

So I am still unsure on which values to use. From what you told me values
greater than 1.0 are harder to interpret, and the number of high confidence
values, (=1.0), is a bit too high. Still, I guess that > 0.99 should be a
good starting point.

Yes, although I would expect the values greater than one to represent real
change points, so I would not discard them. If you set the assembly step
'-c' parameter to zero you will get more conservative confidence values,
and this should also eliminate confidence values greater than one.

About filtering, I didn't understand what filter you are using for the
downstream coverage.

I calculate the median downstream_cov for both conditions, take the
smallest value, and use this as the threshold for minimum coverage in both
conditions. Basically, trying to take only "highish" coverage regions if
that makes sense.

My only hesitation about doing it this way is that it requires the long
isoform to be expressed in both conditions. At least when comparing tissues
I have seen many examples where the long isoform is present in one tissue
and completely absent from the other. If you require a minimum coverage in
the downstream segment in both conditions you will filter out these cases.
The differential usage is arbitrarily set at a minimum of 0.20.

Do you have replicates? If so, you could run the analysis comparing two
replicates to empirically choose a cutoff. A priori you don't expect to see
any differential usage between replicates, so you could choose a cutoff 'X'
where 95% of change points between replicates have a differential usage
less than X.


Reply to this email directly or view it on GitHub.

@adomingues
Copy link

First of all thank you for the detailed answer. I spent some time looking at predicted events yesterday, following your suggestions, and I can report that by and large IsoSCM does indeed a very good job at predicting change points, or at least UTR's that undergoes differential expression. It tends to over-estimate the lengthening of 3'UTRs, that is predict UTRs tend to be longer than what is suggested by coverage, but you allude to that in the paper anyway (that preditions are not bp-resolution). It is still very valuable information.
I will not reply point by point, but I will leave here my final filtering selection. The only thing I did not do was to use replicate comparison to determine the cut-off. I do have replicates, but long story short it would not be advisable to use them in this situation.

Filtering:

  • upstream_cov_WT > cutoff | upstream_cov_KD > cutoff, and cutoff being the lowest median value of upstream_cov of both samples. This might be a bit too strict but I wanted to be reasonably conservative. Higher than the 1st quartile should to the trick.
  • abs(differential_usage) > 0.20, maybe strict but helps to choose events for which one could see differences in PCR validation.
  • downstream segment length > 50. This is arbitrary. I plotted the distribution and there weren't that many very short downstream segment. 50 nt was what I considered reasonable for the species I am working with.
  • confidence > 0.99. I ended up not excluding those > 1 values, and the results look good, that is looking on IGV the coverage supports the predictions. Again, this is little too strict, but I wanted to have a set of regions that could immediately see differences between conditions (or not).

Applying these filters, 30322 events were reduced to 122. I hope this helps someone else using IsoSCM.

@qoiopipq
Copy link

I ran assemble command with c=2 (as default) then compare, I've got 8921 splicing events in total, 3827 of them have confidence values >=1. I saw the previous post about adjusting '-c' parameter, so I tried to set '-c' parameter to zero in assemble step, but I got the following trackback:
2016-01-27T14:20:48,581 [main] INFO executable.IsoSCM - Starting assemble ...
2016-01-27T14:20:49,465 [main] INFO executable.IsoSCM - tabulating splice junctions...
2016-01-27T14:21:51,295 [main] INFO executable.IsoSCM - counting junction supporting reads...
2016-01-27T14:31:14,485 [main] INFO executable.IsoSCM - identifying covered segments...
2016-01-27T14:48:21,038 [main] INFO executable.IsoSCM - merging coverage gaps less than 100 bp...
2016-01-27T14:48:28,718 [main] INFO executable.IsoSCM - filtering splice junctions...
2016-01-27T14:48:33,916 [main] INFO executable.IsoSCM - identifying spliced exons...
2016-01-27T14:48:37,654 [main] INFO executable.IsoSCM - identifying change points...
Exception in thread "main" java.lang.IllegalArgumentException: fromIndex(15) > toIndex(14)
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1006)
at java.util.ArrayList.subList(ArrayList.java:996)
at scm.ConfidenceCalculator.calculateConfidence(ConfidenceCalculator.java:31)
at scm.IdentifyChangePoints.identifyConstrainedNegativeBinomialPoints(IdentifyChangePoints.java:245)
at scm.IdentifyChangePoints.identifyNegativeBinomialChangePointsInLongSegments(IdentifyChangePoints.java:660)
at executable.IsoSCM.main(IsoSCM.java:358)

Any feedback on how to solve this problem is welcome.

@shenkers
Copy link
Owner

Hi @qoiopipq, thank you for reporting this, it appears to be a bug. I believe I have identified the problem, would it be possible for you to share the .bam file that causes this error so that I can verify the fix?

Regarding confidence values >=1, I think I will remove the -c parameter (so that 'compare' can only be run with c=0 and confidence values will be reported in the proper range).

p.s. you say "..I've got 8921 splicing events..." IsoSCM attempts to identify polyadenylation events, not splicing events.

p.p.s. your report belongs in a new issue since it is not related to the error reported on this thread. You probably wanted to reference confidence values >=1, so you posted here. Github has a built in mechanism to do this, if you want to reference another thread just use a hashtag for the issue number, for example if you want to reference this thread just type #7 and github will create a link to this page when you post your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants