Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AT trim defaults parameters #26

Closed
dstreett opened this issue Sep 6, 2016 · 9 comments
Closed

AT trim defaults parameters #26

dstreett opened this issue Sep 6, 2016 · 9 comments
Assignees
Labels

Comments

@dstreett
Copy link
Collaborator

dstreett commented Sep 6, 2016

Hey, @samhunter

Specific to AT trim - what should the default values be for min trim length and number of mismatch?

In general, min accepted length default?

All trimming algorithms will also have parameters for stranded, 3' trim, 5' trim.

Anything I am missing? We can run test later to actually get optimal values, but is there a decent first guess?

Thank you!

@msettles
Copy link
Member

msettles commented Sep 6, 2016

As first guess, use parameters from Lucy

From: David Streett notifications@github.com
Reply-To: ibest/HTStream reply@reply.github.com
Date: Tuesday, September 6, 2016 at 9:49 AM
To: ibest/HTStream HTStream@noreply.github.com
Subject: [ibest/HTStream] AT trim defaults parameters (#26)

Hey, @samhunter

Specific to AT trim - what should the default values be for min trim length and number of mismatch?

In general, min accepted length default?

All trimming algorithms will also have parameters for stranded, 3' trim, 5' trim.

Anything I am missing? We can run test later to actually get optimal values, but is there a decent first guess?

Thank you!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@samhunter
Copy link
Collaborator

By "min trim length' do you mean the minimum number of bases to trim or the minimum size that is kept after trimming?

I think Lucy is using a 10bp sliding window and continues to slide the window until 3 mismatches are encountered? At a quick glance I don't see any rational for this strategy, and it seems like we could be a little more sensitive for short bits of poly A/T if we used a sliding window 5bp allowing for 2 mismatches, moving from either end of the sequence towards the center. Either way should produce sequences that bwa mem will map/trim I would guess?

This is the Lucy strategy:

Poly-A/T tail removal
If the raw DNA sequences are obtained from an
EST library, some users want their poly-A/T tags
to be removed before clustering. LUCY does this
quickly after vector trimming by searching for the first
min span (10) or longer poly-T fragment within the
first initial search range (50) bases inside the
vector-free good region, then attempts to extend from
this initial poly-T seed toward the center of the sequence,
allowing no more than max error (3) mismatches
between every min span (10) consecutive T bases in
the scan. This is therefore a linear-time and linear-space
operation. The poly-A tail trimming at the other end of
the sequence is carried out similarly. If users wish to tell
LUCY that they are processing EST sequences but they
also wish to keep the poly-A/T tags for their purposes,
they can issue the keep option in combination with the
poly-A/T trimming option cdna.

@msettles
Copy link
Member

msettles commented Sep 6, 2016

Id still say that is a starting place, then modify have no idea on
expectations of homopolymer A/T in genome. Seq of length 5 with 3 A/T (2
mismatches) seems pretty likely to occur in genome, non polyadenalated seq

Matt

On Sep 6, 2016 12:13 PM, "Sam Hunter" notifications@github.com wrote:

By "min trim length' do you mean the minimum number of bases to trim or
the minimum size that is kept after trimming?

I think Lucy is using a 10bp sliding window and continues to slide the
window until 3 mismatches are encountered? At a quick glance I don't see
any rational for this strategy, and it seems like we could be a little more
sensitive for short bits of poly A/T if we used a sliding window 5bp
allowing for 2 mismatches, moving from either end of the sequence towards
the center. Either way should produce sequences that bwa mem will map/trim
I would guess?

This is the Lucy strategy:

Poly-A/T tail removal

If the raw DNA sequences are obtained from an EST library, some users
want their poly-A/T tags to be removed before clustering. LUCY does this
quickly after vector trimming by searching for the first min span (10) or
longer poly-T fragment within the first initial search range (50) bases
inside the vector-free good region, then attempts to extend from this
initial poly-T seed toward the center of the sequence, allowing no more
than max error (3) mismatches between every min span (10) consecutive T
bases in the scan. This is therefore a linear-time and linear-space
operation. The poly-A tail trimming at the other end of the sequence is
carried out similarly. If users wish to tell LUCY that they are processing
EST sequences but they also wish to keep the poly-A/T tags for their
purposes, they can issue the keep option in combination with the poly-A/T
trimming option cdna.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ibest/HTStream/issues/26#issuecomment-245058188, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAno5pc-lkBTFSddnmt26sfMp6Co4CPEks5qnbtSgaJpZM4J2DLW
.

@dstreett dstreett mentioned this issue Sep 7, 2016
@dstreett
Copy link
Collaborator Author

dstreett commented Sep 7, 2016

Hello, again, @msettles !

So, for sickle reboot and poly AT tail remover, I wasn't planning on doing a sliding window. I was just planning on doing a simple loop starting at both ends for both of these algorithms. Any reason we should keep the sliding window?

Thanks!

@msettles
Copy link
Member

msettles commented Sep 7, 2016

Don’t know actually!, But in talking with @shunter just now, I may have the perfect dataset (SE100) to test with, it is mouse and 5’ biased, meaning there should be A LOT of differing length polyA/T tails. I think we can use mapping result, and how say BWA mem soft clips the right side of the read to validate and tune with.

Matt

From: David Streett notifications@github.com
Reply-To: ibest/HTStream reply@reply.github.com
Date: Wednesday, September 7, 2016 at 10:16 AM
To: ibest/HTStream HTStream@noreply.github.com
Cc: Matt Settles mattsettles@gmail.com, Mention mention@noreply.github.com
Subject: Re: [ibest/HTStream] AT trim defaults parameters (#26)

Hello, again, @msettles !

So, for sickle reboot and poly AT tail remover, I wasn't planning on doing a sliding window. I was just planning on doing a simple loop starting at both ends for both of these algorithms. Any reason we should keep the sliding window?

Thanks!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@samhunter
Copy link
Collaborator

There must be some information on whether poly-A trimming impacts analysis? I'm not sure if I've ever seen it rigorously analyzed before however? Anyone have a citation? Does bwa mem just happily soft-clip off all of those AAA's and map anyway? Maybe Kallisto/Salmon/etc aren't impacted much?

@msettles
Copy link
Member

msettles commented Sep 7, 2016

Well, my thoughts,

  1.   Newish alg that do a global to local are less likely to be impacted than older global algorithms, I bet length matters, so clipping SE100 has less impact than SE50 data?
    
  2.   Stats associated with polyA/T might be important for validation of comparability, differences (especially for SE data) may be valuable for explaining problems in the data, so more for prep related stats than for ‘better results’
    
  3.   Can’t hurt!
    

But with this dataset should be able to determine that! And I have no citation

matt

From: Sam Hunter notifications@github.com
Reply-To: ibest/HTStream reply@reply.github.com
Date: Wednesday, September 7, 2016 at 11:20 AM
To: ibest/HTStream HTStream@noreply.github.com
Cc: Matt Settles mattsettles@gmail.com, Mention mention@noreply.github.com
Subject: Re: [ibest/HTStream] AT trim defaults parameters (#26)

There must be some information on whether poly-A trimming impacts analysis? I'm not sure if I've ever seen it rigorously analyzed before however? Anyone have a citation? Does bwa mem just happily soft-clip off all of those AAA's and map anyway? Maybe Kallisto/Salmon/etc aren't impacted much?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@dstreett
Copy link
Collaborator Author

dstreett commented Sep 7, 2016

I was also wondering, @msettles , if there were any assumptions we could build into this. Such as T's will only appear on the 5' end and A's will only appear on the 3' end?

@msettles
Copy link
Member

msettles commented Sep 7, 2016

Depends on the library preparation method. So this data I have in mind has a specific set of assumptions, on what/where the polyA/T will occur, but the generic RNAseq, could have A or T at beginning of read or end of read

But should think of how to specify some of those possibilities as parameters, with the default to look at all possible. And stats for all, for now

Matt

From: David Streett notifications@github.com
Reply-To: ibest/HTStream reply@reply.github.com
Date: Wednesday, September 7, 2016 at 12:06 PM
To: ibest/HTStream HTStream@noreply.github.com
Cc: Matt Settles mattsettles@gmail.com, Mention mention@noreply.github.com
Subject: Re: [ibest/HTStream] AT trim defaults parameters (#26)

I was also wonder, @msettles , if there were any assumptions we could build into this. Such as T's will only appear on the 5' end and A's will only appear on the 3' end?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@dstreett dstreett closed this as completed Jan 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants