Wrong redirection on Amazon #30

sumyungguy · 2017-07-11T20:18:52Z

Hi, thanks for this addon!

I had a problem with Amazon, being unable to go to sellers' storefronts. Eventually I solved it by putting sellercentral.amazon.com on the blacklist of Skip Redirect. I guess you don't do fixes for specific websites, but I wonder then if it would be helpful to have an option to not skip redirects if they go to the same domain? Mostly I don't like click-tracking for privacy reasons, but within the same domain it doesn't matter so much, and it would reduce site breakages like this.

Go to for example this seller's page:
https://www.amazon.com/sp?seller=AEHA1AAWZIC3L

At the top of that page, click "Apevia storefront", which has the link:
https://www.amazon.com/shops/AEHA1AAWZIC3L?ref_=v_sp_storefront

That seems to go to here:
https://sellercentral.amazon.com/gp/sc-redirect/seller-page.html/ref=v_sp_storefront[blahblah-mysterious-information]&servername=www.amazon.com&type=shops

But Skip Links then just goes to the root page of www.amazon.com, instead of to the storefront.

sblask · 2017-07-11T20:55:29Z

I can't reproduce your example, but I see what you mean and your proposal doesn't sound like a bad idea at all. I'll have to do a bit of thinking first though. Given your example, matching on everything until the first slash won't work. Extracting the domain can be tough due to top level domains (https://en.wikipedia.org/wiki/Second-level_domain) which I wouldn't want to hardcode...

sblask · 2017-07-12T08:13:15Z

I found https://publicsuffix.org/ yesterday, @crssi also referenced in #29 but I am not really a fan of introducing a sync functionality or doing it manually every month. It also is a very long list and checking that for all requests with a redirect could slow down the whole process. And the whole point for me is speed up (I don't care much about the privacy aspect here).

So I thought I could implement a simple heuristic: do not skip if the domain matches or ends with the same 10? (could be configurable) characters. It's not perfect, but could cover 90% of all the cases?

crssi · 2017-07-12T09:36:49Z

OK... 2 problems

Sync or manually: what about the database is only loaded from https://publicsuffix.org/list/effective_tld_names.dat after installing extension and then once per month (or user can set days) or when user manually hit a sync button.
Speedup: skip regex comparison and lets assume that database from upper URL is processed into SQL indexed like internal database, then it might not be too costly from CPU perspective? At least not for simple one or two level TLDs.

sblask · 2017-07-12T21:43:51Z

I made a quick test and searching through the names takes something between 3 and 15 ms. A lot of variation there. Not sure whether that's acceptable, but anyway I don't think the list is right for the purpose of this ticket. What I'd need is a list of top level domain plus second level domains so that splitting by dots I can identify amazon.com or amazon.co.uk as the right domain. So I basically want com or co.uk plus one more level of subdomain. The public suffix list contains loads of stuff. I can't take the values as is, because that would exclude amazon.com for example, but I can't use them and add another subdomain if necessary either, because there are very specific ones in the list already...

sumyungguy · 2017-07-13T12:55:31Z

I hadn't realized it would be so difficult to figure out if something is in the same domain! But of course it makes sense now that I think about it.

I think you'd need more than top level domain plus second level domains. It looks like in Japan three-level public domains are very common, or things like blogspot.co.uk, or as many as five-level ones with amazonaws. I don't see that there's any robust option other than using the public suffix list. But there must be some optimized search code already available that you could use for it? For example, for domains ending in .com, there are relatively few entries to check. If you can't use the list, then I'd say probably just forget about it, because it won't work the way people expect.

The public suffix list contains loads of stuff. I can't take the values as is, because that would exclude amazon.com for example, but I can't use them and add another subdomain if necessary either, because there are very specific ones in the list already...

Not sure what you mean by that...

crssi · 2017-07-13T14:18:42Z

@sblask:
I don't think you should treat amazon.com and amazon.co.uk as the same domain.

sblask · 2017-07-13T14:44:53Z

That's not what I was talking about. I just meant that I can't just split on dots to figure out the right domain. Given this example:

https://sellercentral.amazon.com/gp/sc-redirect/seller-page.html/ref=v_sp_storefront[blahblah-mysterious-information]&servername=www.amazon.com&type=shops

I would need to extract amazon.com I could just split the domain on dot and use the last two fragments. But this would fail with amazon.co.uk, because I'd need three fragments. My idea was to take the list, identify the equivalent to com or co.uk in the given domain and then add one more fragment. But it doesn't work like this.

There is a library that seems to do the right thing: https://github.com/oncletom/tld.js but it's rather complex and there is no simple way to include it...

crssi · 2017-07-13T18:58:25Z

I am just thinking out loud here and now (brainstorming).
Let's say that https://publicsuffix.org/list/effective_tld_names.dat is used and parsed into sqlite for speeding the process using indexes.
The process is simplified for explanation clarity sake.

Let's start with amazon.co.uk as example in a TLD and domain search process.

First part, that is part to 1st . reversed .uk can be be skipped, since seach will obviously always return true

search for co.uk, 2nd . reversed, is invoked and obviously is true, so it is possible TLD candidate
we go on until false, so search for amazon.co.uk is invoked and in this case we have false
So co.uk is TLD and the last search value is a domain = amazon.co.uk

for amazon.com everything stops already at step 1, where TLD is .com and domain part is last (in this case already the first) search value = amazon.com.

Let's go with a longer example, which does not happen very often... page.domain.act.edu.au
We skip search for .au

search for edu.au returns true, so we go on
search for act.edu.eu returns true, so we go on
search for domain.act.edu.au returns false
now we know that domain.act.edu.au is a domain, TLD is act.edu.au... but we don't bother, since a domain is a value we search and not TLD.
and now we know that page is a host or subdomain of domain.act.edu.au, but we don't care, since we are in search for domain part only.

I am not sure if I have explained well, but at least I tried. :)

Cheers

sblask · 2017-07-13T21:58:35Z

There is no support for SQLite in webextensions, but I think I found a way to do this using sets - lookup should be instantaneous. I finally understood the format of the suffix list. I think this should work: create 3 sets:

exception entries (! removed)
wildcard entries (*. removed)
others

And then similarly to what you suggest split the domain and lookup bit by bit - just in the opposite direction. It goes like this:

split the domain on the first dot into head and rest
check exception entries - if rest in there the result is head.rest
check wildcard entries - if rest in there the result is previous_head.head.rest
check others - if it's in there, the result is head.rest
nothing found? repeat from step one with rest as domain

crssi · 2017-07-14T07:17:05Z

I have suggested the backward solution because I have a feeling that it would need less steps and would be more optimal (CPU regarding) and faster.
But anyway, great thinking and good job.

Cheers

sblask · 2017-07-14T07:50:38Z

It's the same thing, but I find it easier to wrap my head around it, especially with exceptions and wildcards. Might need more steps, but using sets that shouldn't be a problem. Now I just need to find the time to implement it...

crssi · 2017-07-14T08:12:40Z

You are a champ. :)
It was nice brainstorming... if you need tester for beta, let me know here and I will do it gladly. :)

Thank you

sumyungguy · 2017-07-14T10:36:42Z

There's also this:
https://github.com/wrangr/psl
which is used eg. in the Smart Referer addon:
https://github.com/meh/smart-referer/tree/master/webextension/deps

It looks very simple to include as a submodule, and to use it's just one "psl.get" call for the "from" and "to" URLs, gives you the base domain names to compare. From Smart Referer's main.js:

try {
		let toURI   = new URL(request.url);
		let fromURI = new URL(referer.value);
		
		// Check if this request can be dismissed early by either being a perfect
		// source host / target host match OR by being whitelisted somewhere
		if(fromURI.host === toURI.host || policy.allows(fromURI.host, toURI.host)) {
			return;
		}
		
		// Attempt lax matching only if we, in fact, have hostnames and not IP addresses
		let isIPAddress = !isHostnameIPAddress(fromURI.host) && !isHostnameIPAddress(toURI.host);
		if(isIPAddress && !options.strict) {
			// Parse the domain names and look for each of their base domains
			let fromHostBase = psl.get(fromURI.host);
			let toHostBase   = psl.get(toURI.host);
			
			// Allow if the both base domain names match
			if(fromHostBase && toHostBase && fromHostBase === toHostBase) {
				return;
			}
		}
	} catch(e) {
		console.exception(e);
}

Another optimization, besides the "perfect match" as above, is a simple check if the top-level and 2nd level, based on splitting by dots, are not the same. In either of those cases, which is going to be the vast majority, you don't need to do the full check. You could even do a fast pre-check for the most common two-level public domains, like .co.uk, before calling the full check.

sblask · 2017-07-14T21:40:40Z

I had a look at the psl code. For every domain it goes through all entries of the public suffix list. I would have to check two domains, so if my performance check is right, that would be up to 30 ms and that's really rather long. Now that I figured out the algorithm, it shouldn't be too hard to implement it myself.

I like the idea of the pre-check. I'll do some benchmarking and if it's still rather slow, I can add that too. I would not want to judge what the most common domains are though ;-)

petteyg · 2017-07-17T21:28:44Z

See @gorhill's implentation for publicsuffix.org. https://github.com/gorhill/publicsuffixlist.js

It is apparently much faster.

sblask · 2017-07-17T23:23:32Z

Just finished my own implementation of getting the domain using the public suffix list: b17aca7

I made a new benchmark. Not quite comparable to my previous one because that was done in the browser console and this is done on command line. Also not comparable to the one from gorhill, especially because he didn't list the actual hostnames so I couldn't use the same. I also don't handle punycode - I don't know whether the webextension API uses punycode or unicode and I didn't have good examples to test with.

Anyway a comparison of searching through all rules and using my set implementation looks like this:

ok 11 Search through rules x 1,530 ops/sec ±2.10% (87 runs sampled)
ok 12 Set implementation x 1,103,814 ops/sec ±0.57% (94 runs sampled)

(benchmark code is in test-psl)

The difference is quite significant (as expected), but even the search doesn't look quite as bad as my previous result.

Next step: integrating this into the skipping logic and adding an option.

sblask · 2017-07-22T23:23:12Z

Ok, it's done! But I could not upload a new version yet because the add-on page died on me. You can test with Firefox by cloning this repo and running npm install and npm run:stable in the directory you cloned into. For Chrome, just clone into a directory and then use "Load unpacked extension" from the extension settings.

crssi · 2017-07-23T11:25:17Z

I am eager to try, but will be back in two weekes. In the mean time your AMO might get a life back. :) Thank you

sblask · 2017-07-24T22:29:02Z

New version is out, if you find any issues, please open a new issue.

sumyungguy · 2017-07-25T00:05:06Z

Seems to work well so far, thanks!

crssi · 2017-07-26T19:38:02Z

You are a true wizard. Thank you. :)
I have removed all blacklisted stuff (also the default one) and so far so good, no breakage.

But will know more when I return from vacation in about 2 weeks and will let you know. :)

petteyg · 2017-07-30T17:08:09Z

The "same public domain" check doesn't appear to be working properly. The login form on www.chase.com redirect to chaseonline.chase.com and then to secure05b.chase.com. All are obviously chase.com, but the add-on prevents logging in and pops up a message about skipping the chaseonline.chase.com redirect.

sblask · 2017-07-30T22:58:13Z

@petteyg I can't reproduce your problem. Do I need an account? When I click on signin, I get stuff like https://secure01b.chase.com/web/auth/?fromOrigin=https://secure01b.chase.com in the browser console, but the redirect is not skipped. Are you sure you've got the checkbox unticked? Can you post the full URL from which the skip happens?

petteyg · 2017-07-30T23:48:01Z

There's my problem. I was misreading the checkbox and thought it should be on.

crssi · 2017-07-31T07:56:12Z

Nah, not your problem obviously. I made the same mistake.
Its really confusing what means "skip redirect" and what means skip skip redirect and what is whitelist and blacklist... the logic at the first glance is reversed.

@sblask I will write you more about to make more clearly when I am back in a week or so.
Anyway, the new "same domain" logic is awesome. 👍

sumyungguy · 2017-08-02T10:32:50Z

I also had the same experiences, and chalked it up to me being an idiot :-) But now I don't feel so bad... I'll open a new issue about it.

crssi · 2017-08-07T20:34:30Z

OK... reading it again, it make sense.
And you have nicely explained here #32 (comment)
Thank you @sblask, now your extension is not great, but it is awesome :)

Cheers

sblask mentioned this issue Jul 11, 2017

/_login and /adfs is missing as exception #29

Closed

crssi mentioned this issue Jul 14, 2017

Enhancement: do not redirect if target domain is the same as source tumpio/requestcontrol#29

Closed

sblask closed this as completed Jul 24, 2017

sumyungguy mentioned this issue Aug 2, 2017

Reversed logic for allow/deny #32

Closed

crssi mentioned this issue Aug 8, 2017

Request Control Filters arkenfox/user.js#149

Closed

crssi mentioned this issue Sep 29, 2017

wiki: Skip Redirect : ELI5 arkenfox/user.js#249

Closed

stoically mentioned this issue Mar 14, 2018

Same domain logic should take second+ level domains into account stoically/temporary-containers#92

Closed

crssi mentioned this issue Mar 31, 2018

SIMPLE QUESTIONS tumpio/requestcontrol#60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong redirection on Amazon #30

Wrong redirection on Amazon #30

sumyungguy commented Jul 11, 2017 •

edited

Loading

sblask commented Jul 11, 2017

sblask commented Jul 12, 2017

crssi commented Jul 12, 2017

sblask commented Jul 12, 2017

sumyungguy commented Jul 13, 2017

crssi commented Jul 13, 2017

sblask commented Jul 13, 2017

crssi commented Jul 13, 2017

sblask commented Jul 13, 2017

crssi commented Jul 14, 2017

sblask commented Jul 14, 2017

crssi commented Jul 14, 2017

sumyungguy commented Jul 14, 2017 •

edited

Loading

sblask commented Jul 14, 2017

petteyg commented Jul 17, 2017

sblask commented Jul 17, 2017 •

edited

Loading

sblask commented Jul 22, 2017

crssi commented Jul 23, 2017 via email •

edited

Loading

sblask commented Jul 24, 2017

sumyungguy commented Jul 25, 2017

crssi commented Jul 26, 2017

petteyg commented Jul 30, 2017 •

edited

Loading

sblask commented Jul 30, 2017

petteyg commented Jul 30, 2017

crssi commented Jul 31, 2017

sumyungguy commented Aug 2, 2017

crssi commented Aug 7, 2017

Wrong redirection on Amazon #30

Wrong redirection on Amazon #30

Comments

sumyungguy commented Jul 11, 2017 • edited Loading

sblask commented Jul 11, 2017

sblask commented Jul 12, 2017

crssi commented Jul 12, 2017

sblask commented Jul 12, 2017

sumyungguy commented Jul 13, 2017

crssi commented Jul 13, 2017

sblask commented Jul 13, 2017

crssi commented Jul 13, 2017

sblask commented Jul 13, 2017

crssi commented Jul 14, 2017

sblask commented Jul 14, 2017

crssi commented Jul 14, 2017

sumyungguy commented Jul 14, 2017 • edited Loading

sblask commented Jul 14, 2017

petteyg commented Jul 17, 2017

sblask commented Jul 17, 2017 • edited Loading

sblask commented Jul 22, 2017

crssi commented Jul 23, 2017 via email • edited Loading

sblask commented Jul 24, 2017

sumyungguy commented Jul 25, 2017

crssi commented Jul 26, 2017

petteyg commented Jul 30, 2017 • edited Loading

sblask commented Jul 30, 2017

petteyg commented Jul 30, 2017

crssi commented Jul 31, 2017

sumyungguy commented Aug 2, 2017

crssi commented Aug 7, 2017

sumyungguy commented Jul 11, 2017 •

edited

Loading

sumyungguy commented Jul 14, 2017 •

edited

Loading

sblask commented Jul 17, 2017 •

edited

Loading

crssi commented Jul 23, 2017 via email •

edited

Loading

petteyg commented Jul 30, 2017 •

edited

Loading