Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip URLs of leading and trailing non-breaking space (and space, but we already did) #126

Merged
merged 10 commits into from
Feb 5, 2019

Conversation

dentarg
Copy link
Collaborator

@dentarg dentarg commented Nov 21, 2018

https://en.wikipedia.org/wiki/Non-breaking_space

Background: Got a big list of URLs (Excel file) and many values in the "URL" column started with the NO-BREAK SPACE character.

@walro
Copy link
Contributor

walro commented Nov 21, 2018

Is this on parity with what we do for the .NET counterpart? Because this feels quite specific to a single use-case. Not too sure if the cleansing should be the responsibility of this gem.

I would be onboard if we decided that the library should strip all leading, and possibly, trailing whitespace characters, but as I lead with, this feels very specific.

@dentarg
Copy link
Collaborator Author

dentarg commented Nov 28, 2018

Is this on parity with what we do for the .NET counterpart?

Looks like that:

$ ruby -e 'puts "\u00A0https://example.com/foo/\u00A0"' | mono trunk/Deploy/Normalizer/Normalizer.exe
 https://example.com/foo/ ,https://www.example.com/foo,5591070627727872931
$ ruby -e 'puts "\u0020https://example.com/foo/\u0020"' | mono trunk/Deploy/Normalizer/Normalizer.exe
" https://example.com/foo/ ",https://www.example.com/foo,5591070627727872931

Because this feels quite specific to a single use-case. Not too sure if the cleansing should be the responsibility of this gem. I would be onboard if we decided that the library should strip all leading, and possibly, trailing whitespace characters, but as I lead with, this feels very specific.

I agree that it is specific. After I did this, I realised that if we do this, we probably want to remove all whitespace, but I haven't had the time to try to do it.

Re: trailing whitespace, I saw after I did this that Addressable keeps the non-breaking space:

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("https://example.com/foo/\u00A0").to_s'
"https://example.com/foo/ "

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("https://example.com/foo/\u00A0").to_s'
"https://example.com/foo/ "

Addressable::URI#normalize also keeps it:

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("https://example.com/foo/\u00A0").normalize.to_s'
"https://example.com/foo/%20"

The behaviour isn't the same for ordinary space (and heuristic_parse and parse differs):

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("https://example.com/foo/\u0020").to_s'
"https://example.com/foo/ "

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("https://example.com/foo/\u0020").normalize.to_s'
"https://example.com/foo/"

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("https://example.com/foo/\u0020").to_s'
"https://example.com/foo/"

Addressable::URI.parse blows up on any leading space:

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("\u00A0https://example.com/foo/").to_s'
/Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:874:in `scheme=': Invalid scheme format:  https (Addressable::URI::InvalidURIError)
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:799:in `block in initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2355:in `defer_validation'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:796:in `initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `new'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `parse'
  from -e:1:in `<main>'

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("\u0020https://example.com/foo/").to_s'
/Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:874:in `scheme=': Invalid scheme format:  https (Addressable::URI::InvalidURIError)
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:799:in `block in initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2355:in `defer_validation'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:796:in `initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `new'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `parse'
  from -e:1:in `<main>'

Addressable::URI.heuristic_parse do handle the ordinary space but not the non-breaking space:

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("\u0020https://example.com/foo/").to_s'
"https://example.com/foo/"

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("\u00A0https://example.com/foo/").to_s'
/Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:874:in `scheme=': Invalid scheme format:  https (Addressable::URI::InvalidURIError)
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:799:in `block in initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2355:in `defer_validation'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:796:in `initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `new'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `parse'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:207:in `heuristic_parse'
  from -e:1:in `<main>'

$ ruby -raddressable/uri -e 'p Addressable::VERSION::STRING'
"2.5.2"

@dentarg
Copy link
Collaborator Author

dentarg commented Nov 28, 2018

Not too sure if the cleansing should be the responsibility of this gem.

I think it should, at least leading whitespace. An alternative could be to put it in Twingly::URL::Utilities.extract_valid_urls, but I think .parse is nicer, we use it more.

@dentarg
Copy link
Collaborator Author

dentarg commented Nov 28, 2018

Our current behaviour

$ ruby -rtwingly/url -e 'p Twingly::URL.parse("\u0020https://example.com/foo/").to_s'
"https://example.com/foo/"

$ ruby -rtwingly/url -e 'p Twingly::URL.parse("\u00A0https://example.com/foo/").to_s'
""
$ ruby -rtwingly/url -e 'p Twingly::URL::VERSION'
"5.1.1"

@walro
Copy link
Contributor

walro commented Nov 29, 2018

$ ruby -rtwingly/url -e 'p Twingly::URL.parse("\u00A0https://example.com/foo/").to_s'
""

OK, this is pretty bad behavior which I guess got you started.

I agree that it is specific. After I did this, I realised that if we do this, we probably want to remove all whitespace, but I haven't had the time to try to do it.

It sounds like we agree :)

I think it should, at least leading whitespace. An alternative could be to put it in Twingly::URL::Utilities.extract_valid_urls, but I think .parse is nicer, we use it more.

When I meant if it should be the responsibility of the gem, I was talking about the specific nbsp case, as per above - I think it's fine (or even a good thing) that we do it in general

@jage
Copy link
Contributor

jage commented Jan 31, 2019

I'm not 100% sure how we should handle this case, but a suggestion is that we add as an option. I.e. Twingly::URL.parse(url, clean: true) and do a minor release. It won't change (potentially break) any existing usage of this code, but all our projects get this ability with a small change.

In the future I think this might very well become a defaut (major release), but right now I would need time that I don't have to verify how this could affect our systems (i.e. cause new exceptions in downstream projects), and if this is the route forward (i.e. make sure everything works like this). By adding it as an option we can decide per project.

WDYT @dentarg @walro ?

@walro
Copy link
Contributor

walro commented Jan 31, 2019

Sounds good to me to make it an optional thing.

@dentarg
Copy link
Collaborator Author

dentarg commented Jan 31, 2019

I think it might take more time to make it an option than to understand how breaking it would be. I think we should make it more general as @walro and I have discussed, and if need be, bump major the major version when we release it.

@dentarg
Copy link
Collaborator Author

dentarg commented Jan 31, 2019

We already do some cleaning so also I don't think it is a good idea to add clean:. I prefer to keep the usage simple.

@walro
Copy link
Contributor

walro commented Jan 31, 2019

I think it might take more time to make it an option than to understand how breaking

Wouldn't a naïve solution work:

...
... = clean_input(input) if clean
...

@walro
Copy link
Contributor

walro commented Jan 31, 2019

We already do some cleaning so also I don't think it is a good idea to add clean:. I prefer to keep the usage simple.

Ah I see.

@jage
Copy link
Contributor

jage commented Jan 31, 2019

Sure, just to clarify I'm not worried about bumping the major, I'm worried about potential new issues in other systems that might pop up. I.e. and upgrade to twingly-url will be blocked by some other system not liking this. Or that we are "forced" to implement this in more places. But if we look into it, of course this is a good change 👍

Adding this to extract_valid_urls would be much more straight forward IMHO, "just" split or more than space. i.e.. we don't need to worry as much about compatibility (which I agree, might not be an issue here, but I don't know). But it's not the same feature, then you'd need to use the utilities method and not parse directly, but just wanted to note this anyway.

@dentarg
Copy link
Collaborator Author

dentarg commented Jan 31, 2019

Seeing that our .NET library do handle this (#126 (comment)) I don't think we should be afraid of implementing it here and start using it.

I'm worried about potential new issues in other systems that might pop up. I.e. and upgrade to twingly-url will be blocked by some other system not liking this.

I'm not that worried, and as long as we don't do like this, I think we will be fine with our usual approach. We deploy the new version to one system at a time, and observe the outcome. I think that is less work than an extending the API. I'm afraid that it wont be used at all then.

@jage
Copy link
Contributor

jage commented Jan 31, 2019

I think I got stuck on the phrase "Looks like that:". I mean, if it is on par, then it should be fine.

@jage
Copy link
Contributor

jage commented Jan 31, 2019

I'm thinking we should try get this in next week, this should, together with #128 (and maybe #123 ), make for a good release (patch or minor).

@jage
Copy link
Contributor

jage commented Feb 5, 2019

Not sure this should stop this, but just wanted to note a subtle difference:

[1] pry(main)> Twingly::URL.parse(" \u00A0 https://example.com/foo/ \u00A0 ").to_s
=> ""

vs.

$ ruby -e 'puts " \u00A0 https://example.com/foo/ \u00A0 "' | mono trunk/Deploy/Normalizer/Normalizer.exe
"   https://example.com/foo/   ",https://www.example.com/foo,5591070627727872931

@dentarg
Copy link
Collaborator Author

dentarg commented Feb 5, 2019

Hmm

$ ruby -e 'puts " \u00A0 https://example.com/foo/\u{1F4A9}"' | mono trunk/Deploy/Normalizer/Normalizer.exe
"   https://example.com/foo/💩","   https://example.com/foo/💩",9474829912736228245

$ ruby -e 'puts " \u00A0 https://example.com/foo/ \u{1F4A9}" ' | mono trunk/Deploy/Normalizer/Normalizer.exe
"   https://example.com/foo/ 💩","   https://example.com/foo/ 💩",1824702514565726115

$ ruby -e 'puts " \u{1F4A9} https://example.com/foo/ \u{1F4A9}" ' | mono trunk/Deploy/Normalizer/Normalizer.exe
" 💩 https://example.com/foo/ 💩"," 💩 https://example.com/foo/ 💩",6741913218986405642

@dentarg
Copy link
Collaborator Author

dentarg commented Feb 5, 2019

I don't think this PR is in a state to be merged right now.

For starters, as discussed above, it should be more general, and handle more whitespace (maybe we only need to add the regular space to be happy, that should fix the discrepancy seen in #126 (comment)).

Further more, I'm not sure that .parse should trim trailing whitespace. Maybe the character is part of the URL and it breaks if we remove it?

input = input.sub(/#{NBSP}\z/, "")

When we normalize, we can trim trailing whitespace, to achieve same output as the .NET normalizer.

@dentarg
Copy link
Collaborator Author

dentarg commented Feb 5, 2019

I'm not sure that .parse should trim trailing whitespace

Ah, we already do trim the regular space.

@jage
Copy link
Contributor

jage commented Feb 5, 2019

For starters, as discussed above, it should be more general, and handle more whitespace (maybe we only need to add the regular space to be happy, that should fix the discrepancy seen in #126 (comment)).

That can be added later, no need to rewrite everything in one PR. I don't think that's an argument for stopping this, we can just continue the work in a new PR?

... we only need to add the regular space to be happy ...

Not sure what you mean? Regulare space is already removed when parsing?

Further more, I'm not sure that .parse should trim trailing whitespace. Maybe the character is part of the URL and it breaks if we remove it?

Yeah, that is a risk.


I think we should close this PR if we aren't sure what we want to achieve here, might be better to work out a plan for the .parse function in an issue and then focus on the code development? It sounds like some ideas mentioned in this PR shouldn't be in .parse at all, but rather in a broader "extract URL(s) from this string".

I've actively put effort validating stuff here today, I thought you wanted this merged. I think this is ok, there is no contract between our implementations and we're more strict here AFAIK than in our .NET code, so the risk of adding this should be small (but not zero). BUT, if we're not sure about this I'd rather see this closed.

@dentarg
Copy link
Collaborator Author

dentarg commented Feb 5, 2019

Ah, we already do trim the regular space.

I had a hard time seeing where this was done, but it is the heuristic parsing in Addressable that does it:

$ ruby -rbundler/setup -r./lib/twingly/url -e 'p Addressable::URI.heuristic_parse(" https://example.com/foo/ ").to_s'
"https://example.com/foo/"

Related to #126, making the whitespace removal more general.
Strip input from both space and non-breaking space
lib/twingly/url.rb Outdated Show resolved Hide resolved
Because we want to clean the input (#126), we are always working with a
string. #126 (comment)
@walro
Copy link
Contributor

walro commented Feb 5, 2019

(maybe PR title needs a slight tweak btw)

@dentarg dentarg changed the title Handle URLs with non-breaking space Strip URLs of leading and trailing non-breaking space (and space, but we already did) Feb 5, 2019
@dentarg dentarg merged commit f3d3cbb into master Feb 5, 2019
@dentarg dentarg deleted the handle-non-breaking-space branch February 5, 2019 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants