Strip URLs of leading and trailing non-breaking space (and space, but we already did) #126

dentarg · 2018-11-21T16:34:20Z

https://en.wikipedia.org/wiki/Non-breaking_space

Background: Got a big list of URLs (Excel file) and many values in the "URL" column started with the NO-BREAK SPACE character.

https://en.wikipedia.org/wiki/Non-breaking_space

walro · 2018-11-21T17:13:08Z

Is this on parity with what we do for the .NET counterpart? Because this feels quite specific to a single use-case. Not too sure if the cleansing should be the responsibility of this gem.

I would be onboard if we decided that the library should strip all leading, and possibly, trailing whitespace characters, but as I lead with, this feels very specific.

dentarg · 2018-11-28T22:43:48Z

Is this on parity with what we do for the .NET counterpart?

Looks like that:

$ ruby -e 'puts "\u00A0https://example.com/foo/\u00A0"' | mono trunk/Deploy/Normalizer/Normalizer.exe
 https://example.com/foo/ ,https://www.example.com/foo,5591070627727872931

$ ruby -e 'puts "\u0020https://example.com/foo/\u0020"' | mono trunk/Deploy/Normalizer/Normalizer.exe
" https://example.com/foo/ ",https://www.example.com/foo,5591070627727872931

Because this feels quite specific to a single use-case. Not too sure if the cleansing should be the responsibility of this gem. I would be onboard if we decided that the library should strip all leading, and possibly, trailing whitespace characters, but as I lead with, this feels very specific.

I agree that it is specific. After I did this, I realised that if we do this, we probably want to remove all whitespace, but I haven't had the time to try to do it.

Re: trailing whitespace, I saw after I did this that Addressable keeps the non-breaking space:

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("https://example.com/foo/\u00A0").to_s'
"https://example.com/foo/ "

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("https://example.com/foo/\u00A0").to_s'
"https://example.com/foo/ "

Addressable::URI#normalize also keeps it:

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("https://example.com/foo/\u00A0").normalize.to_s'
"https://example.com/foo/%20"

The behaviour isn't the same for ordinary space (and heuristic_parse and parse differs):

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("https://example.com/foo/\u0020").to_s'
"https://example.com/foo/ "

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("https://example.com/foo/\u0020").normalize.to_s'
"https://example.com/foo/"

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("https://example.com/foo/\u0020").to_s'
"https://example.com/foo/"

Addressable::URI.parse blows up on any leading space:

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("\u00A0https://example.com/foo/").to_s'
/Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:874:in `scheme=': Invalid scheme format:  https (Addressable::URI::InvalidURIError)
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:799:in `block in initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2355:in `defer_validation'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:796:in `initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `new'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `parse'
  from -e:1:in `<main>'

$ ruby -raddressable/uri -e 'p Addressable::URI.parse("\u0020https://example.com/foo/").to_s'
/Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:874:in `scheme=': Invalid scheme format:  https (Addressable::URI::InvalidURIError)
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:799:in `block in initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2355:in `defer_validation'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:796:in `initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `new'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `parse'
  from -e:1:in `<main>'

Addressable::URI.heuristic_parse do handle the ordinary space but not the non-breaking space:

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("\u0020https://example.com/foo/").to_s'
"https://example.com/foo/"

$ ruby -raddressable/uri -e 'p Addressable::URI.heuristic_parse("\u00A0https://example.com/foo/").to_s'
/Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:874:in `scheme=': Invalid scheme format:  https (Addressable::URI::InvalidURIError)
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:799:in `block in initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:2355:in `defer_validation'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:796:in `initialize'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `new'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:136:in `parse'
  from /Users/dentarg/.gem/ruby/2.4.2/gems/addressable-2.5.2/lib/addressable/uri.rb:207:in `heuristic_parse'
  from -e:1:in `<main>'

$ ruby -raddressable/uri -e 'p Addressable::VERSION::STRING'
"2.5.2"

dentarg · 2018-11-28T22:46:45Z

Not too sure if the cleansing should be the responsibility of this gem.

I think it should, at least leading whitespace. An alternative could be to put it in Twingly::URL::Utilities.extract_valid_urls, but I think .parse is nicer, we use it more.

dentarg · 2018-11-28T22:51:36Z

Our current behaviour

$ ruby -rtwingly/url -e 'p Twingly::URL.parse("\u0020https://example.com/foo/").to_s'
"https://example.com/foo/"

$ ruby -rtwingly/url -e 'p Twingly::URL.parse("\u00A0https://example.com/foo/").to_s'
""

$ ruby -rtwingly/url -e 'p Twingly::URL::VERSION'
"5.1.1"

walro · 2018-11-29T06:12:52Z

$ ruby -rtwingly/url -e 'p Twingly::URL.parse("\u00A0https://example.com/foo/").to_s'
""

OK, this is pretty bad behavior which I guess got you started.

I agree that it is specific. After I did this, I realised that if we do this, we probably want to remove all whitespace, but I haven't had the time to try to do it.

It sounds like we agree :)

I think it should, at least leading whitespace. An alternative could be to put it in Twingly::URL::Utilities.extract_valid_urls, but I think .parse is nicer, we use it more.

When I meant if it should be the responsibility of the gem, I was talking about the specific nbsp case, as per above - I think it's fine (or even a good thing) that we do it in general

jage · 2019-01-31T10:25:01Z

I'm not 100% sure how we should handle this case, but a suggestion is that we add as an option. I.e. Twingly::URL.parse(url, clean: true) and do a minor release. It won't change (potentially break) any existing usage of this code, but all our projects get this ability with a small change.

In the future I think this might very well become a defaut (major release), but right now I would need time that I don't have to verify how this could affect our systems (i.e. cause new exceptions in downstream projects), and if this is the route forward (i.e. make sure everything works like this). By adding it as an option we can decide per project.

WDYT @dentarg @walro ?

walro · 2019-01-31T10:28:26Z

Sounds good to me to make it an optional thing.

dentarg · 2019-01-31T10:32:22Z

I think it might take more time to make it an option than to understand how breaking it would be. I think we should make it more general as @walro and I have discussed, and if need be, bump major the major version when we release it.

dentarg · 2019-01-31T10:34:01Z

We already do some cleaning so also I don't think it is a good idea to add clean:. I prefer to keep the usage simple.

walro · 2019-01-31T10:35:25Z

I think it might take more time to make it an option than to understand how breaking

Wouldn't a naïve solution work:

...
... = clean_input(input) if clean
...

walro · 2019-01-31T10:35:42Z

We already do some cleaning so also I don't think it is a good idea to add clean:. I prefer to keep the usage simple.

Ah I see.

jage · 2019-01-31T10:39:46Z

Sure, just to clarify I'm not worried about bumping the major, I'm worried about potential new issues in other systems that might pop up. I.e. and upgrade to twingly-url will be blocked by some other system not liking this. Or that we are "forced" to implement this in more places. But if we look into it, of course this is a good change 👍

Adding this to extract_valid_urls would be much more straight forward IMHO, "just" split or more than space. i.e.. we don't need to worry as much about compatibility (which I agree, might not be an issue here, but I don't know). But it's not the same feature, then you'd need to use the utilities method and not parse directly, but just wanted to note this anyway.

dentarg · 2019-01-31T10:46:10Z

Seeing that our .NET library do handle this (#126 (comment)) I don't think we should be afraid of implementing it here and start using it.

I'm worried about potential new issues in other systems that might pop up. I.e. and upgrade to twingly-url will be blocked by some other system not liking this.

I'm not that worried, and as long as we don't do like this, I think we will be fine with our usual approach. We deploy the new version to one system at a time, and observe the outcome. I think that is less work than an extending the API. I'm afraid that it wont be used at all then.

jage · 2019-01-31T10:50:02Z

I think I got stuck on the phrase "Looks like that:". I mean, if it is on par, then it should be fine.

jage · 2019-01-31T15:57:54Z

I'm thinking we should try get this in next week, this should, together with #128 (and maybe #123 ), make for a good release (patch or minor).

jage · 2019-02-05T09:45:32Z

Not sure this should stop this, but just wanted to note a subtle difference:

[1] pry(main)> Twingly::URL.parse(" \u00A0 https://example.com/foo/ \u00A0 ").to_s
=> ""

vs.

$ ruby -e 'puts " \u00A0 https://example.com/foo/ \u00A0 "' | mono trunk/Deploy/Normalizer/Normalizer.exe
"   https://example.com/foo/   ",https://www.example.com/foo,5591070627727872931

dentarg · 2019-02-05T11:46:35Z

Hmm

$ ruby -e 'puts " \u00A0 https://example.com/foo/\u{1F4A9}"' | mono trunk/Deploy/Normalizer/Normalizer.exe
"   https://example.com/foo/💩","   https://example.com/foo/💩",9474829912736228245

$ ruby -e 'puts " \u00A0 https://example.com/foo/ \u{1F4A9}" ' | mono trunk/Deploy/Normalizer/Normalizer.exe
"   https://example.com/foo/ 💩","   https://example.com/foo/ 💩",1824702514565726115

$ ruby -e 'puts " \u{1F4A9} https://example.com/foo/ \u{1F4A9}" ' | mono trunk/Deploy/Normalizer/Normalizer.exe
" 💩 https://example.com/foo/ 💩"," 💩 https://example.com/foo/ 💩",6741913218986405642

dentarg · 2019-02-05T11:53:49Z

I don't think this PR is in a state to be merged right now.

For starters, as discussed above, it should be more general, and handle more whitespace (maybe we only need to add the regular space to be happy, that should fix the discrepancy seen in #126 (comment)).

Further more, I'm not sure that .parse should trim trailing whitespace. Maybe the character is part of the URL and it breaks if we remove it?

twingly-url/lib/twingly/url.rb

Line 78 in 991dbaa

input = input.sub(/#{NBSP}\z/, "")

When we normalize, we can trim trailing whitespace, to achieve same output as the .NET normalizer.

dentarg · 2019-02-05T12:11:26Z

I'm not sure that .parse should trim trailing whitespace

Ah, we already do trim the regular space.

jage · 2019-02-05T12:11:32Z

For starters, as discussed above, it should be more general, and handle more whitespace (maybe we only need to add the regular space to be happy, that should fix the discrepancy seen in #126 (comment)).

That can be added later, no need to rewrite everything in one PR. I don't think that's an argument for stopping this, we can just continue the work in a new PR?

... we only need to add the regular space to be happy ...

Not sure what you mean? Regulare space is already removed when parsing?

Further more, I'm not sure that .parse should trim trailing whitespace. Maybe the character is part of the URL and it breaks if we remove it?

Yeah, that is a risk.

I think we should close this PR if we aren't sure what we want to achieve here, might be better to work out a plan for the .parse function in an issue and then focus on the code development? It sounds like some ideas mentioned in this PR shouldn't be in .parse at all, but rather in a broader "extract URL(s) from this string".

I've actively put effort validating stuff here today, I thought you wanted this merged. I think this is ok, there is no contract between our implementations and we're more strict here AFAIK than in our .NET code, so the risk of adding this should be small (but not zero). BUT, if we're not sure about this I'd rather see this closed.

dentarg · 2019-02-05T12:43:15Z

Ah, we already do trim the regular space.

I had a hard time seeing where this was done, but it is the heuristic parsing in Addressable that does it:

$ ruby -rbundler/setup -r./lib/twingly/url -e 'p Addressable::URI.heuristic_parse(" https://example.com/foo/ ").to_s'
"https://example.com/foo/"

Related to #126, making the whitespace removal more general.

Seems more common to use "leading" than "starting" when talking about whitespace.

Strip input from both space and non-breaking space

lib/twingly/url.rb

Because we want to clean the input (#126), we are always working with a string. #126 (comment)

walro · 2019-02-05T16:04:07Z

(maybe PR title needs a slight tweak btw)

dentarg added 2 commits November 21, 2018 17:25

Handle URLs with non-breaking space

7ebf239

https://en.wikipedia.org/wiki/Non-breaking_space

Be consistent, use uppercase for Unicode codepoint

e62d427

Merge branch 'master' into handle-non-breaking-space

991dbaa

jage approved these changes Feb 5, 2019

View reviewed changes

Strip input from both space and non-breaking space

759028a

Related to #126, making the whitespace removal more general.

dentarg mentioned this pull request Feb 5, 2019

Strip input from both space and non-breaking space #131

Merged

dentarg added 3 commits February 5, 2019 14:13

Terminology: starting -> leading

1bdb310

Seems more common to use "leading" than "starting" when talking about whitespace.

Only string literals get automatic freeze

31a56d3

Improve naming, use constant for whitespace regex

77d54ad

dentarg added 2 commits February 5, 2019 15:52

Improve naming

ae9e161

Merge pull request #131 from twingly/strip-whitespace

e9f58da

Strip input from both space and non-breaking space

walro reviewed Feb 5, 2019

View reviewed changes

lib/twingly/url.rb Outdated Show resolved Hide resolved

Remove dead code

e43ac10

Because we want to clean the input (#126), we are always working with a string. #126 (comment)

walro approved these changes Feb 5, 2019

View reviewed changes

dentarg changed the title ~~Handle URLs with non-breaking space~~ Strip URLs of leading and trailing non-breaking space (and space, but we already did) Feb 5, 2019

dentarg merged commit f3d3cbb into master Feb 5, 2019

dentarg deleted the handle-non-breaking-space branch February 5, 2019 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip URLs of leading and trailing non-breaking space (and space, but we already did) #126

Strip URLs of leading and trailing non-breaking space (and space, but we already did) #126

dentarg commented Nov 21, 2018 •

edited

walro commented Nov 21, 2018 •

edited

dentarg commented Nov 28, 2018 •

edited

dentarg commented Nov 28, 2018

dentarg commented Nov 28, 2018 •

edited

walro commented Nov 29, 2018 •

edited

jage commented Jan 31, 2019

walro commented Jan 31, 2019

dentarg commented Jan 31, 2019

dentarg commented Jan 31, 2019

walro commented Jan 31, 2019

walro commented Jan 31, 2019

jage commented Jan 31, 2019

dentarg commented Jan 31, 2019

jage commented Jan 31, 2019

jage commented Jan 31, 2019

jage commented Feb 5, 2019

dentarg commented Feb 5, 2019

dentarg commented Feb 5, 2019 •

edited

dentarg commented Feb 5, 2019

jage commented Feb 5, 2019

dentarg commented Feb 5, 2019

walro commented Feb 5, 2019

Strip URLs of leading and trailing non-breaking space (and space, but we already did) #126

Strip URLs of leading and trailing non-breaking space (and space, but we already did) #126

Conversation

dentarg commented Nov 21, 2018 • edited

walro commented Nov 21, 2018 • edited

dentarg commented Nov 28, 2018 • edited

dentarg commented Nov 28, 2018

dentarg commented Nov 28, 2018 • edited

walro commented Nov 29, 2018 • edited

jage commented Jan 31, 2019

walro commented Jan 31, 2019

dentarg commented Jan 31, 2019

dentarg commented Jan 31, 2019

walro commented Jan 31, 2019

walro commented Jan 31, 2019

jage commented Jan 31, 2019

dentarg commented Jan 31, 2019

jage commented Jan 31, 2019

jage commented Jan 31, 2019

jage commented Feb 5, 2019

dentarg commented Feb 5, 2019

dentarg commented Feb 5, 2019 • edited

dentarg commented Feb 5, 2019

jage commented Feb 5, 2019

dentarg commented Feb 5, 2019

walro commented Feb 5, 2019

dentarg commented Nov 21, 2018 •

edited

walro commented Nov 21, 2018 •

edited

dentarg commented Nov 28, 2018 •

edited

dentarg commented Nov 28, 2018 •

edited

walro commented Nov 29, 2018 •

edited

dentarg commented Feb 5, 2019 •

edited