-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip URLs of leading and trailing non-breaking space (and space, but we already did) #126
Conversation
Is this on parity with what we do for the .NET counterpart? Because this feels quite specific to a single use-case. Not too sure if the cleansing should be the responsibility of this gem. I would be onboard if we decided that the library should strip all leading, and possibly, trailing whitespace characters, but as I lead with, this feels very specific. |
Looks like that:
I agree that it is specific. After I did this, I realised that if we do this, we probably want to remove all whitespace, but I haven't had the time to try to do it. Re: trailing whitespace, I saw after I did this that Addressable keeps the non-breaking space:
The behaviour isn't the same for ordinary space (and
|
I think it should, at least leading whitespace. An alternative could be to put it in |
Our current behaviour
|
OK, this is pretty bad behavior which I guess got you started.
It sounds like we agree :)
When I meant if it should be the responsibility of the gem, I was talking about the specific nbsp case, as per above - I think it's fine (or even a good thing) that we do it in general |
I'm not 100% sure how we should handle this case, but a suggestion is that we add as an option. I.e. In the future I think this might very well become a defaut (major release), but right now I would need time that I don't have to verify how this could affect our systems (i.e. cause new exceptions in downstream projects), and if this is the route forward (i.e. make sure everything works like this). By adding it as an option we can decide per project. |
Sounds good to me to make it an optional thing. |
I think it might take more time to make it an option than to understand how breaking it would be. I think we should make it more general as @walro and I have discussed, and if need be, bump major the major version when we release it. |
We already do some cleaning so also I don't think it is a good idea to add |
Wouldn't a naïve solution work: ...
... = clean_input(input) if clean
... |
Ah I see. |
Sure, just to clarify I'm not worried about bumping the major, I'm worried about potential new issues in other systems that might pop up. I.e. and upgrade to twingly-url will be blocked by some other system not liking this. Or that we are "forced" to implement this in more places. But if we look into it, of course this is a good change 👍 Adding this to |
Seeing that our .NET library do handle this (#126 (comment)) I don't think we should be afraid of implementing it here and start using it.
I'm not that worried, and as long as we don't do like this, I think we will be fine with our usual approach. We deploy the new version to one system at a time, and observe the outcome. I think that is less work than an extending the API. I'm afraid that it wont be used at all then. |
I think I got stuck on the phrase "Looks like that:". I mean, if it is on par, then it should be fine. |
Not sure this should stop this, but just wanted to note a subtle difference: [1] pry(main)> Twingly::URL.parse(" \u00A0 https://example.com/foo/ \u00A0 ").to_s
=> "" vs. $ ruby -e 'puts " \u00A0 https://example.com/foo/ \u00A0 "' | mono trunk/Deploy/Normalizer/Normalizer.exe
" https://example.com/foo/ ",https://www.example.com/foo,5591070627727872931 |
Hmm
|
I don't think this PR is in a state to be merged right now. For starters, as discussed above, it should be more general, and handle more whitespace (maybe we only need to add the regular space to be happy, that should fix the discrepancy seen in #126 (comment)). Further more, I'm not sure that twingly-url/lib/twingly/url.rb Line 78 in 991dbaa
When we normalize, we can trim trailing whitespace, to achieve same output as the .NET normalizer. |
Ah, we already do trim the regular space. |
That can be added later, no need to rewrite everything in one PR. I don't think that's an argument for stopping this, we can just continue the work in a new PR?
Not sure what you mean? Regulare space is already removed when parsing?
Yeah, that is a risk. I think we should close this PR if we aren't sure what we want to achieve here, might be better to work out a plan for the I've actively put effort validating stuff here today, I thought you wanted this merged. I think this is ok, there is no contract between our implementations and we're more strict here AFAIK than in our .NET code, so the risk of adding this should be small (but not zero). BUT, if we're not sure about this I'd rather see this closed. |
I had a hard time seeing where this was done, but it is the heuristic parsing in Addressable that does it:
|
Related to #126, making the whitespace removal more general.
Seems more common to use "leading" than "starting" when talking about whitespace.
Strip input from both space and non-breaking space
Because we want to clean the input (#126), we are always working with a string. #126 (comment)
(maybe PR title needs a slight tweak btw) |
https://en.wikipedia.org/wiki/Non-breaking_space
Background: Got a big list of URLs (Excel file) and many values in the "URL" column started with the
NO-BREAK SPACE
character.