Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL not handled/encoded correctly #1303

Closed
ozh opened this issue Apr 6, 2013 · 36 comments
Closed

URL not handled/encoded correctly #1303

ozh opened this issue Apr 6, 2013 · 36 comments
Labels
bug Something isn't working core database
Milestone

Comments

@ozh
Copy link
Member

ozh commented Apr 6, 2013

What steps will reproduce the problem?

  1. Enter the following long URL: http://domain.com/good space/
  2. The long URL is correctly stored as http://domain.com/good%20space/
  3. Now, enter the following long URL: http://domain.com/bad%20space/
  4. The long URL is incorrectly stored as http://domain.com/bad%2520space/

What is the expected output? What do you see instead?

What I expect is that YOURLS should be smart enough to know that '%20' is a space and should be left alone. Instead, what's happening is that YOURLS is converting the '%' to '%25' and therefore, the '%20' becomes '%2520'.

Perhaps there could be a check before cleaning up the long URL to detect '%20' or whatever else it could be (e.g. '%5C' for backspace, etc).


This is a COPY of Issue 1303: %20 in long URL not handled correctly, filed on Google Code before the project was moved on Github.

@LeoColomb
Copy link
Member

Seems to be encoding - decoding URL case.
Found two ways to fix it:

  • Be sure URL is decode before encode it:
urldecode($string);
urlencode($string);
save_in_db($string);
  • Search in string caracters which says it's encoded. For exemple:
if (!preg_match("@^[a-zA-Z0-9%+-_]*$@", $string))
    urlencode($string);
save_in_db($string);

@adigitalife
Copy link
Contributor

I'm not sure how to fix this properly in the code but I've implemented a crude workaround by creating a plugin:

yourls_add_filter( 'sanitize_url', 'fix_long_url' );
function fix_long_url( $url, $unsafe_url ) {
    $search = array ( '%2520', '%2521', '%2522', '%2523', '%2524', '%2525', '%2526', '%2527', '%2528', '%2529', '%252A', '%252B', '%252C', '%252D', '%252E', '%252F', '%253D', '%253F', '%255C', '%255F' );
    $replace = array ( '%20', '%21', '%22', '%23', '%24', '%25', '%26', '%27', '%28', '%29', '%2A', '%2B', '%2C', '%2D', '%2E', '%2F', '%3D', '%3F', '%5C', '%5F' );
    $url = str_ireplace ( $search, $replace ,$url );
    return yourls_apply_filter( 'after_fix_long_url', $url, $unsafe_url );

@ColinTips
Copy link

it works using above code.
but the problem is still the same when use API.
I am using wordpress plugin:pluginbuddy-yourls;
if the url include below code, the url will be stored incorrectly.

code:
/t?e=zGU34CA7K%2BPkqB07S4%2FK0CITy7klxxrJ35Nnc0iK%2FBdaKhVAmXafYUmD5KGJ2oqiWhej1Yse2lEI%2FXHtNGaPeSdAuj9tBnnfDunBLi4cn5tl2g%3D%3D

@LeoColomb
Copy link
Member

@ColinQi Please view my PR #1365.

@ColinTips
Copy link

@LeoColomb I just can't understand, when I goto yourls/admin/ to add such encoded url, stored correctly.
but when I use API to add such encoded url, it stored incorrectly.

@uwma
Copy link

uwma commented Sep 5, 2013

Thanks to @adigitalife! His solution may not be the most elegant one, it works!
http://en.uw.ma/pYrs

@innov8ion
Copy link

I was surprised to hear about this issue so long after 1.6 was released but glad to find a temporary fix. I created and activated the following plugin per adigitalife's info on this thread. Thanks, and will be watching for a more permanent fix.

Meanwhile, should this plugin be placed in the official plugin list here to help others? https://github.com/YOURLS/YOURLS/wiki/Plugin-List

@ozh
Copy link
Member Author

ozh commented Sep 18, 2013

I've just committed something that should fix that. Please try to break and report :) Re-open this issue if needed.

@ozh ozh closed this as completed Sep 18, 2013
@innov8ion
Copy link

Very nice, ozh. Is this the one commit you made to address this? 59dff6b

@ozh
Copy link
Member Author

ozh commented Sep 20, 2013

@innov8ion yes

@LeoColomb
Copy link
Member

Definitely, there is a issue with encoding. May need a full rewrite.

@LeoColomb LeoColomb reopened this Mar 14, 2016
@markwaters
Copy link

Not sure if this is related , but when I try and shorten a link with a percentage sign in it , for example -
http://erinjo.xyz/content/all?q=%23Eurovision
When I visit / expand the link , it changes to -
http://erinjo.xyz/content/all?q=#Eurovision
Which doesn't work as expected.
HTH.

@adigitalife
Copy link
Contributor

@markwaters I think it's tricky if the actual URL contains '%23' because it's actually the same as '#' so it gets converted automatically. There's been a discussion that maybe YOURLS should do any kind of encoding at all. I'm not sure what's the current status with that.

@JeffreyDunster
Copy link

I have a similar problem with Microsoft OneDrive links. They often include %2... and %3... in their hashed IDs, which are then converted to other charters, like /, +, " ", etc.

@maustyle
Copy link

hello, i still have the same issue. i have tried to install the plugin:

Fix Long URLs
https://github.com/adigitalife/yourls-fix-long-url/

with no avail.

@adigitalife
Copy link
Contributor

It depends on the exact characters in the URL that you have a problem with. My plugin only looks at some specific characters. If your problematic characters are not in the plugin, you'll need to modify the plugin to add them.

@maustyle
Copy link

maustyle commented Apr 9, 2018

thank you !!

@PopVeKind
Copy link
Contributor

I've just read through this (current and open) thread and it seems to be evolving with different problems. I am after a better understanding of the workings of YOURLS and have a few questions about this topic,

Why Encode?

  1. The Long URL does not originate with YOURLS, so why, for any reason, does YOURLS change it?
  2. Is it not a requirement for the source (human or program) to deliver a properly working Long URL?

If an API or a Plugin or Microsoft OneDrive provides a Long URL that does not work, is that YOURLS job to fix it? If I typed http://vekind.com/ instead of http://vekind.org/ is that the responsibility of YOURLS to correct .com to .org? Why not use just what is supplied?

Decoding

For the exact same reason, that YOURLS should not encode, YOURLS should not decode. If I typed http://vekind.org/bad%20url input It is not a YOURLS job to try to figure out what I meant. Likewise, if it is from an API, Bookmark program, or Plugin. It is the responsibility of the sending program to send the correct Long URL and all YOURLS should do is save it and use it as sent!

Database etc.

Database inputs should be screened for SQL injections, etc. However, this is not at all the same as URL encoding.

If a program (or human) provides the wrong information, should we not just make a note that program has a bug? So the Plugin or API or Bookmark program can be fixed?

The Real Solution

It seems to me the real solution for these errors (that fucking URL encoding problem) is to neither decode the URL nor encode the URL.
The most logical solution is to expect (demand) that the sending program (API, Bookmark, Plugin, or Human), sends the CORRECT Long URL and then YOURLS should simply use what was sent, AS-IS!

Real Reasons

Can any YOURLS Developer explain why YOURLS should try to fix wrong inputs from external sources? Why not fix the problem at the source Plugin, Bookmark, or API?

@PopVeKind
Copy link
Contributor

@ozh @LeoColomb - Please review this logic?

Suggested Solution

  1. I would suggest that all URL encode and URL decode be removed from the Core Code.
  2. URL decode and URL encode are not needed for requesting programs that send a correct LongURL.
  3. As decode and encode are not needed on all sites, they can rightly be classified as Code Bloat,
  4. URL decode and URL encode should be offered as a Plugin for those people who need it to compensate for poorly constructed API, Plugin, or Bookmark programs that do not supply a working LongURL.

Second Suggestion

If moving this out of Core Code Bloat and into a Plugin is not desired, would it be acceptable to add a Core Option to disable all URL encode and URL decode code?

PopVeKind referenced this issue Apr 9, 2018
@ozh
Copy link
Member Author

ozh commented May 10, 2018

There is a need to encode or decode because depending on the context, URLs supplied are, or are not, raw text.

  • URL entered in the text box and click "Shorten"? Don't try to encode or decode, just use what's provided
  • but URL supplied via "prefix and shorten" (eg http://sho.rt/http://omglongurl/some~funk~chars?) will be coded
  • URL supplied via bookmarklet will be coded too, by the browser

@mackaaij
Copy link

mackaaij commented Jul 13, 2018

I think I ran into this issue when adding this long url: https://worcade.stackstorage.com/s/MzCWEihRfYldw5X?dir=/Terms of Service

If I leave the automatically generated shortcode, the shortlink works. If I customize the shortcode to 'terms', the shortlink breaks...

Jorn suggested a workaround: I now shortened the link with bit.ly and then "shortened" the link with our custom domain using Yourls.

@PopVeKind
Copy link
Contributor

@ozh I just saw your post today. It seems we can use a little logic to sort this out at the point YOURLS receives the long URL.

Decode Everything

First off, how about decode everything upon receipt? If the text box is not encoded the decoded output would be the same. Someone might copy/paste an encoded URL into the text box too. So it seems everything should be decoded upon receipt by YOURLS.
I would also encode everything just before saving to the database,

Dubble encoding

The problem seems to be double encoding without decoding. Double decoding or decoding unencoded text is not a problem.

Encode Everything

By encoding everything just before saving in the database, it reverses the first decode and makes the URL Internet ready. Encoding an unencoded URL would have no effect on its functionality.

Examples

  • Enter the following long URL: http://domain.com/good space/
  • The long URL is correctly stored as http%3A%2F%2Fdomain.com%2Fgood%20space%2F
  • Now, enter the following long URL: http%3A%2F%2Fdomain.com%2Fgood%20space%2F
  • The long URL is correctly stored as http%3A%2F%2Fdomain.com%2Fgood%20space%2F

PS

  • This is logic based on the comments on this thread.
  • I have not (yet) coded this onto a working YOURLS site.

@LeoColomb LeoColomb pinned this issue Dec 15, 2018
@rinogo
Copy link

rinogo commented Jan 3, 2019

Update

I'll leave this here in case it helps someone. I've discovered the issue, and it's unrelated to this thread. The problem was the "Keep Query String" plugin, which was activated on prod, but not on dev.

Carry on. :)


Just to add to this issue - I haven't read the entire thread, but hopefully this adds to the discussion.

I have two configurations, both essentially identical, for our dev and prod systems. The prod system produces URLs in which URL parameters have slashes encoded as %25%2F (double-encoded).

The dev system produces links that work just fine. That is, slashes in URL parameters are completely unencoded - they appear as /.

The difference between these two systems is minor. Both systems run CentOS and an essentially identical stack. Both systems run version 1.7.3 of YOURLS. I'm thinking it could be a difference in DB configurations or maybe due to the fact that the dev server runs Apache and the prod server runs Litespeed.

The most curious wrinkle on this entire mess is that the url that is stored in the YOURLS database is identical on both dev and prod. In other words, this means that the problem is somewhere in the code/stack responsible for converting a shorturl into a longurl.

Test cases

We see the same behavior via the GUI and when using the API.

longurl (input): https://test.com/narf blah.html?a=hello there&b=whatever/you/want

stored longurl in the database on both dev and prod: https://test.com/narf%20blah.html?a=hello%20there&b=whatever/you/want

converted shorturl (output) on dev (functions as desired): https://www.test.com/narf%20blah.html?a=hello%20there&b=whatever/you/want

converted shorturl (output) on prod (double-encodes slashes): https://www.test.com/narf%20blah.html?a=hello%2Bthere&b=whatever%252Fyou%252Fwant

@PopVeKind
Copy link
Contributor

@rinogo - That's good intel Rich,

the URL that is stored in the YOURLS database is identical on both dev and prod

The URL spaces were encoded as stored in the database, compared to the input.

At some point, the domain name changes and adds a www. Why? Is that a YOURLS change?

Maybe we should start gathering data on sites that are failing? Example:

Are all sites that fail running under CentOS? Debian? RedHat? FreeBSD? or Ubuntu?
Are all sites that fail running on Litespeed?
Are all sites that fail running PHP5? PHP7?
Are all sites that fail running MariaDB? MySQL?

What sites are NOT failing?

I run without this problem in:
Ubuntu/Apache/MySQL/php5
Ubuntu/Apache/MySQL/php7
Debian8/Nginx/MariaDB/php7

The error does seem to be after the long URL is retrieved from the DB.
Q. Why are we encoding anything AFTER we retrieve the URL from the Database?

Perhaps if we discover what the sites that fail have in common, it will lead to why some sites fail?

@rinogo
Copy link

rinogo commented Jan 7, 2019

Hi, @PopVeKind! I think that's a great idea! I can double-check my setup, but I think everything is working fine since I discovered that the discrepancy was due to a plugin that was installed on prod but not on dev. (I added an update to my post after the fact - perhaps you're working off of email notifications).

Regardless, I'll keep an eye on it; if we have problems, I'll definitely report back here. Thanks to you all (contributors and users alike) for making YOURLS awesome!

@jadkik
Copy link

jadkik commented Jan 9, 2020

I worked around the whole issue with encoding (or at least I hope so) by using a plugin that just uses the URL as it is sent by the user of the API or the form or wherever:

yourls_add_filter( 'encodeURI', 'do_not_encode' );

function do_not_encode( $result, $url ) {
        return $url;
}

I can't believe this issue has been open for 7 years on a project whose sole purpose it is to handle URLs.

If it's so important for the server to handle all the bad input, at the very least it should offer an option to turn off this behavior from the API if they're a well behaved client.

@ozh
Copy link
Member Author

ozh commented Jan 10, 2020

@jadkik in what context does this work? Adding URL manually + bookmarklet + adding "https://sho.rt/" in front of any URL ?

@dfbasis
Copy link

dfbasis commented Feb 11, 2020

@PopVeKind
Copy link
Contributor

I am trying to understand this and I have a question. Does anyone have a problem when entering the long URL into the yourls/admin/ textbox and then click "Shorten"?

I have been reading over this URL encoding issue and do not see this complaint. If this always works ok, I believe I need to focus on the differences between this method and the other methods.

  1. The Admin Textbox method. (yourls/admin/)
  2. The API method.
  3. The GET method "prefix and shorten" (eg http://sho.rt/http://omglongurl/some~funk~chars?)
  4. The Bookmarklet method.

Today the thought came to me that this is not a single problem, but a group of interrelated problems. My goal is to build a server to reproduce these faults on and then test to see where the code goes astray.

I will need actual faults, methods, and configurations to reproduce these problems.

@tbsampson your result is what I would expect by running your origional through urldecode. Did you submit your code via the GET method?

The PHP manual says,
Warning The superglobals $_GET and $_REQUEST are already decoded. Using urldecode() on an element in $_GET or $_REQUEST could have unexpected and dangerous results.

@dfbasis Your call also looks like it was run through urldecode, which changed the %2B%2B to ++. but when this was sent out as the redirect URL the browser changed the ++ to two spaces. Being trailing spaces, the browser just dropped them. Perhaps the urldecode should be changed to rawurldecode or by changing ++ back to %2B%2B before sending out.

@jadkik Your filter looks interesting, which method were you using when you had the failure, and which method did this filter fix for you?

@adigitalife your filter is interesting because it does not have the problem of decoding %2F, %3B, or %3D that @tbsampson is describing (not necessarily an encoding issue).

@ghost
Copy link

ghost commented Apr 16, 2020

Test URL from: #1303 (comment)
Result in YOURLS and in DB: https://test.com/narf%20blah.html?a=hello%20there&b=whatever/you/want
In the Vivaldi address bar: https://www.test.com/narf blah.html?a=hello there&b=whatever/you/want
When copied and pasted, it will come out as above when the option "Copy and Cut Encoded Address" is disabled, or as the the encoded URL in the DB when the option is enabled. The addition of www. is due to test.com's server config.
Result: YOURLS works perfectly.
Debian 10, Apache, PHP7.3, MariaDB

@YOURLS YOURLS deleted a comment from tbsampson Apr 24, 2020
@jadkik
Copy link

jadkik commented Apr 24, 2020

@ozh @PopVeKind sorry for not getting back to you on this. Here is my use case. Only via the API. I am still on 1.7.4, didn't update yet.

The main problem is with the plus-sign and at-sign in my case. YOURLS tends to replace it with an actual plus instead of a %2B, so when it reaches the front end the URL contains a space instead of a plus sign.

With the plugin:

Properly-encoded URL: https://sub.example.com/#/page/confirm/415371?email=username%2Bafter-plus-sign%40gmail.com
Shortened: http://short.exam.pl/2C
From the edit interface > Edit: https://sub.example.com/#/page/confirm/415371?email=username+after-plus-sign@gmail.com

$ curl -v http://short.exam.pl/2C
# ...
Location: https://sub.example.com/#/page/confirm/415371?email=username%2Bafter-plus-sign%40gmail.com
# ...

Without the plugin:

Properly-encoded URL: https://sub.example.com/#/page/confirm/565914?email=username%2Bafter-plus-sign%40gmail.com
Shortened: http://short.exam.pl/2D
From the edit interface > Edit: https://sub.example.com/#/page/confirm/565914?email=username+after-plus-sign@gmail.com

$ curl -v http://short.exam.pl/2D
# ...
Location: https://sub.example.com/#/page/confirm/565914?email=username+after-plus-sign@gmail.com
# ...

Usage

The properly encoded URL is generated and sent like that

import java.net.URLEncoder;

        // ...
        String longUrl = baseUrl + "/page/confirm/" +
                URLEncoder.encode(confirmCode, "UTF-8") +
                "?email=" + URLEncoder.encode(userEmail, "UTF-8");
        // ...
        OkHttpClient client = new OkHttpClient();
        RequestBody formBody = new FormBody.Builder()
                .add("format", "json")
                .add("action", "shorturl")
                .add("username", "username")
                .add("password", "password")
                .add("url", longUrl)
                .build();
        Request request = new Request.Builder()
                .url("http://short.exam.pl/yourls-api.php")
                .post(formBody)
                .build();

@jamiers99
Copy link

@jadkik So with your last post on this thread, did you find a solution. I'm not following what you read. Is there a plugin I can load that will prevent YourLS from taking tags like %20 and changing them to the actual character? I just updated to 1.7.9 and now many of my links are breaking.

@jadkik
Copy link

jadkik commented May 9, 2020

@jamiers99 I had posted that plugin which worked for me: #1303 (comment)

It worked for my use case, may break some other things like bookmark and the editing on the UI (haven't tested).

@ozh ozh unpinned this issue May 18, 2020
@ozh
Copy link
Member Author

ozh commented May 18, 2020

Closing: 2691 and associated PR will supersede

@ozh ozh closed this as completed May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core database
Projects
None yet
Development

No branches or pull requests