Skip to content

Make URL schema case insensitive #2620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 14, 2021
Merged

Make URL schema case insensitive #2620

merged 1 commit into from
Oct 14, 2021

Conversation

mkykadir
Copy link
Contributor

Fixes #2619

Default port and common internet schema checks are case insensitive now. Schema stored in URL structure still respects to user's letters.

@s-ludwig
Copy link
Member

We should really aim to keep @nogc here, because this is potentially performance critical, and the plan for URL is also to work similar to GenericPath, so that at least URL(somestring).toString() is @nogc.

In fact I had a very similar situation recently, just with file extensions. Since the list of extensions in that case was fixed, it was safe to use a temporary stack allocated buffer of fixed size and write the lower-case version of the extension into it. Since we have the possibility to add user-defined schemas here, that unfortunately doesn't work without constraints.

However, I think that limiting the length of schema names in registerCommonInternetSchema to something reasonable (e.g. 128 bytes) is acceptable here - any longer schema name would then just cause isCommonInternetSchema to return false.

@mkykadir
Copy link
Contributor Author

@s-ludwig I didn't touch string due to dependency on the string type;

  • switch/case and
  • StringSet in default case

@Geod24
Copy link
Contributor

Geod24 commented Oct 13, 2021

@s-ludwig : I was looking at this and discussing with @mkykadir , and it seems to me that isCommonInternetSchema should also take a port as parameter when registering a new schema, and that should be stored in st_commonInternetSchemas.

Comment on lines 556 to 560
string lowerschema = schema;
try
lowerschema = schema.toLower();
catch (Exception)
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @s-ludwig means is something like this:

if (schema.length > 128) return false;
char[128] buffer;
buffer[0 .. schema.length] = schema[];
scope lowerSchema = buffer[0 .. schema.length];
lowerSchema.toLowerInPlace();

Unfortunately I think that it is not @nogc because it decodes and thus can throw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach:

immutable ubyte CaseOffset = 'a' - 'A';
foreach (char c; schema)
{
    // We should only get ASCII input, anything else is rejected
    if (c & 0b1000_0000) return false;
    if (c >= 'A' && c <= 'Z')
        buffer[idx] = cast(char) (c + CaseOffset);
    else
        buffer[idx] = c;
    }
}

It's crude but should work. What do you think @s-ludwig ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work, there is also std.ascii.toLower to make this a little bit cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit refactoring, works well,

char[128] lowerschema = '\0';
if (schema.length >= 128) return false;
foreach (ix, char c; schema)
{
  if (!isASCII(c)) return false;
  lowerschema[ix] = toLower(c);
}

but since StringSet requires string parameters, buffer has to be casted which requires a trusted scope

() @trusted {
  return set ? set.contains(cast(string) lowerschema) : false;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can alternatively just change contains to accept scope const(char)[] without generating a compile error here.

@Geod24
Copy link
Contributor

Geod24 commented Oct 13, 2021

We should really aim to keep @nogc here, because this is potentially performance critical, and the plan for URL is also to work similar to GenericPath, so that at least URL(somestring).toString() is @nogc.

That's not currently the case though, we always allocate when calling toString. IMO allowing the user to control where the output is written (#2621) is a better approach, as it allows composition.

I think some users will want to keep user input, some will prefer to normalize it. We would really like a way to normalize it, even if it is a method called explicitly that allocates, as it's a rare (in the grand scheme of things) event.

@s-ludwig
Copy link
Member

We should really aim to keep @nogc here, because this is potentially performance critical, and the plan for URL is also to work similar to GenericPath, so that at least URL(somestring).toString() is @nogc.

That's not currently the case though, we always allocate when calling toString. IMO allowing the user to control where the output is written (#2621) is a better approach, as it allows composition.

Hence why "the plan" ;-)
No really, the important part is the constructor. Having toString @nogc is just a perk that is easily possible when the constructor just stores the original string (and possibly a few precomputed indices).

I think some users will want to keep user input, some will prefer to normalize it. We would really like a way to normalize it, even if it is a method called explicitly that allocates, as it's a rare (in the grand scheme of things) event.

We are not only talking about user input; you might be loading a bunch of URLs out of an XML or CSV file, where allocating per item can have a considerable impact. Adding a normalization functionality makes sense, though, and I'd also see that allocate in all realistic URL implementation possibilities, unless the URL is already normalized.

OT: would see the normalization doing three things:

  1. lower case schema and host names
  2. remove explicit port if same as default port
  3. use InetPath.normalize for the path part

Do you have anything else in mind (apart from what the "fuzzy" parser we've talked about earlier already does)?

@Geod24
Copy link
Contributor

Geod24 commented Oct 13, 2021

Do you have anything else in mind (apart from what the "fuzzy" parser we've talked about earlier already does)?

Remove the ? if the query is empty.
The fuzzy parser is slightly different (it should just generate normalized URLs IMO) and I believe one can be done without the other.

try {
lowerschema = schema.toLower();
} catch (Exception) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function nothrow @s-ludwig ? I think any error should be reported, and not silently ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be assert(false, e.msg) instead of return.

@s-ludwig
Copy link
Member

Apart from that assert, looks good to merge!

@s-ludwig s-ludwig enabled auto-merge October 14, 2021 14:19
@s-ludwig s-ludwig merged commit bf4ad4c into vibe-d:master Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

URL schema is case sensitive
3 participants