Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape all strings by default #122

Merged
merged 1 commit into from
Mar 29, 2016
Merged

Conversation

gyim
Copy link
Contributor

@gyim gyim commented Dec 12, 2015

This commit implements issue #114 (Safe HTML by default). It introduces
a few breaking changes:

  1. Since all strings are escaped by default, adding HTML markup from
    string literals or expressions will not work. Use hiccup.util/raw-str
    to prevent these values from being escaped.
  2. hiccup.util/escape-html, hiccup.core/h and hiccup.core/html
    returns a RawString object instead of a String. They can be converted to
    java.lang.String with the built-in str function.

Note that unlike hiccup.util/html, macros in hiccup.page (html4,
html5, xhtml) still return a string so that their result can be
passed directly to Ring. In the rare cases when the output is embedded
to another html page (e.g. into an iframe) it should be also tagged with
hiccup.util/raw-str.

@weavejester
Copy link
Owner

Thanks for the patch. I have a few suggested changes.

  1. Let's rename raw-str to raw-string. We don't need to abbreviate a function that isn't going to be used often.
  2. Let's remove the h function entirely. The abbreviation isn't necessary anymore.
  3. The escape-html function should probably still return a string. With automatic escaping, it's not going to be used under the html macro as it was before, so it should be treated like a utility function for edge case uses.
  4. I don't think the html4, xhtml and html5 macros should return strings. I think we should favor consistency over convenience, especially since Ring integration problems can be solved with middleware or similar.
  5. I think the RawString type itself should be moved into the hiccup.compiler namespace, as it's directly to do with the compiler. However, the hiccup.util/raw-string function should remain in the same place.
  6. You backtick a few things that could be calculated inline. I'll add comments to the lines themselves.
  7. Some more tests are necessary. In particular, what happens if you have a html inside a html? What happens if you call a function that uses html? That sort of thing.

@@ -237,8 +241,11 @@
(doall (for [expr content]
(cond
(vector? expr) (compile-element expr)
(string? expr) `(escape-html ~expr)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you know the expression is a literal string, you can just use (escape-html expr) and precompute it. Same for the next line.

@gyim
Copy link
Contributor Author

gyim commented Dec 14, 2015

Thanks for the feedback, it is very helpful, and I agree with most of your points. There are a few things though where I would take a different approach:

  1. I would introduce as few breaking changes as possible, and make these changes easy to discover and fix. Ideally, users should not notice any change unless they rely on the current non-escaping behavior (e.g. by embedding HTML entites to string literals). If this is not possible, a few minor changes are OK, but requiring hundreds of changes per project would certainly annoy users and possibly slow down the adoption. So I would not remove or break the (h) function, at least not immediately. I would make it unnecessary and mark it as obsolete, and maybe remove it a few years later.
  2. I would expose RawString as little as possible. It is an implementation detail, and users should not care about its existence: they want a java.lang.String output, not some custom object. I agree about preferring consistency over convenience though: it is not right to return RawString at one place and java.lang.String at another.

You pointed out that escape-html could return a string, and this gave me an idea: why not make everything return a string except raw-string? So how about this:

  1. html, html4, html5 and xhtml should all return a string. When you embed a "partial template" (html inside html) you have to use raw-string. I think this is a clean and consistent behavior, and using "partial templates" is probably not a very common use case. Also, if you miss it, you will immediately see it on the output (a whole HTML subpage will be escaped, which is easy to spot)
  2. h should become an alias for identity, and be marked as obsolete in the doc string. I would also rename the escape-html function so if somebody uses it direcly instead of h (which should be the minority of the use cases) it will cause a compile-time error rather than a lot of double-escaped strings (which are hard to discover if escape-html is used on non-HTML data)
  3. raw-string returns a RawString. Since it is a new function there will be no surprises for existing users. Since this is the only function that returns a RawString, it will be possible to render an entire webpage without ever creating a RawString instance.

@weavejester
Copy link
Owner

I would introduce as few breaking changes as possible

I'm going to have to disagree pretty strongly with this point. If we must introduce breaking changes, then it's better to have them all in one release, rather than spread across multiple releases. In the long term it's much less painful.

The reason for this is that any breaking change, no matter how small, carries with it a huge fixed cost. The is particularly true for a language like Clojure that doesn't allow multiple versions of the same package in the same dependency tree. For example, if someone is using the Ring-Devel package, which depends on Hiccup 1.x, they won't be able to use Hiccup 2.x.

So I'd prefer that people have to deal with all the issues with moving to Hiccup 2.0 once, rather than spread the pain across Hiccup 3.0, 4.0 and so forth.

I would expose RawString as little as possible. It is an implementation detail, and users should not care about its existence

RawString is the return value from html and raw-string, which means it's part of the API and therefore not an implementation detail. We've introduced a new type so we can distinguish between a string and the output from html. It's very much part of the design.

We shouldn't be half-hearted in how we define APIs. Either we hide an abstraction away entirely, or we embrace it. We shouldn't have an abstraction that we expect the end user to make use of, and at the same time try and pretend it's not there.

html, html4, html5 and xhtml should all return a string. When you embed a "partial template" (html inside html) you have to use raw-string. I think this is a clean and consistent behavior, and using "partial templates" is probably not a very common use case.

This would work, except that "partial templates" are a very common use case in Hiccup. Typically in Hiccup you'll have functions like:

(defn panel [title & body]
  [:div.panel [:h2.title title] body])

And if you want to speed them up, you add a html macro in:

(defn panel [title & body]
  (html [:div.panel [:h2.title title] body]))

This precomputes a lot of the output, effectively reducing the runtime problem down to a series of string concatenations, which is much more performant.

On the other hand, actually transforming the RawString into an output string is something that typically will only happen once in an application. In fact, you could easily add something like:

(extend-protocol compojure.response/Renderable
  hiccup.compiler.RawString
  (render [raw _] (str raw))

And then you could keep using html in the same way as you were before. The next version of Ring will have a similar protocol for extending the types allowed in the :body key. So from a practical perspective, not much would change.

If we're going to optimise for the common case, then html should return a RawString.

gyim pushed a commit to gyim/hiccup that referenced this pull request Dec 28, 2015
gyim pushed a commit to gyim/hiccup that referenced this pull request Dec 28, 2015
gyim pushed a commit to gyim/hiccup that referenced this pull request Dec 28, 2015
gyim pushed a commit to gyim/hiccup that referenced this pull request Dec 28, 2015
See discussion at pull request weavejester#122.
gyim pushed a commit to gyim/hiccup that referenced this pull request Dec 28, 2015
@gyim
Copy link
Contributor Author

gyim commented Dec 28, 2015

All right, I tried to finish the implementation as discussed above. I hope it's ready (or very close to ready) for a merge now.

So I'd prefer that people have to deal with all the issues with moving to Hiccup 2.0 once, rather than spread the pain across Hiccup 3.0, 4.0 and so forth.

Got it. To be honest, I would probably go with a backwards-compatible change (e.g. keep h forever, or make auto-escaping opt-in), even if it is a bit half-hearted or unconvenient. But I know that it has downsides, and it is clearly your choice.

RawString is the return value from html and raw-string, which means it's part of the API and therefore not an implementation detail. We've introduced a new type so we can distinguish between a string and the output from html. It's very much part of the design.

Let me put it in a different way.

The semantics of the RawString class is something like this: "it is a value that is almost a string, but hiccup.core/html does not escape it". This is a perfectly acceptable return value for the raw-string function, but it does not seem right to me that the top-level API should return such a value. The caller does not want an "almost a string": it cannot do anything with it except to turn it to a string immediately. But then what is the advantage of not returning a string in the first place?

It is true that it is more convenient if there are nested html calls all over the code. But nested html calls are an optimization, and while I admit that I do not have much data on this particular case, I think optimization is always a minority of use cases. People normally optimize when they measured a bottleneck - and in a typical web app there are a lot of things to optimize before HTML generation becomes a bottleneck (database queries, caching of computation-intensive data, etc.). And in a many use cases (e.g. admin pages, computation-intensive web apps) it cannot be a real bottleneck at all.

So I would not optimize for convenience of the nested html calls, at least not this way. If it is really a widely used feature, I would create an explicit shorthand function for that, or perhaps use a dynamic var-based technique to detect nested calls.

Anyway, it is my last wall-of-text ;) I just wanted to make sure that this change does not make things more difficult than necessary... If you change your mind, I would be happy to modify the code. Or I will finish it this way (if it still needs improvements).

@gyim
Copy link
Contributor Author

gyim commented Dec 28, 2015

I think the RawString type itself should be moved into the hiccup.compiler namespace, as it's directly to do with the compiler. However, the hiccup.util/raw-string function should remain in the same place.

Unfortunately, hiccup.compiler requires the hiccup.util namespace, so it would cause a circular dependency. I left RawString in hiccup.util for now, maybe it could be moved to a separate namespace.

@weavejester
Copy link
Owner

My preference is to favour consistency over convenience in APIs. If html returns a RawString, then the type of the input matches the type of the output. If html returns a string, then we introduce a degree of inconsistency.

Even if a user never nests html in their own code, what happens when they use someone else's code? Because functions are opaque, we can only optimise them if we have control over their source. This means that a performant library should use html liberally, in order to maximise the amount of HTML that can be precomputed.

Further, returning a custom type by default opens the way for additional functionality in future without breaking backward compatibility. At the moment we simply wrap a string, but we could potentially wrap a more complex structure that would allow, for instance, rendering with different whitespace options, or for different doctypes.

The fundamental problem with Hiccup 1.x is that strings are not a rich enough data type to represent everything we might want to communicate in the output. Not escaping by default is merely the most visible problem caused by the underlying design decision to rely on strings.

I also don't think we're losing much in terms of convenience. At most we're introducing an additional six characters the user has to type. At least we introduce zero and everything works as before. That seems a small price to pay for a more consistent, functional, and future-proof API.

@weavejester
Copy link
Owner

I lost track of this PR a little over Christmas. I believe there's still a little work to do on it? I'll add some notes to the source code.

@@ -152,7 +156,7 @@
"True if x is a literal value that can be rendered as-is."
[x]
(and (not (unevaluated? x))
(or (not (or (vector? x) (map? x)))
(or (not (or (vector? x) (map? x) (set? x)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sets aren't used in Hiccup. Not sure why this is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood the intention of this code. This escapes literals when they are not treated specially (e.g. when you write [:span {} {:foo "bar"}] - not that it is too useful...). I thought it would be nice to treat sets the same way. But I revert this commit then.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But sets aren't used in Hiccup, so (set? x) will never be true.

@weavejester
Copy link
Owner

Aside from those two points, I think this PR can be squashed and merged.

@EwenG EwenG mentioned this pull request Jan 23, 2016
gyim pushed a commit to gyim/hiccup that referenced this pull request Mar 22, 2016
See discussion at pull request weavejester#122.
@gyim
Copy link
Contributor Author

gyim commented Mar 22, 2016

Sorry for the long delay... I modified the code according to your requests. If it's OK, I will squash the commits (the last commit is quite large, so I thought it's easier to review it this way).

Object
(^String toString [this] s)
(^boolean equals [this other]
(cond
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead write this as:

(and (instance? RawString other) (= s (.toString other)))

@weavejester
Copy link
Owner

Thanks for the update! I have one small code suggestion, and if you could remove the (set? x) change I think we're just about ready to merge.

Squashing the commits is a good idea :)

This commit implements issue weavejester#114 (Safe HTML by default). It introduces
a few breaking changes:

1. All strings that are passed to `html` are escaped. To support raw,
unescaped strings a new object type (RawString) was introduced. Use
`hiccup.util/raw-string` to prevent a string from being escaped.

2. To allow the nested usage of the `html` macro, `html` now returns a
RawString object instead of a string. It can be converted to a string
with the `str` function.

3. The `h` helper function was removed.
@gyim
Copy link
Contributor Author

gyim commented Mar 23, 2016

Done. I also removed an unnecessary test: checking whether escape-html returns a string (and not a RawString). Since this function did not change after all, and RawStrings cannot be compared to strings anymore, I think this test is no longer useful.

@weavejester
Copy link
Owner

Thanks for the updated commit! It looks good. I'll find some time this weekend to give it a last look over and manual test. If all goes well, I'll merge.

@weavejester weavejester merged commit 566286a into weavejester:master Mar 29, 2016
@weavejester
Copy link
Owner

Thanks for all your hard work on this PR, and thanks for being patient with my feedback :)

Incidentally, I think I've changed my mind about dropping h. If we keep it, but alias it to str and add a deprecation notice, then we can create Hiccup code that works in both 1.0 and 2.0.

@weavejester weavejester mentioned this pull request Jan 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants