Skip to content

Conversation

@jgberry
Copy link

@jgberry jgberry commented Nov 14, 2025

As per the specification linked in the README.md:

If the data contains characters defined in RFC 3986 as reserved characters, or contains space characters, newline characters, or other non-printing characters, those characters must be percent-encoded.

This is presently not adhered to, which can cause the generation of invalid data URLs. Consider the below example:

from data_url import DataURL
url = DataURL.from_data("text/plain", False, "Hello World!")
print(url.url)  # Prints "data:text/plain,Hello World!", not a valid data URL!

Likewise, parsing of URLs with escape characters results in DataURL.data containing encoded data:

from data_url import DataURL
url = DataURL.from_url("data:,A%20brief%20note")
print(url.data) # Prints "A%20brief%20note", expected "A brief note"

This PR address both issues by unquoting non-base64 data when parsing URLs and by quoting non-base64 data when constructing URLs.

Caveats: If users are presently URL encoding their data prior to passing it to DataURL.from_data, this would break their URLs (the URL would be double percent encoded).

@telday
Copy link
Owner

telday commented Nov 19, 2025

Hi @jgberry thanks for taking the time to make these changes! I think you have spotted a couple issues with the docs/code here. As far as the documentation goes it is intentional that we don't automatically quote/unquote data when handling data URL's, that portion of the documentation was intended to imply that the data must be pre-quoted when creating a URL. I would prefer not to update this at the moment because it would break backwards compatibility.

It's definitely an issue that creating a URL with unquoted illegal characters works, this should be updated so that an error is returned in this case and new test cases added for that. I have opened issue #17 to address this.

I do see the need for this feature though, it could be useful for users to pass unquoted data that will be transformed. I'm open to adding a new constructor function for that.

@jgberry
Copy link
Author

jgberry commented Nov 19, 2025

Thanks for taking a look at this @telday!

As far as the documentation goes it is intentional that we don't automatically quote/unquote data when handling data URL's, that portion of the documentation was intended to imply that the data must be pre-quoted when creating a URL.

That makes sense. Could the docs could be adjusted to reflect this more clearly? Currently the docs say "Otherwise [if the base64_encoded flag is set to False] the data is passed through as is". This doesn't explicitly capture that the data MUST be urlencoded if base64_encoded is false.

I'd also like to highlight how these issues currently complicate parsing for users. I'm currently working around this using logic similar to this:

url = data_url.DataURL.from_url(url_str)
if url.is_base64_encoded:
  data = url.data
else:
  data = unquote(url.data)

Perhaps an additional property could be added to parallel DataURL.encoded_data which always exposes the decoded data:

@property
def decoded_data(self):
  if url.is_base64_encoded:
    return url.data
  else:
    return unquote(url.data)

This allows users to fetch the data without concerning themselves how the data is represented.

@jgberry
Copy link
Author

jgberry commented Nov 19, 2025

@telday I've made changes based on your feedback. Let me know what you think!

@telday
Copy link
Owner

telday commented Nov 20, 2025

@jgberry thanks for the quick turnaround. I still don't think this is quite right. When I mentioned a new constructor I meant adding a new external function like construct_data_url. I'm not sure why we would update the from_byte_data function to allow data that is not bytes to be passed for construction.

I also think that by updating the decoded_data property to unquote is mixing the two concepts too much. What if we added a new property unquoted_data which just returns the decoded_data parameter but with the unquote method runs against it?

This way allows us to maintain the integrity of the data throughout but gives new options for creation and use of quoted data that is agnostic to whether or not the data is actually quoted for retrieval.

e.g. You create two urls, one quoted and one not, but you can access the data either way:

url1 = construct_quoted_data_url('application/json', false, '{"some": "json"}')
url2 = construct_data_url('text/plain', false, "data")

...
# Then we can recover the raw data either way with the unquoted property:
print(url1.unquoted_data) # {"some": "json"}
print(url2.unquoted_data) # data

We will still run into issues if the user has unintentional quote characters in their string but at least we are explicitly telling them that they are going to get unquoted data from the property.

@jgberry
Copy link
Author

jgberry commented Nov 20, 2025

When I mentioned a new constructor I meant adding a new external function like construct_data_url. I'm not sure why we would update the from_byte_data function to allow data that is not bytes to be passed for construction.

I see. I missed that external function, it would make sense to move this behavior there.

I also think that by updating the decoded_data property to unquote is mixing the two concepts too much. What if we added a new property unquoted_data which just returns the decoded_data parameter but with the unquote method runs against it?

I'm not sure this makes sense. If the data is base64 encoded, I certainly do not want to unquote it nor could I unquote it; it's a bytes object.

What I effectively want with this property is to expose the underlying data stored in the data URL. If the data in the URL is base64 encoded, then the underlying data is the base64 decoded byte string. If the data is not base64 encoded, then the underlying data is the unquoted string.

If I understand your objection correctly, you're opposed to this property returning a union of str and bytes. But, per the Mozilla data URL definition:

If the data is textual, you can embed the text (using the appropriate entities or escapes based on the enclosing document's type). Otherwise, you can specify base64 to embed base64-encoded binary data.

It is this "data" that the Mozilla documentation refers to that I am trying to expose via the decoded_data property.

This way allows us to maintain the integrity of the data throughout but gives new options for creation and use of quoted data that is agnostic to whether or not the data is actually quoted for retrieval.

Per the spec, non-base64 encoded data MUST be quoted, so I'm not sure why we must be agnostic. Is the goal to have this package handle cases outside of the spec?

All in all, it seems this feature needs a bit more thought and the issues it aims to solve are easy enough to work around. I'll close this PR for now.

@jgberry jgberry closed this Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants