Skip to content

FEAT: Added XML codec#5026

Merged
dockimbel merged 14 commits intored:masterfrom
rebolek:XML-codec
Feb 11, 2022
Merged

FEAT: Added XML codec#5026
dockimbel merged 14 commits intored:masterfrom
rebolek:XML-codec

Conversation

@rebolek
Copy link
Contributor

@rebolek rebolek commented Jan 6, 2022

No description provided.

@hiiamboris
Copy link
Collaborator

This bug still persists: https://gitter.im/red/codecs?at=61b346418853d316403de86f

@rebolek
Copy link
Contributor Author

rebolek commented Jan 7, 2022

Gee, I thought it was fixed already, thanks, I’ll fix it again.

@rebolek
Copy link
Contributor Author

rebolek commented Jan 7, 2022

Found the problem, it happened only with empty tags.

@hiiamboris
Copy link
Collaborator

This PR also needs tests.

@rebolek
Copy link
Contributor Author

rebolek commented Jan 7, 2022

This PR also needs tests.

Of course, they will be added.

@greggirwin
Copy link
Contributor

@dockimbel we're still finalizing this, so don't merge just yet.

@greggirwin
Copy link
Contributor

Talking now about how best to handle options.

@dockimbel
Copy link
Member

Is there some design document or just a documentation on how the Red datatypes are mapped to XML elements?

@greggirwin
Copy link
Contributor

https://gitlab.com/rebolek/markup/-/blob/main/README.md docs the output formats. There is no schema type system beyond that (though I built one for Prolific in the past). All tags map to words, unless you say to use strings as keys, and all other data is just kept as strings from being parsed.

target: []
target-stack: []

set 'load-xml func [
Copy link
Member

@dockimbel dockimbel Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a general problem with how the codecs are implemented. They don't respect the codec API (load/save), but provide adhoc global functions that are circumventing the load/as and save/as ones instead of just been simple handy wrappers on top of them. Though, as this is a global implementation issue in most codecs, I would not veto this PR based on that. I'm just mentioning it here, so that people gets aware of that and of a future refactoring of most codecs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For codecs that need options, the load/save interface needs to change. /as doesn't give you enough control. That means adding something completely general like /with, which has been a deep discussion for this codec, and how it compares to using refinements. Then encode/decode in each codec need to support a general /with options refinement, whether they need them or not.

I'm open to all suggestions, since we may not add many more standard codecs, but users will write them and we want those to use a consistent approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, the load-*/save-* funcs are for when you need more control. We could eliminate some options from the codecs, so they work just as the defaults wrapped by encode/decode (so the calls go load -> decode -> load-* today).

The tradeoff in removing features is more work for users. e.g. we can always output /pretty for JSON, and let users trim/lines if they want. I believe we added /ascii after research on compatibility, but we can also say "forget people stuck with that" and always use UTF-8. For XML, we can always emit one format and post-process to get the others if people want them. That eliminates /as . We can also always include metadata and let users remove or ignore it if it's there. That eliminates /meta. I think /str-keys was there because XML names can be invalid words. We can tell people who hit that problem that those aren't supported, or we can always load them as strings and post-process to words or deal with them as strings; neither of which is great. We can leave out /str-keys for now, and see who complains. Adding it later won't be a breaking change. For those issues, we can normalize the interface and push the problem around. Is that a win overall?

Then there's CSV. XML is a pain, but a hard spec. CSV is a very soft spec, with lots of variants in the wild. Wrapping those details in the codec save the user enormous pain and possible errors. If we limit the options, it becomes much less useful. So it's a great, concrete example to use when we ask "How should codecs with options be handled?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load-* first appeared in @giesse's JSON codec and I followed that path with CSV and XML. We can get rid of it but they allow for setting options as @greggirwin mentioned. If we can think of a different solution that would work for all codecs generaly (e.g. setting JPEG quality etc.) then I’m for getting rid of load-*/save-*.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross posting from Gitter:

We are a week away from month end, and my goal is to get XML out, along with Animation. We'll note in the blog post that there may be changes, and that we are reviewing the overall codec interface.

The original JSON interface is a legacy issue for us now. R2 didn't have codecs, and that code has been around in various forms for a very long time. Romano did the first version, I took over maintenance and kept it up with Douglas Crockford on JSON.org, then @giesse, @rebolek, and @hiiamboris contributed to improving it; but the original interface lives on. :^\

I don't want to pull anyone's focus right now, but also don't want to just hack in a different interface without thinking it through. So it will be a pretty big task. It will likely also inform other extension models or how a system catalog works, even for I/O protocols perhaps.


Char: charset [
#"^(01)" -#"^(D7FF)" #"^(E000)" - #"^(FFFD)" #"^(10000)" - #"^(10FFFF)"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This charset eats 136 KB of memory.

#"^(200C)" - #"^(200D)" #"^(2070)" - #"^(218F)" #"^(2C00)" - #"^(2FEF)"
#"^(3001)" - #"^(D7FF)" #"^(F900)" - #"^(FDCF)" #"^(FDF0)" - #"^(FFFD)"
#"^(010000)" - #"^(0EFFFF)"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This charset eats 120 KB of memory.


; -- Character Range
Char: {[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]}
RestrictedChar: {[#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of those definitions as these words are redefined immediatly after?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll hazard a guess that @rebolek meant to comment those out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point about the charsets size is that such memory will be allocated on loading the XML codec, whether or not he user code calls the codec or not. We should be careful to not waste memory space unless needed. I suggest moving all the significant allocations to a init function that is called the first time the codec API is invoked.

Such init callback function should be part of the codec infrastructure and called automatically when needed without having each codec to implement the logic for that.

I’ll agree on the init function, I’ll update the codec.

NameChar:
name-char: union name-start-char charset [
"-." #"0" - #"9" #"^(B7)" #"^(0300)" - #"^(036F)" #"^(203F)" - #"^(2040)"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This charset eats 120 KB of memory.

@dockimbel
Copy link
Member

dockimbel commented Jan 19, 2022

My point about the charsets size is that such memory will be allocated on loading the XML codec, whether or not he user code calls the codec or not. We should be careful to not waste memory space unless needed. I suggest moving all the significant allocations to a init function that is called the first time the codec API is invoked.

Such init callback function should be part of the codec infrastructure and called automatically when needed without having each codec to implement the logic for that.

@dockimbel
Copy link
Member

dockimbel commented Jan 19, 2022

As another global comment, I have always been in favor of a streaming approach for XML (and other formats like JSON and CSV) rather than static mappings to a predefined datastructure. XML for Red is just a serialization format to communicate with third-party software, it is not a native format for us. We should then treat it as that and not try to insert a potentially heavy datastructure between the raw XML data and the user's datastructures. The streaming approach (beyond all performance and resource usage benefits) allows the user to map directly the XML raw data to the final datastructure and is used in the industry since two decades. I haven't looked the the generative counterpart to streaming parser, but we can certainly come up with a good design for that part too.

Again, I won't veto this PR based on that comment. I plan to implement in the future (in my "20% fun time with Red") at least one streaming parser to show how it can be done in Red efficiently.

@greggirwin
Copy link
Contributor

My point about the charsets size is that such memory will be allocated on loading the XML codec, whether or not he user code calls the codec or not. We should be careful to not waste memory space unless needed. I suggest moving all the significant allocations to a init function that is called the first time the codec API is invoked.

Such init callback function should be part of the codec infrastructure and called automatically when needed without having each codec to implement the logic for that.

I agree with not wasting space and time, but are we better off making init a best practice (even beyond codecs), or should codecs be listed in needs? Codecs that just call OS APIs (e.g. images) come basically for free, and redbin we want there all the time. The rest can be optimized out if someone cares. Nice when batteries are included, so it's almost an exclude feature, kind of like R2's various kernel choices.

@greggirwin
Copy link
Contributor

I'll be very anxious to see how streaming works, and I think @rebolek has done a SAX parser in the past. I will say that I think the static model is easy to reason about, and we should see what existing systems use streaming JSON or XML that we can test with. Formats designed for streaming are a different story. Formats not designed for streaming pose issues like "I did a whole bunch of work and then hit a malformed bit." Now I have to track state, make sure if I've taken any actions that need to be undone they can be undone with information I have, etc.

OTOH, eventing systems and aggregation are going to be key components, so thinking about it is important.

@greggirwin
Copy link
Contributor

Messaging in general is key, and how you envelope things to pass them around, etc. It's also a wide world of needs and approaches. Everything from MQTT, fire and forget, to 0MQ brokering systems.

@greggirwin
Copy link
Contributor

My point about the charsets size is that such memory will be allocated on loading the XML codec, whether or not he user code calls the codec or not. We should be careful to not waste memory space unless needed. I suggest moving all the significant allocations to a init function that is called the first time the codec API is invoked.

Such init callback function should be part of the codec infrastructure and called automatically when needed without having each codec to implement the logic for that.

@rebolek put the init change in, and I'm...I'll just say that while I agree about not wasting time and memory, this change made the code SO much harder to understand (context level vars now in funcs), and SO much more fragile (have to keep two sets of words in sync in different parts of the code), that I want to be very careful about what we prioritize. In order to save space, the ~100 rule names in load-xml are tightly packed into 15 lines.

This is not a criticism of Bolek's code, but of how we address issues. We don't have modules today, but we have worse code to live with once we could easily leave XML out. Given this change (and I hate to see the time and effort on init wasted), I'd almost say go back to the original and leave XML as something for users to add.

@rebolek
Copy link
Contributor Author

rebolek commented Jan 26, 2022

@greggirwin I can unpack the rule names but I agree that it’s much more fragile. Too bad we can’t extend objects as that would solve this problem and we didn’t have to have the words defined twice.

@hiiamboris
Copy link
Collaborator

You're overengineering and dramatizing IMO. Nenad's point was to only stash charsets into init, and maybe to optimize them using not.

@hiiamboris
Copy link
Collaborator

You need to list test file in all-tests.txt for it to be evaluated

@rebolek
Copy link
Contributor Author

rebolek commented Feb 3, 2022

@greggirwin everything should be ready for merging.

@hiiamboris
Copy link
Collaborator

Looks like compiler didn't swallow that :)

@rebolek
Copy link
Contributor Author

rebolek commented Feb 3, 2022

Looks like compiler didn't swallow that :)

Can you show me the error? I tried compiling it without a problem (after fixing a few lines).

@hiiamboris
Copy link
Collaborator

https://ci.appveyor.com/project/red/red/builds/42439531
I don't think Appveyor has detailed logs though

@hiiamboris
Copy link
Collaborator

I can't reproduce it either.

@greggirwin
Copy link
Contributor

Is AppVeyor back to being reliable? It had issues for a while. If you can both compile, and I will do that today as I haven't, that's good enough for me.

@dockimbel I think we're ready.

@greggirwin
Copy link
Contributor

@rebolek please don't forget to doc how to separate metadata from the document.

@rebolek
Copy link
Contributor Author

rebolek commented Feb 4, 2022

@greggirwin Added a function get-meta to %xml-tools.red that will grab just the metadata from a document and added info about it to the docs. Also, one question:

There is a little inconsistency in the key-val format. Attributes are stored as [word value] but XMLdecl uses [set-word: value] format. I'm not sure why, I probably copied it from the triples format and nobody noticed it. Are you fine with changing it to a word! to be consistent with attributes?

@greggirwin
Copy link
Contributor

On key-val using words instead of set-words, yes, please make that change. Good catch.

@greggirwin
Copy link
Contributor

@rebolek %xml-tools.red is missing a formats definition. I added it here to test, but will let you make the change in the PR.

>> meta: get-meta res
*** Script Error: formats has no value
*** Where: find
*** Near : formats format [do make error! "Unknown f"] 
*** Stack: get-meta  

@rebolek
Copy link
Contributor Author

rebolek commented Feb 4, 2022

@greggirwin I must've deleted it by accident and it was still present in the console session. Thanks for finding it out!

@hiiamboris
Copy link
Collaborator

Check for leaks ;)

Red []

leak-check: function [code] [
	print ["checking" mold/flat/part code 70]
	old: values-of system/words
	also do code (
		new: values-of system/words
		repeat i length? words: words-of system/words [
			unless any [
				:old/:i =? :new/:i
				all [									;-- newly loaded words
					i > length? old
					unset? :new/:i
				]
				all [									;@@ workaround for #5065
					any-function? :old/:i
					:old/:i = :new/:i
				]
			][
				w: pad words/:i 15
				s1: mold/flat/part :old/:i 30
				s2: mold/flat/part :new/:i 30
				print rejoin ["leak: "w"^-= "s1" -> "s2]
			]
		]
	)
]
do %xml.red
>> leak-check [load-xml ""]
checking [load-xml ""]
leak: key               = unset -> pretty?:
leak: pretty?           = unset -> false
leak: store-doctype     = unset -> (if include-meta? [repend tar
[]
>> leak-check [to-xml [a "b" #[none]]]
checking [to-xml [a "b" none]]
leak: key               = unset -> pretty?:
leak: pretty?           = unset -> false
>> leak-check [to-xml []]
checking [to-xml []]
*** User Error: "Invalid input data"
*** Where: do
*** Near : error-invalid-data
*** Stack: leak-check also to-xml encode

@rebolek
Copy link
Contributor Author

rebolek commented Feb 11, 2022

@hiiamboris thanks! fixed. I guess there’s nothing more left to do for now.

@greggirwin
Copy link
Contributor

Ping @dockimbel. Ready for merging here.

@dockimbel dockimbel merged commit f9efb78 into red:master Feb 11, 2022
dockimbel added a commit that referenced this pull request Feb 11, 2022
This reverts commit f9efb78, reversing
changes made to e343951.
@greggirwin
Copy link
Contributor

@dockimbel it's been a while now. Do you remember if this merge was reverted because we weren't sure about whether it should be standard or in red/code? Or because we still had questions about the options interface? Something else?

@greggirwin
Copy link
Contributor

I was looking at my draft blog post and realized XML wasn't there like I thought it was.

@rebolek
Copy link
Contributor Author

rebolek commented May 20, 2022

I believe the reason was if it should be standard or in red/code. I’m fine putting it into red/code but then we probably should have some (even temporary until we have some packaging system) easy method to include code from red/code.

@greggirwin
Copy link
Contributor

Agreed, if it goes in red/code, we need a standard solution for loading it. Nenad may also have been concerned about the effect on compile times. But we could wrap it in do to avoid that. I don't remember if we tested it, or if there are R/S bits. The performance hit shouldn't be bad in that case. Oh, wait, I have some test numbers in the blog draft. I will look at that.

@greggirwin
Copy link
Contributor

No R/S in the code that I see at a glance.

@rebolek
Copy link
Contributor Author

rebolek commented May 24, 2022

Right, there’s no R/S.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants