Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FO] fn:nl, fn:tab, fn:cr #121

Closed
ChristianGruen opened this issue Aug 8, 2022 · 15 comments
Closed

[FO] fn:nl, fn:tab, fn:cr #121

ChristianGruen opened this issue Aug 8, 2022 · 15 comments
Labels
Feature A change that introduces a new feature XQFO An issue related to Functions and Operators

Comments

@ChristianGruen
Copy link
Contributor

The most popular custom functions in BaseX, and the most boring ones, allow users to insert new line and tab characters. It would be nice to see official variants added to the spec:

Function Returned character
fn:nl() as xs:string end of line (
, 
)
fn:tab() as xs:string character tabulation (	, 	)
fn:cr() as xs:string carriage return (
)

The third function can possibly be dropped.

@dnovatchev
Copy link
Contributor

dnovatchev commented Aug 9, 2022

Maybe make these OS-dependent, like Windows: CRLF, Unix: LF, MAC (up through ver. 9) CR, etc. There are 8 different cases and 8 different values summarized in Wikipedia :

  • U+000A
  • U+000B
  • U+000C
  • U+000D
  • CR (U+000D) followed by LF (U+000A)
  • U+0085
  • U+2028
  • U+2029

When processing with unparsed-text() all of these might be useful.

And/or maybe allow these to be overriden by Environment settings?

@ChristianGruen
Copy link
Contributor Author

ChristianGruen commented Aug 9, 2022

Thanks for your thoughts. – I don’t recollect all the rules, but CR character in the input are often dropped, e.g. when parsing XML documents, and explicit carriage returns are serialized as 
 even on Windows systems:

declare namespace output = 'http://www.w3.org/2010/xslt-xquery-serialization';
declare option output:method 'xml';
'A

B'

(: The result ...
A

B
:)

XML 1 has 09, 0A and 0D embedded in the grammar rules (whereas 0B and 0C are illegal), but maybe it’s helpful to skip fn:cr, think of »new lines« more generally as something that inserts a line break, and let the processor decide how to serialize new lines, depending on the OS, as it’s already done in other cases?

Edit: I have digged up an old discussion on serializing newlines (caused by possible bugs in the test cases): https://lists.w3.org/Archives/Public/public-qt-comments/2015Oct/0201.html

@dnovatchev
Copy link
Contributor

XML 1 has 09, 0A and 0D embedded in the grammar rules (whereas 0B and 0C are illegal), but maybe it’s helpful to skip fn:cr, think of »new lines« more generally as something that inserts a line break, and let the processor decide how to serialize new lines, depending on the OS, as it’s already done in other cases?

For XML, yes. But Shouldn't XPath take a broader view of the allowed characters? If we are calling fn:unparsed-text(), shouldn't it give us the text without any normalization?

@ChristianGruen
Copy link
Contributor Author

If we are calling fn:unparsed-text(), shouldn't it give us the text without any normalization?

I assume that CR won’t be normalized away, but FOUT1190 will still be raised if the resulting characters are not permitted XML characters, i.e., if XML 1.0 is used and if the input contains control characters like 0B.

@joewiz
Copy link

joewiz commented Aug 9, 2022

Great idea to expose these values, but is there a reason you're using a function rather than a global variable like $fn:nl?

(Also, while not directly related, I thought I'd point to a utility function for referencing HTML entities by name: https://gist.github.com/joewiz/8a2c3e2320da4c24058ccee5aec156f6. I could imagine a similar package that would facilitate lookup of Unicode entities by name.)

@ChristianGruen
Copy link
Contributor Author

The main reason I think is that the concept of built-in global variables would be something new. In the EXPath File Module, we used functions instead of variables for the same reason.

We’d also need to think about the default namespace for global variables: fn:nl() and nl() point to the same function, $fn:nl and $nl would probably not.

I like the idea to have a function for resolving HTML entities.

@joewiz
Copy link

joewiz commented Aug 9, 2022

@ChristianGruen These reasons make sense to me. Thanks!

@benibela
Copy link

benibela commented Aug 9, 2022

And fn:amp

It helps to write queries that are valid XPath and XQuery

The main reason I think is that the concept of built-in global variables would be something new. In the EXPath File Module, we used functions instead of variables for the same reason.

In Xidel I have a lot of built-in variables, e.g. $line-ending and $amp

@graydon2014
Copy link

And fn:amp

If we're going to have fn:amp, we should have all five magic XML characters: fn:amp, fn:quot, fn:apos, fn:lt, and ft:gt.

That's getting us up to nine -- the five magic chars and the four whitespace characters -- and I start to think it shouldn't be distinct functions. Maybe fn:char($name) where $name can be one of the nine magic names OR a codepoint number? (I would greatly prefer hex codepoint numbers but acknowledge I'm not likely to get them.)

(Yes I do want a distinct function for space; I use   a lot so future-me can be sure that space is there (in case of non-monospaced fonts) and not a typo. If we're going to function the rest of the white space we should function space, too.)

@cedporter cedporter added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature labels Sep 14, 2022
@michaelhkay
Copy link
Contributor

Perhaps fn:char($name as xs:string) where $name is the "name" of the character, and is one of

  • an entity name recognised in HTML5, e.g. "nbsp", "apos", "lt", "amp" etc
  • "xHHH" where HHH is the hexadecimal representation of the Unicode codepoint
  • the name of an ASCII control character such as CR, LF, TAB, FF, VT

@ndw
Copy link
Contributor

ndw commented Oct 16, 2022

As soon as you do that, someone is going to propose that you should be able to give any character name from the Unicode database...which I think is conceptually a good idea even if the idea of carrying around the unicode database doesn't fill me with glee.

@michaelhkay
Copy link
Contributor

I deliberately refrained from that suggestion, that would be an awful lot of data to retain. I'm not actually sure how many entity names are defined in HTML5 these days, but presumably they've chosen the set of names with a degree of pragmatism, and if the browser can know them, so can we.

@ChristianGruen
Copy link
Contributor Author

ChristianGruen commented Oct 16, 2022

As far as I can see, we have no other XML functionality that uses CR and similar. If we add fn:chars, I would thus vote for adding detection of backslashed characters (\n, \r, \t) as they are well-known and also used in regular expressions.

Personally, I think &#xHHH; is sufficient as existing solution if users know the hex code of a character anyway. My initial proposal was mostly about adding newlines (fn:nl, or fn:newline). I believe that the proper system-specific output of CR/LF should be up to the serializer. Maybe I shouldn't have mentioned CR and TAB at all.

@michaelhkay
Copy link
Contributor

michaelhkay commented Nov 20, 2022

In PR #261 I have proposed an fn:char function which allows the character to be defined as a decimal or numeric constant, as an HTML entity reference name, or as one of the escape sequences \n, \r, and \t. I think this will meet a wide range of requirements. It takes into account the needs of XPath environments other than XQuery and XSLT, where the &#xNN notation is not available - for example when an XPath expression is included in the query part of a URI; using non-ASCII characters in such a context can be challenging.

@ndw
Copy link
Contributor

ndw commented Jan 10, 2023

Closed by #261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A change that introduces a new feature XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

8 participants