Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chemistry and math layout #92

Closed
NSoiffer opened this issue May 28, 2019 · 18 comments
Closed

Chemistry and math layout #92

NSoiffer opened this issue May 28, 2019 · 18 comments
Labels
chemistry Issues affecting markup for Chemisty

Comments

@NSoiffer
Copy link
Contributor

NSoiffer commented May 28, 2019

Layout of chemical formulas is very similar to laying out math. People often use math editors to enter those formulas, which means they will show up as MathML on the web.

Here are some known differences:

  1. Chemical elements should not be in italics. For multi-letter chemical elements such as Na (sodium), that's not a problem, but for single letter ones such as H (hydrogen), that will be a problem because mi will normally use italics.
  2. Subscripts and superscripts should be positioned slightly differently from the normal typesetting rules (see the TeX book, p179). The idea is that you want the scripts to all align, regardless of whether there is a single sub/superscript or whether both are present. I think this means that if any script is present, it should be treated as if both scripts are present (i.e., use the layout rules for msubsup).
  3. Potentially automatic linebreaking rules might want to be different (brought up on 28/5/19 call, but no details were given).

Note: @davidcarlisle was tasked with looking into the TeX chemistry packages and those packages might reveal other layout differences.

'1' can be solved with MathML today by using mathvariant="normal". Alternatively, maybe the sans-serif math alphabetics in Unicode can be used if chemical elements are always sans-serif. However, these solutions don't help with semantics (#92), so another solution might be preferable. For units (meters, seconds, etc.), the MathML WG came out with a note that suggests using class="MathML-Unit". A similar thing could be done here. Alternatively, these might be tagged with some sort of "role" information and that semantic info could be pulled into the rendering. Personally, I think semantics and display should be kept separate.

'2' can be solved with MathML today with a hack of using mphantom or mspace or something else for the empty script if only one real script is present. If '1' is solved via some semantic info that says this is a chemical element, then the layout rules can be adjusted using this info. As with '1', I personally think semantics and display should be kept separate. Allowing none as a value for msubsup is cleaner than mphantom, etc. Another alternative is to introduce an attr on msub and msup that says "use msubsup layout rules. E.g., msubsuplayout=true/false, with default false.

Until we know more about linebreaking of chemical formulas (which should be really rare), I don't have any proposals.

Using some of the above to put up a straw man proposal

<mrow>
  <msubsup>
    <mi role="math-element">Fe</mi>
    <mn>2</mn>
    <mrow> <mo>+</mo> <mn>2</mn> </mrow>
  </msubsup>
  <msub msubsuplayout="true">
    <mi role="math-element">Cr</mi>
    <mn>2</mn>
  </msub>
  <msub msubsuplayout="true">
    <mi mathvariant="normal" role="math-element">0</mi>
    <mn>4</mn>
  </msub>    
</mrow>

Here's an alternative using "none":

<mrow>
  <msubsup>
    <mi role="math-element">Fe</mi>
    <mn>2</mn>
    <mrow> <mo>+</mo> <mn>2</mn> </mrow>
  </msubsup>
  <msubsup>
    <mi role="math-element">Cr</mi>
    <mn>2</mn>
    <none/>
  </msubsup>
  <msubsup>
    <mi mathvariant="normal" role="math-element">0</mi>
    <mn>4</mn>
    <none/>    
  </msubsup>    
</mrow>
@josephwright
Copy link

On line-breaking, people do do it in the middle of formula but it should keep element names together plus any subscripts. So one might break "C6H12O6" as "C6-H12O6" or "C6H12-O6" but not anywhere else. However, as you say it's pretty unusual to break such cases: normally it's formal names that are 'fun'.

@josephwright
Copy link

josephwright commented May 29, 2019

Perhaps worth noting on sub/superscripts that IUPAC have said that compound ions should have charges after any subscript numbers (https://iupac.org/wp-content/uploads/2015/07/Green-Book-PDF-Version-2011.pdf, p 51). Thus what in TeX-like terms might be expresses SO_{4}^{2-} should have the 2- clearly after the 4 as the charge applies to the entire ion: thus in TeX-like terms one would use SO_{4}{}^{2-}. (See http://mirror.ctan.org/macros/latex/contrib/chemformula/chemformula_en.pdf page 11 for an 'automated' approach.)

@davidcarlisle davidcarlisle added the chemistry Issues affecting markup for Chemisty label May 29, 2019
@mhchem
Copy link

mhchem commented May 29, 2019

I am the author of mhchem (for LaTeX, MathJax, KaTex). First of all, math typesetting is well suited for chemistry, but there are many more fine details that you have not yet mentioned, like bonds, inner dashes and dots, italic prefixes etc. Upright greek characters are an important, but often missing feature. You might want to take a look at https://mhchem.github.io/MathJax-mhchem/ to see a collection of examples. As there are no many fine details, a more structured approach could be needed that a thread of comments like this. From my experience, I would avoid semantic markup, i.e. giving each part of η²-C₂H₄ a description of why it is typeset as it is. The same notation (upright greek, dash, dot, ...) can have several very special chemical meanings, depending on the field of chemistry, with new meanings being added (and forgotten) all the time. I can see in the examples above, you are suggesting using typographic semantics. This is exactly what I would recommend.

@physikerwelt
Copy link
Member

@mhchem thank you for your input. I hope we will not end up using deprecated versions this time.

@davidfarmer
Copy link
Contributor

davidfarmer commented May 29, 2019 via email

@mhchem
Copy link

mhchem commented May 29, 2019

The mhchem syntax is not semantic in your sense. Example: It says what part goes into superscript, but it does say why. I see that this leads to problems with speech output and machine interpretation. But I don't think users would be willing to have a dozen commands to create a semantic right-hand superscript when a simple ^ would do the trick. I think the "straight forward", "little typing" syntax is what makes mhchem popular. (By the way, spoken chemistry is also highly context-dependent with a lot of "homophones" like "C twelve". But my experience with that is very limited.)

@NSoiffer
Copy link
Contributor Author

Great to see all this input!

@davidcarlisle was tasked with finding out a bit more about the various TeX packages for chemistry and hence getting an expert like @mhchem providing input.

I probably should have added those links into the original issue as looking at those packages can be instructive. Here are the links:

He also included a link to siunitx, a package for units that is tangentially related to this thread.

What struck me on skimming through them is how much they were focused on shortening the input. Here's an mhchem example from the pdf: \ce{Hg^2+ ->[I-] HgI2 ->[I-] [Hg^{II}I4]^2-}
image
Note that {}s are not used around the multichar superscript, subscripts are implied, and \atop isn't used for the arrow annotations. chemformula is similar in its design to shorten input.

One of the goals of the refresh effort is to be explicit about the layout rules. If we plan to include chemical layout in MathML (which I think we should), we need to make sure MathML can handle any differences. We also need to decide whether to add things to MathML such as attributes that make it easy to handle the differences or whether we require authoring tools to make the tweaks explicit. As a simple case, mi has the default value of mathvariant="auto" to simplify tagging of mi. That's not what is desired for chemistry. There are three possible ways to deal with this:

  1. Require tools to generate mathvariant="normal" on all mi
  2. Equivalently, have tools generate mathvariant="normal" on the math element
  3. Expand the meaning of auto (the default) so that it understands that in a chemistry context (however that is specified), an upright font should be used.

Note that currently mathvariant is planned to be left in full but removed from core as a legal value on mstyle (#1, #89), and hence from math. So the second option above would be legal in the full spec, but is not legal in core.

Hence, it is important to collect a list of the differences between math and chemistry. Once we have that list, then we look at various options for each and hopefully come up with a unified strategy for dealing with those differences. We may also find that MathML is missing a couple of features that need to be added.

@mhchem
Copy link

mhchem commented May 29, 2019

You might also be interested in the mhchem for MathJax manual. It has more special syntax than the LaTeX version and a live "test-drive" at the end of the page where you can type in and see the results immediately.

Most parts of chemical equations are in an upright font, but not all. If you think about exending auto (option 3), you could make this quite complex. The x in \ce{NO_x} is italic, the n in \ce{Fe^n+}, and the i in \ce{i-Pr} are as well. But I think, this would not fit well this your aim of having a concise MathML spec that is not too difficult to implement.

Please think about chemistry-in-math and math-in-chemistry use cases, e.g. $C_p[\ce{H2O(l)}]$, \ce{CuS($hP12$)} and the examples above.

And here are a few examples that might need extended layouting options:
image image image image

@davidfarmer
Copy link
Contributor

I just went through the first dozen or so pages of the PDF documentation
for the mhchem bundle, and it looked quite semantic to me.
By that I mean, one could write a script to parse the contents of \ce{}
and determine unambiguously what it means. Some examples:
\ce{H+}
\ce{CrO4^2-}
\ce{^227_90Th+}
\ce{(NH4)2S}.
It seems that the uses of ^ and _ are completely unambiguous. Given the (finite!) list of
elements, you can recognize numbers and the + and - symbols, and you can parse the expression.

Reactions:
\ce{A <--> B}
\ce{A ->[H2O] B}
\ce{SO4^2- + Ba^2+ -> BaSO4 v}
Those are made of the pieces mentioned above, along with a small number of new pieces.

There are a few other things, but in each case the notation seems to be unambiguous.
Maybe the case of nested expressions (chem in math in chem) is tricky, but there are
not really that many different things that can happen.

Unless I am missing something, this situation is pretty similar to how I view LaTeX markup.

Is an equals sign distinguished from a double bond by the spaces around it?

@mhchem
Copy link

mhchem commented May 30, 2019

Well, what is semantic? The mhchem syntax is a typographic notation, the transformation mhchem->LaTeX is anambiguous.
Take \ce{[Hg^{II}I4]^2-}, for instance. There are two right-hand superscripts with different meaning (oxidation number and charge). One has to look at the content to understand the meaning. Sometimes, in contexts where one is not interested in charges at all, one can even write arabic oxidation numbers (instead of roman ones).
When it comes to new concepts, chemists simply define (locally) a new meaning for a certain notation. This way, the same typographic notation could be used in different branches of chemistry and they might not even be aware of that – or even understand the meaning of the other notation.

Yes, spaces are a very important semantic element in mhchem syntax.

Don't expect a finite list of elements. Chemists make up new names all the time (D, T, M, THF), some are just conventions within a single article.

@davidcarlisle
Copy link
Collaborator

One interesting thing that people can do with the mhchem for mathjax live demo feature is to see the generated mathml(3) code using mathjax right menu view mathml code option. One thing I notice trying a few examples there is use of

            <msup>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mn>2</mn>
                <mo>+</mo>
              </mrow>
            </msup>

The phantom X appearing in several places (acting as a \mathstrut to force the position of superscripts to a fixed height not depending on the base) it might be nice if we could make that simpler.... superscriptshift might actually be enough although that is minimum shift rather than a fixed shift, it would be enough to force same height on upper and lower case letters in the base though.

In full I picked this example

\ce{$K = \frac{[\ce{Hg^2+}][\ce{Hg}]}{[\ce{Hg2^2+}]}$}

which generated this MathML

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mstyle mathcolor="#a33e00">
    <mrow class="MJX-TeXAtom-ORD">
      <mi>K</mi>
      <mo>=</mo>
      <mfrac>
        <mrow>
          <mo stretchy="false">[</mo>
          <mrow class="MJX-TeXAtom-ORD">
            <mrow class="MJX-TeXAtom-ORD">
              <mi mathvariant="normal">H</mi>
              <mi mathvariant="normal">g</mi>
            </mrow>
            <msup>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mn>2</mn>
                <mo>+</mo>
              </mrow>
            </msup>
          </mrow>
          <mo stretchy="false">]</mo>
          <mo stretchy="false">[</mo>
          <mrow class="MJX-TeXAtom-ORD">
            <mrow class="MJX-TeXAtom-ORD">
              <mi mathvariant="normal">H</mi>
              <mi mathvariant="normal">g</mi>
            </mrow>
          </mrow>
          <mo stretchy="false">]</mo>
        </mrow>
        <mrow>
          <mo stretchy="false">[</mo>
          <mrow class="MJX-TeXAtom-ORD">
            <mrow class="MJX-TeXAtom-ORD">
              <mi mathvariant="normal">H</mi>
              <mi mathvariant="normal">g</mi>
            </mrow>
            <msub>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded height="0">
                    <mn>2</mn>
                  </mpadded>
                </mrow>
              </mrow>
            </msub>
            <msup>
              <mrow class="MJX-TeXAtom-ORD">
                <mrow class="MJX-TeXAtom-ORD">
                  <mpadded width="0">
                    <mphantom>
                      <mi>X</mi>
                    </mphantom>
                  </mpadded>
                </mrow>
              </mrow>
              <mrow class="MJX-TeXAtom-ORD">
                <mn>2</mn>
                <mo>+</mo>
              </mrow>
            </msup>
          </mrow>
          <mo stretchy="false">]</mo>
        </mrow>
      </mfrac>
    </mrow>
  </mstyle>
</math>

@davidfarmer
Copy link
Contributor

davidfarmer commented May 31, 2019 via email

@mhchem
Copy link

mhchem commented May 31, 2019

I'd say, in more than 99% of the cases, one can infer the meaning by "local context". I skipped through the Green Book and the Red Book and found at least these meanings of right superscripts:

  • charge: ^- ^2- ^3- ^+ ^2+ ^3+ ^0 (when on a particle)...
  • oxidation: ^I ^II ^III ^IV ^-I ^-II ^-III ^0 (when at an element) ^{(I)} ^{(II)} ^{(III)} ...
  • excited: ^*
  • radical: ^. ^2.
  • radical and charge: ^.- ^(2.)- ^(2.)2+ ...
  • Kroeger Vink notation has completely different semantics: ^x ^. ^.. ^2. ^' ^''
  • hapticity: \eta^2 \eta^3 \eta^4
  • number of donor atoms: \kappa^2
  • (bonding number: \lambda^5)

There are more, for sure.

I don't see your point, why one should not be able to infer the meaning of A^{\mathrm{T}}. (A^T would definitely be false.) What could go wrong if a screenreader read every instance of "latin uppercase italic letter, with a right superscript upright T" as "the transpose of A" (or whatever letter)?

@davidfarmer
Copy link
Contributor

davidfarmer commented Jun 1, 2019 via email

@mhchem
Copy link

mhchem commented Jun 1, 2019

We are deviating too much here, I guess, but let me add 2 points. First, when you argue with "some people write it non-standard", that can be the case in chemistry too. Second, your mathematical notation is sloppy. An operator is to be set in an upright font, a variable in italics. (This is a universal scientific notation. Looking at StackExchange, the chemical community observes this much more strictly than the physical and mathematical community.)

@davidfarmer
Copy link
Contributor

I appreciate that there are standards for typography, but it is a fact that most popular
introductory linear algebra textbooks typeset matrices as slanted capital letters.
The other extreme is the wikipedia page on matrices, which set them upright and bold.

But the more important point I want to make is that no amount of typography addresses
the issue of what A^t means. Even if the "A" is upright (and/or bold), it is impossible
to tell that A is a matrix, and it is impossible to tell that A^t is the transpose of A.
That is why I want to encode semantic information.

Actually, the motivation is making it possible to pronounce correctly without having
to guess. I am not proposing to encode the fact that A is a matrix (although I do
not object to encoding that), just encoding that "^t" is the "transpose".

The point of this thread is how to do similar encoding for chemistry, so that
charge/valence is pronounced correctly. Maybe encoding that something is an element
is more important than encoding that something is a matrix?

@mhchem
Copy link

mhchem commented Jun 4, 2019

I was talking about the T or t. These are operators, so they are to be typeset upright, so there is no confusion with a variable t. I use the chance to bring this thread back to the chemistry topic: I recommend the IUPAC document On the use of italic and roman fonts for symbols in scientific text.

@NSoiffer
Copy link
Contributor Author

Lots of good discussion, but I don't see anything here that intent (with the addition of isa) doesn't solve, so I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chemistry Issues affecting markup for Chemisty
Projects
None yet
Development

No branches or pull requests

6 participants