U+2212 MINUS SIGN breaking PDF generation #4136

JulienPalard · 2017-10-09T21:42:39Z

Subject: When using U+2212 to denote a negative integer, sphinx can no longer build PDF.

Problem

Unicode character U+2212 in rst files breaks PDF generation.

Procedure to reproduce the problem

$ sphinx-quickstart
[…]
$ cd in_the_directory
$ printf "\n\nHello −4 world." >> index.rst
$ make latexpdf

Error logs / results

[…]
! Package inputenc Error: Unicode char − (U+2212)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.86 Hello −
              4 world.
? 
(./minus.ind) [1] (./minus.aux) ){/usr/share/texlive/texmf-dist/fonts/enc/dvips
/base/8r.enc}</usr/share/texlive/texmf-dist/fonts/type1/urw/helvetic/uhvb8a.pfb
></usr/share/texlive/texmf-dist/fonts/type1/urw/helvetic/uhvbo8a.pfb></usr/shar
e/texlive/texmf-dist/fonts/type1/urw/times/utmb8a.pfb></usr/share/texlive/texmf
-dist/fonts/type1/urw/times/utmr8a.pfb>
Output written on minus.pdf (5 pages, 40041 bytes).
Transcript written on minus.log.
Latexmk: Index file 'minus.idx' was written
Latexmk: Log file says output to 'minus.pdf'
Latexmk: Errors, so I did not complete making targets
Collected error summary (may duplicate other messages):
  pdflatex: Command for 'pdflatex' gave return code 256
Latexmk: Use the -f option to force complete processing,
 unless error was exceeding maximum runs of latex/pdflatex.
Makefile:33: recipe for target 'minus.pdf' failed
make[1]: *** [minus.pdf] Error 12
make[1]: Leaving directory '/home/mdk/Downloads/test-minus/_build/latex'
Makefile:20: recipe for target 'latexpdf' failed
make: *** [latexpdf] Error 2

Expected results

I'd just expect my minus sign to be correctly rendered in a PDF file.

Environment info

OS: Debian sid
Python version: 3.6.2
Sphinx version: 1.6.4
texlive 2017.20171004-1

Looks like there a lots of problems with some unicode characters: https://tex.stackexchange.com/search?q=inputenc+2212

Maybe we should just add another DeclareUnicodeCharacter in utf8extra in latex.py?

The text was updated successfully, but these errors were encountered:

JulienPalard · 2017-10-09T21:56:14Z

Same with Є U+0404 CYRILLIC CAPITAL LETTER UKRAINIAN IE:

! Package inputenc Error: Unicode char Є (U+404)
(inputenc)                not set up for use with LaTeX.

jfbu · 2017-10-09T22:04:58Z

U+2212 belongs to the Symbol, Math category of the Mathematical Operators block. As such, naive stand of a LaTeX user would be that this is supposed to be used in math mode. Thus the nowadays standard LaTeX answer is to say to switch to Unicode engine xelatex or lualatex and use package unicode-math. Its default config will arrange for suitable math fonts hopefully having the glyph.

As per U+0404, then of course we can find thousands of similar example: utf8 option of inputenc can not cover the full Unicode range (but it did get expanded in recent years, so an up-to-date TeX install is always advisable.) Here again, the probably better answer is to switch to Unicode engine xelatex or lualatex.

You may also try option utf8x for inputenc, with traditional pdflatex. It may work, or not. Check 'inputenc' key. You need to solve also the font problem then, the LaTeX font must contain glyph of the wanted Unicode codepoint.

Related: #3444

glyg · 2017-10-10T08:53:54Z

Relevant tests from pandoc's default.latex template:

\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \usepackage[utf8]{inputenc} % no inputenc with xetex

 \ifxetex
    \usepackage{mathspec}
    \usepackage{xltxtra,xunicode}
  \else
    \usepackage{fontspec} % no fontspec
%%% hyperref
\ifxetex
  \usepackage[setpagesize=false, % page size defined by xetex
              unicode=false, % unicode breaks when used with xetex
              xetex]{hyperref}
\else
  \usepackage[unicode=true]{hyperref}
%%% (not) Babel
\ifxetex
  \usepackage{polyglossia}
  \setmainlanguage{$mainlang$}
\else
  \usepackage[$lang$]{babel}

Hope this helps
To summerize:

No inputenc
No fontspec
No babel
Packages:
- mathspec
- xltxtra,xunicode
- polyglossia

And the tricky hyperref options...

jfbu · 2017-10-10T10:12:00Z

Well, check Sphinx's latex_engine config setting. The default will use polyglossia.

One can recommend adding \usepackage{unicode-math} to preamble (which provides \setmathfont in particular).

About using mathspec (xetex only) : it serves not much simply loading it without using its commands.

As per xunicode I think it is now obsoleted by recent fontspec and xltxtra not sure. ~~Probably also (no time to check now).~~ It is xelatex specific.

Sphinx provides minimal set-up via latex_engine, using polyglossia but user should add preamble config for custom fonts. And possibly add unicode-math. This is not done by default because it ~~very~~ significantly slows down build time for PDF; but it is the de facto standard package for math mode with xelatex/lualatex. (mathspec does other things, xelatex only)

jfbu · 2017-10-10T10:27:28Z

@JulienPalard There is no problem with

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[ukrainian]{babel}% or russian
\begin{document}
Є
\end{document}

So if your document uses suitable language it will ok, also with traditional pdflatex.

For an isolated letter, you may need more. Please provide a minimal example of Sphinx project displaying the problem. You could modify 'babel' key for extra language used for special letters. Then package newunicodechar can facilitate using dedicated (TeX) font having that glyph.

jfbu · 2017-10-10T10:47:56Z

@JulienPalard

This:

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[ukrainian, english]{babel}
\usepackage{newunicodechar}
\newunicodechar{Є}{\foreignlanguage{ukrainian}{\IeC {\CYRIE }}}
\begin{document}
This is a Ukrainian letter: Є
\end{document}

shows (but see note below) how you can configure your Sphinx project latex preamble : you only need modify the 'babel' latex-elements key and add to preamble the two lines with newunicodechar.

It will fix your Є problem if it is a one-shot problem with this letter, with no change needed to produced Sphinx mark-up, only via Sphinx user customization of LaTeX document preamble.

Important side note: there appears to be a LaTeX bug with babel+ukrainian which causes extra spaces in output, sadly. As a result the thing to do is rather to get the preamble either to contain:

\usepackage[ukrainian, english]{babel}
\usepackage{newunicodechar}
\newunicodechar{Є}{{\fontencoding{T2A}\selectfont\IeC {\CYRIE }}}

or to use russian and not ukrainian:

\usepackage[russian, english]{babel}
\usepackage{newunicodechar}
\newunicodechar{Є}{\foreignlanguage{russian}{\IeC {\CYRIE }}}

jfbu · 2017-10-10T12:03:09Z

As proof of concept I have tried this in conf.py

latex_elements = {
    # The paper size ('letterpaper' or 'a4paper').
    #
    # 'papersize': 'letterpaper',

    # The font size ('10pt', '11pt' or '12pt').
    #
    # 'pointsize': '10pt',

    # Additional stuff for the LaTeX preamble.
    #
    # 'preamble': '',

    # Latex figure (float) alignment
    #
    # 'figure_align': 'htbp',
'babel': r'\usepackage[russian, main=english]{babel}',

'preamble': r'''
\usepackage{newunicodechar}
\newunicodechar{Є}{\foreignlanguage{russian}{\IeC {\CYRIE }}}
'''
}

then I compile my project containing Є character using make latexpdf.

There is no problem and the build succeeds. Warnings appear in the LaTeX log:

LaTeX Font Warning: Font shape `T2A/ptm/m/n' undefined
(Font)              using `T2A/cmr/m/n' instead on input line 74.

which says that the (TeX) Times font (from \usepackage{times}) does not exist in the T2A encoding loaded by russian language. But the Computer Modern font exists in this (TeX) encoding.

Side note: there is a strange problem that if I use

'babel': r'\usepackage[russian, english]{babel}',

and not

'babel': r'\usepackage[russian, main=english]{babel}',

then the LaTeX compilation is done with Russian as main document language. This seems a bug of LaTeX-babel as it contradicts its documentation, english being the last option. There is something strange non-Sphinx related here. ~~Perhaps interaction with newunicodechar package, anyway, I have not debugged that so far.~~ But it is not due to Sphinx.

EDIT: this is indeed a documented LaTeX-babel bug. In situations like this:

\documentclass[letterpaper,10pt,english]{report}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[ngerman, english]{babel}
\title{FOO Documentation}
\date{Oct 10, 2017}
\author{JFB}
\begin{document}
\maketitle
\tableofcontents

Foo bar.
\end{document}

the document language turns out be German !! If one suppresses the global class option english, then the document language is English (last option to babel). The babel documentation contains

WARNING Languages may be set as global and as package option at the same time,
but in such a case you should set explicitly the main language with the package
option main:
\documentclass[italian]{book}
\usepackage[ngerman,main=italian]{babel}

jfbu · 2017-10-10T12:35:24Z

Still for cyrillic letter , there is an alternative to using russian language locally. One does

latex_elements = {
    'fontenc': r'\usepackage[T2A, T1]{fontenc}',

    'preamble': r'''
\usepackage{newunicodechar}
\newunicodechar{Є}{{\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC {\CYRIE }}}
'''
}

This gives more efficient code (the babel method also sets up hyphenation patterns which however are not needed for an isolated letter). Problem is that average LaTeX user will not know how to automatize, where does the \IeC {\CYRIE } come from. There is an intrinsic LaTeX problem which is that it makes such Unicode characters "active" via \usepackage[utf8]{inputenc} but the character will not by itself change the font: it knows which fontenc encoding to use, but if the current (TeX) font used is not with that encoding, it throws an error rather than changing by itself temporarily the font !

Thus, we use newunicodechar to modify the meaning, but we have to know the LICR representation. I don't have time now, but I suspect with some hacking we could automatize that, something like \sphinxnewunicodechar{Є}{T2A/cmr} the second argument indicating which TeX font to temporarily switch too to avoid the LaTeX error message of the style "no such character in font encoding T1".

JulienPalard · 2017-10-11T19:40:05Z

Wow, thanks for your extensive feedback everyone!!

I'm reviewing it all, I'll probably first try to use a mix between platex and xelatex (from python/cpython#3940 (comment)):

if language != 'ja':
    latex_engine = 'xelatex'

I'm strongly against all kinds of hardcoding like sphinxnewunicodechar{Є}, for the record the documentation I try to build is the Python documentation, so I expect characters to be added regularly and don't want documentation writers and translators to have to update the sphinx configuration each time, this is a nonsense, so I hope we can do without this.

JulienPalard · 2017-10-11T20:10:20Z

Hum, I can't access language in my config.py, sphinx/config.py line 150 does not give it execfile_(filename, config), but language: 'fr' is stored in overrides, in my specific context... I'll still try to see if it works with xelatex for english and french, and platex for japanese.

JulienPalard · 2017-10-11T21:36:54Z

The following minimalist configuration:

latex_engine = 'xelatex'
latex_elements = {}

builds english and french versions of the Python documentation. So I think we should add some information near the latex_engine documentation telling about how to choose one? I don't expect sphinx users to know luatex, xetex, platex, I personally never heard of xetex and platex before today.

JulienPalard · 2017-10-13T16:08:07Z

I'm opening a specific issue about the documentation subject.

jfbu · 2017-10-17T21:44:51Z

I am adding here some background info on problem of Unicode characters with pdflatex.

PDFTeX uses so-called TeX format font encodings which have at most 256 slots for glyphs. Some documentation is described in file encguide.pdf, included with all LaTeX distributions and which can be opened usually from texdoc fontenc but simplest is still to actually examine the suitable files which in TeXLive one typically finds in /usr/local/texlive/2017/texmf-dist/tex/latex/base/, for example file t2aenc.dfu.

When \usepackage[utf8]{inputenc} is encountered, inputenc takes into account which encodings were declared via \usepackage[..., T1]{fontenc} (the last declared has priority). For example if T2A has been declared, the file t2aenc.dfu is executed. This file arranges for utf8-encoded characters to act as macros. These macros will check what is the current font encoding; if the encoding made no provisions for that Unicode code-point, usually an error message results (there may be some default fall-back avoiding that error message).

If the Unicode character is covered by none of the fontenc declared encodings (plus some default encodings always declared such as OT1), then the dreaded "not set-up for use with LaTeX" error message appears.
if the Unicode character is matched by some declared font encoding, but the then current font encoding is not suitable than an error such as Command \CYRIE unavailable in encoding T1. may arise.

It is possible to set-up per Unicode character a default fall-back encoding, but the problem is that the document font may not be available in that encoding. For example when using Sphinx default pdflatex set-up which does \usepackage{times}. In that case if we are lucky we may see some message like

LaTeX Font Warning: Font shape `T2A/ptm/m/n' undefined
(Font)              using `T2A/cmr/m/n' instead on input line 7.

There is to the best of my knowledge no LaTeX official command to set-up a given Unicode character to forcefully not only change the font encoding but also the font family. And changing the font is rather costly macro expansion in LaTeX. It would also be very inefficient to do this at each character. The LaTeX philosophy is rather than the user adds the mark-up locally to select fonts for some stretch of text.

But it is possible to do it in following way. One can use this home-made command

\newcommand*\SetUnicodeCharacterWithFont[3]{%
  \begingroup
     \def\IeC ##1%
   {\unexpanded{{\fontencoding{#2}\fontfamily{#3}\selectfont\IeC{##1}}}}
     \expandafter\xdef\csname u8\string:\detokenize{#1}\endcsname 
                      {\csname u8\string:\detokenize{#1}\endcsname}%
  \endgroup
}

and then issue commands in latex preamble such as

\SetUnicodeCharacterWithFont{Є}{T2A}{cmr}
\SetUnicodeCharacterWithFont{Ѕ}{T2A}{cmr}
\SetUnicodeCharacterWithFont{Б}{T2A}{cmr}

In this way the Є will force locally that T2A encoding is used with cmr font.

The burden on LaTeX user who has isolated exceptional Unicode characters to worry about is then "only" to know which encoding to use (here T2A) and with which font family name. The encoding must also have been passed as option to fontenc package. Compared to my earlier comments here, the user does not have to know about LaTeX internal character representation via \CYRIE macro.

LaTeX documents using babel may declare multiple languages. When switching languages typically the font encoding is automatically modified. This is of course more efficient than doing it again and again for each character. Besides babel will use core TeX facilities for associations of hyphenation patterns to languages. But currently Sphinx does not support multiple languages per document. And even if it did, it would not be automatic, the rst sources would need to have extra mark-up to indicate the local language changes.

All of the above applies to text fonts. It is again another matter for math fonts (which I will not go into). In fact, typically utf8-encoded characters declared by the .dfu files are not set-up for math mode (although sometimes they are). For example one may see this in log file:

LaTeX Warning: Command \CYRIE invalid in math mode on input line 9.

Thus indeed the situation with Unicode and pdflatex is rather complicated. I hope not to have added too much to the confusion...

JulienPalard · 2017-10-20T21:00:13Z

People are here to write documentation, not latex. If we want people to write documentation (and we want?) forcing them to write latex macros each time they use a new out-of-ascii character is unacceptable.

However, there may be a way here to autogenerate those macros while autogenerating the .tex files, but it would require a huge database of which character is from which font? Or use xelatex by default?

jfbu · 2017-10-20T22:17:05Z

I would like to have a more focused discussion on exactly what is at stake here. Your initial report mentioning Є or U+2212 did not explain how these pop up in your project. In 90% of cases, things work because the pdfLaTeX document has a specific language; the language loads the needed font encoding and inputenc prepares the corresponding Unicode codepoints for LaTeX. In human-made LaTeX documents people load babel with enough languages to cover the needed glyphs. They decide of the corresponding font set-up. They then add mark-up for changing locally languages.

It is a fact that Sphinx currently has no support for multi-lingual documents. Is this what is at stake?

Then it appeared your project is CPython docs and unsurprisingly in so many thousands of lines there is bound to be sooner or later some Unicode char. We could use extra raw latex markup in rst sources to use the language switching mechanism I hinted to. Or we need to find a solution to tell LaTeX that this stray Unicode character is too use some specific font suitable for that glyph in some suitable encoding. I provided macros which achieve this with minimal effort. It is not reasonable to build a database, and who is going to invest the time into this? moreover there is not going to be canonical choice of (TeX) font.

Then, why make xelatex default when this will break the look of all current Sphinx projects because this forces usage of new fonts hence modifies linebreaks and pagebreaks, only to fix well-known in LaTeX world fact that pdfLaTeX isn't optimal regarding Unicode support and that every user of LaTeX who has read a minimal quantity of LaTeX documentation knows very well that for ten years+ the "Unicode" engines xelatex and lualatex are there and are much more Unicode savvy ? Anyone will then use conf.py to use xelatex and need not have that done by default.

Besides, you may trust too much that xelatex will solve all Unicode problems. This is simply wrong, because the documents needs suitable fonts, and if the document has too many Unicode characters, you will have to change locally to another OpenType font and we are back to Problem 1 with pdflatex. And I do not even mention the problems of math mode.

Regarding platex usage by Sphinx, here too Unicode support is lacking (for example we had a problem simply with EN DASH...). There are Unicode aware Japanese engines, but currently Sphinx is not set-up to use them, contributions welcome !

There are differences in the graphicx drivers for pdfLaTeX and xelatex, and if we switch to xelatex by default we may create problems with certain types of images in documents. For example there was a bug with Rotate not being obeyed by xelatex in included pdf images which was fixed recently upstream, but will take some time to get down to TeX Linux distros. The rendering of math mode in xelatex has issues non-existent in pdfLaTeX. The build time is increased compared to pdfLaTeX.

We at Sphinx can not fix LaTeX.

JulienPalard · 2017-10-21T09:54:58Z

The 'Є' appeared in https://docs.python.org/3.7/whatsnew/3.7.html#optimizations:

Searching some unlucky Unicode characters (like Ukrainian capital “Є”) in a string was to 25 times slower than searching other characters. Now it is slower only by 3 times in worst case. (Contributed by Serhiy Storchaka in bpo-24821.)

We could use extra raw latex markup in rst sources to use the language switching mechanism I hinted to.

I don't think asking documentation people (in general) to write raw latex is a thing, they write restructuredText, they (OK not everybody.) don't care about PDF being generated, so they don't remotely care latex is used to generate the PDF files.

But can this extra latex markup be added automatically while generating latex file?

why make xelatex default

Because it looks like it work in more cases than pdflatex, and I think the goal of the default values is to provide a seamless experience.

every user of LaTeX

We're speaking of Sphinx-doc users, not LaTeX users.

Besides, you may trust too much that xelatex will solve all Unicode problems.

This is right, my experience covers only a single project, that's also why I'm just opening the discussion: to gather feedback, and to "document it" (here, in the issues threads) to the next ones having the same idea.

jfbu · 2017-10-21T11:30:07Z

reopening issue for better visibility during discussion

JulienPalard · 2017-10-21T12:19:38Z

To all, please note that there's specific threads about:

And let's try to keep this thread focused on having a few unicode characters from a language into a documentation in another language.

jfbu · 2017-10-21T12:25:18Z

Then I propose to close this again, because "having a few unicode characters" is solved by the advices I gave and switching to xelatex does not solve it.

jfbu · 2017-10-21T14:02:08Z

I will give a try to building the PDFs for CPython. Please note that the

'fontenc': r'\usepackage[T1,T2A]{fontenc}'

at https://github.com/python/cpython/blob/db60a5bfa5d5f7a6f1538cc1fe76f0fda57b524e/Doc/conf.py#L97 is wrong. It should be 'fontenc': r'\usepackage[T2A,T1]{fontenc}'. But then something else must be changed for the cyrillic letters. In the order T1,T2A it may solve a problem but it slows down LaTeX compilation.

I will investigate building CPython pdf's when I can.

Also the two lines starting at https://github.com/python/cpython/blob/db60a5bfa5d5f7a6f1538cc1fe76f0fda57b524e/Doc/conf.py#L106 about \OriginalVerbatim are un-needed since Sphinx 1.5.

jfbu · 2017-10-21T14:47:16Z

@JulienPalard

I build successfully English docs of CPython using this (hence pdflatex)

diff --git a/Doc/conf.py b/Doc/conf.py
index aaee983984..8326b1e766 100644
--- a/Doc/conf.py
+++ b/Doc/conf.py
@@ -92,9 +92,7 @@ html_split_index = True
 
 # Get LaTeX to handle Unicode correctly
 latex_elements = {
-    'inputenc': r'\usepackage[utf8x]{inputenc}',
-    'utf8extra': '',
-    'fontenc': r'\usepackage[T1,T2A]{fontenc}',
+    'fontenc': r'\usepackage[T2A,TS1,T1]{fontenc}',
 }
 
 # Additional stuff for the LaTeX preamble.
@@ -103,8 +101,11 @@ latex_elements['preamble'] = r'''
   \sphinxstrong{Python Software Foundation}\\
   Email: \sphinxemail{docs@python.org}
 }
-\let\Verbatim=\OriginalVerbatim
-\let\endVerbatim=\endOriginalVerbatim
+\usepackage{newunicodechar}
+\newunicodechar{ſ}{{\fontencoding{TS1}\fontfamily{lmr}\selectfont s}}
+\newunicodechar{K}{\ensuremath{\mathrm K}}
+\newunicodechar{−}{\textminus}
+\newunicodechar{Є}{{\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}}
 '''
 
 # The paper size ('letter' or 'a4').

Steps:

cd Doc/
make latex
cd build/latex
make all-pdf

Using Sphinx 1.6.3 and Python 3.6.2 from an Anaconda install on Mac OS X 10.9.5.

What does "building French docs" mean ? does this simply mean setting the language to 'fr'. Anyway I will try now.

In the above, the most problematic was the long s; it is provided by TS1 encoding but not explicitely by textcomp package as far as I can tell. I was helped by https://tex.stackexchange.com/a/70580/4686

JulienPalard · 2017-10-21T14:50:18Z

+\newunicodechar{ſ}{{\fontencoding{TS1}\selectfont s}}
+\newunicodechar{K}{\ensuremath{\mathrm K}}
+\newunicodechar{−}{\textminus}
+\newunicodechar{Є}{{\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}}

Did it feel right for you to ask documentation writers to learn that Є is {\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}?

jfbu · 2017-10-21T15:00:46Z

Please, if you want me to help out, please stop aggressing me and turn your remark to the LaTeX team. https://www.latex-project.org/about/team/

As per my comment above I must edit it because the long s is not found in times font. Investigating.

jfbu · 2017-10-21T15:10:09Z

I have edited my comment above for ſ. Unfortunately default cmr font does not have it in ts1 encoding but lmr does.
With the set-up above and language = 'fr' in CPython/Doc/conf.py I build all "French" PDF docs of CPython with no LaTeX error reported.

edit: if using same build repertory, issue make clean in the build/latex else Babel will complain from English to French transition. This is well-known problem with Babel that auxiliary files must be removed when modifying document language after compilation with one language.

jfbu · 2017-10-21T15:11:16Z

@JulienPalard

Did it feel right for you to ask documentation writers to learn that Є is {\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}?

At #4136 (comment) I have provided recipe to avoid having to knwo about the internal LaTeX representation \CYRIE.

JulienPalard · 2017-10-21T15:29:09Z

First, sorry if it felt like aggression, you're helping and I'm greatful for this, even if I don't show it in my english, so let's be clear: Thanks for the time you're spending on the issue, trying to build the cpython doc and look at our problems, we have almost no one latex oriented here for this, so any help is really welcome.

I tried to express surprise you were proposing a syntax almost no documentation writer can read, let appart write, and a solution that will fail each time someone add a new unicode character. I clearly don't want to have a documentation build fail each time someone add a new character.

In the other hand, I heard you about xelatex "not fixing every problems". I did not expect it to do to be fair, but it looks like it works in more cases than pdflatex as it successfully build english and french for cpython (python/docsbuild-scripts#34). It still look way simplier to use xelatex that to add \newunicodechar, up to the moment someone will succeed adding a character breaking xelatex again (missing font)?

To build the french and Japanese documentation you'll need more than language=fr, you'll need gettext_compact=0 and to give a locale_dirs. We provide a makefile in the french translation https://github.com/python/python-docs-fr to do it automatically. There's also a script to build english, japanese, and french at https://github.com/python/docsbuild-scripts I typically run it just with --skip-cache-invalidation by creating directories myself and giving me the rights to write on them, but you can also use the options to change the directories used by the script. Either way you'll have to clone python manually if I remember well the script does not clone cpython itself, it however clones translations itself.

jfbu · 2017-10-21T15:40:33Z

You are asking us at Sphinx to consider making xelatex the default because it helps build the CPython docs? I demonstrated how you can build the whole CPython Docs with pdflatex and a few addition using newunicodechar package (if one does not want to use the macros I have provided above). I never disputed the point that pdflatex is not tailored for Unicode, and that making this work does require some LaTeX connoisseur.

The patch above using \newunicodechar declarations does not work for Japanese, because the newunicodechar package (or the macros I provided in earlier comments) is for context with \usepackage[utf8]{inputenc} and this does not work with Japanese engine platex. The platex engine isn't Unicode aware, and not being Japanese I don't know how people handle there Unicode issues, apart from using Unicode engine uplatex. I know about this Unicode-aware Japanese engine but for obvious reasons I am not familiar with it and can't easily get documented about it. However this does provide me with background.

The newunicodechar method may be at times needed also for xelatex. See the explanation in the newunicodechar documentation

JulienPalard · 2017-10-21T15:51:56Z

You are asking us at Sphinx to consider making xelatex the default because it helps build the CPython docs?

No, I'm proposing the idea to you at Sphinx to consider making xelatex the default because my experience building the Python docs showed me that it was semless to build with xelatex, but very hard to build with pdflatex, and I though "default should be the easy way, not the hard way". Defaults should lead user to an "easy path", something working in most cases, and experts wanting to write specific latex macros could optionally switch to non-default values.

xelatex may not work in all cases, but it looked like xelatex is working in more cases than pdflatex, which was enough for me to think that it would maye a better default. I still think xelatex is a better default, but that's only what I think. It would require more feedback to make a sound choice.

Python side, we're succeeding to build CPython with an external logic to switch between platex and xelatex depending to the language (python/docsbuild-scripts#34).

jfbu · 2017-10-21T16:28:25Z

Thanks for the explanation relative to the internationalization. I have git cloned the French translated repo, manually created cpython/Doc/locales/fr/LC_MESSAGES symlink in a checkout of cpython 3.6 tag with my patched conf.py, then I have manually run the make latex with suitable SPHINXOPTS in a Python 3.6 environment.

I got two innocuous warnings, then I cd to build/latex and run there make clean, make all-pdf. Ok, I see some problems.

This one looks like a typo in French translation:

Le nœud qui suit immédiatement le nœud courant dans le même parent. Voir également {\hyperref[\detokenize{library/xml.dom:xml.dom.Node.previousSibling}]{\sphinxcrossref{\sphinxcode{previousSibling}}}}. Si ce nœud est le dernier de son parent, alors l’attribut sera \sphinxcode{Ǹone}. Cet attribut est en lecture seule.

this is on line 101211 of library.tex. There is Ǹ in None which looks very suspicious.

jfbu · 2017-10-21T16:32:43Z

There is a U+200b in line

Python définit \sphinxcode{pow(0, 0)} et \sphinxcode{0 ** 0} valant \sphinxcode{1}, puisque c’est courant pour les langages de programmation, et logique.

which is line 2096 of library.tex, right before puisque and this looks again very suspicious. Why should a zero-width space be there?

jfbu · 2017-10-21T16:47:42Z

I confirm I can build all French CPython Docs with pdflatex using the above. However there are bugs in the French translation strings (at least on typo and a possible stray zero-width space) which require extra steps. I did:

diff --git a/Doc/conf.py b/Doc/conf.py
index f7073d116a..433f7f259f 100644
--- a/Doc/conf.py
+++ b/Doc/conf.py
@@ -15,6 +15,7 @@ sys.path.append(os.path.abspath('tools/extensions'))
 extensions = ['sphinx.ext.coverage', 'sphinx.ext.doctest',
               'pyspecific', 'c_annotations']
 
+
 # General substitutions.
 project = 'Python'
 copyright = '2001-%s, Python Software Foundation' % time.strftime('%Y')
@@ -91,9 +92,7 @@ html_split_index = True
 
 # Get LaTeX to handle Unicode correctly
 latex_elements = {
-    'inputenc': r'\usepackage[utf8x]{inputenc}',
-    'utf8extra': '',
-    'fontenc': r'\usepackage[T1,T2A]{fontenc}',
+    'fontenc': r'\usepackage[T2A,TS1,T1]{fontenc}',
 }
 
 # Additional stuff for the LaTeX preamble.
@@ -102,8 +101,13 @@ latex_elements['preamble'] = r'''
   \sphinxstrong{Python Software Foundation}\\
   Email: \sphinxemail{docs@python.org}
 }
-\let\Verbatim=\OriginalVerbatim
-\let\endVerbatim=\endOriginalVerbatim
+\usepackage{newunicodechar}
+\newunicodechar{ſ}{{\fontencoding{TS1}\fontfamily{lmr}\selectfont s}}
+\newunicodechar{K}{\ensuremath{\mathrm K}}
+\newunicodechar{−}{\textminus}
+\newunicodechar{Є}{{\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}}
+\usepackage{uspace}% handles Unicode spaces inclusive of zero width U+200B
+\newunicodechar{Ǹ}{N}
 '''
 
 # The paper size ('letter' or 'a4').

The package uspace handles Unicode space characters. It may not be installed at your TeX locale because it is relatively recent, see its repo.

You an also simply do

\DeclareUnicodeCharacter{200B}{}

to get rid of it (as it looked typo).

As per the Ǹ it is surely a typo so, I redeclared it to be N, I have not looked at how it should be defined really if it had been needed.

Sorry for delays, but building all PDFs takes time each time.

Notice that ironically, bugs are revealed from not using Unicode savvy engine, as then input typos are detected.

I find it useful though to have spent time on this big project (library.pdf in French going to 1742 pages), as this can be used in future in case of Sphinx LaTeX changes as a test.

JulienPalard · 2017-10-21T16:55:51Z

I fixed the Ǹ, and I don't think the zero-width space is a thing, so I also removed them.

JulienPalard · 2017-10-21T17:04:40Z

Notice that ironically, bugs are revealed from not using Unicode savvy engine, as then input typos are detected.

Yes, I take note. Also having this in conf.py is way more suitable than havig to move the logic in https://github.com/python/docsbuild-scripts: It allows one to build more naturally using the cpython makefile instead of having to rely of another tool, I like it.

Would you like to PR this on https://github.com/python/cpython (issue already opened: https://bugs.python.org/issue31589)?

jfbu · 2017-10-21T17:15:45Z

I would be glad to PR, but unfortunately putting this in conf.py will break Japanese translated builds, because package newunicodechar works only in context of \usepackage[utf8]{inputenc} (it is not compatible with utf8x), and is not compatible with platex engine.

But it is possible to enclose it in some LaTeX conditional branch and avoid executing it if engine is platex. (or perhaps there is some Sphinx latex_elements key which we can use, offhand I don't remember will check now).

~~Oh, simply I can use 'utf8extra' key.~~ no sorry, my mistake. Will go the LaTeX conditional way.

Regarding Japanese docs, I understand you build with no LaTeX errors. But do the PDFs look fine regarding the problematic characters? (i.e. particularly the long s ſ and the Є)

jfbu · 2017-10-21T18:08:40Z

@JulienPalard I have done the PR at python/cpython#4069

jfbu · 2017-10-23T18:59:53Z

I have found a way for Japanese. I could not solve the problem with pLaTeX which Sphinx uses by default for Japanese. But after some trial and error I perhaps have a way with upLaTeX. But: many files of the CPython documentation use Sphinx howto which use jreport which I do not know how to make work with upLaTeX. For the biggest file which is library.tex it uses jsbook and one can make it work with upLaTeX using LaTeX document class option uplatex.

In conf.py:

latex_elements = {

    'extraclassoptions': 'uplatex',

    'inputenc': r'\usepackage[utf8]{inputenc}',

    'fontenc': r'\usepackage[T2A, TS1, T1]{fontenc}',

    'preamble': r'''
\ifx\XeTeXinterchartoks\undefined % not for xelatex
\ifx\directlua\undefined          % not for lualatex
  \ifx\kcatcode\undefined\else  %  extra preparation for uplatex
  \kcatcode`ſ 15
  \kcatcode`K 15
  \kcatcode`Є 15
  \kcatcode`− 15
  \fi
\DeclareUnicodeCharacter{017F}% ſ
  {{\fontencoding{TS1}\fontfamily{lmr}\selectfont s}}
\DeclareUnicodeCharacter{212A}% K
  {\ensuremath{\mathrm{K}}}
\DeclareUnicodeCharacter{0404}% Є
  {{\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}}
\DeclareUnicodeCharacter{2212}{\textminus}
\fi\fi
'''
}

Unfortunately the extra class option uplatex is not expected by LaTeX when not used with jsbook class so it reports a warning in each latex compilation log when one tries the above with -D language=fr for example. This is mostly invisible to user but I should think better about getting uplatex as class option only for Japanese (one can actually use own version of LaTeX template to add somestuff before the document class, inclusive of some conditional \PassOptionsToClass).

Then we also need this in conf.py:

latex_additional_files = ['latexmkjarc']

and finally we need this latexmkjarc file, which is fastest way I found to trick Sphinx to use upLaTeX and not pLaTeX engine for Japanese:

$latex = 'uplatex ' . $ENV{'LATEXOPTS'} . ' -kanji=utf8 %O %S';
$dvipdf = 'dvipdfmx %O -o %D %S';
$makeindex = 'rm -f %D; upmendex -U -f -d %B.dic -s python.ist %S || echo "upmendex exited with error code $? (ignoring)" && : >> %D';
add_cus_dep( "glo", "gls", 0, "makeglo" );
sub makeglo {
 return system( "upmendex -J -f -s gglo.ist -o '$_[0].gls' '$_[0].glo'" );
}

edit: upmendex apparently doesn't recognize -U option, which should be removed, it was option for mendex to Set input/output kanji encoding to UTF-8. which perhaps makes no sense for upmendex (possibly similarly as -kanji=utf8 perhaps also makes no sense as option for uplatex)

I think the -kanji = utf8 option should be removed, but I can't read Japanese documentation and I worked from google translate... this file above is exactly the Sphinx provided latexmkjarc except that I replaced platex by uplatex and mendex by upmendex.

With this set-up I can compile successfully for English, French, and Japanese: sadly however in Japanese only when 'manual' is chosen for the docclass not 'howto'. So I compiled all CPython docs succesfully in Japanese, but after having forced 'manual' everywhere.

I can't use utf8 inputenc with pLaTeX, but I can with upLaTeX,
package newunicodechar refuses to work with upLaTeX, this is why I reverted to \DeclareUnicodeCharacter
the method above is set-up to work both with pdflatex and with uplatex,
because this method uses the Latexmk way, it does not work on Windows, as currently Sphinx does not use Latexmk on Windows.
there is no need to set latex_engine: Sphinx for Japanese uses 'platex'but the used binary is actually the one listed in the latexmkjarc file ..., so we can workaround the fact that Sphinx does not (yet) provide a 'uplatex' setting for latex_engine.
careful that when copying pasting from here the K it may become K.

@cocoatomo do you have any comment? do you build Japanese CPython documentation with standard Sphinx set-up using pLaTeX?

I couldn't make pLaTeX work with the characters above, except the U+2212 − which at least gives reasonable output it seems with no extra set-up. I had some partial success but then unexpected results arose from using these characters in code-blocks or literal.

Or course there might be much better way, but as I can't read Japanese it is very hard for me to understand how LaTeX classes and engines work there.

JulienPalard · 2017-10-23T20:44:12Z

nosy @methane ^

jfbu · 2017-10-23T21:28:47Z

There is a ujreport LateX document class for usage with upLaTeX.

I can force Sphinx to use it by this kind of patching in conf.py:

from sphinx.builders import latex

def my_default_latex_docclass(config):
    # type: (Config) -> Dict[unicode, unicode]
    """ Better default latex_docclass settings for specific languages. """
    if config.language == 'ja':
        return {'manual': 'jsbook',
                'howto': 'ujreport'}  # replace jreport by ujreport
    else:
        return {}

latex.default_latex_docclass = my_default_latex_docclass

With patch like the above I can then build the whole CPython documentation for Japanese (without having to replace 'howto' by 'manual' as in my first try with upLaTeX).

The uplatex extra class option is ignored by ujreport LaTeX class, and the log warning is inocuous.

@tk0miya is there easier way to get Sphinx to use ujreport as docclass rather than jreport when language is 'ja' and document type 'howto'?

Also it would be convenient to have a 'passoptionstoclass' latex_elements key, because this can have conditional LaTeX code that can't be put safely in 'extraclassoptions'. Admittedly, this can now be done at user level using template override.

brechtm · 2020-06-09T20:12:47Z

I happened to see this issue when I was searching for tickets on another topic. I just wanted to mention that building a PDF using rinohtype might make things easier when dealing with languages other than English. As long as your fonts contain the needed glyphs, things should just work... provided the script doesn't require advanced typesetting features that are currently not supported by rinohtype. You should also know that rinohtype doesn't support typesettings of maths as of yet.

I would be interested to learn how rinohtype handles your document. I welcome any feedback in the rinohtype issue tracker.

tk0miya · 2020-06-11T11:17:41Z

@tk0miya is there easier way to get Sphinx to use ujreport as docclass rather than jreport when language is 'ja' and document type 'howto'?

Now Sphinx uses ujreport when latex_engine = 'uplatex'. May I close this issue?

jfbu · 2020-06-11T11:22:32Z

@tk0miya: +1 for closing as far as I can see.

jfbu added builder:latex type:question labels Oct 9, 2017

JulienPalard mentioned this issue Oct 10, 2017

[WIP] bpo-31589: Build PDF using xelatex for better UTF8 support. python/cpython#3940

Merged

JulienPalard closed this as completed Oct 13, 2017

This was referenced Oct 13, 2017

Documentation: Help choosing latex_engine #4149

Closed

Add xelatex to help building french PDFs. python/psf-salt#122

Merged

jfbu reopened this Oct 21, 2017

JulienPalard mentioned this issue Oct 21, 2017

[WIP] Configuration to properly build PDFs in japanese and french python/docsbuild-scripts#34

Merged

jfbu mentioned this issue Oct 24, 2017

Feature Request: support upLaTeX for PDF builds of Japanese projects #4186

Closed

fbruetting mentioned this issue Mar 2, 2018

Unicode: Supporting middle dot & real minus JuliaLang/julia#26193

Closed

DavidLeoni mentioned this issue Jul 25, 2019

Unicode characters do not appear in PDF DavidLeoni/jupman#29

Open

coiby mentioned this issue Jun 8, 2020

Can't generate correct filename for a specific downloaded image for .tex #7801

Open

tk0miya closed this as completed Jun 11, 2020

unknownue mentioned this issue Aug 14, 2020

pdf or markdown unknownue/PyTorch.docs#2

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jul 23, 2021

U+2212 MINUS SIGN breaking PDF generation #4136

U+2212 MINUS SIGN breaking PDF generation #4136

Comments

JulienPalard commented Oct 9, 2017

Problem

Procedure to reproduce the problem

Error logs / results

Expected results

Environment info

JulienPalard commented Oct 9, 2017

jfbu commented Oct 9, 2017

glyg commented Oct 10, 2017 • edited

jfbu commented Oct 10, 2017 • edited

jfbu commented Oct 10, 2017

jfbu commented Oct 10, 2017

jfbu commented Oct 10, 2017 • edited

jfbu commented Oct 10, 2017

JulienPalard commented Oct 11, 2017

JulienPalard commented Oct 11, 2017

JulienPalard commented Oct 11, 2017

JulienPalard commented Oct 13, 2017

jfbu commented Oct 17, 2017

JulienPalard commented Oct 20, 2017

jfbu commented Oct 20, 2017

JulienPalard commented Oct 21, 2017

jfbu commented Oct 21, 2017

JulienPalard commented Oct 21, 2017

jfbu commented Oct 21, 2017

jfbu commented Oct 21, 2017

jfbu commented Oct 21, 2017 • edited

JulienPalard commented Oct 21, 2017

jfbu commented Oct 21, 2017

jfbu commented Oct 21, 2017 • edited

jfbu commented Oct 21, 2017

JulienPalard commented Oct 21, 2017

jfbu commented Oct 21, 2017

JulienPalard commented Oct 21, 2017

jfbu commented Oct 21, 2017

jfbu commented Oct 21, 2017

jfbu commented Oct 21, 2017

JulienPalard commented Oct 21, 2017

JulienPalard commented Oct 21, 2017

jfbu commented Oct 21, 2017 • edited

jfbu commented Oct 21, 2017

jfbu commented Oct 23, 2017 • edited

JulienPalard commented Oct 23, 2017

jfbu commented Oct 23, 2017

brechtm commented Jun 9, 2020

tk0miya commented Jun 11, 2020

jfbu commented Jun 11, 2020

glyg commented Oct 10, 2017 •

edited

jfbu commented Oct 10, 2017 •

edited

jfbu commented Oct 10, 2017 •

edited

jfbu commented Oct 21, 2017 •

edited

jfbu commented Oct 21, 2017 •

edited

jfbu commented Oct 21, 2017 •

edited

jfbu commented Oct 23, 2017 •

edited